Networker`s records: July 2018

Tuesday, July 31, 2018

ZABBIX: the second server

Tried to launch ZABBIX with all the fixes on the role of servers monitoring, started to segv.This was a bug in error logging.

There is another (big?)
problem there:

It looks like poller needs to be further altered to support asynchronous checks for agents.Or there some kind of locking occurs.

Strange thing that simple checks are slow too. Or might be global thing like DNS or something

This installation is rather small:

And the server is sure enough idle 97%.

upd: nothing interesting, new server was not permitted by ip filters to access hosts via SNMP, and for agents there was also no permittion to it to request data.

Thursday, July 26, 2018

ZABBIX: what does a cat when it have a free time?

subj. ZABBIX does something quite opposite.

I had successfully changed the code to have as many preprocessor managers as i want.

But whatever i did, stuck at 60k nvps.
My initial thought was it's preprocessor manager slows things down. So i've rewrite it to have 4 managers.

But then i saw that running 4 managers spreads CPU load between them, but gives no NVPS increase.

Whatever i thought, i couldn't find a reason to such a behavior. But today it struck me - it's worker threads and workers are waiting for them.

So, i straced worker thread.

.. semop....
Ok, is it GLOBAL config LOCK again?

NO, it's mutex 0, the mutex dedicated for logs.
But i don't see anything from workers in logs and it is loglevel is 3.

And guess what?
Let me show you the main loop of worker code:

I commented it out. It does log rotation. With locking. Inside loop which runs possibly in 100 threads and which task is to process an item as fast as possible.

So all the worker threads was keep trying to put LOG lock and to rotate the log file if it's bigger then config values. I guess it's a developers mistake.

So, without log handling workers could do a bit faster and server achieved approximately 85-90kNVPS, while having only 25% idle.

So, perhaps, that's the limit for double E5645 system with 12 physical cores (and total 24 cores if we count hyper threaded ones).

During all this tests i've also found preprocessor manager priority flaw - it was more likely that it will gather queues before the buffer is full. And when the buffer is full it would be working OK, but it would be slowing down pollers.

Now it's just fine. Queue is close to zero. Each one of managers writing about 110k items each 5 seconds. So totally it is 110k*4threads/5seconds=88kNVPS.

Wednesday, July 25, 2018

ZABBIX: crashtesting, records #2

....failed very fast, on lots of process the system starts to swap

fixed the buffer, the system does fine with 20 snmp threads with, about 25knvps

have significant queuening with 43k required nvps, try to make 30 threads

didn't help, changed delay from 90 to 120, zabbix somehow stopped to work normallu, restarted.

doesn't work either, reduced numbre of pollers to 20
queueing, 33nvps.....
raised pollers back to 40....

OK, 30NVPS ONLY SNMP - NO PROBLEM
changed back delay to snmp to 90s.... lets look
it still queueng but just above 10sec, at planned load of 43knvps

reduced delay of snmp to 100 sec,pinger still off, let' see
ok, the system is stable to complete zero queues at 33kNVPS of clean SNMP load.

Now, let's test the FPING 100 threads:
the old version could handle the load without big broblems with up to 1sec (1sec had quieuieng in the first column)

NMAP: 100 threads works worse then FPING
high system load, 32% idle

Reduced NMAP threads to 20:
It does just like 100 threads FPING but system idle is 62%

Reduced NMAP threads to 10: it's worse, 20 seems more optimal, it's 82% idle

Now try fping in 20 threads:

The normal SNMP is:
0:00.26 /usr/local/sbin/zabbix_server: poller #1 [got 249 values in 5.005509 sec, getting values]
7942 zabbix    20   0 13,670g 271628 213924 S   0,3 1,7   0:00.26 /usr/local/sbin/zabbix_server: poller #4 [got 175 values in 4.851671 sec, getting values]
7943 zabbix    20   0 13,670g 265808 208076 S   0,3 1,6   0:00.25 /usr/local/sbin/zabbix_server: poller #5 [got 142 values in 4.914348 sec, getting values]
7946 zabbix    20   0 13,670g 268228 210508 S   0,3 1,6   0:00.25 /usr/local/sbin/zabbix_server: poller #8 [got 175 values in 5.141629 sec, getting values]
7947 zabbix    20   0 13,670g 274664 216944 S   0,3 1,7   0:00.26 /usr/local/sbin/zabbix_server: poller #9 [got 158 values in 5.044240 sec, getting values]
7949 zabbix    20   0 13,670g 264984 207320 S   0,3 1,6   0:00.24 /usr/local/sbin/zabbix_server: poller #11 [got 184 values in 5.038127 sec, getting values]
7951 zabbix    20   0 13,670g 273344 215680 S   0,3 1,7   0:00.27 /usr/local/sbin/zabbix_server: poller #13 [got 138 values in 4.924217 sec, getting values]
7953 zabbix    20   0 13,670g 287580 229860 S   0,3 1,8   0:00.27 /usr/local/sbin/zabbix_server: poller #15 [got 172 values in 5.133177 sec, getting values]
7955 zabbix    20   0 13,670g 258828 201032 S   0,3 1,6   0:00.24 /usr/local/sbin/zabbix_server: poller #17 [got 142 values in 4.982964 sec, getting values]

140-150 per thred per 5 seconds
tried to make 300 threads

fping synthetic results:
fixed fping
real    0m42,646s
user    0m0,236s
sys    0m0,336s

total CPU usage: 0,572s

fping normal

real    1m33,304s
user    0m0,104s
sys    0m0,276s
total CPU usage: 0,380s

nmap
real    0m3,252s
user    0m0,220s
sys    0m0,076s
total CPU usage: 0,296s

ZABBIX: crashing, getting syssegvs and so on -dairy records

And still i see the server is failing (segv sometimes).
It's not nearly as stable as the production one but tipycally these problems are quite easy to catch

The problem now is that server restart is taking more and more time (about 2-3 minutes)

The preprocess manager problem

The questions:

"something impossible happen logs"

why snmp is slow to be enabled? - figure why unreachible threads are waiting for nothing

huge load after some time!!!! net_snmp_close()???

in the night had huge number of must use snmp_select_socket2 (or so messages) - seems that under certain conditions i leave opened SNMP sessions.
Too many sessions cause high load and what seems to be select() call is not working, so essentially all the gathering process breaks
TO DO: add some profiling, look where i may leave sessions unlclosed, consider removing (select) code and replacing by stupid simple wait call.

morning: huge snmp queue, seems that threads work normaly, i haven't found any abnormalies, just 10 them is not enought. (actually, they work much longer then at start, their run time raized from 6-9 seconds to 14-25, but that might be due high system load)

Raised snmp pollers number to 25, starting at 08:08
8 58 flight normal queue 0

seems something strange has happen at 11 24-11 34, apparently serever has crashed at that time, there is a long queue in SNMP in the browser left
11 41 started again ... actually the server is alive, but is't swapping and there are 16M records in the queue of the preprocess manager.

Myabe that's just too much hosts???
Maybe that's too big load from fpings
Now i disabled about 20k hosts, reduced number of threads to 20 for both pingrer and poller, restarted the server, let's see what's happen

Having only half of the load (and also number of DB Syncers increased to 40) system is seems to be doing fine for at least 1 hour for now. And this is the time when i finaly see some SNMP data gathered in the graphs.

I will look at it for another 2 hours then will start gradually adding new devices.
UPD so far evth is fine, no queueing, system 90% idle

Another problem to consider - devices marked non - snmp acessible stay too long in such a state and i've never saw any processing done by unreachible pollers.
I exect i've broken some fetching logic in DC functions and that makes it such a long to mark devices as SNMP accessible again

So my theory is: 1. I haven't enought data syncer processes enabled before so they where constantly busy in filling the value cache (ClickHouse not really fast for single host queries). Acutally i would suggested to fill the cache on the start by some bulky way - this would mean one query with millions of data. Unfortunately, i am not sure if this is what really zabbix and it's triggers need.

The biggest thing that bothering me right now is occasional preprocessing manager queueng. I am almost sure that the queuing depends on two factors: system load and number of working threads (actually it might be one of them, and this two factors are closely linked). There one big optimization that could be done on the SNMP poller - is to enable bulk queriing, this will save some ticks of CPU and network PPS, might be easier for network devices, but it will not faster things really as anyway each SNMP thread will be waiting for inacessible devices timouts which more longer than making 100 queries to accessible device, and they happen in parralel in the asyncronious model.

Another big change and improvement that might be made is switching from fping to nmap for hosts checking. Question one - is how valuable the packet loss rates for understanding device reachibility. I have a thought that it might be more valuable to make 10 pings each 3 seconds then make them all at once each 30 seconds. In such a way of checking the packet loss rate will be done in a trigger by calculating average on base of last 5-10 checks. To speed up reaction of a device has become non accessible, last 2-3 checks must be considered.

ok, morning, 3 17. I see that the installation has crashed in preprocessing manager caught syssegv.

All pingers has been off (i've set up StartPingers=0)

I really believe that it's preprocessing manager that has to be blamed, in particular, queueing of items. So, i'd like to fixid to one of the following
- find an answer - why is queuing happening at all???
- do not allow queueng
- after certain amount of queueing throttle the processes to wait till preprocessing manager be free (if it's load problem, which seems not)

I have nice thing - i know that running zabbix in debug mode almost immediately causes the queueuing. So i need to trace down all the decision process to understand, why the hell does it queues the messages

I see that most of it's time it's spend processing IPC requests of type 2 (probably, that the result of the poller's work). added extra flush and history flush logging to see if it's happening at all. On the next step will add _flush call result

the Github and sources, naming

The sources are in the two projects at github
Cleaned and checked sources and patch files are at
https://github.com/miklert/zabbix
At least separate clickhouse patch will be there soon.

The very unstable development version is at
https://github.com/miklert/xe-rabbix

Now it's based on 3.4.9 code.

It's a few words about compiling zabbix - you'll need MariaDB-shared library, not just client and server libriares and development files. I've spent quite a time finding that out.

For clickhouse compile with curl libriary

Regarding the name. We had to call it somehow internally, and every monitoring system must have something common in name with ZABBIX.

So we started to call it xerabbix [herabbiks] first, which appeared from Russian soft-sounding word 'dick' and ZABBIX, but it was too rude, so after a month of development it's transformed to xe-rabbix and now pronounced as [iks-e-rabbiks], leaving rudeness in the past.

Nowdays we interpret the name as eXtra Edition of zabbix.

Tuesday, July 24, 2018

ZABBIX: Configuration cache, DC_poller_get_items, DCpoller_requeue_items, queues

A test machine was able to achieve 51kNVPS steady..

It more then enough. Test version could run for more then a week at 16kNVPS stable, no memory leaks. So we have planned to put fixed ZABBIX code into production.

The old installation was ZABBIX 2.1 version. The migrating job included lots of other works like changing OS, fixing automation. Long night work.

Just after the server was upgraded to new OS, we started the database and the server and .... OOOPS it could only do 5kNVPS. The CPU was 90% idle at that time. We have seen this already, right? To finish the job the new server left work for production on test machine.

Ok, lets figure what's going on the new server.
This is statistics that strace shows for a poller thread.

It clealry shows that threads waiting for it's turn to get a lock .

Now a few ideas about locking.

ZABBIX holds in memory huge configuration structure, which it holds in several tables (perhaps it's lists or trees).

Apart from that there are 5 queues of tasks by polling type (ICMP, POLLER, PROXY, so on).
In each queue there are structures ordered and (perhaps) hashed by next check time. The structures references items.

So, each time something gets access to the configuration, it sets global lock via mutex and semop, reads/updates configuration and then unlocks.

The global lock domain picture:

After analyzing the dbconfig.c and dbcahce.c code, i decided, that for some operations global locks might be avoided. To be precise - all operations from poller and pinger are safe to go in parallel as soon as they use and lock their own queues:

I've decided to split each queue type into 4 pieces. To keep host-per thread persistence items are distributed by queues by (hostid%4) hash.
To maintain polling threads persistence, threads are bounded to queues by their hash (procnum%4).

It should be at least 8 threads of a type (but better 16 threads or more), so at any time at least one thread could request data from DC cache, while others might be doing polling job.

So the result?

But before i'll tell about one funny twist in the situation:

After fixing the queues and making them work i see that the server performance is still not better.

Strace shown that poller still spend most time semop calls waiting for global lock (mutex ZBX_MUTEX_CONFIG, 4).

2512:20180723:121820.512 In DCget_user_macro() macro:'{$SNMP_COMMUNITY}'
zabbix_server [2512]: zbx_mutex_lock: global config lock attepmt!
zabbix_server [2512]: zbx_mutex_lock: global config unlocked

After doing some profiling i found that problems was in the macro nobody really needs, it's macro for community, and we only have one community anyway!

And then i even thought that maybe ALL the problems with locking where due to this macro and one week of coding was just due to my lack of knowledge of ZABBIX.

So, i reverted to no_queue ZABBIX version. It have shown that performance is much better without the macro, but only 2 times.

Still no good, so were going the right way with queues.

As we are in the development, lets do fast fix:

update items set snmp_comuntiy='isread' where snmp_community like '{$SNMP_COMMUNITY}'

(perhaps for production it's better to use API or UI to fix templates).

The following pic best describes the first launch of ZABBIX with queues:

On the same server which was doing 5k NVPS before,ZABIX shown 60k NVPS steady.

And finally for the first time ever i could load machine to 50% with ZABBIX.

Why no more? Because then next limiting thing comes to play - preprocess manager.

ZABBIX: preprocessor manager queue buffer

This happened first time right after I've launched asynchronous SNMP polling, which immediately raised NVPS from 2-3k to 7-10k

First I've noticed there is a significant data delay from polling to seeing it in the ClickHouse database. Doing tcpdumps and selects from Clickhouse revealed that data stucks inside zabbix server daemon. Since tcpdump shown that from network traffic data was gathered in time, pollers where not to blame.

Looking to the top -c output i saw that preprocessor manger has queue parameter information set by setproctitle which where constantly growing.

So my first job to find why it was growing. As this was development machine i had logging set to debug all the time.

Logs didn't show me where the process manager was spending all its time.
After I added some extra logs and fixed log.c to increase logging time precision to 1/10000s.
I've found nothing, but the strange fact that all the operations takes 190-222microseconds. And that was interesting because it meant something long happens each time i write information to logs. Right, it's logging itself.

After i switched off logging queuing disappeared, so i thought I've solved the problem.

However a week a so later when systems started to reach 20k NVPS i saw that preprocessor manager starts queuing again.

Queuing is a good thing as it provides a buffer for overload times. But in my case it was happening endlessly, queue was growing all the time, until ZABBIX couldn't allocate more data.

I had to start investigation again. And this time it was tricky since i couldn't use logging of every operation as logging introduced much more problems and overhead then the problem i was looking for.

Finally, from reading old logs, looking at the source, the problem was found:

Process manager has a socket where it accept two types of data. First, there are request from pollers to process newly collected items. Requests then placed to queue. And then preprocess manager distributes tasks for preprocessing to worker threads.
The second type of message is worker responses with the data that has been preprocessed.

With big batches of collected items (at that time they where 4k size, later they increased to 32k) preprocessor manager had no choice but to constantly queue items: when a big chunk of data arrived to the IPC socket, it read all the data and then sent tasks to process items to worker thread. As it is much more items in a batch then worker threads, it had to queue most part of the batch. Then it has to read from socket again since to release worker thread to do it's next item, it has to submit the result first.

(Note, fixed later: about month after i accidentally realized that message sizes where limited to 256 items and the only factor for queuing was processing speed)

I've altered the code to open the second socket so preprocessor manager could decide which data to read - values from pollers or results from workers.

This allowed to survive load peak times when processing of new data was going slower then pollers could collect it. And the bigger queue was, the more processing time needed (or it might be that swapping started at the test system, i had some memory constraints on the test system).

The fix works the following way: preprocess manager works the usual way until it gets 1M records in the queue. Then it stops reading requests from pollers and only sends data to workers. Until there is less then 1M in the queue.

Pollers have to wait untill their request processed, so on peak times when queue is full, polling for new data slows. This is much better then eventually crashing with data collected. I saw queue reaching 20M items.

Preprocessor manager is the only bottleneck left for now, so i consider later to either duplicate it or split functionality to two threads.

One thread will process requests and the other results.

Making two equally functional thread seems to be more difficult as we may fall in the situation when depended items are processed on different threads, making them wait endlessly for the other item to arrive. Or they have to share common storage and use mutexes to arbiter access to the storage safely.

ZABBIX: the results

The initial goal was to optimize hardware usage, we thought we will be able to reduce 21 servers to 6 without sacrificing service. There been some plans to start gathering some new data by stoppping gathering some other we don't need.

We wanted to be able to keep history for at least 3 months or half a year by offloading some history data to another mysql server.

In fact the changes allowed to achieve much more.

First of all, only clickhouse offloading and snmp improvements allow to run everything on the same machine. So we need two for redundancy.

Having queuing problem fixed allowed to raise overall speed from 5-6k NVPS to 62k NVPS on the same server. I certainly believe that fixing the "preoprocess manager" bottleneck would allow 100k NVPS stable on the same hardware.

Some server numbers:

Current machine has 65% idle with only one processor loaded with thread running preprocessor manager.

Problem with slow CPU core is solvable with higher frequency/modern CPU.

Test server running single E3-1280 Xeon could achieve 50k stable without queue optimizations.

It actually performs much better then older system with double E5645.

I've tried to run zabbix server on a machine with AMD cpu's with totaly 32 cores. I was expecting to see overall perfromance rise. But it was the very poor performance - like 2-3k NVPS.

It might be such a problems also come from number of items we have for processing. It might be with lower number of items there will be better results.

And one more funny thing - having {$MACRO} of any kind in key, communtiy or any other items parameters degrades polling speed 2 times as DC_macro_resolve does global config lock.

But lets return to the goals. The initial goals now achievable on a mediocre laptop. Mine with i5-5200u cpu and ssd drive could steadily monitor at 5k NVPS with clickhouse localy installed.

On a modern server with fast core one can plan to achieve 75k production/ 120k NVPS stable.

There is a quiestion - why would one need such a capacity?

It is important even for current tasks we reduced polling intrval to 10 seconds to have a fast reaction, then we where able to poll all SNMP every 5 minutes.

Even if it's questanable if someone really needs such a speeds, i would suggest use the imporvements to decrease polling intervals 10-15 times.
There are a lot of possibilities, which such an HD monitoring gives.

First of all, analyzing events correlation getting much better with higher precision. Second, you get a chance to log things happening fast. For example, just after starting new monitoring we could finally discover what causes spontaneous switches availability problems. Due to higher precision we could find exactly for how long and where and which group of switches where in outage.

So, having a monitoring system capacity 20 times more is a way to enable you to think about new possibilities you can achieve on the same installation. Eventually you discover or invent something that will make a business value.

Monday, July 23, 2018

Clickhouse vs Elasticssearch

A short and very un-honest comparison of these two systems.

There are different techniques to test and each system describes how to better test it. So to have the results that somehow show how the systems will behave under Zabbix history load, i tried to reproduce the exact type and way of load they could be under Zabbix history saving and retrieval.

I've used HTTP interfaces of both and curl to submit end extract data. On data submit there was chunks of 1000 rows. This exactly what ZABBIX does. On data retrieval i did several tests on extracting a last value, last 5 values, and all values for an item.

Both databases where empty before test. Totally i've added about 30 million records to each one. That's quite a low number to understand a long-term behavior.

There was about one month of test Zabbix data collected in Clickhouse already, i did some tests on our Clickhouse install to see degradation of data retrieval on large amounts of data and long history term. I didn't see any performance drop at all. So I've decided to do the same to Elasticssearch if it will outperform Clickhouse on small amounts of data. That didn't happen.

For test i've used freshly installed non-tuned databases.

I didn't do MySQL test for two reasons: there are many MySQL vs Clickhouse compraisons showing better performance and data compression of the latter.

Here, and here, or try to google for a dosen others.

And the other reason is the fact we've learned from existing Zabbix install: MySQL performance degrades dramatically on large volumes of data, prtitionazing helps, but still it will even slower.

So the test graph:

A few comments:
1. The only metric where Elastics has beaten Clickhouse is CPU usage on data fetch, but it's quite low difference. Considering how much CPU Elastics needs for storing data, Clickhouse uses totally less CPU.

2. Please pay attention how little disk, IOPS, CPU clickhouse needs to write data. Also due to compression Clickhouse needs 11 times less drive space then Elastics. Comparing to MySQL it 18 times less disk space.

So, short conclusion: i see NO reason NOT to use Clickhouse.

ZABBIX: clickhouse details

On high rates (more then one 1k NVPS) of new items being written to history, installing clickhouse alongside with the mysql server (and probably with any other supported OLTP database) will make a BIG deal.

Clickhouse requires virtually no CPU and disk IO for writing data to drive.

Let's start from disadvantages of clickhosue:

The biggest one - merge delay. On systems where sharding is used, data may appear on graphs only after 5-20 minutes. This is not happening when there is only one node or several nodes with the same information.

That merge delay is acceptable for analytics data, and rest of functionality like triggers are not affected.

Clickhouse uses a litle bit more CPU on data fetch. Since it marginally more then Elastics and considering CPU expenses that Elastics need on data save, this could hardly be called disadvantage.

With clickhouse, there is no need to clean history tables tono need in trends. Housekeeper won't clean data in Clickhouse (use some internal DB methods to delete old partitions)

Impact of Clickhouse is invaluable: it almost doesn't use neither CPU not disk IOPs for storing large amounts of data. Loads of 50-100k new rows per second is something that could be handled on a usual spindle SATA drive.

Installing it alongside with MySQL on the same machine leads to freeing lots of resources of MySQL (or perhaps any other OLTP database) needed for storing history data. At the same time clickhouse almost doesn't use any resources at all. I would say it this way - you will barely notice clickhouse or mysql in top or htop utilities on the system processing 20-30k NVPS of data.

So, installing clickhouse is rather simple - there is a nice Yandex documentation about it: https://clickhouse.yandex/docs/en/getting_started/ . You can use sources or there are packages for most popular Linux flavors.

Having clickhouse installed you need to create the table for storing history data. It's easy to do via clickhouse client (https://clickhouse.yandex/docs/en/interfaces/http_interface/). Launch the clickhouse-client and execute the query:

CREATE TABLE zabbix.history ( day Date, itemid UInt32, clock DateTime, ns UInt32, value Int64, value_dbl Float64, value_str String) ENGINE = MergeTree(day, (itemid, clock), 8192);

CREATE TABLE zabbix.history_buffer ( day Date, itemid UInt32, clock DateTime, ns UInt32, value Int64, value_dbl Float64, value_str String) ENGINE = Buffer(zabbix, history, 16, 30, 100, 50000, 1000000, 1000000, 10000000) ;

Which will create table and table buffer for effective data merging.

That's it. Clickhouse is ready to accept data.

Then prepare zabbix:
Apply clickhouse patch (it fixes both server and frontend) and setup config files: in zabbix_server.conf:

HistoryStorageURL=http://localhost:8123
HistoryStorageTypes=uint,dbl,str,text
HistoryStorageType=clickhouse

HistoryStorageTableName=zabbix.history_buffer

in frontend configuration: zabbix.conf.php

global $HISTORY;
$HISTORY['url']   = 'http://localhost:8123';
$HISTORY['types'] = ['uint','dbl','str','text'];
$HISTORY['storagetype']   = 'clickhouse';
$HISTORY['tablename']   = 'zabbix.history_buffer';

Then you might use sql check if data is actually coming to the clickhouse server by doing count() or selecting latest data:

select count(*) from zabbix.history_buffer;

Make sure you use history_buffer as table to see the latest data.
To make ZABBIX support clickhouse storage engine download and apply clickhouse patch or download complete sources with patches already applied.

No log file support, sorry folks, never need it.

So, thats it. Until you get really high rates and volume of data, clickhouse works just fine in default configuration.

Sunday, July 22, 2018

ZABBIX: fping vs nmap replacement and "distributed" pingning

Eventualy at some point of time i've decided that nmap will be better then fping in terms of speed.

The test on 3,5k hosts file have shown that nmap finishes the job about 2,5 times faster: 40 seconds versus 100 seconds.

I did the changes in /lib/icmpping.c, so perhaps it will work for proxy either.

The only shortcoming in nmap that it actually doesn't do loss check, but only accessibility check with round trip time calculation.

I haven't found a way to tell nmap to send 10 packets and have loss percentage as well as min/max/avg statistics. This was a stuck situation for some time, but then idea of "time distributed" monitoring came up.

The idea: Instead of sending N packets in a row in quite a short time span during the test period P, it's better to send 1 packet each P/N second and have a pings distributed evenly. Then calculate loss on triggers.

The idea was crazy with the old system . Limit for it was 120 seconds and that was distributed on 10 proxies. But with new enhancements in code and processing the system on its own achieved less then 1.5 seconds interval between check of a single host.

It's quite a remarkable. Totally it's about 28k NVPS.

The checks setup to check host accessibility once each 10 seconds, which is done right now, and triggers are setup to react to 3 consecutive ping losees. So, in the worst case it is 40 seconds reaction. It's ok for most of outages and reliable enough to not bother on some sudden packet loss.

The last thing to note, i left possibility the system to use fping and it's quite simple: for simple icmp checks, if you there is 1 packet in icmp parameters then nmap is used, if it's more, then fping. This potentially might be misleading for administrators and they might forget about it. For our production system there is no fping option.