Wednesday, May 23, 2018

ZABBIX: the FPING problem

Zabbix does it's accessibility checks with the fping utulity.

First of all, if you use the domain names for host checks, make sure on a large volume that DNS is cached, preferably on the same machine. It might be either local DNS cache or i just put all my hosts to /etc/hosts file. Otherwise fping will spend 2ms-10ms on resolving each name, and it does it on host-by-host basis synchronously, so probably it'll take even longer, then accessibility checks. Or use nscd which does the job pretty well , so you don't have to deal to manual "hosts" caching.


On my test machine adding all 36k access level switches caused global zabbix stagnation and the check queues grew to more then 10 minutes.

 


Doing some profiling showed the first obvious reason. The system hasn't enough pinger processes

Ok, lets add 100 of them...
And now there is a CPU problem. Managing 100 threads isn't easy on the CPUs.


So, what's up?
nmap scans all 36к switches in a little less then 12 seconds:


Starting Nmap 7.60 ( https://nmap.org ) at 2018-04-22 09:10 +05
Nmap done: 36773 IP addresses (35534 hosts up) scanned in 11.63 seconds



and in worst case scenario (none hosts are available) it's 20 seconds:

Starting Nmap 7.60 ( https://nmap.org ) at 2018-04-22 09:36 +05
Nmap done: 36773 IP addresses (0 hosts up) scanned in 19.76 seconds


According to documentation, fping does exactly the same - it fires all the ICMP packets at once, and then waits for responses. But for some reason it does that too long.

To figure, i've left only one accessibility check pinger and did a traffic dump, that showed very intersting picture: there is a 10ms inter packet delay between ICMP requests fping sends (RTFM, it's written in ZABBIX manual actually).


That is not good. To ping 1000 devices it will take 10msec*1000devices*3attempts=30000ms=30seconds , but we need to check 36k devices. Sure, threading will help, but would we want to waste half a minute ?

note: old version of fping does 20ms delay. I've found somewhere on the net that the reason is not to loose packets on 100Mbit links, which kind a odd nowadays


It's actually defined in the fping.c and might be hard-coded or changed by configure options:

 /* maxima and minima */
#ifdef FPING_SAFE_LIMITS
#define MIN_INTERVAL 1 /* in millisec */
#define MIN_PERHOST_INTERVAL 10 /* in millisec */
#else
#define MIN_INTERVAL 0
#define MIN_PERHOST_INTERVAL 0
#endif

So, after recompiling the FPING without any delays i put the new version to the production and it's got much better:



And.... something told me i should check everything.

So it'd didn't took long to find that it's only 128 hosts that zabbix can ping each time it opens a pipe to fping. Not really much. Considering fact that there are some resources to make a system call and that might be also optimzed, i did some source code research and found that it's limited in the code.

#define MAX_JAVA_ITEMS <------>32
#define MAX_SNMP_ITEMS <------>128
#define MAX_POLLER_ITEMS<------>128<--->/* MAX(MAX_JAVA_ITEMS, MAX_SNMP_ITEMS) */
#define MAX_PINGER_ITEMS<------> 128



I've tried to change MAX_POLLER_ITEMS along with MAX_SNMP_ITEMS  to 4k, but that lead to zabbix_server crash on first attempt to call pinger thread.

Some debugging shown the reason was the memory allocation (it's almost the same I've had with ng_netgraph 6 years ago).

And this will be different story.

Anyway, even just fixing the fping causes significant reduce on system load (I've measured 30% of CPU related to fping), as you'll have much faster ping processing and less threading.

On ~500 hosts fixed fping finishes 3 times faster (-C5 -t1000) then original one.


1 comment: