Wednesday, May 30, 2018

ZABBIX: 50k NVPS, the final test results and the new bottlneck

It's two new bottlenecks appeared.

The first one: 100 history writers seems to be slow on putting data to clickhosue. But that is not for sure as there is a problem number two:

The problem: i could  squeeze 35k NVPS out of the server.
It is remarkable anyway, considering our current need is 15k NVPS maximum, and existing old system hardly does 5k NVPS.

BUT -  the server is only 20-30% utilized, why aren't we doing 50K at least?

Playing with number of threads doesn't help - more threads - more time they spending in each iteration.

Top shows they are sleeping while waiting things to happen most of the time, which is good.

Ok, let's ask strace: - strace, what are they waiting for?



Interesting, it's semop.

Ok, strange that it's neither SNMP nor mysql handling. The sources shows it's mutex locking to access pool of tasks and to return them after processing.

This is quite bad news - it means the systems is interlocked due to internal architecture limitations. And usually architecture is something that is very hard to change since everything around it built to work with the architecture type.

What most likely happens is the following: procedures of finding the next task to be polled require some process time, so i assume there is a limitation which is now less then the current system polling capacity.

Each poller waits most of the time for it's chance to lock the mutex and use DCconfig_get_poller_items.

So there is a processing limit - how fast one processor can extract items to be polled. Plus there is some overhead to init procedures and to lock and unlock mutexes.

Lets try increase number of items 4 times per poll to reduce inits and locks count: raise 8k to 32k.

I was pessimistic if this will help, but if it does, and ZABBIX will be able to achieve 50K NVPS, then it's a good point to stop in optimizations.

And as it happens often, i was wrong, it kind a helped a lot:

And the queue status is:

About 20% of hosts are intentionally SNMP denied on firewall level, to force timeouts and make ZABBIX life more difficult.  And current processing returns CONFIG_ERROR to SNMP items of inaccessible hosts which make ZABBIX think there is problem. I think anyway of replacing this with TIMEOUT which will mark host as inaccesible, but thats separate story.

Lets check random device, random port:



It's nice - almost precise 75 seconds interval.

And yes, there is a 5-25 minutes gap between the moment you send data to the clickhouse and when you see them due to clickhouse cluster merging intrerval.

This is eliminated if there is only one node and you request the data from buffer, but in our case it's 4 nodes, so usually, there is no data.

And this is clickhouse monitoring showing 40kNVPS and 50kNVPS tests showng very little increase in processing connects (don't look at the peaks - it's ValueCache filling after restart)




For the latter i had to increase dbSyncers to 200 as otherwise they couldn't handle initial ValueCache full in parallel with data send. After reaching safe queue size preprocessor manager stopped to process data from requests sockets, slowing down data polling.

The test shows that 50k NVPS is achievable on the machine. But when i was monitoring the situation for an hour, a saw occasional queuing and processing delays for 200-300k items for 5 or 10 seconds.

I would say, 50k is marginal and non-production numbers (until you get a better server or perhaps, less items and less CPU time on retrieving tasks). To go really beyond that you need a machine with faster core speed, it's not that critical how many cores would you get as soon as you have more then two.

Apart from the locks the next problem will be the preprocess_manager thread. And this is REAL problem number two. In my installation it was showing the following statistics:

 preprocessing manager #1 [queued 0, processed 192824 values, idle 3.223292 sec during 5.662645 sec]

Rough approximation and safety margin of 20% give number of 80kNVPS for it to running with no idle time.

As a final conclusion i'd like to say that the setup is capable to handle up to 40k NVPS in production and  that 3 times more then we need now, so it's just fine for the production and good point to stop in optimization.

NB:
However it wold be interesting to play with the architecture and see what could be done there. Giving more functionality to poller threads and reducing communication with the central storage is what i would do. Perhaps, pollers could take tasks for execution for next 3-4 minutes instead of single check and schedule checks on their own, submitting several results to the preprocessor manager. Then poller the still had to return tasks to sync with the new configuration. Perhaps, the same achivable by several local proxies, but that is out of scope now.

Wednesday, May 23, 2018

ZABBIX: the FPING problem

Zabbix does it's accessibility checks with the fping utulity.

First of all, if you use the domain names for host checks, make sure on a large volume that DNS is cached, preferably on the same machine. It might be either local DNS cache or i just put all my hosts to /etc/hosts file. Otherwise fping will spend 2ms-10ms on resolving each name, and it does it on host-by-host basis synchronously, so probably it'll take even longer, then accessibility checks. Or use nscd which does the job pretty well , so you don't have to deal to manual "hosts" caching.


On my test machine adding all 36k access level switches caused global zabbix stagnation and the check queues grew to more then 10 minutes.

 


Doing some profiling showed the first obvious reason. The system hasn't enough pinger processes

Ok, lets add 100 of them...
And now there is a CPU problem. Managing 100 threads isn't easy on the CPUs.


So, what's up?
nmap scans all 36ะบ switches in a little less then 12 seconds:


Starting Nmap 7.60 ( https://nmap.org ) at 2018-04-22 09:10 +05
Nmap done: 36773 IP addresses (35534 hosts up) scanned in 11.63 seconds



and in worst case scenario (none hosts are available) it's 20 seconds:

Starting Nmap 7.60 ( https://nmap.org ) at 2018-04-22 09:36 +05
Nmap done: 36773 IP addresses (0 hosts up) scanned in 19.76 seconds


According to documentation, fping does exactly the same - it fires all the ICMP packets at once, and then waits for responses. But for some reason it does that too long.

To figure, i've left only one accessibility check pinger and did a traffic dump, that showed very intersting picture: there is a 10ms inter packet delay between ICMP requests fping sends (RTFM, it's written in ZABBIX manual actually).


That is not good. To ping 1000 devices it will take 10msec*1000devices*3attempts=30000ms=30seconds , but we need to check 36k devices. Sure, threading will help, but would we want to waste half a minute ?

note: old version of fping does 20ms delay. I've found somewhere on the net that the reason is not to loose packets on 100Mbit links, which kind a odd nowadays


It's actually defined in the fping.c and might be hard-coded or changed by configure options:

 /* maxima and minima */
#ifdef FPING_SAFE_LIMITS
#define MIN_INTERVAL 1 /* in millisec */
#define MIN_PERHOST_INTERVAL 10 /* in millisec */
#else
#define MIN_INTERVAL 0
#define MIN_PERHOST_INTERVAL 0
#endif

So, after recompiling the FPING without any delays i put the new version to the production and it's got much better:



And.... something told me i should check everything.

So it'd didn't took long to find that it's only 128 hosts that zabbix can ping each time it opens a pipe to fping. Not really much. Considering fact that there are some resources to make a system call and that might be also optimzed, i did some source code research and found that it's limited in the code.

#define MAX_JAVA_ITEMS <------>32
#define MAX_SNMP_ITEMS <------>128
#define MAX_POLLER_ITEMS<------>128<--->/* MAX(MAX_JAVA_ITEMS, MAX_SNMP_ITEMS) */
#define MAX_PINGER_ITEMS<------> 128



I've tried to change MAX_POLLER_ITEMS along with MAX_SNMP_ITEMS  to 4k, but that lead to zabbix_server crash on first attempt to call pinger thread.

Some debugging shown the reason was the memory allocation (it's almost the same I've had with ng_netgraph 6 years ago).

And this will be different story.

Anyway, even just fixing the fping causes significant reduce on system load (I've measured 30% of CPU related to fping), as you'll have much faster ping processing and less threading.

On ~500 hosts fixed fping finishes 3 times faster (-C5 -t1000) then original one.


Tuesday, May 1, 2018