Tuesday, June 19, 2018

ZABBIX: time wasters

During all this Zabbix optimizations there was several situations where I could understand what is the cost of experience: it's wasting lots of time on nothing.

First one: when i was writing the Clickhouse history processing I've found out that Zabbix history process were constantly busy with requesting data from the Clickhouse. 

My initial thought was that it's a ValueCache filling. But it's never had stopped even after one hour run.

Measurements shown that in the same time span installation of Zabbix  working on Mysql DB did about 600 requests to history. When the Zabbix was switched to Clickhouse, it did almost 16k requests. It would have done more, but CPU was the  limiting factor. 

Both Mysql and Clickhouse history tables where empty. 

The problem was in the Clickhouse response handling. When no data has been found, both DB back-end were returning no data, but unlike it's done in Mysql driver, i coded it to return FAIL status on absence of data. It might be have been a typo or might be a decision, i don't remember now.

In Zabbix history library there is FAIL processing as DB fail. So it tries to retry the same request endlessly . 

That is it. Two or three days of hassle.


The second one:
This one spanned for two or three weeks. It's started from my development laptop couldn't handle more then 1k NVPS. After that i saw preprocessing queue growth and delays of items processing.

At that time I've decided to switch to some real server hardware. Then some other things took time, like fixing API scripts to manage objects (there are more then 50k of them), refactoring code, Git, so on.

So, yesterday when i put refactored fixed asynchronous SNMP lib to the production server, after about 5k NVPS i saw queuing.

Half a day and evening, and early morning gave no result - after 3-5k NVPS preprocessing just couldn't handle all that.

Looking through the code, logging didn't gave any results. I could see that constant items queuing were consuming 100% of the preprocessor manager thread time. 

At some point i decided to add extra time precision to logging, as it wasn't clear what takes most of the time. Zabbix debug logging trims time to milliseconds.

I've fixed log.c, and it showed interesting picture - almost all logged operations takes 0.2 milliseconds (150-210 mSec to be precise). 

Which mean that something long happens each time when the server logs a line. Well, this is the answer - it's logging itself that slows everything.

Switching logging to something less then level 4 (DEBUG) eliminates all the queuing and slow SNMP processing completely.

The fun thing  i know this for more than 10 years. 

Debug logging has always been an expensive thing  in terms of resources. 

Especially for network hardware. Switching logging on leads to moving lots of processing from data plane to control plane and typically it ends up with loosing  the device control. 

And even for servers, where you have plenty of CPU, it is something that should be used with care and understanding of probable outcome.

NB: however several days after that i saw preprocessing manager queuing again, so i'll write separately about that later


ZABBIX: asyncronious SNMP polling



After fixing the FPING problem, I've switched on SNMP polling threads and it turned out that it's even bigger problem then accessibility checks: for each device we do only 5 pings and poll 122 items via SNMP.

Switching on SNMP polling on test server  immediately led to SNMP poller queuing. To have a healthy balance i left only 500devices in the test monitoring installation.

SNMP rate measurement, traffic dumps, code investigation showed a few facts:
  • SNMP is syncronious in zabbix 
  • SNMP poller is very feature-rich 
  • it looks like Zabbix guys have had a lot of experience with old and new devices, packet sizes. 
  • They also differ processing of auto-discovered hosts and manually added ones.
So I've spent a few days and  did asynchronous SNMP module, which was based on original zabbix SNMP code and async SNMP example from netsnmp docs, additionally i had to make a few changes in following modules
  • in the DC libriary which fetches the SNMP tasks for poller - i've removed a few conditions which limited fetch results to one host only, and another condition which was doing some response size calculation and could also reduce the number of selected items
  • I had to redo memory allocation type for ITEMs arrays, which where static before the modification. Since these array sizes went up from 128 ITEMS to 4096, they probably couldn't fit to stack anymore.
  • I was lucky that DC library returns ITEMS for check ordered by host. I assume  it's better to request one item per host at a time. So having result ITEMS array grouped by host is very handy in processing
Now, a few problems i've got
  • Had to properly write C  code and read about libriaries - had lot of time spent not understanding way how net_snmp handles pdu's. Perhaps there was another 20 coding problems which were easy to catch.
  • When there are a lot of async snmp pollers are working and under some other circumstances i see that a process "preprocess manager" reports via set_proc_title about huge tasks buffering PIC and there are significant delays of data which could be seen on graphs (later i knew that it is  a separate problem)

What further optimizations could be made:
  •  don't request items from the host if previous N items has failed (this might be done either in fping or better in check logic)
  • sometimes i'm getting process crash when adding/removing bunch of devices, need to figure (not true anymore, didn't have a crush in poller ever since)
  • actually, i need to compare "internal" change to outside script and uploading SNMP results via trapper inreface. Maybe it's feasible to do it outside due to amount of code modifications and possible problems including bugs and upgrade problems.


So far results are inspiring: 20-30 times faster then before: in the same poll time it's  increased from 200 values to 4k-6k values at no CPU cost.

ZABBIX: the preprocessing manager



All went fine until asynchronous SNMP processing where put in test server with about 10k hosts to poll.

At the point when the server was polling about 5k SNMP values per second, the pre-processing manager queue started to grow.

At the same time i saw significant CPU increase for SNMP polling processes, also timings became much worse.

I've tried to play a bit with number of threads doing polling, pre-processing and db syncing and realized the following:
  • the problem doesn't appear when i have only one or two SNMP pollers. I believe reason for that is fast that two pollers hardly can get 5k SNMP NVPS's
  • it is the same problem i saw on my laptop. At that moment i thought is the reason of slow CPU
  • it doesn't depend on any other threads and their count
  • it doesn't depend on history database read speeds. Switching reading off doesn't change anything
  • The more values system able to gather, the worse it gets. Starting 6-10 SNMP pollers leads to fast preprocessing manager queue grow and  bigger the queue gets, less items gets processed. 
  • After 100k items in queue Zabbix processes only 0-200 items each 5 seconds
  • Typically after 2-60 minutes Zabbix crashes with syssegv 11 in the preprocessor manager queue 
 Picture:1M+ items queue, pollers consuming up to 15%CPU, working 3-10 times longer then normal

Overall it seems i have two alternative ways to solve this:
  • figure out what the problem is and fix it 
  • degrade version to 3.0 line  (probably i'll have to rewrite some fixes), since processing manager was introduced only in 3.4, this way all the API scripts might also need to be fixed.

 Sure i'll try the first one first. Nice that it's quite easy to reproduce the problem, and there is a helpful debug.

Update: the second way doesn't fit. As 3.0 doesn't have history interface and possibility to offloading history data to a BigData storage

Update: the issue is fixed, it was another time waster and lack of experience.