Tuesday, June 19, 2018

ZABBIX: time wasters

During all this Zabbix optimizations there was several situations where I could understand what is the cost of experience: it's wasting lots of time on nothing.

First one: when i was writing the Clickhouse history processing I've found out that Zabbix history process were constantly busy with requesting data from the Clickhouse. 

My initial thought was that it's a ValueCache filling. But it's never had stopped even after one hour run.

Measurements shown that in the same time span installation of Zabbix  working on Mysql DB did about 600 requests to history. When the Zabbix was switched to Clickhouse, it did almost 16k requests. It would have done more, but CPU was the  limiting factor. 

Both Mysql and Clickhouse history tables where empty. 

The problem was in the Clickhouse response handling. When no data has been found, both DB back-end were returning no data, but unlike it's done in Mysql driver, i coded it to return FAIL status on absence of data. It might be have been a typo or might be a decision, i don't remember now.

In Zabbix history library there is FAIL processing as DB fail. So it tries to retry the same request endlessly . 

That is it. Two or three days of hassle.


The second one:
This one spanned for two or three weeks. It's started from my development laptop couldn't handle more then 1k NVPS. After that i saw preprocessing queue growth and delays of items processing.

At that time I've decided to switch to some real server hardware. Then some other things took time, like fixing API scripts to manage objects (there are more then 50k of them), refactoring code, Git, so on.

So, yesterday when i put refactored fixed asynchronous SNMP lib to the production server, after about 5k NVPS i saw queuing.

Half a day and evening, and early morning gave no result - after 3-5k NVPS preprocessing just couldn't handle all that.

Looking through the code, logging didn't gave any results. I could see that constant items queuing were consuming 100% of the preprocessor manager thread time. 

At some point i decided to add extra time precision to logging, as it wasn't clear what takes most of the time. Zabbix debug logging trims time to milliseconds.

I've fixed log.c, and it showed interesting picture - almost all logged operations takes 0.2 milliseconds (150-210 mSec to be precise). 

Which mean that something long happens each time when the server logs a line. Well, this is the answer - it's logging itself that slows everything.

Switching logging to something less then level 4 (DEBUG) eliminates all the queuing and slow SNMP processing completely.

The fun thing  i know this for more than 10 years. 

Debug logging has always been an expensive thing  in terms of resources. 

Especially for network hardware. Switching logging on leads to moving lots of processing from data plane to control plane and typically it ends up with loosing  the device control. 

And even for servers, where you have plenty of CPU, it is something that should be used with care and understanding of probable outcome.

NB: however several days after that i saw preprocessing manager queuing again, so i'll write separately about that later


No comments:

Post a Comment