Saturday, September 1, 2018

ZABBIX: the story

Long long ago, perhaps two months for now i have started one week project that seems to be finished only by this fall....


So, ZABBIX.

At the job we use it for monitoring all the nice infrastructure we have. And It happened so that the monitoring service was growing but didn't have a true owner  last four years. So i decided to put my hands on it.

There was several reasons
  • technical problems (hangs, slow speed, short history period)
  • it was quite outdated

but the primary ones where:
  • it's a fun,
  • it's a new knowledge,
  • because "i can"
  • it's a real task that's make a lot of value for business

So, before i started to do something, there was a time for thinking, circling around, drawing schemes, learning new versions of ZABBIX and getting access to the production monitoring servers. That lead me to understanding idea that monitoring consist of two major parts - real-time and analytical.

At the end of May, 2018, on the internal hackaton event i completed the clickhouse history module. I would name this the most important change and "must have" to everyone who doesn't use ElasticsSearch. But even those  who does please read a comparison. Overall, clickhouse works very good as a historical storage. It is quite easy to setup and it makes a big deal. 

To make ZABBIX support the Clickhouse storage you have to be ready to compile ZABBIX from sources with patches. Also clickhouse must be setup along with initial data scheme. Look here how to do it.

But the story didn't finish then. After completing the history storage part some other problems come to my view. It was the number of servers we used for monitoring. We had twenty one!, fifteen of which where proxies to keep up with the load.

A long and curvy road led me to optimize that significantly. I was under impression that most problems came from constant forking of fping process. After some research and  efforts to reduce number of forks per second<> (fps :) ), i did the nmap solution  along with "time distributed pinging" idea. Overall that allowed to make 30-40 times less forks while doing checks three times faster and save about two times of CPU ticks on accessibility checks. .


Then it was SNMP querying. Comparing to accessibility checks we have ten times more items to poll. There are almost 3.5 millions of items now. And polling all in reasonable time frame couldn't be done with only one server even with 1000 threads.

Before I did several  projects with mass SNMP polling on Perl and Python, tried AsyncIO or plain threading, so i've already knew that asynchronous polling would do the thing. Description of the asynchronous polling idea  is in this  article. 

To implement mass polling there was many things to be done in the ZABBIX server beforehand. ZABBIX is somewhat ready for batch items processing. But there are many places in the code where processing is done on one-by-one basis quite ineffectivly. I named this part of job "BULKiing"

When processing batches of items was completed it is also became apparent  that there are two major architecture bottlenecks. One is preprocessor manager with its "queuing problem" and the fact that all the items being processed by single thread, the other problem is global locking of configuration. Both problems lead to a situation when systems with  slow CPU cores will perfoms not really well running ZABBIX under high load. Such a systems show low processing NVPS numbers with while most CPUs being 80-90% in idle state waiting for internal locks.

Both bottlenecks are limited by single CPU core speed. To somewhat avoid preprocessor manager stucking during high load i changed the way it processes  data so it could prioritize  tasks and avoid queuing. Now it can decide what is better to do - getting new results from pollers or sending results to workers. Details are here.

To solve "global locking" i've attempted to enlarge batches of items to poll from 8k to 32k which gave about 10kNVPS performance increase and allowed to marginaly exceed initial goal of 50kNVPS processing. But after we finally put the modified version of the ZABBIX to the production the problem with slow core and global locking appeared again. 

Even it could be solved with faster CPU, i decided to solve this either. This would allow to fully utilize older server hardware with slower CPUs but large number of cores. Detais are here.

And then to resolve another "bottleneck" problem, i've changed architecture to process data by more then one process manager. Averall together with all previous fixes it allowed to raise total processing capacity from 5k to 88kNVPS on the same hardware additionally eliminating need of proxies we put for load share.

The same code could show 118kNVPS on the faster-core test machine.  

Lastly i will be doing some UI-speed related work due to UI not functional under stress situation and even in normal conditions it takes 5-10 seconds to refresh panels.

And to make things complete, there are several short notes which are more about mood and fun and strange situations, architecture questions, lack of knowledge,  coincidences, links to sources, some work records, and so on.

nb: All the numbers are given for our installation of 47k hosts/4M items. I assume many of the problems are not likely to appear on installations with less objects.

nb2: The primary reason i write this is the possibility to catch a glimpse of engineering mood and fun. I accidentally found my 6 y/o records from the time we where launching new BRAS type. Reading that was kind a joy of reliving that time of being consumed with engineering, ideas, solutions, sleepless nights. So i am writing now that I'll have something to read in 2024.



No comments:

Post a Comment