Friday, September 7, 2018

ZABBIX: the "UI meltdown prototype"


Zabbix's UI speed problems have the same root as the database problems - they try to combine two worlds:

 - online situation
 - logs and statistics data from huge OLTP database.

Combination of these two have kind a good informativeness, but at the same time it is slow due to the need of selecting big volumes of data.

So to make things really fast and to have a monitoring which works when everything else fails(and when we need online monitoring most), we need to split it.

The "analytical" and slow part of all the data - like acknowledge statuses, messages, confirmations, alerts data must not be used for online reports.

The proper place for such an information is the database.

Preferably non - OLTP, but BigData one, which is already used for history purposes. But since it's not that much of such a data,it fits ok to OLTP DB for now.

So finally I've got some time to implement the idea.


The Idea points:

Firstly, the server depends even less from the database.

Secondly, online monitoring will be less depended from a  database, only in the server's memory, which have very nice implications, which i write later.

Thirdly, on crisis times (huge network outages), when we need fast online info at most, the database will have a hard times due to large amount of events and problem update records going into the database, so we'd rather have out-of-DB fast way to know what is the current monitoring state, while we have time later to analyze events and problems.

The coding

I have created one more queue in configuration cache, named "problems". Queue updated each time on event when Zabbix's export problems feature is invoked. On recovery events problems are deleted from the queue.
 
Since i needed id indexing and no sorting at all i decided to go with zabbix's hashmap librairy and used hashing based on the eventid.

In the trapper code i did one more message processing:  
{"request":"problems.get"}

Processing is rather simple - iterating over the problems hashset and creating huge JSON response.

I did locking there, since events are registered from different threads (i assume it's the preproces manger which calculates triggers) and  export seems to be happening somewhere else (export thread?).

And it was the easiest part.

Fixing the UI was quite a deal.

Zabbix team used MVC ideology, so there are three scripts to fix and different types of API colls are spread among all three of them.

And there are separate classes to learn for Server and html construction.

Actually, it is very nice from code and structure point of view. It's a "heaven for perfectionist" programmer as it should be, but i wasn't quite ready to fit it all into my brain at once. Whatever, it was fun anyway.

The result: i was able to get stable UI response time to less then 100msec, having about 50-70 msec of response wait time. I've also realized that it was just enough to remove all that nice pop-up boxes with problem descriptions, acknowledge status to have OK response time on  DB load during huge number of hosts outage retrieving data from DB  without trapper. It's more close to 2000msec but still acceptable. So a kind of "easy" fix is also possible.



The problems: sometimes PHP cannot parse output from the zabbix server. Dumps show that PHP gets all the data, i can see buffer in PHP logs, but parser doesn't parse it. Due to rare nature of the thing i couldn't find its roots yet.

Another small disadvantage is the problem start time - as i keep the time of the event, usually it reflects the time when zabbix_server was last restarted, which happens a lot on devel times but rare on production.



The big advantage- it's fast. No, it just _really_ fast! 

It feels just like opening a static page. And it is really doesn't need php-fpm resources. Which means we can give up separate frontend machines, it's not that critical but just good.

So, this is full win in terms of prototype and usability, but it's a bit non production right now, because fixes have broken the "problems" panel and the changes are done by fixing existing widgets, instead they should be added as a separate ones.


Some tests:



I've tested two important on-line monitoring widgets - problems and problems by severity under "average collapse" situation (10k hosts are inaccessible).
Numbers are time in msec till widget has been rendered. For trapper widgets - no difference for ok and 10k fail situation.

On the full 50k+ collapse both reports couldn't fit into 300 seconds.

I've decided not to waste time to allow nginx wait for fpm forever just to figure it maybe will render in, say, 12 minutes.

I'd say that 2 seconds is acceptable, 10 is bad, more than 10 - no chance for production.


And there is one other great thing about this - now we can get online zabbix state without db.

If and when zabbix OLTP mass updates like events and problems will go to BigData DB, then it will be very-very close to becoming a real clustered server:

The idea: Two (or more) servers could use trapper interface to make sure they both functional and split hosts load between them: odd hostsids to one server, even ids to the other. Both serves will have full items database in both db and memory, but on code which distributes tasks to poller one condition should be added - not use the other's servers hosts tasks until it is alive. So when it dies a hosts will be activated and polled

Sure, there are complications: LLD items, active agents, zabbix senders, but i think it's solvable.

Saturday, September 1, 2018

ZABBIX: the story

Long long ago, perhaps two months for now i have started one week project that seems to be finished only by this fall....


So, ZABBIX.

At the job we use it for monitoring all the nice infrastructure we have. And It happened so that the monitoring service was growing but didn't have a true owner  last four years. So i decided to put my hands on it.

There was several reasons
  • technical problems (hangs, slow speed, short history period)
  • it was quite outdated

but the primary ones where:
  • it's a fun,
  • it's a new knowledge,
  • because "i can"
  • it's a real task that's make a lot of value for business

So, before i started to do something, there was a time for thinking, circling around, drawing schemes, learning new versions of ZABBIX and getting access to the production monitoring servers. That lead me to understanding idea that monitoring consist of two major parts - real-time and analytical.

At the end of May, 2018, on the internal hackaton event i completed the clickhouse history module. I would name this the most important change and "must have" to everyone who doesn't use ElasticsSearch. But even those  who does please read a comparison. Overall, clickhouse works very good as a historical storage. It is quite easy to setup and it makes a big deal. 

To make ZABBIX support the Clickhouse storage you have to be ready to compile ZABBIX from sources with patches. Also clickhouse must be setup along with initial data scheme. Look here how to do it.

But the story didn't finish then. After completing the history storage part some other problems come to my view. It was the number of servers we used for monitoring. We had twenty one!, fifteen of which where proxies to keep up with the load.

A long and curvy road led me to optimize that significantly. I was under impression that most problems came from constant forking of fping process. After some research and  efforts to reduce number of forks per second<> (fps :) ), i did the nmap solution  along with "time distributed pinging" idea. Overall that allowed to make 30-40 times less forks while doing checks three times faster and save about two times of CPU ticks on accessibility checks. .


Then it was SNMP querying. Comparing to accessibility checks we have ten times more items to poll. There are almost 3.5 millions of items now. And polling all in reasonable time frame couldn't be done with only one server even with 1000 threads.

Before I did several  projects with mass SNMP polling on Perl and Python, tried AsyncIO or plain threading, so i've already knew that asynchronous polling would do the thing. Description of the asynchronous polling idea  is in this  article. 

To implement mass polling there was many things to be done in the ZABBIX server beforehand. ZABBIX is somewhat ready for batch items processing. But there are many places in the code where processing is done on one-by-one basis quite ineffectivly. I named this part of job "BULKiing"

When processing batches of items was completed it is also became apparent  that there are two major architecture bottlenecks. One is preprocessor manager with its "queuing problem" and the fact that all the items being processed by single thread, the other problem is global locking of configuration. Both problems lead to a situation when systems with  slow CPU cores will perfoms not really well running ZABBIX under high load. Such a systems show low processing NVPS numbers with while most CPUs being 80-90% in idle state waiting for internal locks.

Both bottlenecks are limited by single CPU core speed. To somewhat avoid preprocessor manager stucking during high load i changed the way it processes  data so it could prioritize  tasks and avoid queuing. Now it can decide what is better to do - getting new results from pollers or sending results to workers. Details are here.

To solve "global locking" i've attempted to enlarge batches of items to poll from 8k to 32k which gave about 10kNVPS performance increase and allowed to marginaly exceed initial goal of 50kNVPS processing. But after we finally put the modified version of the ZABBIX to the production the problem with slow core and global locking appeared again. 

Even it could be solved with faster CPU, i decided to solve this either. This would allow to fully utilize older server hardware with slower CPUs but large number of cores. Detais are here.

And then to resolve another "bottleneck" problem, i've changed architecture to process data by more then one process manager. Averall together with all previous fixes it allowed to raise total processing capacity from 5k to 88kNVPS on the same hardware additionally eliminating need of proxies we put for load share.

The same code could show 118kNVPS on the faster-core test machine.  

Lastly i will be doing some UI-speed related work due to UI not functional under stress situation and even in normal conditions it takes 5-10 seconds to refresh panels.

And to make things complete, there are several short notes which are more about mood and fun and strange situations, architecture questions, lack of knowledge,  coincidences, links to sources, some work records, and so on.

nb: All the numbers are given for our installation of 47k hosts/4M items. I assume many of the problems are not likely to appear on installations with less objects.

nb2: The primary reason i write this is the possibility to catch a glimpse of engineering mood and fun. I accidentally found my 6 y/o records from the time we where launching new BRAS type. Reading that was kind a joy of reliving that time of being consumed with engineering, ideas, solutions, sleepless nights. So i am writing now that I'll have something to read in 2024.