Friday, September 7, 2018

ZABBIX: the "UI meltdown prototype"


Zabbix's UI speed problems have the same root as the database problems - they try to combine two worlds:

 - online situation
 - logs and statistics data from huge OLTP database.

Combination of these two have kind a good informativeness, but at the same time it is slow due to the need of selecting big volumes of data.

So to make things really fast and to have a monitoring which works when everything else fails(and when we need online monitoring most), we need to split it.

The "analytical" and slow part of all the data - like acknowledge statuses, messages, confirmations, alerts data must not be used for online reports.

The proper place for such an information is the database.

Preferably non - OLTP, but BigData one, which is already used for history purposes. But since it's not that much of such a data,it fits ok to OLTP DB for now.

So finally I've got some time to implement the idea.


The Idea points:

Firstly, the server depends even less from the database.

Secondly, online monitoring will be less depended from a  database, only in the server's memory, which have very nice implications, which i write later.

Thirdly, on crisis times (huge network outages), when we need fast online info at most, the database will have a hard times due to large amount of events and problem update records going into the database, so we'd rather have out-of-DB fast way to know what is the current monitoring state, while we have time later to analyze events and problems.

The coding

I have created one more queue in configuration cache, named "problems". Queue updated each time on event when Zabbix's export problems feature is invoked. On recovery events problems are deleted from the queue.
 
Since i needed id indexing and no sorting at all i decided to go with zabbix's hashmap librairy and used hashing based on the eventid.

In the trapper code i did one more message processing:  
{"request":"problems.get"}

Processing is rather simple - iterating over the problems hashset and creating huge JSON response.

I did locking there, since events are registered from different threads (i assume it's the preproces manger which calculates triggers) and  export seems to be happening somewhere else (export thread?).

And it was the easiest part.

Fixing the UI was quite a deal.

Zabbix team used MVC ideology, so there are three scripts to fix and different types of API colls are spread among all three of them.

And there are separate classes to learn for Server and html construction.

Actually, it is very nice from code and structure point of view. It's a "heaven for perfectionist" programmer as it should be, but i wasn't quite ready to fit it all into my brain at once. Whatever, it was fun anyway.

The result: i was able to get stable UI response time to less then 100msec, having about 50-70 msec of response wait time. I've also realized that it was just enough to remove all that nice pop-up boxes with problem descriptions, acknowledge status to have OK response time on  DB load during huge number of hosts outage retrieving data from DB  without trapper. It's more close to 2000msec but still acceptable. So a kind of "easy" fix is also possible.



The problems: sometimes PHP cannot parse output from the zabbix server. Dumps show that PHP gets all the data, i can see buffer in PHP logs, but parser doesn't parse it. Due to rare nature of the thing i couldn't find its roots yet.

Another small disadvantage is the problem start time - as i keep the time of the event, usually it reflects the time when zabbix_server was last restarted, which happens a lot on devel times but rare on production.



The big advantage- it's fast. No, it just _really_ fast! 

It feels just like opening a static page. And it is really doesn't need php-fpm resources. Which means we can give up separate frontend machines, it's not that critical but just good.

So, this is full win in terms of prototype and usability, but it's a bit non production right now, because fixes have broken the "problems" panel and the changes are done by fixing existing widgets, instead they should be added as a separate ones.


Some tests:



I've tested two important on-line monitoring widgets - problems and problems by severity under "average collapse" situation (10k hosts are inaccessible).
Numbers are time in msec till widget has been rendered. For trapper widgets - no difference for ok and 10k fail situation.

On the full 50k+ collapse both reports couldn't fit into 300 seconds.

I've decided not to waste time to allow nginx wait for fpm forever just to figure it maybe will render in, say, 12 minutes.

I'd say that 2 seconds is acceptable, 10 is bad, more than 10 - no chance for production.


And there is one other great thing about this - now we can get online zabbix state without db.

If and when zabbix OLTP mass updates like events and problems will go to BigData DB, then it will be very-very close to becoming a real clustered server:

The idea: Two (or more) servers could use trapper interface to make sure they both functional and split hosts load between them: odd hostsids to one server, even ids to the other. Both serves will have full items database in both db and memory, but on code which distributes tasks to poller one condition should be added - not use the other's servers hosts tasks until it is alive. So when it dies a hosts will be activated and polled

Sure, there are complications: LLD items, active agents, zabbix senders, but i think it's solvable.

Saturday, September 1, 2018

ZABBIX: the story

Long long ago, perhaps two months for now i have started one week project that seems to be finished only by this fall....


So, ZABBIX.

At the job we use it for monitoring all the nice infrastructure we have. And It happened so that the monitoring service was growing but didn't have a true owner  last four years. So i decided to put my hands on it.

There was several reasons
  • technical problems (hangs, slow speed, short history period)
  • it was quite outdated

but the primary ones where:
  • it's a fun,
  • it's a new knowledge,
  • because "i can"
  • it's a real task that's make a lot of value for business

So, before i started to do something, there was a time for thinking, circling around, drawing schemes, learning new versions of ZABBIX and getting access to the production monitoring servers. That lead me to understanding idea that monitoring consist of two major parts - real-time and analytical.

At the end of May, 2018, on the internal hackaton event i completed the clickhouse history module. I would name this the most important change and "must have" to everyone who doesn't use ElasticsSearch. But even those  who does please read a comparison. Overall, clickhouse works very good as a historical storage. It is quite easy to setup and it makes a big deal. 

To make ZABBIX support the Clickhouse storage you have to be ready to compile ZABBIX from sources with patches. Also clickhouse must be setup along with initial data scheme. Look here how to do it.

But the story didn't finish then. After completing the history storage part some other problems come to my view. It was the number of servers we used for monitoring. We had twenty one!, fifteen of which where proxies to keep up with the load.

A long and curvy road led me to optimize that significantly. I was under impression that most problems came from constant forking of fping process. After some research and  efforts to reduce number of forks per second<> (fps :) ), i did the nmap solution  along with "time distributed pinging" idea. Overall that allowed to make 30-40 times less forks while doing checks three times faster and save about two times of CPU ticks on accessibility checks. .


Then it was SNMP querying. Comparing to accessibility checks we have ten times more items to poll. There are almost 3.5 millions of items now. And polling all in reasonable time frame couldn't be done with only one server even with 1000 threads.

Before I did several  projects with mass SNMP polling on Perl and Python, tried AsyncIO or plain threading, so i've already knew that asynchronous polling would do the thing. Description of the asynchronous polling idea  is in this  article. 

To implement mass polling there was many things to be done in the ZABBIX server beforehand. ZABBIX is somewhat ready for batch items processing. But there are many places in the code where processing is done on one-by-one basis quite ineffectivly. I named this part of job "BULKiing"

When processing batches of items was completed it is also became apparent  that there are two major architecture bottlenecks. One is preprocessor manager with its "queuing problem" and the fact that all the items being processed by single thread, the other problem is global locking of configuration. Both problems lead to a situation when systems with  slow CPU cores will perfoms not really well running ZABBIX under high load. Such a systems show low processing NVPS numbers with while most CPUs being 80-90% in idle state waiting for internal locks.

Both bottlenecks are limited by single CPU core speed. To somewhat avoid preprocessor manager stucking during high load i changed the way it processes  data so it could prioritize  tasks and avoid queuing. Now it can decide what is better to do - getting new results from pollers or sending results to workers. Details are here.

To solve "global locking" i've attempted to enlarge batches of items to poll from 8k to 32k which gave about 10kNVPS performance increase and allowed to marginaly exceed initial goal of 50kNVPS processing. But after we finally put the modified version of the ZABBIX to the production the problem with slow core and global locking appeared again. 

Even it could be solved with faster CPU, i decided to solve this either. This would allow to fully utilize older server hardware with slower CPUs but large number of cores. Detais are here.

And then to resolve another "bottleneck" problem, i've changed architecture to process data by more then one process manager. Averall together with all previous fixes it allowed to raise total processing capacity from 5k to 88kNVPS on the same hardware additionally eliminating need of proxies we put for load share.

The same code could show 118kNVPS on the faster-core test machine.  

Lastly i will be doing some UI-speed related work due to UI not functional under stress situation and even in normal conditions it takes 5-10 seconds to refresh panels.

And to make things complete, there are several short notes which are more about mood and fun and strange situations, architecture questions, lack of knowledge,  coincidences, links to sources, some work records, and so on.

nb: All the numbers are given for our installation of 47k hosts/4M items. I assume many of the problems are not likely to appear on installations with less objects.

nb2: The primary reason i write this is the possibility to catch a glimpse of engineering mood and fun. I accidentally found my 6 y/o records from the time we where launching new BRAS type. Reading that was kind a joy of reliving that time of being consumed with engineering, ideas, solutions, sleepless nights. So i am writing now that I'll have something to read in 2024.



Friday, August 31, 2018

ZABBIX: DZSA (direct zabbix server access)

The essential UI problem is that it uses API.

The problem is in efficiency, especially on downtime times. Especially when DB is slooow under 100k+ problems, events, triggers updates. (Remember the crash?). Most important widgets looks like this most of the time:

Solution:
Have a list of active problems on the Zabbix server side, refresh it according to problems happening or closing on the "go".

Fetch it via trapper interface.
Build  "current status" widgets out of them, avoid API as much as possible.

Side effect:
Have a list of "internal problems" for monitoring system administrator to see what's working wrong in terms of monitoring, use the same idea to build such an interface.

The progress

server->trapper->php->browser "hello world" prototype seems to be working:




Lets see how it works all together completed a bit later.

Friday, August 17, 2018

ZABBIX: nmap, fping, collapse reasons

Today i finally had time to figure what's happened on the zabbix crash two days ago.


The reason so much hosts stopped to be accessible was the nmap fail. I've fixed it by switching to fping, but i had to figure out, why nmap has stopped working.

During the crash we've tried to fix networking accessibility problem and  someone has left a nat rule to map ICMP traffic (most networks) to the proper src address.

And this is important: i dont' know the exact reason, but it seems that nat rule set on POSTROUTING stage HAS been applying to traffic, and returning traffic for some reason hasn't been able to be received by the same socket. I am not sure if it went directly to the socket without being deNatted or it couldn't get the socket after being deNATted by iptables. Doesn't really matter.

Fping was functional under same conditions due to it was already sending traffic from the correct address.

For nmap situation is a bit more complicated: when the -S option is used, it doesn't actually sends any traffic. To be precise, no traffic is leaving the system with the desired destination. Only setting outgoing interface helps (perhaps, one more reason to learn RAW sockets nature).

So to utilize source address with NMAP functionality there must be outgoing interface to be known by zabbix server or outgoing interface has to be set in the configuration. Latter seems to be simpler.

And one more thing to consider: actually, nmap parameters where wrong, so alongside with icmp it was sending packets to 443 and 80 port for host discovery. This is not right as may produce harm on slow devices, so I've fixed the nmap options.


Thursday, August 16, 2018

ZABBIX: v4 in productions and first production collapse

We've got the v4 working for production now.




While we where trying to fix the nmap source address problem, we accidentally disrupted network connectivity, which lead to the mass system collapse:

The core reason of collapse is slow mysql update speed. This, in turn have roots from ZABBIX nature to log everything to the MYSQL database.

And third, by architecture, Zabbix server and frontend are exchanging data through Mysql.

What happens is:
Net collapse-> lots of updates (problems, events, etc) -> db collapse -> slow frontend -> more db load due to page refreshes.

Some thoughts:

1.Do exchanging of on-line data (existing problems) via some in-memory method.
2. Reduce as much as possible DB load and logging of events, and (probably) send them to clickhouse.

I have also found that there is a huge trends_uint table. I thought that it's housekeeper's job to update it. Since there is HouseKeepingFrequency=0 in the config file i haven't check it before. So i did "update items set trends=0" to make sure.

But lets' retrun to collapsing on high load.
It's both 1) and 2) needs to be fixed. But also something have to be done about frontend widgets PHP part wich seems to be nonfunctional under high number of innacessible hosts.


Wednesday, August 15, 2018

ZABBIX: upgrade to v4

So far i am very glad to have started the upgrade. There are many changes in the code.

The most important now - they redid locking. However  it's kind a diffrent from i did, but, perhaps more right.

Instead of splitting queues to pieces, i saw that now there are different locks for read and write. That sounds to be promising as now it will be possible to use reading in parallel. Perhaps its good, but testing didn't show any performance increase yet.

And i don't remember this before:

It's from DCconfig_poller_get_items - function to retrieve items from queues to check. I think it's just not right.

It's much better to make async requesting in every type of check. Do some reasonable timeouts and just _forget_ about priorities, different timing and unreachable pollers.

I tried it for snmp and saw no negative impact.

Whatever number hosts are not reachable the thread will wait only one timeout of 2 seconds either for one or eight thousand hosts.

Some disadvantage of async pollers is that in theory they will  work a bit longer for pack batch of hosts then they would do for a single item. Each pack of items might be processed from 2 to 8 seconds, so it's not good for polling data more frequent then 10 seconds.

But according to my tests end expereince, its not worse then existing system, under comparable NVPS it will be very fast, but don't expect 1-2 second polling delays from the system processing 20-30kNVPS


Friday, August 10, 2018

ZABBIX upgrade to v4, some pure load test

As a final part of work i want to upgrade all to v4 line to be able to support the changes as long as possible with minimum efforts. (And to work as early adopter and bug fixer for Zabbix)

I have also decided to do some pure untouched virgin Zabbix load tests.

I took clean v4.0.0 alpha 9, applied clickhouse changes.

By the way i saw some minor elastics changes, but what is more important - they seem to be implemented file offloading of events, history, trends, perhaps something else.

That is valuable. Perhaps if found this before, the clickhouse offloading would be implemented outside if the main server process as a file parser. On the other hand, current implementation is in process, efficient and also frontend-compatible.

The reason to do load test is a to find that maybe Zabbix will be able to do out of the box what we want it to do.

This time i've tried to fetch as much SNMP traffic as possible:

pic: (1) system is 44% loaded, but (2) collecting only 21-22k new values per 5 seconds, so it's 5-6k NVPS. It's spends 10-12 seconds (4) in waiting for configuration locks, under normal conditions poller processes spends 4-5 seconds per each 50-60 values. Totaly, there are 3500(!) (4) poller threads running. And they also exhausted most of the memory.

I couldn't make it run more then 4k threads out of the box, but assuming linear CPU growth it will be able to grow to 8k after some sysctl fixes and gather 12k NVPS in the best case.

Then i've decided to test again pure accessibility check speed. As i mentioned before, i could acheive 28k NVPS on the same machine of pure accesibily 1-packet check on fixed icmp module which utilizes nmap instead of fping.


Ok, pure fping. System is quite loaded. Load average is 466 :), no idle time, 10-11kNVPS performance.

So, Zabbix cannot do alone out of the box what we need and there is a reason to apply fping and snmp patches, since they really make it possible to collect 15 times more data on the same hardware.




Thursday, August 9, 2018

Wow, Thats Funny! 100k NVPS exceeded!

Subject is inspired by on of the most popular TED YouTube videos.


I was recently doing load testing of the new code altogether and found a funny thing:

After certain point lowering delay for items leads to dramatic degradation of preprocessor manager.


I've  added statistics logging to worker threads and found out that they are free most of the time, so this time it looks like it's the preprocessor manager problems.

After reducing items delay to something less then 30 seconds i saw that preprocessor managers drop their processing speeds from around 80 000 items to 500-800 items per 5 seconds.

pic: semop 4 shown, but real problem is the semop 1

Perhaps, it's still locking. There aren't that many semop operations, according to strace, but visually it's the operation where the process freezes most. I couldn't find a simple way to calculate wait time via strace or something similar, so have to rely on my eyes. (update: RTFM, -w  -c options combined  produces strace statistics based on real time)

First i've found prerproc_sync_configuration was guilty, but this was wrong path.

After an hour of research i've found another bottleneck. And this one is also global architectural - it the history cache locks.

Situation is almost the same as it was with configuration lock. There is a globally locked access to history storage which leads to "single CPU problem"

Having LOTS of results coming from pollers, threads become limited by a single core speed to put data to the cache and to read it from.

Solution? Well, it nothing new - lets split it to 4 independent caches which is locked separately and access each one by it's sets of threads.

Most of dbcache.c has to be fixed as it's not only the cache that has to be split into 4 parts, but there are lot's of static vars declared.

Not sure if this is something needed for production but a might be a good prototyping experiment.

It is also a question hanging - why performance raises when i split preprocessing manager to 4 workers on Xeon5645 CPUs? (upd: perhaps the answer is that preprocessor manager has lots of other processing work to do which being split by differnt CPUs give some combined profit)

Whatever.

Just to make sure the job of maximizing NVPS is completed,  i fixed dbcache.c and HistoryCache locking to support 4 hsitory caches to see if this is _the last_ bottleneck.

I did what i assume safe locking - rare and non critical operations are still lock the entire cache operations, but most critical ones - only locks it's part of the cache.

Data flush from preprocessor manager to cache, where speed is mostly needed, are done by locking one of four sub caches, and such an additions can go in parallel. Then i also had to fix lots of static vars in dbcache.c as they seem to be used in parralel.

But i have to admit i've failed on this experiment. Spent 3 day's free time, but still was getting very unreliable daemon which keep crashing after 1-2 minutes.

So at the moment i am giving up on this, perhaps later i may revert to this.

pic: four preprocessor managers, doing 118k NVPS combined (note - idle time calculated wrong)



And there is one more reason.

I returned to test machine with E3-1280. And wow!, Zabbix could poll, process and save 118kNVPS (pic) with fixed queueing and 4 preprocessor managers. I would say its 105-110k stable while still having 18-20%idle. And It's kind a lot. My initial goal was to reach 50k. Considering we need 10k on one instance and 2k on the other it's enough capacity to grow for a while.








Wednesday, August 1, 2018

ZABBIX: multiple preprocessing managers

I decided to do that due to one thread of Xeon5645 couldn't handle all the preprocessing job, and also this might be required for faster CPU's but for higher loads.

First, some thoughts i had ahead:
  • 1.there might be a problem if two dependent items go to different managers so they will be waiting one another (deadlock)
  • 2. It must be some kind of workers distribution.
My first idea was to split requests and responses to threads, but then i decided it will be much easier to solve 1 and 2.

Since there was already host-to-queue persistence in queuing version, so i did thread-to-preprocmanager persistence the same way (procnum%4). So all items from each host will get to the same preprocessing manager.

Workers are split by preprocessing managers in the same way, by hashing thair procnum.


So, after fixing the worker log handling problem, preprocessors could handle much more requests.I saw stable value around 110k per 5 seconds per each manager, giving altogether 88k NVPS. I thought of making a 100k, but decided not to waste time on this old hardware, but test something more up to date later.

By the way, measuring prprocessor manager statistics is the way how i prefer to measure NVPS. I assume server collects a bit more from the network. It does requests to inaccessible hosts, which return no data and don't appear at preprocessing manager.

So thats it. Splitting preproc manager is essential if there is a need to achieve more then 50k NVPS stable on modern CPU's. This is what actually was limiting the test server with Xeon E3-1280.

Tuesday, July 31, 2018

ZABBIX: the second server

Tried to launch ZABBIX with all the fixes on the role of servers monitoring, started to segv.This was a bug in error logging.

There is another (big?)
problem there:
It looks like poller needs to be further altered to support asynchronous checks for agents.Or there some kind of locking occurs.

Strange thing that simple checks are slow too. Or might be global thing like DNS or something

This installation is rather small: 

And the server is sure enough idle 97%.

upd: nothing interesting, new server was not permitted by ip filters to access hosts via SNMP, and for agents there was also no permittion to it to request data. 

Thursday, July 26, 2018

ZABBIX: what does a cat when it have a free time?

subj. ZABBIX does something quite opposite.

I had successfully changed the code to have as many preprocessor managers as i want.

But whatever i did, stuck at 60k nvps.
My initial thought was it's preprocessor manager slows things down. So i've rewrite it to have 4 managers.

But then i saw that running 4 managers spreads CPU load between them, but gives no NVPS increase.

Whatever i thought, i couldn't find a reason to such a behavior. But today it struck me - it's worker threads and workers are waiting for them.

So, i straced worker thread.

.. semop....
Ok, is it GLOBAL config LOCK again?

NO, it's mutex 0, the mutex dedicated for logs.
But i don't see anything from workers in logs and it is loglevel is 3.

And guess what?
Let me show you the main loop of worker code:

I commented it out. It does log rotation. With locking. Inside loop which runs possibly in 100 threads and which task is to process an item as fast as possible.

So all the worker threads was keep trying to put LOG lock and to rotate the log file if it's bigger then config values. I guess it's a developers mistake. 

So, without log handling workers could do a bit faster and server achieved approximately 85-90kNVPS, while having only 25% idle. 

So, perhaps, that's the limit for double  E5645 system with 12 physical cores (and total 24 cores if we count hyper threaded ones).

During all this tests i've also found preprocessor manager priority flaw - it was more likely that it will gather queues before the buffer is full. And when the buffer is full it would be working OK, but it would be slowing down pollers.

Now it's just fine. Queue is close to zero. Each one of managers writing about 110k items each 5 seconds. So totally it is 110k*4threads/5seconds=88kNVPS.

Wednesday, July 25, 2018

ZABBIX: crashtesting, records #2

....failed very fast, on lots of process the system starts to swap

fixed the buffer, the system does fine with 20 snmp threads with, about 25knvps

have significant queuening with 43k required nvps, try to make 30 threads


didn't help, changed delay from 90 to 120, zabbix somehow stopped to work normallu, restarted.

doesn't work either, reduced numbre of pollers to 20
queueing, 33nvps.....
raised pollers back to 40....

OK, 30NVPS ONLY SNMP - NO PROBLEM
changed back delay to snmp to 90s.... lets look
it still queueng but just above 10sec, at planned load of 43knvps

reduced delay of snmp to 100 sec,pinger still off, let' see
ok, the system is stable to complete zero queues at 33kNVPS of clean SNMP load.

Now, let's test the FPING 100 threads:
the old version could handle the load without big broblems with up to 1sec (1sec had quieuieng in the first column)

NMAP: 100 threads works worse then FPING
high system load,  32% idle

Reduced NMAP threads to 20:
It does just like 100 threads FPING but system idle is 62%

Reduced NMAP threads to 10: it's worse, 20 seems more optimal, it's 82% idle

Now try fping in 20 threads:

The normal SNMP is:
0:00.26 /usr/local/sbin/zabbix_server: poller #1 [got 249 values in 5.005509 sec, getting values]                                
 7942 zabbix    20   0 13,670g 271628 213924 S   0,3  1,7   0:00.26 /usr/local/sbin/zabbix_server: poller #4 [got 175 values in 4.851671 sec, getting values]                                
 7943 zabbix    20   0 13,670g 265808 208076 S   0,3  1,6   0:00.25 /usr/local/sbin/zabbix_server: poller #5 [got 142 values in 4.914348 sec, getting values]                                
 7946 zabbix    20   0 13,670g 268228 210508 S   0,3  1,6   0:00.25 /usr/local/sbin/zabbix_server: poller #8 [got 175 values in 5.141629 sec, getting values]                                
 7947 zabbix    20   0 13,670g 274664 216944 S   0,3  1,7   0:00.26 /usr/local/sbin/zabbix_server: poller #9 [got 158 values in 5.044240 sec, getting values]                                
 7949 zabbix    20   0 13,670g 264984 207320 S   0,3  1,6   0:00.24 /usr/local/sbin/zabbix_server: poller #11 [got 184 values in 5.038127 sec, getting values]                               
 7951 zabbix    20   0 13,670g 273344 215680 S   0,3  1,7   0:00.27 /usr/local/sbin/zabbix_server: poller #13 [got 138 values in 4.924217 sec, getting values]                               
 7953 zabbix    20   0 13,670g 287580 229860 S   0,3  1,8   0:00.27 /usr/local/sbin/zabbix_server: poller #15 [got 172 values in 5.133177 sec, getting values]                               
 7955 zabbix    20   0 13,670g 258828 201032 S   0,3  1,6   0:00.24 /usr/local/sbin/zabbix_server: poller #17 [got 142 values in 4.982964 sec, getting values]            

140-150 per thred per 5 seconds
tried to make 300 threads



fping synthetic results:
 fixed fping
real    0m42,646s
user    0m0,236s
sys    0m0,336s


total CPU usage: 0,572s

fping normal

real    1m33,304s
user    0m0,104s
sys    0m0,276s
total CPU usage: 0,380s

nmap
real    0m3,252s
user    0m0,220s
sys    0m0,076s
total CPU usage: 0,296s




ZABBIX: crashing, getting syssegvs and so on -dairy records

And still i see the server is failing (segv sometimes).
It's not nearly as stable as the production one but tipycally these problems are quite easy to catch

The problem now is that server restart is taking more and more time (about 2-3 minutes)



The preprocess manager problem



The questions:

"something impossible happen logs"

why snmp is slow to be enabled? - figure why unreachible threads are waiting for nothing



huge load after some time!!!! net_snmp_close()???

in the night had huge number of must use snmp_select_socket2 (or so messages) - seems that under certain conditions i leave opened SNMP sessions.
Too many sessions cause high load and what seems to be select() call is not working, so essentially  all the gathering process breaks
TO DO: add some profiling, look where i may leave sessions unlclosed, consider removing (select) code and replacing by stupid simple wait call.


morning: huge snmp queue, seems that threads work normaly, i haven't found any abnormalies, just 10 them is not enought. (actually, they work much longer then at start, their run time raized from 6-9 seconds to 14-25, but that might be due high system load)

Raised snmp pollers number to 25, starting at 08:08
8 58 flight normal queue 0

seems something strange has happen at 11 24-11 34, apparently serever has crashed at that time, there is a long queue in SNMP in the browser left
11 41 started again ... actually the server is alive, but is't swapping and there are 16M records in the queue of the preprocess manager.

Myabe that's just too much hosts???
Maybe that's too big load from fpings
Now i disabled about 20k hosts, reduced number of threads to 20 for both pingrer and poller, restarted  the server, let's see what's happen


Having only half of the load (and also number of DB Syncers increased to 40) system is seems to be doing fine for at least 1 hour for now. And this is the time when i finaly see some SNMP data gathered in the graphs.

I will look at it for another 2 hours then will start gradually adding new devices.
UPD so far evth is fine, no queueing, system 90% idle


Another problem to consider - devices marked non - snmp acessible stay too long in such a state and i've never saw any processing done by unreachible pollers.
I exect i've broken some fetching logic in DC functions and that makes it such a long to mark devices as SNMP accessible again


So my theory is: 1.  I haven't enought data syncer processes enabled before so they where constantly busy in filling the value cache (ClickHouse not really fast for single host queries). Acutally i would suggested to fill the cache on the start by some bulky way - this would mean one query with millions of data. Unfortunately, i am not sure if this is what really zabbix and it's triggers need.


 The biggest thing that bothering me right now is occasional preprocessing manager queueng. I am almost sure that the queuing depends on two factors: system load and number of working threads (actually it might be one of them, and this two factors are closely linked). There one big optimization that could be done on the SNMP poller - is to enable bulk queriing, this will save some ticks of CPU and network PPS, might be easier for network devices, but it will not faster things really as anyway each SNMP thread will be waiting for inacessible devices timouts which more longer than making 100 queries to accessible device, and they happen in parralel in the asyncronious model.

Another big change and improvement that might be made is switching from fping to nmap for hosts checking. Question one - is how valuable the packet loss rates for understanding device reachibility. I have a thought that it might be more valuable to make 10 pings each 3 seconds then make them all at once each 30 seconds. In such a way of checking the packet loss rate will be done in a trigger by calculating average on base of last 5-10 checks. To speed up reaction of a device has become non accessible, last 2-3 checks must be considered.


ok, morning, 3 17. I see that the installation has crashed in preprocessing manager caught syssegv.

All pingers has been off (i've set up StartPingers=0)

I really believe that it's preprocessing manager that has to be blamed, in particular,  queueing of items. So, i'd like to fixid to one of the following
- find an answer - why is queuing happening at all???
- do not allow queueng
- after certain amount of queueing throttle the processes to wait till preprocessing manager be free (if it's load problem, which seems not)

I have nice thing - i know that running zabbix in debug mode almost immediately causes the queueuing. So i need to trace down all the decision process to understand, why the hell does it queues the messages


I see that most of it's time it's spend processing IPC requests of type 2 (probably, that the result of the poller's work). added extra flush and history flush logging to see if it's happening at all. On the next step will add _flush call result









the Github and sources, naming

The sources are in the two projects at github
Cleaned and checked sources and patch files are at
https://github.com/miklert/zabbix
At least separate clickhouse patch will be there soon.

The very unstable development version is at
https://github.com/miklert/xe-rabbix

Now it's based on 3.4.9 code.

It's a few words about compiling zabbix - you'll need MariaDB-shared library, not just client and server libriares and development files. I've spent quite a time finding that out.

For clickhouse compile with curl libriary

Regarding the name. We had to call it somehow internally, and every monitoring system must have something common in name with ZABBIX.

So we started to call it xerabbix [herabbiks] first, which appeared from Russian soft-sounding word 'dick' and ZABBIX, but it was too rude, so after a month of development it's transformed to xe-rabbix and now pronounced as [iks-e-rabbiks], leaving  rudeness in the past.

Nowdays we interpret the name as eXtra Edition of zabbix.

Tuesday, July 24, 2018

ZABBIX: Configuration cache, DC_poller_get_items, DCpoller_requeue_items, queues

A test machine was able to achieve 51kNVPS steady..

It more then enough. Test version could run for more then a week at 16kNVPS stable, no memory leaks. So we have  planned to put fixed ZABBIX code into production.

The old installation was ZABBIX 2.1 version. The migrating job included lots of other works like changing OS, fixing automation. Long night work.

Just after the server was upgraded to new OS, we started the database and the server and .... OOOPS it could only do 5kNVPS. The CPU was 90% idle at that time. We have seen this already, right? To finish the job the new server left work for  production on test machine.
 
Ok, lets figure what's going on the new server.
This is statistics that strace shows for a poller thread.



It clealry shows that threads  waiting for it's turn to get a lock .


Now a few ideas about locking.

ZABBIX holds in memory huge configuration structure, which it holds in several tables (perhaps it's lists or trees).

Apart from that there are 5 queues of tasks by polling type (ICMP, POLLER, PROXY, so on).
In each queue there are structures ordered and (perhaps) hashed by next check time. The structures references items.

So, each time something gets access to the configuration, it sets global lock via mutex and semop, reads/updates configuration and then unlocks.



 The global lock domain picture:


After analyzing the dbconfig.c and dbcahce.c code,  i decided, that for some operations global locks might be avoided.  To be precise - all operations from poller and pinger are safe to go in parallel as soon as they use and lock their own queues:



I've decided to split each queue type into 4 pieces. To keep host-per thread persistence items are distributed by queues by (hostid%4) hash.
To maintain polling threads persistence, threads are bounded to queues by their hash (procnum%4).

It should be at least 8 threads of a type (but better 16 threads or more), so at any time at least one thread could request data from DC cache, while others might be doing polling job.

So the result?

But before i'll tell about one funny twist in the situation:

After fixing the queues and making them work i see that the server performance is still not better.

Strace shown that poller still spend most time  semop calls waiting for global lock (mutex ZBX_MUTEX_CONFIG, 4).


2512:20180723:121820.512 In DCget_user_macro() macro:'{$SNMP_COMMUNITY}'
zabbix_server [2512]: zbx_mutex_lock: global config lock attepmt!
zabbix_server [2512]: zbx_mutex_lock: global config unlocked

After doing some profiling i found that  problems was in the macro nobody really needs, it's macro for community, and we only have one community anyway!

And then i even thought that maybe ALL the problems with locking  where due to this macro and one week of coding was just due to my lack of  knowledge of ZABBIX.

So, i reverted to no_queue ZABBIX version. It have shown that performance is much better without the macro, but only 2 times.

Still no good, so were going the right way with queues.

As we are in the development, lets do fast fix:

update items set snmp_comuntiy='isread' where snmp_community like '{$SNMP_COMMUNITY}'

(perhaps for production it's better to use API or UI to fix templates).


The following pic best describes the first launch of ZABBIX with queues:


On the same server which was doing 5k NVPS before,ZABIX shown 60k NVPS steady.

And finally for the first time ever i could load machine to 50% with ZABBIX.

Why no more? Because then next limiting thing comes to play - preprocess manager.

ZABBIX: preprocessor manager queue buffer

This happened first time right after I've launched asynchronous SNMP polling, which immediately raised NVPS from 2-3k to 7-10k

First I've noticed there is a significant data delay from polling to seeing it in the ClickHouse database. Doing tcpdumps and selects from Clickhouse revealed that data stucks inside zabbix server daemon. Since tcpdump shown that from network traffic data was gathered in time, pollers where not to blame.

Looking to the top -c output i saw that preprocessor manger has queue parameter information set by setproctitle which where constantly growing.

So my first job to find why it was growing. As this was development machine i had logging set to debug all the time.

Logs didn't show me where the process manager was spending all its time.
After I added some extra logs and fixed log.c to increase logging time precision to 1/10000s.
I've found nothing, but the strange fact that all the operations takes 190-222microseconds. And that was interesting because it meant something long happens each time i write information to logs. Right, it's logging itself.

After i switched off logging queuing disappeared, so i thought I've solved the problem.

However a week a so later when systems started to reach 20k NVPS i saw that preprocessor manager starts queuing again.

Queuing is a good thing as it provides a buffer for overload times. But in my case it was happening endlessly, queue was growing all the time, until ZABBIX couldn't allocate more data.

I had to start investigation again. And this time it was tricky since i couldn't use logging of every operation as logging introduced much more problems and  overhead then the problem i was looking for.

Finally, from reading old logs, looking at the source, the problem was found:

Process manager has a socket where it accept two types of data. First, there are request from pollers to process newly collected items. Requests then placed to queue. And then preprocess manager distributes tasks for preprocessing to worker threads.
The second type of message is worker responses with the data that has been preprocessed.

With big batches of collected items (at that time they where 4k size, later they increased to 32k) preprocessor manager had no choice but to constantly queue items: when a big chunk of data arrived to the IPC socket, it read all the data and then sent tasks to process items to worker thread. As it is much more items in a batch then worker threads, it had to queue most part of the batch. Then it has to read from socket again since to release worker thread to do it's next item, it has to submit the result first.

(Note, fixed later: about month after i accidentally realized that message sizes where limited to 256 items and the only factor for queuing was processing speed)

I've altered the code to open the second socket so preprocessor manager could decide which data to read - values from pollers or results from workers.

This allowed to survive load peak times when processing of new data was going slower then pollers could collect it.  And the bigger queue was, the more processing time needed (or it might be that swapping started at the test system, i had some memory constraints on the test system).

The fix works the following way: preprocess manager works the usual way until it gets 1M records in the queue. Then it stops reading requests from pollers and only sends data to workers. Until there is less then 1M in the queue.

Pollers have to wait untill their request processed, so on peak times when queue is full, polling for new data slows. This is much better then eventually crashing with data collected. I saw queue reaching 20M items.

Preprocessor manager is the only bottleneck left for now, so i consider later to either duplicate it or split functionality to two threads.

One thread will process requests and the other results.

Making two equally functional thread seems to be more difficult as we may fall in the situation when depended items are processed on different threads, making them wait endlessly for the other item to arrive. Or they have to share common storage and use mutexes to arbiter access to the storage safely.



ZABBIX: the results

The initial goal was to optimize hardware usage, we thought we will be able to reduce 21 servers to 6 without sacrificing service. There been some plans to start gathering some new data by stoppping gathering some other  we don't need.

We wanted to be able to keep history for at least 3 months or half a year by offloading some history data to another mysql server.

In fact the changes allowed to achieve much more.

First of all, only clickhouse offloading and snmp improvements allow to run everything on the same machine. So we need two for redundancy.


Having queuing problem fixed allowed to raise overall speed from 5-6k NVPS to 62k NVPS on the same server. I certainly believe that fixing the "preoprocess manager" bottleneck would allow 100k NVPS stable on the same hardware.

Some server numbers:

Current machine has 65% idle with only one processor loaded with thread running preprocessor manager.

Problem with slow CPU core is solvable with higher  frequency/modern CPU.

Test server running single E3-1280 Xeon could achieve 50k stable without queue optimizations.

It actually performs much better then older system with double E5645.


I've tried to run zabbix server on a machine with AMD cpu's with totaly 32 cores. I was expecting to see overall perfromance rise. But it was the very poor performance - like 2-3k NVPS.

It might be such a problems also come from number of items we have for processing. It might be with lower number of items there will be better results.

 And one more funny thing - having {$MACRO} of any kind in key, communtiy or any other items parameters degrades polling speed 2 times as DC_macro_resolve does global config lock.



But lets return to the goals. The initial goals  now achievable on a mediocre laptop. Mine with i5-5200u cpu and ssd drive could steadily monitor at 5k NVPS with clickhouse localy installed.

On a modern server with fast core one can plan to achieve 75k production/ 120k NVPS stable.

There is a quiestion - why would one need such a capacity?

It is important even for current tasks  we reduced polling intrval to 10 seconds to have a fast reaction, then we where able to poll all SNMP every 5 minutes.

Even if it's questanable if someone really needs such a speeds, i would suggest use the imporvements to decrease polling intervals 10-15 times.
There are a lot of possibilities, which such an HD monitoring gives.

First of all, analyzing events correlation getting much better with higher precision. Second, you get a chance to log things happening fast. For example, just after starting new monitoring we could finally discover what causes spontaneous switches availability problems.  Due to higher precision we could find exactly for how long and where and which group of switches where in outage.

So, having a monitoring system capacity 20 times more is a way to enable you to think about new possibilities you can achieve on the same installation. Eventually you discover or invent something that will make a business value.






Monday, July 23, 2018

Clickhouse vs Elasticssearch

A short and very un-honest comparison of these two systems.

There are different techniques to test and each system describes how to better test it. So to have the results that somehow show how the systems will behave under Zabbix history load, i tried to reproduce the exact type and way of load they could be under Zabbix history saving and retrieval.

I've used HTTP interfaces of both and curl to submit end extract data. On data submit there was chunks of 1000 rows.  This exactly what ZABBIX does. On data retrieval i did several tests on extracting a last value, last 5 values, and all values for an item.

Both databases where empty before test. Totally i've added about 30 million records to each one. That's quite a low number to understand a long-term behavior.

There was about one month of test Zabbix data collected in Clickhouse already, i did some tests on our Clickhouse install to see degradation of data retrieval on large amounts of data and long history term. I didn't see any performance drop at all. So I've decided to do the same to Elasticssearch if it will outperform Clickhouse on small amounts of data. That didn't happen.

For test i've used freshly installed non-tuned databases.

I didn't do MySQL test for two reasons: there are many MySQL vs Clickhouse  compraisons showing better performance and data compression of the latter.

Here, and here, or try to google for a dosen others.

And the other reason is the fact we've learned from existing Zabbix install: MySQL performance degrades dramatically on large volumes of data, prtitionazing helps, but still it will even slower.


So the test graph:



A few comments:
1. The only metric where Elastics has beaten Clickhouse is CPU usage on data fetch, but it's quite low difference. Considering how much CPU Elastics needs for storing data, Clickhouse uses totally less CPU.

2. Please pay attention how little disk, IOPS, CPU clickhouse needs to write data. Also due to compression Clickhouse needs 11 times less drive space then Elastics. Comparing to MySQL it 18 times less disk space.

So, short conclusion: i see NO reason NOT to use Clickhouse.



ZABBIX: clickhouse details

On high rates (more then one 1k NVPS) of new items being written to history, installing clickhouse alongside with the mysql server  (and probably with any other supported OLTP database) will make a BIG deal.

Clickhouse requires virtually no CPU and disk IO for writing data to drive.

Let's start from disadvantages of clickhosue:

The biggest one - merge delay. On systems where sharding is used, data may appear on graphs only after 5-20 minutes. This is not happening when there is only one node or several nodes with the same information.

That merge delay is acceptable for analytics data, and rest of functionality like triggers are not affected.

Clickhouse uses a litle bit more CPU on data fetch. Since it marginally more then Elastics and considering CPU expenses that Elastics need on data save, this could hardly be called disadvantage.

With clickhouse, there is no need to clean history tables tono need in trends. Housekeeper won't clean data in Clickhouse (use some internal DB methods to delete old partitions)

Impact of Clickhouse is invaluable: it almost doesn't use neither CPU not disk IOPs for storing large amounts of data. Loads of 50-100k new rows per second is something that could be handled on a usual spindle SATA drive.

Installing it alongside with MySQL on the same machine leads to freeing lots of resources of MySQL (or perhaps any other OLTP database) needed for storing history data. At the same time clickhouse almost doesn't use any resources at all. I would say it this way - you will barely notice clickhouse or mysql in top or htop utilities on the system processing 20-30k NVPS of data.

So, installing clickhouse is rather simple - there is a nice Yandex documentation about it: https://clickhouse.yandex/docs/en/getting_started/ . You can use sources or there are packages for most popular Linux flavors.

Having clickhouse installed you need to create the table for storing history data. It's easy to do via clickhouse client (https://clickhouse.yandex/docs/en/interfaces/http_interface/). Launch the clickhouse-client and execute the query:

CREATE TABLE zabbix.history ( day Date,  itemid UInt32,  clock DateTime,  ns UInt32,  value Int64,  value_dbl Float64,  value_str String) ENGINE = MergeTree(day, (itemid, clock), 8192);

CREATE TABLE zabbix.history_buffer ( day Date,  itemid UInt32,  clock DateTime,  ns UInt32,  value Int64,  value_dbl Float64,  value_str String) ENGINE = Buffer(zabbix, history, 16, 30, 100, 50000, 1000000, 1000000, 10000000) ;

Which will create table and table buffer for effective data merging.

That's it. Clickhouse is ready to accept data.

Then prepare zabbix:
Apply clickhouse patch (it fixes both server and frontend) and setup config files: in zabbix_server.conf:


HistoryStorageURL=http://localhost:8123
HistoryStorageTypes=uint,dbl,str,text
HistoryStorageType=clickhouse 
HistoryStorageTableName=zabbix.history_buffer


in frontend configuration:  zabbix.conf.php

global $HISTORY;
$HISTORY['url']   = 'http://localhost:8123';
$HISTORY['types'] = ['uint','dbl','str','text'];
$HISTORY['storagetype']   = 'clickhouse'; 

$HISTORY['tablename']   = 'zabbix.history_buffer';


Then you might use sql check if data is actually coming to the clickhouse server by doing count() or selecting latest data:

select count(*) from zabbix.history_buffer;

Make sure you use history_buffer as table to see the latest data.
To make ZABBIX support clickhouse storage engine download and apply clickhouse patch or download complete sources with patches already applied.

No log file support, sorry folks, never need it.

So, thats it. Until you get really high rates and volume of data, clickhouse works just fine in default configuration.


Sunday, July 22, 2018

ZABBIX: fping vs nmap replacement and "distributed" pingning

Eventualy at some point of time i've decided that nmap will be better then fping in terms of speed.

The test on 3,5k hosts file have shown that nmap finishes the job about 2,5 times faster: 40 seconds versus 100 seconds.

I did the changes in /lib/icmpping.c, so perhaps it will work for proxy either.

The only shortcoming in nmap that it actually doesn't do loss check, but only accessibility check with round trip time calculation.

I haven't found a way to tell nmap to send 10 packets and have loss percentage as well as min/max/avg statistics. This was a stuck situation for some time, but then idea of "time distributed" monitoring came up.

The idea: Instead of sending N packets in a row in quite a short time span during the test period P, it's better to send 1 packet each P/N second and have a pings distributed evenly. Then calculate loss on triggers.

The idea was crazy with the old system . Limit for it was 120 seconds and that was distributed on 10 proxies. But with new enhancements in code and processing the system on its own achieved less then 1.5 seconds interval between check of a single host.

It's quite a remarkable. Totally it's about 28k NVPS.

The checks setup to check host accessibility once each 10 seconds, which is done right now, and triggers are setup to react to 3 consecutive ping losees. So, in the worst case it is 40 seconds reaction. It's ok for most of outages and reliable enough to not bother on some sudden packet loss.


The last thing to note, i left possibility the system to use fping and it's quite simple: for simple icmp checks, if you there is 1 packet in icmp parameters then nmap is used, if it's more, then  fping. This potentially might be misleading for administrators and they might forget about it. For our production system there is no fping option.



Tuesday, June 19, 2018

ZABBIX: time wasters

During all this Zabbix optimizations there was several situations where I could understand what is the cost of experience: it's wasting lots of time on nothing.

First one: when i was writing the Clickhouse history processing I've found out that Zabbix history process were constantly busy with requesting data from the Clickhouse. 

My initial thought was that it's a ValueCache filling. But it's never had stopped even after one hour run.

Measurements shown that in the same time span installation of Zabbix  working on Mysql DB did about 600 requests to history. When the Zabbix was switched to Clickhouse, it did almost 16k requests. It would have done more, but CPU was the  limiting factor. 

Both Mysql and Clickhouse history tables where empty. 

The problem was in the Clickhouse response handling. When no data has been found, both DB back-end were returning no data, but unlike it's done in Mysql driver, i coded it to return FAIL status on absence of data. It might be have been a typo or might be a decision, i don't remember now.

In Zabbix history library there is FAIL processing as DB fail. So it tries to retry the same request endlessly . 

That is it. Two or three days of hassle.


The second one:
This one spanned for two or three weeks. It's started from my development laptop couldn't handle more then 1k NVPS. After that i saw preprocessing queue growth and delays of items processing.

At that time I've decided to switch to some real server hardware. Then some other things took time, like fixing API scripts to manage objects (there are more then 50k of them), refactoring code, Git, so on.

So, yesterday when i put refactored fixed asynchronous SNMP lib to the production server, after about 5k NVPS i saw queuing.

Half a day and evening, and early morning gave no result - after 3-5k NVPS preprocessing just couldn't handle all that.

Looking through the code, logging didn't gave any results. I could see that constant items queuing were consuming 100% of the preprocessor manager thread time. 

At some point i decided to add extra time precision to logging, as it wasn't clear what takes most of the time. Zabbix debug logging trims time to milliseconds.

I've fixed log.c, and it showed interesting picture - almost all logged operations takes 0.2 milliseconds (150-210 mSec to be precise). 

Which mean that something long happens each time when the server logs a line. Well, this is the answer - it's logging itself that slows everything.

Switching logging to something less then level 4 (DEBUG) eliminates all the queuing and slow SNMP processing completely.

The fun thing  i know this for more than 10 years. 

Debug logging has always been an expensive thing  in terms of resources. 

Especially for network hardware. Switching logging on leads to moving lots of processing from data plane to control plane and typically it ends up with loosing  the device control. 

And even for servers, where you have plenty of CPU, it is something that should be used with care and understanding of probable outcome.

NB: however several days after that i saw preprocessing manager queuing again, so i'll write separately about that later


ZABBIX: asyncronious SNMP polling



After fixing the FPING problem, I've switched on SNMP polling threads and it turned out that it's even bigger problem then accessibility checks: for each device we do only 5 pings and poll 122 items via SNMP.

Switching on SNMP polling on test server  immediately led to SNMP poller queuing. To have a healthy balance i left only 500devices in the test monitoring installation.

SNMP rate measurement, traffic dumps, code investigation showed a few facts:
  • SNMP is syncronious in zabbix 
  • SNMP poller is very feature-rich 
  • it looks like Zabbix guys have had a lot of experience with old and new devices, packet sizes. 
  • They also differ processing of auto-discovered hosts and manually added ones.
So I've spent a few days and  did asynchronous SNMP module, which was based on original zabbix SNMP code and async SNMP example from netsnmp docs, additionally i had to make a few changes in following modules
  • in the DC libriary which fetches the SNMP tasks for poller - i've removed a few conditions which limited fetch results to one host only, and another condition which was doing some response size calculation and could also reduce the number of selected items
  • I had to redo memory allocation type for ITEMs arrays, which where static before the modification. Since these array sizes went up from 128 ITEMS to 4096, they probably couldn't fit to stack anymore.
  • I was lucky that DC library returns ITEMS for check ordered by host. I assume  it's better to request one item per host at a time. So having result ITEMS array grouped by host is very handy in processing
Now, a few problems i've got
  • Had to properly write C  code and read about libriaries - had lot of time spent not understanding way how net_snmp handles pdu's. Perhaps there was another 20 coding problems which were easy to catch.
  • When there are a lot of async snmp pollers are working and under some other circumstances i see that a process "preprocess manager" reports via set_proc_title about huge tasks buffering PIC and there are significant delays of data which could be seen on graphs (later i knew that it is  a separate problem)

What further optimizations could be made:
  •  don't request items from the host if previous N items has failed (this might be done either in fping or better in check logic)
  • sometimes i'm getting process crash when adding/removing bunch of devices, need to figure (not true anymore, didn't have a crush in poller ever since)
  • actually, i need to compare "internal" change to outside script and uploading SNMP results via trapper inreface. Maybe it's feasible to do it outside due to amount of code modifications and possible problems including bugs and upgrade problems.


So far results are inspiring: 20-30 times faster then before: in the same poll time it's  increased from 200 values to 4k-6k values at no CPU cost.

ZABBIX: the preprocessing manager



All went fine until asynchronous SNMP processing where put in test server with about 10k hosts to poll.

At the point when the server was polling about 5k SNMP values per second, the pre-processing manager queue started to grow.

At the same time i saw significant CPU increase for SNMP polling processes, also timings became much worse.

I've tried to play a bit with number of threads doing polling, pre-processing and db syncing and realized the following:
  • the problem doesn't appear when i have only one or two SNMP pollers. I believe reason for that is fast that two pollers hardly can get 5k SNMP NVPS's
  • it is the same problem i saw on my laptop. At that moment i thought is the reason of slow CPU
  • it doesn't depend on any other threads and their count
  • it doesn't depend on history database read speeds. Switching reading off doesn't change anything
  • The more values system able to gather, the worse it gets. Starting 6-10 SNMP pollers leads to fast preprocessing manager queue grow and  bigger the queue gets, less items gets processed. 
  • After 100k items in queue Zabbix processes only 0-200 items each 5 seconds
  • Typically after 2-60 minutes Zabbix crashes with syssegv 11 in the preprocessor manager queue 
 Picture:1M+ items queue, pollers consuming up to 15%CPU, working 3-10 times longer then normal

Overall it seems i have two alternative ways to solve this:
  • figure out what the problem is and fix it 
  • degrade version to 3.0 line  (probably i'll have to rewrite some fixes), since processing manager was introduced only in 3.4, this way all the API scripts might also need to be fixed.

 Sure i'll try the first one first. Nice that it's quite easy to reproduce the problem, and there is a helpful debug.

Update: the second way doesn't fit. As 3.0 doesn't have history interface and possibility to offloading history data to a BigData storage

Update: the issue is fixed, it was another time waster and lack of experience.