Friday, August 31, 2018

ZABBIX: DZSA (direct zabbix server access)

The essential UI problem is that it uses API.

The problem is in efficiency, especially on downtime times. Especially when DB is slooow under 100k+ problems, events, triggers updates. (Remember the crash?). Most important widgets looks like this most of the time:

Solution:
Have a list of active problems on the Zabbix server side, refresh it according to problems happening or closing on the "go".

Fetch it via trapper interface.
Build  "current status" widgets out of them, avoid API as much as possible.

Side effect:
Have a list of "internal problems" for monitoring system administrator to see what's working wrong in terms of monitoring, use the same idea to build such an interface.

The progress

server->trapper->php->browser "hello world" prototype seems to be working:




Lets see how it works all together completed a bit later.

Friday, August 17, 2018

ZABBIX: nmap, fping, collapse reasons

Today i finally had time to figure what's happened on the zabbix crash two days ago.


The reason so much hosts stopped to be accessible was the nmap fail. I've fixed it by switching to fping, but i had to figure out, why nmap has stopped working.

During the crash we've tried to fix networking accessibility problem and  someone has left a nat rule to map ICMP traffic (most networks) to the proper src address.

And this is important: i dont' know the exact reason, but it seems that nat rule set on POSTROUTING stage HAS been applying to traffic, and returning traffic for some reason hasn't been able to be received by the same socket. I am not sure if it went directly to the socket without being deNatted or it couldn't get the socket after being deNATted by iptables. Doesn't really matter.

Fping was functional under same conditions due to it was already sending traffic from the correct address.

For nmap situation is a bit more complicated: when the -S option is used, it doesn't actually sends any traffic. To be precise, no traffic is leaving the system with the desired destination. Only setting outgoing interface helps (perhaps, one more reason to learn RAW sockets nature).

So to utilize source address with NMAP functionality there must be outgoing interface to be known by zabbix server or outgoing interface has to be set in the configuration. Latter seems to be simpler.

And one more thing to consider: actually, nmap parameters where wrong, so alongside with icmp it was sending packets to 443 and 80 port for host discovery. This is not right as may produce harm on slow devices, so I've fixed the nmap options.


Thursday, August 16, 2018

ZABBIX: v4 in productions and first production collapse

We've got the v4 working for production now.




While we where trying to fix the nmap source address problem, we accidentally disrupted network connectivity, which lead to the mass system collapse:

The core reason of collapse is slow mysql update speed. This, in turn have roots from ZABBIX nature to log everything to the MYSQL database.

And third, by architecture, Zabbix server and frontend are exchanging data through Mysql.

What happens is:
Net collapse-> lots of updates (problems, events, etc) -> db collapse -> slow frontend -> more db load due to page refreshes.

Some thoughts:

1.Do exchanging of on-line data (existing problems) via some in-memory method.
2. Reduce as much as possible DB load and logging of events, and (probably) send them to clickhouse.

I have also found that there is a huge trends_uint table. I thought that it's housekeeper's job to update it. Since there is HouseKeepingFrequency=0 in the config file i haven't check it before. So i did "update items set trends=0" to make sure.

But lets' retrun to collapsing on high load.
It's both 1) and 2) needs to be fixed. But also something have to be done about frontend widgets PHP part wich seems to be nonfunctional under high number of innacessible hosts.


Wednesday, August 15, 2018

ZABBIX: upgrade to v4

So far i am very glad to have started the upgrade. There are many changes in the code.

The most important now - they redid locking. However  it's kind a diffrent from i did, but, perhaps more right.

Instead of splitting queues to pieces, i saw that now there are different locks for read and write. That sounds to be promising as now it will be possible to use reading in parallel. Perhaps its good, but testing didn't show any performance increase yet.

And i don't remember this before:

It's from DCconfig_poller_get_items - function to retrieve items from queues to check. I think it's just not right.

It's much better to make async requesting in every type of check. Do some reasonable timeouts and just _forget_ about priorities, different timing and unreachable pollers.

I tried it for snmp and saw no negative impact.

Whatever number hosts are not reachable the thread will wait only one timeout of 2 seconds either for one or eight thousand hosts.

Some disadvantage of async pollers is that in theory they will  work a bit longer for pack batch of hosts then they would do for a single item. Each pack of items might be processed from 2 to 8 seconds, so it's not good for polling data more frequent then 10 seconds.

But according to my tests end expereince, its not worse then existing system, under comparable NVPS it will be very fast, but don't expect 1-2 second polling delays from the system processing 20-30kNVPS


Friday, August 10, 2018

ZABBIX upgrade to v4, some pure load test

As a final part of work i want to upgrade all to v4 line to be able to support the changes as long as possible with minimum efforts. (And to work as early adopter and bug fixer for Zabbix)

I have also decided to do some pure untouched virgin Zabbix load tests.

I took clean v4.0.0 alpha 9, applied clickhouse changes.

By the way i saw some minor elastics changes, but what is more important - they seem to be implemented file offloading of events, history, trends, perhaps something else.

That is valuable. Perhaps if found this before, the clickhouse offloading would be implemented outside if the main server process as a file parser. On the other hand, current implementation is in process, efficient and also frontend-compatible.

The reason to do load test is a to find that maybe Zabbix will be able to do out of the box what we want it to do.

This time i've tried to fetch as much SNMP traffic as possible:

pic: (1) system is 44% loaded, but (2) collecting only 21-22k new values per 5 seconds, so it's 5-6k NVPS. It's spends 10-12 seconds (4) in waiting for configuration locks, under normal conditions poller processes spends 4-5 seconds per each 50-60 values. Totaly, there are 3500(!) (4) poller threads running. And they also exhausted most of the memory.

I couldn't make it run more then 4k threads out of the box, but assuming linear CPU growth it will be able to grow to 8k after some sysctl fixes and gather 12k NVPS in the best case.

Then i've decided to test again pure accessibility check speed. As i mentioned before, i could acheive 28k NVPS on the same machine of pure accesibily 1-packet check on fixed icmp module which utilizes nmap instead of fping.


Ok, pure fping. System is quite loaded. Load average is 466 :), no idle time, 10-11kNVPS performance.

So, Zabbix cannot do alone out of the box what we need and there is a reason to apply fping and snmp patches, since they really make it possible to collect 15 times more data on the same hardware.




Thursday, August 9, 2018

Wow, Thats Funny! 100k NVPS exceeded!

Subject is inspired by on of the most popular TED YouTube videos.


I was recently doing load testing of the new code altogether and found a funny thing:

After certain point lowering delay for items leads to dramatic degradation of preprocessor manager.


I've  added statistics logging to worker threads and found out that they are free most of the time, so this time it looks like it's the preprocessor manager problems.

After reducing items delay to something less then 30 seconds i saw that preprocessor managers drop their processing speeds from around 80 000 items to 500-800 items per 5 seconds.

pic: semop 4 shown, but real problem is the semop 1

Perhaps, it's still locking. There aren't that many semop operations, according to strace, but visually it's the operation where the process freezes most. I couldn't find a simple way to calculate wait time via strace or something similar, so have to rely on my eyes. (update: RTFM, -w  -c options combined  produces strace statistics based on real time)

First i've found prerproc_sync_configuration was guilty, but this was wrong path.

After an hour of research i've found another bottleneck. And this one is also global architectural - it the history cache locks.

Situation is almost the same as it was with configuration lock. There is a globally locked access to history storage which leads to "single CPU problem"

Having LOTS of results coming from pollers, threads become limited by a single core speed to put data to the cache and to read it from.

Solution? Well, it nothing new - lets split it to 4 independent caches which is locked separately and access each one by it's sets of threads.

Most of dbcache.c has to be fixed as it's not only the cache that has to be split into 4 parts, but there are lot's of static vars declared.

Not sure if this is something needed for production but a might be a good prototyping experiment.

It is also a question hanging - why performance raises when i split preprocessing manager to 4 workers on Xeon5645 CPUs? (upd: perhaps the answer is that preprocessor manager has lots of other processing work to do which being split by differnt CPUs give some combined profit)

Whatever.

Just to make sure the job of maximizing NVPS is completed,  i fixed dbcache.c and HistoryCache locking to support 4 hsitory caches to see if this is _the last_ bottleneck.

I did what i assume safe locking - rare and non critical operations are still lock the entire cache operations, but most critical ones - only locks it's part of the cache.

Data flush from preprocessor manager to cache, where speed is mostly needed, are done by locking one of four sub caches, and such an additions can go in parallel. Then i also had to fix lots of static vars in dbcache.c as they seem to be used in parralel.

But i have to admit i've failed on this experiment. Spent 3 day's free time, but still was getting very unreliable daemon which keep crashing after 1-2 minutes.

So at the moment i am giving up on this, perhaps later i may revert to this.

pic: four preprocessor managers, doing 118k NVPS combined (note - idle time calculated wrong)



And there is one more reason.

I returned to test machine with E3-1280. And wow!, Zabbix could poll, process and save 118kNVPS (pic) with fixed queueing and 4 preprocessor managers. I would say its 105-110k stable while still having 18-20%idle. And It's kind a lot. My initial goal was to reach 50k. Considering we need 10k on one instance and 2k on the other it's enough capacity to grow for a while.








Wednesday, August 1, 2018

ZABBIX: multiple preprocessing managers

I decided to do that due to one thread of Xeon5645 couldn't handle all the preprocessing job, and also this might be required for faster CPU's but for higher loads.

First, some thoughts i had ahead:
  • 1.there might be a problem if two dependent items go to different managers so they will be waiting one another (deadlock)
  • 2. It must be some kind of workers distribution.
My first idea was to split requests and responses to threads, but then i decided it will be much easier to solve 1 and 2.

Since there was already host-to-queue persistence in queuing version, so i did thread-to-preprocmanager persistence the same way (procnum%4). So all items from each host will get to the same preprocessing manager.

Workers are split by preprocessing managers in the same way, by hashing thair procnum.


So, after fixing the worker log handling problem, preprocessors could handle much more requests.I saw stable value around 110k per 5 seconds per each manager, giving altogether 88k NVPS. I thought of making a 100k, but decided not to waste time on this old hardware, but test something more up to date later.

By the way, measuring prprocessor manager statistics is the way how i prefer to measure NVPS. I assume server collects a bit more from the network. It does requests to inaccessible hosts, which return no data and don't appear at preprocessing manager.

So thats it. Splitting preproc manager is essential if there is a need to achieve more then 50k NVPS stable on modern CPU's. This is what actually was limiting the test server with Xeon E3-1280.