Tuesday, July 31, 2018

ZABBIX: the second server

Tried to launch ZABBIX with all the fixes on the role of servers monitoring, started to segv.This was a bug in error logging.

There is another (big?)
problem there:
It looks like poller needs to be further altered to support asynchronous checks for agents.Or there some kind of locking occurs.

Strange thing that simple checks are slow too. Or might be global thing like DNS or something

This installation is rather small: 

And the server is sure enough idle 97%.

upd: nothing interesting, new server was not permitted by ip filters to access hosts via SNMP, and for agents there was also no permittion to it to request data. 

Thursday, July 26, 2018

ZABBIX: what does a cat when it have a free time?

subj. ZABBIX does something quite opposite.

I had successfully changed the code to have as many preprocessor managers as i want.

But whatever i did, stuck at 60k nvps.
My initial thought was it's preprocessor manager slows things down. So i've rewrite it to have 4 managers.

But then i saw that running 4 managers spreads CPU load between them, but gives no NVPS increase.

Whatever i thought, i couldn't find a reason to such a behavior. But today it struck me - it's worker threads and workers are waiting for them.

So, i straced worker thread.

.. semop....
Ok, is it GLOBAL config LOCK again?

NO, it's mutex 0, the mutex dedicated for logs.
But i don't see anything from workers in logs and it is loglevel is 3.

And guess what?
Let me show you the main loop of worker code:

I commented it out. It does log rotation. With locking. Inside loop which runs possibly in 100 threads and which task is to process an item as fast as possible.

So all the worker threads was keep trying to put LOG lock and to rotate the log file if it's bigger then config values. I guess it's a developers mistake. 

So, without log handling workers could do a bit faster and server achieved approximately 85-90kNVPS, while having only 25% idle. 

So, perhaps, that's the limit for double  E5645 system with 12 physical cores (and total 24 cores if we count hyper threaded ones).

During all this tests i've also found preprocessor manager priority flaw - it was more likely that it will gather queues before the buffer is full. And when the buffer is full it would be working OK, but it would be slowing down pollers.

Now it's just fine. Queue is close to zero. Each one of managers writing about 110k items each 5 seconds. So totally it is 110k*4threads/5seconds=88kNVPS.

Wednesday, July 25, 2018

ZABBIX: crashtesting, records #2

....failed very fast, on lots of process the system starts to swap

fixed the buffer, the system does fine with 20 snmp threads with, about 25knvps

have significant queuening with 43k required nvps, try to make 30 threads


didn't help, changed delay from 90 to 120, zabbix somehow stopped to work normallu, restarted.

doesn't work either, reduced numbre of pollers to 20
queueing, 33nvps.....
raised pollers back to 40....

OK, 30NVPS ONLY SNMP - NO PROBLEM
changed back delay to snmp to 90s.... lets look
it still queueng but just above 10sec, at planned load of 43knvps

reduced delay of snmp to 100 sec,pinger still off, let' see
ok, the system is stable to complete zero queues at 33kNVPS of clean SNMP load.

Now, let's test the FPING 100 threads:
the old version could handle the load without big broblems with up to 1sec (1sec had quieuieng in the first column)

NMAP: 100 threads works worse then FPING
high system load,  32% idle

Reduced NMAP threads to 20:
It does just like 100 threads FPING but system idle is 62%

Reduced NMAP threads to 10: it's worse, 20 seems more optimal, it's 82% idle

Now try fping in 20 threads:

The normal SNMP is:
0:00.26 /usr/local/sbin/zabbix_server: poller #1 [got 249 values in 5.005509 sec, getting values]                                
 7942 zabbix    20   0 13,670g 271628 213924 S   0,3  1,7   0:00.26 /usr/local/sbin/zabbix_server: poller #4 [got 175 values in 4.851671 sec, getting values]                                
 7943 zabbix    20   0 13,670g 265808 208076 S   0,3  1,6   0:00.25 /usr/local/sbin/zabbix_server: poller #5 [got 142 values in 4.914348 sec, getting values]                                
 7946 zabbix    20   0 13,670g 268228 210508 S   0,3  1,6   0:00.25 /usr/local/sbin/zabbix_server: poller #8 [got 175 values in 5.141629 sec, getting values]                                
 7947 zabbix    20   0 13,670g 274664 216944 S   0,3  1,7   0:00.26 /usr/local/sbin/zabbix_server: poller #9 [got 158 values in 5.044240 sec, getting values]                                
 7949 zabbix    20   0 13,670g 264984 207320 S   0,3  1,6   0:00.24 /usr/local/sbin/zabbix_server: poller #11 [got 184 values in 5.038127 sec, getting values]                               
 7951 zabbix    20   0 13,670g 273344 215680 S   0,3  1,7   0:00.27 /usr/local/sbin/zabbix_server: poller #13 [got 138 values in 4.924217 sec, getting values]                               
 7953 zabbix    20   0 13,670g 287580 229860 S   0,3  1,8   0:00.27 /usr/local/sbin/zabbix_server: poller #15 [got 172 values in 5.133177 sec, getting values]                               
 7955 zabbix    20   0 13,670g 258828 201032 S   0,3  1,6   0:00.24 /usr/local/sbin/zabbix_server: poller #17 [got 142 values in 4.982964 sec, getting values]            

140-150 per thred per 5 seconds
tried to make 300 threads



fping synthetic results:
 fixed fping
real    0m42,646s
user    0m0,236s
sys    0m0,336s


total CPU usage: 0,572s

fping normal

real    1m33,304s
user    0m0,104s
sys    0m0,276s
total CPU usage: 0,380s

nmap
real    0m3,252s
user    0m0,220s
sys    0m0,076s
total CPU usage: 0,296s




ZABBIX: crashing, getting syssegvs and so on -dairy records

And still i see the server is failing (segv sometimes).
It's not nearly as stable as the production one but tipycally these problems are quite easy to catch

The problem now is that server restart is taking more and more time (about 2-3 minutes)



The preprocess manager problem



The questions:

"something impossible happen logs"

why snmp is slow to be enabled? - figure why unreachible threads are waiting for nothing



huge load after some time!!!! net_snmp_close()???

in the night had huge number of must use snmp_select_socket2 (or so messages) - seems that under certain conditions i leave opened SNMP sessions.
Too many sessions cause high load and what seems to be select() call is not working, so essentially  all the gathering process breaks
TO DO: add some profiling, look where i may leave sessions unlclosed, consider removing (select) code and replacing by stupid simple wait call.


morning: huge snmp queue, seems that threads work normaly, i haven't found any abnormalies, just 10 them is not enought. (actually, they work much longer then at start, their run time raized from 6-9 seconds to 14-25, but that might be due high system load)

Raised snmp pollers number to 25, starting at 08:08
8 58 flight normal queue 0

seems something strange has happen at 11 24-11 34, apparently serever has crashed at that time, there is a long queue in SNMP in the browser left
11 41 started again ... actually the server is alive, but is't swapping and there are 16M records in the queue of the preprocess manager.

Myabe that's just too much hosts???
Maybe that's too big load from fpings
Now i disabled about 20k hosts, reduced number of threads to 20 for both pingrer and poller, restarted  the server, let's see what's happen


Having only half of the load (and also number of DB Syncers increased to 40) system is seems to be doing fine for at least 1 hour for now. And this is the time when i finaly see some SNMP data gathered in the graphs.

I will look at it for another 2 hours then will start gradually adding new devices.
UPD so far evth is fine, no queueing, system 90% idle


Another problem to consider - devices marked non - snmp acessible stay too long in such a state and i've never saw any processing done by unreachible pollers.
I exect i've broken some fetching logic in DC functions and that makes it such a long to mark devices as SNMP accessible again


So my theory is: 1.  I haven't enought data syncer processes enabled before so they where constantly busy in filling the value cache (ClickHouse not really fast for single host queries). Acutally i would suggested to fill the cache on the start by some bulky way - this would mean one query with millions of data. Unfortunately, i am not sure if this is what really zabbix and it's triggers need.


 The biggest thing that bothering me right now is occasional preprocessing manager queueng. I am almost sure that the queuing depends on two factors: system load and number of working threads (actually it might be one of them, and this two factors are closely linked). There one big optimization that could be done on the SNMP poller - is to enable bulk queriing, this will save some ticks of CPU and network PPS, might be easier for network devices, but it will not faster things really as anyway each SNMP thread will be waiting for inacessible devices timouts which more longer than making 100 queries to accessible device, and they happen in parralel in the asyncronious model.

Another big change and improvement that might be made is switching from fping to nmap for hosts checking. Question one - is how valuable the packet loss rates for understanding device reachibility. I have a thought that it might be more valuable to make 10 pings each 3 seconds then make them all at once each 30 seconds. In such a way of checking the packet loss rate will be done in a trigger by calculating average on base of last 5-10 checks. To speed up reaction of a device has become non accessible, last 2-3 checks must be considered.


ok, morning, 3 17. I see that the installation has crashed in preprocessing manager caught syssegv.

All pingers has been off (i've set up StartPingers=0)

I really believe that it's preprocessing manager that has to be blamed, in particular,  queueing of items. So, i'd like to fixid to one of the following
- find an answer - why is queuing happening at all???
- do not allow queueng
- after certain amount of queueing throttle the processes to wait till preprocessing manager be free (if it's load problem, which seems not)

I have nice thing - i know that running zabbix in debug mode almost immediately causes the queueuing. So i need to trace down all the decision process to understand, why the hell does it queues the messages


I see that most of it's time it's spend processing IPC requests of type 2 (probably, that the result of the poller's work). added extra flush and history flush logging to see if it's happening at all. On the next step will add _flush call result









the Github and sources, naming

The sources are in the two projects at github
Cleaned and checked sources and patch files are at
https://github.com/miklert/zabbix
At least separate clickhouse patch will be there soon.

The very unstable development version is at
https://github.com/miklert/xe-rabbix

Now it's based on 3.4.9 code.

It's a few words about compiling zabbix - you'll need MariaDB-shared library, not just client and server libriares and development files. I've spent quite a time finding that out.

For clickhouse compile with curl libriary

Regarding the name. We had to call it somehow internally, and every monitoring system must have something common in name with ZABBIX.

So we started to call it xerabbix [herabbiks] first, which appeared from Russian soft-sounding word 'dick' and ZABBIX, but it was too rude, so after a month of development it's transformed to xe-rabbix and now pronounced as [iks-e-rabbiks], leaving  rudeness in the past.

Nowdays we interpret the name as eXtra Edition of zabbix.

Tuesday, July 24, 2018

ZABBIX: Configuration cache, DC_poller_get_items, DCpoller_requeue_items, queues

A test machine was able to achieve 51kNVPS steady..

It more then enough. Test version could run for more then a week at 16kNVPS stable, no memory leaks. So we have  planned to put fixed ZABBIX code into production.

The old installation was ZABBIX 2.1 version. The migrating job included lots of other works like changing OS, fixing automation. Long night work.

Just after the server was upgraded to new OS, we started the database and the server and .... OOOPS it could only do 5kNVPS. The CPU was 90% idle at that time. We have seen this already, right? To finish the job the new server left work for  production on test machine.
 
Ok, lets figure what's going on the new server.
This is statistics that strace shows for a poller thread.



It clealry shows that threads  waiting for it's turn to get a lock .


Now a few ideas about locking.

ZABBIX holds in memory huge configuration structure, which it holds in several tables (perhaps it's lists or trees).

Apart from that there are 5 queues of tasks by polling type (ICMP, POLLER, PROXY, so on).
In each queue there are structures ordered and (perhaps) hashed by next check time. The structures references items.

So, each time something gets access to the configuration, it sets global lock via mutex and semop, reads/updates configuration and then unlocks.



 The global lock domain picture:


After analyzing the dbconfig.c and dbcahce.c code,  i decided, that for some operations global locks might be avoided.  To be precise - all operations from poller and pinger are safe to go in parallel as soon as they use and lock their own queues:



I've decided to split each queue type into 4 pieces. To keep host-per thread persistence items are distributed by queues by (hostid%4) hash.
To maintain polling threads persistence, threads are bounded to queues by their hash (procnum%4).

It should be at least 8 threads of a type (but better 16 threads or more), so at any time at least one thread could request data from DC cache, while others might be doing polling job.

So the result?

But before i'll tell about one funny twist in the situation:

After fixing the queues and making them work i see that the server performance is still not better.

Strace shown that poller still spend most time  semop calls waiting for global lock (mutex ZBX_MUTEX_CONFIG, 4).


2512:20180723:121820.512 In DCget_user_macro() macro:'{$SNMP_COMMUNITY}'
zabbix_server [2512]: zbx_mutex_lock: global config lock attepmt!
zabbix_server [2512]: zbx_mutex_lock: global config unlocked

After doing some profiling i found that  problems was in the macro nobody really needs, it's macro for community, and we only have one community anyway!

And then i even thought that maybe ALL the problems with locking  where due to this macro and one week of coding was just due to my lack of  knowledge of ZABBIX.

So, i reverted to no_queue ZABBIX version. It have shown that performance is much better without the macro, but only 2 times.

Still no good, so were going the right way with queues.

As we are in the development, lets do fast fix:

update items set snmp_comuntiy='isread' where snmp_community like '{$SNMP_COMMUNITY}'

(perhaps for production it's better to use API or UI to fix templates).


The following pic best describes the first launch of ZABBIX with queues:


On the same server which was doing 5k NVPS before,ZABIX shown 60k NVPS steady.

And finally for the first time ever i could load machine to 50% with ZABBIX.

Why no more? Because then next limiting thing comes to play - preprocess manager.

ZABBIX: preprocessor manager queue buffer

This happened first time right after I've launched asynchronous SNMP polling, which immediately raised NVPS from 2-3k to 7-10k

First I've noticed there is a significant data delay from polling to seeing it in the ClickHouse database. Doing tcpdumps and selects from Clickhouse revealed that data stucks inside zabbix server daemon. Since tcpdump shown that from network traffic data was gathered in time, pollers where not to blame.

Looking to the top -c output i saw that preprocessor manger has queue parameter information set by setproctitle which where constantly growing.

So my first job to find why it was growing. As this was development machine i had logging set to debug all the time.

Logs didn't show me where the process manager was spending all its time.
After I added some extra logs and fixed log.c to increase logging time precision to 1/10000s.
I've found nothing, but the strange fact that all the operations takes 190-222microseconds. And that was interesting because it meant something long happens each time i write information to logs. Right, it's logging itself.

After i switched off logging queuing disappeared, so i thought I've solved the problem.

However a week a so later when systems started to reach 20k NVPS i saw that preprocessor manager starts queuing again.

Queuing is a good thing as it provides a buffer for overload times. But in my case it was happening endlessly, queue was growing all the time, until ZABBIX couldn't allocate more data.

I had to start investigation again. And this time it was tricky since i couldn't use logging of every operation as logging introduced much more problems and  overhead then the problem i was looking for.

Finally, from reading old logs, looking at the source, the problem was found:

Process manager has a socket where it accept two types of data. First, there are request from pollers to process newly collected items. Requests then placed to queue. And then preprocess manager distributes tasks for preprocessing to worker threads.
The second type of message is worker responses with the data that has been preprocessed.

With big batches of collected items (at that time they where 4k size, later they increased to 32k) preprocessor manager had no choice but to constantly queue items: when a big chunk of data arrived to the IPC socket, it read all the data and then sent tasks to process items to worker thread. As it is much more items in a batch then worker threads, it had to queue most part of the batch. Then it has to read from socket again since to release worker thread to do it's next item, it has to submit the result first.

(Note, fixed later: about month after i accidentally realized that message sizes where limited to 256 items and the only factor for queuing was processing speed)

I've altered the code to open the second socket so preprocessor manager could decide which data to read - values from pollers or results from workers.

This allowed to survive load peak times when processing of new data was going slower then pollers could collect it.  And the bigger queue was, the more processing time needed (or it might be that swapping started at the test system, i had some memory constraints on the test system).

The fix works the following way: preprocess manager works the usual way until it gets 1M records in the queue. Then it stops reading requests from pollers and only sends data to workers. Until there is less then 1M in the queue.

Pollers have to wait untill their request processed, so on peak times when queue is full, polling for new data slows. This is much better then eventually crashing with data collected. I saw queue reaching 20M items.

Preprocessor manager is the only bottleneck left for now, so i consider later to either duplicate it or split functionality to two threads.

One thread will process requests and the other results.

Making two equally functional thread seems to be more difficult as we may fall in the situation when depended items are processed on different threads, making them wait endlessly for the other item to arrive. Or they have to share common storage and use mutexes to arbiter access to the storage safely.