Tuesday, July 24, 2018

ZABBIX: preprocessor manager queue buffer

This happened first time right after I've launched asynchronous SNMP polling, which immediately raised NVPS from 2-3k to 7-10k

First I've noticed there is a significant data delay from polling to seeing it in the ClickHouse database. Doing tcpdumps and selects from Clickhouse revealed that data stucks inside zabbix server daemon. Since tcpdump shown that from network traffic data was gathered in time, pollers where not to blame.

Looking to the top -c output i saw that preprocessor manger has queue parameter information set by setproctitle which where constantly growing.

So my first job to find why it was growing. As this was development machine i had logging set to debug all the time.

Logs didn't show me where the process manager was spending all its time.
After I added some extra logs and fixed log.c to increase logging time precision to 1/10000s.
I've found nothing, but the strange fact that all the operations takes 190-222microseconds. And that was interesting because it meant something long happens each time i write information to logs. Right, it's logging itself.

After i switched off logging queuing disappeared, so i thought I've solved the problem.

However a week a so later when systems started to reach 20k NVPS i saw that preprocessor manager starts queuing again.

Queuing is a good thing as it provides a buffer for overload times. But in my case it was happening endlessly, queue was growing all the time, until ZABBIX couldn't allocate more data.

I had to start investigation again. And this time it was tricky since i couldn't use logging of every operation as logging introduced much more problems and  overhead then the problem i was looking for.

Finally, from reading old logs, looking at the source, the problem was found:

Process manager has a socket where it accept two types of data. First, there are request from pollers to process newly collected items. Requests then placed to queue. And then preprocess manager distributes tasks for preprocessing to worker threads.
The second type of message is worker responses with the data that has been preprocessed.

With big batches of collected items (at that time they where 4k size, later they increased to 32k) preprocessor manager had no choice but to constantly queue items: when a big chunk of data arrived to the IPC socket, it read all the data and then sent tasks to process items to worker thread. As it is much more items in a batch then worker threads, it had to queue most part of the batch. Then it has to read from socket again since to release worker thread to do it's next item, it has to submit the result first.

(Note, fixed later: about month after i accidentally realized that message sizes where limited to 256 items and the only factor for queuing was processing speed)

I've altered the code to open the second socket so preprocessor manager could decide which data to read - values from pollers or results from workers.

This allowed to survive load peak times when processing of new data was going slower then pollers could collect it.  And the bigger queue was, the more processing time needed (or it might be that swapping started at the test system, i had some memory constraints on the test system).

The fix works the following way: preprocess manager works the usual way until it gets 1M records in the queue. Then it stops reading requests from pollers and only sends data to workers. Until there is less then 1M in the queue.

Pollers have to wait untill their request processed, so on peak times when queue is full, polling for new data slows. This is much better then eventually crashing with data collected. I saw queue reaching 20M items.

Preprocessor manager is the only bottleneck left for now, so i consider later to either duplicate it or split functionality to two threads.

One thread will process requests and the other results.

Making two equally functional thread seems to be more difficult as we may fall in the situation when depended items are processed on different threads, making them wait endlessly for the other item to arrive. Or they have to share common storage and use mutexes to arbiter access to the storage safely.



No comments:

Post a Comment