Tuesday, July 24, 2018

ZABBIX: the results

The initial goal was to optimize hardware usage, we thought we will be able to reduce 21 servers to 6 without sacrificing service. There been some plans to start gathering some new data by stoppping gathering some other  we don't need.

We wanted to be able to keep history for at least 3 months or half a year by offloading some history data to another mysql server.

In fact the changes allowed to achieve much more.

First of all, only clickhouse offloading and snmp improvements allow to run everything on the same machine. So we need two for redundancy.


Having queuing problem fixed allowed to raise overall speed from 5-6k NVPS to 62k NVPS on the same server. I certainly believe that fixing the "preoprocess manager" bottleneck would allow 100k NVPS stable on the same hardware.

Some server numbers:

Current machine has 65% idle with only one processor loaded with thread running preprocessor manager.

Problem with slow CPU core is solvable with higher  frequency/modern CPU.

Test server running single E3-1280 Xeon could achieve 50k stable without queue optimizations.

It actually performs much better then older system with double E5645.


I've tried to run zabbix server on a machine with AMD cpu's with totaly 32 cores. I was expecting to see overall perfromance rise. But it was the very poor performance - like 2-3k NVPS.

It might be such a problems also come from number of items we have for processing. It might be with lower number of items there will be better results.

 And one more funny thing - having {$MACRO} of any kind in key, communtiy or any other items parameters degrades polling speed 2 times as DC_macro_resolve does global config lock.



But lets return to the goals. The initial goals  now achievable on a mediocre laptop. Mine with i5-5200u cpu and ssd drive could steadily monitor at 5k NVPS with clickhouse localy installed.

On a modern server with fast core one can plan to achieve 75k production/ 120k NVPS stable.

There is a quiestion - why would one need such a capacity?

It is important even for current tasks  we reduced polling intrval to 10 seconds to have a fast reaction, then we where able to poll all SNMP every 5 minutes.

Even if it's questanable if someone really needs such a speeds, i would suggest use the imporvements to decrease polling intervals 10-15 times.
There are a lot of possibilities, which such an HD monitoring gives.

First of all, analyzing events correlation getting much better with higher precision. Second, you get a chance to log things happening fast. For example, just after starting new monitoring we could finally discover what causes spontaneous switches availability problems.  Due to higher precision we could find exactly for how long and where and which group of switches where in outage.

So, having a monitoring system capacity 20 times more is a way to enable you to think about new possibilities you can achieve on the same installation. Eventually you discover or invent something that will make a business value.






No comments:

Post a Comment