Thursday, August 16, 2018

ZABBIX: v4 in productions and first production collapse

We've got the v4 working for production now.




While we where trying to fix the nmap source address problem, we accidentally disrupted network connectivity, which lead to the mass system collapse:

The core reason of collapse is slow mysql update speed. This, in turn have roots from ZABBIX nature to log everything to the MYSQL database.

And third, by architecture, Zabbix server and frontend are exchanging data through Mysql.

What happens is:
Net collapse-> lots of updates (problems, events, etc) -> db collapse -> slow frontend -> more db load due to page refreshes.

Some thoughts:

1.Do exchanging of on-line data (existing problems) via some in-memory method.
2. Reduce as much as possible DB load and logging of events, and (probably) send them to clickhouse.

I have also found that there is a huge trends_uint table. I thought that it's housekeeper's job to update it. Since there is HouseKeepingFrequency=0 in the config file i haven't check it before. So i did "update items set trends=0" to make sure.

But lets' retrun to collapsing on high load.
It's both 1) and 2) needs to be fixed. But also something have to be done about frontend widgets PHP part wich seems to be nonfunctional under high number of innacessible hosts.


No comments:

Post a Comment