Zabbix's UI speed problems have the same root as the database problems - they try to combine two worlds:
- online situation
- logs and statistics data from huge OLTP database.
Combination of these two have kind a good informativeness, but at the same time it is slow due to the need of selecting big volumes of data.
So to make things really fast and to have a monitoring which works when everything else fails(and when we need online monitoring most), we need to split it.
The "analytical" and slow part of all the data - like acknowledge statuses, messages, confirmations, alerts data must not be used for online reports.
The proper place for such an information is the database.
Preferably non - OLTP, but BigData one, which is already used for history purposes. But since it's not that much of such a data,it fits ok to OLTP DB for now.
So finally I've got some time to implement the idea.
The Idea points:
Firstly, the server depends even less from the database.
Secondly, online monitoring will be less depended from a database, only in the server's memory, which have very nice implications, which i write later.
Thirdly, on crisis times (huge network outages), when we need fast online info at most, the database will have a hard times due to large amount of events and problem update records going into the database, so we'd rather have out-of-DB fast way to know what is the current monitoring state, while we have time later to analyze events and problems.
The coding
I have created one more queue in configuration cache, named "problems". Queue updated each time on event when Zabbix's export problems feature is invoked. On recovery events problems are deleted from the queue.
Since i needed id indexing and no sorting at all i decided to go with zabbix's hashmap librairy and used hashing based on the eventid.
In the trapper code i did one more message processing:
{"request":"problems.get"}
Processing is rather simple - iterating over the problems hashset and creating huge JSON response.
I did locking there, since events are registered from different threads (i assume it's the preproces manger which calculates triggers) and export seems to be happening somewhere else (export thread?).
And it was the easiest part.
Fixing the UI was quite a deal.
Zabbix team used MVC ideology, so there are three scripts to fix and different types of API colls are spread among all three of them.
And there are separate classes to learn for Server and html construction.
Actually, it is very nice from code and structure point of view. It's a "heaven for perfectionist" programmer as it should be, but i wasn't quite ready to fit it all into my brain at once. Whatever, it was fun anyway.
The result: i was able to get stable UI response time to less then 100msec, having about 50-70 msec of response wait time. I've also realized that it was just enough to remove all that nice pop-up boxes with problem descriptions, acknowledge status to have OK response time on DB load during huge number of hosts outage retrieving data from DB without trapper. It's more close to 2000msec but still acceptable. So a kind of "easy" fix is also possible.
The problems: sometimes PHP cannot parse output from the zabbix server. Dumps show that PHP gets all the data, i can see buffer in PHP logs, but parser doesn't parse it. Due to rare nature of the thing i couldn't find its roots yet.
Another small disadvantage is the problem start time - as i keep the time of the event, usually it reflects the time when zabbix_server was last restarted, which happens a lot on devel times but rare on production.
The big advantage- it's fast. No, it just _really_ fast!
It feels just like opening a static page. And it is really doesn't need php-fpm resources. Which means we can give up separate frontend machines, it's not that critical but just good.
So, this is full win in terms of prototype and usability, but it's a bit non production right now, because fixes have broken the "problems" panel and the changes are done by fixing existing widgets, instead they should be added as a separate ones.
Some tests:
I've tested two important on-line monitoring widgets - problems and problems by severity under "average collapse" situation (10k hosts are inaccessible).
Numbers are time in msec till widget has been rendered. For trapper widgets - no difference for ok and 10k fail situation.
On the full 50k+ collapse both reports couldn't fit into 300 seconds.
I've decided not to waste time to allow nginx wait for fpm forever just to figure it maybe will render in, say, 12 minutes.
I'd say that 2 seconds is acceptable, 10 is bad, more than 10 - no chance for production.
And there is one other great thing about this - now we can get online zabbix state without db.
If and when zabbix OLTP mass updates like events and problems will go to BigData DB, then it will be very-very close to becoming a real clustered server:
The idea: Two (or more) servers could use trapper interface to make sure they both functional and split hosts load between them: odd hostsids to one server, even ids to the other. Both serves will have full items database in both db and memory, but on code which distributes tasks to poller one condition should be added - not use the other's servers hosts tasks until it is alive. So when it dies a hosts will be activated and polled
Sure, there are complications: LLD items, active agents, zabbix senders, but i think it's solvable.