Monday, February 25, 2019

The “monitoring domains” idea


This one of the most important articles about the clustered Zabbix monitoring. It’s about common terminology and architecture.
When there is a classical server with a set of proxies, there is a strict host-> server or host -> proxy relation. This is done by setting the proxy attribute to a host in Zabbix configuration panels.
But when we have a cluster, then we would like to bound a host to a set of servers or proxies. And this is what “monitoring domains” for.
So, a “monitoring domain” is an identifier of set of proxies and servers which should be used to process a certain host.
A monitoring domain name may look just like an ordinary internet domain name “service.example.org” or you may use any kind of string which is valid for domain name like “server_farm1_major_hosts”, however the first one looks better.
Let’s look from a user’s perspective to “monitoring domains”.
Say, one got a server in the central office and two proxies in two branches.  

In a classic configuration this might be a server and two proxies.
Hosts from branch1 are bound to proxy1 just as hosts in the branch2 are set to be monitored from proxy in branch2. All hosts from central office doesn’t have a proxy set, so they are monitored directly by the server.
In a clustered architecture for such a case you would want to define three domains – central_office, branch1 and branch2. Since cluster is an option thenm, perhaps, there are reasons for making redundant structure.


Friday, February 22, 2019

Zabbix Cluster Database ID conflict resolution


The major problem for clustered database is conflict resolution. In general, “an application has to be aware of clustering and must have it’s means of conflicts resolution”.
I have decided to do cluster granularity host-based, there are many reasons for it, and the most important one – is by Zabbix architecture one host must be processed by the same server. This is due to preprocessing, items dependency, and less obvious reasons like proper polling planning to prevent host hammering.
Since different servers will update different hosts, there seems not to be a problem or a source of conflicts or data integrity.
Except logged things like audit, and most importantly events and problems tables.
When I have tried to launch a “cluster” – two instances of a server on a single database, the first problem I saw was that IDs conflict when adding new events and problems.
I guess that to have a database compatibility or maybe for some historical reasons Zabbix doesn’t rely on database ID generation and instead uses its own C function to generate the new ones.
And this is the place where changes are coming. In theory there are several approaches to maintain unique IDs along the cluster:
  • Generating some long unique prefix (or UUID) and using it in ID by adding locally-unique counter to it. This approach has benefits of being automatic (no need manual SERVER ID setting), but it doesn’t fit to the existing database structure.
  • Stepping: Each SERVER is assigned a sequential ID – SERVER ID(say 0….7). New row ID is generated as lowest exceeding number which is multiple of number of servers then SERVER ID added to that number. ID’s are never will be sequential in this case.They will have  gaps, but we are having 8 bytes for INT now, who cares?. The disadvantage – need to assign IDs manually to each server.
  • Ranging: each server is assigned it’s own range of IDS: say,
    • server1: 0….1000000,
    • server2: 10000001 – 2000000
    • and so on.
This might be calculated based on server ID which like in previous case could be sequential 0…7. The disadvantage is that existing methods and some logic of ID generation will not work and more complicated queries to initial ID retrieval will be required.

So, the only real compatible and feasible way is to implement IDs generation for Zabbix clustering is to use stepping. And it’s turned out to be simple, only a few lines of code to fix actually

Monday, February 18, 2019

Zabbix cluster and the “CAP theorem”


Zabbix cluster assumes having some sort of database redundancy. Having clickhouse patch makes things a bit easier – amount of database writes reduces to drastically.As clickhouse by it nature is replicated, there is no problem with history storage.
Consistency:
But for the rest of the data there are some issues as Zabbix keeps state in the database:
  • It uses a kind of redundant trigger state storage – triggers states, events and problems
  • It updates items reachability status in DB
So running Zabbix on the cluster must to deal with such a data inconsistency as it must be assumed that just before a server went out of service it didn’t updated it’s database, or it could do a partial update:
For example, a trigger state is recorded, but no event or problem generated.
So I did the following: when a new host is assigned to the server, it will reset it’s items states to “Undefined”, so next trigger recalculation will cause new trigger value to be written and initial problem and event records will be generated. This may cause some repeating events and problems, but guaranties that the item is in proper state in the DB after first trigger calculation.
Availability:
This is up to cluster design.Since I am on PostgreSQL right now, I have decided to use BDR. It allows to have own database on each server and promises to have databases in the same “eventually over the time”.
Partitioning tolerance
The point to have own database server is to be tolerant to cluster partitioning. I assume the database is installed on the same machine as the Zabbix server or close by it. So on event of the cluster partitioning server to database connectivity remains alive. Not really sure what will happen to Zabbix if it’s not, it definitely has some means of waiting for the database to become alive. And perhaps till that stops metrics processing as history syncers would wait for the DB to come back to write new trigger states.
Overall
Zabbix fits very well to cluster which works by distributing hosts among servers. All CAP theorem problems aren’t big ussues. I assume even having history being writing to SQL database is still OK for CAP consideartions.



Friday, September 7, 2018

ZABBIX: the "UI meltdown prototype"


Zabbix's UI speed problems have the same root as the database problems - they try to combine two worlds:

 - online situation
 - logs and statistics data from huge OLTP database.

Combination of these two have kind a good informativeness, but at the same time it is slow due to the need of selecting big volumes of data.

So to make things really fast and to have a monitoring which works when everything else fails(and when we need online monitoring most), we need to split it.

The "analytical" and slow part of all the data - like acknowledge statuses, messages, confirmations, alerts data must not be used for online reports.

The proper place for such an information is the database.

Preferably non - OLTP, but BigData one, which is already used for history purposes. But since it's not that much of such a data,it fits ok to OLTP DB for now.

So finally I've got some time to implement the idea.


The Idea points:

Firstly, the server depends even less from the database.

Secondly, online monitoring will be less depended from a  database, only in the server's memory, which have very nice implications, which i write later.

Thirdly, on crisis times (huge network outages), when we need fast online info at most, the database will have a hard times due to large amount of events and problem update records going into the database, so we'd rather have out-of-DB fast way to know what is the current monitoring state, while we have time later to analyze events and problems.

The coding

I have created one more queue in configuration cache, named "problems". Queue updated each time on event when Zabbix's export problems feature is invoked. On recovery events problems are deleted from the queue.
 
Since i needed id indexing and no sorting at all i decided to go with zabbix's hashmap librairy and used hashing based on the eventid.

In the trapper code i did one more message processing:  
{"request":"problems.get"}

Processing is rather simple - iterating over the problems hashset and creating huge JSON response.

I did locking there, since events are registered from different threads (i assume it's the preproces manger which calculates triggers) and  export seems to be happening somewhere else (export thread?).

And it was the easiest part.

Fixing the UI was quite a deal.

Zabbix team used MVC ideology, so there are three scripts to fix and different types of API colls are spread among all three of them.

And there are separate classes to learn for Server and html construction.

Actually, it is very nice from code and structure point of view. It's a "heaven for perfectionist" programmer as it should be, but i wasn't quite ready to fit it all into my brain at once. Whatever, it was fun anyway.

The result: i was able to get stable UI response time to less then 100msec, having about 50-70 msec of response wait time. I've also realized that it was just enough to remove all that nice pop-up boxes with problem descriptions, acknowledge status to have OK response time on  DB load during huge number of hosts outage retrieving data from DB  without trapper. It's more close to 2000msec but still acceptable. So a kind of "easy" fix is also possible.



The problems: sometimes PHP cannot parse output from the zabbix server. Dumps show that PHP gets all the data, i can see buffer in PHP logs, but parser doesn't parse it. Due to rare nature of the thing i couldn't find its roots yet.

Another small disadvantage is the problem start time - as i keep the time of the event, usually it reflects the time when zabbix_server was last restarted, which happens a lot on devel times but rare on production.



The big advantage- it's fast. No, it just _really_ fast! 

It feels just like opening a static page. And it is really doesn't need php-fpm resources. Which means we can give up separate frontend machines, it's not that critical but just good.

So, this is full win in terms of prototype and usability, but it's a bit non production right now, because fixes have broken the "problems" panel and the changes are done by fixing existing widgets, instead they should be added as a separate ones.


Some tests:



I've tested two important on-line monitoring widgets - problems and problems by severity under "average collapse" situation (10k hosts are inaccessible).
Numbers are time in msec till widget has been rendered. For trapper widgets - no difference for ok and 10k fail situation.

On the full 50k+ collapse both reports couldn't fit into 300 seconds.

I've decided not to waste time to allow nginx wait for fpm forever just to figure it maybe will render in, say, 12 minutes.

I'd say that 2 seconds is acceptable, 10 is bad, more than 10 - no chance for production.


And there is one other great thing about this - now we can get online zabbix state without db.

If and when zabbix OLTP mass updates like events and problems will go to BigData DB, then it will be very-very close to becoming a real clustered server:

The idea: Two (or more) servers could use trapper interface to make sure they both functional and split hosts load between them: odd hostsids to one server, even ids to the other. Both serves will have full items database in both db and memory, but on code which distributes tasks to poller one condition should be added - not use the other's servers hosts tasks until it is alive. So when it dies a hosts will be activated and polled

Sure, there are complications: LLD items, active agents, zabbix senders, but i think it's solvable.

Saturday, September 1, 2018

ZABBIX: the story

Long long ago, perhaps two months for now i have started one week project that seems to be finished only by this fall....


So, ZABBIX.

At the job we use it for monitoring all the nice infrastructure we have. And It happened so that the monitoring service was growing but didn't have a true owner  last four years. So i decided to put my hands on it.

There was several reasons
  • technical problems (hangs, slow speed, short history period)
  • it was quite outdated

but the primary ones where:
  • it's a fun,
  • it's a new knowledge,
  • because "i can"
  • it's a real task that's make a lot of value for business

So, before i started to do something, there was a time for thinking, circling around, drawing schemes, learning new versions of ZABBIX and getting access to the production monitoring servers. That lead me to understanding idea that monitoring consist of two major parts - real-time and analytical.

At the end of May, 2018, on the internal hackaton event i completed the clickhouse history module. I would name this the most important change and "must have" to everyone who doesn't use ElasticsSearch. But even those  who does please read a comparison. Overall, clickhouse works very good as a historical storage. It is quite easy to setup and it makes a big deal. 

To make ZABBIX support the Clickhouse storage you have to be ready to compile ZABBIX from sources with patches. Also clickhouse must be setup along with initial data scheme. Look here how to do it.

But the story didn't finish then. After completing the history storage part some other problems come to my view. It was the number of servers we used for monitoring. We had twenty one!, fifteen of which where proxies to keep up with the load.

A long and curvy road led me to optimize that significantly. I was under impression that most problems came from constant forking of fping process. After some research and  efforts to reduce number of forks per second<> (fps :) ), i did the nmap solution  along with "time distributed pinging" idea. Overall that allowed to make 30-40 times less forks while doing checks three times faster and save about two times of CPU ticks on accessibility checks. .


Then it was SNMP querying. Comparing to accessibility checks we have ten times more items to poll. There are almost 3.5 millions of items now. And polling all in reasonable time frame couldn't be done with only one server even with 1000 threads.

Before I did several  projects with mass SNMP polling on Perl and Python, tried AsyncIO or plain threading, so i've already knew that asynchronous polling would do the thing. Description of the asynchronous polling idea  is in this  article. 

To implement mass polling there was many things to be done in the ZABBIX server beforehand. ZABBIX is somewhat ready for batch items processing. But there are many places in the code where processing is done on one-by-one basis quite ineffectivly. I named this part of job "BULKiing"

When processing batches of items was completed it is also became apparent  that there are two major architecture bottlenecks. One is preprocessor manager with its "queuing problem" and the fact that all the items being processed by single thread, the other problem is global locking of configuration. Both problems lead to a situation when systems with  slow CPU cores will perfoms not really well running ZABBIX under high load. Such a systems show low processing NVPS numbers with while most CPUs being 80-90% in idle state waiting for internal locks.

Both bottlenecks are limited by single CPU core speed. To somewhat avoid preprocessor manager stucking during high load i changed the way it processes  data so it could prioritize  tasks and avoid queuing. Now it can decide what is better to do - getting new results from pollers or sending results to workers. Details are here.

To solve "global locking" i've attempted to enlarge batches of items to poll from 8k to 32k which gave about 10kNVPS performance increase and allowed to marginaly exceed initial goal of 50kNVPS processing. But after we finally put the modified version of the ZABBIX to the production the problem with slow core and global locking appeared again. 

Even it could be solved with faster CPU, i decided to solve this either. This would allow to fully utilize older server hardware with slower CPUs but large number of cores. Detais are here.

And then to resolve another "bottleneck" problem, i've changed architecture to process data by more then one process manager. Averall together with all previous fixes it allowed to raise total processing capacity from 5k to 88kNVPS on the same hardware additionally eliminating need of proxies we put for load share.

The same code could show 118kNVPS on the faster-core test machine.  

Lastly i will be doing some UI-speed related work due to UI not functional under stress situation and even in normal conditions it takes 5-10 seconds to refresh panels.

And to make things complete, there are several short notes which are more about mood and fun and strange situations, architecture questions, lack of knowledge,  coincidences, links to sources, some work records, and so on.

nb: All the numbers are given for our installation of 47k hosts/4M items. I assume many of the problems are not likely to appear on installations with less objects.

nb2: The primary reason i write this is the possibility to catch a glimpse of engineering mood and fun. I accidentally found my 6 y/o records from the time we where launching new BRAS type. Reading that was kind a joy of reliving that time of being consumed with engineering, ideas, solutions, sleepless nights. So i am writing now that I'll have something to read in 2024.



Friday, August 31, 2018

ZABBIX: DZSA (direct zabbix server access)

The essential UI problem is that it uses API.

The problem is in efficiency, especially on downtime times. Especially when DB is slooow under 100k+ problems, events, triggers updates. (Remember the crash?). Most important widgets looks like this most of the time:

Solution:
Have a list of active problems on the Zabbix server side, refresh it according to problems happening or closing on the "go".

Fetch it via trapper interface.
Build  "current status" widgets out of them, avoid API as much as possible.

Side effect:
Have a list of "internal problems" for monitoring system administrator to see what's working wrong in terms of monitoring, use the same idea to build such an interface.

The progress

server->trapper->php->browser "hello world" prototype seems to be working:




Lets see how it works all together completed a bit later.

Friday, August 17, 2018

ZABBIX: nmap, fping, collapse reasons

Today i finally had time to figure what's happened on the zabbix crash two days ago.


The reason so much hosts stopped to be accessible was the nmap fail. I've fixed it by switching to fping, but i had to figure out, why nmap has stopped working.

During the crash we've tried to fix networking accessibility problem and  someone has left a nat rule to map ICMP traffic (most networks) to the proper src address.

And this is important: i dont' know the exact reason, but it seems that nat rule set on POSTROUTING stage HAS been applying to traffic, and returning traffic for some reason hasn't been able to be received by the same socket. I am not sure if it went directly to the socket without being deNatted or it couldn't get the socket after being deNATted by iptables. Doesn't really matter.

Fping was functional under same conditions due to it was already sending traffic from the correct address.

For nmap situation is a bit more complicated: when the -S option is used, it doesn't actually sends any traffic. To be precise, no traffic is leaving the system with the desired destination. Only setting outgoing interface helps (perhaps, one more reason to learn RAW sockets nature).

So to utilize source address with NMAP functionality there must be outgoing interface to be known by zabbix server or outgoing interface has to be set in the configuration. Latter seems to be simpler.

And one more thing to consider: actually, nmap parameters where wrong, so alongside with icmp it was sending packets to 443 and 80 port for host discovery. This is not right as may produce harm on slow devices, so I've fixed the nmap options.