Monday, February 25, 2019

The “monitoring domains” idea


This one of the most important articles about the clustered Zabbix monitoring. It’s about common terminology and architecture.
When there is a classical server with a set of proxies, there is a strict host-> server or host -> proxy relation. This is done by setting the proxy attribute to a host in Zabbix configuration panels.
But when we have a cluster, then we would like to bound a host to a set of servers or proxies. And this is what “monitoring domains” for.
So, a “monitoring domain” is an identifier of set of proxies and servers which should be used to process a certain host.
A monitoring domain name may look just like an ordinary internet domain name “service.example.org” or you may use any kind of string which is valid for domain name like “server_farm1_major_hosts”, however the first one looks better.
Let’s look from a user’s perspective to “monitoring domains”.
Say, one got a server in the central office and two proxies in two branches.  

In a classic configuration this might be a server and two proxies.
Hosts from branch1 are bound to proxy1 just as hosts in the branch2 are set to be monitored from proxy in branch2. All hosts from central office doesn’t have a proxy set, so they are monitored directly by the server.
In a clustered architecture for such a case you would want to define three domains – central_office, branch1 and branch2. Since cluster is an option thenm, perhaps, there are reasons for making redundant structure.


Friday, February 22, 2019

Zabbix Cluster Database ID conflict resolution


The major problem for clustered database is conflict resolution. In general, “an application has to be aware of clustering and must have it’s means of conflicts resolution”.
I have decided to do cluster granularity host-based, there are many reasons for it, and the most important one – is by Zabbix architecture one host must be processed by the same server. This is due to preprocessing, items dependency, and less obvious reasons like proper polling planning to prevent host hammering.
Since different servers will update different hosts, there seems not to be a problem or a source of conflicts or data integrity.
Except logged things like audit, and most importantly events and problems tables.
When I have tried to launch a “cluster” – two instances of a server on a single database, the first problem I saw was that IDs conflict when adding new events and problems.
I guess that to have a database compatibility or maybe for some historical reasons Zabbix doesn’t rely on database ID generation and instead uses its own C function to generate the new ones.
And this is the place where changes are coming. In theory there are several approaches to maintain unique IDs along the cluster:
  • Generating some long unique prefix (or UUID) and using it in ID by adding locally-unique counter to it. This approach has benefits of being automatic (no need manual SERVER ID setting), but it doesn’t fit to the existing database structure.
  • Stepping: Each SERVER is assigned a sequential ID – SERVER ID(say 0….7). New row ID is generated as lowest exceeding number which is multiple of number of servers then SERVER ID added to that number. ID’s are never will be sequential in this case.They will have  gaps, but we are having 8 bytes for INT now, who cares?. The disadvantage – need to assign IDs manually to each server.
  • Ranging: each server is assigned it’s own range of IDS: say,
    • server1: 0….1000000,
    • server2: 10000001 – 2000000
    • and so on.
This might be calculated based on server ID which like in previous case could be sequential 0…7. The disadvantage is that existing methods and some logic of ID generation will not work and more complicated queries to initial ID retrieval will be required.

So, the only real compatible and feasible way is to implement IDs generation for Zabbix clustering is to use stepping. And it’s turned out to be simple, only a few lines of code to fix actually

Monday, February 18, 2019

Zabbix cluster and the “CAP theorem”


Zabbix cluster assumes having some sort of database redundancy. Having clickhouse patch makes things a bit easier – amount of database writes reduces to drastically.As clickhouse by it nature is replicated, there is no problem with history storage.
Consistency:
But for the rest of the data there are some issues as Zabbix keeps state in the database:
  • It uses a kind of redundant trigger state storage – triggers states, events and problems
  • It updates items reachability status in DB
So running Zabbix on the cluster must to deal with such a data inconsistency as it must be assumed that just before a server went out of service it didn’t updated it’s database, or it could do a partial update:
For example, a trigger state is recorded, but no event or problem generated.
So I did the following: when a new host is assigned to the server, it will reset it’s items states to “Undefined”, so next trigger recalculation will cause new trigger value to be written and initial problem and event records will be generated. This may cause some repeating events and problems, but guaranties that the item is in proper state in the DB after first trigger calculation.
Availability:
This is up to cluster design.Since I am on PostgreSQL right now, I have decided to use BDR. It allows to have own database on each server and promises to have databases in the same “eventually over the time”.
Partitioning tolerance
The point to have own database server is to be tolerant to cluster partitioning. I assume the database is installed on the same machine as the Zabbix server or close by it. So on event of the cluster partitioning server to database connectivity remains alive. Not really sure what will happen to Zabbix if it’s not, it definitely has some means of waiting for the database to become alive. And perhaps till that stops metrics processing as history syncers would wait for the DB to come back to write new trigger states.
Overall
Zabbix fits very well to cluster which works by distributing hosts among servers. All CAP theorem problems aren’t big ussues. I assume even having history being writing to SQL database is still OK for CAP consideartions.