Monday, February 18, 2019

Zabbix cluster and the “CAP theorem”


Zabbix cluster assumes having some sort of database redundancy. Having clickhouse patch makes things a bit easier – amount of database writes reduces to drastically.As clickhouse by it nature is replicated, there is no problem with history storage.
Consistency:
But for the rest of the data there are some issues as Zabbix keeps state in the database:
  • It uses a kind of redundant trigger state storage – triggers states, events and problems
  • It updates items reachability status in DB
So running Zabbix on the cluster must to deal with such a data inconsistency as it must be assumed that just before a server went out of service it didn’t updated it’s database, or it could do a partial update:
For example, a trigger state is recorded, but no event or problem generated.
So I did the following: when a new host is assigned to the server, it will reset it’s items states to “Undefined”, so next trigger recalculation will cause new trigger value to be written and initial problem and event records will be generated. This may cause some repeating events and problems, but guaranties that the item is in proper state in the DB after first trigger calculation.
Availability:
This is up to cluster design.Since I am on PostgreSQL right now, I have decided to use BDR. It allows to have own database on each server and promises to have databases in the same “eventually over the time”.
Partitioning tolerance
The point to have own database server is to be tolerant to cluster partitioning. I assume the database is installed on the same machine as the Zabbix server or close by it. So on event of the cluster partitioning server to database connectivity remains alive. Not really sure what will happen to Zabbix if it’s not, it definitely has some means of waiting for the database to become alive. And perhaps till that stops metrics processing as history syncers would wait for the DB to come back to write new trigger states.
Overall
Zabbix fits very well to cluster which works by distributing hosts among servers. All CAP theorem problems aren’t big ussues. I assume even having history being writing to SQL database is still OK for CAP consideartions.



No comments:

Post a Comment