Tuesday, September 3, 2019

Glaber modules, Go/Python support thoughts and research



The initial research and lookup on modules and Go/Python possibilities.

First research and googling have revealed that it’s pretty standard features – to extend C functionality via using Go libs and there is a way to include and execute Python code inside C.

Then I finally decided to look deeper at existing Zabbix loadable module functionality.

I’ve already come across them before, while I was digging history syncers for some problems. That time I thought I might be better extending the History interface by making modules but then come to the conclusion that they have very very limited functionality to have a fully functional history interface to be implemented via modules.

Which is strange.

Yes, because it’s only a small step – to make a history read interface and then we would have a modular interface which would allow only by attaching a proper module to achieve all that fast and fancy-schmancy features like ClickHouse integration or, perhaps, any other History Storage which fits well for historical time series data.

So this has definitely to be done now. Fortunately, it’s quite simple functionality and I stress this again – I don’t understand why it hasn’t been done right out of the box.

Second. The load_modules procedure. It’s quite obvious and simple, just a few things I need to do there. I’ll need to do some extra functionality to do different processing of modules due to module callbacks that will be different.

Sure, new API_VERSION which I’ll call GLABER API VERSION must be declared by the module to be treated and loaded differently from original Zabbix modules.

In initial modules, there has been some intermediate structure to hold pointers to history callbacks responsible for different history types.

I don’t understand the profit of it. Maybe yet. But there is a big disadvantage to it.
I will be extending the list of callbacks: history, polling, perhaps preprocessing functions might be extended soon by custom community code.
So the intermediate structure would have to be changed each time when a new callback type appears.
This way all the modules will have to be rebuilt from sources on such a change. No good. So far I haven’t found a reason why wouldn’t use some predefined function names and use dlsym to find them and attach to a list of callbacks of certain functionality. 

After digging all that and understanding Glaber code changes, I decided to start the initial implementation of the Go module. 

But before doing something it’s always worth of searching – it’s very unlikely no one has done that already. 

And that’s right: there is a nice place – Zabbix Share where there are plenty of modules and adaptors with examples how to code on Go, Python, PHP to make agent modules.

 It’s not a big difference since it’s the same structure and functionality for agents. And some of them exist for ages. For example, the Golang module is dated by 2015 !.

This a little bit sad. I say this again. If Zabbix had had full history interface implemented, which only literally needs a few days from them, then today Zabbix would work on any storage thing for years already. 

But ok, we can do it right now. It's never too late.

I have decided to start with Golang. It stays a bit apart from interpreted languages like Python/PHP. It’s compiled to a binary. It's faster. 

Since some point after Go 1.3-1.5, it’s possible to build modules without tricks with renaming the “main” function which was mandatory before. Fully compiled to a binary code Go modules do not need adapters and might be directly built to shared libraries.

A few notes about Go itself – after looking through for it's features I have realized it’s just a perfect for most things I’ve already done in Glaber code: It has the very nice concept of channels that perfectly fits tasks like streaming flow of metrics to history storage. It has a lightweight pseudo-threading named "goroutines" which makes perfectly easy coding of async polling – you can code it in simple sync way, and Goroutines solve the problem of thread efficiency.

Sure, that all must be tested and checked first. I have some doubts if all that nice Go features will function in library mode when the only certain function will be called from C code. But even if they will not, it’s still worth using Go for code simplification.

Then, memory leaks. It is a concern now. Since there will be lots of new code running, it is something that must be checked well. 
Especially in the case of Python. From my previous experience with a similar task with FreeRadius and rlm_perl it’s quite hard to eliminate memory leaks sometimes – too much code and functionality involved to be able to solve the thing in a reasonable time.

Ok, the actual tests. So far I have played a bit with compiling Go code to libraries and have tried to attach the libraries to a simple C code. Sometimes it has worked, sometimes it didn’t, but it’s enough to conclude that it’s perfectly doable and functional.

So the first plan will be to alter a bit Loading of Modules, implement support of history interface callbacks without touching existing modules support. And then after implementing basic reads and writes process a few millions of metrics, and look at my primary concerns about Golang:

1. Will go channels work?
2. Will we see any major memory leaks?

It’s most likely, that existing history_clickhouse will be moved to a separate module as well to compare C and Go efficiency if needed.

That’s it for now.

Monday, February 25, 2019

The “monitoring domains” idea


This one of the most important articles about the clustered Zabbix monitoring. It’s about common terminology and architecture.
When there is a classical server with a set of proxies, there is a strict host-> server or host -> proxy relation. This is done by setting the proxy attribute to a host in Zabbix configuration panels.
But when we have a cluster, then we would like to bound a host to a set of servers or proxies. And this is what “monitoring domains” for.
So, a “monitoring domain” is an identifier of set of proxies and servers which should be used to process a certain host.
A monitoring domain name may look just like an ordinary internet domain name “service.example.org” or you may use any kind of string which is valid for domain name like “server_farm1_major_hosts”, however the first one looks better.
Let’s look from a user’s perspective to “monitoring domains”.
Say, one got a server in the central office and two proxies in two branches.  

In a classic configuration this might be a server and two proxies.
Hosts from branch1 are bound to proxy1 just as hosts in the branch2 are set to be monitored from proxy in branch2. All hosts from central office doesn’t have a proxy set, so they are monitored directly by the server.
In a clustered architecture for such a case you would want to define three domains – central_office, branch1 and branch2. Since cluster is an option thenm, perhaps, there are reasons for making redundant structure.


Friday, February 22, 2019

Zabbix Cluster Database ID conflict resolution


The major problem for clustered database is conflict resolution. In general, “an application has to be aware of clustering and must have it’s means of conflicts resolution”.
I have decided to do cluster granularity host-based, there are many reasons for it, and the most important one – is by Zabbix architecture one host must be processed by the same server. This is due to preprocessing, items dependency, and less obvious reasons like proper polling planning to prevent host hammering.
Since different servers will update different hosts, there seems not to be a problem or a source of conflicts or data integrity.
Except logged things like audit, and most importantly events and problems tables.
When I have tried to launch a “cluster” – two instances of a server on a single database, the first problem I saw was that IDs conflict when adding new events and problems.
I guess that to have a database compatibility or maybe for some historical reasons Zabbix doesn’t rely on database ID generation and instead uses its own C function to generate the new ones.
And this is the place where changes are coming. In theory there are several approaches to maintain unique IDs along the cluster:
  • Generating some long unique prefix (or UUID) and using it in ID by adding locally-unique counter to it. This approach has benefits of being automatic (no need manual SERVER ID setting), but it doesn’t fit to the existing database structure.
  • Stepping: Each SERVER is assigned a sequential ID – SERVER ID(say 0….7). New row ID is generated as lowest exceeding number which is multiple of number of servers then SERVER ID added to that number. ID’s are never will be sequential in this case.They will have  gaps, but we are having 8 bytes for INT now, who cares?. The disadvantage – need to assign IDs manually to each server.
  • Ranging: each server is assigned it’s own range of IDS: say,
    • server1: 0….1000000,
    • server2: 10000001 – 2000000
    • and so on.
This might be calculated based on server ID which like in previous case could be sequential 0…7. The disadvantage is that existing methods and some logic of ID generation will not work and more complicated queries to initial ID retrieval will be required.

So, the only real compatible and feasible way is to implement IDs generation for Zabbix clustering is to use stepping. And it’s turned out to be simple, only a few lines of code to fix actually

Monday, February 18, 2019

Zabbix cluster and the “CAP theorem”


Zabbix cluster assumes having some sort of database redundancy. Having clickhouse patch makes things a bit easier – amount of database writes reduces to drastically.As clickhouse by it nature is replicated, there is no problem with history storage.
Consistency:
But for the rest of the data there are some issues as Zabbix keeps state in the database:
  • It uses a kind of redundant trigger state storage – triggers states, events and problems
  • It updates items reachability status in DB
So running Zabbix on the cluster must to deal with such a data inconsistency as it must be assumed that just before a server went out of service it didn’t updated it’s database, or it could do a partial update:
For example, a trigger state is recorded, but no event or problem generated.
So I did the following: when a new host is assigned to the server, it will reset it’s items states to “Undefined”, so next trigger recalculation will cause new trigger value to be written and initial problem and event records will be generated. This may cause some repeating events and problems, but guaranties that the item is in proper state in the DB after first trigger calculation.
Availability:
This is up to cluster design.Since I am on PostgreSQL right now, I have decided to use BDR. It allows to have own database on each server and promises to have databases in the same “eventually over the time”.
Partitioning tolerance
The point to have own database server is to be tolerant to cluster partitioning. I assume the database is installed on the same machine as the Zabbix server or close by it. So on event of the cluster partitioning server to database connectivity remains alive. Not really sure what will happen to Zabbix if it’s not, it definitely has some means of waiting for the database to become alive. And perhaps till that stops metrics processing as history syncers would wait for the DB to come back to write new trigger states.
Overall
Zabbix fits very well to cluster which works by distributing hosts among servers. All CAP theorem problems aren’t big ussues. I assume even having history being writing to SQL database is still OK for CAP consideartions.