It's two new bottlenecks appeared.
The first one: 100 history writers seems to be slow on putting data to clickhosue. But that is not for sure as there is a problem number two:
The problem: i could squeeze 35k NVPS out of the server.
It is remarkable anyway, considering our current need is 15k NVPS maximum, and existing old system hardly does 5k NVPS.
BUT - the server is only 20-30% utilized, why aren't we doing 50K at least?
Playing with number of threads doesn't help - more threads - more time they spending in each iteration.
Top shows they are sleeping while waiting things to happen most of the time, which is good.
Ok, let's ask strace: - strace, what are they waiting for?
Interesting, it's semop.
Ok, strange that it's neither SNMP nor mysql handling. The sources shows it's mutex locking to access pool of tasks and to return them after processing.
This is quite bad news - it means the systems is interlocked due to internal architecture limitations. And usually architecture is something that is very hard to change since everything around it built to work with the architecture type.
What most likely happens is the following: procedures of finding the next task to be polled require some process time, so i assume there is a limitation which is now less then the current system polling capacity.
Each poller waits most of the time for it's chance to lock the mutex and use DCconfig_get_poller_items.
So there is a processing limit - how fast one processor can extract items to be polled. Plus there is some overhead to init procedures and to lock and unlock mutexes.
Lets try increase number of items 4 times per poll to reduce inits and locks count: raise 8k to 32k.
I was pessimistic if this will help, but if it does, and ZABBIX will be able to achieve 50K NVPS, then it's a good point to stop in optimizations.
And as it happens often, i was wrong, it kind a helped a lot:
And the queue status is:
It's nice - almost precise 75 seconds interval.
And yes, there is a 5-25 minutes gap between the moment you send data to the clickhouse and when you see them due to clickhouse cluster merging intrerval.
This is eliminated if there is only one node and you request the data from buffer, but in our case it's 4 nodes, so usually, there is no data.
And this is clickhouse monitoring showing 40kNVPS and 50kNVPS tests showng very little increase in processing connects (don't look at the peaks - it's ValueCache filling after restart)
For the latter i had to increase dbSyncers to 200 as otherwise they couldn't handle initial ValueCache full in parallel with data send. After reaching safe queue size preprocessor manager stopped to process data from requests sockets, slowing down data polling.
The test shows that 50k NVPS is achievable on the machine. But when i was monitoring the situation for an hour, a saw occasional queuing and processing delays for 200-300k items for 5 or 10 seconds.
I would say, 50k is marginal and non-production numbers (until you get a better server or perhaps, less items and less CPU time on retrieving tasks). To go really beyond that you need a machine with faster core speed, it's not that critical how many cores would you get as soon as you have more then two.
Apart from the locks the next problem will be the preprocess_manager thread. And this is REAL problem number two. In my installation it was showing the following statistics:
preprocessing manager #1 [queued 0, processed 192824 values, idle 3.223292 sec during 5.662645 sec]
Rough approximation and safety margin of 20% give number of 80kNVPS for it to running with no idle time.
As a final conclusion i'd like to say that the setup is capable to handle up to 40k NVPS in production and that 3 times more then we need now, so it's just fine for the production and good point to stop in optimization.
NB:
However it wold be interesting to play with the architecture and see what could be done there. Giving more functionality to poller threads and reducing communication with the central storage is what i would do. Perhaps, pollers could take tasks for execution for next 3-4 minutes instead of single check and schedule checks on their own, submitting several results to the preprocessor manager. Then poller the still had to return tasks to sync with the new configuration. Perhaps, the same achivable by several local proxies, but that is out of scope now.
The first one: 100 history writers seems to be slow on putting data to clickhosue. But that is not for sure as there is a problem number two:
The problem: i could squeeze 35k NVPS out of the server.
It is remarkable anyway, considering our current need is 15k NVPS maximum, and existing old system hardly does 5k NVPS.
BUT - the server is only 20-30% utilized, why aren't we doing 50K at least?
Playing with number of threads doesn't help - more threads - more time they spending in each iteration.
Top shows they are sleeping while waiting things to happen most of the time, which is good.
Ok, let's ask strace: - strace, what are they waiting for?
Interesting, it's semop.
Ok, strange that it's neither SNMP nor mysql handling. The sources shows it's mutex locking to access pool of tasks and to return them after processing.
This is quite bad news - it means the systems is interlocked due to internal architecture limitations. And usually architecture is something that is very hard to change since everything around it built to work with the architecture type.
What most likely happens is the following: procedures of finding the next task to be polled require some process time, so i assume there is a limitation which is now less then the current system polling capacity.
Each poller waits most of the time for it's chance to lock the mutex and use DCconfig_get_poller_items.
So there is a processing limit - how fast one processor can extract items to be polled. Plus there is some overhead to init procedures and to lock and unlock mutexes.
Lets try increase number of items 4 times per poll to reduce inits and locks count: raise 8k to 32k.
I was pessimistic if this will help, but if it does, and ZABBIX will be able to achieve 50K NVPS, then it's a good point to stop in optimizations.
And as it happens often, i was wrong, it kind a helped a lot:
And the queue status is:
About 20% of hosts are intentionally SNMP denied on firewall level, to force timeouts and make ZABBIX life more difficult. And current processing returns CONFIG_ERROR to SNMP items of inaccessible hosts which make ZABBIX think there is problem. I think anyway of replacing this with TIMEOUT which will mark host as inaccesible, but thats separate story.
Lets check random device, random port:
It's nice - almost precise 75 seconds interval.
And yes, there is a 5-25 minutes gap between the moment you send data to the clickhouse and when you see them due to clickhouse cluster merging intrerval.
This is eliminated if there is only one node and you request the data from buffer, but in our case it's 4 nodes, so usually, there is no data.
And this is clickhouse monitoring showing 40kNVPS and 50kNVPS tests showng very little increase in processing connects (don't look at the peaks - it's ValueCache filling after restart)
For the latter i had to increase dbSyncers to 200 as otherwise they couldn't handle initial ValueCache full in parallel with data send. After reaching safe queue size preprocessor manager stopped to process data from requests sockets, slowing down data polling.
The test shows that 50k NVPS is achievable on the machine. But when i was monitoring the situation for an hour, a saw occasional queuing and processing delays for 200-300k items for 5 or 10 seconds.
I would say, 50k is marginal and non-production numbers (until you get a better server or perhaps, less items and less CPU time on retrieving tasks). To go really beyond that you need a machine with faster core speed, it's not that critical how many cores would you get as soon as you have more then two.
Apart from the locks the next problem will be the preprocess_manager thread. And this is REAL problem number two. In my installation it was showing the following statistics:
preprocessing manager #1 [queued 0, processed 192824 values, idle 3.223292 sec during 5.662645 sec]
Rough approximation and safety margin of 20% give number of 80kNVPS for it to running with no idle time.
As a final conclusion i'd like to say that the setup is capable to handle up to 40k NVPS in production and that 3 times more then we need now, so it's just fine for the production and good point to stop in optimization.
NB:
However it wold be interesting to play with the architecture and see what could be done there. Giving more functionality to poller threads and reducing communication with the central storage is what i would do. Perhaps, pollers could take tasks for execution for next 3-4 minutes instead of single check and schedule checks on their own, submitting several results to the preprocessor manager. Then poller the still had to return tasks to sync with the new configuration. Perhaps, the same achivable by several local proxies, but that is out of scope now.