A test machine was able to achieve 51kNVPS steady..
It more then enough. Test version could run for more then a week at 16kNVPS stable, no memory leaks. So we have planned to put fixed ZABBIX code into production.
The old installation was ZABBIX 2.1 version. The migrating job included lots of other works like changing OS, fixing automation. Long night work.
Just after the server was upgraded to new OS, we started the database and the server and .... OOOPS it could only do 5kNVPS. The CPU was 90% idle at that time. We have seen this already, right? To finish the job the new server left work for production on test machine.
Ok, lets figure what's going on the new server.
This is statistics that strace shows for a poller thread.
It clealry shows that threads waiting for it's turn to get a lock .
Now a few ideas about locking.
ZABBIX holds in memory huge configuration structure, which it holds in several tables (perhaps it's lists or trees).
Apart from that there are 5 queues of tasks by polling type (ICMP, POLLER, PROXY, so on).
In each queue there are structures ordered and (perhaps) hashed by next check time. The structures references items.
So, each time something gets access to the configuration, it sets global lock via mutex and semop, reads/updates configuration and then unlocks.
The global lock domain picture:
After analyzing the dbconfig.c and dbcahce.c code, i decided, that for some operations global locks might be avoided. To be precise - all operations from poller and pinger are safe to go in parallel as soon as they use and lock their own queues:
I've decided to split each queue type into 4 pieces. To keep host-per thread persistence items are distributed by queues by (hostid%4) hash.
To maintain polling threads persistence, threads are bounded to queues by their hash (procnum%4).
It should be at least 8 threads of a type (but better 16 threads or more), so at any time at least one thread could request data from DC cache, while others might be doing polling job.
So the result?
But before i'll tell about one funny twist in the situation:
After fixing the queues and making them work i see that the server performance is still not better.
Strace shown that poller still spend most time semop calls waiting for global lock (mutex ZBX_MUTEX_CONFIG, 4).
2512:20180723:121820.512 In DCget_user_macro() macro:'{$SNMP_COMMUNITY}'
zabbix_server [2512]: zbx_mutex_lock: global config lock attepmt!
zabbix_server [2512]: zbx_mutex_lock: global config unlocked
After doing some profiling i found that problems was in the macro nobody really needs, it's macro for community, and we only have one community anyway!
And then i even thought that maybe ALL the problems with locking where due to this macro and one week of coding was just due to my lack of knowledge of ZABBIX.
So, i reverted to no_queue ZABBIX version. It have shown that performance is much better without the macro, but only 2 times.
Still no good, so were going the right way with queues.
As we are in the development, lets do fast fix:
update items set snmp_comuntiy='isread' where snmp_community like '{$SNMP_COMMUNITY}'
(perhaps for production it's better to use API or UI to fix templates).
The following pic best describes the first launch of ZABBIX with queues:
On the same server which was doing 5k NVPS before,ZABIX shown 60k NVPS steady.
And finally for the first time ever i could load machine to 50% with ZABBIX.
Why no more? Because then next limiting thing comes to play - preprocess manager.
It more then enough. Test version could run for more then a week at 16kNVPS stable, no memory leaks. So we have planned to put fixed ZABBIX code into production.
The old installation was ZABBIX 2.1 version. The migrating job included lots of other works like changing OS, fixing automation. Long night work.
Just after the server was upgraded to new OS, we started the database and the server and .... OOOPS it could only do 5kNVPS. The CPU was 90% idle at that time. We have seen this already, right? To finish the job the new server left work for production on test machine.
Ok, lets figure what's going on the new server.
This is statistics that strace shows for a poller thread.
It clealry shows that threads waiting for it's turn to get a lock .
Now a few ideas about locking.
ZABBIX holds in memory huge configuration structure, which it holds in several tables (perhaps it's lists or trees).
Apart from that there are 5 queues of tasks by polling type (ICMP, POLLER, PROXY, so on).
In each queue there are structures ordered and (perhaps) hashed by next check time. The structures references items.
So, each time something gets access to the configuration, it sets global lock via mutex and semop, reads/updates configuration and then unlocks.
After analyzing the dbconfig.c and dbcahce.c code, i decided, that for some operations global locks might be avoided. To be precise - all operations from poller and pinger are safe to go in parallel as soon as they use and lock their own queues:
I've decided to split each queue type into 4 pieces. To keep host-per thread persistence items are distributed by queues by (hostid%4) hash.
To maintain polling threads persistence, threads are bounded to queues by their hash (procnum%4).
It should be at least 8 threads of a type (but better 16 threads or more), so at any time at least one thread could request data from DC cache, while others might be doing polling job.
So the result?
But before i'll tell about one funny twist in the situation:
After fixing the queues and making them work i see that the server performance is still not better.
Strace shown that poller still spend most time semop calls waiting for global lock (mutex ZBX_MUTEX_CONFIG, 4).
2512:20180723:121820.512 In DCget_user_macro() macro:'{$SNMP_COMMUNITY}'
zabbix_server [2512]: zbx_mutex_lock: global config lock attepmt!
zabbix_server [2512]: zbx_mutex_lock: global config unlocked
After doing some profiling i found that problems was in the macro nobody really needs, it's macro for community, and we only have one community anyway!
And then i even thought that maybe ALL the problems with locking where due to this macro and one week of coding was just due to my lack of knowledge of ZABBIX.
So, i reverted to no_queue ZABBIX version. It have shown that performance is much better without the macro, but only 2 times.
Still no good, so were going the right way with queues.
As we are in the development, lets do fast fix:
update items set snmp_comuntiy='isread' where snmp_community like '{$SNMP_COMMUNITY}'
(perhaps for production it's better to use API or UI to fix templates).
The following pic best describes the first launch of ZABBIX with queues:
On the same server which was doing 5k NVPS before,ZABIX shown 60k NVPS steady.
And finally for the first time ever i could load machine to 50% with ZABBIX.
Why no more? Because then next limiting thing comes to play - preprocess manager.
No comments:
Post a Comment