And still i see the server is failing (segv sometimes).
It's not nearly as stable as the production one but tipycally these problems are quite easy to catch
The problem now is that server restart is taking more and more time (about 2-3 minutes)
The preprocess manager problem
The questions:
"something impossible happen logs"
why snmp is slow to be enabled? - figure why unreachible threads are waiting for nothing
huge load after some time!!!! net_snmp_close()???
in the night had huge number of must use snmp_select_socket2 (or so messages) - seems that under certain conditions i leave opened SNMP sessions.
Too many sessions cause high load and what seems to be select() call is not working, so essentially all the gathering process breaks
TO DO: add some profiling, look where i may leave sessions unlclosed, consider removing (select) code and replacing by stupid simple wait call.
morning: huge snmp queue, seems that threads work normaly, i haven't found any abnormalies, just 10 them is not enought. (actually, they work much longer then at start, their run time raized from 6-9 seconds to 14-25, but that might be due high system load)
Raised snmp pollers number to 25, starting at 08:08
8 58 flight normal queue 0
seems something strange has happen at 11 24-11 34, apparently serever has crashed at that time, there is a long queue in SNMP in the browser left
11 41 started again ... actually the server is alive, but is't swapping and there are 16M records in the queue of the preprocess manager.
Myabe that's just too much hosts???
Maybe that's too big load from fpings
Now i disabled about 20k hosts, reduced number of threads to 20 for both pingrer and poller, restarted the server, let's see what's happen
Having only half of the load (and also number of DB Syncers increased to 40) system is seems to be doing fine for at least 1 hour for now. And this is the time when i finaly see some SNMP data gathered in the graphs.
I will look at it for another 2 hours then will start gradually adding new devices.
UPD so far evth is fine, no queueing, system 90% idle
Another problem to consider - devices marked non - snmp acessible stay too long in such a state and i've never saw any processing done by unreachible pollers.
I exect i've broken some fetching logic in DC functions and that makes it such a long to mark devices as SNMP accessible again
So my theory is: 1. I haven't enought data syncer processes enabled before so they where constantly busy in filling the value cache (ClickHouse not really fast for single host queries). Acutally i would suggested to fill the cache on the start by some bulky way - this would mean one query with millions of data. Unfortunately, i am not sure if this is what really zabbix and it's triggers need.
The biggest thing that bothering me right now is occasional preprocessing manager queueng. I am almost sure that the queuing depends on two factors: system load and number of working threads (actually it might be one of them, and this two factors are closely linked). There one big optimization that could be done on the SNMP poller - is to enable bulk queriing, this will save some ticks of CPU and network PPS, might be easier for network devices, but it will not faster things really as anyway each SNMP thread will be waiting for inacessible devices timouts which more longer than making 100 queries to accessible device, and they happen in parralel in the asyncronious model.
Another big change and improvement that might be made is switching from fping to nmap for hosts checking. Question one - is how valuable the packet loss rates for understanding device reachibility. I have a thought that it might be more valuable to make 10 pings each 3 seconds then make them all at once each 30 seconds. In such a way of checking the packet loss rate will be done in a trigger by calculating average on base of last 5-10 checks. To speed up reaction of a device has become non accessible, last 2-3 checks must be considered.
ok, morning, 3 17. I see that the installation has crashed in preprocessing manager caught syssegv.
All pingers has been off (i've set up StartPingers=0)
I really believe that it's preprocessing manager that has to be blamed, in particular, queueing of items. So, i'd like to fixid to one of the following
- find an answer - why is queuing happening at all???
- do not allow queueng
- after certain amount of queueing throttle the processes to wait till preprocessing manager be free (if it's load problem, which seems not)
I have nice thing - i know that running zabbix in debug mode almost immediately causes the queueuing. So i need to trace down all the decision process to understand, why the hell does it queues the messages
I see that most of it's time it's spend processing IPC requests of type 2 (probably, that the result of the poller's work). added extra flush and history flush logging to see if it's happening at all. On the next step will add _flush call result
It's not nearly as stable as the production one but tipycally these problems are quite easy to catch
The problem now is that server restart is taking more and more time (about 2-3 minutes)
The preprocess manager problem
The questions:
"something impossible happen logs"
why snmp is slow to be enabled? - figure why unreachible threads are waiting for nothing
huge load after some time!!!! net_snmp_close()???
in the night had huge number of must use snmp_select_socket2 (or so messages) - seems that under certain conditions i leave opened SNMP sessions.
Too many sessions cause high load and what seems to be select() call is not working, so essentially all the gathering process breaks
TO DO: add some profiling, look where i may leave sessions unlclosed, consider removing (select) code and replacing by stupid simple wait call.
morning: huge snmp queue, seems that threads work normaly, i haven't found any abnormalies, just 10 them is not enought. (actually, they work much longer then at start, their run time raized from 6-9 seconds to 14-25, but that might be due high system load)
Raised snmp pollers number to 25, starting at 08:08
8 58 flight normal queue 0
seems something strange has happen at 11 24-11 34, apparently serever has crashed at that time, there is a long queue in SNMP in the browser left
11 41 started again ... actually the server is alive, but is't swapping and there are 16M records in the queue of the preprocess manager.
Myabe that's just too much hosts???
Maybe that's too big load from fpings
Now i disabled about 20k hosts, reduced number of threads to 20 for both pingrer and poller, restarted the server, let's see what's happen
Having only half of the load (and also number of DB Syncers increased to 40) system is seems to be doing fine for at least 1 hour for now. And this is the time when i finaly see some SNMP data gathered in the graphs.
I will look at it for another 2 hours then will start gradually adding new devices.
UPD so far evth is fine, no queueing, system 90% idle
Another problem to consider - devices marked non - snmp acessible stay too long in such a state and i've never saw any processing done by unreachible pollers.
I exect i've broken some fetching logic in DC functions and that makes it such a long to mark devices as SNMP accessible again
So my theory is: 1. I haven't enought data syncer processes enabled before so they where constantly busy in filling the value cache (ClickHouse not really fast for single host queries). Acutally i would suggested to fill the cache on the start by some bulky way - this would mean one query with millions of data. Unfortunately, i am not sure if this is what really zabbix and it's triggers need.
The biggest thing that bothering me right now is occasional preprocessing manager queueng. I am almost sure that the queuing depends on two factors: system load and number of working threads (actually it might be one of them, and this two factors are closely linked). There one big optimization that could be done on the SNMP poller - is to enable bulk queriing, this will save some ticks of CPU and network PPS, might be easier for network devices, but it will not faster things really as anyway each SNMP thread will be waiting for inacessible devices timouts which more longer than making 100 queries to accessible device, and they happen in parralel in the asyncronious model.
Another big change and improvement that might be made is switching from fping to nmap for hosts checking. Question one - is how valuable the packet loss rates for understanding device reachibility. I have a thought that it might be more valuable to make 10 pings each 3 seconds then make them all at once each 30 seconds. In such a way of checking the packet loss rate will be done in a trigger by calculating average on base of last 5-10 checks. To speed up reaction of a device has become non accessible, last 2-3 checks must be considered.
ok, morning, 3 17. I see that the installation has crashed in preprocessing manager caught syssegv.
All pingers has been off (i've set up StartPingers=0)
I really believe that it's preprocessing manager that has to be blamed, in particular, queueing of items. So, i'd like to fixid to one of the following
- find an answer - why is queuing happening at all???
- do not allow queueng
- after certain amount of queueing throttle the processes to wait till preprocessing manager be free (if it's load problem, which seems not)
I have nice thing - i know that running zabbix in debug mode almost immediately causes the queueuing. So i need to trace down all the decision process to understand, why the hell does it queues the messages
I see that most of it's time it's spend processing IPC requests of type 2 (probably, that the result of the poller's work). added extra flush and history flush logging to see if it's happening at all. On the next step will add _flush call result
No comments:
Post a Comment