Tuesday, June 19, 2018

ZABBIX: asyncronious SNMP polling



After fixing the FPING problem, I've switched on SNMP polling threads and it turned out that it's even bigger problem then accessibility checks: for each device we do only 5 pings and poll 122 items via SNMP.

Switching on SNMP polling on test server  immediately led to SNMP poller queuing. To have a healthy balance i left only 500devices in the test monitoring installation.

SNMP rate measurement, traffic dumps, code investigation showed a few facts:
  • SNMP is syncronious in zabbix 
  • SNMP poller is very feature-rich 
  • it looks like Zabbix guys have had a lot of experience with old and new devices, packet sizes. 
  • They also differ processing of auto-discovered hosts and manually added ones.
So I've spent a few days and  did asynchronous SNMP module, which was based on original zabbix SNMP code and async SNMP example from netsnmp docs, additionally i had to make a few changes in following modules
  • in the DC libriary which fetches the SNMP tasks for poller - i've removed a few conditions which limited fetch results to one host only, and another condition which was doing some response size calculation and could also reduce the number of selected items
  • I had to redo memory allocation type for ITEMs arrays, which where static before the modification. Since these array sizes went up from 128 ITEMS to 4096, they probably couldn't fit to stack anymore.
  • I was lucky that DC library returns ITEMS for check ordered by host. I assume  it's better to request one item per host at a time. So having result ITEMS array grouped by host is very handy in processing
Now, a few problems i've got
  • Had to properly write C  code and read about libriaries - had lot of time spent not understanding way how net_snmp handles pdu's. Perhaps there was another 20 coding problems which were easy to catch.
  • When there are a lot of async snmp pollers are working and under some other circumstances i see that a process "preprocess manager" reports via set_proc_title about huge tasks buffering PIC and there are significant delays of data which could be seen on graphs (later i knew that it is  a separate problem)

What further optimizations could be made:
  •  don't request items from the host if previous N items has failed (this might be done either in fping or better in check logic)
  • sometimes i'm getting process crash when adding/removing bunch of devices, need to figure (not true anymore, didn't have a crush in poller ever since)
  • actually, i need to compare "internal" change to outside script and uploading SNMP results via trapper inreface. Maybe it's feasible to do it outside due to amount of code modifications and possible problems including bugs and upgrade problems.


So far results are inspiring: 20-30 times faster then before: in the same poll time it's  increased from 200 values to 4k-6k values at no CPU cost.

No comments:

Post a Comment