Wednesday, June 6, 2018

ZABBIX: timewasters II

So, there are a few stories to tell:

One is simple: I haven't considered timeout and retries params for SNMP when i was moving to the test server. I thought they are taken from each item's config.

In fact, they are not.

Retries where calculated inside old SNMP code i replaced. Rertry was  0 or 1 under different circumstances.

Timeout is global poller thread option set in zabbix_server.conf , which in my case was 10 second.

So, when i started tests on the new machine, threads started to work 30 seconds on each bulk task iteration.

It's turned out there where about 20-30% of devices are not accessible via SNMP from the new test server, and net snmp used global system's retries value was 3. So each thread was waiting for up to 30 seconds as in such a huge bulks of queries there was always inaccessible host.


The second one:  It is might be still actual.

Prehistory: Several times i saw snmp poller processes "hanging". To be more correct, they where constantly running not doing anything useful.


My assumption was it's due to the bugs in asynchronous  SNMP processing.


The story:

When that happen for the third time after  i fixed whatever i could in SNMP part, the investigation has started:

First i've looked what is the process doing:

It's polling something. But i don't use polls anywhere.

Looking inside /proc/ for socket id, and listing the socket didn't help - yes, there is u_str (a stream socket) opened by the thread's pid, and that's it. No other side, no nothing.

So, lets look for the poll call in the code. But.... there is nothing in whole Zabbix project. That's fine, it's a library then.

First Net-SNMP under investigation: nothing there. Actually, they use a poll call in agent code, not this case.

I didn't know what other libriaries the thread uses, and i ruled out mysql since it's the preprocessor workers who will write values to the database.

So i decided to go another way: look at the backtrace.

gcore + gdb helps a lot.

What i've got is:
Yep. It's mysql.


Yes, i was write - it doesn't writes the values (in my case all the values go to clickhouse anyway).

And yes, i was wrong - it's mysql interaction on marking hosts as dead when they are silent via SNMP.

The reaction: I've upgraded mysql server and libraries to maria db ver 10.3.

Since i strongly disbelieve the update will help, the next thing to do might be to switch off "disabling hosts" feature.

It's not that important when process SNMP asynchronously to wait for a SNMP timeout for a few extra hosts - anyway there will be a one in each bulk data request and it's only 2 seconds.

Or might be enabling the whole network access will help - it will be way less traffic on disabling hosts mysql requests.


No comments:

Post a Comment