Wednesday, June 6, 2018

ZABBIX: timewasters II

So, there are a few stories to tell:

One is simple: I haven't considered timeout and retries params for SNMP when i was moving to the test server. I thought they are taken from each item's config.

In fact, they are not.

Retries where calculated inside old SNMP code i replaced. Rertry was  0 or 1 under different circumstances.

Timeout is global poller thread option set in zabbix_server.conf , which in my case was 10 second.

So, when i started tests on the new machine, threads started to work 30 seconds on each bulk task iteration.

It's turned out there where about 20-30% of devices are not accessible via SNMP from the new test server, and net snmp used global system's retries value was 3. So each thread was waiting for up to 30 seconds as in such a huge bulks of queries there was always inaccessible host.


The second one:  It is might be still actual.

Prehistory: Several times i saw snmp poller processes "hanging". To be more correct, they where constantly running not doing anything useful.


My assumption was it's due to the bugs in asynchronous  SNMP processing.


The story:

When that happen for the third time after  i fixed whatever i could in SNMP part, the investigation has started:

First i've looked what is the process doing:

It's polling something. But i don't use polls anywhere.

Looking inside /proc/ for socket id, and listing the socket didn't help - yes, there is u_str (a stream socket) opened by the thread's pid, and that's it. No other side, no nothing.

So, lets look for the poll call in the code. But.... there is nothing in whole Zabbix project. That's fine, it's a library then.

First Net-SNMP under investigation: nothing there. Actually, they use a poll call in agent code, not this case.

I didn't know what other libriaries the thread uses, and i ruled out mysql since it's the preprocessor workers who will write values to the database.

So i decided to go another way: look at the backtrace.

gcore + gdb helps a lot.

What i've got is:
Yep. It's mysql.


Yes, i was write - it doesn't writes the values (in my case all the values go to clickhouse anyway).

And yes, i was wrong - it's mysql interaction on marking hosts as dead when they are silent via SNMP.

The reaction: I've upgraded mysql server and libraries to maria db ver 10.3.

Since i strongly disbelieve the update will help, the next thing to do might be to switch off "disabling hosts" feature.

It's not that important when process SNMP asynchronously to wait for a SNMP timeout for a few extra hosts - anyway there will be a one in each bulk data request and it's only 2 seconds.

Or might be enabling the whole network access will help - it will be way less traffic on disabling hosts mysql requests.


Wednesday, May 30, 2018

ZABBIX: 50k NVPS, the final test results and the new bottlneck

It's two new bottlenecks appeared.

The first one: 100 history writers seems to be slow on putting data to clickhosue. But that is not for sure as there is a problem number two:

The problem: i could  squeeze 35k NVPS out of the server.
It is remarkable anyway, considering our current need is 15k NVPS maximum, and existing old system hardly does 5k NVPS.

BUT -  the server is only 20-30% utilized, why aren't we doing 50K at least?

Playing with number of threads doesn't help - more threads - more time they spending in each iteration.

Top shows they are sleeping while waiting things to happen most of the time, which is good.

Ok, let's ask strace: - strace, what are they waiting for?



Interesting, it's semop.

Ok, strange that it's neither SNMP nor mysql handling. The sources shows it's mutex locking to access pool of tasks and to return them after processing.

This is quite bad news - it means the systems is interlocked due to internal architecture limitations. And usually architecture is something that is very hard to change since everything around it built to work with the architecture type.

What most likely happens is the following: procedures of finding the next task to be polled require some process time, so i assume there is a limitation which is now less then the current system polling capacity.

Each poller waits most of the time for it's chance to lock the mutex and use DCconfig_get_poller_items.

So there is a processing limit - how fast one processor can extract items to be polled. Plus there is some overhead to init procedures and to lock and unlock mutexes.

Lets try increase number of items 4 times per poll to reduce inits and locks count: raise 8k to 32k.

I was pessimistic if this will help, but if it does, and ZABBIX will be able to achieve 50K NVPS, then it's a good point to stop in optimizations.

And as it happens often, i was wrong, it kind a helped a lot:

And the queue status is:

About 20% of hosts are intentionally SNMP denied on firewall level, to force timeouts and make ZABBIX life more difficult.  And current processing returns CONFIG_ERROR to SNMP items of inaccessible hosts which make ZABBIX think there is problem. I think anyway of replacing this with TIMEOUT which will mark host as inaccesible, but thats separate story.

Lets check random device, random port:



It's nice - almost precise 75 seconds interval.

And yes, there is a 5-25 minutes gap between the moment you send data to the clickhouse and when you see them due to clickhouse cluster merging intrerval.

This is eliminated if there is only one node and you request the data from buffer, but in our case it's 4 nodes, so usually, there is no data.

And this is clickhouse monitoring showing 40kNVPS and 50kNVPS tests showng very little increase in processing connects (don't look at the peaks - it's ValueCache filling after restart)




For the latter i had to increase dbSyncers to 200 as otherwise they couldn't handle initial ValueCache full in parallel with data send. After reaching safe queue size preprocessor manager stopped to process data from requests sockets, slowing down data polling.

The test shows that 50k NVPS is achievable on the machine. But when i was monitoring the situation for an hour, a saw occasional queuing and processing delays for 200-300k items for 5 or 10 seconds.

I would say, 50k is marginal and non-production numbers (until you get a better server or perhaps, less items and less CPU time on retrieving tasks). To go really beyond that you need a machine with faster core speed, it's not that critical how many cores would you get as soon as you have more then two.

Apart from the locks the next problem will be the preprocess_manager thread. And this is REAL problem number two. In my installation it was showing the following statistics:

 preprocessing manager #1 [queued 0, processed 192824 values, idle 3.223292 sec during 5.662645 sec]

Rough approximation and safety margin of 20% give number of 80kNVPS for it to running with no idle time.

As a final conclusion i'd like to say that the setup is capable to handle up to 40k NVPS in production and  that 3 times more then we need now, so it's just fine for the production and good point to stop in optimization.

NB:
However it wold be interesting to play with the architecture and see what could be done there. Giving more functionality to poller threads and reducing communication with the central storage is what i would do. Perhaps, pollers could take tasks for execution for next 3-4 minutes instead of single check and schedule checks on their own, submitting several results to the preprocessor manager. Then poller the still had to return tasks to sync with the new configuration. Perhaps, the same achivable by several local proxies, but that is out of scope now.

Wednesday, May 23, 2018

ZABBIX: the FPING problem

Zabbix does it's accessibility checks with the fping utulity.

First of all, if you use the domain names for host checks, make sure on a large volume that DNS is cached, preferably on the same machine. It might be either local DNS cache or i just put all my hosts to /etc/hosts file. Otherwise fping will spend 2ms-10ms on resolving each name, and it does it on host-by-host basis synchronously, so probably it'll take even longer, then accessibility checks. Or use nscd which does the job pretty well , so you don't have to deal to manual "hosts" caching.


On my test machine adding all 36k access level switches caused global zabbix stagnation and the check queues grew to more then 10 minutes.

 


Doing some profiling showed the first obvious reason. The system hasn't enough pinger processes

Ok, lets add 100 of them...
And now there is a CPU problem. Managing 100 threads isn't easy on the CPUs.


So, what's up?
nmap scans all 36к switches in a little less then 12 seconds:


Starting Nmap 7.60 ( https://nmap.org ) at 2018-04-22 09:10 +05
Nmap done: 36773 IP addresses (35534 hosts up) scanned in 11.63 seconds



and in worst case scenario (none hosts are available) it's 20 seconds:

Starting Nmap 7.60 ( https://nmap.org ) at 2018-04-22 09:36 +05
Nmap done: 36773 IP addresses (0 hosts up) scanned in 19.76 seconds


According to documentation, fping does exactly the same - it fires all the ICMP packets at once, and then waits for responses. But for some reason it does that too long.

To figure, i've left only one accessibility check pinger and did a traffic dump, that showed very intersting picture: there is a 10ms inter packet delay between ICMP requests fping sends (RTFM, it's written in ZABBIX manual actually).


That is not good. To ping 1000 devices it will take 10msec*1000devices*3attempts=30000ms=30seconds , but we need to check 36k devices. Sure, threading will help, but would we want to waste half a minute ?

note: old version of fping does 20ms delay. I've found somewhere on the net that the reason is not to loose packets on 100Mbit links, which kind a odd nowadays


It's actually defined in the fping.c and might be hard-coded or changed by configure options:

 /* maxima and minima */
#ifdef FPING_SAFE_LIMITS
#define MIN_INTERVAL 1 /* in millisec */
#define MIN_PERHOST_INTERVAL 10 /* in millisec */
#else
#define MIN_INTERVAL 0
#define MIN_PERHOST_INTERVAL 0
#endif

So, after recompiling the FPING without any delays i put the new version to the production and it's got much better:



And.... something told me i should check everything.

So it'd didn't took long to find that it's only 128 hosts that zabbix can ping each time it opens a pipe to fping. Not really much. Considering fact that there are some resources to make a system call and that might be also optimzed, i did some source code research and found that it's limited in the code.

#define MAX_JAVA_ITEMS <------>32
#define MAX_SNMP_ITEMS <------>128
#define MAX_POLLER_ITEMS<------>128<--->/* MAX(MAX_JAVA_ITEMS, MAX_SNMP_ITEMS) */
#define MAX_PINGER_ITEMS<------> 128



I've tried to change MAX_POLLER_ITEMS along with MAX_SNMP_ITEMS  to 4k, but that lead to zabbix_server crash on first attempt to call pinger thread.

Some debugging shown the reason was the memory allocation (it's almost the same I've had with ng_netgraph 6 years ago).

And this will be different story.

Anyway, even just fixing the fping causes significant reduce on system load (I've measured 30% of CPU related to fping), as you'll have much faster ping processing and less threading.

On ~500 hosts fixed fping finishes 3 times faster (-C5 -t1000) then original one.


Tuesday, May 1, 2018

Monday, June 13, 2016

NAMED - end of the story


So, one week full-load flight with new recursor is finished.

Some results:
 - no software related problems except once we've experienced some kind of attack when DNS traffic tripled, according to maintenance team report that was attack to the authoritative servers.

Since they reside on the same hardware with recursors that caused significant system degradation, supposedly, because of named was killing both CPU's.
Unfortunately no real debugging and analyzing is possible now.
 - CPU load reaches 25-30% in peak time, and powerdns as able to use both CPU cores without proccess blocking

Some setup details:s one cache, but with all logic put to two scripts - nxdomain and preresolve. Some auth functions related to different answers to different internal networks are put in preresolve scipt.

Local domains and RFC 1918 (grey) networks are forwarded to auth directly as root servers have no idea about their delegation (actually, they are site-specific zones), some black-listed zones are also processed in the preresolve. Scripts are in  c-perl-like LUA language, pretty simple and easy to underastnd language. According to tests even complicated lookups in LUA are much more fast end effective then doing real lookups (blacklisted zone).

Problems: they alwayas are. The only problem that most recent version of the pdns-recursor doesn't do round-robin DNS balancing correctly causing overloading of some servers. Previous version worked fine, and we left it in the production for now.

The other thing, pdn-recursor also trims UDP and we cannot answer 40-50 server pools with it, BUT because of preresolve section we don't need it anymore as the problematic pool with ~50 servers is divided to 6 networks and in LUA we can answer only those servers which are supposed to serve that network segment, in comparison, BIND allowed only to do site-specific sorting of the pools, returning some servers first in the pool, but anyway all the pool was in the answer.


Overall: pdns-recursor is really nice upgrade, very low memory requerements, simple and efficient. Highly recomended for high recursor loads (5-20Kps of DNS traffic)

Sunday, October 21, 2012

STP: MSTP with domains

Just finished the MSTP with domains course.

Wonder if someone uses it.

For example we ended up not using STP at all due to we have man DLinks at access and they seem not to understand RSTP well at all.

In fact, the can do good-old STP, but that's too slow.

Friday, December 30, 2011

2.2 Gigabits


The problem

On full load time we've noticed we have a 'strange' system CPU usage on the systems where ng_state works, and support team started to get complains about delays.

This started in the middle of december.

It took a few days to recognize the problem: ng_queue process starts to consume more processor time and even take fist 'top' positions

So, the queueing means the ng_state module cannot process traffic.
But wait, lets look graphs again - we have more then 50% of CPU free.(pic)



Here is the story begins.

Ok, i did some tests on the working system and clearly recognized the problem - removing the controlling traffic (all ngctl calls) lead to immediate load drop of ng_queue.

So, i've asked a question on FreeBSD forums about netgraph behavior, didn't get an answer.

My idea was that some kind of data blocking occurs when ngctl (netgraph control messages) calls are processed, i thought that this is general problem of controll mesages.

I did more tests and could prove that any messages to any of the modules involved to traffic processing lead to growth of ng_queue process time.

Ok, if control messages have such a bad nature then i had to get rid of them - so i did an extra 'command' hook to get commands via data packets and started to use NgSocket C library to interact with it.

On tests this showed absence of problems with ng_queue grouth on high control traffic.

Fine! 2-3 days and we have that on the production.

Guess what?

Yes the problem with ng_queue on high control traffic disappeared even on high intensity control traffic.

But .... a few hours later i saw that queuing is still there and takes some proccessor time.

In the evening time queuing was about the same amount as before with the same problems.


WTF????

Ok, this time i HAD to know - why queueng occurs - it took another week of tests, profiling, kernel sources digging and the picture was clear:

There are 3 reasons for queueng
   - if we asked to queue data in the module/hook/packet
   - if module is out of stack
   - if the traffic is coming to inbound hook while being processing by outbound thread (loop situation)

And some good news - i've found the answer why queueng occurred - command messages have to had a special flag to be proccessed without blocking, so i've put the flag to the module, and returned to ngctl command traffic.

This was nowhere end of story.
Queuing isn't disappeared.

But queuing wasn't easy to catch - it was appearing after some time (5-30 minutes) after putting the control traffic on the module and loading the module with a few gigs of payload.

I was switching off some processing, swithcing back again, one by one. I was getting false positive results and next day i was thinking that problem somewhere else.

Someday i've found that the reason is MALLOC, great, i switched to UMA then, no success.

After one more week of doing that i had 2 results that was 100% prooved - after module has been restarted, without control traffic it lived fine. After first IP being put to a service - ng_queue started to grow.

Stuck.

Then I've switched to netgraph kernel profiling.

First of all, i added some printfs to see a reason why a packet has been queued.
And this was the first light in the tunnel.

So, i realized, i was getting all 3 reasons.

Unbelievable.

Ok. Reason one - loop detecten - well, this was easy, we have a loop there. This was eliminated quickly.

Reason 2 - queueng because we asked so - this was happening because of control massegges was 'WRITERS', in other words - the module was blocked during processing it, add the flag - disappeared.

Reason 3 - stack. I am not that a programmer to know reason why we out of stack. But wait. I still rememeber something. Stack is used to allocate varaibles defined inside procedures.

So, the netgraph is line of procedures call and sometimes there are up to 10 of them. Not that many to bother. I only knew by that time that module can allocat via MALLOCs 64 megs - need more? - Use UMA.

I was expecting that i have a few megs of stack, but it was wrong.

Profiling showed that only 16kilobites (yes, only 16384 bytes) was alloated for the stack.

If more then 3/4 of stack where used, netgraph would put the packed to queue instead of next procedure processing. So, as soon as stack consumption was close to 12 kB - we'are in queue (toast, dead).

When packet was coming to ng_state it had already 5-6kb of stack used. And ng_state, whait - what's there .... no, it's 2kB buffer at one place, and another 2kb buffer in other place ... do i need to continue?

Now things was clear - the second 2kb buffer appeared when i added messages processing via separate hook, this i can just remove because it's no longer needed, the second 2kb buffer was ... in dead code.

Ok, lets' recap.

1.There was queuing because of control message blocking
2.There was queuing of classified packets because of looping
3.After fixing 2 there was queeing of classified packets beacuse of extra 2kb buffer twice on the path
4.Because of (3) fixing of (2) didn't help
5.When finally (3) was fixed, things got to normal

and ... no, this wasn't all....

6.Because of hurry old SVN copy was used in 3 systems out of 4, next day i was looking at strange graph and i was in a mess - Support team reported great improvements, no complains, ones who's had problem said network is great now,  but i still could see the problems - maybe i was so sure and glad due the fact that problem solved yesterday that they couldn't just accept there still problems... who knows.

So now it's still one interesting thing left - how much traffic we can put on a system ? Is it planned 4-5Gigs (7-8 in Duplex) or we'll meet next barrier someplace

Next week we'll maybe join loads from 2 machines into one which will be about 3-5 gigs of downstream traffic