Friday, December 30, 2011

2.2 Gigabits


The problem

On full load time we've noticed we have a 'strange' system CPU usage on the systems where ng_state works, and support team started to get complains about delays.

This started in the middle of december.

It took a few days to recognize the problem: ng_queue process starts to consume more processor time and even take fist 'top' positions

So, the queueing means the ng_state module cannot process traffic.
But wait, lets look graphs again - we have more then 50% of CPU free.(pic)



Here is the story begins.

Ok, i did some tests on the working system and clearly recognized the problem - removing the controlling traffic (all ngctl calls) lead to immediate load drop of ng_queue.

So, i've asked a question on FreeBSD forums about netgraph behavior, didn't get an answer.

My idea was that some kind of data blocking occurs when ngctl (netgraph control messages) calls are processed, i thought that this is general problem of controll mesages.

I did more tests and could prove that any messages to any of the modules involved to traffic processing lead to growth of ng_queue process time.

Ok, if control messages have such a bad nature then i had to get rid of them - so i did an extra 'command' hook to get commands via data packets and started to use NgSocket C library to interact with it.

On tests this showed absence of problems with ng_queue grouth on high control traffic.

Fine! 2-3 days and we have that on the production.

Guess what?

Yes the problem with ng_queue on high control traffic disappeared even on high intensity control traffic.

But .... a few hours later i saw that queuing is still there and takes some proccessor time.

In the evening time queuing was about the same amount as before with the same problems.


WTF????

Ok, this time i HAD to know - why queueng occurs - it took another week of tests, profiling, kernel sources digging and the picture was clear:

There are 3 reasons for queueng
   - if we asked to queue data in the module/hook/packet
   - if module is out of stack
   - if the traffic is coming to inbound hook while being processing by outbound thread (loop situation)

And some good news - i've found the answer why queueng occurred - command messages have to had a special flag to be proccessed without blocking, so i've put the flag to the module, and returned to ngctl command traffic.

This was nowhere end of story.
Queuing isn't disappeared.

But queuing wasn't easy to catch - it was appearing after some time (5-30 minutes) after putting the control traffic on the module and loading the module with a few gigs of payload.

I was switching off some processing, swithcing back again, one by one. I was getting false positive results and next day i was thinking that problem somewhere else.

Someday i've found that the reason is MALLOC, great, i switched to UMA then, no success.

After one more week of doing that i had 2 results that was 100% prooved - after module has been restarted, without control traffic it lived fine. After first IP being put to a service - ng_queue started to grow.

Stuck.

Then I've switched to netgraph kernel profiling.

First of all, i added some printfs to see a reason why a packet has been queued.
And this was the first light in the tunnel.

So, i realized, i was getting all 3 reasons.

Unbelievable.

Ok. Reason one - loop detecten - well, this was easy, we have a loop there. This was eliminated quickly.

Reason 2 - queueng because we asked so - this was happening because of control massegges was 'WRITERS', in other words - the module was blocked during processing it, add the flag - disappeared.

Reason 3 - stack. I am not that a programmer to know reason why we out of stack. But wait. I still rememeber something. Stack is used to allocate varaibles defined inside procedures.

So, the netgraph is line of procedures call and sometimes there are up to 10 of them. Not that many to bother. I only knew by that time that module can allocat via MALLOCs 64 megs - need more? - Use UMA.

I was expecting that i have a few megs of stack, but it was wrong.

Profiling showed that only 16kilobites (yes, only 16384 bytes) was alloated for the stack.

If more then 3/4 of stack where used, netgraph would put the packed to queue instead of next procedure processing. So, as soon as stack consumption was close to 12 kB - we'are in queue (toast, dead).

When packet was coming to ng_state it had already 5-6kb of stack used. And ng_state, whait - what's there .... no, it's 2kB buffer at one place, and another 2kb buffer in other place ... do i need to continue?

Now things was clear - the second 2kb buffer appeared when i added messages processing via separate hook, this i can just remove because it's no longer needed, the second 2kb buffer was ... in dead code.

Ok, lets' recap.

1.There was queuing because of control message blocking
2.There was queuing of classified packets because of looping
3.After fixing 2 there was queeing of classified packets beacuse of extra 2kb buffer twice on the path
4.Because of (3) fixing of (2) didn't help
5.When finally (3) was fixed, things got to normal

and ... no, this wasn't all....

6.Because of hurry old SVN copy was used in 3 systems out of 4, next day i was looking at strange graph and i was in a mess - Support team reported great improvements, no complains, ones who's had problem said network is great now,  but i still could see the problems - maybe i was so sure and glad due the fact that problem solved yesterday that they couldn't just accept there still problems... who knows.

So now it's still one interesting thing left - how much traffic we can put on a system ? Is it planned 4-5Gigs (7-8 in Duplex) or we'll meet next barrier someplace

Next week we'll maybe join loads from 2 machines into one which will be about 3-5 gigs of downstream traffic

Thursday, December 1, 2011

The ng_state development cycle is closed now.
Last month we did 3-5 patches, mostly monitoring-related.

Now we've got some activity numbers, now it's clear that in peak times a server serves 7.5K customers with 50% CPU utilization. This is pretty good numbers, they indicate total server capacity is more then 10K customers.





I am going to switch to a new project now.

One of the most interesting required to do now - is implement 'global switching' project.
Here is the details.
 - companies lease town-wide L2 connectivity between locations
 - our network is NOT mpls and especially, VPLS ready
 - we have PC-based devices for L2 on top of L3 (Ethernet over IP) connectivity
 - those devices acts like Ethernet repeaters in hub-and-spoke topology, meaning all traffic is distributed  between ALL nodes, regardless of need of a VLAN in the HUB.
 - so we need at least some kind of 'VLAN pruning'
- it would be much better to switch from hub-and spoke to partial mesh topology, and prevent L2 looping by a kind of split horizon.

Sure i want it to be as simple as possible, so the following things desired
 - dynamical exchange of VLAN list
 - path costs accounting in partial mesh
 - automatical or semi-automatical neighbors discovery

Yes, that  sounds like a kind of a routing protocol, and in fact, it is.
So i'd like to utilize one of current IGPs running in the network core to do the VLAN exchange (or maybe start one more - mp-bgp? ospfv3? ... whatever...).