Networker`s records

Saturday, April 16, 2011

Problems with BRAS high load ... fixed

About two weeks ago we've started to expereince problems with NAS servers is i wrote a week ago.

On peak load hours they were serving almost 300mbits and suffering from 6000+ pipes number. At some point dummynet caused degradation up to 10 times (both by traffic and packet no) Pipes where used for customer traffic policing.

To avoid the 'pipe' problem and get rid of dummynet i've wrote ng_dummynet module. The reason we don't want to use IPFW + tables + ng_ipfw - we are not letting L3 processing at all. Ng dummynet rememres which Ip in which class and sends traffic of the same class (which might be several IPs) to the same ng_car module.
Two days in production showed 20-30% degradation in comparison to dummynet which where very strange and unexpected. In two days i've spent lots of time trying to figure where is the problem, which could end up kerenl tasq process profiling.

But today on the tetst stand we finally detected that degradation happens on more than 1000 customers in new ng_dummynet module.

While doing block by block cut-offs of functionality in the module finally i've found the problem: it was diffent network and host byte order. The module uses hashing by two last bytes, which turned to be first two ip bytes and they are the same for all customers. So, the hashing woked as simple one by one IPs enumeration. Which turned to enumeration of 2-3k IPs in average for each packet.

The module fix worked out: imeddieately CPU tasq time dropped times.
Shure you want the picture:

The interesting thing: traffic from one NIC always processed on one core.

Conclusions: on two NIC system it's dnagerous to work on 2-Core system as traffic starvation may overload all the system. On 4- kernel system (i5) with two NIC's CPU Utilization will not go higher then 50%, on 8-core i7 - not higher then 25%. Of course on i5 and i7 free cores may be loaded with extra something, for example, put another NIC.

Take a look: С2Duo 8500 -> Core I7 (some higher model with 4 cores 3mhz) upgrade

And the last few words: i even started to write all-in one module to avoid multple IP lookups and rely only on netflow engine.
But now it seems to be useless. Unless i do P2P by behaviour recognition.

I think that after final optimizations we can get even more 10-15% CPU time on the same system.

Thursday, April 7, 2011

ng_dummynet

Did some online debugging yesterday.

We've had one system with high load.

As for now i cannot exactly tell what is causing the high CPU load.

During the test i didn't touch the IPFW QUEUE system which is responsible for the policing. In fact, before reinitializing netgraph system experinced 70-80 CPU load (both cores the same 70-80) persent. At that moment there were 35-45K packets.

After netgraph reinit load dropped to 10-15 percent of each core. Pretty strange. The next thing to suspect was number of flows in ng_state. Before reinit there were about 120K. Right after init system collected 40K flows and load was 10-15 percent, in next 15 minutes it's rised to 80K flows, load raised substantioanaly - 2-5 percent.

Interestingly, complete removal of netgrpah processing doesn't result in significant CPU utulisation reduce - same 10-15% on tasq. Which means some part of netgraph is killing the performance.

So, to conclude: i am not sure that the problem is dummynet, but at least eliminating dummynet will allow to overcome 3K users (6k pipes) per system limit.

I've called it ng_dummynet. The thing is in process now. Yesterday i've did skeleton. Thing is supposed to be used together with ng_car which will be created per user basis. For now it is able to pass traffic uplink <-> downlink. Hoping to finish it by tommorow.

Sunday, April 3, 2011

NAMED - almost end of story

03.04 NAMED - end of story
So, one week full-load flight with new recursor is finished.

Some results:
- no software related problems except once we've expereinced some kind of attack when DNS traffic tripled, according to maintenance team report that was attack to the authoritattive servers, since they reside on the same hardware with recursors that caused significant system degradation, supposily, because of named was killing both CPU's, unfortunaly no real debugging and analyzing is possible now.
- CPU load reaches 25-30% in peak time, and the good news powerdns as able to use both CPU cores without proccess blocking

Some setup details: actually now it is one cache, but with all logic put to two scripts - nxdomain and preresolve. Some auth functions related to different answers to different internal networks are put in preresolve scipt.

Local domains and RFC 1918 (grey) networks are forwarded to auth directly as root servers have no idea about their delegation (actually, they are site-specific zones), some black-listed zones are also processed in the preresolve. Scripts are in c-perl-like LUA language, pretty simple and easy to underastnd language. According to tests even complicated lookups in LUA are much more fast end effective then doing real lookups (blacklisted zone).
Problems: they alwayas are. The only problem that most recent version of the pdns-recursor doesn't do round-robin DNS balancing correctly causing overloading of some servers. Previous version works fine, and thow, we left it in the production for now. The other thing, pdn-recursor also trims UDP and we cannot answer 40-50 server pools with it, BUT because of preresolve section we are'nt need it anymore as the problematic pool with ~50 servers is diveded to 6 networks and in LUA we can answer only those servers wich are supposed to serve that network segment, in comparison, BIND allowed only to do site-specific sorting of the pools, returning some servers first in the pool, but anyway all the pool was in the answer.

Overall: pdns-recursor is really nice upgrade, very low memory requerements, simple and efficient. Highly recomended for high recursor loads (5-20Kps of DNS traffic)

Friday, April 1, 2011

who is speaking

I am doing a litle out-of-job project - a callcenter for Itallians, to process paid services (mostly Future-Telling, Destiny, Erotic and so on). Actualy it's almost finished and now handles about 300-400 calls a day.

Some details - dedicated server receives calls via sip from PSTN via a Carrier. Calls are routed to the operators. Some basic accounting, management is done in MySQL.

As the main developer dissaperaed in the middle, i am finishing the thing myself.

Actually this is pretty basic call-center, it's only interested me as i had no hands-in sip and voip systems experience before at all.

Now I've stopped development for a while and determine reason of frequent call drops in the evenings. To have a full picture i've installed ZABBIX localy and keep monitoring some basic aspects like connectivity to the sip provider and server load and other sites as well.

Saturday, March 26, 2011

some news, good and not so good

At the moment we transferred almost 20k happy customers to new technology. 40k pending. The first outcome - about 20% of customers prefer old access methods.

Now we have problems which appeared as result of service simplification in the past. The problem is DUMMYNET. After 6k PIPES system dies. That's it, doesn't matter how much traffic.

So, now we are adding boxes to process traffic, but this is temporary solution. I look forward now to get rid of dummynet and use ng_car for policing.

When we;ve switched from dummynet on pptp servers to ng_car, we could double user capacity per system. Hopefully this will be the case this time again.

Some task analysis: there 2 primary methods to do the policing - one is to put separate traffic multiplexer to thousands of ng_cars, the other is to use well-working ng_state for that.

Advantage of separate multiplexer is it will just work in current setup. Disadvantage - every packet should be switched that way.
Implementation with ng_state allows to process only first packet in the flow, BUT following problems arrives: traffic forwarding and class policing should be done differently then. Actually instead of using different traffic path's for different classes i should add class labels to the packet.

Will do separate multiplexer now, as faster to implement solution and also, it might be not much slower comparing to the second solution as there label lookup will be needed for each packet.

Thursday, March 10, 2011

DNS story, part2

Tests revealed that old djbdns seem to be too old as it processes requests 10 times slower then slow bind does.

The final decicsion and contest winner is powerDNS recursor for recursor part and BIND for authoritative part. And yes, finaly we've split them.

Idea to have 12 caches and split-horizon auth DNS server still continiuoes to be just an idea.

Instead it apperaed that powerDNS recursor has a very nice feature - it can pass all incoming requests to a lua script and all request ended with NX answer to another lua script. This is enought to fullfill all our split horizon demands.

The only exception we don't know destination IP of DNS query in script when nice "packet cache" feature is on. Thats why at the moment two copies of powerDNS recursor is on duty. Not a big problem, two!=twelve.

It took about 10 working days to transfer all recursive payload from named to powerDNS. All went smooth, without server lost.

Some things to notice:

the most recent version of powerDNS recursor doesn't do round robin correct when answering queries with many resources. I didn't have time to figure out why, just downgraded to one minor version, it's ok.
to have all private thing functional ( .local adresses and "grey" networks PTR resolving) recursor should be specifically told where auth server for them is, as root servers will not answer/know about such a resources.
restarts instantly,
memory consumtion as 200-300 megs, ten times less than named.

At the moment we are leaving named as authoritative server. Mostly because of noc duty guys are comfortable to work with it, and absense of high traffic now, most mission-critical requests for customers processed inside recursor without asking auth server.

The CPU usage diagram:

Doesn't look impressive, but please note that user load (middle line) with bind couldn't get more then 50% (one CPU), The user load magnitude after upgrade is because named cleaning it's empty cache (recursive part still on, but no real requests go to bind).

Saturday, February 19, 2011

Bind, djbdns, etc and ISP

Today, instead of polishing my SCE solution to put it to production 4 am in Monday, i had to dig some DNS stuff.

In short, BIND sucks. To be more specific: actually it's not and it's a very good piece of software, which can do alot. But when used in so-called split-horizon setups it becomes very heavy, long starting, memory hungry begemoth

Today i spent 3 hours in trying to understand WTF is going on with one of our DNS servers. The task was even more complicated because of load balancers are on the way. I discovered it by very unpleasant surfing in browser, 'host' test showed that some queries, even cached ones took up to two seconds to resolve. That explained a lot.

So, when i looked at the first ns (say, ns1) stats - i saw this:

thats right, almost 100% resources under named. (See the green IDLE time).
Even at night time.

After doing some profiling ( strace -c ) i've discovered that most of the time named sits in futex syscall, and about 10-20% of them aren't succesifull.

Futex problem means concurrency problem. Switching off one CPU in config did magic thing - two out of three threads dropped futex time to less then 1%, and no more futex unsuccessful calls.

Interestingly, but the third process is almost only futex calls, and half of them failing.

This setup does the job much better - no visible delays occurs even when i put all the load to the server. And significant load drop.

Here is whole picture.

In the center there is the result of named restart - it's easy to see that problem not immediate, it takes 4-5 hours to occur.
On the right the result of switching to one CPU and getting rid of locks is visible - the system becomes 60% free, with no visible signs of degradation, actually,yet.

Troubleshooting is done, somehow it's working. But i need to further. First off all, hardware upgrade needed, as it's only c2duo 6300 CPU, will do on Monday.

But stop, any software can bring even most sophisticated hardware down to it's knees by doing stupid things like BIND does.

It looks like i had to look at something very efficient, for example djbdns, something like nginx.

The problem is that i need views support (split-horizon DNS). Djbdns is the only free alternative able to do that. Others suggest setting extra instance of the server per view, which will be pretty complicated setup in my case.

The other reason to get rid of BIND is for handling load-balanced pools which consists from 40-50 servers: BIND returns them all, and i haven't found a way to reduce number to say, 4-5 resources. And this part is really needed, because there are plenty of dumb devices that fail to resolve big RR pools.

Djbdns looks to be pretty efficient and realy nice solution.
It also splits caching and authoritive parts of the server into two daemons, which is nice industry-stnsart feature,but it is another problem for me.

Authoritive server will not do caching or forwarding, but it does split-horizon.

If i put caching server first (how it's supposed to be), then authoritive will have no info about user IPs which means that split horizon will not work.

Next i've got an idea to substitute root servers on authoritive server and use a patch to answer list of root servers when authoritive part recieves something it's not authoritive for. But it means every client will do two requests instead of one - first will return the IP of cache and then second will do the request to the cache. Not good.

What will definetly work - making one cache for each view. At the moment i have 5 views, and one view doesn't need caching.

So many thoughts that i am stopping at this point. Lets rest a bit.

While looking through the features of the servers i've found intreresting utulities with ProDNS server - they can replay captured DNS traffic on selected DNS - this is an easy way to simulate payload close to real on the new server.