Saturday, February 19, 2011

Bind, djbdns, etc and ISP


Today, instead of polishing my SCE solution to put it to production 4 am in Monday, i had to dig some DNS stuff.

In short, BIND sucks. To be more specific: actually it's not and it's a very good piece of software, which can do alot. But when used in so-called split-horizon setups it becomes very heavy, long starting, memory hungry begemoth

Today i spent 3 hours in trying to understand WTF is going on with one of our DNS servers. The task was even more complicated because of load balancers are on the way. I discovered it by very unpleasant surfing in browser, 'host' test showed that some queries, even cached ones took up to two seconds to resolve. That explained a lot.

So, when i looked at the first ns (say, ns1) stats - i saw this:

thats right, almost 100% resources under named. (See the green IDLE time).
Even at night time.

After doing some profiling ( strace -c ) i've discovered that most of the time named sits in futex syscall, and about 10-20% of them aren't succesifull.

Futex problem means concurrency problem. Switching off one CPU in config did magic thing - two out of three threads dropped futex time to less then 1%, and no more futex unsuccessful calls.

Interestingly, but the third process is almost only futex calls, and half of them failing.

This setup does the job much better - no visible delays occurs even when i put all the load to the server. And significant load drop.

Here is whole picture.


In the center there is the result of named restart - it's easy to see that problem not immediate, it takes 4-5 hours to occur.
On the right the result of switching to one CPU and getting rid of locks is visible - the system becomes 60% free, with no visible signs of degradation, actually,yet.

Troubleshooting is done, somehow it's working. But i need to further. First off all, hardware upgrade needed, as it's only c2duo 6300 CPU, will do on Monday.

But stop, any software can bring even most sophisticated hardware down to it's knees by doing stupid things like BIND does.

It looks like i had to look at something very efficient, for example djbdns, something like nginx.

The problem is that i need views support  (split-horizon DNS). Djbdns is the only free alternative able to do that. Others suggest setting extra instance of the server per view,  which will be pretty complicated setup in my case.

The other reason to get rid of BIND is for handling load-balanced pools which consists from 40-50 servers: BIND returns them all, and i haven't found a way to reduce number to say, 4-5 resources. And this part is really needed, because there are plenty of dumb devices that fail to resolve big RR pools.

Djbdns looks to be pretty efficient and realy nice solution.
It also splits caching and authoritive parts of the server into two daemons, which is nice industry-stnsart feature,but it is another problem for me.

Authoritive server will not do caching or forwarding, but it does split-horizon.

If i put caching server first (how it's supposed to be), then authoritive will have no info about user IPs which means that split horizon will not work.

Next i've got an idea to substitute root servers on authoritive server and use a patch to answer list of root servers when authoritive part recieves something it's not authoritive for. But it means every client will do two requests instead of one - first will return the IP of cache and then second will do the request to the cache. Not good.

What will definetly work - making one cache for each view. At the moment i have 5 views, and one view doesn't need caching.

So many thoughts that i am stopping at this point. Lets rest a bit.

While looking through the features of the servers i've found intreresting utulities with ProDNS server - they can replay captured DNS traffic on selected DNS - this is an easy way to simulate payload close to real on the new server.

Monday, February 7, 2011

Forwarding interesting traffic


We've spent a working day trying to figure how to implement forwarding of selected traffic in netgraph. No result. So ... It took a couple hours to do one one very simple module: fixmac.


Fixmac overwrites destination mac. This is forwarding, actually - one small disadvantage - need to handle MAC of destination router or system manually (unlike with IPFW, which works with arp table, but even IPFW never initiates ARP lookup).

Fixmac has two hooks (in and out) and process all traffic both ways. Everything going in -> out gets it mac replaced, out->in is intact.

Supports one netgraph message: ngctl msg fixmac: setmac 01:02:03:04:05:06, and stats.

Never thought i'll ever need to remember or get any practical use of this (src: wikipedia):


This time it was quite helpful.
most of my workmates failed to guess which goes first - ether addresses or 802.1q header.

Saturday, February 5, 2011

ng_iptable load tests

Did high-load tests on the 2G test_bad - tol see the overall impact and simultaneous add/del/high_traffic to test stability

Now, switching back to state module:
First of all, need new testbed config and check if bi-directional flow recognition works. Simplified netgraph structure will be: 

So, i'am going to use same per-class netflow matching to do ip/based switching - which means iptable matching is done once per flow, if it would been on the main traffic path - then EACH packet had to be switched.
Most of important traffic will be send through 4 (if policing is needed) or 3 (no policing) nodes. "Learning" traffic may involve much more steps, in worst case packet has to be switched 50-100 bpf programs. This is CPU and delay problem, so, only first 6-8 packets in each flow are to be recognised.

The other thing, it might be desirable to add car to new flows, so automatically mark exceeding flows as being junk after certain limit (will do it later if needed).

One more thing (TODO) - "maintenance switch" - nobody likes when system crashes, and it does when reconfiguring nodes with high traffic on them. So, maintenance switch must be used to put traffic recognition OFF-line and so - do safe reconfiguration. 

Friday, February 4, 2011

ng_iptable

OOOps, tests shows that i did kernel panicer module, not iptable :)
    UPD: mutexes are to be inited prior to use :) - FIXED
    UPD: seem to be working, found bug with duplicate ip items - doing code to delete all duplicate entries -FIXED
Now, mass add/del - pecker test....
well, 10k add/removals... pretty stable ... yet no problems


Tuesday, February 1, 2011

ng_iptable

Decided to do user-IP-switch: it will be based on MUX module and will have three hooks (in, default, intable). A packet from "in" will be switched to default hook if it's not in the table and to "intable" if its in the ip table.
It will be able accept two additional control messages: ADD <IP> or DELETE <IP>.

Ok, renamed and cleaned the module, now it's capable of mixing traffic, but doesn't do ip lookup.
to do:
    - add ip , delete ip messages and addresses from the table. (DONE)
    - take ip from packet data (DONE, not checked yet)
    - do the search  (it has to be impressively effective) (DONE, but it's not even checked)

So, to check the thing will do the opposite setup of iptable and one2many nodes

This way: