Networker`s records: 2011

Friday, December 30, 2011

2.2 Gigabits

The problem

On full load time we've noticed we have a 'strange' system CPU usage on the systems where ng_state works, and support team started to get complains about delays.

This started in the middle of december.

It took a few days to recognize the problem: ng_queue process starts to consume more processor time and even take fist 'top' positions

So, the queueing means the ng_state module cannot process traffic.
But wait, lets look graphs again - we have more then 50% of CPU free.(pic)

Here is the story begins.

Ok, i did some tests on the working system and clearly recognized the problem - removing the controlling traffic (all ngctl calls) lead to immediate load drop of ng_queue.

So, i've asked a question on FreeBSD forums about netgraph behavior, didn't get an answer.

My idea was that some kind of data blocking occurs when ngctl (netgraph control messages) calls are processed, i thought that this is general problem of controll mesages.

I did more tests and could prove that any messages to any of the modules involved to traffic processing lead to growth of ng_queue process time.

Ok, if control messages have such a bad nature then i had to get rid of them - so i did an extra 'command' hook to get commands via data packets and started to use NgSocket C library to interact with it.

On tests this showed absence of problems with ng_queue grouth on high control traffic.

Fine! 2-3 days and we have that on the production.

Guess what?

Yes the problem with ng_queue on high control traffic disappeared even on high intensity control traffic.

But .... a few hours later i saw that queuing is still there and takes some proccessor time.

In the evening time queuing was about the same amount as before with the same problems.

WTF????

Ok, this time i HAD to know - why queueng occurs - it took another week of tests, profiling, kernel sources digging and the picture was clear:

There are 3 reasons for queueng
- if we asked to queue data in the module/hook/packet
- if module is out of stack
- if the traffic is coming to inbound hook while being processing by outbound thread (loop situation)

And some good news - i've found the answer why queueng occurred - command messages have to had a special flag to be proccessed without blocking, so i've put the flag to the module, and returned to ngctl command traffic.

This was nowhere end of story.
Queuing isn't disappeared.

But queuing wasn't easy to catch - it was appearing after some time (5-30 minutes) after putting the control traffic on the module and loading the module with a few gigs of payload.

I was switching off some processing, swithcing back again, one by one. I was getting false positive results and next day i was thinking that problem somewhere else.

Someday i've found that the reason is MALLOC, great, i switched to UMA then, no success.

After one more week of doing that i had 2 results that was 100% prooved - after module has been restarted, without control traffic it lived fine. After first IP being put to a service - ng_queue started to grow.

Stuck.

Then I've switched to netgraph kernel profiling.

First of all, i added some printfs to see a reason why a packet has been queued.
And this was the first light in the tunnel.

So, i realized, i was getting all 3 reasons.

Unbelievable.

Ok. Reason one - loop detecten - well, this was easy, we have a loop there. This was eliminated quickly.

Reason 2 - queueng because we asked so - this was happening because of control massegges was 'WRITERS', in other words - the module was blocked during processing it, add the flag - disappeared.

Reason 3 - stack. I am not that a programmer to know reason why we out of stack. But wait. I still rememeber something. Stack is used to allocate varaibles defined inside procedures.

So, the netgraph is line of procedures call and sometimes there are up to 10 of them. Not that many to bother. I only knew by that time that module can allocat via MALLOCs 64 megs - need more? - Use UMA.

I was expecting that i have a few megs of stack, but it was wrong.

Profiling showed that only 16kilobites (yes, only 16384 bytes) was alloated for the stack.

If more then 3/4 of stack where used, netgraph would put the packed to queue instead of next procedure processing. So, as soon as stack consumption was close to 12 kB - we'are in queue (toast, dead).

When packet was coming to ng_state it had already 5-6kb of stack used. And ng_state, whait - what's there .... no, it's 2kB buffer at one place, and another 2kb buffer in other place ... do i need to continue?

Now things was clear - the second 2kb buffer appeared when i added messages processing via separate hook, this i can just remove because it's no longer needed, the second 2kb buffer was ... in dead code.

Ok, lets' recap.

1.There was queuing because of control message blocking
2.There was queuing of classified packets because of looping
3.After fixing 2 there was queeing of classified packets beacuse of extra 2kb buffer twice on the path
4.Because of (3) fixing of (2) didn't help
5.When finally (3) was fixed, things got to normal

and ... no, this wasn't all....

6.Because of hurry old SVN copy was used in 3 systems out of 4, next day i was looking at strange graph and i was in a mess - Support team reported great improvements, no complains, ones who's had problem said network is great now, but i still could see the problems - maybe i was so sure and glad due the fact that problem solved yesterday that they couldn't just accept there still problems... who knows.

So now it's still one interesting thing left - how much traffic we can put on a system ? Is it planned 4-5Gigs (7-8 in Duplex) or we'll meet next barrier someplace

Next week we'll maybe join loads from 2 machines into one which will be about 3-5 gigs of downstream traffic

Thursday, December 1, 2011

The ng_state development cycle is closed now.
Last month we did 3-5 patches, mostly monitoring-related.

Now we've got some activity numbers, now it's clear that in peak times a server serves 7.5K customers with 50% CPU utilization. This is pretty good numbers, they indicate total server capacity is more then 10K customers.

I am going to switch to a new project now.

One of the most interesting required to do now - is implement 'global switching' project.
Here is the details.
- companies lease town-wide L2 connectivity between locations
- our network is NOT mpls and especially, VPLS ready
- we have PC-based devices for L2 on top of L3 (Ethernet over IP) connectivity
- those devices acts like Ethernet repeaters in hub-and-spoke topology, meaning all traffic is distributed between ALL nodes, regardless of need of a VLAN in the HUB.
- so we need at least some kind of 'VLAN pruning'
- it would be much better to switch from hub-and spoke to partial mesh topology, and prevent L2 looping by a kind of split horizon.

Sure i want it to be as simple as possible, so the following things desired
- dynamical exchange of VLAN list
- path costs accounting in partial mesh
- automatical or semi-automatical neighbors discovery

Yes, that sounds like a kind of a routing protocol, and in fact, it is.
So i'd like to utilize one of current IGPs running in the network core to do the VLAN exchange (or maybe start one more - mp-bgp? ospfv3? ... whatever...).

Wednesday, November 30, 2011

CCNA

I've just completed my first CCNA training as a teacher.
Two weeks, ICND1 and ICND2

The course is much better, it's changed for good since 2006.

I liked very much L2 and L3 sections, very good explanations and pictures/slides, step-by-step examples.

It's still Frame-Relay there.... i wonder if it's still in use now, thanks God no IPX anymore, and very good IPv6 section

I didn't like the LABs. They are supposed to be used in a way when students do not see the devices, and use remote connectivity to manage remote devices.

So, i refused to use the proposed remote laboratory and grab some devices i could find out-of-use in my work.

I've got a 3640, one 2811, a few 2960s, one 3400, a 2600, and 7200 in dynamips for some tests.

Good mix of routers/switches, one L2-L3 switch.

So, in about 10 days my students had a chance to get a real experience - password restore, device connectivity, terminal cable connection and console program setup/use, IOS upgrade, an so on. We disassembled the 2600 (likely it's failed because one memory bank unlocked) and managed to restore it to normal operations, could see how the thing does NOT pass boot testing.

Yes, it's the same LABS as proposed in the course, BUT - it's real hands-on experience, which is very important, from technical and psychological aspects.

So, on the day 10 people who didn't even understood OSI levels where pretty comfortable in setting up/running devices, ip stack, advanced switching features, NAT (via CLI), RIP, OSPF, EIGRP, and even some redistribution functions,IPv6 basic routing.

We also spent half a day 9 implementing their labs - they brought their ideas, tasks, real life challenges and we've solved them all.

I wanted to try myself in a teacher role, and i think it went very good.
The best measurement - a can hardly remember when students asked to go to a lunch or have break or being bored/tired.

Friday, November 11, 2011

Rate-Limit vs Shaping

... this is the question.

After i've published the description of the ng_state module I've got several comments stain policing is much worse then shaping in ISP applications.

We already came through this in 2008 when we've switched from shaping to policing on our PPTP servers. That time tests shown policing has much worse user expereince quality on speeds below 256Kbits.

Today's lowest CIR is 6Mbits, and most of our customers enjoys CIRs more then 10Mbit.

So, i decided to retest.

The test should prove that shaping gives noticeable positive difference in network software behavior.

We put one PC after policing/shaping device.

About ten peoples was asked to go through 4 tests: policing and shaping on speeds 3Mbit and 6Mbit. We reduced the speeds because rate-limit artifacts more noticeable on lower bandwidth.

Each tester didn't know which exact test he is performing. After completing all 4 tests, the tester should grade each test on 1..5 scale.

So, the result was surprising: most testers preferred 'rate-limit' experience, about 60% of tests with same speeds.

Overall, most testers graded experience with high marks (4 or 5) and summary of grades for 3 and 6Mbits is very close.

So the final result : user experience is at least not worse with rate-limit then with shaping.

Maybe one of the reasons is CIRs are so high these days that most interactive applications, like Web surf, online video, games can not reach the limits.

Graph, re-transmissions, packet lost are 'better' with shaping, but if doesn't affect quality of work and at the same time is much more expensive in terms of hardware, then it's a good reason not to use it. And more - I've never heard from the Support team someone complaining about speed graph is not narrow.

Tuesday, November 1, 2011

ng_state

NG_STATE

The ng_state module implements traffic policing with elements of DPI (deep packet inspection), classification and traffic processing according to classification results

What is it? A little bit of history, economics, reasons to use

Economic and common sense

The solution is resonable to ones who is dealing with > 1G of traffic, > 500 users to proccess and police, and does not have solition or needs functions the module has.

Most likely it's ISP applications, or corporate traffic gateways.

To make it up and running you need time and knowledge in FreeBSD administration and network systems integration.
I recommend to use Cisco devices or similar if possible. Due to their good support, stability and quality they are more likely to give higher service uptime results.

Nevertheless, if you are in situation of very fast growth, tight schedule, unable to buy or wait for such a device, have excessive amount of traffic, you are good in FreeBSD then the module and solution on it could be profitable.

The module might also be interesting if you already have a FreeBSD gateway.

Time

I estimate about 40 man-hours to start and test the solution, and 40 more to inetgrate it with external systems. One month to make it running in production systems.

Hardware

4-5Gigs of duplex traffix (4-5G up and down at the same time) is proccessed on 2k$ hardware,

10Gig in duplex – 3,2k$ – 4k$ for hardware.

History

At some point policing feature has te be split from existing NAS because of new type of access appeared. With such a split policing, proxying, messages, P2P detecion functions had to be done on standalone hardware.

Sure, we've spent a time looking for something ready out of box. The best thing to fit was Cisco SCE. We were ready to spent 30k$ per each processed Gig, and sacrifise some features,wait several month for delivery.

A device was ordered for test, but test is delayed for 2,5 month. Finally we've got it, but didn't like – the thing required another 3-4 PC to manage it (actually it turned out later that same 3-4 PC could do much more traffic processing then the device itself), couldn't alter traffic, only drop.

The device we've got for testing had one of it's two ports out of order so we've spent another 10 days in trying to make it work, chasing support, trying other Gbic/configs, chasing again and so on.
By the time we've finished the testing, all service was working on PC-based solution. It was working ok. It had some bugs, was not optimal, and CPU hungry.

Other few months of optimization, joining/cutting functions, centralizing processing turned to very effective, well-functioning module as we have now.

Appliance

The module implements user bandwidth policing according to service agreement and traffic type. It is intended to be used on Internet gateways.

Main features

User bandwidth policing
Possibility of having common CIR for group of IP addresses
Traffic classification and traffic policing according to class
Netflow v5 export
WEB traffic redirect to display informational messages
Flood and attack protection (internal and external)
P2P traffic detection by behavior

The module must be put on a traffic path. It can be done by setting up standalone server with two NIC and connecting it into Ethernet link split or by installing module on an existing server.
It is preferable to install module on standalone server when bandwidth exceeds 500Mbit or 100kps.
The module is functionally close to Cisco SCE.

Some reference numbers

Number of supported users - 65k.
The module is tested on loads about 5-7Gbits, which is 4 million flows handling.
in theory the maximum number of flows is 40-50 million, user number is only limited to 65k by common sense.

Performance

Performance is really high. Traffic is handled on kernel level, and it goes to module almost without any other kernel processing, which is very efficient.
Also following should be considered

NIC type and features
CPU type, number of processors
traffic characteristics
Number of rules in classification module

Performance reference

100kps processed on one core of Core 2 Duo 6800 leads to 55-60% of one core utilization.
Core I7-960 machine with Intel 10Gb NIC is capable of processing about 2Mps which is 10-12Gbit total or 5-7Gbit in duplex.

In case when classification, Netflow and WEB traffic redirect is not required, and bandwidth policing is the only task, then its possible and desirable to switch off flow engine (at the moment this can be done by fixing module) which will lead to about 5x performance increase.

In such a configuration 180kps load takes 15% of one core of Core 2 Duo 6800 CPU.

The mentioned results where seen on average user inetnet traffic (40-50% is HTTP, another 40% is torrent).

System requirements

ОS FreeBSD 8x. (although it works in 7X, 8 is recommended for better network subsystem)
2-4Gigs of memory
64 bit OS (preferred)

Installation

You will need kernel sources and you have to be sure that running kernel is the same version as sources are. If you are not - install kernel sources and then compile and install new kernel.
Adjust some OS parameters
kernel memory
netgraph items ( sometimes they help to survive congestion time )
Compile the module:

Copy sources to /usr/src/sys/netgraph/state
Copy makefile to /usr/src/sys/moduels/netgraph/state
go to /usr/src/sys/moduels/netgraph/state, do make install clean
load module kld_load ng_state
Create you netgraph graph or use my script
Using netgraph messages start setting services according to your needs
Profit

Typical schemes

1. Standalone machine with classification

2. Standalone machine without classification

3. Existing machine

4. Making reserve

To make a reserve use any managable switch with STP/RSTP functionality. Switch bypass link In parralel to the ng_state server with higher STP/RSTP cost. This way whenever L2 connectivity is broken on the primary link, where the server is, then traffic switches to the bypass link. On some switches loop detection should be switched off in case all connectivity is done on the same device.
Remark: All non-IP traffic is travels without restrictions between up1 and down1 hooks. Such a traffic is not a subject of policing, so all L2 non-IP protocols will be running without policing and loss possibility.

Attention

Promisc mode

When working on Ethernet link split mode then NICs should be switched to promisc mode to disable hardware mac filtration.
Do for all NICs participating in the split
ifconfig em0: promisc

Traffic is processed on kernel space

On the time of critical load the system may become unmanageable, even from local console.
IPFW
As traffic does not get into IP processing level, firewall features related to IP doesn't work. Only basic ALLOW/DENY rules will do. Man ipfw, please.
MAC autorsrc for ng_ether
To disable source mac substitution in traffic, disable the feature in ng_ether of both NICs of the split.
ngctl msg em0: setautosrc 0

Classification

Classification feature

The module can get information about traffic class from external systems or modules. To do classification it's assumed that traffic can by classified on first 10 packets in the flow. After successive pattern match packets in the flow are classified with the same class and no more packets are sent to classifier.
As an example ng_bpf can be used. Ng bpf is capable of doing following:

tcpdump expression
packet length, headers
protocol, ports
patterns with payload data match

Man bpf, man ng_bpf please.
After being classified traffic is sent back to the ng_state modules to complete learning process:

Classified traffic processing
The module sends packets in it's original direction on hook up/down with number matching traffic class. If packet is not classified yet or hook with class number not exists then traffic is send out on hook 1. (up1 or down1)

Further processing examples

Overall view of classification process

send packets to different NIC (say, sent all WEB traffic to NIC where transparent proxy waits to catch it)
put additional policer to class hook (ng_car)
send P2P to cheap channel

P2P

Module is capable of detecting p2p traffic by its behavior. Traffic that has been classified is excluded from detection. Detection is based on connections number and traffic intensity. Each 5 second for each service decision is made - if the service (user) is P2Per or not. For those detected P2Per traffic marked as p2p-pissible is sent to class 2 next 5 seconds.

Flood protection

Module forbids to create more the 4000 flows for each service, except default service. This protects from floods, port scans and other connection hungry attacks.
Note: i've tested a few p2p clients and was not able to reach the limit without p2p client tuning. Typycally about 300-500 connects are exists for 4-5 resources being downloaded by p2p. Even by doing tuning I was able to get 3000 flows. For normal use this limit shouldn't be a problem

Informational messages

.. are to display informational messages, for example 'out of balance' or agreement to use the service. All web traffic is sent to predefined IP address (destination IP address is changed, DNAT actually)

Managing the module

Service setup

For grouping several IPs to same group with same polices 'service' is used. Service is a set of common parameters and restrictions, such as CIR (commited information rate), p2p detection, flag to show informational messages). One IP can be at only at one service at a time. When IP is added to a new service, then it is removed from old one.
Module is controlled via netgraph messages.
Service and IP address manage commands

ngctl msg ng_state: setservice { userid=\”\” ip=192.168.0.1 cir_up=200000 cir_down=200000 fwd=1 fwd_ip=192.168.0.5 mpolicer=2 }

Sets service of customer identified by .
Cir_down, cir_up - CIR in downstream and upstream directions.
ip=192.168.0.5 to assign this ip to the service.
If it is desired to assign more addresses to the same service then do it by calling setservice command with IP and usename only.

ngctl msg state: setservice { userid=”” ip=192.168.0.2 }

One command for each IP address.
fwd – switch on traffic redirection of port 80 to display informational messages. Set fwd=2 to enable forwardingm fwd=1 to disable, if not set – then old value is kept.
fwd_ip - IP address where to send the traffic. Warning - this cannot be used for transparent proxy
mpolicer – P2P traffic detection for the service. To set use mpolicer=2. When not set (mpolicer=1), all traffic belonging to the service is ubject of P2P by behavior detection.

Service and address request commnads

ngctl msg ng_state: getservice { userid=”” }ngctl msg ng_state: getservice { ip=192.168.0.1 }

Will show service metrics for user identified by , or if only IP is given - then service will be found by the IP address.

Statistics

ngctl msg state: servicestat { userid=”” }ngctl msg state: servicestat { ip=192.168.0.5 }

Show per-service statistics – packets, bytes transmitted/dropped, flow number.

Mass speed check

ngctl msg state: getspeeds N

To do massive CIR verification to synchronize billing data and module settings. Returns 70-100 CIRs at a time (whatever number will fit 4kb buffer)

Hooks description

upX, downX. X - number of class. Packets, classified as class X will be leaving the module by hook upX or donwX (according to original destination). If hook is not connected, then up1/down1 is used.
To turn on packet class learning do

ngctl msg ng_state: setifupclass {iface=18 class=14}

All packets entering up18 will be classified as class 14. Flow the packet belongs will also be marked as class 14. All further packets in the flow will be classified as class 14 without learning. This way classes are set after matching rules in bpf-classifier

newflows_up, newflows_down – hooks which are used to sent new packets to. When classification is not possible for the packet then it should enter module again with class set to 10 (“unable to classify”) which will lead to sending further packets in the flow to classification. If classification fails after 10s packet, then flow is marked class 2 (P2P - possible ).

export – for netflow v5 export, works same way as ng_netflow.

Special features, defaults

Default service

Packets not belonging to any services are assumed to belong to "default service". Default service parameters might be set via setservice command with userid parameter set to ”default”

ngctl msg state: setservice { userid=\”default\” cir_up=64000 cir_down=64000 }

Same to retrieve service parameters.

ngctl msg state: getservice { userid=\”default\” }

Default service is supposed to be used as default policy – for example, when CIR speeds are set to very small values, then no traffic will go until IP is assigned to a service.

Deleting IP addresses from services

Not possible. But you can move ip to default service back. Use setservice
ngctl msg state: setservice {userid=\”defaut\” ip=127.0.0.2 }

Netflow

Export is done same way as ng_netflow. As the state module does eport in dedicated thread each 3-5 seconds - netflow stream is somewhat bursty.
In addition, src_ad field is used to pass number of service the flow belongs. Billing can check this number to see if correct service is set.

Predefined classes

1 – default class, shouldn't be policed
2 – p2p-possible class
10 – unable to classify yet

Please, Notice
The module is very specific instrument as almost useless alone. To make it fully functional it should be setup with scripts which will translate business-logic to the module commands.

Thanks
Special thanks to Gleb Smirnoff, whose ng_netflow is used as a base for the module.

Download

here

Sunday, October 30, 2011

ng_state

Working on releasing the ng_state code and description.
Created google code project.
Documentation is almost ready, polishing now.

Tuesday, October 25, 2011

Engrish

I've just read all my posts. My English is ugly. I am sorry about that. I will try to make it better. Thanks.

Wednesday, June 22, 2011

DPI traffic recognition

How it works
To survive congestion times and prioritize traffic we're using mix of DPI and behavior-analyze.

DPI looks signatures and marks flows as certain class if they match patterns.

If flow doesn't match a pattern after first 10 packets it is marked as P2P-possible.
Each 5 seconds the decision is made about each client - is he P2P-er or not.

For ones detected as P2P-eer P2P-possible traffic is subject of additional polices.

Why behavior ?
- P2P traffic mutates frequently and developers do everything to hide it, so DPI is difficult to maintain and not effective
- such a traffic easily detectable by it's behavior - P2P applications tend to create hundreds of `connections` (TCP or UDP), and they greedy about bandwidth

So, to detect P2P it's simply enough to count flows and amount data transferred via that flows.
We also do simple optimizations - only considering P2P-possible flows where traffic have been seen in last 3 seconds.

The numbers were picked after 2 days of capturing and analyzing flow statistics.

One of the methods to do that was creating graphs from real-life.
This is how some kinds of traffic looks like:

Web pages generate up to 100 flows, but they contain relatively small number of bytes.

Downloads from one host (say, ftp or http file download) give ralatively small number of flows with big amount of bytes in them.

P2P traffic generates either a few hundreds number of flows with lots of bytes in active flows.

P2P traffic is very flexible - according to network conditions it can do big flows from single peer or download only a few hundend kilobytes from each, but do it from thousand of peers. We've tried to do per-flow policing while detecting P2P - this gives no result - P2P software adapts very well to such a conditions: if large flows restricted - does lots of small flows.

Policing traffic by it's behavior is a "feedback" system.
So, policing parameters should be adequate to now allow the "feedback" system to start flapping between border states (RED), instead it should attenuate to stable level.

So, playing with timings and amount of regulation each time we've got the BLUE picture bellow.

By now the system is pretty effective.
It's possible that later P2P will mutate to look like one of white-list protocols, this is very simple to do, but i think it won't be difficult to detect that.

At the moment from real traffic we weren't able to find such things.
But we've discovered that it's rather popular to use HTTP in games, SSL in applications, but haven't detected any P2P look-like traffic on top of them.

And the last, economical part:

We'd like to buy a ready DPI solution, because it has so many patterns, they are updated, maintained, BUT all of them COSTS.

Prices are about 10k$ per Gig. Our BRAS and DPI and SCE combined-device on top of PC hardware is 0.5k$ per Gig, and yes, plus our time, but still more then 10 times difference, plus easy upgradable and expandable.

Saturday, April 16, 2011

Problems with BRAS high load ... fixed

About two weeks ago we've started to expereince problems with NAS servers is i wrote a week ago.

On peak load hours they were serving almost 300mbits and suffering from 6000+ pipes number. At some point dummynet caused degradation up to 10 times (both by traffic and packet no) Pipes where used for customer traffic policing.

To avoid the 'pipe' problem and get rid of dummynet i've wrote ng_dummynet module. The reason we don't want to use IPFW + tables + ng_ipfw - we are not letting L3 processing at all. Ng dummynet rememres which Ip in which class and sends traffic of the same class (which might be several IPs) to the same ng_car module.
Two days in production showed 20-30% degradation in comparison to dummynet which where very strange and unexpected. In two days i've spent lots of time trying to figure where is the problem, which could end up kerenl tasq process profiling.

But today on the tetst stand we finally detected that degradation happens on more than 1000 customers in new ng_dummynet module.

While doing block by block cut-offs of functionality in the module finally i've found the problem: it was diffent network and host byte order. The module uses hashing by two last bytes, which turned to be first two ip bytes and they are the same for all customers. So, the hashing woked as simple one by one IPs enumeration. Which turned to enumeration of 2-3k IPs in average for each packet.

The module fix worked out: imeddieately CPU tasq time dropped times.
Shure you want the picture:

The interesting thing: traffic from one NIC always processed on one core.

Conclusions: on two NIC system it's dnagerous to work on 2-Core system as traffic starvation may overload all the system. On 4- kernel system (i5) with two NIC's CPU Utilization will not go higher then 50%, on 8-core i7 - not higher then 25%. Of course on i5 and i7 free cores may be loaded with extra something, for example, put another NIC.

Take a look: С2Duo 8500 -> Core I7 (some higher model with 4 cores 3mhz) upgrade

And the last few words: i even started to write all-in one module to avoid multple IP lookups and rely only on netflow engine.
But now it seems to be useless. Unless i do P2P by behaviour recognition.

I think that after final optimizations we can get even more 10-15% CPU time on the same system.

Thursday, April 7, 2011

ng_dummynet

Did some online debugging yesterday.

We've had one system with high load.

As for now i cannot exactly tell what is causing the high CPU load.

During the test i didn't touch the IPFW QUEUE system which is responsible for the policing. In fact, before reinitializing netgraph system experinced 70-80 CPU load (both cores the same 70-80) persent. At that moment there were 35-45K packets.

After netgraph reinit load dropped to 10-15 percent of each core. Pretty strange. The next thing to suspect was number of flows in ng_state. Before reinit there were about 120K. Right after init system collected 40K flows and load was 10-15 percent, in next 15 minutes it's rised to 80K flows, load raised substantioanaly - 2-5 percent.

Interestingly, complete removal of netgrpah processing doesn't result in significant CPU utulisation reduce - same 10-15% on tasq. Which means some part of netgraph is killing the performance.

So, to conclude: i am not sure that the problem is dummynet, but at least eliminating dummynet will allow to overcome 3K users (6k pipes) per system limit.

I've called it ng_dummynet. The thing is in process now. Yesterday i've did skeleton. Thing is supposed to be used together with ng_car which will be created per user basis. For now it is able to pass traffic uplink <-> downlink. Hoping to finish it by tommorow.

Sunday, April 3, 2011

NAMED - almost end of story

03.04 NAMED - end of story
So, one week full-load flight with new recursor is finished.

Some results:
- no software related problems except once we've expereinced some kind of attack when DNS traffic tripled, according to maintenance team report that was attack to the authoritattive servers, since they reside on the same hardware with recursors that caused significant system degradation, supposily, because of named was killing both CPU's, unfortunaly no real debugging and analyzing is possible now.
- CPU load reaches 25-30% in peak time, and the good news powerdns as able to use both CPU cores without proccess blocking

Some setup details: actually now it is one cache, but with all logic put to two scripts - nxdomain and preresolve. Some auth functions related to different answers to different internal networks are put in preresolve scipt.

Local domains and RFC 1918 (grey) networks are forwarded to auth directly as root servers have no idea about their delegation (actually, they are site-specific zones), some black-listed zones are also processed in the preresolve. Scripts are in c-perl-like LUA language, pretty simple and easy to underastnd language. According to tests even complicated lookups in LUA are much more fast end effective then doing real lookups (blacklisted zone).
Problems: they alwayas are. The only problem that most recent version of the pdns-recursor doesn't do round-robin DNS balancing correctly causing overloading of some servers. Previous version works fine, and thow, we left it in the production for now. The other thing, pdn-recursor also trims UDP and we cannot answer 40-50 server pools with it, BUT because of preresolve section we are'nt need it anymore as the problematic pool with ~50 servers is diveded to 6 networks and in LUA we can answer only those servers wich are supposed to serve that network segment, in comparison, BIND allowed only to do site-specific sorting of the pools, returning some servers first in the pool, but anyway all the pool was in the answer.

Overall: pdns-recursor is really nice upgrade, very low memory requerements, simple and efficient. Highly recomended for high recursor loads (5-20Kps of DNS traffic)

Friday, April 1, 2011

who is speaking

I am doing a litle out-of-job project - a callcenter for Itallians, to process paid services (mostly Future-Telling, Destiny, Erotic and so on). Actualy it's almost finished and now handles about 300-400 calls a day.

Some details - dedicated server receives calls via sip from PSTN via a Carrier. Calls are routed to the operators. Some basic accounting, management is done in MySQL.

As the main developer dissaperaed in the middle, i am finishing the thing myself.

Actually this is pretty basic call-center, it's only interested me as i had no hands-in sip and voip systems experience before at all.

Now I've stopped development for a while and determine reason of frequent call drops in the evenings. To have a full picture i've installed ZABBIX localy and keep monitoring some basic aspects like connectivity to the sip provider and server load and other sites as well.

Saturday, March 26, 2011

some news, good and not so good

At the moment we transferred almost 20k happy customers to new technology. 40k pending. The first outcome - about 20% of customers prefer old access methods.

Now we have problems which appeared as result of service simplification in the past. The problem is DUMMYNET. After 6k PIPES system dies. That's it, doesn't matter how much traffic.

So, now we are adding boxes to process traffic, but this is temporary solution. I look forward now to get rid of dummynet and use ng_car for policing.

When we;ve switched from dummynet on pptp servers to ng_car, we could double user capacity per system. Hopefully this will be the case this time again.

Some task analysis: there 2 primary methods to do the policing - one is to put separate traffic multiplexer to thousands of ng_cars, the other is to use well-working ng_state for that.

Advantage of separate multiplexer is it will just work in current setup. Disadvantage - every packet should be switched that way.
Implementation with ng_state allows to process only first packet in the flow, BUT following problems arrives: traffic forwarding and class policing should be done differently then. Actually instead of using different traffic path's for different classes i should add class labels to the packet.

Will do separate multiplexer now, as faster to implement solution and also, it might be not much slower comparing to the second solution as there label lookup will be needed for each packet.

Thursday, March 10, 2011

DNS story, part2

Tests revealed that old djbdns seem to be too old as it processes requests 10 times slower then slow bind does.

The final decicsion and contest winner is powerDNS recursor for recursor part and BIND for authoritative part. And yes, finaly we've split them.

Idea to have 12 caches and split-horizon auth DNS server still continiuoes to be just an idea.

Instead it apperaed that powerDNS recursor has a very nice feature - it can pass all incoming requests to a lua script and all request ended with NX answer to another lua script. This is enought to fullfill all our split horizon demands.

The only exception we don't know destination IP of DNS query in script when nice "packet cache" feature is on. Thats why at the moment two copies of powerDNS recursor is on duty. Not a big problem, two!=twelve.

It took about 10 working days to transfer all recursive payload from named to powerDNS. All went smooth, without server lost.

Some things to notice:

the most recent version of powerDNS recursor doesn't do round robin correct when answering queries with many resources. I didn't have time to figure out why, just downgraded to one minor version, it's ok.
to have all private thing functional ( .local adresses and "grey" networks PTR resolving) recursor should be specifically told where auth server for them is, as root servers will not answer/know about such a resources.
restarts instantly,
memory consumtion as 200-300 megs, ten times less than named.

At the moment we are leaving named as authoritative server. Mostly because of noc duty guys are comfortable to work with it, and absense of high traffic now, most mission-critical requests for customers processed inside recursor without asking auth server.

The CPU usage diagram:

Doesn't look impressive, but please note that user load (middle line) with bind couldn't get more then 50% (one CPU), The user load magnitude after upgrade is because named cleaning it's empty cache (recursive part still on, but no real requests go to bind).

Saturday, February 19, 2011

Bind, djbdns, etc and ISP

Today, instead of polishing my SCE solution to put it to production 4 am in Monday, i had to dig some DNS stuff.

In short, BIND sucks. To be more specific: actually it's not and it's a very good piece of software, which can do alot. But when used in so-called split-horizon setups it becomes very heavy, long starting, memory hungry begemoth

Today i spent 3 hours in trying to understand WTF is going on with one of our DNS servers. The task was even more complicated because of load balancers are on the way. I discovered it by very unpleasant surfing in browser, 'host' test showed that some queries, even cached ones took up to two seconds to resolve. That explained a lot.

So, when i looked at the first ns (say, ns1) stats - i saw this:

thats right, almost 100% resources under named. (See the green IDLE time).
Even at night time.

After doing some profiling ( strace -c ) i've discovered that most of the time named sits in futex syscall, and about 10-20% of them aren't succesifull.

Futex problem means concurrency problem. Switching off one CPU in config did magic thing - two out of three threads dropped futex time to less then 1%, and no more futex unsuccessful calls.

Interestingly, but the third process is almost only futex calls, and half of them failing.

This setup does the job much better - no visible delays occurs even when i put all the load to the server. And significant load drop.

Here is whole picture.

In the center there is the result of named restart - it's easy to see that problem not immediate, it takes 4-5 hours to occur.
On the right the result of switching to one CPU and getting rid of locks is visible - the system becomes 60% free, with no visible signs of degradation, actually,yet.

Troubleshooting is done, somehow it's working. But i need to further. First off all, hardware upgrade needed, as it's only c2duo 6300 CPU, will do on Monday.

But stop, any software can bring even most sophisticated hardware down to it's knees by doing stupid things like BIND does.

It looks like i had to look at something very efficient, for example djbdns, something like nginx.

The problem is that i need views support (split-horizon DNS). Djbdns is the only free alternative able to do that. Others suggest setting extra instance of the server per view, which will be pretty complicated setup in my case.

The other reason to get rid of BIND is for handling load-balanced pools which consists from 40-50 servers: BIND returns them all, and i haven't found a way to reduce number to say, 4-5 resources. And this part is really needed, because there are plenty of dumb devices that fail to resolve big RR pools.

Djbdns looks to be pretty efficient and realy nice solution.
It also splits caching and authoritive parts of the server into two daemons, which is nice industry-stnsart feature,but it is another problem for me.

Authoritive server will not do caching or forwarding, but it does split-horizon.

If i put caching server first (how it's supposed to be), then authoritive will have no info about user IPs which means that split horizon will not work.

Next i've got an idea to substitute root servers on authoritive server and use a patch to answer list of root servers when authoritive part recieves something it's not authoritive for. But it means every client will do two requests instead of one - first will return the IP of cache and then second will do the request to the cache. Not good.

What will definetly work - making one cache for each view. At the moment i have 5 views, and one view doesn't need caching.

So many thoughts that i am stopping at this point. Lets rest a bit.

While looking through the features of the servers i've found intreresting utulities with ProDNS server - they can replay captured DNS traffic on selected DNS - this is an easy way to simulate payload close to real on the new server.

Monday, February 7, 2011

Forwarding interesting traffic

We've spent a working day trying to figure how to implement forwarding of selected traffic in netgraph. No result. So ... It took a couple hours to do one one very simple module: fixmac.

Fixmac overwrites destination mac. This is forwarding, actually - one small disadvantage - need to handle MAC of destination router or system manually (unlike with IPFW, which works with arp table, but even IPFW never initiates ARP lookup).

Fixmac has two hooks (in and out) and process all traffic both ways. Everything going in -> out gets it mac replaced, out->in is intact.

Supports one netgraph message: ngctl msg fixmac: setmac 01:02:03:04:05:06, and stats.

Never thought i'll ever need to remember or get any practical use of this (src: wikipedia):

This time it was quite helpful.
most of my workmates failed to guess which goes first - ether addresses or 802.1q header.

Saturday, February 5, 2011

ng_iptable load tests

Did high-load tests on the 2G test_bad - tol see the overall impact and simultaneous add/del/high_traffic to test stability

Now, switching back to state module:
First of all, need new testbed config and check if bi-directional flow recognition works. Simplified netgraph structure will be:

So, i'am going to use same per-class netflow matching to do ip/based switching - which means iptable matching is done once per flow, if it would been on the main traffic path - then EACH packet had to be switched.
Most of important traffic will be send through 4 (if policing is needed) or 3 (no policing) nodes. "Learning" traffic may involve much more steps, in worst case packet has to be switched 50-100 bpf programs. This is CPU and delay problem, so, only first 6-8 packets in each flow are to be recognised.

The other thing, it might be desirable to add car to new flows, so automatically mark exceeding flows as being junk after certain limit (will do it later if needed).

One more thing (TODO) - "maintenance switch" - nobody likes when system crashes, and it does when reconfiguring nodes with high traffic on them. So, maintenance switch must be used to put traffic recognition OFF-line and so - do safe reconfiguration.

Friday, February 4, 2011

ng_iptable

OOOps, tests shows that i did kernel panicer module, not iptable :)
UPD: mutexes are to be inited prior to use :) - FIXED
UPD: seem to be working, found bug with duplicate ip items - doing code to delete all duplicate entries -FIXED
Now, mass add/del - pecker test....
well, 10k add/removals... pretty stable ... yet no problems

Tuesday, February 1, 2011

ng_iptable

Decided to do user-IP-switch: it will be based on MUX module and will have three hooks (in, default, intable). A packet from "in" will be switched to default hook if it's not in the table and to "intable" if its in the ip table.
It will be able accept two additional control messages: ADD <IP> or DELETE <IP>.

Ok, renamed and cleaned the module, now it's capable of mixing traffic, but doesn't do ip lookup.
to do:
    - add ip , delete ip messages and addresses from the table. (DONE)
    - take ip from packet data (DONE, not checked yet)
    - do the search (it has to be impressively effective) (DONE, but it's not even checked)

So, to check the thing will do the opposite setup of iptable and one2many nodes

This way:

Monday, January 31, 2011

Almost production

Pretty close to put the module in real use.

tasks

- make differnet name (at the moment kernel complains the module cannot be loaded as ng_netflow already there when original ng_netflow loaded) (UPD: Fixed, after figuring for 2 nights what's wrong just replaced all words "netflow" to new name "netfl2". Just changing self name wasn't enough, instead, it looks like symbol name of a proc or func was used by kernel to reference the name. Ok, this is a good reason to switch to another name right now, so the new name will be 'state').

- configure it process traffic as kurrent traffic shapers:
now they are implemented on linux:

-A FORWARD -p tcp -d XXXXXXXXX/19 --sport 1024:65535 -m length --length 500: -m hashlimit --hashlimit-mode dstip --hashlimit-above 15/sec --hashlimit-burst 80 --hashlimit-name P2P -j TOS --set-tos 2 -входящий к клиенту

in human words: All traffic, destined to XXXXX on higher ports and packet size 500 and higher exceeding 15 simultanious connections mark as tos2.
So, this introduces a kind of problem - there is no connection tracking info yet, so we don't know the connections no (yet).

The other problem, is that only certain users should get this filter, but it's not desirable to put as many BPF programs, as with planned payload up to 6000 customers it isn't desirable to force a packet to go through 6000 BPF programs.

So, what can be done:

1.User-switch - an ng_user which holds ip table and switches traffic on two interfaces (match, notmatch) upon presence of the ip in the table.
User switch might also solve problem of transparent proxying (and user per-user services), as IP forwarding doesn't work on L2, but there is no L3 traffic processing in our architecture (DONE)

2.Do user-switching things in the netflow itself benefits is single search in flow cache will do the job, disadvantage is loss of flexibility and mixtrure of two modules, and ... yes, it against KISS.

Saturday, January 8, 2011

thoughts about next steps

Spent one more day thinking about the most productive way of detecting the traffic and make whole thing realy fast, debuggable, managable, easy to use.

Idea to work with traffic in uni-directional way is definetly bad - most of the traffic signatures could be found only in one direction, but traffic could and should be identified in both.

So, the next major fix will be passing traffic through the netflow algo bi-directionally, and making policing for classes on both ways symmetrically.

This leads to major design changes - two bpf nodes needed, connecting to different classifying sides, flows ALWAYS should go through the netflow engine. This actually ok, as previously i mented to use the same setup on both traffic sides and now it is just taking alltogether in one begemoth.

One future thought - it's aren't that difficult to make a subscriber list and create a class 'prepend' for each subscriber - it means that each subscriber's traffic class could be handled separately. This is not a real need now, just thought for future, to include BRAS functionality

Friday, January 7, 2011

Ok, quite good.
Two classess, each class has it's own policy

To make it clear classes CIR differs 10 times. So screenshot is selfexplaining - i can only add that for test http traffic set as class 2 with CIR=100kbps and default class (all other traffic) CIR=10kbps. And... FreeBsd goes from FTP, while Ubunti from HTTP.

For convenience, i created a perl script with text config - to put tcpdump patterns with their descriptions and class numbers together. Later i'll consider reordering and on-fly changing on the same script.

What is next ?
- create real patterns - no ports, real DPI patterns, or patterns with ports.
- netlow2 node: new name and bi-directional functionality
- redefine names, terms, etc in terms upsteram and downstream.

I think that same should be done for classes - but, actually structure is pretty fixed there, so it's possible to create 50 (or 50k) classes and then only change classes CIR. Not so important now, just to consider in future

Starting with test setup on real traffic now.
I am altering testbed to make some ng_car tests, then will do ng_cars for default and clssified traffic, and then tests on classifiyng quality will start.

I am still not clear about general policy of classifying the traffic, most likely, it will be classification of "real-time" and defaulting rest to low-speed

One more thought: need to do significant netflow2 node rewriting - create upsteram and downstream links instead of data and in and think about linking opposite-direction flows (two times less work on classifying), and also think about possibility to use classes together with subscribers.