Re: accepting large number of inbound TCP connections

David Boreham Fri, 29 Dec 2017 11:45:06 -0800

I should say that I don't know much about ATS but I have spent some timelooking into similar problems with other servers over the years. Someideas below:


On 12/29/2017 3:56 AM, Mateusz Zajakala wrote:

CPU utilization does not exceed 40% during peak traffic. I alsochecked the number of sockets in connection

Note that 40% aggregate CPU on a many-core system can easily hide asaturated single thread. If under your workload the server ends up doingmuch work in a single thread, that can starve overall throughput. e.g.on your 8-core box one thread maxing out a core would only show up as12.5% -- obviously lower than your observed 40%.

pending state (SYN_RECV) and it never goes above 20, so I supposeaccepting incoming connections is not the bottleneck.
What about the number of worker threads? I'm using autoconfig withdefault scale factor (1.5) which on my system (8 cores) creates 27threads for traffic_server. Does it make sense to increase the scalefactor if my CPU utilization is not high? will this improve theoverall performance? What about stacksize?

I would recommend first gathering some data along the lines of "ok, sowhat _is_ it doing?" rather than theorizing about solutions. For exampleuse "pstack", or a similar tool to snapshot the ATS process' threadstacks at full-load. Take a few such samples and look at them to seewhat it is up to. If you see for example all the threads busy doing workthen that might be good supporting evidence for making a thread poollarger. or, is the accept thread always running (indicating the incomingaccept workload has saturated one core). I suspect there are variouscounters and such that will be maintained by the ATS code and can beinspected on a live server -- typically these will give you some ideawhat is happening (e.g. work is queuing up waiting on threads).

A good way to think through a problem like this is to try to imaginewhat the server should be doing under the load you have. Once you havethat mental picture, go look at what it is actually doing and see what'sdifferent.

How should I go on about finding the cause of some of the clients notbeing able to connect occasionally?

See if you can reproduce the problem yourself with a test client (e.g.curl/wget). If you can then good : now work to "trace" what is happeningwith the packets from that client. You can use a netfilter/tcpdumpfilter to target only its IP or MAC to isolate the traffic you want tolook at vs the deluge with low overhead. This should tell you if thestall is occurring at the NIC or in the kernel or in user space. To diginto what's going on in user space use logging (I assume but don't knowfor sure that ATS can be made to log the client IP). If you need moreinformation to debug than existing logging will give you : add new codeto log useful information for your investigation.

If you can't reproduce the issue with your own client, well that's notgreat, but you can attempt to work "backwards" to a reproduced case bycapturing all or a decent sample of the network traffic then analyzingit statically to find examples.

Re: accepting large number of inbound TCP connections

Reply via email to