In addition to all the other ideas given here, you should test 7.1.x. We have 
fixed many issues around origin connectivity there. Not saying it will fix 
this, but it’s worth a shot, and it’s where we are focusing development efforts.

— Leif 

> On Dec 29, 2017, at 12:44 PM, David Boreham <[email protected]> wrote:
> 
> I should say that I don't know much about ATS but I have spent some time 
> looking into similar problems with other servers over the years. Some ideas 
> below:
> 
>> On 12/29/2017 3:56 AM, Mateusz Zajakala wrote:
>> CPU utilization does not exceed 40% during peak traffic. I also checked the 
>> number of sockets in connection
> Note that 40% aggregate CPU on a many-core system can easily hide a saturated 
> single thread. If under your workload the server ends up doing much work in a 
> single thread, that can starve overall throughput. e.g. on your 8-core box 
> one thread maxing out a core would only show up as 12.5% -- obviously lower 
> than your observed 40%.
>> pending state (SYN_RECV) and it never goes above 20, so I suppose accepting 
>> incoming connections is not the bottleneck.
>> 
>> What about the number of worker threads? I'm using autoconfig with default 
>> scale factor (1.5) which on my system (8 cores) creates 27 threads for 
>> traffic_server. Does it make sense to increase the scale factor if my CPU 
>> utilization is not high? will this improve the overall performance? What 
>> about stacksize?
>> 
> I would recommend first gathering some data along the lines of "ok, so what 
> _is_ it doing?" rather than theorizing about solutions. For example use 
> "pstack", or a similar tool to snapshot the ATS process' thread stacks at 
> full-load. Take a few such samples and look at them to see what it is up to. 
> If you see for example all the threads busy doing work then that might be 
> good supporting evidence for making a thread pool larger. or, is the accept 
> thread always running (indicating the incoming accept workload has saturated 
> one core). I suspect there are various counters and such that will be 
> maintained by the ATS code and can be inspected on a live server -- typically 
> these will give you some idea what is happening (e.g. work is queuing up 
> waiting on threads).
> 
> A good way to think through a problem like this is to try to imagine what the 
> server should be doing under the load you have. Once you have that mental 
> picture, go look at what it is actually doing and see what's different.
>> How should I go on about finding the cause of some of the clients not being 
>> able to connect occasionally?
> 
> See if you can reproduce the problem yourself with a test client (e.g. 
> curl/wget). If you can then good : now work to "trace" what is happening with 
> the packets from that client. You can use a netfilter/tcpdump filter to 
> target only its IP or MAC to isolate the traffic you want to look at vs the 
> deluge with low overhead. This should tell you if the stall is occurring at 
> the NIC or in the kernel or in user space. To dig into what's going on in 
> user space use logging (I assume but don't know for sure that ATS can be made 
> to log the client IP). If you need more information to debug than existing 
> logging will give you : add new code to log useful information for your 
> investigation.
> 
> If you can't reproduce the issue with your own client, well that's not great, 
> but you can attempt to work "backwards" to a reproduced case by capturing all 
> or a decent sample of the network traffic then analyzing it statically to 
> find examples.
> 
> 

Reply via email to