In addition to all the other ideas given here, you should test 7.1.x. We have fixed many issues around origin connectivity there. Not saying it will fix this, but it’s worth a shot, and it’s where we are focusing development efforts.
— Leif > On Dec 29, 2017, at 12:44 PM, David Boreham <[email protected]> wrote: > > I should say that I don't know much about ATS but I have spent some time > looking into similar problems with other servers over the years. Some ideas > below: > >> On 12/29/2017 3:56 AM, Mateusz Zajakala wrote: >> CPU utilization does not exceed 40% during peak traffic. I also checked the >> number of sockets in connection > Note that 40% aggregate CPU on a many-core system can easily hide a saturated > single thread. If under your workload the server ends up doing much work in a > single thread, that can starve overall throughput. e.g. on your 8-core box > one thread maxing out a core would only show up as 12.5% -- obviously lower > than your observed 40%. >> pending state (SYN_RECV) and it never goes above 20, so I suppose accepting >> incoming connections is not the bottleneck. >> >> What about the number of worker threads? I'm using autoconfig with default >> scale factor (1.5) which on my system (8 cores) creates 27 threads for >> traffic_server. Does it make sense to increase the scale factor if my CPU >> utilization is not high? will this improve the overall performance? What >> about stacksize? >> > I would recommend first gathering some data along the lines of "ok, so what > _is_ it doing?" rather than theorizing about solutions. For example use > "pstack", or a similar tool to snapshot the ATS process' thread stacks at > full-load. Take a few such samples and look at them to see what it is up to. > If you see for example all the threads busy doing work then that might be > good supporting evidence for making a thread pool larger. or, is the accept > thread always running (indicating the incoming accept workload has saturated > one core). I suspect there are various counters and such that will be > maintained by the ATS code and can be inspected on a live server -- typically > these will give you some idea what is happening (e.g. work is queuing up > waiting on threads). > > A good way to think through a problem like this is to try to imagine what the > server should be doing under the load you have. Once you have that mental > picture, go look at what it is actually doing and see what's different. >> How should I go on about finding the cause of some of the clients not being >> able to connect occasionally? > > See if you can reproduce the problem yourself with a test client (e.g. > curl/wget). If you can then good : now work to "trace" what is happening with > the packets from that client. You can use a netfilter/tcpdump filter to > target only its IP or MAC to isolate the traffic you want to look at vs the > deluge with low overhead. This should tell you if the stall is occurring at > the NIC or in the kernel or in user space. To dig into what's going on in > user space use logging (I assume but don't know for sure that ATS can be made > to log the client IP). If you need more information to debug than existing > logging will give you : add new code to log useful information for your > investigation. > > If you can't reproduce the issue with your own client, well that's not great, > but you can attempt to work "backwards" to a reproduced case by capturing all > or a decent sample of the network traffic then analyzing it statically to > find examples. > >
