I have noticed that the root of the problem must be related to the way ATS
downloads the files from the origin:
RX packets:209192729 errors:0 dropped:23521536 overruns:0 frame:0
TX packets:314132718 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:3000
RX bytes:65378739409 (60.8 GiB) TX bytes:457782226108 (426.3 GiB)
Only RX packets are being dropped and no transmitted packets.
2013/3/22 Philip <[email protected]>
> Balancing the interrupts didn't make the situation better:
> http://i.imgur.com/IH0uSwr.png :/
>
>
> 2013/3/22 Yongming Zhao <[email protected]>
>
>> yeah, you should balance all the eth2-TxRx-* :D
>>
>>
>> 在 2013-3-22,下午6:17,Philip <[email protected]> 写道:
>>
>> I have a hard time understanding the output of /proc/interrupts since
>> there seem to be multiple interrupts already: "eth2-TxRx-0",
>> "eth2-TxRx-1".. but it seems to be balanced pretty poorly. Should I change
>> smp_affinity for all these interrupts or only for the one that has the name
>> "eth2"?
>>
>> You can see the output of /proc/interrupts here ->
>> http://i.imgur.com/ZLulmkQ.png
>>
>> Best Regards
>> Philip
>>
>>
>> 2013/3/22 Yongming Zhao <[email protected]>
>>
>>> well, it is easy to identify the irq issue here:
>>> 1, in "top", press "1" to display all CPU details. and press "H" to
>>> display the Traffic Server threadings, by default the process is sorted
>>> with CPU usage desc.
>>> you may get one of the CPU with full load but not single TS process.
>>>
>>> 2, "cat /proc/interrupts", and grep out your 10GE nic, check the IRQs.
>>> you need the IRQs on different CPUs for better performance.
>>> you may get that all the IRQs for the NIC is on one CPU, that is the CPU
>>> with full load, typically this CPU0
>>>
>>> just set the smp_affinity for each IRQ, here is a not prove to working
>>> one line script(replace the eth1 with your NIC name):
>>>
>>> j=0;for i in $(grep eth1 /proc/interrupts | awk -F: "{print \$1}");do
>>> test $j -gt $(grep processor /proc/cpuinfo | tail -n 1 | awk '{print $NF}')
>>> && let j=0;echo $(echo -n $(python -c 'a=1<<'$(echo $j%32 | bc)'; print
>>> "%X"%a'); echo -n $(k=$(echo $j/32 | bc);while [ $k -gt 0 ];do echo -n
>>> ",00000000";let k=k-1;done))> /proc/irq/$i/smp_affinity;let j=j+1;done
>>>
>>>
>>> FYI
>>>
>>> 在 2013-3-22,上午6:23,Igor Galić <[email protected]> 写道:
>>>
>>> This may be useful:
>>>
>>> http://kerneltrap.org/mailarchive/linux-netdev/2010/4/15/6274814/thread
>>>
>>> ------------------------------
>>>
>>> Hi Yongming,
>>>
>>> I haven't changed the networking configuraton but I've also noticed that
>>> once the first core is at 100% utilization the server won't answer all ping
>>> requests anymore and has packet loss. This might be a sign that all network
>>> traffic is handled by the first core isn't it?
>>>
>>> You can find a screenshot of the threading output of top here:
>>> http://i.imgur.com/X3te2Ru.png
>>>
>>> Best Regards
>>> Philip
>>>
>>> 2013/3/21 Yongming Zhao <[email protected]>
>>>
>>>> well, due to the high network traffic, have you make the 10Ge NIC irq
>>>> balanced to multiple cpu?
>>>>
>>>> and can you show us the threading CPU usage in the top?
>>>>
>>>> thanks
>>>>
>>>> 在 2013-3-21,下午7:42,Philip <[email protected]> 写道:
>>>>
>>>> I've just upgraded to ATS 3.3.1-dev. The problem still is the same:
>>>> http://i.imgur.com/1pHWQy7.png
>>>>
>>>> The load goes on one core. (The server is only running ATS)
>>>>
>>>> 2013/3/21 Philip <[email protected]>
>>>>
>>>>> Hi Igor,
>>>>>
>>>>> I am using ATS 3.2.4, Debian 6 (Squeeze) and a 3.2.13 Kernel.
>>>>>
>>>>> I was using the "traffic_line -r" command to see the number of origin
>>>>> connections growing and htop/atop to see that only one core is 100%
>>>>> utilized. I've already tested the following changes to the configuration:
>>>>>
>>>>> proxy.config.accept_threads -> 0
>>>>>
>>>>> proxy.config.exec_thread.autoconfig -> 0
>>>>> proxy.config.exec_thread.limit -> 120
>>>>>
>>>>> They had no effect there is still the one core that becomes 100%
>>>>> utilized and turns out to be a bottleneck.
>>>>>
>>>>> Best Regards
>>>>> Philip
>>>>>
>>>>>
>>>>> 2013/3/21 Igor Galić <[email protected]>
>>>>>
>>>>>> Hi Philip,
>>>>>>
>>>>>> Let's start with some simple data mining:
>>>>>>
>>>>>> which version of ATS are you running?
>>>>>> What OS/Distro/version are you running it on?
>>>>>>
>>>>>> Are you looking at stats_over_http's output to determine what's going
>>>>>> on in ATS?
>>>>>>
>>>>>> -- i
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> I have noticed the following strange behavior: Once the number of
>>>>>> origin connections start to increase and the proxying speed collapses the
>>>>>> first core is at 100% utilization while the others are not even close to
>>>>>> that. It seems like the origin requests are handled by the first core
>>>>>> only.
>>>>>> Is this expected behavior that can be changed by editing the
>>>>>> configuration
>>>>>> or is this a bug?
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2013/3/20 Philip <[email protected]>
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am running ATS on a pretty large server with two physical 6 core
>>>>>>> XEON CPUs and 22 raw device disks. I want to use that server as a
>>>>>>> frontend
>>>>>>> for several fileservers. It is currently configured to be infront of two
>>>>>>> file-servers. The load on the ATS server is pretty low. About 1-4% disk
>>>>>>> utilization and 500Mbps of outgoing traffic.
>>>>>>>
>>>>>>> Once I direct the traffic of the third file server towards ATS
>>>>>>> something strange happens:
>>>>>>>
>>>>>>> - The number of origin connection increases continually.
>>>>>>> - Requests that hit ATS and are not cached are served really slow to
>>>>>>> the client (about 35 kB/s) while requests that are served from the cache
>>>>>>> are blazingly fast.
>>>>>>>
>>>>>>> The ATS server has a dedicated 10Gbps port that is not maxed out, no
>>>>>>> CPU core is maxxed, there is no swapping, there are no error logs and
>>>>>>> also
>>>>>>> the origin servers are not heavy utilized. It feels like there are not
>>>>>>> enough workers to process the origin requests.
>>>>>>>
>>>>>>> Is there anything I can do to check if my theory is right and a way
>>>>>>> to increase the number of origin workers?
>>>>>>>
>>>>>>> Best Regards
>>>>>>> Philip
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Igor Galić
>>>>>>
>>>>>> Tel: +43 (0) 664 886 22 883
>>>>>> Mail: [email protected]
>>>>>> URL: http://brainsware.org/
>>>>>> GPG: 6880 4155 74BD FD7C B515 2EA5 4B1D 9E08 A097 C9AE
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Igor Galić
>>>
>>> Tel: +43 (0) 664 886 22 883
>>> Mail: [email protected]
>>> URL: http://brainsware.org/
>>> GPG: 6880 4155 74BD FD7C B515 2EA5 4B1D 9E08 A097 C9AE
>>>
>>>
>>>
>>
>>
>