Bryan,
Somehow the problem came back after a few hours (not as bad yet)... This is
running 5.3.2 with a few patches (available on Traffic Control GitHub).
Per-node numastat info (in MBs):
Node 0 Node 1 Total
--------------- --------------- ---------------
Numa_Hit 831049.44 684360.21 1515409.66
Numa_Miss 0.00 0.00 0.00
Numa_Foreign 0.00 0.00 0.00
Interleave_Hit 423.24 423.16 846.40
Local_Node 831046.29 683892.71 1514939.01
Other_Node 3.15 467.50 470.65
# sysctl -a | grep huge
vm.nr_hugepages = 0
vm.nr_hugepages_mempolicy = 0
vm.hugetlb_shm_group = 0
vm.hugepages_treat_as_movable = 0
vm.nr_overcommit_hugepages = 0
We ATS set with CONFIG proxy.config.allocator.hugepages = INT 0
On Wed, Aug 17, 2016 at 5:11 PM, Bryan Call <[email protected]> wrote:
> Can you run: numastat -v? Looking up try_to_unmap_anon it looks like the
> kernel is trying to balance the memory usage across the numa nodes. There
> was a recent change to how memory allocation is done on the aio threads in
> this ticket:
> https://issues.apache.org/jira/browse/TS-4719
>
> Can you also run sudo sysctl -a | grep huge and paste the output?
>
> -Bryan
>
>
>
> On Aug 16, 2016, at 11:00 PM, Steve Malenfant <[email protected]>
> wrote:
>
> Bryan, here's the output during of one of those events...
>
> This server RAM is used at about 80%. Hugepage is not enabled. 256GB of
> Ram Cache on a server with 384GB RAM.
>
> Samples: 756K of event 'cycles', Event count (approx.): 26719576313
> Overhead Shared Object Symbol
> 70.35% [kernel] [k] try_to_unmap_anon
> 2.42% [kernel] [k] find_busiest_group
> 1.34% [kernel] [k] _spin_lock
> 0.73% [kernel] [k] thread_return
> 0.64% libpthread-2.12.so [.] pthread_cond_timedwait@@GLIBC_2.3.2
> 0.52% [kernel] [k] schedule
> 0.50% [kernel] [k] find_next_bit
> 0.42% [kernel] [k] cpumask_next_and
> 0.38% [kernel] [k] _spin_lock_irqsave
>
> [~]# cat /proc/meminfo
> MemTotal: 396790288 kB
> MemFree: 52626016 kB
> Buffers: 151148 kB
> Cached: 2035288 kB
> SwapCached: 0 kB
> Active: 1198652 kB
> Inactive: 1171944 kB
> Active(anon): 222464 kB
> Inactive(anon): 152 kB
> Active(file): 976188 kB
> Inactive(file): 1171792 kB
> Unevictable: 289593148 kB
> Mlocked: 289595216 kB
> SwapTotal: 4095996 kB
> SwapFree: 4095996 kB
> Dirty: 208 kB
> Writeback: 0 kB
> AnonPages: 289777432 kB
> Mapped: 128752 kB
> Shmem: 244 kB
> Slab: 15548892 kB
> SReclaimable: 15454924 kB
> SUnreclaim: 93968 kB
> KernelStack: 17648 kB
> PageTables: 573316 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 202491140 kB
> Committed_AS: 291736828 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 1383676 kB
> VmallocChunk: 34157955876 kB
> HardwareCorrupted: 0 kB
> AnonHugePages: 0 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 58368 kB
> DirectMap2M: 1968128 kB
> DirectMap1G: 400556032 kB
>
> On Tue, Aug 16, 2016 at 8:54 PM, Bryan Call <[email protected]> wrote:
>
>> Can you run sudo perf top or sudo perf record during the time you are
>> having a spike in CPU usage?
>>
>> -Bryan
>>
>>
>>
>>
>> On Aug 16, 2016, at 3:09 PM, Steve Malenfant <[email protected]>
>> wrote:
>>
>> We have a condition where 4 caches, every 33-34 minutes goes high CPU
>> utliization (1000%) and prevents incoming connections. This last about 30
>> seconds and goes away.
>>
>> It seems to occur with some specific origin/content which I don't exactly
>> know which one so far.
>>
>> I'm including some details (in case it's obvious), but how do we profile
>> a running server to see what's causing this high cpu usage?
>>
>> The load is normal, maybe about 100Mbps (basically idling).
>>
>> top - 17:20:33 up 61 days, 3:32, 4 users, load average: 4.74, 1.20,
>> 0.53
>> Tasks: 616 total, 16 running, 600 sleeping, 0 stopped, 0 zombie
>> Cpu(s): 2.6%us, 22.3%sy, 0.0%ni, 74.9%id, 0.0%wa, 0.0%hi, 0.2%si,
>> 0.0%st
>> Mem: 396790288k total, 342530608k used, 54259680k free, 148764k buffers
>> Swap: 4095996k total, 0k used, 4095996k free, 1618892k cached
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 8703 ats 20 0 282g 275g 39m D 853.9 72.7 12496:54 [ET_NET 0]
>> 61 root 20 0 0 0 0 R 5.6 0.0 5:10.11 ksoftirqd/14
>> 77 root 20 0 0 0 0 R 5.0 0.0 5:15.08 ksoftirqd/18
>> 125 root 20 0 0 0 0 R 5.0 0.0 5:02.48 ksoftirqd/30
>> 13 root 20 0 0 0 0 R 4.6 0.0 5:23.63 ksoftirqd/2
>> 37 root 20 0 0 0 0 R 4.6 0.0 5:10.31 ksoftirqd/8
>> 45 root 20 0 0 0 0 R 4.6 0.0 5:10.36 ksoftirqd/10
>> 93 root 20 0 0 0 0 R 4.6 0.0 5:29.05 ksoftirqd/22
>> 109 root 20 0 0 0 0 R 4.6 0.0 5:06.27 ksoftirqd/26
>> 21 root 20 0 0 0 0 R 4.3 0.0 5:12.82 ksoftirqd/4
>> 29 root 20 0 0 0 0 R 4.3 0.0 5:10.79 ksoftirqd/6
>> 85 root 20 0 0 0 0 R 4.3 0.0 5:18.36 ksoftirqd/20
>> 101 root 20 0 0 0 0 R 4.3 0.0 5:13.54 ksoftirqd/24
>> 4 root 20 0 0 0 0 R 4.0 0.0 5:16.25 ksoftirqd/0
>> 69 root 20 0 0 0 0 R 4.0 0.0 5:09.72 ksoftirqd/16
>> 117 root 20 0 0 0 0 R 3.0 0.0 5:03.74 ksoftirqd/28
>> 53 root 20 0 0 0 0 S 1.7 0.0 5:07.40 ksoftirqd/12
>>
>> Here's a "dstat" output of that server when this is occurring... updated
>> every second.
>>
>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
>> usr sys idl wai hiq siq| read writ| recv send| in out | int csw
>> 0 0 98 1 0 0|7940k 10M|6270k 18M| 0 0 | 13k 35k
>> 0 0 99 0 0 0|9738k 4160k|1600k 10M| 0 0 | 12k 34k
>> 0 0 99 0 0 0|3232k 4136k|1390k 5724k| 0 0 | 11k 34k
>> 0 0 100 0 0 0| 0 4096k| 34k 555k| 0 0 | 10k 33k
>> 0 0 99 0 0 0|1316k 9095k|7028k 7240k| 0 0 | 11k 34k
>> 0 1 99 0 0 0| 288k 8658k|6189k 5878k| 0 0 | 29k 36k
>> 0 0 99 1 0 0| 13M 6204k|2870k 17M| 0 0 | 14k 35k
>> 0 0 99 0 0 0| 0 4096k|3009k 3387k| 0 0 | 14k 35k
>> 0 0 99 0 0 0|1028k 6384k|5138k 6047k| 0 0 | 14k 35k
>> 0 0 99 1 0 0|5127k 8192k|4262k 7417k| 0 0 | 13k 34k
>> 0 0 99 0 0 0|2372k 2612k|2011k 4944k| 0 0 | 14k 35k
>> 0 0 99 0 0 0| 0 4172k|1633k 2201k| 0 0 | 13k 35k
>> 0 0 99 0 0 0| 11M 608k|2528k 12M| 0 0 | 13k 35k
>> 0 3 95 1 0 0| 15M 8409k| 11M 28M| 0 0 | 16k 37k
>> 0 3 95 1 0 0|6324k 4244k|7959k 20M| 0 0 | 12k 32k
>> 0 4 96 0 0 0| 256k 2128k| 853k 2823k| 0 0 | 27k 32k
>> 0 3 96 0 0 0|1028k 0 | 472k 562k| 0 0 | 10k 31k
>> 0 3 96 0 0 0|1028k 68k|1457k 1454k| 0 0 | 11k 32k
>> 0 3 97 0 0 0| 0 0 | 316k 317k| 0 0 | 11k 31k
>> 0 3 97 0 0 0|3076k 0 | 415k 3251k| 0 0 |9498 30k
>> 0 3 97 0 0 0|2688k 84k| 16k 2985k| 0 0 |8623 30k
>> 0 3 97 0 0 0| 0 0 | 64k 65k| 0 0 |9121 30k
>> 0 3 97 0 0 0| 0 68k|1607k 1556k| 0 0 | 11k 31k
>> 0 3 97 0 0 0| 0 0 | 18k 120k| 0 0 |7669 28k
>> 0 3 97 0 0 0| 0 0 | 536k 537k| 0 0 |8660 30k
>> 0 4 96 0 0 0| 0 72k| 65k 75k| 0 0 | 26k 30k
>> 0 3 96 0 0 0| 0 2104k|3079k 3093k| 0 0 |9267 30k
>> 0 3 97 0 0 0| 0 108k| 593k 506k| 0 0 |7591 28k
>> 0 3 96 0 0 0|1152k 4096B| 124k 238k| 0 0 |7914 29k
>> 0 3 97 0 0 0| 0 0 |5719B 1574B| 0 0 |8884 29k
>> 0 3 96 0 0 0| 0 224k| 18k 51k| 0 0 |9092 30k
>> 0 3 97 0 0 0| 0 0 | 87k 108k| 0 0 |8998 29k
>> 0 3 96 0 0 0|5328k 0 |1594k 6790k| 0 0 |8499 29k
>> 0 3 97 0 0 0| 0 156k| 44k 203k| 0 0 |8039 29k
>> 0 3 97 0 0 0| 0 0 | 17k 82k| 0 0 |5377 25k
>> 0 4 96 0 0 0| 0 252k|3224B 6858B| 0 0 | 28k 31k
>> 0 3 96 0 0 0| 0 0 |2139B 8535B| 0 0 | 10k 29k
>> 0 3 97 0 0 0|5508k 0 | 20k 5516k| 0 0 |7762 29k
>> 0 3 97 0 0 0| 0 72k| 10k 28k| 0 0 |7026 29k
>> 0 3 97 0 0 0| 0 0 |3782B 712B| 0 0 |5829 29k
>> 0 4 96 0 0 0| 0 264k| 11k 10k| 0 0 |5170 183k
>> 0 4 95 0 0 0| 0 0 | 15k 1415B| 0 0 |4125 245k
>> 0 3 97 0 0 0| 0 0 |5416B 966B| 0 0 |3608 40k
>> 0 3 97 0 0 0| 0 4096B|6233B 3868B| 0 0 |3636 41k
>> 0 3 96 0 0 0| 0 64k|5761B 1594B| 0 0 |3579 41k
>> 2 20 78 0 0 0| 0 80k| 11k 1544B| 0 0 | 16k 3590k
>> 3 26 71 0 0 0| 0 0 | 13k 9997B| 0 0 | 12k 4969k
>> 3 26 71 0 0 0| 0 0 |6685B 1042B| 0 0 | 12k 5017k
>> 3 25 71 0 0 0| 0 0 | 11k 1660B| 0 0 | 12k 4957k
>> 3 25 72 0 0 0| 0 0 |7765B 4463B| 0 0 | 12k 4862k
>> 3 25 72 0 0 0| 0 128k|8377B 1478B| 0 0 | 12k 4875k
>> 3 25 72 0 0 0| 0 0 |4407B 992B| 0 0 | 12k 4814k
>> 3 25 71 0 0 0| 0 0 | 17k 1959B| 0 0 | 12k 4993k
>> 3 26 71 0 0 0| 0 0 |7774B 1478B| 0 0 | 12k 4986k
>> 3 25 72 0 0 0| 0 0 |5728B 1066B| 0 0 | 12k 4921k
>> 3 25 71 0 0 0| 0 108k| 14k 1816B| 0 0 | 19k 4922k
>> 3 25 72 0 0 0| 0 0 |6296B 9096B| 0 0 | 12k 4899k
>> 3 25 72 0 0 0| 0 0 |8974B 1166B| 0 0 | 12k 4914k
>> 3 26 71 0 0 0| 0 0 | 12k 1038B| 0 0 | 12k 5133k
>> 3 25 72 0 0 0| 0 0 |9956B 1668B| 0 0 | 12k 4873k
>> 3 25 72 0 0 0| 0 228k|8025B 2172B| 0 0 | 12k 4841k
>> 3 25 72 0 0 0| 0 0 |7743B 1728B| 0 0 | 12k 4888k
>> 3 26 71 0 0 0| 0 0 |8287B 6061B| 0 0 | 12k 4955k
>> 3 26 70 0 0 0| 0 0 |8702B 1602B| 0 0 | 12k 5064k
>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
>> usr sys idl wai hiq siq| read writ| recv send| in out | int csw
>> 3 26 71 0 0 0| 0 0 |8898B 1198B| 0 0 | 12k 4983k
>> 3 25 72 0 0 0| 0 152k| 10k 48k| 0 0 | 17k 4840k
>> 3 25 72 0 0 0| 0 0 |4030B 8837B| 0 0 | 15k 4900k
>> 3 25 72 0 0 0| 0 0 | 12k 2260B| 0 0 | 12k 4909k
>> 3 26 71 0 0 0| 0 0 | 14k 2169B| 0 0 | 12k 5001k
>> 3 26 70 0 0 0| 0 0 | 12k 1470B| 0 0 | 12k 5047k
>> 3 26 71 0 0 0| 0 232k| 14k 1964B| 0 0 | 12k 4967k
>> 3 25 72 0 0 0| 0 0 | 14k 2115B| 0 0 | 12k 4846k
>> 3 25 72 0 0 0| 0 0 | 15k 2144B| 0 0 | 12k 4945k
>> 3 25 72 0 0 0| 0 0 | 15k 1544B| 0 0 | 12k 4862k
>> 3 26 71 0 0 0| 0 0 |6595B 1124B| 0 0 | 12k 4991k
>> 3 27 70 0 0 0| 0 128k| 39k 7960B| 0 0 | 17k 5065k
>> 3 26 71 0 0 0| 0 0 | 13k 11k| 0 0 | 12k 4942k
>> 3 25 72 0 0 0| 0 0 |9035B 2072B| 0 0 | 12k 4881k
>> 3 25 72 0 0 0| 0 0 |9057B 2788B| 0 0 | 12k 4825k
>> 1 5 90 4 0 0| 41M 28M| 49M 104M| 0 0 | 30k 848k
>> 0 1 92 7 0 0| 28M 51M| 46M 166M| 0 0 | 22k 39k
>> 0 0 97 2 0 0| 19M 20M| 18M 48M| 0 0 | 15k 36k
>> 0 0 97 2 0 0| 21M 14M| 13M 41M| 0 0 | 15k 36k
>> 0 0 97 3 0 0| 13M 16M| 21M 33M| 0 0 | 14k 36k
>> 0 0 96 3 0 0| 14M 20M| 21M 42M| 0 0 | 14k 36k
>> 0 1 96 3 0 0| 15M 15M| 16M 32M| 0 0 | 24k 36k
>> 0 0 99 1 0 0| 11M 4351k|3789k 18M| 0 0 | 12k 34k
>> 0 0 99 0 0 0|7182k 0 |1881k 8056k| 0 0 | 11k 33k
>> 0 0 99 0 0 0|3164k 100k|2091k 6605k| 0 0 | 11k 35k
>> 0 0 99 0 0 0|3079k 2647k|6850k 6604k| 0 0 | 11k 33k
>> 0 0 98 1 0 0|8640k 7195k|6742k 17M| 0 0 | 12k 34k
>> 0 0 99 1 0 0| 0 4451k|6093k 7067k| 0 0 | 10k 33k
>> 0 0 99 1 0 0|3908k 4787k|1294k 4556k| 0 0 | 11k 33k
>> 0 0 99 0 0 0|7460k 44k|1505k 9884k| 0 0 |9809 32k
>>
>> Thanks,
>>
>> Steve
>>
>>
>>
>
>