Can you run: numastat -v? Looking up try_to_unmap_anon it looks like the kernel is trying to balance the memory usage across the numa nodes. There was a recent change to how memory allocation is done on the aio threads in this ticket: https://issues.apache.org/jira/browse/TS-4719 <https://issues.apache.org/jira/browse/TS-4719>
Can you also run sudo sysctl -a | grep huge and paste the output? -Bryan > On Aug 16, 2016, at 11:00 PM, Steve Malenfant <[email protected] > <mailto:[email protected]>> wrote: > > Bryan, here's the output during of one of those events... > > This server RAM is used at about 80%. Hugepage is not enabled. 256GB of Ram > Cache on a server with 384GB RAM. > > Samples: 756K of event 'cycles', Event count (approx.): 26719576313 > Overhead Shared Object Symbol > 70.35% [kernel] [k] try_to_unmap_anon > 2.42% [kernel] [k] find_busiest_group > 1.34% [kernel] [k] _spin_lock > 0.73% [kernel] [k] thread_return > 0.64% libpthread-2.12.so <http://libpthread-2.12.so/> [.] > pthread_cond_timedwait@@GLIBC_2.3.2 > 0.52% [kernel] [k] schedule > 0.50% [kernel] [k] find_next_bit > 0.42% [kernel] [k] cpumask_next_and > 0.38% [kernel] [k] _spin_lock_irqsave > > [~]# cat /proc/meminfo > MemTotal: 396790288 kB > MemFree: 52626016 kB > Buffers: 151148 kB > Cached: 2035288 kB > SwapCached: 0 kB > Active: 1198652 kB > Inactive: 1171944 kB > Active(anon): 222464 kB > Inactive(anon): 152 kB > Active(file): 976188 kB > Inactive(file): 1171792 kB > Unevictable: 289593148 kB > Mlocked: 289595216 kB > SwapTotal: 4095996 kB > SwapFree: 4095996 kB > Dirty: 208 kB > Writeback: 0 kB > AnonPages: 289777432 kB > Mapped: 128752 kB > Shmem: 244 kB > Slab: 15548892 kB > SReclaimable: 15454924 kB > SUnreclaim: 93968 kB > KernelStack: 17648 kB > PageTables: 573316 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > WritebackTmp: 0 kB > CommitLimit: 202491140 kB > Committed_AS: 291736828 kB > VmallocTotal: 34359738367 kB > VmallocUsed: 1383676 kB > VmallocChunk: 34157955876 kB > HardwareCorrupted: 0 kB > AnonHugePages: 0 kB > HugePages_Total: 0 > HugePages_Free: 0 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 2048 kB > DirectMap4k: 58368 kB > DirectMap2M: 1968128 kB > DirectMap1G: 400556032 kB > > On Tue, Aug 16, 2016 at 8:54 PM, Bryan Call <[email protected] > <mailto:[email protected]>> wrote: > Can you run sudo perf top or sudo perf record during the time you are having > a spike in CPU usage? > > -Bryan > > > > >> On Aug 16, 2016, at 3:09 PM, Steve Malenfant <[email protected] >> <mailto:[email protected]>> wrote: >> >> We have a condition where 4 caches, every 33-34 minutes goes high CPU >> utliization (1000%) and prevents incoming connections. This last about 30 >> seconds and goes away. >> >> It seems to occur with some specific origin/content which I don't exactly >> know which one so far. >> >> I'm including some details (in case it's obvious), but how do we profile a >> running server to see what's causing this high cpu usage? >> >> The load is normal, maybe about 100Mbps (basically idling). >> >> top - 17:20:33 up 61 days, 3:32, 4 users, load average: 4.74, 1.20, 0.53 >> Tasks: 616 total, 16 running, 600 sleeping, 0 stopped, 0 zombie >> Cpu(s): 2.6%us, 22.3%sy, 0.0%ni, 74.9%id, 0.0%wa, 0.0%hi, 0.2%si, >> 0.0%st >> Mem: 396790288k total, 342530608k used, 54259680k free, 148764k buffers >> Swap: 4095996k total, 0k used, 4095996k free, 1618892k cached >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 8703 ats 20 0 282g 275g 39m D 853.9 72.7 12496:54 [ET_NET 0] >> 61 root 20 0 0 0 0 R 5.6 0.0 5:10.11 ksoftirqd/14 >> 77 root 20 0 0 0 0 R 5.0 0.0 5:15.08 ksoftirqd/18 >> 125 root 20 0 0 0 0 R 5.0 0.0 5:02.48 ksoftirqd/30 >> 13 root 20 0 0 0 0 R 4.6 0.0 5:23.63 ksoftirqd/2 >> 37 root 20 0 0 0 0 R 4.6 0.0 5:10.31 ksoftirqd/8 >> 45 root 20 0 0 0 0 R 4.6 0.0 5:10.36 ksoftirqd/10 >> 93 root 20 0 0 0 0 R 4.6 0.0 5:29.05 ksoftirqd/22 >> 109 root 20 0 0 0 0 R 4.6 0.0 5:06.27 ksoftirqd/26 >> 21 root 20 0 0 0 0 R 4.3 0.0 5:12.82 ksoftirqd/4 >> 29 root 20 0 0 0 0 R 4.3 0.0 5:10.79 ksoftirqd/6 >> 85 root 20 0 0 0 0 R 4.3 0.0 5:18.36 ksoftirqd/20 >> 101 root 20 0 0 0 0 R 4.3 0.0 5:13.54 ksoftirqd/24 >> 4 root 20 0 0 0 0 R 4.0 0.0 5:16.25 ksoftirqd/0 >> 69 root 20 0 0 0 0 R 4.0 0.0 5:09.72 ksoftirqd/16 >> 117 root 20 0 0 0 0 R 3.0 0.0 5:03.74 ksoftirqd/28 >> 53 root 20 0 0 0 0 S 1.7 0.0 5:07.40 ksoftirqd/12 >> >> Here's a "dstat" output of that server when this is occurring... updated >> every second. >> >> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- >> usr sys idl wai hiq siq| read writ| recv send| in out | int csw >> 0 0 98 1 0 0|7940k 10M|6270k 18M| 0 0 | 13k 35k >> 0 0 99 0 0 0|9738k 4160k|1600k 10M| 0 0 | 12k 34k >> 0 0 99 0 0 0|3232k 4136k|1390k 5724k| 0 0 | 11k 34k >> 0 0 100 0 0 0| 0 4096k| 34k 555k| 0 0 | 10k 33k >> 0 0 99 0 0 0|1316k 9095k|7028k 7240k| 0 0 | 11k 34k >> 0 1 99 0 0 0| 288k 8658k|6189k 5878k| 0 0 | 29k 36k >> 0 0 99 1 0 0| 13M 6204k|2870k 17M| 0 0 | 14k 35k >> 0 0 99 0 0 0| 0 4096k|3009k 3387k| 0 0 | 14k 35k >> 0 0 99 0 0 0|1028k 6384k|5138k 6047k| 0 0 | 14k 35k >> 0 0 99 1 0 0|5127k 8192k|4262k 7417k| 0 0 | 13k 34k >> 0 0 99 0 0 0|2372k 2612k|2011k 4944k| 0 0 | 14k 35k >> 0 0 99 0 0 0| 0 4172k|1633k 2201k| 0 0 | 13k 35k >> 0 0 99 0 0 0| 11M 608k|2528k 12M| 0 0 | 13k 35k >> 0 3 95 1 0 0| 15M 8409k| 11M 28M| 0 0 | 16k 37k >> 0 3 95 1 0 0|6324k 4244k|7959k 20M| 0 0 | 12k 32k >> 0 4 96 0 0 0| 256k 2128k| 853k 2823k| 0 0 | 27k 32k >> 0 3 96 0 0 0|1028k 0 | 472k 562k| 0 0 | 10k 31k >> 0 3 96 0 0 0|1028k 68k|1457k 1454k| 0 0 | 11k 32k >> 0 3 97 0 0 0| 0 0 | 316k 317k| 0 0 | 11k 31k >> 0 3 97 0 0 0|3076k 0 | 415k 3251k| 0 0 |9498 30k >> 0 3 97 0 0 0|2688k 84k| 16k 2985k| 0 0 |8623 30k >> 0 3 97 0 0 0| 0 0 | 64k 65k| 0 0 |9121 30k >> 0 3 97 0 0 0| 0 68k|1607k 1556k| 0 0 | 11k 31k >> 0 3 97 0 0 0| 0 0 | 18k 120k| 0 0 |7669 28k >> 0 3 97 0 0 0| 0 0 | 536k 537k| 0 0 |8660 30k >> 0 4 96 0 0 0| 0 72k| 65k 75k| 0 0 | 26k 30k >> 0 3 96 0 0 0| 0 2104k|3079k 3093k| 0 0 |9267 30k >> 0 3 97 0 0 0| 0 108k| 593k 506k| 0 0 |7591 28k >> 0 3 96 0 0 0|1152k 4096B| 124k 238k| 0 0 |7914 29k >> 0 3 97 0 0 0| 0 0 |5719B 1574B| 0 0 |8884 29k >> 0 3 96 0 0 0| 0 224k| 18k 51k| 0 0 |9092 30k >> 0 3 97 0 0 0| 0 0 | 87k 108k| 0 0 |8998 29k >> 0 3 96 0 0 0|5328k 0 |1594k 6790k| 0 0 |8499 29k >> 0 3 97 0 0 0| 0 156k| 44k 203k| 0 0 |8039 29k >> 0 3 97 0 0 0| 0 0 | 17k 82k| 0 0 |5377 25k >> 0 4 96 0 0 0| 0 252k|3224B 6858B| 0 0 | 28k 31k >> 0 3 96 0 0 0| 0 0 |2139B 8535B| 0 0 | 10k 29k >> 0 3 97 0 0 0|5508k 0 | 20k 5516k| 0 0 |7762 29k >> 0 3 97 0 0 0| 0 72k| 10k 28k| 0 0 |7026 29k >> 0 3 97 0 0 0| 0 0 |3782B 712B| 0 0 |5829 29k >> 0 4 96 0 0 0| 0 264k| 11k 10k| 0 0 |5170 183k >> 0 4 95 0 0 0| 0 0 | 15k 1415B| 0 0 |4125 245k >> 0 3 97 0 0 0| 0 0 |5416B 966B| 0 0 |3608 40k >> 0 3 97 0 0 0| 0 4096B|6233B 3868B| 0 0 |3636 41k >> 0 3 96 0 0 0| 0 64k|5761B 1594B| 0 0 |3579 41k >> 2 20 78 0 0 0| 0 80k| 11k 1544B| 0 0 | 16k 3590k >> 3 26 71 0 0 0| 0 0 | 13k 9997B| 0 0 | 12k 4969k >> 3 26 71 0 0 0| 0 0 |6685B 1042B| 0 0 | 12k 5017k >> 3 25 71 0 0 0| 0 0 | 11k 1660B| 0 0 | 12k 4957k >> 3 25 72 0 0 0| 0 0 |7765B 4463B| 0 0 | 12k 4862k >> 3 25 72 0 0 0| 0 128k|8377B 1478B| 0 0 | 12k 4875k >> 3 25 72 0 0 0| 0 0 |4407B 992B| 0 0 | 12k 4814k >> 3 25 71 0 0 0| 0 0 | 17k 1959B| 0 0 | 12k 4993k >> 3 26 71 0 0 0| 0 0 |7774B 1478B| 0 0 | 12k 4986k >> 3 25 72 0 0 0| 0 0 |5728B 1066B| 0 0 | 12k 4921k >> 3 25 71 0 0 0| 0 108k| 14k 1816B| 0 0 | 19k 4922k >> 3 25 72 0 0 0| 0 0 |6296B 9096B| 0 0 | 12k 4899k >> 3 25 72 0 0 0| 0 0 |8974B 1166B| 0 0 | 12k 4914k >> 3 26 71 0 0 0| 0 0 | 12k 1038B| 0 0 | 12k 5133k >> 3 25 72 0 0 0| 0 0 |9956B 1668B| 0 0 | 12k 4873k >> 3 25 72 0 0 0| 0 228k|8025B 2172B| 0 0 | 12k 4841k >> 3 25 72 0 0 0| 0 0 |7743B 1728B| 0 0 | 12k 4888k >> 3 26 71 0 0 0| 0 0 |8287B 6061B| 0 0 | 12k 4955k >> 3 26 70 0 0 0| 0 0 |8702B 1602B| 0 0 | 12k 5064k >> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- >> usr sys idl wai hiq siq| read writ| recv send| in out | int csw >> 3 26 71 0 0 0| 0 0 |8898B 1198B| 0 0 | 12k 4983k >> 3 25 72 0 0 0| 0 152k| 10k 48k| 0 0 | 17k 4840k >> 3 25 72 0 0 0| 0 0 |4030B 8837B| 0 0 | 15k 4900k >> 3 25 72 0 0 0| 0 0 | 12k 2260B| 0 0 | 12k 4909k >> 3 26 71 0 0 0| 0 0 | 14k 2169B| 0 0 | 12k 5001k >> 3 26 70 0 0 0| 0 0 | 12k 1470B| 0 0 | 12k 5047k >> 3 26 71 0 0 0| 0 232k| 14k 1964B| 0 0 | 12k 4967k >> 3 25 72 0 0 0| 0 0 | 14k 2115B| 0 0 | 12k 4846k >> 3 25 72 0 0 0| 0 0 | 15k 2144B| 0 0 | 12k 4945k >> 3 25 72 0 0 0| 0 0 | 15k 1544B| 0 0 | 12k 4862k >> 3 26 71 0 0 0| 0 0 |6595B 1124B| 0 0 | 12k 4991k >> 3 27 70 0 0 0| 0 128k| 39k 7960B| 0 0 | 17k 5065k >> 3 26 71 0 0 0| 0 0 | 13k 11k| 0 0 | 12k 4942k >> 3 25 72 0 0 0| 0 0 |9035B 2072B| 0 0 | 12k 4881k >> 3 25 72 0 0 0| 0 0 |9057B 2788B| 0 0 | 12k 4825k >> 1 5 90 4 0 0| 41M 28M| 49M 104M| 0 0 | 30k 848k >> 0 1 92 7 0 0| 28M 51M| 46M 166M| 0 0 | 22k 39k >> 0 0 97 2 0 0| 19M 20M| 18M 48M| 0 0 | 15k 36k >> 0 0 97 2 0 0| 21M 14M| 13M 41M| 0 0 | 15k 36k >> 0 0 97 3 0 0| 13M 16M| 21M 33M| 0 0 | 14k 36k >> 0 0 96 3 0 0| 14M 20M| 21M 42M| 0 0 | 14k 36k >> 0 1 96 3 0 0| 15M 15M| 16M 32M| 0 0 | 24k 36k >> 0 0 99 1 0 0| 11M 4351k|3789k 18M| 0 0 | 12k 34k >> 0 0 99 0 0 0|7182k 0 |1881k 8056k| 0 0 | 11k 33k >> 0 0 99 0 0 0|3164k 100k|2091k 6605k| 0 0 | 11k 35k >> 0 0 99 0 0 0|3079k 2647k|6850k 6604k| 0 0 | 11k 33k >> 0 0 98 1 0 0|8640k 7195k|6742k 17M| 0 0 | 12k 34k >> 0 0 99 1 0 0| 0 4451k|6093k 7067k| 0 0 | 10k 33k >> 0 0 99 1 0 0|3908k 4787k|1294k 4556k| 0 0 | 11k 33k >> 0 0 99 0 0 0|7460k 44k|1505k 9884k| 0 0 |9809 32k >> >> Thanks, >> >> Steve > >
