Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?
On Tue, 16 Feb 2021 13:36:22 +0100, "Jason A. Donenfeld" said: > Another anecdote: 5.11.0, 64 gigs of ram. If I run QEMU/KVM for a VM > with 16 gigs at the same time as a VMware VM with 16 gigs of ram, > kcompact goes wild and both VMs get really slow. The key here is running > KVM at the same time as VMware. Do things operated as expected if there are 2 KVM instances, or 2 VMware instances? pgp8TQVYliT2k.pgp Description: PGP signature
Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?
On Mon, Jan 25, 2021 at 07:54:38PM +0100, Tibor Bana wrote: > Greetings! > > I don't know if it still actual, but I am strugling with this problem right > now and searching the internet for solutions. > I read the thread and saw that you are strugling to reproduce the problem, > and I can reproduce it almost every day. > > - Install vmware player, and a linux guest. > - Configure the virtual machine to have a good amount of memory and cpu > - run resource intensive tasks on the guest > - when the host used up almost it's all memory and start to reuse caches > kcompactd will kick in. > > As I know the problem is related to transparent huge pages, but I tried to > disable it. > Today I saw the problem again and kcompactd shown an interesting status in > top. It hasn't used any memory, all zeroes but it used up one core > completely. > > My machine is a core-i7 with 4 physical cores and hyper threading and 24GB > Memory > 5.9.11-arch2-1 #1 SMP PREEMPT Sat, 28 Nov 2020 02:07:22 + x86_64 GNU/Linux Another anecdote: 5.11.0, 64 gigs of ram. If I run QEMU/KVM for a VM with 16 gigs at the same time as a VMware VM with 16 gigs of ram, kcompact goes wild and both VMs get really slow. The key here is running KVM at the same time as VMware.
Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?
Hi, Sorry for the delay. I had time to do a full system upgrade yesterday evening and fortunately Archlinux already ships 5.10.10, today I used my computer as usual to test it. I haven't experienced the symptoms, but since I disabled transparent huge pages it showed up sporadically. If I face it again I will let you know. On Tue, Jan 26, 2021 at 10:17 AM Mel Gorman wrote: > > On Mon, Jan 25, 2021 at 07:54:38PM +0100, Tibor Bana wrote: > > Greetings! > > > > I don't know if it still actual, but I am strugling with this problem right > > now and searching the internet for solutions. > > I read the thread and saw that you are strugling to reproduce the problem, > > and I can reproduce it almost every day. > > > > - Install vmware player, and a linux guest. > > - Configure the virtual machine to have a good amount of memory and cpu > > - run resource intensive tasks on the guest > > - when the host used up almost it's all memory and start to reuse caches > > kcompactd will kick in. > > > > As I know the problem is related to transparent huge pages, but I tried to > > disable it. > > Today I saw the problem again and kcompactd shown an interesting status in > > top. It hasn't used any memory, all zeroes but it used up one core > > completely. > > > > My machine is a core-i7 with 4 physical cores and hyper threading and 24GB > > Memory > > 5.9.11-arch2-1 #1 SMP PREEMPT Sat, 28 Nov 2020 02:07:22 + x86_64 > > GNU/Linux > > > > Hope this can help, to point out the problem. > > > > Is 5.10.10 affected because it included two patches related to halting > compaction that are relevant. > > d20bdd571ee5c9966191568527ecdb1bd4b52368 mm/compaction: stop isolation if too > many pages are isolated and we have pages to migrate > 38935861d85a4d9a353d1dd5a156c97700e2765d mm/compaction: count pages and stop > correctly during page isolation > > -- > Mel Gorman > SUSE Labs
Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?
On Mon, 25 Jan 2021 19:54:38 +0100, Tibor Bana said: > I don't know if it still actual, but I am strugling with this problem right > now and searching the internet for solutions. I read the thread and saw that > you are strugling to reproduce the problem, and I can reproduce it almost > every > day. I'm pretty sure that you have a real bug on your hands. Even if your box is very low on memory, kcompactd should eventually figure out it's not making any progress and wait for the situation to change before trying again. However, I'm also pretty sure that it's a different one than the one we were chasing, because that one never showed up again once all the patches landed in linux-next, some 18 months before 5.9 was released.
Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?
On Mon, Jan 25, 2021 at 07:54:38PM +0100, Tibor Bana wrote: > Greetings! > > I don't know if it still actual, but I am strugling with this problem right > now and searching the internet for solutions. > I read the thread and saw that you are strugling to reproduce the problem, > and I can reproduce it almost every day. > > - Install vmware player, and a linux guest. > - Configure the virtual machine to have a good amount of memory and cpu > - run resource intensive tasks on the guest > - when the host used up almost it's all memory and start to reuse caches > kcompactd will kick in. > > As I know the problem is related to transparent huge pages, but I tried to > disable it. > Today I saw the problem again and kcompactd shown an interesting status in > top. It hasn't used any memory, all zeroes but it used up one core > completely. > > My machine is a core-i7 with 4 physical cores and hyper threading and 24GB > Memory > 5.9.11-arch2-1 #1 SMP PREEMPT Sat, 28 Nov 2020 02:07:22 + x86_64 GNU/Linux > > Hope this can help, to point out the problem. > Is 5.10.10 affected because it included two patches related to halting compaction that are relevant. d20bdd571ee5c9966191568527ecdb1bd4b52368 mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate 38935861d85a4d9a353d1dd5a156c97700e2765d mm/compaction: count pages and stop correctly during page isolation -- Mel Gorman SUSE Labs
Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?
Greetings! I don't know if it still actual, but I am strugling with this problem right now and searching the internet for solutions. I read the thread and saw that you are strugling to reproduce the problem, and I can reproduce it almost every day. - Install vmware player, and a linux guest. - Configure the virtual machine to have a good amount of memory and cpu - run resource intensive tasks on the guest - when the host used up almost it's all memory and start to reuse caches kcompactd will kick in. As I know the problem is related to transparent huge pages, but I tried to disable it. Today I saw the problem again and kcompactd shown an interesting status in top. It hasn't used any memory, all zeroes but it used up one core completely. My machine is a core-i7 with 4 physical cores and hyper threading and 24GB Memory 5.9.11-arch2-1 #1 SMP PREEMPT Sat, 28 Nov 2020 02:07:22 + x86_64 GNU/Linux Hope this can help, to point out the problem. Tibor Bana On Wed, 30 Jan 2019 10:40:20 + Mel Gorman wrote: > On Tue, Jan 29, 2019 at 11:29:37PM -0500, valdis.kletni...@vt.edu wrote: > > On Tue, 29 Jan 2019 20:06:39 -0500, valdis.kletni...@vt.edu said: > > > On Mon, 28 Jan 2019 10:16:27 +0100, Jan Kara said: > > > > > > > So my buffer_migrate_page_norefs() is certainly buggy in its current > > > > incarnation (as a result block device page cache is not migratable at > > > > all). > > > > I've sent Andrew a patch over week ago but so far it got ignored. The > > > > patch > > > > is attached, can you give it a try whether it changes something for you? > > > > Thanks! > > > > > > Been running with the patch for about 24 hours, haven't seen kcompactd > > > misbehave. I even fired up a Chrome with a lot of tabs open, a Firefox, > > > and a > > > kernel build, intentionally drove the system into swapping, and kcompactd > > > didn't make it into the top 10 on 'top'. > > > > > > I'm willing to say put a "tested-by:" on that one, it looks fixed from > > > here. > > > If there's any remaining bugs, they're ones I can't seem to trigger... > > > > Spoke too soon. Sitting here not stressing the laptop at all, plenty of > > free > > memory, and ka-blam. > > > > Will keep my eyes open and do the data gathering Mel Gorban wanted - I > > discovered > > too late that trace-cmd wasn't installed, and things broke free by > > themselves (probably > > not coincidence that I launched a terminal window and then it cleared) > > > > That's unfortunate. I also note that linux-next still has not been > updated with the latest version of the compaction series. Nevertheless, > it might be helpful to get the output of > > grep -r . /sys/kernel/mm/transparent_hugepage/* > > and the trace when the system is in normal use but kcompactd has not > pegged at 100%. At minimum, I'd like to see what the sources of high-order > allocations are and the likely causes of wakeups of kcompactd in case > there are any hints there. Your Kconfig is also potentially useful. > > Thanks. > > -- > Mel Gorman > SUSE Labs -- Tibor Bana
Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?
On Tue, Jan 29, 2019 at 11:29:37PM -0500, valdis.kletni...@vt.edu wrote: > On Tue, 29 Jan 2019 20:06:39 -0500, valdis.kletni...@vt.edu said: > > On Mon, 28 Jan 2019 10:16:27 +0100, Jan Kara said: > > > > > So my buffer_migrate_page_norefs() is certainly buggy in its current > > > incarnation (as a result block device page cache is not migratable at > > > all). > > > I've sent Andrew a patch over week ago but so far it got ignored. The > > > patch > > > is attached, can you give it a try whether it changes something for you? > > > Thanks! > > > > Been running with the patch for about 24 hours, haven't seen kcompactd > > misbehave. I even fired up a Chrome with a lot of tabs open, a Firefox, > > and a > > kernel build, intentionally drove the system into swapping, and kcompactd > > didn't make it into the top 10 on 'top'. > > > > I'm willing to say put a "tested-by:" on that one, it looks fixed from > > here. > > If there's any remaining bugs, they're ones I can't seem to trigger... > > Spoke too soon. Sitting here not stressing the laptop at all, plenty of free > memory, and ka-blam. > > Will keep my eyes open and do the data gathering Mel Gorban wanted - I > discovered > too late that trace-cmd wasn't installed, and things broke free by themselves > (probably > not coincidence that I launched a terminal window and then it cleared) > That's unfortunate. I also note that linux-next still has not been updated with the latest version of the compaction series. Nevertheless, it might be helpful to get the output of grep -r . /sys/kernel/mm/transparent_hugepage/* and the trace when the system is in normal use but kcompactd has not pegged at 100%. At minimum, I'd like to see what the sources of high-order allocations are and the likely causes of wakeups of kcompactd in case there are any hints there. Your Kconfig is also potentially useful. Thanks. -- Mel Gorman SUSE Labs
Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?
On Tue, 29 Jan 2019 20:06:39 -0500, valdis.kletni...@vt.edu said: > On Mon, 28 Jan 2019 10:16:27 +0100, Jan Kara said: > > > So my buffer_migrate_page_norefs() is certainly buggy in its current > > incarnation (as a result block device page cache is not migratable at all). > > I've sent Andrew a patch over week ago but so far it got ignored. The patch > > is attached, can you give it a try whether it changes something for you? > > Thanks! > > Been running with the patch for about 24 hours, haven't seen kcompactd > misbehave. I even fired up a Chrome with a lot of tabs open, a Firefox, and a > kernel build, intentionally drove the system into swapping, and kcompactd > didn't make it into the top 10 on 'top'. > > I'm willing to say put a "tested-by:" on that one, it looks fixed from here. > If there's any remaining bugs, they're ones I can't seem to trigger... Spoke too soon. Sitting here not stressing the laptop at all, plenty of free memory, and ka-blam. Will keep my eyes open and do the data gathering Mel Gorban wanted - I discovered too late that trace-cmd wasn't installed, and things broke free by themselves (probably not coincidence that I launched a terminal window and then it cleared) top - 23:24:03 up 2:19, 1 user, load average: 2.70, 2.00, 1.55 Tasks: 221 total, 3 running, 218 sleeping, 0 stopped, 0 zombie %Cpu(s): 15.6 us, 67.3 sy, 0.0 ni, 9.5 id, 0.0 wa, 5.6 hi, 2.0 si, 0.0 st GiB Mem : 7.6 total, 2.7 free, 3.1 used, 1.8 buff/cache GiB Swap: 8.0 total, 8.0 free, 0.0 used. 4.1 avail Mem PID PPID %MEM PR NI SVIRTRESSHR SWAP UID %CPU TIME+ COMMAND 27 2 0.0 20 0 R0.0m 0.0m 0.0m 0.0m 0 78.5 2:11.91 kcompactd0
Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?
On Mon, 28 Jan 2019 10:16:27 +0100, Jan Kara said: > So my buffer_migrate_page_norefs() is certainly buggy in its current > incarnation (as a result block device page cache is not migratable at all). > I've sent Andrew a patch over week ago but so far it got ignored. The patch > is attached, can you give it a try whether it changes something for you? > Thanks! Been running with the patch for about 24 hours, haven't seen kcompactd misbehave. I even fired up a Chrome with a lot of tabs open, a Firefox, and a kernel build, intentionally drove the system into swapping, and kcompactd didn't make it into the top 10 on 'top'. I'm willing to say put a "tested-by:" on that one, it looks fixed from here. If there's any remaining bugs, they're ones I can't seem to trigger...
Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?
On Mon, Jan 28, 2019 at 10:16:27AM +0100, Jan Kara wrote: > On Sun 27-01-19 16:36:34, valdis.kletni...@vt.edu wrote: > > On Sun, 27 Jan 2019 17:00:27 +0100, Pavel Machek said: > > > > > I've noticed this as well on earlier kernels (next-20181224 to > > > > > 20190115) > > > > > Some more info: > > > > > 1) echo 3 > /proc/sys/vm/drop_caches unwedges kcompactd in 1-3 > > > > > seconds. > > > > This aspect is curious as it indicates that kcompactd could potentially > > > > be infinite looping but it's not something I've experienced myself. By > > > > any chance is there a preditable reproduction case for this? > > > > > > I seen it exactly once, so not sure how reproducible this is. x86-32 > > > machine, running chromium browser, so yes, there was some swapping > > > involved. > > > > I don't have a surefire replicator, but my laptop (x86_64, so it's not a > > 32-bit > > only issue) triggers it fairly often, up to multiple times a day. Doesn't > > seem to > > be just the Chrome browser that triggers it - usually I'm doing other stuff > > as > > well, like a compile or similar. The fact that 'drop_caches' clears it > > makes me > > wonder if we're hitting a corner case where cache data isn't being > > automatically > > cleared and clogging something up. > > So my buffer_migrate_page_norefs() is certainly buggy in its current > incarnation (as a result block device page cache is not migratable at all). > I've sent Andrew a patch over week ago but so far it got ignored. The patch > is attached, can you give it a try whether it changes something for you? > Thanks! > Definetly worth trying and hopefully both the migration and compaction patches sync up soon. In the event this patch does not help, I would appreciate the following 1) A trace while kcompactd is pegged at 100% trace-cmd record -a -e compaction -e migrate -e kmem:mm_page_alloc -e vmscan:mm_vmscan_kswapd_wake -e vmscan:mm_vmscan_kswapd_sleep sleep 10 Compress the resulting trace.dat and email it to me. If it's too big for a reasonable email, drop "-e kmem:mm_page_alloc" from the command line and it should be a more reasonable size. If not, reduce the sleep time to gather a shorter inverval. 2) Sample stack traces of kcompact while pegged at 100% echo -n > /tmp/kcompactd-stack; for i in `seq 1 100`; do echo sample $i >> /tmp/kcompactd-stack; cat /proc/`pidof kcompactd0`/stack >> /tmp/kcompactd-stack; done; gzip -f /tmp/kcompactd-stack And mail me the resulting /tmp/kcompactd-stack.gz Thanks. -- Mel Gorman SUSE Labs
Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?
On (01/28/19 10:16), Jan Kara wrote: > On Sun 27-01-19 16:36:34, valdis.kletni...@vt.edu wrote: > > On Sun, 27 Jan 2019 17:00:27 +0100, Pavel Machek said: > > > > > I've noticed this as well on earlier kernels (next-20181224 to > > > > > 20190115) > > > > > Some more info: > > > > > 1) echo 3 > /proc/sys/vm/drop_caches unwedges kcompactd in 1-3 > > > > > seconds. > > > > This aspect is curious as it indicates that kcompactd could potentially > > > > be infinite looping but it's not something I've experienced myself. By > > > > any chance is there a preditable reproduction case for this? > > > > > > I seen it exactly once, so not sure how reproducible this is. x86-32 > > > machine, running chromium browser, so yes, there was some swapping > > > involved. > > > > I don't have a surefire replicator, but my laptop (x86_64, so it's not a > > 32-bit > > only issue) triggers it fairly often, up to multiple times a day. Doesn't > > seem to > > be just the Chrome browser that triggers it - usually I'm doing other stuff > > as > > well, like a compile or similar. The fact that 'drop_caches' clears it > > makes me > > wonder if we're hitting a corner case where cache data isn't being > > automatically > > cleared and clogging something up. > > So my buffer_migrate_page_norefs() is certainly buggy in its current > incarnation (as a result block device page cache is not migratable at all). > I've sent Andrew a patch over week ago but so far it got ignored. The patch > is attached, can you give it a try whether it changes something for you? > Thanks! Hello Jan, Just for note, I'm seeing the same problems on my x86 box [1]. Don't have a reproducer for the issue yet, but will try to test your patch. Thanks. [1] https://lore.kernel.org/lkml/20190128085747.GA14454@jagdpanzerIV/T/#u -ss
Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?
On Sun 27-01-19 16:36:34, valdis.kletni...@vt.edu wrote: > On Sun, 27 Jan 2019 17:00:27 +0100, Pavel Machek said: > > > > I've noticed this as well on earlier kernels (next-20181224 to 20190115) > > > > Some more info: > > > > 1) echo 3 > /proc/sys/vm/drop_caches unwedges kcompactd in 1-3 seconds. > > > This aspect is curious as it indicates that kcompactd could potentially > > > be infinite looping but it's not something I've experienced myself. By > > > any chance is there a preditable reproduction case for this? > > > > I seen it exactly once, so not sure how reproducible this is. x86-32 > > machine, running chromium browser, so yes, there was some swapping > > involved. > > I don't have a surefire replicator, but my laptop (x86_64, so it's not a > 32-bit > only issue) triggers it fairly often, up to multiple times a day. Doesn't > seem to > be just the Chrome browser that triggers it - usually I'm doing other stuff as > well, like a compile or similar. The fact that 'drop_caches' clears it > makes me > wonder if we're hitting a corner case where cache data isn't being > automatically > cleared and clogging something up. So my buffer_migrate_page_norefs() is certainly buggy in its current incarnation (as a result block device page cache is not migratable at all). I've sent Andrew a patch over week ago but so far it got ignored. The patch is attached, can you give it a try whether it changes something for you? Thanks! Honza -- Jan Kara SUSE Labs, CR >From 59ab3a8504c35e2215af6c251bdb2a8a1caca1dd Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Wed, 16 Jan 2019 11:02:48 +0100 Subject: [PATCH] mm: migrate: Make buffer_migrate_page_norefs() actually succeed Currently, buffer_migrate_page_norefs() was constantly failing because buffer_migrate_lock_buffers() grabbed reference on each buffer. In fact, there's no reason for buffer_migrate_lock_buffers() to grab any buffer references as the page is locked during all our operation and thus nobody can reclaim buffers from the page. So remove grabbing of buffer references which also makes buffer_migrate_page_norefs() succeed. Fixes: 89cb0888ca14 "mm: migrate: provide buffer_migrate_page_norefs()" Signed-off-by: Jan Kara --- mm/migrate.c | 5 - 1 file changed, 5 deletions(-) Andrew, can you please merge this patch? Sadly my previous testing only tested that page migration in general didn't get broken but I forgot to test whether the new migrate page callback actually results in more successful migrations for block device pages. So the bug got only revealed by customer testing. Now I've reproduced the workload internally and verified that the patch indeed fixes the issue. diff --git a/mm/migrate.c b/mm/migrate.c index a16b15090df3..712b231a7376 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -709,7 +709,6 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head, /* Simple case, sync compaction */ if (mode != MIGRATE_ASYNC) { do { - get_bh(bh); lock_buffer(bh); bh = bh->b_this_page; @@ -720,18 +719,15 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head, /* async case, we cannot block on lock_buffer so use trylock_buffer */ do { - get_bh(bh); if (!trylock_buffer(bh)) { /* * We failed to lock the buffer and cannot stall in * async migration. Release the taken locks */ struct buffer_head *failed_bh = bh; - put_bh(failed_bh); bh = head; while (bh != failed_bh) { unlock_buffer(bh); -put_bh(bh); bh = bh->b_this_page; } return false; @@ -818,7 +814,6 @@ static int __buffer_migrate_page(struct address_space *mapping, bh = head; do { unlock_buffer(bh); - put_bh(bh); bh = bh->b_this_page; } while (bh != head); -- 2.16.4
Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?
On Sun, 27 Jan 2019 17:00:27 +0100, Pavel Machek said: > > > I've noticed this as well on earlier kernels (next-20181224 to 20190115) > > > Some more info: > > > 1) echo 3 > /proc/sys/vm/drop_caches unwedges kcompactd in 1-3 seconds. > > This aspect is curious as it indicates that kcompactd could potentially > > be infinite looping but it's not something I've experienced myself. By > > any chance is there a preditable reproduction case for this? > > I seen it exactly once, so not sure how reproducible this is. x86-32 > machine, running chromium browser, so yes, there was some swapping > involved. I don't have a surefire replicator, but my laptop (x86_64, so it's not a 32-bit only issue) triggers it fairly often, up to multiple times a day. Doesn't seem to be just the Chrome browser that triggers it - usually I'm doing other stuff as well, like a compile or similar. The fact that 'drop_caches' clears it makes me wonder if we're hitting a corner case where cache data isn't being automatically cleared and clogging something up. Any particular diagnostic info you want me to get next time it hits? (Am currently on next-20190125, if that matters).
Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?
Hi! > > > top - 13:38:51 up 1:42, 16 users, load average: 1.41, 1.93, 1.62 > > > Tasks: 182 total, 3 running, 138 sleeping, 0 stopped, 0 zombie > > > %Cpu(s): 2.3 us, 57.8 sy, 0.0 ni, 39.9 id, 0.0 wa, 0.0 hi, 0.0 si, > > > 0.0 st > > > KiB Mem: 3020044 total, 2429420 used, 590624 free,27468 buffers > > > KiB Swap: 2097148 total,0 used, 2097148 free. 1924268 cached > > > Mem > > > > > > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ > > > COMMAND > > > 608 root 20 0 0 0 0 R 99.6 0.0 11:34.38 > > > kcompactd0 > > > 9782 root 20 0 0 0 0 I 7.9 0.0 0:59.02 > > > kworker/0: > > > 2971 root 20 0 46624 23076 13576 S 4.3 0.8 2:50.22 Xorg > > > > I've noticed this as well on earlier kernels (next-20181224 to 20190115) > > > > Some more info: > > > > 1) echo 3 > /proc/sys/vm/drop_caches unwedges kcompactd in 1-3 seconds. > > > > This aspect is curious as it indicates that kcompactd could potentially > be infinite looping but it's not something I've experienced myself. By > any chance is there a preditable reproduction case for this? I seen it exactly once, so not sure how reproducible this is. x86-32 machine, running chromium browser, so yes, there was some swapping involved. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?
On Sat, Jan 26, 2019 at 09:56:53PM -0500, valdis.kletni...@vt.edu wrote: > On Sat, 26 Jan 2019 21:00:05 +0100, Pavel Machek said: > > > top - 13:38:51 up 1:42, 16 users, load average: 1.41, 1.93, 1.62 > > Tasks: 182 total, 3 running, 138 sleeping, 0 stopped, 0 zombie > > %Cpu(s): 2.3 us, 57.8 sy, 0.0 ni, 39.9 id, 0.0 wa, 0.0 hi, 0.0 si, > > 0.0 st > > KiB Mem: 3020044 total, 2429420 used, 590624 free,27468 buffers > > KiB Swap: 2097148 total,0 used, 2097148 free. 1924268 cached Mem > > > > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND > > 608 root 20 0 0 0 0 R 99.6 0.0 11:34.38 > > kcompactd0 > > 9782 root 20 0 0 0 0 I 7.9 0.0 0:59.02 > > kworker/0: > > 2971 root 20 0 46624 23076 13576 S 4.3 0.8 2:50.22 Xorg > > I've noticed this as well on earlier kernels (next-20181224 to 20190115) > > Some more info: > > 1) echo 3 > /proc/sys/vm/drop_caches unwedges kcompactd in 1-3 seconds. > This aspect is curious as it indicates that kcompactd could potentially be infinite looping but it's not something I've experienced myself. By any chance is there a preditable reproduction case for this? > I've also seen khugepaged hung up: > > cat /proc/29/stack > [<0>] ___preempt_schedule+0x16/0x18 > [<0>] page_vma_mapped_walk+0x60/0x840 > [<0>] remove_migration_pte+0x67/0x390 > [<0>] rmap_walk_file+0x186/0x380 > [<0>] rmap_walk+0xa3/0xd0 > [<0>] remove_migration_ptes+0x69/0x70 > [<0>] migrate_pages+0xb6d/0xfd8 > [<0>] compact_zone+0xb70/0x1370 > [<0>] compact_zone_order+0xd8/0x120 > [<0>] try_to_compact_pages+0xe5/0x550 > [<0>] __alloc_pages_direct_compact+0x6d/0x1a0 > [<0>] __alloc_pages_slowpath+0x6c9/0x1640 > [<0>] __alloc_pages_nodemask+0x558/0x5b0 > [<0>] khugepaged+0x499/0x810 > [<0>] kthread+0x158/0x170 > [<0>] ret_from_fork+0x3a/0x50 > [<0>] 0x > > Looks like something has gone astray with compact_zone. > It's a possibility that the buffer aspect of the trace is a red herring and there is some corner case that prevents the migration scan/free scanner meeting and exiting compaction. Again, a reproduction case of some sort would be nice or an indication of how long it takes to trigger. An update of the series is due which may or may not fix this but if it doesn't, we'll need to start tracing this to see what's going on at the point of failure. -- Mel Gorman SUSE Labs
Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?
Adding Jan Kara to cc due to the fact it appears the lockup is within buffer_migrate_page_norefs which changed recently. On Sat, Jan 26, 2019 at 09:56:53PM -0500, valdis.kletni...@vt.edu wrote: > On Sat, 26 Jan 2019 21:00:05 +0100, Pavel Machek said: > > > top - 13:38:51 up 1:42, 16 users, load average: 1.41, 1.93, 1.62 > > Tasks: 182 total, 3 running, 138 sleeping, 0 stopped, 0 zombie > > %Cpu(s): 2.3 us, 57.8 sy, 0.0 ni, 39.9 id, 0.0 wa, 0.0 hi, 0.0 si, > > 0.0 st > > KiB Mem: 3020044 total, 2429420 used, 590624 free,27468 buffers > > KiB Swap: 2097148 total,0 used, 2097148 free. 1924268 cached Mem > > > > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND > > 608 root 20 0 0 0 0 R 99.6 0.0 11:34.38 > > kcompactd0 > > 9782 root 20 0 0 0 0 I 7.9 0.0 0:59.02 > > kworker/0: > > 2971 root 20 0 46624 23076 13576 S 4.3 0.8 2:50.22 Xorg > > I've noticed this as well on earlier kernels (next-20181224 to 20190115) > > Some more info: > > 1) echo 3 > /proc/sys/vm/drop_caches unwedges kcompactd in 1-3 seconds. > > 2) Typical kcompactd traceback: > > cat /proc/27/stack > [<0>] retint_kernel+0x1b/0x2d > [<0>] lock_is_held_type+0x1b/0x50 > [<0>] ___might_sleep+0xad/0x220 > [<0>] __might_sleep+0x113/0x130 > [<0>] on_each_cpu_cond_mask+0x12a/0x140 > [<0>] on_each_cpu_cond+0x18/0x20 > [<0>] invalidate_bh_lrus+0x29/0x30 > [<0>] __buffer_migrate_page+0x154/0x340 > [<0>] buffer_migrate_page_norefs+0x14/0x20 > [<0>] move_to_new_page+0x8e/0x360 > [<0>] migrate_pages+0x3cc/0xfd8 > [<0>] compact_zone+0xb70/0x1380 > [<0>] kcompactd_do_work+0x15b/0x500 > [<0>] kcompactd+0x74/0x340 > [<0>] kthread+0x158/0x170 > [<0>] ret_from_fork+0x3a/0x50 > [<0>] 0x > > I've also seen khugepaged hung up: > > cat /proc/29/stack > [<0>] ___preempt_schedule+0x16/0x18 > [<0>] page_vma_mapped_walk+0x60/0x840 > [<0>] remove_migration_pte+0x67/0x390 > [<0>] rmap_walk_file+0x186/0x380 > [<0>] rmap_walk+0xa3/0xd0 > [<0>] remove_migration_ptes+0x69/0x70 > [<0>] migrate_pages+0xb6d/0xfd8 > [<0>] compact_zone+0xb70/0x1370 > [<0>] compact_zone_order+0xd8/0x120 > [<0>] try_to_compact_pages+0xe5/0x550 > [<0>] __alloc_pages_direct_compact+0x6d/0x1a0 > [<0>] __alloc_pages_slowpath+0x6c9/0x1640 > [<0>] __alloc_pages_nodemask+0x558/0x5b0 > [<0>] khugepaged+0x499/0x810 > [<0>] kthread+0x158/0x170 > [<0>] ret_from_fork+0x3a/0x50 > [<0>] 0x > > Looks like something has gone astray with compact_zone. > -- Mel Gorman SUSE Labs
Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?
On Sat, 26 Jan 2019 21:00:05 +0100, Pavel Machek said: > top - 13:38:51 up 1:42, 16 users, load average: 1.41, 1.93, 1.62 > Tasks: 182 total, 3 running, 138 sleeping, 0 stopped, 0 zombie > %Cpu(s): 2.3 us, 57.8 sy, 0.0 ni, 39.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 > st > KiB Mem: 3020044 total, 2429420 used, 590624 free,27468 buffers > KiB Swap: 2097148 total,0 used, 2097148 free. 1924268 cached Mem > > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND > 608 root 20 0 0 0 0 R 99.6 0.0 11:34.38 kcompactd0 > 9782 root 20 0 0 0 0 I 7.9 0.0 0:59.02 kworker/0: > 2971 root 20 0 46624 23076 13576 S 4.3 0.8 2:50.22 Xorg I've noticed this as well on earlier kernels (next-20181224 to 20190115) Some more info: 1) echo 3 > /proc/sys/vm/drop_caches unwedges kcompactd in 1-3 seconds. 2) Typical kcompactd traceback: cat /proc/27/stack [<0>] retint_kernel+0x1b/0x2d [<0>] lock_is_held_type+0x1b/0x50 [<0>] ___might_sleep+0xad/0x220 [<0>] __might_sleep+0x113/0x130 [<0>] on_each_cpu_cond_mask+0x12a/0x140 [<0>] on_each_cpu_cond+0x18/0x20 [<0>] invalidate_bh_lrus+0x29/0x30 [<0>] __buffer_migrate_page+0x154/0x340 [<0>] buffer_migrate_page_norefs+0x14/0x20 [<0>] move_to_new_page+0x8e/0x360 [<0>] migrate_pages+0x3cc/0xfd8 [<0>] compact_zone+0xb70/0x1380 [<0>] kcompactd_do_work+0x15b/0x500 [<0>] kcompactd+0x74/0x340 [<0>] kthread+0x158/0x170 [<0>] ret_from_fork+0x3a/0x50 [<0>] 0x I've also seen khugepaged hung up: cat /proc/29/stack [<0>] ___preempt_schedule+0x16/0x18 [<0>] page_vma_mapped_walk+0x60/0x840 [<0>] remove_migration_pte+0x67/0x390 [<0>] rmap_walk_file+0x186/0x380 [<0>] rmap_walk+0xa3/0xd0 [<0>] remove_migration_ptes+0x69/0x70 [<0>] migrate_pages+0xb6d/0xfd8 [<0>] compact_zone+0xb70/0x1370 [<0>] compact_zone_order+0xd8/0x120 [<0>] try_to_compact_pages+0xe5/0x550 [<0>] __alloc_pages_direct_compact+0x6d/0x1a0 [<0>] __alloc_pages_slowpath+0x6c9/0x1640 [<0>] __alloc_pages_nodemask+0x558/0x5b0 [<0>] khugepaged+0x499/0x810 [<0>] kthread+0x158/0x170 [<0>] ret_from_fork+0x3a/0x50 [<0>] 0x Looks like something has gone astray with compact_zone.
[regression -next0117] What is kcompactd and why is he eating 100% of my cpu?
Hi! With modern web, 100% CPU load is no longer uncommon, but this time chromium is not to blame: pavel@amd:/data/l/linux-next-32$ uname -a Linux amd 5.0.0-rc2-next-20190117 #214 SMP Fri Jan 18 09:47:18 CET 2019 i686 GNU/Linux top - 13:38:51 up 1:42, 16 users, load average: 1.41, 1.93, 1.62 Tasks: 182 total, 3 running, 138 sleeping, 0 stopped, 0 zombie %Cpu(s): 2.3 us, 57.8 sy, 0.0 ni, 39.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 3020044 total, 2429420 used, 590624 free,27468 buffers KiB Swap: 2097148 total,0 used, 2097148 free. 1924268 cached Mem PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 608 root 20 0 0 0 0 R 99.6 0.0 11:34.38 kcompactd0 9782 root 20 0 0 0 0 I 7.9 0.0 0:59.02 kworker/0:+ 2971 root 20 0 46624 23076 13576 S 4.3 0.8 2:50.22 Xorg -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature