Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Fri, Sep 25, 2020 at 02:18:48PM -0700, Shakeel Butt wrote: > > Yes, you are right. Let's first get this patch tested and after > confirmation we can update the commit message. Thanks Shakeel! I've tested your patch, as well as reverting the three commits that Linus had suggested, and both seem to address the problem for me as well. I did see a small number of failures immediately as soon as the VM has booted, when testing with the "revert the three commits" but this appears to be a different failure, which I had been seeing periodically during the bisect as well which was no doubt introducing noise in my testing: [ 28.545018] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [swapper/1:0] [ 28.545018] Modules linked in: [ 28.545018] irq event stamp: 4517759 [ 28.545018] hardirqs last enabled at (4517758): [] asm_common_interrupt+0x1e/0x40 [ 28.545018] hardirqs last disabled at (4517759): [] sysvec_apic_timer_interrupt+0xb/0x90 [ 28.545018] softirqs last enabled at (10634): [] irq_enter_rcu+0x6d/0x70 [ 28.545018] softirqs last disabled at (10635): [] asm_call_on_stack+0x12/0x20 [ 28.545018] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc6-xfstests-7-g3f3cb48a7d90 #1916 [ 28.545018] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 [ 28.545018] RIP: 0010:__do_softirq+0xa3/0x435 [ 28.545018] Code: 00 83 80 ac 07 00 00 01 48 89 44 24 08 c7 44 24 1c 0a 00 00 00 65 66 c7 05 a8 ae 9e 55 00 00 e8 d3 92 3b ff fb b8 ff ff ff ff <48> c7 c3 40 51 00 ab 41 0f bc c7 89 c6 83 c6 01 89 74 24 04 75 6a [ 28.545018] RSP: :b89f000e0f98 EFLAGS: 0202 [ 28.545018] RAX: RBX: RCX: 298a [ 28.545018] RDX: RSI: RDI: aa80009d [ 28.545018] RBP: b89f000abda0 R08: 0001 R09: [ 28.545018] R10: 0001 R11: 0046 R12: 0001 [ 28.545018] R13: R14: R15: 0080 [ 28.545018] FS: () GS:998e5920() knlGS: [ 28.545018] CS: 0010 DS: ES: CR0: 80050033 [ 28.545018] CR2: CR3: 00023e012001 CR4: 001706e0 [ 28.545018] DR0: DR1: DR2: [ 28.545018] DR3: DR6: fffe0ff0 DR7: 0400 [ 28.545018] Call Trace: [ 28.545018] [ 28.545018] asm_call_on_stack+0x12/0x20 [ 28.545018] [ 28.545018] do_softirq_own_stack+0x4e/0x60 [ 28.545018] irq_exit_rcu+0x9f/0xe0 [ 28.545018] sysvec_call_function_single+0x43/0x90 [ 28.545018] asm_sysvec_call_function_single+0x12/0x20 [ 28.545018] RIP: 0010:acpi_idle_do_entry+0x54/0x70 [ 28.545018] Code: ed c3 e9 cf fe ff ff 65 48 8b 04 25 00 6e 01 00 48 8b 00 a8 08 75 ea e8 ba c0 5b ff e9 07 00 00 00 0f 00 2d f8 3d 4e 00 fb f4 <9c> 58 fa f6 c4 02 74 cf e9 5f c2 5b ff cc cc cc cc cc cc cc cc cc [ 28.545018] RSP: :b89f000abe88 EFLAGS: 0202 [ 28.545018] RAX: 293b RBX: 998e5564 RCX: 1a12 [ 28.545018] RDX: RSI: RDI: aa5fd2b6 [ 28.545018] RBP: ab163760 R08: 0001 R09: 000e003c [ 28.545018] R10: 998e582e2340 R11: 0046 R12: 0001 [ 28.545018] R13: 0001 R14: ab1637e0 R15: [ 28.545018] ? acpi_idle_do_entry+0x46/0x70 [ 28.545018] ? acpi_idle_do_entry+0x46/0x70 [ 28.545018] acpi_idle_enter+0x7d/0xb0 [ 28.545018] cpuidle_enter_state+0x84/0x2c0 [ 28.545018] cpuidle_enter+0x29/0x40 [ 28.545018] cpuidle_idle_call+0x111/0x180 [ 28.545018] do_idle+0x7b/0xd0 [ 28.545018] cpu_startup_entry+0x19/0x20 [ 28.545018] secondary_startup_64+0xb6/0xc0 I think this was an issue relating to acpi_idle that others have reported, but I thought this was fixed before -rc6 was released? In any case, this is post -rc6, so apparently there is something else going on here, and this is probably unrelated to the regression which Shakeel's patch is addressing. - Ted
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Sat, Sep 26, 2020 at 09:43:25AM +0800, Ming Lei wrote: > On Fri, Sep 25, 2020 at 12:19:02PM -0700, Shakeel Butt wrote: > > On Fri, Sep 25, 2020 at 10:58 AM Shakeel Butt > > wrote: > > > > > [snip] > > > > > > I don't think you can ignore the flushing. The __free_once() in > > > ___cache_free() assumes there is a space available. > > > > > > BTW do_drain() also have the same issue. > > > > > > Why not move slabs_destroy() after we update ac->avail and memmove()? > > > > Ming, can you please try the following patch? > > > > > > From: Shakeel Butt > > > > [PATCH] mm: slab: fix potential infinite recursion in ___cache_free > > > > With the commit 10befea91b61 ("mm: memcg/slab: use a single set of > > kmem_caches for all allocations"), it becomes possible to call kfree() > > from the slabs_destroy(). However if slabs_destroy() is being called for > > the array_cache of the local CPU then this opens the potential scenario > > of infinite recursion because kfree() called from slabs_destroy() can > > call slabs_destroy() with the same array_cache of the local CPU. Since > > the array_cache of the local CPU is not updated before calling > > slabs_destroy(), it will try to free the same pages. > > > > To fix the issue, simply update the cache before calling > > slabs_destroy(). > > > > Signed-off-by: Shakeel Butt > > --- > > mm/slab.c | 8 ++-- > > 1 file changed, 6 insertions(+), 2 deletions(-) > > > > diff --git a/mm/slab.c b/mm/slab.c > > index 3160dff6fd76..f658e86ec8ce 100644 > > --- a/mm/slab.c > > +++ b/mm/slab.c > > @@ -1632,6 +1632,10 @@ static void slab_destroy(struct kmem_cache *cachep, > > struct page *page) > > kmem_cache_free(cachep->freelist_cache, freelist); > > } > > > > +/* > > + * Update the size of the caches before calling slabs_destroy as it may > > + * recursively call kfree. > > + */ > > static void slabs_destroy(struct kmem_cache *cachep, struct list_head > > *list) > > { > > struct page *page, *n; > > @@ -2153,8 +2157,8 @@ static void do_drain(void *arg) > > spin_lock(>list_lock); > > free_block(cachep, ac->entry, ac->avail, node, ); > > spin_unlock(>list_lock); > > - slabs_destroy(cachep, ); > > ac->avail = 0; > > + slabs_destroy(cachep, ); > > } > > > > static void drain_cpu_caches(struct kmem_cache *cachep) > > @@ -3402,9 +3406,9 @@ static void cache_flusharray(struct kmem_cache > > *cachep, struct array_cache *ac) > > } > > #endif > > spin_unlock(>list_lock); > > - slabs_destroy(cachep, ); > > ac->avail -= batchcount; > > memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail); > > + slabs_destroy(cachep, ); > > } > > The issue can't be reproduced after applying this patch: > > Tested-by: Ming Lei Perfect, thank you very much for the confirmation! Shakeel, can you, please, resend the patch with the proper fixes tag and the updated commit log? Please, feel free to add Reviewed-by: Roman Gushchin . Thank you! Roman
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Fri, Sep 25, 2020 at 12:19:02PM -0700, Shakeel Butt wrote: > On Fri, Sep 25, 2020 at 10:58 AM Shakeel Butt > wrote: > > > [snip] > > > > I don't think you can ignore the flushing. The __free_once() in > > ___cache_free() assumes there is a space available. > > > > BTW do_drain() also have the same issue. > > > > Why not move slabs_destroy() after we update ac->avail and memmove()? > > Ming, can you please try the following patch? > > > From: Shakeel Butt > > [PATCH] mm: slab: fix potential infinite recursion in ___cache_free > > With the commit 10befea91b61 ("mm: memcg/slab: use a single set of > kmem_caches for all allocations"), it becomes possible to call kfree() > from the slabs_destroy(). However if slabs_destroy() is being called for > the array_cache of the local CPU then this opens the potential scenario > of infinite recursion because kfree() called from slabs_destroy() can > call slabs_destroy() with the same array_cache of the local CPU. Since > the array_cache of the local CPU is not updated before calling > slabs_destroy(), it will try to free the same pages. > > To fix the issue, simply update the cache before calling > slabs_destroy(). > > Signed-off-by: Shakeel Butt > --- > mm/slab.c | 8 ++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/mm/slab.c b/mm/slab.c > index 3160dff6fd76..f658e86ec8ce 100644 > --- a/mm/slab.c > +++ b/mm/slab.c > @@ -1632,6 +1632,10 @@ static void slab_destroy(struct kmem_cache *cachep, > struct page *page) > kmem_cache_free(cachep->freelist_cache, freelist); > } > > +/* > + * Update the size of the caches before calling slabs_destroy as it may > + * recursively call kfree. > + */ > static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list) > { > struct page *page, *n; > @@ -2153,8 +2157,8 @@ static void do_drain(void *arg) > spin_lock(>list_lock); > free_block(cachep, ac->entry, ac->avail, node, ); > spin_unlock(>list_lock); > - slabs_destroy(cachep, ); > ac->avail = 0; > + slabs_destroy(cachep, ); > } > > static void drain_cpu_caches(struct kmem_cache *cachep) > @@ -3402,9 +3406,9 @@ static void cache_flusharray(struct kmem_cache *cachep, > struct array_cache *ac) > } > #endif > spin_unlock(>list_lock); > - slabs_destroy(cachep, ); > ac->avail -= batchcount; > memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail); > + slabs_destroy(cachep, ); > } The issue can't be reproduced after applying this patch: Tested-by: Ming Lei Thanks, Ming
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Fri, Sep 25, 2020 at 1:56 PM Roman Gushchin wrote: > > On Fri, Sep 25, 2020 at 12:19:02PM -0700, Shakeel Butt wrote: > > On Fri, Sep 25, 2020 at 10:58 AM Shakeel Butt > > wrote: > > > > > [snip] > > > > > > I don't think you can ignore the flushing. The __free_once() in > > > ___cache_free() assumes there is a space available. > > > > > > BTW do_drain() also have the same issue. > > > > > > Why not move slabs_destroy() after we update ac->avail and memmove()? > > > > Ming, can you please try the following patch? > > > > > > From: Shakeel Butt > > > > [PATCH] mm: slab: fix potential infinite recursion in ___cache_free > > > > With the commit 10befea91b61 ("mm: memcg/slab: use a single set of > > kmem_caches for all allocations"), it becomes possible to call kfree() > > from the slabs_destroy(). However if slabs_destroy() is being called for > > the array_cache of the local CPU then this opens the potential scenario > > of infinite recursion because kfree() called from slabs_destroy() can > > call slabs_destroy() with the same array_cache of the local CPU. Since > > the array_cache of the local CPU is not updated before calling > > slabs_destroy(), it will try to free the same pages. > > > > To fix the issue, simply update the cache before calling > > slabs_destroy(). > > > > Signed-off-by: Shakeel Butt > > I like the patch and I think it should fix the problem. > > However the description above should be likely asjusted a bit. > It seems that the problem is not necessary caused by an infinite recursion, > it can be even simpler. > > In cache_flusharray() we rely on the state of ac, which is described > by ac->avail. In particular we rely on batchcount < ac->avail, > as we shift the batchcount number of pointers by memmove. > But if slabs_destroy() is called before and leaded to a change of the > ac state, it can lead to a memory corruption. > > Also, unconditionally resetting ac->avail to 0 in do_drain() after calling > to slab_destroy() seems to be wrong. > It explains double free BUGs we've seen in stacktraces. > Yes, you are right. Let's first get this patch tested and after confirmation we can update the commit message.
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Fri, Sep 25, 2020 at 12:19:02PM -0700, Shakeel Butt wrote: > On Fri, Sep 25, 2020 at 10:58 AM Shakeel Butt > wrote: > > > [snip] > > > > I don't think you can ignore the flushing. The __free_once() in > > ___cache_free() assumes there is a space available. > > > > BTW do_drain() also have the same issue. > > > > Why not move slabs_destroy() after we update ac->avail and memmove()? > > Ming, can you please try the following patch? > > > From: Shakeel Butt > > [PATCH] mm: slab: fix potential infinite recursion in ___cache_free > > With the commit 10befea91b61 ("mm: memcg/slab: use a single set of > kmem_caches for all allocations"), it becomes possible to call kfree() > from the slabs_destroy(). However if slabs_destroy() is being called for > the array_cache of the local CPU then this opens the potential scenario > of infinite recursion because kfree() called from slabs_destroy() can > call slabs_destroy() with the same array_cache of the local CPU. Since > the array_cache of the local CPU is not updated before calling > slabs_destroy(), it will try to free the same pages. > > To fix the issue, simply update the cache before calling > slabs_destroy(). > > Signed-off-by: Shakeel Butt I like the patch and I think it should fix the problem. However the description above should be likely asjusted a bit. It seems that the problem is not necessary caused by an infinite recursion, it can be even simpler. In cache_flusharray() we rely on the state of ac, which is described by ac->avail. In particular we rely on batchcount < ac->avail, as we shift the batchcount number of pointers by memmove. But if slabs_destroy() is called before and leaded to a change of the ac state, it can lead to a memory corruption. Also, unconditionally resetting ac->avail to 0 in do_drain() after calling to slab_destroy() seems to be wrong. It explains double free BUGs we've seen in stacktraces. Thanks! > --- > mm/slab.c | 8 ++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/mm/slab.c b/mm/slab.c > index 3160dff6fd76..f658e86ec8ce 100644 > --- a/mm/slab.c > +++ b/mm/slab.c > @@ -1632,6 +1632,10 @@ static void slab_destroy(struct kmem_cache *cachep, > struct page *page) > kmem_cache_free(cachep->freelist_cache, freelist); > } > > +/* > + * Update the size of the caches before calling slabs_destroy as it may > + * recursively call kfree. > + */ > static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list) > { > struct page *page, *n; > @@ -2153,8 +2157,8 @@ static void do_drain(void *arg) > spin_lock(>list_lock); > free_block(cachep, ac->entry, ac->avail, node, ); > spin_unlock(>list_lock); > - slabs_destroy(cachep, ); > ac->avail = 0; > + slabs_destroy(cachep, ); > } > > static void drain_cpu_caches(struct kmem_cache *cachep) > @@ -3402,9 +3406,9 @@ static void cache_flusharray(struct kmem_cache *cachep, > struct array_cache *ac) > } > #endif > spin_unlock(>list_lock); > - slabs_destroy(cachep, ); > ac->avail -= batchcount; > memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail); > + slabs_destroy(cachep, ); > } > > /* > -- > 2.28.0.681.g6f77f65b4e-goog >
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Fri, Sep 25, 2020 at 10:58 AM Shakeel Butt wrote: > [snip] > > I don't think you can ignore the flushing. The __free_once() in > ___cache_free() assumes there is a space available. > > BTW do_drain() also have the same issue. > > Why not move slabs_destroy() after we update ac->avail and memmove()? Ming, can you please try the following patch? From: Shakeel Butt [PATCH] mm: slab: fix potential infinite recursion in ___cache_free With the commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations"), it becomes possible to call kfree() from the slabs_destroy(). However if slabs_destroy() is being called for the array_cache of the local CPU then this opens the potential scenario of infinite recursion because kfree() called from slabs_destroy() can call slabs_destroy() with the same array_cache of the local CPU. Since the array_cache of the local CPU is not updated before calling slabs_destroy(), it will try to free the same pages. To fix the issue, simply update the cache before calling slabs_destroy(). Signed-off-by: Shakeel Butt --- mm/slab.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 3160dff6fd76..f658e86ec8ce 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1632,6 +1632,10 @@ static void slab_destroy(struct kmem_cache *cachep, struct page *page) kmem_cache_free(cachep->freelist_cache, freelist); } +/* + * Update the size of the caches before calling slabs_destroy as it may + * recursively call kfree. + */ static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list) { struct page *page, *n; @@ -2153,8 +2157,8 @@ static void do_drain(void *arg) spin_lock(>list_lock); free_block(cachep, ac->entry, ac->avail, node, ); spin_unlock(>list_lock); - slabs_destroy(cachep, ); ac->avail = 0; + slabs_destroy(cachep, ); } static void drain_cpu_caches(struct kmem_cache *cachep) @@ -3402,9 +3406,9 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac) } #endif spin_unlock(>list_lock); - slabs_destroy(cachep, ); ac->avail -= batchcount; memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail); + slabs_destroy(cachep, ); } /* -- 2.28.0.681.g6f77f65b4e-goog
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Fri, Sep 25, 2020 at 10:48 AM Roman Gushchin wrote: > > On Fri, Sep 25, 2020 at 10:35:03AM -0700, Shakeel Butt wrote: > > On Fri, Sep 25, 2020 at 10:22 AM Shakeel Butt wrote: > > > > > > On Fri, Sep 25, 2020 at 10:17 AM Linus Torvalds > > > wrote: > > > > > > > > On Fri, Sep 25, 2020 at 9:19 AM Ming Lei wrote: > > > > > > > > > > git bisect shows the first bad commit: > > > > > > > > > > [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: > > > > > use a single set of > > > > > kmem_caches for all allocations > > > > > > > > > > And I have double checked that the above commit is really the first > > > > > bad > > > > > commit for the list corruption issue of 'list_del corruption, > > > > > e1c241b00408->next > > > > > is LIST_POISON1 (dead0100)', > > > > > > > > Thet commit doesn't revert cleanly, but I think that's purely because > > > > we'd also need to revert > > > > > > > > 849504809f86 ("mm: memcg/slab: remove unused argument by > > > > charge_slab_page()") > > > > 74d555bed5d0 ("mm: slab: rename (un)charge_slab_page() to > > > > (un)account_slab_page()") > > > > > > > > too. > > > > > > > > Can you verify that a > > > > > > > > git revert 74d555bed5d0 849504809f86 10befea91b61 > > > > > > > > on top of current -git makes things work for you again? > > > > > > > > I'm going to do an rc8 this release simply because we have another VM > > > > issue that I hope to get fixed - but there we know what the problem > > > > and the fix _is_, it just needs some care. > > > > > > > > So if Roman (or somebody else) can see what's wrong and we can fix > > > > this quickly, we don't need to go down the revert path, but .. > > > > > > > > > > I think I have a theory. The issue is happening due to the potential > > > infinite recursion: > > > > > > [ 5060.124412] ___cache_free+0x488/0x6b0 > > > *Second recursion > > > [ 5060.128666] kfree+0xc9/0x1d0 > > > [ 5060.131947] kmem_freepages+0xa0/0xf0 > > > [ 5060.135746] slab_destroy+0x19/0x50 > > > [ 5060.139577] slabs_destroy+0x6d/0x90 > > > [ 5060.143379] ___cache_free+0x4a3/0x6b0 > > > *First recursion > > > [ 5060.147896] kfree+0xc9/0x1d0 > > > [ 5060.151082] kmem_freepages+0xa0/0xf0 > > > [ 5060.155121] slab_destroy+0x19/0x50 > > > [ 5060.159028] slabs_destroy+0x6d/0x90 > > > [ 5060.162920] ___cache_free+0x4a3/0x6b0 > > > [ 5060.167097] kfree+0xc9/0x1d0 > > > > > > ___cache_free() is calling cache_flusharray() to flush the local cpu > > > array_cache if the cache has more elements than the limit (ac->avail > > > >= ac->limit). > > > > > > cache_flusharray() is removing batchcount number of element from local > > > cpu array_cache and pass it slabs_destroy (if the node shared cache is > > > also full). > > > > > > Note that we have not updated local cpu array_cache size yet and > > > called slabs_destroy() which can call kfree() through > > > unaccount_slab_page(). > > > > > > We are on the same CPU and this recursive kfree again check the > > > (ac->avail >= ac->limit) and call cache_flusharray() again and recurse > > > indefinitely. > > It's a coll theory! And it explains why we haven't seen it with SLUB. > > > > > I can see two possible fixes. We can either do async kfree of > > page_obj_cgroups(page) or we can update the local cpu array_cache's > > size before slabs_destroy(). > > I wonder if something like this can fix the problem? > (completely untested). > > -- > > diff --git a/mm/slab.c b/mm/slab.c > index 684ebe5b0c7a..c94b9ccfb803 100644 > --- a/mm/slab.c > +++ b/mm/slab.c > @@ -186,6 +186,7 @@ struct array_cache { > unsigned int limit; > unsigned int batchcount; > unsigned int touched; > + bool flushing; > void *entry[]; /* > * Must have this definition in here for the proper > * alignment of array_cache. Also simplifies accessing > @@ -526,6 +527,7 @@ static void init_arraycache(struct array_cache *ac, int > limit, int batch) > ac->limit = limit; > ac->batchcount = batch; > ac->touched = 0; > + ac->flushing = false; > } > } > > @@ -3368,6 +3370,11 @@ static void cache_flusharray(struct kmem_cache > *cachep, struct array_cache *ac) > int node = numa_mem_id(); > LIST_HEAD(list); > > + if (ac->flushing) > + return; > + > + ac->flushing = true; > + > batchcount = ac->batchcount; > > check_irq_off(); > @@ -3404,6 +3411,7 @@ static void cache_flusharray(struct kmem_cache *cachep, > struct array_cache *ac) > spin_unlock(>list_lock); > slabs_destroy(cachep, ); > ac->avail -= batchcount; > + ac->flushing = false; > memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void > *)*ac->avail); > } > I don't think you can ignore the flushing. The __free_once() in ___cache_free() assumes there is a space available. BTW do_drain()
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Fri, Sep 25, 2020 at 10:35:03AM -0700, Shakeel Butt wrote: > On Fri, Sep 25, 2020 at 10:22 AM Shakeel Butt wrote: > > > > On Fri, Sep 25, 2020 at 10:17 AM Linus Torvalds > > wrote: > > > > > > On Fri, Sep 25, 2020 at 9:19 AM Ming Lei wrote: > > > > > > > > git bisect shows the first bad commit: > > > > > > > > [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use > > > > a single set of > > > > kmem_caches for all allocations > > > > > > > > And I have double checked that the above commit is really the first bad > > > > commit for the list corruption issue of 'list_del corruption, > > > > e1c241b00408->next > > > > is LIST_POISON1 (dead0100)', > > > > > > Thet commit doesn't revert cleanly, but I think that's purely because > > > we'd also need to revert > > > > > > 849504809f86 ("mm: memcg/slab: remove unused argument by > > > charge_slab_page()") > > > 74d555bed5d0 ("mm: slab: rename (un)charge_slab_page() to > > > (un)account_slab_page()") > > > > > > too. > > > > > > Can you verify that a > > > > > > git revert 74d555bed5d0 849504809f86 10befea91b61 > > > > > > on top of current -git makes things work for you again? > > > > > > I'm going to do an rc8 this release simply because we have another VM > > > issue that I hope to get fixed - but there we know what the problem > > > and the fix _is_, it just needs some care. > > > > > > So if Roman (or somebody else) can see what's wrong and we can fix > > > this quickly, we don't need to go down the revert path, but .. > > > > > > > I think I have a theory. The issue is happening due to the potential > > infinite recursion: > > > > [ 5060.124412] ___cache_free+0x488/0x6b0 > > *Second recursion > > [ 5060.128666] kfree+0xc9/0x1d0 > > [ 5060.131947] kmem_freepages+0xa0/0xf0 > > [ 5060.135746] slab_destroy+0x19/0x50 > > [ 5060.139577] slabs_destroy+0x6d/0x90 > > [ 5060.143379] ___cache_free+0x4a3/0x6b0 > > *First recursion > > [ 5060.147896] kfree+0xc9/0x1d0 > > [ 5060.151082] kmem_freepages+0xa0/0xf0 > > [ 5060.155121] slab_destroy+0x19/0x50 > > [ 5060.159028] slabs_destroy+0x6d/0x90 > > [ 5060.162920] ___cache_free+0x4a3/0x6b0 > > [ 5060.167097] kfree+0xc9/0x1d0 > > > > ___cache_free() is calling cache_flusharray() to flush the local cpu > > array_cache if the cache has more elements than the limit (ac->avail > > >= ac->limit). > > > > cache_flusharray() is removing batchcount number of element from local > > cpu array_cache and pass it slabs_destroy (if the node shared cache is > > also full). > > > > Note that we have not updated local cpu array_cache size yet and > > called slabs_destroy() which can call kfree() through > > unaccount_slab_page(). > > > > We are on the same CPU and this recursive kfree again check the > > (ac->avail >= ac->limit) and call cache_flusharray() again and recurse > > indefinitely. It's a coll theory! And it explains why we haven't seen it with SLUB. > > I can see two possible fixes. We can either do async kfree of > page_obj_cgroups(page) or we can update the local cpu array_cache's > size before slabs_destroy(). I wonder if something like this can fix the problem? (completely untested). -- diff --git a/mm/slab.c b/mm/slab.c index 684ebe5b0c7a..c94b9ccfb803 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -186,6 +186,7 @@ struct array_cache { unsigned int limit; unsigned int batchcount; unsigned int touched; + bool flushing; void *entry[]; /* * Must have this definition in here for the proper * alignment of array_cache. Also simplifies accessing @@ -526,6 +527,7 @@ static void init_arraycache(struct array_cache *ac, int limit, int batch) ac->limit = limit; ac->batchcount = batch; ac->touched = 0; + ac->flushing = false; } } @@ -3368,6 +3370,11 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac) int node = numa_mem_id(); LIST_HEAD(list); + if (ac->flushing) + return; + + ac->flushing = true; + batchcount = ac->batchcount; check_irq_off(); @@ -3404,6 +3411,7 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac) spin_unlock(>list_lock); slabs_destroy(cachep, ); ac->avail -= batchcount; + ac->flushing = false; memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail); }
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Fri, Sep 25, 2020 at 10:22 AM Shakeel Butt wrote: > > On Fri, Sep 25, 2020 at 10:17 AM Linus Torvalds > wrote: > > > > On Fri, Sep 25, 2020 at 9:19 AM Ming Lei wrote: > > > > > > git bisect shows the first bad commit: > > > > > > [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a > > > single set of > > > kmem_caches for all allocations > > > > > > And I have double checked that the above commit is really the first bad > > > commit for the list corruption issue of 'list_del corruption, > > > e1c241b00408->next > > > is LIST_POISON1 (dead0100)', > > > > Thet commit doesn't revert cleanly, but I think that's purely because > > we'd also need to revert > > > > 849504809f86 ("mm: memcg/slab: remove unused argument by > > charge_slab_page()") > > 74d555bed5d0 ("mm: slab: rename (un)charge_slab_page() to > > (un)account_slab_page()") > > > > too. > > > > Can you verify that a > > > > git revert 74d555bed5d0 849504809f86 10befea91b61 > > > > on top of current -git makes things work for you again? > > > > I'm going to do an rc8 this release simply because we have another VM > > issue that I hope to get fixed - but there we know what the problem > > and the fix _is_, it just needs some care. > > > > So if Roman (or somebody else) can see what's wrong and we can fix > > this quickly, we don't need to go down the revert path, but .. > > > > I think I have a theory. The issue is happening due to the potential > infinite recursion: > > [ 5060.124412] ___cache_free+0x488/0x6b0 > *Second recursion > [ 5060.128666] kfree+0xc9/0x1d0 > [ 5060.131947] kmem_freepages+0xa0/0xf0 > [ 5060.135746] slab_destroy+0x19/0x50 > [ 5060.139577] slabs_destroy+0x6d/0x90 > [ 5060.143379] ___cache_free+0x4a3/0x6b0 > *First recursion > [ 5060.147896] kfree+0xc9/0x1d0 > [ 5060.151082] kmem_freepages+0xa0/0xf0 > [ 5060.155121] slab_destroy+0x19/0x50 > [ 5060.159028] slabs_destroy+0x6d/0x90 > [ 5060.162920] ___cache_free+0x4a3/0x6b0 > [ 5060.167097] kfree+0xc9/0x1d0 > > ___cache_free() is calling cache_flusharray() to flush the local cpu > array_cache if the cache has more elements than the limit (ac->avail > >= ac->limit). > > cache_flusharray() is removing batchcount number of element from local > cpu array_cache and pass it slabs_destroy (if the node shared cache is > also full). > > Note that we have not updated local cpu array_cache size yet and > called slabs_destroy() which can call kfree() through > unaccount_slab_page(). > > We are on the same CPU and this recursive kfree again check the > (ac->avail >= ac->limit) and call cache_flusharray() again and recurse > indefinitely. I can see two possible fixes. We can either do async kfree of page_obj_cgroups(page) or we can update the local cpu array_cache's size before slabs_destroy(). Shakeel
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Fri, Sep 25, 2020 at 09:47:43AM -0700, Shakeel Butt wrote: > On Fri, Sep 25, 2020 at 9:32 AM Shakeel Butt wrote: > > > > On Fri, Sep 25, 2020 at 9:19 AM Ming Lei wrote: > > > > > > On Fri, Sep 25, 2020 at 03:31:45PM +0800, Ming Lei wrote: > > > > On Thu, Sep 24, 2020 at 09:13:11PM -0400, Theodore Y. Ts'o wrote: > > > > > On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote: > > > > > > HOWEVER, thanks to a hint from a colleague at $WORK, and realizing > > > > > > that one of the stack traces had virtio balloon in the trace, I > > > > > > realized that when I switched the GCE VM type from e1-standard-2 to > > > > > > n1-standard-2 (where e1 VM's are cheaper because they use > > > > > > virtio-balloon to better manage host OS memory utilization), problem > > > > > > has become, much, *much* rarer (and possibly has gone away, although > > > > > > I'm going to want to run a lot more tests before I say that > > > > > > conclusively) on my test setup. At the very least, using an n1 VM > > > > > > (which doesn't have virtio-balloon enabled in the hypervisor) is > > > > > > enough to unblock ext4 development. > > > > > > > > > > and I spoke too soon. A number of runs using -rc6 are now > > > > > failing even with the n1-standard-2 VM, so virtio-ballon may not be an > > > > > indicator. > > > > > > > > > > This is why debugging this is frustrating; it is very much a heisenbug > > > > > --- although 5.8 seems to work completely reliably, as does commits > > > > > before 37f4a24c2469. Anything after that point will show random > > > > > failures. :-( > > > > > > > > It does not make sense to mention 37f4a24c2469, which is reverted in > > > > 4e2f62e566b5. Later the patch in 37f4a24c2469 is fixed and re-commited > > > > as 568f27006577. > > > > > > > > However, I can _not_ reproduce the issue by running the same test on > > > > kernel built from 568f27006577 directly. > > > > > > > > Also you have confirmed that the issue can't be fixed after reverting > > > > 568f27006577 against v5.9-rc4. > > > > > > > > Looks the real issue(slab list corruption) should be introduced between > > > > 568f27006577 and v5.9-rc4. > > > > > > git bisect shows the first bad commit: > > > > > > [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a > > > single set of > > > kmem_caches for all allocations > > > > > > And I have double checked that the above commit is really the first bad > > > commit for the list corruption issue of 'list_del corruption, > > > e1c241b00408->next > > > is LIST_POISON1 (dead0100)', see the detailed stack trace and > > > kernel oops log in the following link: > > > > > > https://lore.kernel.org/lkml/20200916202026.gc38...@mit.edu/ > > > > The failure signature is similar to > > https://lore.kernel.org/lkml/20200901075321.GL4299@shao2-debian/ > > > > > > > > And the kernel config is the one(without KASAN) used by Theodore in GCE > > > VM, see > > > the following link: > > > > > > https://lore.kernel.org/lkml/20200917143012.gf38...@mit.edu/ > > > > > > The reproducer is xfstests generic/038. In my setting, test device is > > > virtio-scsi, and > > > scratch device is virtio-blk. > > Is it possible to check SLUB as well to confirm that the issue is only > happening on SLAB? Can you also, please, check if passing cgroup.memory=nokmem as a boot argument is fixing the issue? Thanks!
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Fri, Sep 25, 2020 at 10:17 AM Linus Torvalds wrote: > > On Fri, Sep 25, 2020 at 9:19 AM Ming Lei wrote: > > > > git bisect shows the first bad commit: > > > > [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a > > single set of > > kmem_caches for all allocations > > > > And I have double checked that the above commit is really the first bad > > commit for the list corruption issue of 'list_del corruption, > > e1c241b00408->next > > is LIST_POISON1 (dead0100)', > > Thet commit doesn't revert cleanly, but I think that's purely because > we'd also need to revert > > 849504809f86 ("mm: memcg/slab: remove unused argument by > charge_slab_page()") > 74d555bed5d0 ("mm: slab: rename (un)charge_slab_page() to > (un)account_slab_page()") > > too. > > Can you verify that a > > git revert 74d555bed5d0 849504809f86 10befea91b61 > > on top of current -git makes things work for you again? > > I'm going to do an rc8 this release simply because we have another VM > issue that I hope to get fixed - but there we know what the problem > and the fix _is_, it just needs some care. > > So if Roman (or somebody else) can see what's wrong and we can fix > this quickly, we don't need to go down the revert path, but .. > I think I have a theory. The issue is happening due to the potential infinite recursion: [ 5060.124412] ___cache_free+0x488/0x6b0 *Second recursion [ 5060.128666] kfree+0xc9/0x1d0 [ 5060.131947] kmem_freepages+0xa0/0xf0 [ 5060.135746] slab_destroy+0x19/0x50 [ 5060.139577] slabs_destroy+0x6d/0x90 [ 5060.143379] ___cache_free+0x4a3/0x6b0 *First recursion [ 5060.147896] kfree+0xc9/0x1d0 [ 5060.151082] kmem_freepages+0xa0/0xf0 [ 5060.155121] slab_destroy+0x19/0x50 [ 5060.159028] slabs_destroy+0x6d/0x90 [ 5060.162920] ___cache_free+0x4a3/0x6b0 [ 5060.167097] kfree+0xc9/0x1d0 ___cache_free() is calling cache_flusharray() to flush the local cpu array_cache if the cache has more elements than the limit (ac->avail >= ac->limit). cache_flusharray() is removing batchcount number of element from local cpu array_cache and pass it slabs_destroy (if the node shared cache is also full). Note that we have not updated local cpu array_cache size yet and called slabs_destroy() which can call kfree() through unaccount_slab_page(). We are on the same CPU and this recursive kfree again check the (ac->avail >= ac->limit) and call cache_flusharray() again and recurse indefinitely.
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Fri, Sep 25, 2020 at 9:19 AM Ming Lei wrote: > > git bisect shows the first bad commit: > > [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a > single set of > kmem_caches for all allocations > > And I have double checked that the above commit is really the first bad > commit for the list corruption issue of 'list_del corruption, > e1c241b00408->next > is LIST_POISON1 (dead0100)', Thet commit doesn't revert cleanly, but I think that's purely because we'd also need to revert 849504809f86 ("mm: memcg/slab: remove unused argument by charge_slab_page()") 74d555bed5d0 ("mm: slab: rename (un)charge_slab_page() to (un)account_slab_page()") too. Can you verify that a git revert 74d555bed5d0 849504809f86 10befea91b61 on top of current -git makes things work for you again? I'm going to do an rc8 this release simply because we have another VM issue that I hope to get fixed - but there we know what the problem and the fix _is_, it just needs some care. So if Roman (or somebody else) can see what's wrong and we can fix this quickly, we don't need to go down the revert path, but .. Linus
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Fri, Sep 25, 2020 at 9:32 AM Shakeel Butt wrote: > > On Fri, Sep 25, 2020 at 9:19 AM Ming Lei wrote: > > > > On Fri, Sep 25, 2020 at 03:31:45PM +0800, Ming Lei wrote: > > > On Thu, Sep 24, 2020 at 09:13:11PM -0400, Theodore Y. Ts'o wrote: > > > > On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote: > > > > > HOWEVER, thanks to a hint from a colleague at $WORK, and realizing > > > > > that one of the stack traces had virtio balloon in the trace, I > > > > > realized that when I switched the GCE VM type from e1-standard-2 to > > > > > n1-standard-2 (where e1 VM's are cheaper because they use > > > > > virtio-balloon to better manage host OS memory utilization), problem > > > > > has become, much, *much* rarer (and possibly has gone away, although > > > > > I'm going to want to run a lot more tests before I say that > > > > > conclusively) on my test setup. At the very least, using an n1 VM > > > > > (which doesn't have virtio-balloon enabled in the hypervisor) is > > > > > enough to unblock ext4 development. > > > > > > > > and I spoke too soon. A number of runs using -rc6 are now > > > > failing even with the n1-standard-2 VM, so virtio-ballon may not be an > > > > indicator. > > > > > > > > This is why debugging this is frustrating; it is very much a heisenbug > > > > --- although 5.8 seems to work completely reliably, as does commits > > > > before 37f4a24c2469. Anything after that point will show random > > > > failures. :-( > > > > > > It does not make sense to mention 37f4a24c2469, which is reverted in > > > 4e2f62e566b5. Later the patch in 37f4a24c2469 is fixed and re-commited > > > as 568f27006577. > > > > > > However, I can _not_ reproduce the issue by running the same test on > > > kernel built from 568f27006577 directly. > > > > > > Also you have confirmed that the issue can't be fixed after reverting > > > 568f27006577 against v5.9-rc4. > > > > > > Looks the real issue(slab list corruption) should be introduced between > > > 568f27006577 and v5.9-rc4. > > > > git bisect shows the first bad commit: > > > > [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a > > single set of > > kmem_caches for all allocations > > > > And I have double checked that the above commit is really the first bad > > commit for the list corruption issue of 'list_del corruption, > > e1c241b00408->next > > is LIST_POISON1 (dead0100)', see the detailed stack trace and > > kernel oops log in the following link: > > > > https://lore.kernel.org/lkml/20200916202026.gc38...@mit.edu/ > > The failure signature is similar to > https://lore.kernel.org/lkml/20200901075321.GL4299@shao2-debian/ > > > > > And the kernel config is the one(without KASAN) used by Theodore in GCE VM, > > see > > the following link: > > > > https://lore.kernel.org/lkml/20200917143012.gf38...@mit.edu/ > > > > The reproducer is xfstests generic/038. In my setting, test device is > > virtio-scsi, and > > scratch device is virtio-blk. Is it possible to check SLUB as well to confirm that the issue is only happening on SLAB?
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Fri, Sep 25, 2020 at 9:19 AM Ming Lei wrote: > > On Fri, Sep 25, 2020 at 03:31:45PM +0800, Ming Lei wrote: > > On Thu, Sep 24, 2020 at 09:13:11PM -0400, Theodore Y. Ts'o wrote: > > > On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote: > > > > HOWEVER, thanks to a hint from a colleague at $WORK, and realizing > > > > that one of the stack traces had virtio balloon in the trace, I > > > > realized that when I switched the GCE VM type from e1-standard-2 to > > > > n1-standard-2 (where e1 VM's are cheaper because they use > > > > virtio-balloon to better manage host OS memory utilization), problem > > > > has become, much, *much* rarer (and possibly has gone away, although > > > > I'm going to want to run a lot more tests before I say that > > > > conclusively) on my test setup. At the very least, using an n1 VM > > > > (which doesn't have virtio-balloon enabled in the hypervisor) is > > > > enough to unblock ext4 development. > > > > > > and I spoke too soon. A number of runs using -rc6 are now > > > failing even with the n1-standard-2 VM, so virtio-ballon may not be an > > > indicator. > > > > > > This is why debugging this is frustrating; it is very much a heisenbug > > > --- although 5.8 seems to work completely reliably, as does commits > > > before 37f4a24c2469. Anything after that point will show random > > > failures. :-( > > > > It does not make sense to mention 37f4a24c2469, which is reverted in > > 4e2f62e566b5. Later the patch in 37f4a24c2469 is fixed and re-commited > > as 568f27006577. > > > > However, I can _not_ reproduce the issue by running the same test on > > kernel built from 568f27006577 directly. > > > > Also you have confirmed that the issue can't be fixed after reverting > > 568f27006577 against v5.9-rc4. > > > > Looks the real issue(slab list corruption) should be introduced between > > 568f27006577 and v5.9-rc4. > > git bisect shows the first bad commit: > > [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a > single set of > kmem_caches for all allocations > > And I have double checked that the above commit is really the first bad > commit for the list corruption issue of 'list_del corruption, > e1c241b00408->next > is LIST_POISON1 (dead0100)', see the detailed stack trace and > kernel oops log in the following link: > > https://lore.kernel.org/lkml/20200916202026.gc38...@mit.edu/ The failure signature is similar to https://lore.kernel.org/lkml/20200901075321.GL4299@shao2-debian/ > > And the kernel config is the one(without KASAN) used by Theodore in GCE VM, > see > the following link: > > https://lore.kernel.org/lkml/20200917143012.gf38...@mit.edu/ > > The reproducer is xfstests generic/038. In my setting, test device is > virtio-scsi, and > scratch device is virtio-blk. > > > [1] git bisect log: > > git bisect start > # good: [568f2700657794b8258e72983ba508793a658942] blk-mq: centralise related > handling into blk_mq_get_driver_tag > git bisect good 568f2700657794b8258e72983ba508793a658942 > # bad: [f4d51dffc6c01a9e94650d95ce0104964f8ae822] Linux 5.9-rc4 > git bisect bad f4d51dffc6c01a9e94650d95ce0104964f8ae822 > # good: [8186749621ed6b8fc42644c399e8c755a2b6f630] Merge tag > 'drm-next-2020-08-06' of git://anongit.freedesktop.org/drm/drm > git bisect good 8186749621ed6b8fc42644c399e8c755a2b6f630 > # good: [60e76bb8a4e4c5398ea9053535e1fd0c9d6bb06e] Merge tag > 'm68knommu-for-v5.9' of > git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu > git bisect good 60e76bb8a4e4c5398ea9053535e1fd0c9d6bb06e > # bad: [57b077939287835b9396a1c3b40d35609cf2fcb8] Merge tag 'for_linus' of > git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost > git bisect bad 57b077939287835b9396a1c3b40d35609cf2fcb8 > # bad: [0f43283be7fec4a76cd4ed50dc37db30719bde05] Merge branch 'work.fdpic' > of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs > git bisect bad 0f43283be7fec4a76cd4ed50dc37db30719bde05 > # good: [5631c5e0eb9035d92ceb20fcd9cdb7779a3f5cc7] Merge tag > 'xfs-5.9-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux > git bisect good 5631c5e0eb9035d92ceb20fcd9cdb7779a3f5cc7 > # good: [e3083c3f369700cd1eec3de93b8d8ec0918ff778] media: cafe-driver: use > generic power management > git bisect good e3083c3f369700cd1eec3de93b8d8ec0918ff778 > # bad: [42742d9bde2a8e11ec932cb5821f720a40a7c2a9] mm: thp: replace HTTP links > with HTTPS ones > git bisect bad 42742d9bde2a8e11ec932cb5821f720a40a7c2a9 > # bad: [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a > single set of kmem_caches for all allocations > git bisect bad 10befea91b61c4e2c2d1df06a2e978d182fcf792 > # good: [cfbe1636c3585c1e032bfac512fb8be903fbc913] mm, kcsan: instrument > SLAB/SLUB free with "ASSERT_EXCLUSIVE_ACCESS" > git bisect good cfbe1636c3585c1e032bfac512fb8be903fbc913 > # good: [0f190a7ab78878f9e6c6930fc0f5f92c1250b57d] mm/page_io.c: use > blk_io_schedule() for avoiding task hung in sync io > git
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Fri, Sep 25, 2020 at 03:31:45PM +0800, Ming Lei wrote: > On Thu, Sep 24, 2020 at 09:13:11PM -0400, Theodore Y. Ts'o wrote: > > On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote: > > > HOWEVER, thanks to a hint from a colleague at $WORK, and realizing > > > that one of the stack traces had virtio balloon in the trace, I > > > realized that when I switched the GCE VM type from e1-standard-2 to > > > n1-standard-2 (where e1 VM's are cheaper because they use > > > virtio-balloon to better manage host OS memory utilization), problem > > > has become, much, *much* rarer (and possibly has gone away, although > > > I'm going to want to run a lot more tests before I say that > > > conclusively) on my test setup. At the very least, using an n1 VM > > > (which doesn't have virtio-balloon enabled in the hypervisor) is > > > enough to unblock ext4 development. > > > > and I spoke too soon. A number of runs using -rc6 are now > > failing even with the n1-standard-2 VM, so virtio-ballon may not be an > > indicator. > > > > This is why debugging this is frustrating; it is very much a heisenbug > > --- although 5.8 seems to work completely reliably, as does commits > > before 37f4a24c2469. Anything after that point will show random > > failures. :-( > > It does not make sense to mention 37f4a24c2469, which is reverted in > 4e2f62e566b5. Later the patch in 37f4a24c2469 is fixed and re-commited > as 568f27006577. > > However, I can _not_ reproduce the issue by running the same test on > kernel built from 568f27006577 directly. > > Also you have confirmed that the issue can't be fixed after reverting > 568f27006577 against v5.9-rc4. > > Looks the real issue(slab list corruption) should be introduced between > 568f27006577 and v5.9-rc4. git bisect shows the first bad commit: [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of kmem_caches for all allocations And I have double checked that the above commit is really the first bad commit for the list corruption issue of 'list_del corruption, e1c241b00408->next is LIST_POISON1 (dead0100)', see the detailed stack trace and kernel oops log in the following link: https://lore.kernel.org/lkml/20200916202026.gc38...@mit.edu/ And the kernel config is the one(without KASAN) used by Theodore in GCE VM, see the following link: https://lore.kernel.org/lkml/20200917143012.gf38...@mit.edu/ The reproducer is xfstests generic/038. In my setting, test device is virtio-scsi, and scratch device is virtio-blk. [1] git bisect log: git bisect start # good: [568f2700657794b8258e72983ba508793a658942] blk-mq: centralise related handling into blk_mq_get_driver_tag git bisect good 568f2700657794b8258e72983ba508793a658942 # bad: [f4d51dffc6c01a9e94650d95ce0104964f8ae822] Linux 5.9-rc4 git bisect bad f4d51dffc6c01a9e94650d95ce0104964f8ae822 # good: [8186749621ed6b8fc42644c399e8c755a2b6f630] Merge tag 'drm-next-2020-08-06' of git://anongit.freedesktop.org/drm/drm git bisect good 8186749621ed6b8fc42644c399e8c755a2b6f630 # good: [60e76bb8a4e4c5398ea9053535e1fd0c9d6bb06e] Merge tag 'm68knommu-for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu git bisect good 60e76bb8a4e4c5398ea9053535e1fd0c9d6bb06e # bad: [57b077939287835b9396a1c3b40d35609cf2fcb8] Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost git bisect bad 57b077939287835b9396a1c3b40d35609cf2fcb8 # bad: [0f43283be7fec4a76cd4ed50dc37db30719bde05] Merge branch 'work.fdpic' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs git bisect bad 0f43283be7fec4a76cd4ed50dc37db30719bde05 # good: [5631c5e0eb9035d92ceb20fcd9cdb7779a3f5cc7] Merge tag 'xfs-5.9-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux git bisect good 5631c5e0eb9035d92ceb20fcd9cdb7779a3f5cc7 # good: [e3083c3f369700cd1eec3de93b8d8ec0918ff778] media: cafe-driver: use generic power management git bisect good e3083c3f369700cd1eec3de93b8d8ec0918ff778 # bad: [42742d9bde2a8e11ec932cb5821f720a40a7c2a9] mm: thp: replace HTTP links with HTTPS ones git bisect bad 42742d9bde2a8e11ec932cb5821f720a40a7c2a9 # bad: [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of kmem_caches for all allocations git bisect bad 10befea91b61c4e2c2d1df06a2e978d182fcf792 # good: [cfbe1636c3585c1e032bfac512fb8be903fbc913] mm, kcsan: instrument SLAB/SLUB free with "ASSERT_EXCLUSIVE_ACCESS" git bisect good cfbe1636c3585c1e032bfac512fb8be903fbc913 # good: [0f190a7ab78878f9e6c6930fc0f5f92c1250b57d] mm/page_io.c: use blk_io_schedule() for avoiding task hung in sync io git bisect good 0f190a7ab78878f9e6c6930fc0f5f92c1250b57d # good: [286e04b8ed7a04279ae277f0f024430246ea5eec] mm: memcg/slab: allocate obj_cgroups for non-root slab pages git bisect good 286e04b8ed7a04279ae277f0f024430246ea5eec # good: [9855609bde03e2472b99a95e869d29ee1e78a751] mm: memcg/slab: use a single set of kmem_caches for all accounted
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Thu, Sep 24, 2020 at 09:13:11PM -0400, Theodore Y. Ts'o wrote: > On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote: > > HOWEVER, thanks to a hint from a colleague at $WORK, and realizing > > that one of the stack traces had virtio balloon in the trace, I > > realized that when I switched the GCE VM type from e1-standard-2 to > > n1-standard-2 (where e1 VM's are cheaper because they use > > virtio-balloon to better manage host OS memory utilization), problem > > has become, much, *much* rarer (and possibly has gone away, although > > I'm going to want to run a lot more tests before I say that > > conclusively) on my test setup. At the very least, using an n1 VM > > (which doesn't have virtio-balloon enabled in the hypervisor) is > > enough to unblock ext4 development. > > and I spoke too soon. A number of runs using -rc6 are now > failing even with the n1-standard-2 VM, so virtio-ballon may not be an > indicator. > > This is why debugging this is frustrating; it is very much a heisenbug > --- although 5.8 seems to work completely reliably, as does commits > before 37f4a24c2469. Anything after that point will show random > failures. :-( It does not make sense to mention 37f4a24c2469, which is reverted in 4e2f62e566b5. Later the patch in 37f4a24c2469 is fixed and re-commited as 568f27006577. However, I can _not_ reproduce the issue by running the same test on kernel built from 568f27006577 directly. Also you have confirmed that the issue can't be fixed after reverting 568f27006577 against v5.9-rc4. Looks the real issue(slab list corruption) should be introduced between 568f27006577 and v5.9-rc4. thanks, Ming
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Fri, Sep 25, 2020 at 09:14:16AM +0800, Ming Lei wrote: > On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote: > > On Thu, Sep 24, 2020 at 08:59:01AM +0800, Ming Lei wrote: > > > > > > The list corruption issue can be reproduced on kvm/qumu guest too when > > > running xfstests(ext4) generic/038. > > > > > > However, the issue may become not reproduced when adding or removing > > > memory > > > debug options, such as adding KASAN. > > > > Can you reliably reprodue this crash? And if so, with what config and > > what kernel version. > > Yeah, it can be reproduced reliably by running xfstests(ext4) > generic/038 over virtio-scsi(test device)/virtio-blk(scratch device). > > The kernel is -rc4, and not test -rc5 yet. > > It is exactly the config you provided, and I just enabled CDROM & ISOFS > against your config for passing cloud-init data via cdrom to VM. > > > One of the reasons why I had gone silent on this bug is that I've been > > trying many, many configurations and configurations which reliably > > reproduced on say, -rc4 would not reproduce on -rc5, and then I would > > get a completely different set of results on -rc6. So I've been > > trying to run a lot of different experiments to try to understand what > > might be going on, since it seems pretty clear this must be a very > > timing-sensitive failure. > > > > I also found that the re-occrance went down significantly if I enabled > > KASAN, and while it didn't go away, I wasn't able to get a KASAN > > failure to trigger, either. Turning off CONFIG_PROVE_LOCKING and a > > *lot* of other debugging configs made the problem vanish in -rc4, but > > that trick didn't work with -rc5 or -rc6. > > The issue can be reproduced reliably in my environment after > disabling LOCKDEP only for disabling KMEMLEAK only. > > But after disabling DEBUG_OBJECTS, it becomes hard to trigger. > > > > > Each time I discovered one of these things, I was about to post to the > > e-mail thread, only to have a confirmation test run on a different > > kernel version make the problem go away. In particular, your revert > > helped with -rc4 and -rc6 IIRC, but it didn't help in -rc5. > > > > HOWEVER, thanks to a hint from a colleague at $WORK, and realizing > > that one of the stack traces had virtio balloon in the trace, I > > realized that when I switched the GCE VM type from e1-standard-2 to > > n1-standard-2 (where e1 VM's are cheaper because they use > > virtio-balloon to better manage host OS memory utilization), problem > > has become, much, *much* rarer (and possibly has gone away, although > > I'm going to want to run a lot more tests before I say that > > conclusively) on my test setup. At the very least, using an n1 VM > > (which doesn't have virtio-balloon enabled in the hypervisor) is > > enough to unblock ext4 development. > > > > Any chance your kvm/qemu configuration might have been using > > virtio-ballon? Because other ext4 developers who have been using > > kvm-xftests have not had any problems > > I don't setup virtio-ballon, see the attached script for setting the VM. > > > > > > When I enable PAGE_POISONING, double free on kmalloc(192) is captured: > > > > > > [ 1198.317139] slab: double free detected in cache 'kmalloc-192', objp > > > 89ada7584300^M > > > [ 1198.326651] [ cut here ]^M > > > [ 1198.327969] kernel BUG at mm/slab.c:2535!^M > > > [ 1198.329129] invalid opcode: [#1] SMP PTI^M > > > [ 1198.333776] CPU: 1 PID: 0 Comm: swapper/1 Not tainted > > > 5.9.0-rc4_quiesce_srcu-xfstests #102^M > > > [ 1198.336085] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > > > rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014^M > > > [ 1198.339826] RIP: 0010:free_block.cold.92+0x13/0x15^M > > > [ 1198.341472] Code: 8d 44 05 f0 eb d0 48 63 83 e0 00 00 00 48 8d 54 05 > > > f8 e9 4b 81 ff ff 48 8b 73 58 48 89 ea 48 c7 c7 98 e7 4a 9c e8 20 c3 eb > > > ff <0f> 0b 48 8b 73 58 48 c7 c2 20 e8 4a 9c 48 c7 c7 70 32 22 9c e8 19^M > > > [ 1198.347331] RSP: 0018:982e40710be8 EFLAGS: 00010046^M > > > [ 1198.349091] RAX: 0048 RBX: 89adb6441400 RCX: > > > ^M > > > [ 1198.351839] RDX: RSI: 89adbaa97800 RDI: > > > 89adbaa97800^M > > > [ 1198.354572] RBP: 89ada7584300 R08: 0417 R09: > > > 0057^M > > > [ 1198.357150] R10: 0001 R11: 982e40710aa5 R12: > > > 89adbaaae598^M > > > [ 1198.359067] R13: e7bc819d6108 R14: e7bc819d6100 R15: > > > 89adb6442280^M > > > [ 1198.360975] FS: () GS:89adbaa8() > > > knlGS:^M > > > [ 1198.363202] CS: 0010 DS: ES: CR0: 80050033^M > > > [ 1198.365986] CR2: 55f6a3811318 CR3: 00017adca005 CR4: > > > 00770ee0^M > > > [ 1198.368679] DR0: DR1: DR2: > > > ^M > > > [ 1198.371386] DR3: DR6:
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote: > On Thu, Sep 24, 2020 at 08:59:01AM +0800, Ming Lei wrote: > > > > The list corruption issue can be reproduced on kvm/qumu guest too when > > running xfstests(ext4) generic/038. > > > > However, the issue may become not reproduced when adding or removing memory > > debug options, such as adding KASAN. > > Can you reliably reprodue this crash? And if so, with what config and > what kernel version. Yeah, it can be reproduced reliably by running xfstests(ext4) generic/038 over virtio-scsi(test device)/virtio-blk(scratch device). The kernel is -rc4, and not test -rc5 yet. It is exactly the config you provided, and I just enabled CDROM & ISOFS against your config for passing cloud-init data via cdrom to VM. > One of the reasons why I had gone silent on this bug is that I've been > trying many, many configurations and configurations which reliably > reproduced on say, -rc4 would not reproduce on -rc5, and then I would > get a completely different set of results on -rc6. So I've been > trying to run a lot of different experiments to try to understand what > might be going on, since it seems pretty clear this must be a very > timing-sensitive failure. > > I also found that the re-occrance went down significantly if I enabled > KASAN, and while it didn't go away, I wasn't able to get a KASAN > failure to trigger, either. Turning off CONFIG_PROVE_LOCKING and a > *lot* of other debugging configs made the problem vanish in -rc4, but > that trick didn't work with -rc5 or -rc6. The issue can be reproduced reliably in my environment after disabling LOCKDEP only for disabling KMEMLEAK only. But after disabling DEBUG_OBJECTS, it becomes hard to trigger. > > Each time I discovered one of these things, I was about to post to the > e-mail thread, only to have a confirmation test run on a different > kernel version make the problem go away. In particular, your revert > helped with -rc4 and -rc6 IIRC, but it didn't help in -rc5. > > HOWEVER, thanks to a hint from a colleague at $WORK, and realizing > that one of the stack traces had virtio balloon in the trace, I > realized that when I switched the GCE VM type from e1-standard-2 to > n1-standard-2 (where e1 VM's are cheaper because they use > virtio-balloon to better manage host OS memory utilization), problem > has become, much, *much* rarer (and possibly has gone away, although > I'm going to want to run a lot more tests before I say that > conclusively) on my test setup. At the very least, using an n1 VM > (which doesn't have virtio-balloon enabled in the hypervisor) is > enough to unblock ext4 development. > > Any chance your kvm/qemu configuration might have been using > virtio-ballon? Because other ext4 developers who have been using > kvm-xftests have not had any problems I don't setup virtio-ballon, see the attached script for setting the VM. > > > When I enable PAGE_POISONING, double free on kmalloc(192) is captured: > > > > [ 1198.317139] slab: double free detected in cache 'kmalloc-192', objp > > 89ada7584300^M > > [ 1198.326651] [ cut here ]^M > > [ 1198.327969] kernel BUG at mm/slab.c:2535!^M > > [ 1198.329129] invalid opcode: [#1] SMP PTI^M > > [ 1198.333776] CPU: 1 PID: 0 Comm: swapper/1 Not tainted > > 5.9.0-rc4_quiesce_srcu-xfstests #102^M > > [ 1198.336085] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > > rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014^M > > [ 1198.339826] RIP: 0010:free_block.cold.92+0x13/0x15^M > > [ 1198.341472] Code: 8d 44 05 f0 eb d0 48 63 83 e0 00 00 00 48 8d 54 05 f8 > > e9 4b 81 ff ff 48 8b 73 58 48 89 ea 48 c7 c7 98 e7 4a 9c e8 20 c3 eb ff > > <0f> 0b 48 8b 73 58 48 c7 c2 20 e8 4a 9c 48 c7 c7 70 32 22 9c e8 19^M > > [ 1198.347331] RSP: 0018:982e40710be8 EFLAGS: 00010046^M > > [ 1198.349091] RAX: 0048 RBX: 89adb6441400 RCX: > > ^M > > [ 1198.351839] RDX: RSI: 89adbaa97800 RDI: > > 89adbaa97800^M > > [ 1198.354572] RBP: 89ada7584300 R08: 0417 R09: > > 0057^M > > [ 1198.357150] R10: 0001 R11: 982e40710aa5 R12: > > 89adbaaae598^M > > [ 1198.359067] R13: e7bc819d6108 R14: e7bc819d6100 R15: > > 89adb6442280^M > > [ 1198.360975] FS: () GS:89adbaa8() > > knlGS:^M > > [ 1198.363202] CS: 0010 DS: ES: CR0: 80050033^M > > [ 1198.365986] CR2: 55f6a3811318 CR3: 00017adca005 CR4: > > 00770ee0^M > > [ 1198.368679] DR0: DR1: DR2: > > ^M > > [ 1198.371386] DR3: DR6: fffe0ff0 DR7: > > 0400^M > > [ 1198.374203] PKRU: 5554^M > > [ 1198.375174] Call Trace:^M > > [ 1198.376165] ^M > > [ 1198.376908] ___cache_free+0x56d/0x770^M > > [ 1198.378355] ? kmem_freepages+0xa0/0xf0^M > > [ 1198.379814]
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote: > HOWEVER, thanks to a hint from a colleague at $WORK, and realizing > that one of the stack traces had virtio balloon in the trace, I > realized that when I switched the GCE VM type from e1-standard-2 to > n1-standard-2 (where e1 VM's are cheaper because they use > virtio-balloon to better manage host OS memory utilization), problem > has become, much, *much* rarer (and possibly has gone away, although > I'm going to want to run a lot more tests before I say that > conclusively) on my test setup. At the very least, using an n1 VM > (which doesn't have virtio-balloon enabled in the hypervisor) is > enough to unblock ext4 development. and I spoke too soon. A number of runs using -rc6 are now failing even with the n1-standard-2 VM, so virtio-ballon may not be an indicator. This is why debugging this is frustrating; it is very much a heisenbug --- although 5.8 seems to work completely reliably, as does commits before 37f4a24c2469. Anything after that point will show random failures. :-( - Ted
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Thu, Sep 24, 2020 at 08:59:01AM +0800, Ming Lei wrote: > > The list corruption issue can be reproduced on kvm/qumu guest too when > running xfstests(ext4) generic/038. > > However, the issue may become not reproduced when adding or removing memory > debug options, such as adding KASAN. Can you reliably reprodue this crash? And if so, with what config and what kernel version. One of the reasons why I had gone silent on this bug is that I've been trying many, many configurations and configurations which reliably reproduced on say, -rc4 would not reproduce on -rc5, and then I would get a completely different set of results on -rc6. So I've been trying to run a lot of different experiments to try to understand what might be going on, since it seems pretty clear this must be a very timing-sensitive failure. I also found that the re-occrance went down significantly if I enabled KASAN, and while it didn't go away, I wasn't able to get a KASAN failure to trigger, either. Turning off CONFIG_PROVE_LOCKING and a *lot* of other debugging configs made the problem vanish in -rc4, but that trick didn't work with -rc5 or -rc6. Each time I discovered one of these things, I was about to post to the e-mail thread, only to have a confirmation test run on a different kernel version make the problem go away. In particular, your revert helped with -rc4 and -rc6 IIRC, but it didn't help in -rc5. HOWEVER, thanks to a hint from a colleague at $WORK, and realizing that one of the stack traces had virtio balloon in the trace, I realized that when I switched the GCE VM type from e1-standard-2 to n1-standard-2 (where e1 VM's are cheaper because they use virtio-balloon to better manage host OS memory utilization), problem has become, much, *much* rarer (and possibly has gone away, although I'm going to want to run a lot more tests before I say that conclusively) on my test setup. At the very least, using an n1 VM (which doesn't have virtio-balloon enabled in the hypervisor) is enough to unblock ext4 development. Any chance your kvm/qemu configuration might have been using virtio-ballon? Because other ext4 developers who have been using kvm-xftests have not had any problems > When I enable PAGE_POISONING, double free on kmalloc(192) is captured: > > [ 1198.317139] slab: double free detected in cache 'kmalloc-192', objp > 89ada7584300^M > [ 1198.326651] [ cut here ]^M > [ 1198.327969] kernel BUG at mm/slab.c:2535!^M > [ 1198.329129] invalid opcode: [#1] SMP PTI^M > [ 1198.333776] CPU: 1 PID: 0 Comm: swapper/1 Not tainted > 5.9.0-rc4_quiesce_srcu-xfstests #102^M > [ 1198.336085] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014^M > [ 1198.339826] RIP: 0010:free_block.cold.92+0x13/0x15^M > [ 1198.341472] Code: 8d 44 05 f0 eb d0 48 63 83 e0 00 00 00 48 8d 54 05 f8 e9 > 4b 81 ff ff 48 8b 73 58 48 89 ea 48 c7 c7 98 e7 4a 9c e8 20 c3 eb ff <0f> 0b > 48 8b 73 58 48 c7 c2 20 e8 4a 9c 48 c7 c7 70 32 22 9c e8 19^M > [ 1198.347331] RSP: 0018:982e40710be8 EFLAGS: 00010046^M > [ 1198.349091] RAX: 0048 RBX: 89adb6441400 RCX: > ^M > [ 1198.351839] RDX: RSI: 89adbaa97800 RDI: > 89adbaa97800^M > [ 1198.354572] RBP: 89ada7584300 R08: 0417 R09: > 0057^M > [ 1198.357150] R10: 0001 R11: 982e40710aa5 R12: > 89adbaaae598^M > [ 1198.359067] R13: e7bc819d6108 R14: e7bc819d6100 R15: > 89adb6442280^M > [ 1198.360975] FS: () GS:89adbaa8() > knlGS:^M > [ 1198.363202] CS: 0010 DS: ES: CR0: 80050033^M > [ 1198.365986] CR2: 55f6a3811318 CR3: 00017adca005 CR4: > 00770ee0^M > [ 1198.368679] DR0: DR1: DR2: > ^M > [ 1198.371386] DR3: DR6: fffe0ff0 DR7: > 0400^M > [ 1198.374203] PKRU: 5554^M > [ 1198.375174] Call Trace:^M > [ 1198.376165] ^M > [ 1198.376908] ___cache_free+0x56d/0x770^M > [ 1198.378355] ? kmem_freepages+0xa0/0xf0^M > [ 1198.379814] kfree+0x91/0x120^M > [ 1198.382121] kmem_freepages+0xa0/0xf0^M > [ 1198.383474] slab_destroy+0x9f/0x120^M > [ 1198.384779] slabs_destroy+0x6d/0x90^M > [ 1198.386110] ___cache_free+0x632/0x770^M > [ 1198.387547] ? kmem_freepages+0xa0/0xf0^M > [ 1198.389016] kfree+0x91/0x120^M > [ 1198.390160] kmem_freepages+0xa0/0xf0^M > [ 1198.391551] slab_destroy+0x9f/0x120^M > [ 1198.392964] slabs_destroy+0x6d/0x90^M > [ 1198.394439] ___cache_free+0x632/0x770^M > [ 1198.395896] kmem_cache_free.part.75+0x19/0x70^M > [ 1198.397791] rcu_core+0x1eb/0x6b0^M > [ 1198.399829] ? ktime_get+0x37/0xa0^M > [ 1198.401343] __do_softirq+0xdf/0x2c5^M > [ 1198.403010] asm_call_on_stack+0x12/0x20^M > [ 1198.404847] ^M > [ 1198.405799] do_softirq_own_stack+0x39/0x50^M > [ 1198.407621] irq_exit_rcu+0x97/0xa0^M > [
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Thu, Sep 17, 2020 at 10:30:12AM -0400, Theodore Y. Ts'o wrote: > On Thu, Sep 17, 2020 at 10:20:51AM +0800, Ming Lei wrote: > > > > Obviously there is other more serious issue, since 568f27006577 is > > completely reverted in your test, and you still see list corruption > > issue. > > > > So I'd suggest to find the big issue first. Once it is fixed, maybe > > everything becomes fine. > > ... > > Looks it is more like a memory corruption issue, is there any helpful log > > dumped when running kernel with kasan? > > Last night, I ran six VM's using -rc4 with and without KASAN; without > Kasan, half of them hung. With KASAN enabled, all of the test VM's > ran to completion. > > This strongly suggests whatever the problem is, it's timing related. > I'll run a larger set of test runs to see if this pattern is confirmed > today. > > > BTW, I have kvm/qumu auto test which runs blktest/xfstest over > > virtio-blk/virito-scsi/loop/nvme > > with xfs/ext4 every two days, and not see such failure recently, but the > > kernel config is based > > rhel8's config. > > Here is the configs I'm using, with and without KASAN. (With KASAN is > enabled is sent as a diff to avoid running into LKML's e-mail size > restrictrions.) > The list corruption issue can be reproduced on kvm/qumu guest too when running xfstests(ext4) generic/038. However, the issue may become not reproduced when adding or removing memory debug options, such as adding KASAN. When I enable PAGE_POISONING, double free on kmalloc(192) is captured: [ 1198.317139] slab: double free detected in cache 'kmalloc-192', objp 89ada7584300^M [ 1198.326651] [ cut here ]^M [ 1198.327969] kernel BUG at mm/slab.c:2535!^M [ 1198.329129] invalid opcode: [#1] SMP PTI^M [ 1198.333776] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc4_quiesce_srcu-xfstests #102^M [ 1198.336085] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014^M [ 1198.339826] RIP: 0010:free_block.cold.92+0x13/0x15^M [ 1198.341472] Code: 8d 44 05 f0 eb d0 48 63 83 e0 00 00 00 48 8d 54 05 f8 e9 4b 81 ff ff 48 8b 73 58 48 89 ea 48 c7 c7 98 e7 4a 9c e8 20 c3 eb ff <0f> 0b 48 8b 73 58 48 c7 c2 20 e8 4a 9c 48 c7 c7 70 32 22 9c e8 19^M [ 1198.347331] RSP: 0018:982e40710be8 EFLAGS: 00010046^M [ 1198.349091] RAX: 0048 RBX: 89adb6441400 RCX: ^M [ 1198.351839] RDX: RSI: 89adbaa97800 RDI: 89adbaa97800^M [ 1198.354572] RBP: 89ada7584300 R08: 0417 R09: 0057^M [ 1198.357150] R10: 0001 R11: 982e40710aa5 R12: 89adbaaae598^M [ 1198.359067] R13: e7bc819d6108 R14: e7bc819d6100 R15: 89adb6442280^M [ 1198.360975] FS: () GS:89adbaa8() knlGS:^M [ 1198.363202] CS: 0010 DS: ES: CR0: 80050033^M [ 1198.365986] CR2: 55f6a3811318 CR3: 00017adca005 CR4: 00770ee0^M [ 1198.368679] DR0: DR1: DR2: ^M [ 1198.371386] DR3: DR6: fffe0ff0 DR7: 0400^M [ 1198.374203] PKRU: 5554^M [ 1198.375174] Call Trace:^M [ 1198.376165] ^M [ 1198.376908] ___cache_free+0x56d/0x770^M [ 1198.378355] ? kmem_freepages+0xa0/0xf0^M [ 1198.379814] kfree+0x91/0x120^M [ 1198.382121] kmem_freepages+0xa0/0xf0^M [ 1198.383474] slab_destroy+0x9f/0x120^M [ 1198.384779] slabs_destroy+0x6d/0x90^M [ 1198.386110] ___cache_free+0x632/0x770^M [ 1198.387547] ? kmem_freepages+0xa0/0xf0^M [ 1198.389016] kfree+0x91/0x120^M [ 1198.390160] kmem_freepages+0xa0/0xf0^M [ 1198.391551] slab_destroy+0x9f/0x120^M [ 1198.392964] slabs_destroy+0x6d/0x90^M [ 1198.394439] ___cache_free+0x632/0x770^M [ 1198.395896] kmem_cache_free.part.75+0x19/0x70^M [ 1198.397791] rcu_core+0x1eb/0x6b0^M [ 1198.399829] ? ktime_get+0x37/0xa0^M [ 1198.401343] __do_softirq+0xdf/0x2c5^M [ 1198.403010] asm_call_on_stack+0x12/0x20^M [ 1198.404847] ^M [ 1198.405799] do_softirq_own_stack+0x39/0x50^M [ 1198.407621] irq_exit_rcu+0x97/0xa0^M [ 1198.409127] sysvec_apic_timer_interrupt+0x2c/0x80^M [ 1198.410608] asm_sysvec_apic_timer_interrupt+0x12/0x20^M [ 1198.411883] RIP: 0010:default_idle+0x13/0x20^M [ 1198.412994] Code: 89 44 24 20 48 83 c0 22 48 89 44 24 28 eb c7 e8 03 93 ff ff cc cc cc 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 11 ec 55 00 fb f4 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 65 48 8b 04^M [ 1198.418791] RSP: 0018:982e406abec8 EFLAGS: 0246^M [ 1198.420529] RAX: 9bca6ea0 RBX: 0001 RCX: 89aeafa69648^M [ 1198.422884] RDX: 89adb9c0 RSI: RDI: 0116f5b68899^M [ 1198.425283] RBP: 0001 R08: 012531639bb4 R09: ^M [ 1198.427804] R10: 89adb5e1f600 R11: R12: ^M [ 1198.430532] R13: R14: R15: ^M [
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Thu, Sep 17, 2020 at 10:30:12AM -0400, Theodore Y. Ts'o wrote: > On Thu, Sep 17, 2020 at 10:20:51AM +0800, Ming Lei wrote: > > > > Obviously there is other more serious issue, since 568f27006577 is > > completely reverted in your test, and you still see list corruption > > issue. > > > > So I'd suggest to find the big issue first. Once it is fixed, maybe > > everything becomes fine. > > ... > > Looks it is more like a memory corruption issue, is there any helpful log > > dumped when running kernel with kasan? > > Last night, I ran six VM's using -rc4 with and without KASAN; without > Kasan, half of them hung. With KASAN enabled, all of the test VM's > ran to completion. >From your last email, when you run -rc4 with revert of 568f27006577, you can observe list corruption easily. So can you enable KASAN on -rc4 with revert of 568f27006577 and see if it makes a difference? > > This strongly suggests whatever the problem is, it's timing related. > I'll run a larger set of test runs to see if this pattern is confirmed > today. Looks you enable lots of other debug options, such a lockdep, which has much much heavy runtime load. Maybe you can disable all non-KASAN debug option(non-KASAN memory debug options, lockdep, ...) and keep KASAN debug only and see if you are lucky. Thanks, Ming
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Thu, Sep 17, 2020 at 10:20:51AM +0800, Ming Lei wrote: > > Obviously there is other more serious issue, since 568f27006577 is > completely reverted in your test, and you still see list corruption > issue. > > So I'd suggest to find the big issue first. Once it is fixed, maybe > everything becomes fine. > ... > Looks it is more like a memory corruption issue, is there any helpful log > dumped when running kernel with kasan? Last night, I ran six VM's using -rc4 with and without KASAN; without Kasan, half of them hung. With KASAN enabled, all of the test VM's ran to completion. This strongly suggests whatever the problem is, it's timing related. I'll run a larger set of test runs to see if this pattern is confirmed today. > BTW, I have kvm/qumu auto test which runs blktest/xfstest over > virtio-blk/virito-scsi/loop/nvme > with xfs/ext4 every two days, and not see such failure recently, but the > kernel config is based > rhel8's config. Here is the configs I'm using, with and without KASAN. (With KASAN is enabled is sent as a diff to avoid running into LKML's e-mail size restrictrions.) - Ted # # Automatically generated file; DO NOT EDIT. # Linux/x86_64 5.9.0-rc4 Kernel Configuration # CONFIG_CC_VERSION_TEXT="gcc (Debian 10.2.0-7) 10.2.0" CONFIG_CC_IS_GCC=y CONFIG_GCC_VERSION=100200 CONFIG_LD_VERSION=23500 CONFIG_CLANG_VERSION=0 CONFIG_CC_CAN_LINK=y CONFIG_CC_CAN_LINK_STATIC=y CONFIG_CC_HAS_ASM_GOTO=y CONFIG_CC_HAS_ASM_INLINE=y CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_TABLE_SORT=y CONFIG_THREAD_INFO_IN_TASK=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="-xfstests" CONFIG_LOCALVERSION_AUTO=y CONFIG_BUILD_SALT="" CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y CONFIG_HAVE_KERNEL_ZSTD=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set # CONFIG_KERNEL_XZ is not set # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set # CONFIG_KERNEL_ZSTD is not set CONFIG_DEFAULT_INIT="" CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y # CONFIG_WATCH_QUEUE is not set CONFIG_CROSS_MEMORY_ATTACH=y CONFIG_USELIB=y # CONFIG_AUDIT is not set CONFIG_HAVE_ARCH_AUDITSYSCALL=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_GENERIC_IRQ_MIGRATION=y CONFIG_HARDIRQS_SW_RESEND=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_MSI_IRQ=y CONFIG_GENERIC_MSI_IRQ_DOMAIN=y CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y CONFIG_GENERIC_IRQ_RESERVATION_MODE=y CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y # CONFIG_GENERIC_IRQ_DEBUGFS is not set # end of IRQ subsystem CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_INIT=y CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK=y CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y # end of Timers subsystem CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set CONFIG_PREEMPT_COUNT=y # # CPU/Task time and stats accounting # CONFIG_TICK_CPU_ACCOUNTING=y # CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set # CONFIG_IRQ_TIME_ACCOUNTING is not set CONFIG_HAVE_SCHED_AVG_IRQ=y # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # CONFIG_PSI is not set # end of CPU/Task time and stats accounting CONFIG_CPU_ISOLATION=y # # RCU Subsystem # CONFIG_TREE_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_SRCU=y CONFIG_TREE_SRCU=y CONFIG_TASKS_RCU_GENERIC=y CONFIG_TASKS_RUDE_RCU=y CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_NEED_SEGCBLIST=y # end of RCU Subsystem CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y # CONFIG_IKHEADERS is not set CONFIG_LOG_BUF_SHIFT=17 CONFIG_LOG_CPU_MAX_BUF_SHIFT=12 CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13 CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y # # Scheduler features # # CONFIG_UCLAMP_TASK is not set # end of Scheduler features CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y CONFIG_CC_HAS_INT128=y CONFIG_ARCH_SUPPORTS_INT128=y # CONFIG_NUMA_BALANCING is not set CONFIG_CGROUPS=y CONFIG_PAGE_COUNTER=y CONFIG_MEMCG=y CONFIG_MEMCG_SWAP=y CONFIG_MEMCG_KMEM=y CONFIG_BLK_CGROUP=y CONFIG_CGROUP_WRITEBACK=y CONFIG_CGROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y # CONFIG_CFS_BANDWIDTH is not set # CONFIG_RT_GROUP_SCHED is not set CONFIG_CGROUP_PIDS=y CONFIG_CGROUP_RDMA=y CONFIG_CGROUP_FREEZER=y CONFIG_CPUSETS=y
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Wed, Sep 16, 2020 at 04:20:26PM -0400, Theodore Y. Ts'o wrote: > On Wed, Sep 16, 2020 at 07:09:41AM +0800, Ming Lei wrote: > > > The problem is it's a bit tricky to revert 568f27006577, since there > > > is a merge conflict in blk_kick_flush(). I attempted to do the bisect > > > manually here, but it's clearly not right since the kernel is not > > > booting after the revert: > > > > > > https://github.com/tytso/ext4/commit/1e67516382a33da2c9d483b860ac4ec2ad390870 > > > > > > branch: > > > > > > https://github.com/tytso/ext4/tree/manual-revert-of-568f27006577 > > > > > > Can you send me a patch which correctly reverts 568f27006577? I can > > > try it against -rc1 .. -rc4, whichever is most convenient. > > > > Please test the following revert patch against -rc4. > > Unfortunately the results of the revert is... wierd. > > With -rc4, *all* of the VM's are failing --- reliably. With rc4 with > the revert, *some* of the VM's are able to complete the tests, but > over half are still locking up or failing with some kind of oops. So > that seems to imply that there is some kind of timing issue going on, > or maybe there are multiple bugs in play? Obviously there is other more serious issue, since 568f27006577 is completely reverted in your test, and you still see list corruption issue. So I'd suggest to find the big issue first. Once it is fixed, maybe everything becomes fine. .. > > v5.9-rc4 with the revert of 568f27006577: we're seeing a similar > number of VM failures, but the failure signature is different. > The most common failure is: > > [ 390.023691] [ cut here ] > [ 390.028614] list_del corruption, e1c241b00408->next is > LIST_POISON1 (dead0100) > [ 390.037040] WARNING: CPU: 1 PID: 5948 at lib/list_debug.c:47 > __list_del_entry_valid+0x4e/0x90 > [ 390.045684] CPU: 1 PID: 5948 Comm: umount Not tainted > 5.9.0-rc4-xfstests-1-g6fdef015feba #11 > [ 390.054581] Hardware name: Google Google Compute Engine/Google > Compute Engine, BIOS Google 01/01/2011 > [ 390.063943] RIP: 0010:__list_del_entry_valid+0x4e/0x90 > [ 390.069731] Code: 2e 48 8b 32 48 39 fe 75 3a 48 8b 50 08 48 39 f2 75 > 48 b8 01 00 00 00 c3 48 89 fe 48 89 c2 48 c7 c7 10 13 12 9b e8 30 2f 8c ff > <0f> 0b 31 c0 c3 48 89 fe 48 c7 c7 48 13 12 9b e8 1c 2f 8c ff 0f 0b > [ 390.088615] RSP: 0018:ae95c6ddba28 EFLAGS: 00010082 > [ 390.094079] RAX: RBX: ce95bfc007d0 RCX: > 0027 > [ 390.101338] RDX: 0027 RSI: a0c9d93d7dc0 RDI: > a0c9d93d7dc8 > [ 390.108659] RBP: e1c241b00408 R08: 006ba6bff7dc R09: > > [ 390.115925] R10: a0c9d3f444c0 R11: 0046 R12: > a0c9d8041180 > [ 390.123186] R13: a0c86c010e00 R14: e1c241b00400 R15: > a0c9d8042240 > [ 390.130637] FS: 7fb227580080() GS:a0c9d920() > knlGS: > [ 390.138860] CS: 0010 DS: ES: CR0: 80050033 > [ 390.144721] CR2: 7ff72d2dfe74 CR3: 0001cdbb8002 CR4: > 003706e0 > [ 390.152022] DR0: DR1: DR2: > > [ 390.159314] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 390.166569] Call Trace: > [ 390.169148] free_block+0xec/0x270 > [ 390.173100] ___cache_free+0x488/0x6b0 > [ 390.177062] kfree+0xc9/0x1d0 > [ 390.181202] kmem_freepages+0xa0/0xf0 > [ 390.185009] slab_destroy+0x19/0x50 > [ 390.188653] slabs_destroy+0x6d/0x90 > [ 390.192339] ___cache_free+0x4a3/0x6b0 > [ 390.196477] kfree+0xc9/0x1d0 > [ 390.199651] kmem_freepages+0xa0/0xf0 > [ 390.203459] slab_destroy+0x19/0x50 > [ 390.207060] slabs_destroy+0x6d/0x90 > [ 390.210784] ___cache_free+0x4a3/0x6b0 > [ 390.214672] ? lockdep_hardirqs_on_prepare+0xe7/0x180 > [ 390.219845] kfree+0xc9/0x1d0 > [ 390.222928] put_crypt_info+0xe3/0x100 > [ 390.226801] fscrypt_put_encryption_info+0x15/0x30 > [ 390.231721] ext4_clear_inode+0x80/0x90 > [ 390.235774] ext4_evict_inode+0x6d/0x630 > [ 390.239960] evict+0xd0/0x1a0 > [ 390.243049] dispose_list+0x51/0x80 > [ 390.246659] evict_inodes+0x15b/0x1b0 > [ 390.250526] generic_shutdown_super+0x37/0x100 > [ 390.255094] kill_block_super+0x21/0x50 > [ 390.259066] deactivate_locked_super+0x2f/0x70 > [ 390.263638] cleanup_mnt+0xb8/0x140 > [ 390.267248] task_work_run+0x73/0xc0 > [ 390.270953] exit_to_user_mode_prepare+0x197/0x1a0 > [ 390.277333] syscall_exit_to_user_mode+0x3c/0x210 > [ 390.282171] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 390.287348] RIP: 0033:0x7fb2279a6507 > [ 390.291128] Code: 19 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Wed, Sep 16, 2020 at 07:09:41AM +0800, Ming Lei wrote: > > The problem is it's a bit tricky to revert 568f27006577, since there > > is a merge conflict in blk_kick_flush(). I attempted to do the bisect > > manually here, but it's clearly not right since the kernel is not > > booting after the revert: > > > > https://github.com/tytso/ext4/commit/1e67516382a33da2c9d483b860ac4ec2ad390870 > > > > branch: > > > > https://github.com/tytso/ext4/tree/manual-revert-of-568f27006577 > > > > Can you send me a patch which correctly reverts 568f27006577? I can > > try it against -rc1 .. -rc4, whichever is most convenient. > > Please test the following revert patch against -rc4. Unfortunately the results of the revert is... wierd. With -rc4, *all* of the VM's are failing --- reliably. With rc4 with the revert, *some* of the VM's are able to complete the tests, but over half are still locking up or failing with some kind of oops. So that seems to imply that there is some kind of timing issue going on, or maybe there are multiple bugs in play? So let's review the bidding. We're going to review four commits: 7bf137298cb7: (Parent of 568f27006577) Completely clean, all VM's complete the tests 568f27006577: Fails reliably. In 9 of the 11 VM's there is nothing on the console; the I/O is just stopped. If I've been able to get to the VM before it gets killed from the timeout, ssh works, but any attempt do any I/O will hang, which presumably explains why the tests are hanging. In the other two VM's there are a hung task timeouts, with stack traces that look like this... v5.9-rc4: More than half of the VM's are failing --- but at least some are succeeding, which is more than can be said for 568f27006577. There is a *variety* of different sort of failures. So the fact that we're not seeing the silent hangs in -rc4 is... interesting v5.9-rc4 with the revert of 568f27006577: we're seeing a similar number of VM failures, but the failure signature is different. The most common failure is... (More details below, with the stack traces.) I really don't know what to make of this. It looks like there's something going on in the block layer, based the fact that 568f27006577 fails reliably, but its predecssor is completely clean. But then things have changed significantly by the time we get to -rc4. I'll do a more in-depth analysis of -rc1 to see if the failure patterns are more similar to 568f27006577 than -rc4. But hopefully you can see something that I'm missing? Thanks, - Ted --- 7bf137298cb7: (Parent of 568f27006577) Completely clean, all VM's complete the tests --- 568f27006577: Fails reliably. In 9 of the 11 VM's there is nothing on the console; the I/O is just stopped. If I've been able to get to the VM before it gets killed from the timeout, ssh works, but any attempt do any I/O will hang, which presumably explains why the tests are hanging. In the other two VM's there are a hung task timeouts, with stack traces that look like this: [14375.634282] INFO: task jbd2/sda1-8:116 blocked for more than 122 seconds. [14375.641679] Not tainted 5.8.0-rc2-xfstests-30545-g568f27006577 #6 [14375.648517] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [14375.656523] jbd2/sda1-8 D0 116 2 0x4000 [14375.656530] Call Trace: [14375.656548] __schedule+0x2cc/0x6e0 [14375.656695] ? sched_clock_cpu+0xc/0xb0 [14375.656699] schedule+0x55/0xd0 [14375.656702] io_schedule+0x12/0x40 [14375.656708] blk_mq_get_tag+0x11e/0x280 [14375.656715] ? __wake_up_common_lock+0xc0/0xc0 [14375.656719] __blk_mq_alloc_request+0xb6/0x100 [14375.656722] blk_mq_submit_bio+0x13f/0x7d0 [14375.656727] ? blk_queue_enter+0x15c/0x510 [14375.656731] submit_bio_noacct+0x48d/0x500 [14375.656737] ? kvm_sched_clock_read+0x14/0x30 [14375.656740] ? submit_bio+0x42/0x150 [14375.656744] submit_bio+0x42/0x150 [14375.656748] ? guard_bio_eod+0x90/0x140 [14375.656754] submit_bh_wbc+0x16d/0x190 [14375.656761] jbd2_journal_commit_transaction+0x70d/0x1f23 [14375.656767] ? kjournald2+0x128/0x3b0 [14375.656771] kjournald2+0x128/0x3b0 [14375.656777] ? trace_hardirqs_on+0x1c/0xf0 [14375.656781] ? __wake_up_common_lock+0xc0/0xc0 [14375.656785] ? __jbd2_debug+0x50/0x50 [14375.656788] kthread+0x136/0x150 [14375.656792] ? __kthread_queue_delayed_work+0x90/0x90 [14375.656796] ret_from_fork+0x22/0x30 --- v5.9-rc4: More than half of the VM's are failing --- but at least some are succeeding, which
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Tue, Sep 15, 2020 at 06:45:41PM -0400, Theodore Y. Ts'o wrote: > On Tue, Sep 15, 2020 at 03:33:03PM +0800, Ming Lei wrote: > > Hi Theodore, > > > > On Tue, Sep 15, 2020 at 12:45:19AM -0400, Theodore Y. Ts'o wrote: > > > On Thu, Sep 03, 2020 at 11:55:28PM -0400, Theodore Y. Ts'o wrote: > > > > Worse, right now, -rc1 and -rc2 is causing random crashes in my > > > > gce-xfstests framework. Sometimes it happens before we've run even a > > > > single xfstests; sometimes it happens after we have successfully > > > > completed all of the tests, and we're doing a shutdown of the VM under > > > > test. Other times it happens in the middle of a test run. Given that > > > > I'm seeing this at -rc1, which is before my late ext4 pull request to > > > > Linus, it's probably not an ext4 related bug. But it also means that > > > > I'm partially blind in terms of my kernel testing at the moment. So I > > > > can't even tell Linus that I've run lots of tests and I'm 100% > > > > confident your one-line change is 100% safe. > > > > > > I was finally able to bisect it down to the commit: > > > > > > 37f4a24c2469: blk-mq: centralise related handling into > > > blk_mq_get_driver_tag > > > > 37f4a24c2469 has been reverted in: > > > > 4e2f62e566b5 Revert "blk-mq: put driver tag when this request is > > completed" > > > > And later the patch is committed as the following after being fixed: > > > > 568f27006577 blk-mq: centralise related handling into > > blk_mq_get_driver_tag > > > > So can you reproduce the issue by running kernel of commit 568f27006577? > > Yes. And things work fine if I try 4e2f62e566b5. > > > If yes, can the issue be fixed by reverting 568f27006577? > > The problem is it's a bit tricky to revert 568f27006577, since there > is a merge conflict in blk_kick_flush(). I attempted to do the bisect > manually here, but it's clearly not right since the kernel is not > booting after the revert: > > https://github.com/tytso/ext4/commit/1e67516382a33da2c9d483b860ac4ec2ad390870 > > branch: > > https://github.com/tytso/ext4/tree/manual-revert-of-568f27006577 > > Can you send me a patch which correctly reverts 568f27006577? I can > try it against -rc1 .. -rc4, whichever is most convenient. Please test the following revert patch against -rc4. diff --git a/block/blk-flush.c b/block/blk-flush.c index 53abb5c73d99..24c208d21793 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -219,6 +219,7 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error) struct request *rq, *n; unsigned long flags = 0; struct blk_flush_queue *fq = blk_get_flush_queue(q, flush_rq->mq_ctx); + struct blk_mq_hw_ctx *hctx; blk_account_io_flush(flush_rq); @@ -234,11 +235,13 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error) if (fq->rq_status != BLK_STS_OK) error = fq->rq_status; + hctx = flush_rq->mq_hctx; if (!q->elevator) { - flush_rq->tag = BLK_MQ_NO_TAG; + blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq); + flush_rq->tag = -1; } else { blk_mq_put_driver_tag(flush_rq); - flush_rq->internal_tag = BLK_MQ_NO_TAG; + flush_rq->internal_tag = -1; } running = >flush_queue[fq->flush_running_idx]; @@ -309,16 +312,12 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq, flush_rq->mq_hctx = first_rq->mq_hctx; if (!q->elevator) { + fq->orig_rq = first_rq; flush_rq->tag = first_rq->tag; - - /* -* We borrow data request's driver tag, so have to mark -* this flush request as INFLIGHT for avoiding double -* account of this driver tag -*/ - flush_rq->rq_flags |= RQF_MQ_INFLIGHT; - } else + blk_mq_tag_set_rq(flush_rq->mq_hctx, first_rq->tag, flush_rq); + } else { flush_rq->internal_tag = first_rq->internal_tag; + } flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH; flush_rq->cmd_flags |= (flags & REQ_DRV) | (flags & REQ_FAILFAST_MASK); diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h index b1acac518c4e..3945c7f5b944 100644 --- a/block/blk-mq-tag.h +++ b/block/blk-mq-tag.h @@ -101,6 +101,18 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx, return atomic_read(>nr_active) < depth; } +/* + * This helper should only be used for flush request to share tag + * with the request cloned from, and both the two requests can't be + * in flight at the same time. The caller has to make sure the tag + * can't be freed. + */ +static inline void blk_mq_tag_set_rq(struct blk_mq_hw_ctx *hctx, + unsigned int tag, struct request *rq) +{ + hctx->tags->rqs[tag] = rq; +} + static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Tue, Sep 15, 2020 at 03:33:03PM +0800, Ming Lei wrote: > Hi Theodore, > > On Tue, Sep 15, 2020 at 12:45:19AM -0400, Theodore Y. Ts'o wrote: > > On Thu, Sep 03, 2020 at 11:55:28PM -0400, Theodore Y. Ts'o wrote: > > > Worse, right now, -rc1 and -rc2 is causing random crashes in my > > > gce-xfstests framework. Sometimes it happens before we've run even a > > > single xfstests; sometimes it happens after we have successfully > > > completed all of the tests, and we're doing a shutdown of the VM under > > > test. Other times it happens in the middle of a test run. Given that > > > I'm seeing this at -rc1, which is before my late ext4 pull request to > > > Linus, it's probably not an ext4 related bug. But it also means that > > > I'm partially blind in terms of my kernel testing at the moment. So I > > > can't even tell Linus that I've run lots of tests and I'm 100% > > > confident your one-line change is 100% safe. > > > > I was finally able to bisect it down to the commit: > > > > 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag > > 37f4a24c2469 has been reverted in: > > 4e2f62e566b5 Revert "blk-mq: put driver tag when this request is > completed" > > And later the patch is committed as the following after being fixed: > > 568f27006577 blk-mq: centralise related handling into > blk_mq_get_driver_tag > > So can you reproduce the issue by running kernel of commit 568f27006577? Yes. And things work fine if I try 4e2f62e566b5. > If yes, can the issue be fixed by reverting 568f27006577? The problem is it's a bit tricky to revert 568f27006577, since there is a merge conflict in blk_kick_flush(). I attempted to do the bisect manually here, but it's clearly not right since the kernel is not booting after the revert: https://github.com/tytso/ext4/commit/1e67516382a33da2c9d483b860ac4ec2ad390870 branch: https://github.com/tytso/ext4/tree/manual-revert-of-568f27006577 Can you send me a patch which correctly reverts 568f27006577? I can try it against -rc1 .. -rc4, whichever is most convenient. > Can you share the exact mount command line for setup the environment? > and the exact xfstest item? It's a variety of mount command lines, since I'm using gce-xfstests[1][2] using a variety of file system scenarios --- but the basic one, which is ext4 using the default 4k block size is failing (they all are failing). [1] https://thunk.org/gce-xfstests [2] https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md It's also not one consistent xfstests which is failing, but it does tend to be tests which are loading up the storage stack with a lot of small random read/writes, especially involving metadata blocks/writes. (For example, tests which run fsstress.) Since this reliably triggers for me, and other people running kvm-xfstests or are running xfstests on their own test environments aren't seeing it, I'm assuming it must be some kind of interesting interaction between virtio-scsi, perhaps with how Google Persistent Disk is behaving (maybe timing related? who knows?). Darrick Wong did say he saw something like it once using Oracle's Cloud infrastructure, but as far as I know it hasn't reproduced since. On Google Compute Engine VM's, it reproduces *extremely* reliably. I expect that if you were to set up gce-xfstests, get a free GCE account with the initial $300 free credits, you could run "gce-xfstests -c ext4/4k -g auto" and it would reproduce within an hour or so. (So under a dollar's worth of VM credits, so long as you notice that it's hung and shut down the VM after gathering debugging data.) The instructions are at [2], and the image xfstests-202008311554 in the xfstests-cloud project is a public copy of the VM test appliance I was using. % gcloud compute images describe --project xfstests-cloud xfstests-202008311554 archiveSizeBytes: '1720022528' creationTimestamp: '2020-09-15T15:09:30.544-07:00' description: Linux Kernel File System Test Appliance diskSizeGb: '10' family: xfstests guestOsFeatures: - type: VIRTIO_SCSI_MULTIQUEUE - type: UEFI_COMPATIBLE id: '1558420969906537845' kind: compute#image labelFingerprint: V-2Qgcxt2uw= labels: blktests: g8a75bed e2fsprogs: v1_45_6 fio: fio-3_22 fsverity: v1_2 ima-evm-utils: v1_3_1 nvme-cli: v1_12 quota: g13bb8c2 util-linux: v2_36 xfsprogs: v5_8_0-rc1 xfstests: linux-v3_8-2838-geb439bf2 xfstests-bld: gb5085ab licenseCodes: - '5543610867827062957' licenses: - https://www.googleapis.com/compute/v1/projects/debian-cloud/global/licenses/debian-10-buster name: xfstests-202008311554 selfLink: https://www.googleapis.com/compute/v1/projects/xfstests-cloud/global/images/xfstests-202008311554 sourceDisk: https://www.googleapis.com/compute/v1/projects/xfstests-cloud/zones/us-east1-d/disks/temp-xfstests-202008311554 sourceDiskId: '5824762850044577124' sourceType: RAW status: READY storageLocations: - us Cheers, - Ted P.S. As
Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
Hi Theodore, On Tue, Sep 15, 2020 at 12:45:19AM -0400, Theodore Y. Ts'o wrote: > On Thu, Sep 03, 2020 at 11:55:28PM -0400, Theodore Y. Ts'o wrote: > > Worse, right now, -rc1 and -rc2 is causing random crashes in my > > gce-xfstests framework. Sometimes it happens before we've run even a > > single xfstests; sometimes it happens after we have successfully > > completed all of the tests, and we're doing a shutdown of the VM under > > test. Other times it happens in the middle of a test run. Given that > > I'm seeing this at -rc1, which is before my late ext4 pull request to > > Linus, it's probably not an ext4 related bug. But it also means that > > I'm partially blind in terms of my kernel testing at the moment. So I > > can't even tell Linus that I've run lots of tests and I'm 100% > > confident your one-line change is 100% safe. > > I was finally able to bisect it down to the commit: > > 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag 37f4a24c2469 has been reverted in: 4e2f62e566b5 Revert "blk-mq: put driver tag when this request is completed" And later the patch is committed as the following after being fixed: 568f27006577 blk-mq: centralise related handling into blk_mq_get_driver_tag So can you reproduce the issue by running kernel of commit 568f27006577? If yes, can the issue be fixed by reverting 568f27006577? > > (See below for [1] Bisect log.) > > The previous commit allows the tests to run to completion. With > commit 37f4a24c2469 and later all 11 test scenarios (4k blocks, 1k > blocks, ext3 compat, ext4 w/ fscrypt, nojournal mode, data=journal, > bigalloc, etc.) the VM will get stuck. Can you share the exact mount command line for setup the environment? and the exact xfstest item? Thanks, Ming
REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
On Thu, Sep 03, 2020 at 11:55:28PM -0400, Theodore Y. Ts'o wrote: > Worse, right now, -rc1 and -rc2 is causing random crashes in my > gce-xfstests framework. Sometimes it happens before we've run even a > single xfstests; sometimes it happens after we have successfully > completed all of the tests, and we're doing a shutdown of the VM under > test. Other times it happens in the middle of a test run. Given that > I'm seeing this at -rc1, which is before my late ext4 pull request to > Linus, it's probably not an ext4 related bug. But it also means that > I'm partially blind in terms of my kernel testing at the moment. So I > can't even tell Linus that I've run lots of tests and I'm 100% > confident your one-line change is 100% safe. I was finally able to bisect it down to the commit: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag (See below for [1] Bisect log.) The previous commit allows the tests to run to completion. With commit 37f4a24c2469 and later all 11 test scenarios (4k blocks, 1k blocks, ext3 compat, ext4 w/ fscrypt, nojournal mode, data=journal, bigalloc, etc.) the VM will get stuck. The symptom is that while running xfstests in a Google Compute Engine (GCE) VM, the tests just hang. There are a number of tests where this is more likely, but it's not unique to a single test. In most cases, there is nothing; just the test stops running until the test framework times out after an hour (tests usually complete in seconds or at most a few tens of minutes or so in the worst case) and kills the VM. In one case, I did get a report like this. (See below for [2] stack trace from 37f4a24c2469.) I attempted to revert the commit in question against -rc1 and -rc4; that result can be found at the branches manual-revert-of-blk-mq-patch and manual-revert-of-blk-mq-patch-rc4 at https://github.com/tytso/ext4. I don't think I got the revert quite right; with the revert, most of the test VM's successfully complete, but 2 out of the 11 fail, with a different stack trace. (See below for [3] stack trace from my attempted manual revert of 37f4a24c2469). But it does seem to confirm that the primary cause of the test VM hangs is caused by commit 37f4a24c2469. Does this make any sense as to what might be going on? I hope it does for you, since I'm pretty confused what might be going on. Thanks, - Ted [1] Bisect log git bisect start # bad: [9123e3a74ec7b934a4a099e98af6a61c2f80bbf5] Linux 5.9-rc1 git bisect bad 9123e3a74ec7b934a4a099e98af6a61c2f80bbf5 # good: [bcf876870b95592b52519ed4aafcf9d95999bc9c] Linux 5.8 git bisect good bcf876870b95592b52519ed4aafcf9d95999bc9c # bad: [8186749621ed6b8fc42644c399e8c755a2b6f630] Merge tag 'drm-next-2020-08-06' of git://anongit.freedesktop.org/drm/drm git bisect bad 8186749621ed6b8fc42644c399e8c755a2b6f630 # bad: [2324d50d051ec0f14a548e78554fb02513d6dcef] Merge tag 'docs-5.9' of git://git.lwn.net/linux git bisect bad 2324d50d051ec0f14a548e78554fb02513d6dcef # bad: [92c59e126b21fd212195358a0d296e787e444087] Merge tag 'arm-defconfig-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc git bisect bad 92c59e126b21fd212195358a0d296e787e444087 # bad: [cdc8fcb49905c0b67e355e027cb462ee168ffaa3] Merge tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block git bisect bad cdc8fcb49905c0b67e355e027cb462ee168ffaa3 # good: [ab5c60b79ab6cc50b39bbb21b2f9fb55af900b84] Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 git bisect good ab5c60b79ab6cc50b39bbb21b2f9fb55af900b84 # bad: [d958e343bdc3de2643ce25225bed082dc222858d] block: blk-timeout: delete duplicated word git bisect bad d958e343bdc3de2643ce25225bed082dc222858d # bad: [53042f3cc411adc79811ba3cfbca5d7a42a7b806] ps3vram: stop using ->queuedata git bisect bad 53042f3cc411adc79811ba3cfbca5d7a42a7b806 # good: [621c1f42945e76015c3a585e7a9fe6e71665eba0] block: move struct block_device to blk_types.h git bisect good 621c1f42945e76015c3a585e7a9fe6e71665eba0 # good: [36a3df5a4574d5ddf59804fcd0c4e9654c514d9a] blk-mq: put driver tag when this request is completed git bisect good 36a3df5a4574d5ddf59804fcd0c4e9654c514d9a # good: [570e9b73b0af2e5381ca5343759779b8c1ed20e3] blk-mq: move blk_mq_get_driver_tag into blk-mq.c git bisect good 570e9b73b0af2e5381ca5343759779b8c1ed20e3 # bad: [b5fc1e8bedf8ad2c6381e0df6331ad5686aca425] blk-mq: remove pointless call of list_entry_rq() in hctx_show_busy_rq() git bisect bad b5fc1e8bedf8ad2c6381e0df6331ad5686aca425 # bad: [37f4a24c2469a10a4c16c641671bd766e276cf9f] blk-mq: centralise related handling into blk_mq_get_driver_tag git bisect bad 37f4a24c2469a10a4c16c641671bd766e276cf9f # good: [723bf178f158abd1ce6069cb049581b3cb003aab] blk-mq: move blk_mq_put_driver_tag() into blk-mq.c git bisect good 723bf178f158abd1ce6069cb049581b3cb003aab # first bad commit: [37f4a24c2469a10a4c16c641671bd766e276cf9f] blk-mq: centralise related handling into