Re: bio linked list corruption.
On Tue, Dec 6, 2016 at 12:16 AM, Peter Zijlstrawrote: >> >> Of course, I'm really hoping that this shmem.c use is the _only_ such >> case. But I doubt it. > > $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l > 28 Hmm. Most of them seem to be ok, because they use "wait_event()", which will always remove itself from the wait-queue. And they do it from the place that allocates the wait-queue. IOW, the mm/shmem.c case really was fairly special, because it just did "prepare_to_wait()", and then did a finish_wait() - and not in the thread that allocated it on the stack. So it's really that "some _other_ thread allocated the waitqueue on the stack, and now we're doing a wait on it" that is bad. So the normal pattern seems to be: - allocate wq on the stack - pass it on to a waker - wait for it and that's ok, because as part of "wait for it" we will also be cleaning things up. The reason mm/shmem.c was buggy was that it did - allocate wq on the stack - pass it on to somebody else to wait for - wake it up and *that* is buggy, because it's the waiter, not the waker, that normally cleans things up. Is there some good way to find this kind of pattern automatically, I wonder Linus
Re: bio linked list corruption.
On Tue, Dec 6, 2016 at 12:16 AM, Peter Zijlstra wrote: >> >> Of course, I'm really hoping that this shmem.c use is the _only_ such >> case. But I doubt it. > > $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l > 28 Hmm. Most of them seem to be ok, because they use "wait_event()", which will always remove itself from the wait-queue. And they do it from the place that allocates the wait-queue. IOW, the mm/shmem.c case really was fairly special, because it just did "prepare_to_wait()", and then did a finish_wait() - and not in the thread that allocated it on the stack. So it's really that "some _other_ thread allocated the waitqueue on the stack, and now we're doing a wait on it" that is bad. So the normal pattern seems to be: - allocate wq on the stack - pass it on to a waker - wait for it and that's ok, because as part of "wait for it" we will also be cleaning things up. The reason mm/shmem.c was buggy was that it did - allocate wq on the stack - pass it on to somebody else to wait for - wake it up and *that* is buggy, because it's the waiter, not the waker, that normally cleans things up. Is there some good way to find this kind of pattern automatically, I wonder Linus
Re: bio linked list corruption.
On 5 December 2016 at 22:33, Vegard Nossumwrote: > On 5 December 2016 at 21:35, Linus Torvalds > wrote: >> Note for Ingo and Peter: this patch has not been tested at all. But >> Vegard did test an earlier patch of mine that just verified that yes, >> the issue really was that wait queue entries remained on the wait >> queue head just as we were about to return and free it. > > The second patch has been running for 1h+ without any problems of any > kind. I should typically have seen 2 crashes by now. I'll let it run > overnight to be sure. Alright, so nearly 12 hours later I don't see either the new warning or the original crash at all, so feel free to add: Tested-by: Vegard Nossum . That said, my 8 VMs had all panicked in some way due to OOMs (which is new since v4.8), although some got page allocation stalls for >20s and died because "khugepaged blocked for more than 120 seconds", others got "Out of memory and no killable processes". Vegard
Re: bio linked list corruption.
On 5 December 2016 at 22:33, Vegard Nossum wrote: > On 5 December 2016 at 21:35, Linus Torvalds > wrote: >> Note for Ingo and Peter: this patch has not been tested at all. But >> Vegard did test an earlier patch of mine that just verified that yes, >> the issue really was that wait queue entries remained on the wait >> queue head just as we were about to return and free it. > > The second patch has been running for 1h+ without any problems of any > kind. I should typically have seen 2 crashes by now. I'll let it run > overnight to be sure. Alright, so nearly 12 hours later I don't see either the new warning or the original crash at all, so feel free to add: Tested-by: Vegard Nossum . That said, my 8 VMs had all panicked in some way due to OOMs (which is new since v4.8), although some got page allocation stalls for >20s and died because "khugepaged blocked for more than 120 seconds", others got "Out of memory and no killable processes". Vegard
Re: bio linked list corruption.
* Peter Zijlstrawrote: > $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l > 28 This debug facility looks sensible. A couple of minor suggestions: > --- a/include/linux/wait.h > +++ b/include/linux/wait.h > @@ -39,6 +39,9 @@ struct wait_bit_queue { > struct __wait_queue_head { > spinlock_t lock; > struct list_headtask_list; > +#ifdef CONFIG_DEBUG_WAITQUEUE > + int onstack; > +#endif The structure will pack better in the debug-enabled case if 'onstack' is next to 'lock', as spinlock_t is 4 bytes on many architectures. > -#define __WAIT_QUEUE_HEAD_INITIALIZER(name) { > \ > +#ifdef CONFIG_DEBUG_WAITQUEUE > +#define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack), > +#else > +#define ___WAIT_QUEUE_ONSTACK(onstack) > +#endif Please help readers by showing the internal structure of the definition: #ifdef CONFIG_DEBUG_WAITQUEUE # define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack), #else # define ___WAIT_QUEUE_ONSTACK(onstack) #endif > +static inline void prepare_debug(wait_queue_head_t *q, wait_queue_t *wait) > +{ > +#ifdef CONFIG_DEBUG_WAITQUEUE > + WARN_ON_ONCE(q->onstack && wait->func == autoremove_wake_function) > +#endif > +} I'd name this debug_waitqueue_check() or such - as the 'prepare' is a bit misleadig (we don't prepare debugging, we do the debug check here). > +config DEBUG_WAITQUEUE > + bool "Debug waitqueue" > + depends on DEBUG_KERNEL I'd name it DEBUG_SCHED_WAITQUEUE=y and I'd also make it depend on CONFIG_DEBUG_SCHED. LGTM otherwise! Thanks, Ingo
Re: bio linked list corruption.
* Peter Zijlstra wrote: > $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l > 28 This debug facility looks sensible. A couple of minor suggestions: > --- a/include/linux/wait.h > +++ b/include/linux/wait.h > @@ -39,6 +39,9 @@ struct wait_bit_queue { > struct __wait_queue_head { > spinlock_t lock; > struct list_headtask_list; > +#ifdef CONFIG_DEBUG_WAITQUEUE > + int onstack; > +#endif The structure will pack better in the debug-enabled case if 'onstack' is next to 'lock', as spinlock_t is 4 bytes on many architectures. > -#define __WAIT_QUEUE_HEAD_INITIALIZER(name) { > \ > +#ifdef CONFIG_DEBUG_WAITQUEUE > +#define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack), > +#else > +#define ___WAIT_QUEUE_ONSTACK(onstack) > +#endif Please help readers by showing the internal structure of the definition: #ifdef CONFIG_DEBUG_WAITQUEUE # define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack), #else # define ___WAIT_QUEUE_ONSTACK(onstack) #endif > +static inline void prepare_debug(wait_queue_head_t *q, wait_queue_t *wait) > +{ > +#ifdef CONFIG_DEBUG_WAITQUEUE > + WARN_ON_ONCE(q->onstack && wait->func == autoremove_wake_function) > +#endif > +} I'd name this debug_waitqueue_check() or such - as the 'prepare' is a bit misleadig (we don't prepare debugging, we do the debug check here). > +config DEBUG_WAITQUEUE > + bool "Debug waitqueue" > + depends on DEBUG_KERNEL I'd name it DEBUG_SCHED_WAITQUEUE=y and I'd also make it depend on CONFIG_DEBUG_SCHED. LGTM otherwise! Thanks, Ingo
Re: bio linked list corruption.
On Mon, Dec 05, 2016 at 12:35:52PM -0800, Linus Torvalds wrote: > Adding the scheduler people to the participants list, and re-attaching > the patch, because while this patch is internal to the VM code, the > issue itself is not. > > There might well be other cases where somebody goes "wake_up_all()" > will wake everybody up, so I can put the wait queue head on the stack, > and then after I've woken people up I can return". > > Ingo/PeterZ: the reason that does *not* work is that "wake_up_all()" > does make sure that everybody is woken up, but the usual autoremove > wake function only removes the wakeup entry if the process was woken > up by that particular wakeup. If something else had woken it up, the > entry remains on the list, and the waker in this case returned with > the wait head still containing entries. > > Which is deadly when the wait queue head is on the stack. Yes, very much so. > So I'm wondering if we should make that "synchronous_wake_function()" > available generally, and maybe introduce a DEFINE_WAIT_SYNC() helper > that uses it. We could also do some debug code that tracks the ONSTACK ness and warns on autoremove. Something like the below, equally untested. > > Of course, I'm really hoping that this shmem.c use is the _only_ such > case. But I doubt it. $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l 28 --- diff --git a/include/linux/wait.h b/include/linux/wait.h index 2408e8d5c05c..199faaa89847 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -39,6 +39,9 @@ struct wait_bit_queue { struct __wait_queue_head { spinlock_t lock; struct list_headtask_list; +#ifdef CONFIG_DEBUG_WAITQUEUE + int onstack; +#endif }; typedef struct __wait_queue_head wait_queue_head_t; @@ -56,9 +59,18 @@ struct task_struct; #define DECLARE_WAITQUEUE(name, tsk) \ wait_queue_t name = __WAITQUEUE_INITIALIZER(name, tsk) -#define __WAIT_QUEUE_HEAD_INITIALIZER(name) { \ +#ifdef CONFIG_DEBUG_WAITQUEUE +#define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack), +#else +#define ___WAIT_QUEUE_ONSTACK(onstack) +#endif + +#define ___WAIT_QUEUE_HEAD_INITIALIZER(name, onstack) { \ .lock = __SPIN_LOCK_UNLOCKED(name.lock), \ - .task_list = { &(name).task_list, &(name).task_list } } + .task_list = { &(name).task_list, &(name).task_list }, \ + ___WAIT_QUEUE_ONSTACK(onstack) } + +#define __WAIT_QUEUE_HEAD_INITIALIZER(name) ___WAIT_QUEUE_HEAD_INITIALIZER(name, 0) #define DECLARE_WAIT_QUEUE_HEAD(name) \ wait_queue_head_t name = __WAIT_QUEUE_HEAD_INITIALIZER(name) @@ -80,11 +92,12 @@ extern void __init_waitqueue_head(wait_queue_head_t *q, const char *name, struct #ifdef CONFIG_LOCKDEP # define __WAIT_QUEUE_HEAD_INIT_ONSTACK(name) \ - ({ init_waitqueue_head(); name; }) + ({ init_waitqueue_head(); (name).onstack = 1; name; }) # define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \ wait_queue_head_t name = __WAIT_QUEUE_HEAD_INIT_ONSTACK(name) #else -# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) DECLARE_WAIT_QUEUE_HEAD(name) +# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \ + wait_queue_head_t name = ___WAIT_QUEUE_HEAD_INITIALIZER(name, 1) #endif static inline void init_waitqueue_entry(wait_queue_t *q, struct task_struct *p) diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index 9453efe9b25a..746d00117d08 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -156,6 +156,13 @@ void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive) } EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal use only */ +static inline void prepare_debug(wait_queue_head_t *q, wait_queue_t *wait) +{ +#ifdef CONFIG_DEBUG_WAITQUEUE + WARN_ON_ONCE(q->onstack && wait->func == autoremove_wake_function) +#endif +} + /* * Note: we use "set_current_state()" _after_ the wait-queue add, * because we need a memory barrier there on SMP, so that any @@ -178,6 +185,7 @@ prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state) if (list_empty(>task_list)) __add_wait_queue(q, wait); set_current_state(state); + prepare_debug(q, wait); spin_unlock_irqrestore(>lock, flags); } EXPORT_SYMBOL(prepare_to_wait); @@ -192,6 +200,7 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state) if (list_empty(>task_list)) __add_wait_queue_tail(q, wait); set_current_state(state); + prepare_debug(q, wait); spin_unlock_irqrestore(>lock, flags); } EXPORT_SYMBOL(prepare_to_wait_exclusive); @@ -235,6 +244,7 @@ long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state) } set_current_state(state); } + prepare_debug(q, wait); spin_unlock_irqrestore(>lock,
Re: bio linked list corruption.
On Mon, Dec 05, 2016 at 12:35:52PM -0800, Linus Torvalds wrote: > Adding the scheduler people to the participants list, and re-attaching > the patch, because while this patch is internal to the VM code, the > issue itself is not. > > There might well be other cases where somebody goes "wake_up_all()" > will wake everybody up, so I can put the wait queue head on the stack, > and then after I've woken people up I can return". > > Ingo/PeterZ: the reason that does *not* work is that "wake_up_all()" > does make sure that everybody is woken up, but the usual autoremove > wake function only removes the wakeup entry if the process was woken > up by that particular wakeup. If something else had woken it up, the > entry remains on the list, and the waker in this case returned with > the wait head still containing entries. > > Which is deadly when the wait queue head is on the stack. Yes, very much so. > So I'm wondering if we should make that "synchronous_wake_function()" > available generally, and maybe introduce a DEFINE_WAIT_SYNC() helper > that uses it. We could also do some debug code that tracks the ONSTACK ness and warns on autoremove. Something like the below, equally untested. > > Of course, I'm really hoping that this shmem.c use is the _only_ such > case. But I doubt it. $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l 28 --- diff --git a/include/linux/wait.h b/include/linux/wait.h index 2408e8d5c05c..199faaa89847 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -39,6 +39,9 @@ struct wait_bit_queue { struct __wait_queue_head { spinlock_t lock; struct list_headtask_list; +#ifdef CONFIG_DEBUG_WAITQUEUE + int onstack; +#endif }; typedef struct __wait_queue_head wait_queue_head_t; @@ -56,9 +59,18 @@ struct task_struct; #define DECLARE_WAITQUEUE(name, tsk) \ wait_queue_t name = __WAITQUEUE_INITIALIZER(name, tsk) -#define __WAIT_QUEUE_HEAD_INITIALIZER(name) { \ +#ifdef CONFIG_DEBUG_WAITQUEUE +#define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack), +#else +#define ___WAIT_QUEUE_ONSTACK(onstack) +#endif + +#define ___WAIT_QUEUE_HEAD_INITIALIZER(name, onstack) { \ .lock = __SPIN_LOCK_UNLOCKED(name.lock), \ - .task_list = { &(name).task_list, &(name).task_list } } + .task_list = { &(name).task_list, &(name).task_list }, \ + ___WAIT_QUEUE_ONSTACK(onstack) } + +#define __WAIT_QUEUE_HEAD_INITIALIZER(name) ___WAIT_QUEUE_HEAD_INITIALIZER(name, 0) #define DECLARE_WAIT_QUEUE_HEAD(name) \ wait_queue_head_t name = __WAIT_QUEUE_HEAD_INITIALIZER(name) @@ -80,11 +92,12 @@ extern void __init_waitqueue_head(wait_queue_head_t *q, const char *name, struct #ifdef CONFIG_LOCKDEP # define __WAIT_QUEUE_HEAD_INIT_ONSTACK(name) \ - ({ init_waitqueue_head(); name; }) + ({ init_waitqueue_head(); (name).onstack = 1; name; }) # define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \ wait_queue_head_t name = __WAIT_QUEUE_HEAD_INIT_ONSTACK(name) #else -# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) DECLARE_WAIT_QUEUE_HEAD(name) +# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \ + wait_queue_head_t name = ___WAIT_QUEUE_HEAD_INITIALIZER(name, 1) #endif static inline void init_waitqueue_entry(wait_queue_t *q, struct task_struct *p) diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index 9453efe9b25a..746d00117d08 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -156,6 +156,13 @@ void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive) } EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal use only */ +static inline void prepare_debug(wait_queue_head_t *q, wait_queue_t *wait) +{ +#ifdef CONFIG_DEBUG_WAITQUEUE + WARN_ON_ONCE(q->onstack && wait->func == autoremove_wake_function) +#endif +} + /* * Note: we use "set_current_state()" _after_ the wait-queue add, * because we need a memory barrier there on SMP, so that any @@ -178,6 +185,7 @@ prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state) if (list_empty(>task_list)) __add_wait_queue(q, wait); set_current_state(state); + prepare_debug(q, wait); spin_unlock_irqrestore(>lock, flags); } EXPORT_SYMBOL(prepare_to_wait); @@ -192,6 +200,7 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state) if (list_empty(>task_list)) __add_wait_queue_tail(q, wait); set_current_state(state); + prepare_debug(q, wait); spin_unlock_irqrestore(>lock, flags); } EXPORT_SYMBOL(prepare_to_wait_exclusive); @@ -235,6 +244,7 @@ long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state) } set_current_state(state); } + prepare_debug(q, wait); spin_unlock_irqrestore(>lock,
Re: bio linked list corruption.
On 5 December 2016 at 21:35, Linus Torvaldswrote: > Note for Ingo and Peter: this patch has not been tested at all. But > Vegard did test an earlier patch of mine that just verified that yes, > the issue really was that wait queue entries remained on the wait > queue head just as we were about to return and free it. The second patch has been running for 1h+ without any problems of any kind. I should typically have seen 2 crashes by now. I'll let it run overnight to be sure. Vegard
Re: bio linked list corruption.
On 5 December 2016 at 21:35, Linus Torvalds wrote: > Note for Ingo and Peter: this patch has not been tested at all. But > Vegard did test an earlier patch of mine that just verified that yes, > the issue really was that wait queue entries remained on the wait > queue head just as we were about to return and free it. The second patch has been running for 1h+ without any problems of any kind. I should typically have seen 2 crashes by now. I'll let it run overnight to be sure. Vegard
Re: bio linked list corruption.
Adding the scheduler people to the participants list, and re-attaching the patch, because while this patch is internal to the VM code, the issue itself is not. There might well be other cases where somebody goes "wake_up_all()" will wake everybody up, so I can put the wait queue head on the stack, and then after I've woken people up I can return". Ingo/PeterZ: the reason that does *not* work is that "wake_up_all()" does make sure that everybody is woken up, but the usual autoremove wake function only removes the wakeup entry if the process was woken up by that particular wakeup. If something else had woken it up, the entry remains on the list, and the waker in this case returned with the wait head still containing entries. Which is deadly when the wait queue head is on the stack. So I'm wondering if we should make that "synchronous_wake_function()" available generally, and maybe introduce a DEFINE_WAIT_SYNC() helper that uses it. Of course, I'm really hoping that this shmem.c use is the _only_ such case. But I doubt it. Comments? Note for Ingo and Peter: this patch has not been tested at all. But Vegard did test an earlier patch of mine that just verified that yes, the issue really was that wait queue entries remained on the wait queue head just as we were about to return and free it. Linus On Mon, Dec 5, 2016 at 12:10 PM, Linus Torvaldswrote: > > Anyway, can you try this patch instead? It should actually cause the > wake_up_all() to always remove all entries, and thus the WARN_ON() > should no longer happen (and I removed the "list_del()" hackery). > >Linus diff --git a/mm/shmem.c b/mm/shmem.c index 166ebf5d2bce..17beb44e9f4f 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1848,6 +1848,19 @@ alloc_nohuge:page = shmem_alloc_and_acct_page(gfp, info, sbinfo, return error; } +/* + * This is like autoremove_wake_function, but it removes the wait queue + * entry unconditionally - even if something else had already woken the + * target. + */ +static int synchronous_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key) +{ + int ret = default_wake_function(wait, mode, sync, key); + list_del_init(>task_list); + return ret; +} + + static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { struct inode *inode = file_inode(vma->vm_file); @@ -1883,7 +1896,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) vmf->pgoff >= shmem_falloc->start && vmf->pgoff < shmem_falloc->next) { wait_queue_head_t *shmem_falloc_waitq; - DEFINE_WAIT(shmem_fault_wait); + DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function); ret = VM_FAULT_NOPAGE; if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) && @@ -2665,6 +2678,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, spin_lock(>i_lock); inode->i_private = NULL; wake_up_all(_falloc_waitq); + WARN_ON_ONCE(!list_empty(_falloc_waitq.task_list)); spin_unlock(>i_lock); error = 0; goto out;
Re: bio linked list corruption.
Adding the scheduler people to the participants list, and re-attaching the patch, because while this patch is internal to the VM code, the issue itself is not. There might well be other cases where somebody goes "wake_up_all()" will wake everybody up, so I can put the wait queue head on the stack, and then after I've woken people up I can return". Ingo/PeterZ: the reason that does *not* work is that "wake_up_all()" does make sure that everybody is woken up, but the usual autoremove wake function only removes the wakeup entry if the process was woken up by that particular wakeup. If something else had woken it up, the entry remains on the list, and the waker in this case returned with the wait head still containing entries. Which is deadly when the wait queue head is on the stack. So I'm wondering if we should make that "synchronous_wake_function()" available generally, and maybe introduce a DEFINE_WAIT_SYNC() helper that uses it. Of course, I'm really hoping that this shmem.c use is the _only_ such case. But I doubt it. Comments? Note for Ingo and Peter: this patch has not been tested at all. But Vegard did test an earlier patch of mine that just verified that yes, the issue really was that wait queue entries remained on the wait queue head just as we were about to return and free it. Linus On Mon, Dec 5, 2016 at 12:10 PM, Linus Torvalds wrote: > > Anyway, can you try this patch instead? It should actually cause the > wake_up_all() to always remove all entries, and thus the WARN_ON() > should no longer happen (and I removed the "list_del()" hackery). > >Linus diff --git a/mm/shmem.c b/mm/shmem.c index 166ebf5d2bce..17beb44e9f4f 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1848,6 +1848,19 @@ alloc_nohuge:page = shmem_alloc_and_acct_page(gfp, info, sbinfo, return error; } +/* + * This is like autoremove_wake_function, but it removes the wait queue + * entry unconditionally - even if something else had already woken the + * target. + */ +static int synchronous_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key) +{ + int ret = default_wake_function(wait, mode, sync, key); + list_del_init(>task_list); + return ret; +} + + static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { struct inode *inode = file_inode(vma->vm_file); @@ -1883,7 +1896,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) vmf->pgoff >= shmem_falloc->start && vmf->pgoff < shmem_falloc->next) { wait_queue_head_t *shmem_falloc_waitq; - DEFINE_WAIT(shmem_fault_wait); + DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function); ret = VM_FAULT_NOPAGE; if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) && @@ -2665,6 +2678,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, spin_lock(>i_lock); inode->i_private = NULL; wake_up_all(_falloc_waitq); + WARN_ON_ONCE(!list_empty(_falloc_waitq.task_list)); spin_unlock(>i_lock); error = 0; goto out;
Re: bio linked list corruption.
On 5 December 2016 at 20:11, Vegard Nossumwrote: > On 5 December 2016 at 18:55, Linus Torvalds > wrote: >> On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum >> wrote: >> Since you apparently can recreate this fairly easily, how about trying >> this stupid patch? >> >> NOTE! This is entirely untested. I may have screwed this up entirely. >> You get the idea, though - just remove the wait queue head from the >> list - the list entries stay around, but nothing points to the stack >> entry (that we're going to free) any more. >> >> And add the warning to see if this actually ever triggers (and because >> I'd like to see the callchain when it does, to see if it's another >> waitqueue somewhere or what..) > > [ cut here ] > WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0 > Kernel panic - not syncing: panic_on_warn set ... So I noticed that panic_on_warn just after sending the email and I've been waiting for it it to trigger again. The warning has triggered twice more without panic_on_warn set and I haven't seen any crash yet. Vegard
Re: bio linked list corruption.
On 5 December 2016 at 20:11, Vegard Nossum wrote: > On 5 December 2016 at 18:55, Linus Torvalds > wrote: >> On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum >> wrote: >> Since you apparently can recreate this fairly easily, how about trying >> this stupid patch? >> >> NOTE! This is entirely untested. I may have screwed this up entirely. >> You get the idea, though - just remove the wait queue head from the >> list - the list entries stay around, but nothing points to the stack >> entry (that we're going to free) any more. >> >> And add the warning to see if this actually ever triggers (and because >> I'd like to see the callchain when it does, to see if it's another >> waitqueue somewhere or what..) > > [ cut here ] > WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0 > Kernel panic - not syncing: panic_on_warn set ... So I noticed that panic_on_warn just after sending the email and I've been waiting for it it to trigger again. The warning has triggered twice more without panic_on_warn set and I haven't seen any crash yet. Vegard
Re: bio linked list corruption.
On Mon, Dec 5, 2016 at 11:11 AM, Vegard Nossumwrote: > > [ cut here ] > WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0 Ok, good. So that's confirmed as the cause of this problem. And the call chain that I wanted is obviously completely uninteresting, because it's call cghain on the other side (the page fault side) that would show the nested wake queue behavior. I was just being stupid about it. I wonder if we have any other places where we just blithely assume that "wake_up_all()" will actually empty the whole wait queue. It's _usually_ true, but as noted, nested waiting does happen. Anyway, can you try this patch instead? It should actually cause the wake_up_all() to always remove all entries, and thus the WARN_ON() should no longer happen (and I removed the "list_del()" hackery). Linus diff --git a/mm/shmem.c b/mm/shmem.c index 166ebf5d2bce..17beb44e9f4f 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1848,6 +1848,19 @@ alloc_nohuge:page = shmem_alloc_and_acct_page(gfp, info, sbinfo, return error; } +/* + * This is like autoremove_wake_function, but it removes the wait queue + * entry unconditionally - even if something else had already woken the + * target. + */ +static int synchronous_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key) +{ + int ret = default_wake_function(wait, mode, sync, key); + list_del_init(>task_list); + return ret; +} + + static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { struct inode *inode = file_inode(vma->vm_file); @@ -1883,7 +1896,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) vmf->pgoff >= shmem_falloc->start && vmf->pgoff < shmem_falloc->next) { wait_queue_head_t *shmem_falloc_waitq; - DEFINE_WAIT(shmem_fault_wait); + DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function); ret = VM_FAULT_NOPAGE; if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) && @@ -2665,6 +2678,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, spin_lock(>i_lock); inode->i_private = NULL; wake_up_all(_falloc_waitq); + WARN_ON_ONCE(!list_empty(_falloc_waitq.task_list)); spin_unlock(>i_lock); error = 0; goto out;
Re: bio linked list corruption.
On Mon, Dec 5, 2016 at 11:11 AM, Vegard Nossum wrote: > > [ cut here ] > WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0 Ok, good. So that's confirmed as the cause of this problem. And the call chain that I wanted is obviously completely uninteresting, because it's call cghain on the other side (the page fault side) that would show the nested wake queue behavior. I was just being stupid about it. I wonder if we have any other places where we just blithely assume that "wake_up_all()" will actually empty the whole wait queue. It's _usually_ true, but as noted, nested waiting does happen. Anyway, can you try this patch instead? It should actually cause the wake_up_all() to always remove all entries, and thus the WARN_ON() should no longer happen (and I removed the "list_del()" hackery). Linus diff --git a/mm/shmem.c b/mm/shmem.c index 166ebf5d2bce..17beb44e9f4f 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1848,6 +1848,19 @@ alloc_nohuge:page = shmem_alloc_and_acct_page(gfp, info, sbinfo, return error; } +/* + * This is like autoremove_wake_function, but it removes the wait queue + * entry unconditionally - even if something else had already woken the + * target. + */ +static int synchronous_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key) +{ + int ret = default_wake_function(wait, mode, sync, key); + list_del_init(>task_list); + return ret; +} + + static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { struct inode *inode = file_inode(vma->vm_file); @@ -1883,7 +1896,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) vmf->pgoff >= shmem_falloc->start && vmf->pgoff < shmem_falloc->next) { wait_queue_head_t *shmem_falloc_waitq; - DEFINE_WAIT(shmem_fault_wait); + DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function); ret = VM_FAULT_NOPAGE; if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) && @@ -2665,6 +2678,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, spin_lock(>i_lock); inode->i_private = NULL; wake_up_all(_falloc_waitq); + WARN_ON_ONCE(!list_empty(_falloc_waitq.task_list)); spin_unlock(>i_lock); error = 0; goto out;
Re: bio linked list corruption.
On 5 December 2016 at 18:55, Linus Torvaldswrote: > On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum wrote: >> >> The warning shows that it made it past the list_empty_careful() check >> in finish_wait() but then bugs out on the >task_list >> dereference. >> >> Anything stick out? > > I hate that shmem waitqueue garbage. It's really subtle. > > I think the problem is that "wake_up_all()" in shmem_fallocate() > doesn't necessarily wake up everything. It wakes up TASK_NORMAL - > which does include TASK_UNINTERRUPTIBLE, but doesn't actually mean > "everything on the list". > > I think that what happens is that the waiters somehow move from > TASK_UNINTERRUPTIBLE to TASK_RUNNING early, and this means that > wake_up_all() will ignore them, leave them on the list, and now that > list on stack is no longer empty at the end. > > And the way *THAT* can happen is that the task is on some *other* > waitqueue as well, and that other waiqueue wakes it up. That's not > impossible, you can certainly have people on wait-queues that still > take faults. > > Or somebody just uses a directed wake_up_process() or something. > > Since you apparently can recreate this fairly easily, how about trying > this stupid patch? > > NOTE! This is entirely untested. I may have screwed this up entirely. > You get the idea, though - just remove the wait queue head from the > list - the list entries stay around, but nothing points to the stack > entry (that we're going to free) any more. > > And add the warning to see if this actually ever triggers (and because > I'd like to see the callchain when it does, to see if it's another > waitqueue somewhere or what..) [ cut here ] WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0 Kernel panic - not syncing: panic_on_warn set ... CPU: 22 PID: 14012 Comm: trinity-c73 Not tainted 4.9.0-rc7+ #220 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 8801e32af970 81fb08c1 83e74b60 8801e32afa48 83ed7600 847103e0 8801e32afa38 81515244 41b58ab3 844e21da 81515061 8151591e Call Trace: [] dump_stack+0x83/0xb2 [] panic+0x1e3/0x3ad [] __warn+0x1bf/0x1e0 [] warn_slowpath_null+0x2c/0x40 [] shmem_fallocate+0x9a7/0xac0 [] vfs_fallocate+0x350/0x620 [] SyS_madvise+0x432/0x1290 [] do_syscall_64+0x1af/0x4d0 [] entry_SYSCALL64_slow_path+0x25/0x25 [ cut here ] Attached a full log. Vegard 0.txt.gz Description: GNU Zip compressed data
Re: bio linked list corruption.
On 5 December 2016 at 18:55, Linus Torvalds wrote: > On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum wrote: >> >> The warning shows that it made it past the list_empty_careful() check >> in finish_wait() but then bugs out on the >task_list >> dereference. >> >> Anything stick out? > > I hate that shmem waitqueue garbage. It's really subtle. > > I think the problem is that "wake_up_all()" in shmem_fallocate() > doesn't necessarily wake up everything. It wakes up TASK_NORMAL - > which does include TASK_UNINTERRUPTIBLE, but doesn't actually mean > "everything on the list". > > I think that what happens is that the waiters somehow move from > TASK_UNINTERRUPTIBLE to TASK_RUNNING early, and this means that > wake_up_all() will ignore them, leave them on the list, and now that > list on stack is no longer empty at the end. > > And the way *THAT* can happen is that the task is on some *other* > waitqueue as well, and that other waiqueue wakes it up. That's not > impossible, you can certainly have people on wait-queues that still > take faults. > > Or somebody just uses a directed wake_up_process() or something. > > Since you apparently can recreate this fairly easily, how about trying > this stupid patch? > > NOTE! This is entirely untested. I may have screwed this up entirely. > You get the idea, though - just remove the wait queue head from the > list - the list entries stay around, but nothing points to the stack > entry (that we're going to free) any more. > > And add the warning to see if this actually ever triggers (and because > I'd like to see the callchain when it does, to see if it's another > waitqueue somewhere or what..) [ cut here ] WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0 Kernel panic - not syncing: panic_on_warn set ... CPU: 22 PID: 14012 Comm: trinity-c73 Not tainted 4.9.0-rc7+ #220 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 8801e32af970 81fb08c1 83e74b60 8801e32afa48 83ed7600 847103e0 8801e32afa38 81515244 41b58ab3 844e21da 81515061 8151591e Call Trace: [] dump_stack+0x83/0xb2 [] panic+0x1e3/0x3ad [] __warn+0x1bf/0x1e0 [] warn_slowpath_null+0x2c/0x40 [] shmem_fallocate+0x9a7/0xac0 [] vfs_fallocate+0x350/0x620 [] SyS_madvise+0x432/0x1290 [] do_syscall_64+0x1af/0x4d0 [] entry_SYSCALL64_slow_path+0x25/0x25 [ cut here ] Attached a full log. Vegard 0.txt.gz Description: GNU Zip compressed data
Re: bio linked list corruption.
On 5 December 2016 at 19:11, Andy Lutomirskiwrote: > On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossum wrote: >> On 23 November 2016 at 20:58, Dave Jones wrote: >>> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: >>> >>> > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 >>> > trace from just before this happened. Does this shed any light ? >>> > >>> > https://codemonkey.org.uk/junk/trace.txt >>> >>> crap, I just noticed the timestamps in the trace come from quite a bit >>> later. I'll tweak the code to do the taint checking/ftrace stop after >>> every syscall, that should narrow the window some more. >> >> FWIW I hit this as well: >> >> BUG: unable to handle kernel paging request at 81ff08b7 > > We really ought to improve this message. If nothing else, it should > say whether it was a read, a write, or an instruction fetch. > >> IP: [] __lock_acquire.isra.32+0xda/0x1a30 >> PGD 461e067 PUD 461f063 >> PMD 1e001e1 > > Too lazy to manually decode this right now, but I don't think it matters. > >> Oops: 0003 [#1] PREEMPT SMP KASAN > > Note this is SMP, but that just means CONFIG_SMP=y. Vegard, how many > CPUs does your kernel think you have? My first crash was running on a 1-CPU guest (not intentionally, but because of a badly configured qemu). I'm honestly surprised it triggered at all with 1 CPU, but I guess it shows that it's not a true concurrency issue at least! > >> RIP: 0010:[] [] >> __lock_acquire.isra.32+0xda/0x1a30 >> RSP: 0018:8801bab8f730 EFLAGS: 00010082 >> RAX: 81ff071f RBX: RCX: > > RAX points to kernel text. Yes, it's somewhere in the middle of iov_iter_init() -- other crashes also had put_prev_entity(), a null pointer, and some garbage values I couldn't identify. > >> Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85 >> c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e> >> ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00 > > 2b:3e ff 80 98 01 00 00 incl %ds:*0x198(%rax)<-- > trapping instruction > > That's very strange. I think this is: > > atomic_inc((atomic_t *)>ops); > > but my kernel contains: > > 3cb4: f0 ff 80 98 01 00 00lock incl 0x198(%rax) > > So your kernel has been smp-alternatived. That 3e comes from > alternatives_smp_unlock. If you're running on SMP with UP > alternatives, things will break. Yes, indeed. It was running on 1 CPU by mistake and still triggered the bug. The crashes started really pouring in once I got my qemu fixed. Just to reassure you, here's another crash which shows it's using the correct instruction on an actual multicore guest: BUG: unable to handle kernel paging request at 0003030001de IP: [] __lock_acquire.isra.32+0xda/0x1a30 PGD 183fd2067 PUD 0 Oops: 0002 [#1] PREEMPT SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) CPU: 23 PID: 9584 Comm: trinity-c104 Not tainted 4.9.0-rc7+ #219 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: 880189f68000 task.stack: 88017fe5 RIP: 0010:[] [] __lock_acquire.isra.32+0xda/0x1a30 RSP: 0018:88017fe575e0 EFLAGS: 00010002 RAX: 00030346 RBX: RCX: [...] Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85 c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00 RIP [] __lock_acquire.isra.32+0xda/0x1a30 RSP CR2: 0003030001de ---[ end trace 2846425104eb6141 ]--- Kernel panic - not syncing: Fatal exception [ cut here ] 2b:* f0 ff 80 98 01 00 00lock incl 0x198(%rax) <-- trapping instruction > What's your kernel command line? Can we have your entire kernel log from > boot? Just in case you still want this, I've attached the boot log for the "true SMP" guest above. Vegard 5.txt.gz Description: GNU Zip compressed data
Re: bio linked list corruption.
On 5 December 2016 at 19:11, Andy Lutomirski wrote: > On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossum wrote: >> On 23 November 2016 at 20:58, Dave Jones wrote: >>> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: >>> >>> > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 >>> > trace from just before this happened. Does this shed any light ? >>> > >>> > https://codemonkey.org.uk/junk/trace.txt >>> >>> crap, I just noticed the timestamps in the trace come from quite a bit >>> later. I'll tweak the code to do the taint checking/ftrace stop after >>> every syscall, that should narrow the window some more. >> >> FWIW I hit this as well: >> >> BUG: unable to handle kernel paging request at 81ff08b7 > > We really ought to improve this message. If nothing else, it should > say whether it was a read, a write, or an instruction fetch. > >> IP: [] __lock_acquire.isra.32+0xda/0x1a30 >> PGD 461e067 PUD 461f063 >> PMD 1e001e1 > > Too lazy to manually decode this right now, but I don't think it matters. > >> Oops: 0003 [#1] PREEMPT SMP KASAN > > Note this is SMP, but that just means CONFIG_SMP=y. Vegard, how many > CPUs does your kernel think you have? My first crash was running on a 1-CPU guest (not intentionally, but because of a badly configured qemu). I'm honestly surprised it triggered at all with 1 CPU, but I guess it shows that it's not a true concurrency issue at least! > >> RIP: 0010:[] [] >> __lock_acquire.isra.32+0xda/0x1a30 >> RSP: 0018:8801bab8f730 EFLAGS: 00010082 >> RAX: 81ff071f RBX: RCX: > > RAX points to kernel text. Yes, it's somewhere in the middle of iov_iter_init() -- other crashes also had put_prev_entity(), a null pointer, and some garbage values I couldn't identify. > >> Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85 >> c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e> >> ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00 > > 2b:3e ff 80 98 01 00 00 incl %ds:*0x198(%rax)<-- > trapping instruction > > That's very strange. I think this is: > > atomic_inc((atomic_t *)>ops); > > but my kernel contains: > > 3cb4: f0 ff 80 98 01 00 00lock incl 0x198(%rax) > > So your kernel has been smp-alternatived. That 3e comes from > alternatives_smp_unlock. If you're running on SMP with UP > alternatives, things will break. Yes, indeed. It was running on 1 CPU by mistake and still triggered the bug. The crashes started really pouring in once I got my qemu fixed. Just to reassure you, here's another crash which shows it's using the correct instruction on an actual multicore guest: BUG: unable to handle kernel paging request at 0003030001de IP: [] __lock_acquire.isra.32+0xda/0x1a30 PGD 183fd2067 PUD 0 Oops: 0002 [#1] PREEMPT SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) CPU: 23 PID: 9584 Comm: trinity-c104 Not tainted 4.9.0-rc7+ #219 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: 880189f68000 task.stack: 88017fe5 RIP: 0010:[] [] __lock_acquire.isra.32+0xda/0x1a30 RSP: 0018:88017fe575e0 EFLAGS: 00010002 RAX: 00030346 RBX: RCX: [...] Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85 c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00 RIP [] __lock_acquire.isra.32+0xda/0x1a30 RSP CR2: 0003030001de ---[ end trace 2846425104eb6141 ]--- Kernel panic - not syncing: Fatal exception [ cut here ] 2b:* f0 ff 80 98 01 00 00lock incl 0x198(%rax) <-- trapping instruction > What's your kernel command line? Can we have your entire kernel log from > boot? Just in case you still want this, I've attached the boot log for the "true SMP" guest above. Vegard 5.txt.gz Description: GNU Zip compressed data
Re: bio linked list corruption.
On Mon, Dec 5, 2016 at 10:11 AM, Andy Lutomirskiwrote: > > So your kernel has been smp-alternatived. That 3e comes from > alternatives_smp_unlock. If you're running on SMP with UP > alternatives, things will break. I'm assuming he's just running in a VM with a single CPU. The problem that I pointed out with assuming wake_up_all() actually removes all wait queue entries does not depend on SMP. The race is much more fundamental and long-lived. Linus
Re: bio linked list corruption.
On Mon, Dec 5, 2016 at 10:11 AM, Andy Lutomirski wrote: > > So your kernel has been smp-alternatived. That 3e comes from > alternatives_smp_unlock. If you're running on SMP with UP > alternatives, things will break. I'm assuming he's just running in a VM with a single CPU. The problem that I pointed out with assuming wake_up_all() actually removes all wait queue entries does not depend on SMP. The race is much more fundamental and long-lived. Linus
Re: bio linked list corruption.
On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossumwrote: > On 23 November 2016 at 20:58, Dave Jones wrote: >> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: >> >> > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 >> > trace from just before this happened. Does this shed any light ? >> > >> > https://codemonkey.org.uk/junk/trace.txt >> >> crap, I just noticed the timestamps in the trace come from quite a bit >> later. I'll tweak the code to do the taint checking/ftrace stop after >> every syscall, that should narrow the window some more. > > FWIW I hit this as well: > > BUG: unable to handle kernel paging request at 81ff08b7 We really ought to improve this message. If nothing else, it should say whether it was a read, a write, or an instruction fetch. > IP: [] __lock_acquire.isra.32+0xda/0x1a30 > PGD 461e067 PUD 461f063 > PMD 1e001e1 Too lazy to manually decode this right now, but I don't think it matters. > Oops: 0003 [#1] PREEMPT SMP KASAN Note this is SMP, but that just means CONFIG_SMP=y. Vegard, how many CPUs does your kernel think you have? > RIP: 0010:[] [] > __lock_acquire.isra.32+0xda/0x1a30 > RSP: 0018:8801bab8f730 EFLAGS: 00010082 > RAX: 81ff071f RBX: RCX: RAX points to kernel text. > Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85 > c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e> > ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00 2b:3e ff 80 98 01 00 00 incl %ds:*0x198(%rax)<-- trapping instruction That's very strange. I think this is: atomic_inc((atomic_t *)>ops); but my kernel contains: 3cb4: f0 ff 80 98 01 00 00lock incl 0x198(%rax) So your kernel has been smp-alternatived. That 3e comes from alternatives_smp_unlock. If you're running on SMP with UP alternatives, things will break. What's your kernel command line? Can we have your entire kernel log from boot? Adding Borislav, since he's the guru for this code.
Re: bio linked list corruption.
On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossum wrote: > On 23 November 2016 at 20:58, Dave Jones wrote: >> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: >> >> > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 >> > trace from just before this happened. Does this shed any light ? >> > >> > https://codemonkey.org.uk/junk/trace.txt >> >> crap, I just noticed the timestamps in the trace come from quite a bit >> later. I'll tweak the code to do the taint checking/ftrace stop after >> every syscall, that should narrow the window some more. > > FWIW I hit this as well: > > BUG: unable to handle kernel paging request at 81ff08b7 We really ought to improve this message. If nothing else, it should say whether it was a read, a write, or an instruction fetch. > IP: [] __lock_acquire.isra.32+0xda/0x1a30 > PGD 461e067 PUD 461f063 > PMD 1e001e1 Too lazy to manually decode this right now, but I don't think it matters. > Oops: 0003 [#1] PREEMPT SMP KASAN Note this is SMP, but that just means CONFIG_SMP=y. Vegard, how many CPUs does your kernel think you have? > RIP: 0010:[] [] > __lock_acquire.isra.32+0xda/0x1a30 > RSP: 0018:8801bab8f730 EFLAGS: 00010082 > RAX: 81ff071f RBX: RCX: RAX points to kernel text. > Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85 > c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e> > ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00 2b:3e ff 80 98 01 00 00 incl %ds:*0x198(%rax)<-- trapping instruction That's very strange. I think this is: atomic_inc((atomic_t *)>ops); but my kernel contains: 3cb4: f0 ff 80 98 01 00 00lock incl 0x198(%rax) So your kernel has been smp-alternatived. That 3e comes from alternatives_smp_unlock. If you're running on SMP with UP alternatives, things will break. What's your kernel command line? Can we have your entire kernel log from boot? Adding Borislav, since he's the guru for this code.
Re: bio linked list corruption.
On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossumwrote: > > The warning shows that it made it past the list_empty_careful() check > in finish_wait() but then bugs out on the >task_list > dereference. > > Anything stick out? I hate that shmem waitqueue garbage. It's really subtle. I think the problem is that "wake_up_all()" in shmem_fallocate() doesn't necessarily wake up everything. It wakes up TASK_NORMAL - which does include TASK_UNINTERRUPTIBLE, but doesn't actually mean "everything on the list". I think that what happens is that the waiters somehow move from TASK_UNINTERRUPTIBLE to TASK_RUNNING early, and this means that wake_up_all() will ignore them, leave them on the list, and now that list on stack is no longer empty at the end. And the way *THAT* can happen is that the task is on some *other* waitqueue as well, and that other waiqueue wakes it up. That's not impossible, you can certainly have people on wait-queues that still take faults. Or somebody just uses a directed wake_up_process() or something. Since you apparently can recreate this fairly easily, how about trying this stupid patch? NOTE! This is entirely untested. I may have screwed this up entirely. You get the idea, though - just remove the wait queue head from the list - the list entries stay around, but nothing points to the stack entry (that we're going to free) any more. And add the warning to see if this actually ever triggers (and because I'd like to see the callchain when it does, to see if it's another waitqueue somewhere or what..) Linus diff --git a/mm/shmem.c b/mm/shmem.c index 166ebf5d2bce..a80148b43476 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2665,6 +2665,8 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, spin_lock(>i_lock); inode->i_private = NULL; wake_up_all(_falloc_waitq); + if (WARN_ON_ONCE(!list_empty(_falloc_waitq.task_list))) + list_del(_falloc_waitq.task_list); spin_unlock(>i_lock); error = 0; goto out;
Re: bio linked list corruption.
On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum wrote: > > The warning shows that it made it past the list_empty_careful() check > in finish_wait() but then bugs out on the >task_list > dereference. > > Anything stick out? I hate that shmem waitqueue garbage. It's really subtle. I think the problem is that "wake_up_all()" in shmem_fallocate() doesn't necessarily wake up everything. It wakes up TASK_NORMAL - which does include TASK_UNINTERRUPTIBLE, but doesn't actually mean "everything on the list". I think that what happens is that the waiters somehow move from TASK_UNINTERRUPTIBLE to TASK_RUNNING early, and this means that wake_up_all() will ignore them, leave them on the list, and now that list on stack is no longer empty at the end. And the way *THAT* can happen is that the task is on some *other* waitqueue as well, and that other waiqueue wakes it up. That's not impossible, you can certainly have people on wait-queues that still take faults. Or somebody just uses a directed wake_up_process() or something. Since you apparently can recreate this fairly easily, how about trying this stupid patch? NOTE! This is entirely untested. I may have screwed this up entirely. You get the idea, though - just remove the wait queue head from the list - the list entries stay around, but nothing points to the stack entry (that we're going to free) any more. And add the warning to see if this actually ever triggers (and because I'd like to see the callchain when it does, to see if it's another waitqueue somewhere or what..) Linus diff --git a/mm/shmem.c b/mm/shmem.c index 166ebf5d2bce..a80148b43476 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2665,6 +2665,8 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, spin_lock(>i_lock); inode->i_private = NULL; wake_up_all(_falloc_waitq); + if (WARN_ON_ONCE(!list_empty(_falloc_waitq.task_list))) + list_del(_falloc_waitq.task_list); spin_unlock(>i_lock); error = 0; goto out;
Re: bio linked list corruption.
On Mon, Dec 05, 2016 at 06:09:29PM +0100, Vegard Nossum wrote: > On 5 December 2016 at 12:10, Vegard Nossumwrote: > > On 5 December 2016 at 00:04, Vegard Nossum wrote: > >> FWIW I hit this as well: > >> > >> BUG: unable to handle kernel paging request at 81ff08b7 > >> IP: [] __lock_acquire.isra.32+0xda/0x1a30 > >> CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB 4.9.0-rc7+ > >> #217 > > [...] > > > >> I think you can rule out btrfs in any case, probably block layer as > >> well, since it looks like this comes from shmem. > > > > I should rather say that the VM runs on a 9p root filesystem and it > > doesn't use/mount any block devices or disk-based filesystems. > > > > I have all the trinity logs for the crash if that's useful. I tried a > > couple of runs with just the (at the time) in-progress syscalls but it > > didn't turn up anything interesting. Otherwise it seems like a lot of > > data to go through by hand. > > I've hit this another 7 times in the past ~3 hours. > > Three times the address being dereferenced has pointed to > iov_iter_init+0xaf (even across a kernel rebuild), three times it has > pointed to put_prev_entity+0x55, once to 0x80008, and twice to > 0x292. The fact that it would hit even one of those more than once > across runs is pretty suspicious to me, although the ones that point > to iov_iter_init and put_prev_entity point to "random" instructions in > the sense that they are neither entry points nor return addresses. > > shmem_fault() was always on the stack, but it came from different > syscalls: add_key(), newuname(), pipe2(), newstat(), fstat(), > clock_settime(), mount(), etc. > [ cut here ] > The warning shows that it made it past the list_empty_careful() check > in finish_wait() but then bugs out on the >task_list > dereference. I just pushed out the ftrace changes I made to Trinity that might help you gather more clues. Right now it's hardcoded to dump a trace to /boot/trace.txt when it detects the kernel has become tainted. Before a trinity run, I run this as root.. #!/bin/sh cd /sys/kernel/debug/tracing/ echo 1 > buffer_size_kb echo function >> current_tracer for i in $(cat /home/davej/blacklist-symbols) do echo $i >> set_ftrace_notrace done echo 1 >> tracing_on blacklist-symbols is the more noisy stuff that pollutes traces. Right now I use these: https://paste.fedoraproject.org/499794/14809582/ You may need to add some more. (I'll get around to making all this scripting go away, and have trinity just set this stuff up itself eventually) Oh, and if you're not running as root, you might need a diff like below so that trinity can stop the trace when it detects tainting. Dave diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 8696ce6bf2f6..2d6c97e871e0 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -7217,7 +7217,7 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer) trace_create_file("trace_clock", 0644, d_tracer, tr, _clock_fops); - trace_create_file("tracing_on", 0644, d_tracer, + trace_create_file("tracing_on", 0666, d_tracer, tr, _simple_fops); create_trace_options_dir(tr);
Re: bio linked list corruption.
On Mon, Dec 05, 2016 at 06:09:29PM +0100, Vegard Nossum wrote: > On 5 December 2016 at 12:10, Vegard Nossum wrote: > > On 5 December 2016 at 00:04, Vegard Nossum wrote: > >> FWIW I hit this as well: > >> > >> BUG: unable to handle kernel paging request at 81ff08b7 > >> IP: [] __lock_acquire.isra.32+0xda/0x1a30 > >> CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB 4.9.0-rc7+ > >> #217 > > [...] > > > >> I think you can rule out btrfs in any case, probably block layer as > >> well, since it looks like this comes from shmem. > > > > I should rather say that the VM runs on a 9p root filesystem and it > > doesn't use/mount any block devices or disk-based filesystems. > > > > I have all the trinity logs for the crash if that's useful. I tried a > > couple of runs with just the (at the time) in-progress syscalls but it > > didn't turn up anything interesting. Otherwise it seems like a lot of > > data to go through by hand. > > I've hit this another 7 times in the past ~3 hours. > > Three times the address being dereferenced has pointed to > iov_iter_init+0xaf (even across a kernel rebuild), three times it has > pointed to put_prev_entity+0x55, once to 0x80008, and twice to > 0x292. The fact that it would hit even one of those more than once > across runs is pretty suspicious to me, although the ones that point > to iov_iter_init and put_prev_entity point to "random" instructions in > the sense that they are neither entry points nor return addresses. > > shmem_fault() was always on the stack, but it came from different > syscalls: add_key(), newuname(), pipe2(), newstat(), fstat(), > clock_settime(), mount(), etc. > [ cut here ] > The warning shows that it made it past the list_empty_careful() check > in finish_wait() but then bugs out on the >task_list > dereference. I just pushed out the ftrace changes I made to Trinity that might help you gather more clues. Right now it's hardcoded to dump a trace to /boot/trace.txt when it detects the kernel has become tainted. Before a trinity run, I run this as root.. #!/bin/sh cd /sys/kernel/debug/tracing/ echo 1 > buffer_size_kb echo function >> current_tracer for i in $(cat /home/davej/blacklist-symbols) do echo $i >> set_ftrace_notrace done echo 1 >> tracing_on blacklist-symbols is the more noisy stuff that pollutes traces. Right now I use these: https://paste.fedoraproject.org/499794/14809582/ You may need to add some more. (I'll get around to making all this scripting go away, and have trinity just set this stuff up itself eventually) Oh, and if you're not running as root, you might need a diff like below so that trinity can stop the trace when it detects tainting. Dave diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 8696ce6bf2f6..2d6c97e871e0 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -7217,7 +7217,7 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer) trace_create_file("trace_clock", 0644, d_tracer, tr, _clock_fops); - trace_create_file("tracing_on", 0644, d_tracer, + trace_create_file("tracing_on", 0666, d_tracer, tr, _simple_fops); create_trace_options_dir(tr);
Re: bio linked list corruption.
On 5 December 2016 at 12:10, Vegard Nossumwrote: > On 5 December 2016 at 00:04, Vegard Nossum wrote: >> FWIW I hit this as well: >> >> BUG: unable to handle kernel paging request at 81ff08b7 >> IP: [] __lock_acquire.isra.32+0xda/0x1a30 >> CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB 4.9.0-rc7+ #217 > [...] > >> I think you can rule out btrfs in any case, probably block layer as >> well, since it looks like this comes from shmem. > > I should rather say that the VM runs on a 9p root filesystem and it > doesn't use/mount any block devices or disk-based filesystems. > > I have all the trinity logs for the crash if that's useful. I tried a > couple of runs with just the (at the time) in-progress syscalls but it > didn't turn up anything interesting. Otherwise it seems like a lot of > data to go through by hand. I've hit this another 7 times in the past ~3 hours. Three times the address being dereferenced has pointed to iov_iter_init+0xaf (even across a kernel rebuild), three times it has pointed to put_prev_entity+0x55, once to 0x80008, and twice to 0x292. The fact that it would hit even one of those more than once across runs is pretty suspicious to me, although the ones that point to iov_iter_init and put_prev_entity point to "random" instructions in the sense that they are neither entry points nor return addresses. shmem_fault() was always on the stack, but it came from different syscalls: add_key(), newuname(), pipe2(), newstat(), fstat(), clock_settime(), mount(), etc. I also got this warning which is related: [ cut here ] WARNING: CPU: 9 PID: 25045 at lib/list_debug.c:59 __list_del_entry+0x14f/0x1d0 list_del corruption. prev->next should be 88014bdc79e8, but was 88014bfbfc60 Kernel panic - not syncing: panic_on_warn set ... CPU: 9 PID: 25045 Comm: trinity-c22 Not tainted 4.9.0-rc7+ #219 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 88014bdc7700 81fb0861 83e74b60 88014bdc77d8 84006c00 847103e0 88014bdc77c8 81515244 41b58ab3 844e21c2 81515061 0054 Call Trace: [] dump_stack+0x83/0xb2 [] panic+0x1e3/0x3ad [] ? percpu_up_read_preempt_enable.constprop.45+0xcb/0xcb [] ? __list_del_entry+0x14f/0x1d0 [] __warn+0x1bf/0x1e0 [] ? __lock_acquire.isra.32+0xc2/0x1a30 [] warn_slowpath_fmt+0xac/0xd0 [] ? __warn+0x1e0/0x1e0 [] ? finish_wait+0xb0/0x180 [] __list_del_entry+0x14f/0x1d0 [] ? finish_wait+0xb0/0x180 [] finish_wait+0xbb/0x180 [] shmem_fault+0x4c7/0x6b0 [] ? shmem_getpage_gfp+0x673/0x1c90 [] ? shmem_getpage_gfp+0x1c90/0x1c90 [] ? wake_atomic_t_function+0x210/0x210 [] __do_fault+0x206/0x410 [] ? do_page_mkwrite+0x320/0x320 [] ? handle_mm_fault+0x1cc/0x2a60 [] handle_mm_fault+0x10f7/0x2a60 [] ? handle_mm_fault+0x132/0x2a60 [] ? thread_group_cputime+0x49f/0x6e0 [] ? __pmd_alloc+0x370/0x370 [] ? thread_group_cputime+0x4bc/0x6e0 [] ? thread_group_cputime_adjusted+0x6d/0xe0 [] ? __do_page_fault+0x220/0x9f0 [] ? find_vma+0x30/0x150 [] __do_page_fault+0x452/0x9f0 [] trace_do_page_fault+0x1e5/0x3a0 [] do_async_page_fault+0x27/0xa0 [] async_page_fault+0x28/0x30 [] ? copy_user_generic_string+0x2c/0x40 [] ? SyS_times+0x93/0x110 [] ? do_sys_times+0x2b0/0x2b0 [] ? do_sys_times+0x2b0/0x2b0 [] do_syscall_64+0x1af/0x4d0 [] entry_SYSCALL64_slow_path+0x25/0x25 [ cut here ] The warning shows that it made it past the list_empty_careful() check in finish_wait() but then bugs out on the >task_list dereference. Anything stick out? Vegard
Re: bio linked list corruption.
On 5 December 2016 at 12:10, Vegard Nossum wrote: > On 5 December 2016 at 00:04, Vegard Nossum wrote: >> FWIW I hit this as well: >> >> BUG: unable to handle kernel paging request at 81ff08b7 >> IP: [] __lock_acquire.isra.32+0xda/0x1a30 >> CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB 4.9.0-rc7+ #217 > [...] > >> I think you can rule out btrfs in any case, probably block layer as >> well, since it looks like this comes from shmem. > > I should rather say that the VM runs on a 9p root filesystem and it > doesn't use/mount any block devices or disk-based filesystems. > > I have all the trinity logs for the crash if that's useful. I tried a > couple of runs with just the (at the time) in-progress syscalls but it > didn't turn up anything interesting. Otherwise it seems like a lot of > data to go through by hand. I've hit this another 7 times in the past ~3 hours. Three times the address being dereferenced has pointed to iov_iter_init+0xaf (even across a kernel rebuild), three times it has pointed to put_prev_entity+0x55, once to 0x80008, and twice to 0x292. The fact that it would hit even one of those more than once across runs is pretty suspicious to me, although the ones that point to iov_iter_init and put_prev_entity point to "random" instructions in the sense that they are neither entry points nor return addresses. shmem_fault() was always on the stack, but it came from different syscalls: add_key(), newuname(), pipe2(), newstat(), fstat(), clock_settime(), mount(), etc. I also got this warning which is related: [ cut here ] WARNING: CPU: 9 PID: 25045 at lib/list_debug.c:59 __list_del_entry+0x14f/0x1d0 list_del corruption. prev->next should be 88014bdc79e8, but was 88014bfbfc60 Kernel panic - not syncing: panic_on_warn set ... CPU: 9 PID: 25045 Comm: trinity-c22 Not tainted 4.9.0-rc7+ #219 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 88014bdc7700 81fb0861 83e74b60 88014bdc77d8 84006c00 847103e0 88014bdc77c8 81515244 41b58ab3 844e21c2 81515061 0054 Call Trace: [] dump_stack+0x83/0xb2 [] panic+0x1e3/0x3ad [] ? percpu_up_read_preempt_enable.constprop.45+0xcb/0xcb [] ? __list_del_entry+0x14f/0x1d0 [] __warn+0x1bf/0x1e0 [] ? __lock_acquire.isra.32+0xc2/0x1a30 [] warn_slowpath_fmt+0xac/0xd0 [] ? __warn+0x1e0/0x1e0 [] ? finish_wait+0xb0/0x180 [] __list_del_entry+0x14f/0x1d0 [] ? finish_wait+0xb0/0x180 [] finish_wait+0xbb/0x180 [] shmem_fault+0x4c7/0x6b0 [] ? shmem_getpage_gfp+0x673/0x1c90 [] ? shmem_getpage_gfp+0x1c90/0x1c90 [] ? wake_atomic_t_function+0x210/0x210 [] __do_fault+0x206/0x410 [] ? do_page_mkwrite+0x320/0x320 [] ? handle_mm_fault+0x1cc/0x2a60 [] handle_mm_fault+0x10f7/0x2a60 [] ? handle_mm_fault+0x132/0x2a60 [] ? thread_group_cputime+0x49f/0x6e0 [] ? __pmd_alloc+0x370/0x370 [] ? thread_group_cputime+0x4bc/0x6e0 [] ? thread_group_cputime_adjusted+0x6d/0xe0 [] ? __do_page_fault+0x220/0x9f0 [] ? find_vma+0x30/0x150 [] __do_page_fault+0x452/0x9f0 [] trace_do_page_fault+0x1e5/0x3a0 [] do_async_page_fault+0x27/0xa0 [] async_page_fault+0x28/0x30 [] ? copy_user_generic_string+0x2c/0x40 [] ? SyS_times+0x93/0x110 [] ? do_sys_times+0x2b0/0x2b0 [] ? do_sys_times+0x2b0/0x2b0 [] do_syscall_64+0x1af/0x4d0 [] entry_SYSCALL64_slow_path+0x25/0x25 [ cut here ] The warning shows that it made it past the list_empty_careful() check in finish_wait() but then bugs out on the >task_list dereference. Anything stick out? Vegard
Re: bio linked list corruption.
On 5 December 2016 at 00:04, Vegard Nossumwrote: > FWIW I hit this as well: > > BUG: unable to handle kernel paging request at 81ff08b7 > IP: [] __lock_acquire.isra.32+0xda/0x1a30 > CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB 4.9.0-rc7+ #217 [...] > I think you can rule out btrfs in any case, probably block layer as > well, since it looks like this comes from shmem. I should rather say that the VM runs on a 9p root filesystem and it doesn't use/mount any block devices or disk-based filesystems. I have all the trinity logs for the crash if that's useful. I tried a couple of runs with just the (at the time) in-progress syscalls but it didn't turn up anything interesting. Otherwise it seems like a lot of data to go through by hand. The crashing child seems to have just started, though, if that's relevant: [child56:21744] [0] open(filename="/sys/block/loop2/power/runtime_active_time", flags=0x777b01, mode=666) = -1 (Not a directory) [child56:21744] [1] [32BIT] sched_getattr(pid=1, param=0x7f37ec26c000, size=3415) = -1 (Invalid argument) [child56:21744] [2] [32BIT] access(filename="/proc/2/task/2/net/stat/arp_cache", mode=2000) = -1 (Invalid argument) [child56:21744] [3] getegid() = 0xfffe [child56:21744] [4] swapoff(path="/proc/721/task/721/net/dev_snmp6/tunl0") = -1 (Operation not permitted) [child56:21744] [5] timerfd_create(clockid=0x0, flags=0x0) = 439 [child56:21744] [6] pkey_mprotect(start=0x7f37ee656000, len=0, prot=0x108, key=0x600) = 0 [child56:21744] [7] msync(start=0x7f37ee657000, len=0, flags=0x6) = 0 [child56:21744] [8] flock(fd=168, cmd=0xffc191f30b0c) = -1 (Invalid argument) [child56:21744] [9] add_key(_type=0x437a15, _description=0x7f37ec06c000, _payload=0x7f37ec06c000, plen=0, ringid=0xfff8) The other logfiles end thusly: ==> trinity-child0.log <== [child0:21593] [311] faccessat(dfd=246, filename="/proc/983/task/983/net/protocols", mode=2000) = -1 (Invalid argument) [child0:21593] [312] renameat(olddfd=246, oldname="/proc/13/task/13/attr/sockcreate", newdfd=377, newname="/proc/16/task/16/net/stat/rt_cache") = -1 (Permission denied) [child0:21593] [313] [32BIT] readv(fd=289, vec=0x2e1a3d0, vlen=215) = 0 ==> trinity-child100.log <== [child100:21536] [439] setgid(gid=0x2a000200) = -1 (Operation not permitted) [child100:21536] [440] waitid(which=175, upid=21587, infop=0x4, options=3542, ru=0x7f37ec76c000) = -1 (Invalid argument) [child100:21536] [441] getxattr(pathname="/proc/980/task/980/net/ptype", name=0x7f37ee466000, value=0x7f37ec26c000, size=49) = -1 (Operation not supported) ==> trinity-child101.log <== [child101:21537] [55] getcwd(buf=0x7f37ee466000, size=4096) = 39 [child101:21537] [56] [32BIT] munlock(addr=0x7f37ee658000, len=0) = 0 [child101:21537] [57] semctl(semid=0xbd851e2b40e7df, semnum=0x1b1b1b1b1b, cmd=0x20, arg=0xcacacacaca) = -1 (Invalid argument) ==> trinity-child102.log <== [child102:21542] [11] readahead(fd=353, offset=2, count=249) = -1 (Invalid argument) [child102:21542] [12] add_key(_type=0x43793f, _description=0x7f37ec46c000, _payload=0x7f37ee658000, plen=32, ringid=0xfffa) = -1 (Invalid argument) [child102:21542] [13] time(tloc=0x7f37ee466000) = 0x584474e0 ==> trinity-child103.log <== [child103:21543] [45] dup(fildes=183) = 512 [child103:21543] [46] rt_sigpending(set=0x7f37ec86c000, sigsetsize=32) = -1 (Invalid argument) [child103:21543] [47] newstat(filename="/proc/587/task/587/gid_map", statbuf=0x7f37ee466000) = 0 ==> trinity-child104.log <== [child104:21546] [49] getdents(fd=162, dirent=0x0, count=127) = -1 (Not a directory) [child104:21546] [50] [32BIT] clock_adjtime(which_clock=0, utx=0x4) = -1 (Bad address) [child104:21546] [51] setsid() = 0x542a ==> trinity-child105.log <== [child105:21547] [523] epoll_wait(epfd=244, events=0x8, maxevents=246, timeout=-1) = -1 (Invalid argument) [child105:21547] [524] dup2(oldfd=244, newfd=244) = 244 [child105:21547] [525] acct(name=0x7f37ec26c000) = -1 (Operation not permitted) ==> trinity-child106.log <== [child106:19910] [136] getegid() = 0xfffe [child106:19910] [137] munmap(addr=0x7f37ee65a000, len=4096) = 0 [child106:19910] [138] clock_nanosleep(which_clock=0x1, flags=0x1, rqtp=0x7f37ec06c000, rmtp=0x7f37ee466000) ==> trinity-child107.log <== [child107:21224] [994] copy_file_range(fd_in=373, off_in=0x2400e210, fd_out=373, off_out=8, len=8, flags=0x0) = -1 (Bad file descriptor) [child107:21224] [995] kcmp(pid1=1, pid2=21453, type=0x5, idx1=0x787878787878, idx2=0xff6060606060) = -1 (Operation not permitted) [child107:21224] [996] [32BIT] readv(fd=365, vec=0x2e27e10, vlen=36) = 0 ==> trinity-child108.log <== [child108:21226] [759] recvfrom(fd=219, ubuf=0x7f37ec26c000, size=8, flags=0x0, addr=0x2e1ed80, addr_len=110) = -1 (Bad file descriptor) [child108:21226] [760] shmat(shmid=-4097, shmaddr=0x7f37ee465000, shmflg=-195) = -1 (Invalid argument) [child108:21226] [761] [32BIT] seccomp(op=0x0, flags=0x0,
Re: bio linked list corruption.
On 5 December 2016 at 00:04, Vegard Nossum wrote: > FWIW I hit this as well: > > BUG: unable to handle kernel paging request at 81ff08b7 > IP: [] __lock_acquire.isra.32+0xda/0x1a30 > CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB 4.9.0-rc7+ #217 [...] > I think you can rule out btrfs in any case, probably block layer as > well, since it looks like this comes from shmem. I should rather say that the VM runs on a 9p root filesystem and it doesn't use/mount any block devices or disk-based filesystems. I have all the trinity logs for the crash if that's useful. I tried a couple of runs with just the (at the time) in-progress syscalls but it didn't turn up anything interesting. Otherwise it seems like a lot of data to go through by hand. The crashing child seems to have just started, though, if that's relevant: [child56:21744] [0] open(filename="/sys/block/loop2/power/runtime_active_time", flags=0x777b01, mode=666) = -1 (Not a directory) [child56:21744] [1] [32BIT] sched_getattr(pid=1, param=0x7f37ec26c000, size=3415) = -1 (Invalid argument) [child56:21744] [2] [32BIT] access(filename="/proc/2/task/2/net/stat/arp_cache", mode=2000) = -1 (Invalid argument) [child56:21744] [3] getegid() = 0xfffe [child56:21744] [4] swapoff(path="/proc/721/task/721/net/dev_snmp6/tunl0") = -1 (Operation not permitted) [child56:21744] [5] timerfd_create(clockid=0x0, flags=0x0) = 439 [child56:21744] [6] pkey_mprotect(start=0x7f37ee656000, len=0, prot=0x108, key=0x600) = 0 [child56:21744] [7] msync(start=0x7f37ee657000, len=0, flags=0x6) = 0 [child56:21744] [8] flock(fd=168, cmd=0xffc191f30b0c) = -1 (Invalid argument) [child56:21744] [9] add_key(_type=0x437a15, _description=0x7f37ec06c000, _payload=0x7f37ec06c000, plen=0, ringid=0xfff8) The other logfiles end thusly: ==> trinity-child0.log <== [child0:21593] [311] faccessat(dfd=246, filename="/proc/983/task/983/net/protocols", mode=2000) = -1 (Invalid argument) [child0:21593] [312] renameat(olddfd=246, oldname="/proc/13/task/13/attr/sockcreate", newdfd=377, newname="/proc/16/task/16/net/stat/rt_cache") = -1 (Permission denied) [child0:21593] [313] [32BIT] readv(fd=289, vec=0x2e1a3d0, vlen=215) = 0 ==> trinity-child100.log <== [child100:21536] [439] setgid(gid=0x2a000200) = -1 (Operation not permitted) [child100:21536] [440] waitid(which=175, upid=21587, infop=0x4, options=3542, ru=0x7f37ec76c000) = -1 (Invalid argument) [child100:21536] [441] getxattr(pathname="/proc/980/task/980/net/ptype", name=0x7f37ee466000, value=0x7f37ec26c000, size=49) = -1 (Operation not supported) ==> trinity-child101.log <== [child101:21537] [55] getcwd(buf=0x7f37ee466000, size=4096) = 39 [child101:21537] [56] [32BIT] munlock(addr=0x7f37ee658000, len=0) = 0 [child101:21537] [57] semctl(semid=0xbd851e2b40e7df, semnum=0x1b1b1b1b1b, cmd=0x20, arg=0xcacacacaca) = -1 (Invalid argument) ==> trinity-child102.log <== [child102:21542] [11] readahead(fd=353, offset=2, count=249) = -1 (Invalid argument) [child102:21542] [12] add_key(_type=0x43793f, _description=0x7f37ec46c000, _payload=0x7f37ee658000, plen=32, ringid=0xfffa) = -1 (Invalid argument) [child102:21542] [13] time(tloc=0x7f37ee466000) = 0x584474e0 ==> trinity-child103.log <== [child103:21543] [45] dup(fildes=183) = 512 [child103:21543] [46] rt_sigpending(set=0x7f37ec86c000, sigsetsize=32) = -1 (Invalid argument) [child103:21543] [47] newstat(filename="/proc/587/task/587/gid_map", statbuf=0x7f37ee466000) = 0 ==> trinity-child104.log <== [child104:21546] [49] getdents(fd=162, dirent=0x0, count=127) = -1 (Not a directory) [child104:21546] [50] [32BIT] clock_adjtime(which_clock=0, utx=0x4) = -1 (Bad address) [child104:21546] [51] setsid() = 0x542a ==> trinity-child105.log <== [child105:21547] [523] epoll_wait(epfd=244, events=0x8, maxevents=246, timeout=-1) = -1 (Invalid argument) [child105:21547] [524] dup2(oldfd=244, newfd=244) = 244 [child105:21547] [525] acct(name=0x7f37ec26c000) = -1 (Operation not permitted) ==> trinity-child106.log <== [child106:19910] [136] getegid() = 0xfffe [child106:19910] [137] munmap(addr=0x7f37ee65a000, len=4096) = 0 [child106:19910] [138] clock_nanosleep(which_clock=0x1, flags=0x1, rqtp=0x7f37ec06c000, rmtp=0x7f37ee466000) ==> trinity-child107.log <== [child107:21224] [994] copy_file_range(fd_in=373, off_in=0x2400e210, fd_out=373, off_out=8, len=8, flags=0x0) = -1 (Bad file descriptor) [child107:21224] [995] kcmp(pid1=1, pid2=21453, type=0x5, idx1=0x787878787878, idx2=0xff6060606060) = -1 (Operation not permitted) [child107:21224] [996] [32BIT] readv(fd=365, vec=0x2e27e10, vlen=36) = 0 ==> trinity-child108.log <== [child108:21226] [759] recvfrom(fd=219, ubuf=0x7f37ec26c000, size=8, flags=0x0, addr=0x2e1ed80, addr_len=110) = -1 (Bad file descriptor) [child108:21226] [760] shmat(shmid=-4097, shmaddr=0x7f37ee465000, shmflg=-195) = -1 (Invalid argument) [child108:21226] [761] [32BIT] seccomp(op=0x0, flags=0x0, uargs=0x0) ==>
Re: bio linked list corruption.
On 23 November 2016 at 20:58, Dave Joneswrote: > On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: > > > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 > > trace from just before this happened. Does this shed any light ? > > > > https://codemonkey.org.uk/junk/trace.txt > > crap, I just noticed the timestamps in the trace come from quite a bit > later. I'll tweak the code to do the taint checking/ftrace stop after > every syscall, that should narrow the window some more. FWIW I hit this as well: BUG: unable to handle kernel paging request at 81ff08b7 IP: [] __lock_acquire.isra.32+0xda/0x1a30 PGD 461e067 PUD 461f063 PMD 1e001e1 Oops: 0003 [#1] PREEMPT SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB 4.9.0-rc7+ #217 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 task: 8801ee924080 task.stack: 8801bab88000 RIP: 0010:[] [] __lock_acquire.isra.32+0xda/0x1a30 RSP: 0018:8801bab8f730 EFLAGS: 00010082 RAX: 81ff071f RBX: RCX: RDX: 0003 RSI: RDI: 85ae7d00 RBP: 8801bab8f7b0 R08: 0001 R09: R10: 8801e727fc40 R11: fbfff0b54ced R12: 85ae7d00 R13: 84912920 R14: 8801ee924080 R15: FS: 7f37ee653700() GS:8801f6a0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 81ff08b7 CR3: 0001daa7 CR4: 06f0 DR0: 7f37ee465000 DR1: DR2: DR3: DR6: 0ff0 DR7: 0600 Stack: 8801ee9247d0 0001 8801ee924080 8801f6a201c0 8801f6a201c0 0001 8801 8801 8801e727fc40 8801ee924080 Call Trace: [] lock_acquire+0x141/0x2b0 [] ? finish_wait+0xb0/0x180 [] _raw_spin_lock_irqsave+0x49/0x60 [] ? finish_wait+0xb0/0x180 [] finish_wait+0xb0/0x180 [] shmem_fault+0x4c7/0x6b0 [] ? p9_client_rpc+0x13d/0xf40 [] ? shmem_getpage_gfp+0x1c90/0x1c90 [] ? radix_tree_next_chunk+0x4f7/0x840 [] ? wake_atomic_t_function+0x210/0x210 [] __do_fault+0x206/0x410 [] ? do_page_mkwrite+0x320/0x320 [] handle_mm_fault+0x1cef/0x2a60 [] ? handle_mm_fault+0x132/0x2a60 [] ? __pmd_alloc+0x370/0x370 [] ? inode_add_bytes+0x10e/0x160 [] ? memset+0x31/0x40 [] ? find_vma+0x30/0x150 [] __do_page_fault+0x452/0x9f0 [] ? iov_iter_init+0xaf/0x1d0 [] trace_do_page_fault+0x1e5/0x3a0 [] do_async_page_fault+0x27/0xa0 [] async_page_fault+0x28/0x30 [] ? strnlen_user+0x91/0x1a0 [] ? strnlen_user+0x6e/0x1a0 [] strndup_user+0x28/0xb0 [] SyS_add_key+0xc7/0x370 [] ? key_get_type_from_user.constprop.6+0xd0/0xd0 [] ? __context_tracking_exit.part.4+0x3a/0x1e0 [] ? key_get_type_from_user.constprop.6+0xd0/0xd0 [] do_syscall_64+0x1af/0x4d0 [] entry_SYSCALL64_slow_path+0x25/0x25 Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85 c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e> ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00 RIP [] __lock_acquire.isra.32+0xda/0x1a30 I didn't read all the emails in this thread, the crash site looks identical to one of the earlier traces although the caller may be different. I think you can rule out btrfs in any case, probably block layer as well, since it looks like this comes from shmem. Vegard
Re: bio linked list corruption.
On 23 November 2016 at 20:58, Dave Jones wrote: > On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: > > > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 > > trace from just before this happened. Does this shed any light ? > > > > https://codemonkey.org.uk/junk/trace.txt > > crap, I just noticed the timestamps in the trace come from quite a bit > later. I'll tweak the code to do the taint checking/ftrace stop after > every syscall, that should narrow the window some more. FWIW I hit this as well: BUG: unable to handle kernel paging request at 81ff08b7 IP: [] __lock_acquire.isra.32+0xda/0x1a30 PGD 461e067 PUD 461f063 PMD 1e001e1 Oops: 0003 [#1] PREEMPT SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB 4.9.0-rc7+ #217 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 task: 8801ee924080 task.stack: 8801bab88000 RIP: 0010:[] [] __lock_acquire.isra.32+0xda/0x1a30 RSP: 0018:8801bab8f730 EFLAGS: 00010082 RAX: 81ff071f RBX: RCX: RDX: 0003 RSI: RDI: 85ae7d00 RBP: 8801bab8f7b0 R08: 0001 R09: R10: 8801e727fc40 R11: fbfff0b54ced R12: 85ae7d00 R13: 84912920 R14: 8801ee924080 R15: FS: 7f37ee653700() GS:8801f6a0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 81ff08b7 CR3: 0001daa7 CR4: 06f0 DR0: 7f37ee465000 DR1: DR2: DR3: DR6: 0ff0 DR7: 0600 Stack: 8801ee9247d0 0001 8801ee924080 8801f6a201c0 8801f6a201c0 0001 8801 8801 8801e727fc40 8801ee924080 Call Trace: [] lock_acquire+0x141/0x2b0 [] ? finish_wait+0xb0/0x180 [] _raw_spin_lock_irqsave+0x49/0x60 [] ? finish_wait+0xb0/0x180 [] finish_wait+0xb0/0x180 [] shmem_fault+0x4c7/0x6b0 [] ? p9_client_rpc+0x13d/0xf40 [] ? shmem_getpage_gfp+0x1c90/0x1c90 [] ? radix_tree_next_chunk+0x4f7/0x840 [] ? wake_atomic_t_function+0x210/0x210 [] __do_fault+0x206/0x410 [] ? do_page_mkwrite+0x320/0x320 [] handle_mm_fault+0x1cef/0x2a60 [] ? handle_mm_fault+0x132/0x2a60 [] ? __pmd_alloc+0x370/0x370 [] ? inode_add_bytes+0x10e/0x160 [] ? memset+0x31/0x40 [] ? find_vma+0x30/0x150 [] __do_page_fault+0x452/0x9f0 [] ? iov_iter_init+0xaf/0x1d0 [] trace_do_page_fault+0x1e5/0x3a0 [] do_async_page_fault+0x27/0xa0 [] async_page_fault+0x28/0x30 [] ? strnlen_user+0x91/0x1a0 [] ? strnlen_user+0x6e/0x1a0 [] strndup_user+0x28/0xb0 [] SyS_add_key+0xc7/0x370 [] ? key_get_type_from_user.constprop.6+0xd0/0xd0 [] ? __context_tracking_exit.part.4+0x3a/0x1e0 [] ? key_get_type_from_user.constprop.6+0xd0/0xd0 [] do_syscall_64+0x1af/0x4d0 [] entry_SYSCALL64_slow_path+0x25/0x25 Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85 c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e> ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00 RIP [] __lock_acquire.isra.32+0xda/0x1a30 I didn't read all the emails in this thread, the crash site looks identical to one of the earlier traces although the caller may be different. I think you can rule out btrfs in any case, probably block layer as well, since it looks like this comes from shmem. Vegard
Re: bio linked list corruption.
On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 > trace from just before this happened. Does this shed any light ? > > https://codemonkey.org.uk/junk/trace.txt crap, I just noticed the timestamps in the trace come from quite a bit later. I'll tweak the code to do the taint checking/ftrace stop after every syscall, that should narrow the window some more. Getting closer.. Dave
Re: bio linked list corruption.
On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 > trace from just before this happened. Does this shed any light ? > > https://codemonkey.org.uk/junk/trace.txt crap, I just noticed the timestamps in the trace come from quite a bit later. I'll tweak the code to do the taint checking/ftrace stop after every syscall, that should narrow the window some more. Getting closer.. Dave
Re: bio linked list corruption.
On Mon, Oct 31, 2016 at 01:44:55PM -0600, Chris Mason wrote: > On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote: > >On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones> >wrote: > >> > >> BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 > >> page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 > >> index:0x100c > >> flags: 0x400c(referenced|uptodate) > >> page dumped because: non-NULL mapping > > > >Hmm. So this seems to be btrfs-specific, right? > > > >I searched for all your "non-NULL mapping" cases, and they all seem to > >have basically the same call trace, with some work thread doing > >writeback and going through btrfs_writepages(). > > > >Sounds like it's a race with either fallocate hole-punching or > >truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ > >clearly ran other filesystems too but I am not seeing this backtrace > >for anything else. > > Agreed, I think this is a separate bug, almost certainly btrfs specific. > I'll work with Dave on a better reproducer. Ok, let's try this.. Here's the usual dump, from the current tree: [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 [ 317.702235] page:ea001363f500 count:0 mapcount:0 mapping:8804eef8e0e0 index:0x29 [ 317.718068] flags: 0x400c(referenced|uptodate) [ 317.731086] page dumped because: non-NULL mapping [ 317.744118] CPU: 3 PID: 1558 Comm: kworker/u8:8 Not tainted 4.9.0-rc6-think+ #1 [ 317.770404] Workqueue: writeback wb_workfn [ 317.783490] (flush-btrfs-3) [ 317.783581] c9ff3798 [ 317.796651] 813b69bc [ 317.796742] ea001363f500 [ 317.796798] 81c474ee [ 317.796854] c9ff37c0 [ 317.809910] 811b30c4 [ 317.810001] [ 317.810055] ea001363f500 [ 317.810112] 0003 [ 317.823230] c9ff37d0 [ 317.823320] 811b318f [ 317.823376] c9ff3818 [ 317.823431] Call Trace: [ 317.836645] [] dump_stack+0x4f/0x73 [ 317.850063] [] bad_page+0xc4/0x130 [ 317.863426] [] free_pages_check_bad+0x5f/0x70 [ 317.876805] [] free_hot_cold_page+0x305/0x3b0 [ 317.890132] [] free_hot_cold_page_list+0x7e/0x170 [ 317.903410] [] release_pages+0x2e7/0x3a0 [ 317.916600] [] __pagevec_release+0x27/0x40 [ 317.929817] [] extent_write_cache_pages.isra.44.constprop.59+0x355/0x430 [btrfs] [ 317.943073] [] extent_writepages+0x5d/0x90 [btrfs] [ 317.956033] [] ? btrfs_releasepage+0x40/0x40 [btrfs] [ 317.968902] [] btrfs_writepages+0x28/0x30 [btrfs] [ 317.981744] [] do_writepages+0x21/0x30 [ 317.994524] [] __writeback_single_inode+0x7f/0x6e0 [ 318.007333] [] ? _raw_spin_unlock+0x31/0x50 [ 318.020090] [] writeback_sb_inodes+0x31d/0x6c0 [ 318.032844] [] __writeback_inodes_wb+0x92/0xc0 [ 318.045623] [] wb_writeback+0x3f7/0x530 [ 318.058304] [] wb_workfn+0x148/0x620 [ 318.070949] [] ? process_one_work+0x194/0x690 [ 318.083549] [] ? wb_workfn+0x5/0x620 [ 318.096160] [] process_one_work+0x232/0x690 [ 318.108722] [] ? process_one_work+0x194/0x690 [ 318.121225] [] worker_thread+0x4e/0x490 [ 318.133680] [] ? process_one_work+0x690/0x690 [ 318.146126] [] ? process_one_work+0x690/0x690 [ 318.158507] [] kthread+0x102/0x120 [ 318.170880] [] ? kthread_park+0x60/0x60 [ 318.183254] [] ret_from_fork+0x22/0x30 This time, I managed to get my ftrace magic working, so we have this trace from just before this happened. Does this shed any light ? https://codemonkey.org.uk/junk/trace.txt (note that some of the spinlock/rcu calls will be missing here. I chose to add them to ftrace's exclude list because they dominate the trace buffer so much, especially in debug builds) Dave
Re: bio linked list corruption.
On Mon, Oct 31, 2016 at 01:44:55PM -0600, Chris Mason wrote: > On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote: > >On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones > >wrote: > >> > >> BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 > >> page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 > >> index:0x100c > >> flags: 0x400c(referenced|uptodate) > >> page dumped because: non-NULL mapping > > > >Hmm. So this seems to be btrfs-specific, right? > > > >I searched for all your "non-NULL mapping" cases, and they all seem to > >have basically the same call trace, with some work thread doing > >writeback and going through btrfs_writepages(). > > > >Sounds like it's a race with either fallocate hole-punching or > >truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ > >clearly ran other filesystems too but I am not seeing this backtrace > >for anything else. > > Agreed, I think this is a separate bug, almost certainly btrfs specific. > I'll work with Dave on a better reproducer. Ok, let's try this.. Here's the usual dump, from the current tree: [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 [ 317.702235] page:ea001363f500 count:0 mapcount:0 mapping:8804eef8e0e0 index:0x29 [ 317.718068] flags: 0x400c(referenced|uptodate) [ 317.731086] page dumped because: non-NULL mapping [ 317.744118] CPU: 3 PID: 1558 Comm: kworker/u8:8 Not tainted 4.9.0-rc6-think+ #1 [ 317.770404] Workqueue: writeback wb_workfn [ 317.783490] (flush-btrfs-3) [ 317.783581] c9ff3798 [ 317.796651] 813b69bc [ 317.796742] ea001363f500 [ 317.796798] 81c474ee [ 317.796854] c9ff37c0 [ 317.809910] 811b30c4 [ 317.810001] [ 317.810055] ea001363f500 [ 317.810112] 0003 [ 317.823230] c9ff37d0 [ 317.823320] 811b318f [ 317.823376] c9ff3818 [ 317.823431] Call Trace: [ 317.836645] [] dump_stack+0x4f/0x73 [ 317.850063] [] bad_page+0xc4/0x130 [ 317.863426] [] free_pages_check_bad+0x5f/0x70 [ 317.876805] [] free_hot_cold_page+0x305/0x3b0 [ 317.890132] [] free_hot_cold_page_list+0x7e/0x170 [ 317.903410] [] release_pages+0x2e7/0x3a0 [ 317.916600] [] __pagevec_release+0x27/0x40 [ 317.929817] [] extent_write_cache_pages.isra.44.constprop.59+0x355/0x430 [btrfs] [ 317.943073] [] extent_writepages+0x5d/0x90 [btrfs] [ 317.956033] [] ? btrfs_releasepage+0x40/0x40 [btrfs] [ 317.968902] [] btrfs_writepages+0x28/0x30 [btrfs] [ 317.981744] [] do_writepages+0x21/0x30 [ 317.994524] [] __writeback_single_inode+0x7f/0x6e0 [ 318.007333] [] ? _raw_spin_unlock+0x31/0x50 [ 318.020090] [] writeback_sb_inodes+0x31d/0x6c0 [ 318.032844] [] __writeback_inodes_wb+0x92/0xc0 [ 318.045623] [] wb_writeback+0x3f7/0x530 [ 318.058304] [] wb_workfn+0x148/0x620 [ 318.070949] [] ? process_one_work+0x194/0x690 [ 318.083549] [] ? wb_workfn+0x5/0x620 [ 318.096160] [] process_one_work+0x232/0x690 [ 318.108722] [] ? process_one_work+0x194/0x690 [ 318.121225] [] worker_thread+0x4e/0x490 [ 318.133680] [] ? process_one_work+0x690/0x690 [ 318.146126] [] ? process_one_work+0x690/0x690 [ 318.158507] [] kthread+0x102/0x120 [ 318.170880] [] ? kthread_park+0x60/0x60 [ 318.183254] [] ret_from_fork+0x22/0x30 This time, I managed to get my ftrace magic working, so we have this trace from just before this happened. Does this shed any light ? https://codemonkey.org.uk/junk/trace.txt (note that some of the spinlock/rcu calls will be missing here. I chose to add them to ftrace's exclude list because they dominate the trace buffer so much, especially in debug builds) Dave
Re: bio linked list corruption.
On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote: On Mon, Oct 31, 2016 at 11:55 AM, Dave Joneswrote: BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c flags: 0x400c(referenced|uptodate) page dumped because: non-NULL mapping Hmm. So this seems to be btrfs-specific, right? I searched for all your "non-NULL mapping" cases, and they all seem to have basically the same call trace, with some work thread doing writeback and going through btrfs_writepages(). Sounds like it's a race with either fallocate hole-punching or truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ clearly ran other filesystems too but I am not seeing this backtrace for anything else. Agreed, I think this is a separate bug, almost certainly btrfs specific. I'll work with Dave on a better reproducer. -chris
Re: bio linked list corruption.
On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote: On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones wrote: BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c flags: 0x400c(referenced|uptodate) page dumped because: non-NULL mapping Hmm. So this seems to be btrfs-specific, right? I searched for all your "non-NULL mapping" cases, and they all seem to have basically the same call trace, with some work thread doing writeback and going through btrfs_writepages(). Sounds like it's a race with either fallocate hole-punching or truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ clearly ran other filesystems too but I am not seeing this backtrace for anything else. Agreed, I think this is a separate bug, almost certainly btrfs specific. I'll work with Dave on a better reproducer. -chris
Re: bio linked list corruption.
On Mon, Oct 31, 2016 at 11:55 AM, Dave Joneswrote: > > BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 > page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c > flags: 0x400c(referenced|uptodate) > page dumped because: non-NULL mapping Hmm. So this seems to be btrfs-specific, right? I searched for all your "non-NULL mapping" cases, and they all seem to have basically the same call trace, with some work thread doing writeback and going through btrfs_writepages(). Sounds like it's a race with either fallocate hole-punching or truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ clearly ran other filesystems too but I am not seeing this backtrace for anything else. Linus
Re: bio linked list corruption.
On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones wrote: > > BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 > page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c > flags: 0x400c(referenced|uptodate) > page dumped because: non-NULL mapping Hmm. So this seems to be btrfs-specific, right? I searched for all your "non-NULL mapping" cases, and they all seem to have basically the same call trace, with some work thread doing writeback and going through btrfs_writepages(). Sounds like it's a race with either fallocate hole-punching or truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ clearly ran other filesystems too but I am not seeing this backtrace for anything else. Linus
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 07:47:51PM -0400, Dave Jones wrote: > On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: > > > >-hctx->queued++; > > >-data->hctx = hctx; > > >-data->ctx = ctx; > > >+data->hctx = alloc_data.hctx; > > >+data->ctx = alloc_data.ctx; > > >+data->hctx->queued++; > > > return rq; > > > } > > > > This made it through an entire dbench 2048 run on btrfs. My script has > > it running in a loop, but this is farther than I've gotten before. > > Looking great so far. > > Fixed the splat during boot for me too. > Now the fun part, let's see if it fixed the 'weird shit' that Trinity > was stumbling on. It took a while, but.. bad news. BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c flags: 0x400c(referenced|uptodate) page dumped because: non-NULL mapping CPU: 3 PID: 1586 Comm: kworker/u8:12 Not tainted 4.9.0-rc3-think+ #1 Workqueue: writeback wb_workfn (flush-btrfs-2) c9777828 8130d04c ea0013838e40 819ff654 c9777850 81150e5f ea0013838e40 400c c9777860 81150f1a c97778a8 Call Trace: [] dump_stack+0x4f/0x73 [] bad_page+0xbf/0x120 [] free_pages_check_bad+0x5a/0x70 [] free_hot_cold_page+0x248/0x290 [] free_hot_cold_page_list+0x2b/0x50 [] release_pages+0x2bd/0x350 [] __pagevec_release+0x22/0x30 [] extent_write_cache_pages.isra.48.constprop.63+0x32e/0x400 [btrfs] [] extent_writepages+0x49/0x60 [btrfs] [] ? btrfs_releasepage+0x40/0x40 [btrfs] [] btrfs_writepages+0x23/0x30 [btrfs] [] do_writepages+0x1c/0x30 [] __writeback_single_inode+0x33/0x180 [] writeback_sb_inodes+0x2a8/0x5b0 [] __writeback_inodes_wb+0x8d/0xc0 [] wb_writeback+0x1e3/0x1f0 [] wb_workfn+0xd2/0x280 [] process_one_work+0x1d5/0x490 [] ? process_one_work+0x175/0x490 [] worker_thread+0x49/0x490 [] ? process_one_work+0x490/0x490 [] kthread+0xee/0x110 [] ? kthread_park+0x60/0x60 [] ret_from_fork+0x22/0x30
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 07:47:51PM -0400, Dave Jones wrote: > On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: > > > >-hctx->queued++; > > >-data->hctx = hctx; > > >-data->ctx = ctx; > > >+data->hctx = alloc_data.hctx; > > >+data->ctx = alloc_data.ctx; > > >+data->hctx->queued++; > > > return rq; > > > } > > > > This made it through an entire dbench 2048 run on btrfs. My script has > > it running in a loop, but this is farther than I've gotten before. > > Looking great so far. > > Fixed the splat during boot for me too. > Now the fun part, let's see if it fixed the 'weird shit' that Trinity > was stumbling on. It took a while, but.. bad news. BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c flags: 0x400c(referenced|uptodate) page dumped because: non-NULL mapping CPU: 3 PID: 1586 Comm: kworker/u8:12 Not tainted 4.9.0-rc3-think+ #1 Workqueue: writeback wb_workfn (flush-btrfs-2) c9777828 8130d04c ea0013838e40 819ff654 c9777850 81150e5f ea0013838e40 400c c9777860 81150f1a c97778a8 Call Trace: [] dump_stack+0x4f/0x73 [] bad_page+0xbf/0x120 [] free_pages_check_bad+0x5a/0x70 [] free_hot_cold_page+0x248/0x290 [] free_hot_cold_page_list+0x2b/0x50 [] release_pages+0x2bd/0x350 [] __pagevec_release+0x22/0x30 [] extent_write_cache_pages.isra.48.constprop.63+0x32e/0x400 [btrfs] [] extent_writepages+0x49/0x60 [btrfs] [] ? btrfs_releasepage+0x40/0x40 [btrfs] [] btrfs_writepages+0x23/0x30 [btrfs] [] do_writepages+0x1c/0x30 [] __writeback_single_inode+0x33/0x180 [] writeback_sb_inodes+0x2a8/0x5b0 [] __writeback_inodes_wb+0x8d/0xc0 [] wb_writeback+0x1e3/0x1f0 [] wb_workfn+0xd2/0x280 [] process_one_work+0x1d5/0x490 [] ? process_one_work+0x175/0x490 [] worker_thread+0x49/0x490 [] ? process_one_work+0x490/0x490 [] kthread+0xee/0x110 [] ? kthread_park+0x60/0x60 [] ret_from_fork+0x22/0x30
Re: bio linked list corruption.
On Thu, Oct 27, 2016 at 04:41:33PM +1100, Dave Chinner wrote: > And that's indicative of a delalloc metadata reservation being > being too small and so we're allocating unreserved blocks. > > Different symptoms, same underlying cause, I think. > > I see the latter assert from time to time in my testing, but it's > not common (maybe once a month) and I've never been able to track it > down. However, it doesn't affect production systems unless they hit > ENOSPC hard enough that it causes the critical reserve pool to be > exhausted iand so the allocation fails. That's extremely rare - > usually takes a several hundred processes all trying to write as had > as they can concurrently and to all slip through the ENOSPC > detection without the correct metadata reservations and all require > multiple metadata blocks to be allocated durign writeback... > > If you've got a way to trigger it quickly and reliably, that would > be helpful... Seems pretty quickly reproducable for me in some shape or form. Run trinity with --enable-fds=testfile and create enough children to create a fair bit of contention, (for me -C64 seems a good fit on spinning rust, but if you're running on shiny nvme you might have to pump it up a bit). Dave
Re: bio linked list corruption.
On Thu, Oct 27, 2016 at 04:41:33PM +1100, Dave Chinner wrote: > And that's indicative of a delalloc metadata reservation being > being too small and so we're allocating unreserved blocks. > > Different symptoms, same underlying cause, I think. > > I see the latter assert from time to time in my testing, but it's > not common (maybe once a month) and I've never been able to track it > down. However, it doesn't affect production systems unless they hit > ENOSPC hard enough that it causes the critical reserve pool to be > exhausted iand so the allocation fails. That's extremely rare - > usually takes a several hundred processes all trying to write as had > as they can concurrently and to all slip through the ENOSPC > detection without the correct metadata reservations and all require > multiple metadata blocks to be allocated durign writeback... > > If you've got a way to trigger it quickly and reliably, that would > be helpful... Seems pretty quickly reproducable for me in some shape or form. Run trinity with --enable-fds=testfile and create enough children to create a fair bit of contention, (for me -C64 seems a good fit on spinning rust, but if you're running on shiny nvme you might have to pump it up a bit). Dave
Re: bio linked list corruption.
On 10/27/2016 10:34 AM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwigwrote: Dave, can you hit the warnings with this? Totally untested... Can we just kill off the unhelpful blk_map_ctx structure, e.g.: Yeah, I found that hard to read too. The difference between blk_map_ctx and blk_mq_alloc_data is minimal, might as well just use the latter everywhere. But please separate that patch from the patch that fixes the ctx list corruption. Agree And Jens - can I have at least the corruption fix asap? Due to KS travel, I'm going to do rc3 on Saturday, and I'd like to have the fix in at least a day before that.. Just committed half an hour ago with the acks etc, I'll send it to you shortly. -- Jens Axboe
Re: bio linked list corruption.
On 10/27/2016 10:34 AM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwig wrote: Dave, can you hit the warnings with this? Totally untested... Can we just kill off the unhelpful blk_map_ctx structure, e.g.: Yeah, I found that hard to read too. The difference between blk_map_ctx and blk_mq_alloc_data is minimal, might as well just use the latter everywhere. But please separate that patch from the patch that fixes the ctx list corruption. Agree And Jens - can I have at least the corruption fix asap? Due to KS travel, I'm going to do rc3 on Saturday, and I'd like to have the fix in at least a day before that.. Just committed half an hour ago with the acks etc, I'll send it to you shortly. -- Jens Axboe
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwigwrote: >> Dave, can you hit the warnings with this? Totally untested... > > Can we just kill off the unhelpful blk_map_ctx structure, e.g.: Yeah, I found that hard to read too. The difference between blk_map_ctx and blk_mq_alloc_data is minimal, might as well just use the latter everywhere. But please separate that patch from the patch that fixes the ctx list corruption. And Jens - can I have at least the corruption fix asap? Due to KS travel, I'm going to do rc3 on Saturday, and I'd like to have the fix in at least a day before that.. Linus
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwig wrote: >> Dave, can you hit the warnings with this? Totally untested... > > Can we just kill off the unhelpful blk_map_ctx structure, e.g.: Yeah, I found that hard to read too. The difference between blk_map_ctx and blk_mq_alloc_data is minimal, might as well just use the latter everywhere. But please separate that patch from the patch that fixes the ctx list corruption. And Jens - can I have at least the corruption fix asap? Due to KS travel, I'm going to do rc3 on Saturday, and I'd like to have the fix in at least a day before that.. Linus
Re: bio linked list corruption.
On 10/26/2016 08:00 PM, Jens Axboe wrote: > On 10/26/2016 05:47 PM, Dave Jones wrote: >> On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: >> >> > >-hctx->queued++; >> > >-data->hctx = hctx; >> > >-data->ctx = ctx; >> > >+data->hctx = alloc_data.hctx; >> > >+data->ctx = alloc_data.ctx; >> > >+data->hctx->queued++; >> > > return rq; >> > > } >> > >> > This made it through an entire dbench 2048 run on btrfs. My script >> has >> > it running in a loop, but this is farther than I've gotten before. >> > Looking great so far. >> >> Fixed the splat during boot for me too. >> Now the fun part, let's see if it fixed the 'weird shit' that Trinity >> was stumbling on. > > Let's let the testing simmer overnight, then I'll turn this into a real > patch tomorrow and get it submitted. > I ran all night on both btrfs and xfs. XFS came out clean, but btrfs hit the WARN_ON below. I hit it a few times with Jens' patch, always the same warning. It's pretty obviously a btrfs bug, we're not cleaning up this list properly during fsync. I tried a v1 of a btrfs fix overnight, but I see where it was incomplete now and will re-run. For the blk-mq bug, I think we got it! Tested-by: always-blaming-jens-from-now-onWARNING: CPU: 5 PID: 16163 at lib/list_debug.c:62 __list_del_entry+0x86/0xd0 list_del corruption. next->prev should be 8801196d3be0, but was 88010fc63308 Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper i2c_piix4 lrw i2c_core gf128mul ablk_helper virtio_net serio_raw button pcspkr floppy cryptd sch_fq_codel autofs4 virtio_blk CPU: 5 PID: 16163 Comm: dbench Not tainted 4.9.0-rc2-00041-g811d54d-dirty #322 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 8801196d3a68 814fde3f 8151c356 8801196d3ac8 8801196d3ac8 8801196d3ab8 810648cf dead0100 003e813bfc4a 8801196d3b98 880122b5c800 Call Trace: [] dump_stack+0x53/0x74 [] ? __list_del_entry+0x86/0xd0 [] __warn+0xff/0x120 [] warn_slowpath_fmt+0x49/0x50 [] __list_del_entry+0x86/0xd0 [] btrfs_sync_log+0x75d/0xbd0 [] ? btrfs_log_inode_parent+0x547/0xbb0 [] ? _raw_spin_lock+0x1b/0x40 [] ? __might_sleep+0x53/0xa0 [] ? dput+0x65/0x280 [] ? btrfs_log_dentry_safe+0x77/0x90 [] btrfs_sync_file+0x424/0x490 [] ? SYSC_kill+0xba/0x1d0 [] ? __sb_end_write+0x58/0x80 [] vfs_fsync_range+0x4c/0xb0 [] ? syscall_trace_enter+0x201/0x2e0 [] vfs_fsync+0x1c/0x20 [] do_fsync+0x3d/0x70 [] ? syscall_slow_exit_work+0xfb/0x100 [] SyS_fsync+0x10/0x20 [] do_syscall_64+0x55/0xd0 [] ? prepare_exit_to_usermode+0x37/0x40 [] entry_SYSCALL64_slow_path+0x25/0x25 ---[ end trace c93288442a6424aa ]---
Re: bio linked list corruption.
On 10/26/2016 08:00 PM, Jens Axboe wrote: > On 10/26/2016 05:47 PM, Dave Jones wrote: >> On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: >> >> > >-hctx->queued++; >> > >-data->hctx = hctx; >> > >-data->ctx = ctx; >> > >+data->hctx = alloc_data.hctx; >> > >+data->ctx = alloc_data.ctx; >> > >+data->hctx->queued++; >> > > return rq; >> > > } >> > >> > This made it through an entire dbench 2048 run on btrfs. My script >> has >> > it running in a loop, but this is farther than I've gotten before. >> > Looking great so far. >> >> Fixed the splat during boot for me too. >> Now the fun part, let's see if it fixed the 'weird shit' that Trinity >> was stumbling on. > > Let's let the testing simmer overnight, then I'll turn this into a real > patch tomorrow and get it submitted. > I ran all night on both btrfs and xfs. XFS came out clean, but btrfs hit the WARN_ON below. I hit it a few times with Jens' patch, always the same warning. It's pretty obviously a btrfs bug, we're not cleaning up this list properly during fsync. I tried a v1 of a btrfs fix overnight, but I see where it was incomplete now and will re-run. For the blk-mq bug, I think we got it! Tested-by: always-blaming-jens-from-now-on WARNING: CPU: 5 PID: 16163 at lib/list_debug.c:62 __list_del_entry+0x86/0xd0 list_del corruption. next->prev should be 8801196d3be0, but was 88010fc63308 Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper i2c_piix4 lrw i2c_core gf128mul ablk_helper virtio_net serio_raw button pcspkr floppy cryptd sch_fq_codel autofs4 virtio_blk CPU: 5 PID: 16163 Comm: dbench Not tainted 4.9.0-rc2-00041-g811d54d-dirty #322 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 8801196d3a68 814fde3f 8151c356 8801196d3ac8 8801196d3ac8 8801196d3ab8 810648cf dead0100 003e813bfc4a 8801196d3b98 880122b5c800 Call Trace: [] dump_stack+0x53/0x74 [] ? __list_del_entry+0x86/0xd0 [] __warn+0xff/0x120 [] warn_slowpath_fmt+0x49/0x50 [] __list_del_entry+0x86/0xd0 [] btrfs_sync_log+0x75d/0xbd0 [] ? btrfs_log_inode_parent+0x547/0xbb0 [] ? _raw_spin_lock+0x1b/0x40 [] ? __might_sleep+0x53/0xa0 [] ? dput+0x65/0x280 [] ? btrfs_log_dentry_safe+0x77/0x90 [] btrfs_sync_file+0x424/0x490 [] ? SYSC_kill+0xba/0x1d0 [] ? __sb_end_write+0x58/0x80 [] vfs_fsync_range+0x4c/0xb0 [] ? syscall_trace_enter+0x201/0x2e0 [] vfs_fsync+0x1c/0x20 [] do_fsync+0x3d/0x70 [] ? syscall_slow_exit_work+0xfb/0x100 [] SyS_fsync+0x10/0x20 [] do_syscall_64+0x55/0xd0 [] ? prepare_exit_to_usermode+0x37/0x40 [] entry_SYSCALL64_slow_path+0x25/0x25 ---[ end trace c93288442a6424aa ]---
Re: bio linked list corruption.
> Dave, can you hit the warnings with this? Totally untested... Can we just kill off the unhelpful blk_map_ctx structure, e.g.: diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed..d74a74a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1190,21 +1190,15 @@ static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx, } } -struct blk_map_ctx { - struct blk_mq_hw_ctx *hctx; - struct blk_mq_ctx *ctx; -}; - static struct request *blk_mq_map_request(struct request_queue *q, struct bio *bio, - struct blk_map_ctx *data) + struct blk_mq_alloc_data *data) { struct blk_mq_hw_ctx *hctx; struct blk_mq_ctx *ctx; struct request *rq; int op = bio_data_dir(bio); int op_flags = 0; - struct blk_mq_alloc_data alloc_data; blk_queue_enter_live(q); ctx = blk_mq_get_ctx(q); @@ -1214,12 +1208,10 @@ static struct request *blk_mq_map_request(struct request_queue *q, op_flags |= REQ_SYNC; trace_block_getrq(q, bio, op); - blk_mq_set_alloc_data(_data, q, 0, ctx, hctx); - rq = __blk_mq_alloc_request(_data, op, op_flags); + blk_mq_set_alloc_data(data, q, 0, ctx, hctx); + rq = __blk_mq_alloc_request(data, op, op_flags); - hctx->queued++; - data->hctx = hctx; - data->ctx = ctx; + data->hctx->queued++; return rq; } @@ -1267,7 +1259,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) { const int is_sync = rw_is_sync(bio_op(bio), bio->bi_opf); const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA); - struct blk_map_ctx data; + struct blk_mq_alloc_data data; struct request *rq; unsigned int request_count = 0; struct blk_plug *plug; @@ -1363,7 +1355,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA); struct blk_plug *plug; unsigned int request_count = 0; - struct blk_map_ctx data; + struct blk_mq_alloc_data data; struct request *rq; blk_qc_t cookie;
Re: bio linked list corruption.
> Dave, can you hit the warnings with this? Totally untested... Can we just kill off the unhelpful blk_map_ctx structure, e.g.: diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed..d74a74a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1190,21 +1190,15 @@ static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx, } } -struct blk_map_ctx { - struct blk_mq_hw_ctx *hctx; - struct blk_mq_ctx *ctx; -}; - static struct request *blk_mq_map_request(struct request_queue *q, struct bio *bio, - struct blk_map_ctx *data) + struct blk_mq_alloc_data *data) { struct blk_mq_hw_ctx *hctx; struct blk_mq_ctx *ctx; struct request *rq; int op = bio_data_dir(bio); int op_flags = 0; - struct blk_mq_alloc_data alloc_data; blk_queue_enter_live(q); ctx = blk_mq_get_ctx(q); @@ -1214,12 +1208,10 @@ static struct request *blk_mq_map_request(struct request_queue *q, op_flags |= REQ_SYNC; trace_block_getrq(q, bio, op); - blk_mq_set_alloc_data(_data, q, 0, ctx, hctx); - rq = __blk_mq_alloc_request(_data, op, op_flags); + blk_mq_set_alloc_data(data, q, 0, ctx, hctx); + rq = __blk_mq_alloc_request(data, op, op_flags); - hctx->queued++; - data->hctx = hctx; - data->ctx = ctx; + data->hctx->queued++; return rq; } @@ -1267,7 +1259,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) { const int is_sync = rw_is_sync(bio_op(bio), bio->bi_opf); const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA); - struct blk_map_ctx data; + struct blk_mq_alloc_data data; struct request *rq; unsigned int request_count = 0; struct blk_plug *plug; @@ -1363,7 +1355,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA); struct blk_plug *plug; unsigned int request_count = 0; - struct blk_map_ctx data; + struct blk_mq_alloc_data data; struct request *rq; blk_qc_t cookie;
Re: bio linked list corruption.
On Tue, Oct 25, 2016 at 08:27:52PM -0400, Dave Jones wrote: > DaveC: Do these look like real problems, or is this more "looks like > random memory corruption" ? It's been a while since I did some stress > testing on XFS, so these might not be new.. > > XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: > 2938 > [ cut here ] > kernel BUG at fs/xfs/xfs_message.c:113! > invalid opcode: [#1] PREEMPT SMP > CPU: 1 PID: 6227 Comm: trinity-c9 Not tainted 4.9.0-rc1-think+ #6 > task: 8804f4658040 task.stack: 88050568c000 > RIP: 0010:[] > [] assfail+0x1b/0x20 [xfs] > RSP: :88050568f9e8 EFLAGS: 00010282 > RAX: ffea RBX: 0046 RCX: 0001 > RDX: ffc0 RSI: 000a RDI: a02fe34d > RBP: 88050568f9e8 R08: R09: > R10: 000a R11: f000 R12: 88050568fb44 > R13: 00f3 R14: 8804f292bf88 R15: 000e0046 > FS: 7fe2ddfdfb40() GS:88050a00() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 7fe2dbabd000 CR3: 0004f461f000 CR4: 001406e0 > Stack: > 88050568fa88 a027ccee fff9 8804f16fd8b0 > 3ffa 0032 8804f292bf40 4976 > 000e0008 04fd 8804 5107 > Call Trace: > [] xfs_bmap_add_extent_hole_delay+0x54e/0x620 [xfs] > [] xfs_bmapi_reserve_delalloc+0x2b4/0x400 [xfs] > [] xfs_file_iomap_begin_delay.isra.12+0x247/0x3c0 [xfs] > [] xfs_file_iomap_begin+0x181/0x270 [xfs] > [] iomap_apply+0x53/0x100 > [] iomap_file_buffered_write+0x6b/0x90 All this iomap code is new, so it's entirely possible that this is a new bug. The assert failure is indicative that the delalloc extent's metadata reservation grew when we expected it to stay the same or shrink. > XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: > fs/xfs/xfs_trans.c, line: 309 ... > [] xfs_trans_mod_sb+0x241/0x280 [xfs] > [] xfs_ag_resv_alloc_extent+0x4f/0xc0 [xfs] > [] xfs_alloc_ag_vextent+0x23d/0x300 [xfs] > [] xfs_alloc_vextent+0x5fb/0x6d0 [xfs] > [] xfs_bmap_btalloc+0x304/0x8e0 [xfs] > [] ? xfs_iext_bno_to_ext+0xee/0x170 [xfs] > [] xfs_bmap_alloc+0x2b/0x40 [xfs] > [] xfs_bmapi_write+0x640/0x1210 [xfs] > [] xfs_iomap_write_allocate+0x166/0x350 [xfs] > [] xfs_map_blocks+0x1b0/0x260 [xfs] > [] xfs_do_writepage+0x23b/0x730 [xfs] And that's indicative of a delalloc metadata reservation being being too small and so we're allocating unreserved blocks. Different symptoms, same underlying cause, I think. I see the latter assert from time to time in my testing, but it's not common (maybe once a month) and I've never been able to track it down. However, it doesn't affect production systems unless they hit ENOSPC hard enough that it causes the critical reserve pool to be exhausted iand so the allocation fails. That's extremely rare - usually takes a several hundred processes all trying to write as had as they can concurrently and to all slip through the ENOSPC detection without the correct metadata reservations and all require multiple metadata blocks to be allocated durign writeback... If you've got a way to trigger it quickly and reliably, that would be helpful... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: bio linked list corruption.
On Tue, Oct 25, 2016 at 08:27:52PM -0400, Dave Jones wrote: > DaveC: Do these look like real problems, or is this more "looks like > random memory corruption" ? It's been a while since I did some stress > testing on XFS, so these might not be new.. > > XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: > 2938 > [ cut here ] > kernel BUG at fs/xfs/xfs_message.c:113! > invalid opcode: [#1] PREEMPT SMP > CPU: 1 PID: 6227 Comm: trinity-c9 Not tainted 4.9.0-rc1-think+ #6 > task: 8804f4658040 task.stack: 88050568c000 > RIP: 0010:[] > [] assfail+0x1b/0x20 [xfs] > RSP: :88050568f9e8 EFLAGS: 00010282 > RAX: ffea RBX: 0046 RCX: 0001 > RDX: ffc0 RSI: 000a RDI: a02fe34d > RBP: 88050568f9e8 R08: R09: > R10: 000a R11: f000 R12: 88050568fb44 > R13: 00f3 R14: 8804f292bf88 R15: 000e0046 > FS: 7fe2ddfdfb40() GS:88050a00() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 7fe2dbabd000 CR3: 0004f461f000 CR4: 001406e0 > Stack: > 88050568fa88 a027ccee fff9 8804f16fd8b0 > 3ffa 0032 8804f292bf40 4976 > 000e0008 04fd 8804 5107 > Call Trace: > [] xfs_bmap_add_extent_hole_delay+0x54e/0x620 [xfs] > [] xfs_bmapi_reserve_delalloc+0x2b4/0x400 [xfs] > [] xfs_file_iomap_begin_delay.isra.12+0x247/0x3c0 [xfs] > [] xfs_file_iomap_begin+0x181/0x270 [xfs] > [] iomap_apply+0x53/0x100 > [] iomap_file_buffered_write+0x6b/0x90 All this iomap code is new, so it's entirely possible that this is a new bug. The assert failure is indicative that the delalloc extent's metadata reservation grew when we expected it to stay the same or shrink. > XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: > fs/xfs/xfs_trans.c, line: 309 ... > [] xfs_trans_mod_sb+0x241/0x280 [xfs] > [] xfs_ag_resv_alloc_extent+0x4f/0xc0 [xfs] > [] xfs_alloc_ag_vextent+0x23d/0x300 [xfs] > [] xfs_alloc_vextent+0x5fb/0x6d0 [xfs] > [] xfs_bmap_btalloc+0x304/0x8e0 [xfs] > [] ? xfs_iext_bno_to_ext+0xee/0x170 [xfs] > [] xfs_bmap_alloc+0x2b/0x40 [xfs] > [] xfs_bmapi_write+0x640/0x1210 [xfs] > [] xfs_iomap_write_allocate+0x166/0x350 [xfs] > [] xfs_map_blocks+0x1b0/0x260 [xfs] > [] xfs_do_writepage+0x23b/0x730 [xfs] And that's indicative of a delalloc metadata reservation being being too small and so we're allocating unreserved blocks. Different symptoms, same underlying cause, I think. I see the latter assert from time to time in my testing, but it's not common (maybe once a month) and I've never been able to track it down. However, it doesn't affect production systems unless they hit ENOSPC hard enough that it causes the critical reserve pool to be exhausted iand so the allocation fails. That's extremely rare - usually takes a several hundred processes all trying to write as had as they can concurrently and to all slip through the ENOSPC detection without the correct metadata reservations and all require multiple metadata blocks to be allocated durign writeback... If you've got a way to trigger it quickly and reliably, that would be helpful... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: bio linked list corruption.
On 10/26/2016 05:47 PM, Dave Jones wrote: On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: > >- hctx->queued++; > >- data->hctx = hctx; > >- data->ctx = ctx; > >+ data->hctx = alloc_data.hctx; > >+ data->ctx = alloc_data.ctx; > >+ data->hctx->queued++; > > return rq; > > } > > This made it through an entire dbench 2048 run on btrfs. My script has > it running in a loop, but this is farther than I've gotten before. > Looking great so far. Fixed the splat during boot for me too. Now the fun part, let's see if it fixed the 'weird shit' that Trinity was stumbling on. Let's let the testing simmer overnight, then I'll turn this into a real patch tomorrow and get it submitted. -- Jens Axboe
Re: bio linked list corruption.
On 10/26/2016 05:47 PM, Dave Jones wrote: On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: > >- hctx->queued++; > >- data->hctx = hctx; > >- data->ctx = ctx; > >+ data->hctx = alloc_data.hctx; > >+ data->ctx = alloc_data.ctx; > >+ data->hctx->queued++; > > return rq; > > } > > This made it through an entire dbench 2048 run on btrfs. My script has > it running in a loop, but this is farther than I've gotten before. > Looking great so far. Fixed the splat during boot for me too. Now the fun part, let's see if it fixed the 'weird shit' that Trinity was stumbling on. Let's let the testing simmer overnight, then I'll turn this into a real patch tomorrow and get it submitted. -- Jens Axboe
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: > >- hctx->queued++; > >- data->hctx = hctx; > >- data->ctx = ctx; > >+ data->hctx = alloc_data.hctx; > >+ data->ctx = alloc_data.ctx; > >+ data->hctx->queued++; > >return rq; > > } > > This made it through an entire dbench 2048 run on btrfs. My script has > it running in a loop, but this is farther than I've gotten before. > Looking great so far. Fixed the splat during boot for me too. Now the fun part, let's see if it fixed the 'weird shit' that Trinity was stumbling on. Dave
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: > >- hctx->queued++; > >- data->hctx = hctx; > >- data->ctx = ctx; > >+ data->hctx = alloc_data.hctx; > >+ data->ctx = alloc_data.ctx; > >+ data->hctx->queued++; > >return rq; > > } > > This made it through an entire dbench 2048 run on btrfs. My script has > it running in a loop, but this is farther than I've gotten before. > Looking great so far. Fixed the splat during boot for me too. Now the fun part, let's see if it fixed the 'weird shit' that Trinity was stumbling on. Dave
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 05:20:01PM -0600, Jens Axboe wrote: On 10/26/2016 05:08 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboewrote: Actually, I think I see what might trigger it. You are on nvme, iirc, and that has a deep queue. Yes. I have long since moved on from slow disks, so all my systems are not just flash, but m.2 nvme ssd's. So at least that could explain why Dave sees it at bootup but I don't. Yep, you'd never sleep during normal boot. The original patch had a problem, though... This one should be better. Too many 'data's, we'll still need to assign ctx/hctx, we should just use the current ones, not the original ones. diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed64771..e56fec187ba6 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1217,9 +1224,9 @@ static struct request *blk_mq_map_request(struct request_queue *q, blk_mq_set_alloc_data(_data, q, 0, ctx, hctx); rq = __blk_mq_alloc_request(_data, op, op_flags); - hctx->queued++; - data->hctx = hctx; - data->ctx = ctx; + data->hctx = alloc_data.hctx; + data->ctx = alloc_data.ctx; + data->hctx->queued++; return rq; } This made it through an entire dbench 2048 run on btrfs. My script has it running in a loop, but this is farther than I've gotten before. Looking great so far. -chris
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 05:20:01PM -0600, Jens Axboe wrote: On 10/26/2016 05:08 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe wrote: Actually, I think I see what might trigger it. You are on nvme, iirc, and that has a deep queue. Yes. I have long since moved on from slow disks, so all my systems are not just flash, but m.2 nvme ssd's. So at least that could explain why Dave sees it at bootup but I don't. Yep, you'd never sleep during normal boot. The original patch had a problem, though... This one should be better. Too many 'data's, we'll still need to assign ctx/hctx, we should just use the current ones, not the original ones. diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed64771..e56fec187ba6 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1217,9 +1224,9 @@ static struct request *blk_mq_map_request(struct request_queue *q, blk_mq_set_alloc_data(_data, q, 0, ctx, hctx); rq = __blk_mq_alloc_request(_data, op, op_flags); - hctx->queued++; - data->hctx = hctx; - data->ctx = ctx; + data->hctx = alloc_data.hctx; + data->ctx = alloc_data.ctx; + data->hctx->queued++; return rq; } This made it through an entire dbench 2048 run on btrfs. My script has it running in a loop, but this is farther than I've gotten before. Looking great so far. -chris
Re: bio linked list corruption.
On 10/26/2016 05:19 PM, Chris Mason wrote: On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote: On 10/26/2016 04:58 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvaldswrote: Dave: it might be a good idea to split that "WARN_ON_ONCE()" in blk_mq_merge_queue_io() into two I did that myself too, since Dave sees this during boot. But I'm not getting the warning ;( Dave gets it with ext4, and thats' what I have too, so I'm not sure what the required trigger would be. Actually, I think I see what might trigger it. You are on nvme, iirc, and that has a deep queue. Dave, are you testing on a sata drive or something similar with a shallower queue depth? If we end up sleeping for a request, I think we could trigger data->ctx being different. Dave, can you hit the warnings with this? Totally untested... Confirmed, totally untested ;) Don't try this one at home folks (working this out with Jens offlist) Ahem, I did say untested! The one I just sent in reply to Linus should be better. Though that one is completely untested as well... -- Jens Axboe
Re: bio linked list corruption.
On 10/26/2016 05:19 PM, Chris Mason wrote: On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote: On 10/26/2016 04:58 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds wrote: Dave: it might be a good idea to split that "WARN_ON_ONCE()" in blk_mq_merge_queue_io() into two I did that myself too, since Dave sees this during boot. But I'm not getting the warning ;( Dave gets it with ext4, and thats' what I have too, so I'm not sure what the required trigger would be. Actually, I think I see what might trigger it. You are on nvme, iirc, and that has a deep queue. Dave, are you testing on a sata drive or something similar with a shallower queue depth? If we end up sleeping for a request, I think we could trigger data->ctx being different. Dave, can you hit the warnings with this? Totally untested... Confirmed, totally untested ;) Don't try this one at home folks (working this out with Jens offlist) Ahem, I did say untested! The one I just sent in reply to Linus should be better. Though that one is completely untested as well... -- Jens Axboe
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote: On 10/26/2016 04:58 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvaldswrote: Dave: it might be a good idea to split that "WARN_ON_ONCE()" in blk_mq_merge_queue_io() into two I did that myself too, since Dave sees this during boot. But I'm not getting the warning ;( Dave gets it with ext4, and thats' what I have too, so I'm not sure what the required trigger would be. Actually, I think I see what might trigger it. You are on nvme, iirc, and that has a deep queue. Dave, are you testing on a sata drive or something similar with a shallower queue depth? If we end up sleeping for a request, I think we could trigger data->ctx being different. Dave, can you hit the warnings with this? Totally untested... Confirmed, totally untested ;) Don't try this one at home folks (working this out with Jens offlist) G: unable to handle kernel paging request at 02411200 IP: [] _raw_spin_lock+0x22/0x40 PGD 12840a067 PUD 128446067 PMD 0 Oops: 0002 [#1] PREEMPT SMP Modules linked in: virtio_blk(+) CPU: 4 PID: 125 Comm: modprobe Not tainted 4.9.0-rc2-00041-g811d54d-dirty #320 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 task: 88013849aac0 task.stack: 8801293d8000 RIP: 0010:[] [] _raw_spin_lock+0x22/0x40 RSP: 0018:8801293db278 EFLAGS: 00010246 RAX: RBX: 02411200 RCX: RDX: 0001 RSI: 88013a5c1048 RDI: RBP: 8801293db288 R08: 0005 R09: 880128449380 R10: R11: 0008 R12: R13: 0001 R14: 0076 R15: 8801293b6a80 FS: 7f1a2a9cdb40() GS:88013fd0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 02411200 CR3: 00013a5d1000 CR4: 000406e0 Stack: 8801293db2d0 880128488000 8801293db348 814debff 00ff8801293db2c8 8801293db338 8801284888c0 8801284888b8 60fec4f9 02411200 880128f810c0 880128f810c0 Call Trace: [] blk_sq_make_request+0x34f/0x580 [] ? mempool_alloc_slab+0x15/0x20 [] generic_make_request+0x104/0x200 [] submit_bio+0x65/0x130 [] submit_bh_wbc+0x16e/0x210 [] submit_bh+0x13/0x20 [] block_read_full_page+0x205/0x3d0 [] ? I_BDEV+0x20/0x20 [] ? lru_cache_add+0xe/0x10 [] ? add_to_page_cache_lru+0x92/0xf0 [] ? __page_cache_alloc+0xd1/0xe0 [] blkdev_readpage+0x18/0x20 [] do_read_cache_page+0x1c6/0x380 [] ? blkdev_writepages+0x10/0x10 [] ? alloc_pages_current+0xb2/0x1c0 [] read_cache_page+0x12/0x20 [] read_dev_sector+0x31/0xb0 [] read_lba+0xbd/0x130 [] find_valid_gpt+0xa2/0x580 [] ? find_valid_gpt+0x580/0x580 [] efi_partition+0x67/0x3d0 [] ? vsnprintf+0x2aa/0x470 [] ? snprintf+0x34/0x40 [] ? find_valid_gpt+0x580/0x580 [] check_partition+0x106/0x1e0 [] rescan_partitions+0x8c/0x270 [] __blkdev_get+0x328/0x3f0 [] blkdev_get+0x54/0x320 [] ? unlock_new_inode+0x5a/0x80 [] ? bdget+0xff/0x110 [] device_add_disk+0x3c6/0x450 [] ? ioread8+0x1a/0x40 [] ? vp_get+0x4e/0x70 [] virtblk_probe+0x460/0x708 [virtio_blk] [] ? vp_finalize_features+0x36/0x50 [] virtio_dev_probe+0x132/0x1e0 [] driver_probe_device+0x1a9/0x2d0 [] ? mutex_lock+0x24/0x50 [] __driver_attach+0xbd/0xc0 [] ? driver_probe_device+0x2d0/0x2d0 [] ? driver_probe_device+0x2d0/0x2d0 [] bus_for_each_dev+0x8a/0xb0 [] driver_attach+0x1e/0x20 [] bus_add_driver+0x1b6/0x230 [] driver_register+0x60/0xe0 [] register_virtio_driver+0x20/0x40 [] init+0x57/0x81 [virtio_blk] [] ? 0xa0004000 [] ? 0xa0004000 [] do_one_initcall+0x46/0x150 [] do_init_module+0x6a/0x210 [] ? vfree+0x37/0x90 [] load_module+0x1638/0x1860 [] ? do_free_init+0x30/0x30 [] ? kernel_read_file_from_fd+0x54/0x90 [] SYSC_finit_module+0xc2/0xd0 [] SyS_finit_module+0xe/0x10 [] entry_SYSCALL_64_fastpath+0x13/0x94 Code: 89 df e8 a2 52 70 ff eb e6 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 89 fb bf 01 00 00 00 e8 95 53 6e ff 31 c0 ba 01 00 00 00 0f b1 13 85 c0 75 07 48 83 c4 08 5b c9 c3 89 c6 48 89 df e8 RIP [] _raw_spin_lock+0x22/0x40 RSP CR2: 02411200 ---[ end trace e8cb117e64947621 ]--- Kernel panic - not syncing: Fatal exception Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception -chris
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote: On 10/26/2016 04:58 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds wrote: Dave: it might be a good idea to split that "WARN_ON_ONCE()" in blk_mq_merge_queue_io() into two I did that myself too, since Dave sees this during boot. But I'm not getting the warning ;( Dave gets it with ext4, and thats' what I have too, so I'm not sure what the required trigger would be. Actually, I think I see what might trigger it. You are on nvme, iirc, and that has a deep queue. Dave, are you testing on a sata drive or something similar with a shallower queue depth? If we end up sleeping for a request, I think we could trigger data->ctx being different. Dave, can you hit the warnings with this? Totally untested... Confirmed, totally untested ;) Don't try this one at home folks (working this out with Jens offlist) G: unable to handle kernel paging request at 02411200 IP: [] _raw_spin_lock+0x22/0x40 PGD 12840a067 PUD 128446067 PMD 0 Oops: 0002 [#1] PREEMPT SMP Modules linked in: virtio_blk(+) CPU: 4 PID: 125 Comm: modprobe Not tainted 4.9.0-rc2-00041-g811d54d-dirty #320 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 task: 88013849aac0 task.stack: 8801293d8000 RIP: 0010:[] [] _raw_spin_lock+0x22/0x40 RSP: 0018:8801293db278 EFLAGS: 00010246 RAX: RBX: 02411200 RCX: RDX: 0001 RSI: 88013a5c1048 RDI: RBP: 8801293db288 R08: 0005 R09: 880128449380 R10: R11: 0008 R12: R13: 0001 R14: 0076 R15: 8801293b6a80 FS: 7f1a2a9cdb40() GS:88013fd0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 02411200 CR3: 00013a5d1000 CR4: 000406e0 Stack: 8801293db2d0 880128488000 8801293db348 814debff 00ff8801293db2c8 8801293db338 8801284888c0 8801284888b8 60fec4f9 02411200 880128f810c0 880128f810c0 Call Trace: [] blk_sq_make_request+0x34f/0x580 [] ? mempool_alloc_slab+0x15/0x20 [] generic_make_request+0x104/0x200 [] submit_bio+0x65/0x130 [] submit_bh_wbc+0x16e/0x210 [] submit_bh+0x13/0x20 [] block_read_full_page+0x205/0x3d0 [] ? I_BDEV+0x20/0x20 [] ? lru_cache_add+0xe/0x10 [] ? add_to_page_cache_lru+0x92/0xf0 [] ? __page_cache_alloc+0xd1/0xe0 [] blkdev_readpage+0x18/0x20 [] do_read_cache_page+0x1c6/0x380 [] ? blkdev_writepages+0x10/0x10 [] ? alloc_pages_current+0xb2/0x1c0 [] read_cache_page+0x12/0x20 [] read_dev_sector+0x31/0xb0 [] read_lba+0xbd/0x130 [] find_valid_gpt+0xa2/0x580 [] ? find_valid_gpt+0x580/0x580 [] efi_partition+0x67/0x3d0 [] ? vsnprintf+0x2aa/0x470 [] ? snprintf+0x34/0x40 [] ? find_valid_gpt+0x580/0x580 [] check_partition+0x106/0x1e0 [] rescan_partitions+0x8c/0x270 [] __blkdev_get+0x328/0x3f0 [] blkdev_get+0x54/0x320 [] ? unlock_new_inode+0x5a/0x80 [] ? bdget+0xff/0x110 [] device_add_disk+0x3c6/0x450 [] ? ioread8+0x1a/0x40 [] ? vp_get+0x4e/0x70 [] virtblk_probe+0x460/0x708 [virtio_blk] [] ? vp_finalize_features+0x36/0x50 [] virtio_dev_probe+0x132/0x1e0 [] driver_probe_device+0x1a9/0x2d0 [] ? mutex_lock+0x24/0x50 [] __driver_attach+0xbd/0xc0 [] ? driver_probe_device+0x2d0/0x2d0 [] ? driver_probe_device+0x2d0/0x2d0 [] bus_for_each_dev+0x8a/0xb0 [] driver_attach+0x1e/0x20 [] bus_add_driver+0x1b6/0x230 [] driver_register+0x60/0xe0 [] register_virtio_driver+0x20/0x40 [] init+0x57/0x81 [virtio_blk] [] ? 0xa0004000 [] ? 0xa0004000 [] do_one_initcall+0x46/0x150 [] do_init_module+0x6a/0x210 [] ? vfree+0x37/0x90 [] load_module+0x1638/0x1860 [] ? do_free_init+0x30/0x30 [] ? kernel_read_file_from_fd+0x54/0x90 [] SYSC_finit_module+0xc2/0xd0 [] SyS_finit_module+0xe/0x10 [] entry_SYSCALL_64_fastpath+0x13/0x94 Code: 89 df e8 a2 52 70 ff eb e6 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 89 fb bf 01 00 00 00 e8 95 53 6e ff 31 c0 ba 01 00 00 00 0f b1 13 85 c0 75 07 48 83 c4 08 5b c9 c3 89 c6 48 89 df e8 RIP [] _raw_spin_lock+0x22/0x40 RSP CR2: 02411200 ---[ end trace e8cb117e64947621 ]--- Kernel panic - not syncing: Fatal exception Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception -chris
Re: bio linked list corruption.
On 10/26/2016 05:08 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboewrote: Actually, I think I see what might trigger it. You are on nvme, iirc, and that has a deep queue. Yes. I have long since moved on from slow disks, so all my systems are not just flash, but m.2 nvme ssd's. So at least that could explain why Dave sees it at bootup but I don't. Yep, you'd never sleep during normal boot. The original patch had a problem, though... This one should be better. Too many 'data's, we'll still need to assign ctx/hctx, we should just use the current ones, not the original ones. diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed64771..e56fec187ba6 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1217,9 +1224,9 @@ static struct request *blk_mq_map_request(struct request_queue *q, blk_mq_set_alloc_data(_data, q, 0, ctx, hctx); rq = __blk_mq_alloc_request(_data, op, op_flags); - hctx->queued++; - data->hctx = hctx; - data->ctx = ctx; + data->hctx = alloc_data.hctx; + data->ctx = alloc_data.ctx; + data->hctx->queued++; return rq; } -- Jens Axboe
Re: bio linked list corruption.
On 10/26/2016 05:08 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe wrote: Actually, I think I see what might trigger it. You are on nvme, iirc, and that has a deep queue. Yes. I have long since moved on from slow disks, so all my systems are not just flash, but m.2 nvme ssd's. So at least that could explain why Dave sees it at bootup but I don't. Yep, you'd never sleep during normal boot. The original patch had a problem, though... This one should be better. Too many 'data's, we'll still need to assign ctx/hctx, we should just use the current ones, not the original ones. diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed64771..e56fec187ba6 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1217,9 +1224,9 @@ static struct request *blk_mq_map_request(struct request_queue *q, blk_mq_set_alloc_data(_data, q, 0, ctx, hctx); rq = __blk_mq_alloc_request(_data, op, op_flags); - hctx->queued++; - data->hctx = hctx; - data->ctx = ctx; + data->hctx = alloc_data.hctx; + data->ctx = alloc_data.ctx; + data->hctx->queued++; return rq; } -- Jens Axboe
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 03:07:10PM -0700, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 1:00 PM, Chris Masonwrote: Today I turned off every CONFIG_DEBUG_* except for list debugging, and ran dbench 2048: [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 __list_add+0xbe/0xd0 [ 2759.119652] list_add corruption. prev->next should be next (e8c80308), but was c9ccfb88. (prev=880128522380). [ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk [ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 4.9.0-rc1-15246-g4ce9206-dirty #317 [ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 [ 2759.125077] c9000f6fb868 814fe4ff 8151cb5e c9000f6fb8c8 [ 2759.125077] c9000f6fb8c8 c9000f6fb8b8 81064bbf [ 2759.127444] 880128523680 002139968000 880138b7a4a0 880128523540 [ 2759.127444] Call Trace: [ 2759.127444] [] dump_stack+0x53/0x74 [ 2759.127444] [] ? __list_add+0xbe/0xd0 [ 2759.127444] [] __warn+0xff/0x120 [ 2759.127444] [] warn_slowpath_fmt+0x49/0x50 [ 2759.127444] [] __list_add+0xbe/0xd0 [ 2759.127444] [] blk_sq_make_request+0x388/0x580 Ok, that's definitely the same one that Dave started out seeing. The fact that it is that reliable - two different machines, two very different loads (dbench looks nothing like trinity) really makes me think that maybe the problem really is in the block plugging after all. It very much does not smell like random stack corruption. It's simply not random enough. Agreed. I'd feel better if I could trigger this outside of btrfs, even though Dave Chinner hit something very similar on xfs. I'll peel off another test machine for a long XFS run. And I just noticed something: I originally thought that this is the "list_add_tail()" to the plug - which is the only "list_add()" variant in that function. But that never made sense, because the whole "but was" isn't a stack address, and "next" in "list_add_tail()" is basically fixed, and would have to be the stack. But I now notice that there's actually another "list_add()" variant there, and it's the one from __blk_mq_insert_request() that gets inlined into blk_mq_insert_request(), which then gets inlined into blk_mq_make_request(). And that actually makes some sense just looking at the offsets too: blk_sq_make_request+0x388/0x580 so it's somewhat at the end of blk_sq_make_request(). So it's not unlikely. And there it makes perfect sense that the "next should be" value is *not* on the stack. Chris, if you built with debug info, you can try ./scripts/faddr2line /boot/vmlinux blk_sq_make_request+0x388 to get what line that blk_sq_make_request+0x388 address actually is. I think it's the list_add_tail(>queuelist, >rq_list); in __blk_mq_insert_req_list() (when it's inlined from blk_sq_make_request(), "at_head" will be false. So it smells like ">rq_list" might be corrupt. I'm running your current git here, so these line numbers should line up for you: blk_sq_make_request+0x388/0x578: __blk_mq_insert_request at block/blk-mq.c:1049 (inlined by) blk_mq_merge_queue_io at block/blk-mq.c:1175 (inlined by) blk_sq_make_request at block/blk-mq.c:1419 The fsync path in the WARN doesn't have any plugs that I can see, so its not surprising that we're not in the plugging path. I'm here: if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) { /* * For a SYNC request, send it to the hardware immediately. For * an ASYNC request, just ensure that we run it later on. The * latter allows for merging opportunities and more efficient * dispatching. */ I'll try the debugging patch you sent in the other email. -chris
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 03:07:10PM -0700, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason wrote: Today I turned off every CONFIG_DEBUG_* except for list debugging, and ran dbench 2048: [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 __list_add+0xbe/0xd0 [ 2759.119652] list_add corruption. prev->next should be next (e8c80308), but was c9ccfb88. (prev=880128522380). [ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk [ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 4.9.0-rc1-15246-g4ce9206-dirty #317 [ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 [ 2759.125077] c9000f6fb868 814fe4ff 8151cb5e c9000f6fb8c8 [ 2759.125077] c9000f6fb8c8 c9000f6fb8b8 81064bbf [ 2759.127444] 880128523680 002139968000 880138b7a4a0 880128523540 [ 2759.127444] Call Trace: [ 2759.127444] [] dump_stack+0x53/0x74 [ 2759.127444] [] ? __list_add+0xbe/0xd0 [ 2759.127444] [] __warn+0xff/0x120 [ 2759.127444] [] warn_slowpath_fmt+0x49/0x50 [ 2759.127444] [] __list_add+0xbe/0xd0 [ 2759.127444] [] blk_sq_make_request+0x388/0x580 Ok, that's definitely the same one that Dave started out seeing. The fact that it is that reliable - two different machines, two very different loads (dbench looks nothing like trinity) really makes me think that maybe the problem really is in the block plugging after all. It very much does not smell like random stack corruption. It's simply not random enough. Agreed. I'd feel better if I could trigger this outside of btrfs, even though Dave Chinner hit something very similar on xfs. I'll peel off another test machine for a long XFS run. And I just noticed something: I originally thought that this is the "list_add_tail()" to the plug - which is the only "list_add()" variant in that function. But that never made sense, because the whole "but was" isn't a stack address, and "next" in "list_add_tail()" is basically fixed, and would have to be the stack. But I now notice that there's actually another "list_add()" variant there, and it's the one from __blk_mq_insert_request() that gets inlined into blk_mq_insert_request(), which then gets inlined into blk_mq_make_request(). And that actually makes some sense just looking at the offsets too: blk_sq_make_request+0x388/0x580 so it's somewhat at the end of blk_sq_make_request(). So it's not unlikely. And there it makes perfect sense that the "next should be" value is *not* on the stack. Chris, if you built with debug info, you can try ./scripts/faddr2line /boot/vmlinux blk_sq_make_request+0x388 to get what line that blk_sq_make_request+0x388 address actually is. I think it's the list_add_tail(>queuelist, >rq_list); in __blk_mq_insert_req_list() (when it's inlined from blk_sq_make_request(), "at_head" will be false. So it smells like ">rq_list" might be corrupt. I'm running your current git here, so these line numbers should line up for you: blk_sq_make_request+0x388/0x578: __blk_mq_insert_request at block/blk-mq.c:1049 (inlined by) blk_mq_merge_queue_io at block/blk-mq.c:1175 (inlined by) blk_sq_make_request at block/blk-mq.c:1419 The fsync path in the WARN doesn't have any plugs that I can see, so its not surprising that we're not in the plugging path. I'm here: if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) { /* * For a SYNC request, send it to the hardware immediately. For * an ASYNC request, just ensure that we run it later on. The * latter allows for merging opportunities and more efficient * dispatching. */ I'll try the debugging patch you sent in the other email. -chris
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboewrote: > > Actually, I think I see what might trigger it. You are on nvme, iirc, > and that has a deep queue. Yes. I have long since moved on from slow disks, so all my systems are not just flash, but m.2 nvme ssd's. So at least that could explain why Dave sees it at bootup but I don't. Linus
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe wrote: > > Actually, I think I see what might trigger it. You are on nvme, iirc, > and that has a deep queue. Yes. I have long since moved on from slow disks, so all my systems are not just flash, but m.2 nvme ssd's. So at least that could explain why Dave sees it at bootup but I don't. Linus
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote: > On 10/26/2016 04:58 PM, Linus Torvalds wrote: > > On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds > >wrote: > >> > >> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > >> blk_mq_merge_queue_io() into two > > > > I did that myself too, since Dave sees this during boot. > > > > But I'm not getting the warning ;( > > > > Dave gets it with ext4, and thats' what I have too, so I'm not sure > > what the required trigger would be. > > Actually, I think I see what might trigger it. You are on nvme, iirc, > and that has a deep queue. Dave, are you testing on a sata drive or > something similar with a shallower queue depth? If we end up sleeping > for a request, I think we could trigger data->ctx being different. yeah, just regular sata. I've been meaning to put an ssd in that box for a while, but now it sounds like my procrastination may have paid off. > Dave, can you hit the warnings with this? Totally untested... Coming up.. Dave
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote: > On 10/26/2016 04:58 PM, Linus Torvalds wrote: > > On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds > > wrote: > >> > >> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > >> blk_mq_merge_queue_io() into two > > > > I did that myself too, since Dave sees this during boot. > > > > But I'm not getting the warning ;( > > > > Dave gets it with ext4, and thats' what I have too, so I'm not sure > > what the required trigger would be. > > Actually, I think I see what might trigger it. You are on nvme, iirc, > and that has a deep queue. Dave, are you testing on a sata drive or > something similar with a shallower queue depth? If we end up sleeping > for a request, I think we could trigger data->ctx being different. yeah, just regular sata. I've been meaning to put an ssd in that box for a while, but now it sounds like my procrastination may have paid off. > Dave, can you hit the warnings with this? Totally untested... Coming up.. Dave
Re: bio linked list corruption.
On 10/26/2016 04:58 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvaldswrote: Dave: it might be a good idea to split that "WARN_ON_ONCE()" in blk_mq_merge_queue_io() into two I did that myself too, since Dave sees this during boot. But I'm not getting the warning ;( Dave gets it with ext4, and thats' what I have too, so I'm not sure what the required trigger would be. Actually, I think I see what might trigger it. You are on nvme, iirc, and that has a deep queue. Dave, are you testing on a sata drive or something similar with a shallower queue depth? If we end up sleeping for a request, I think we could trigger data->ctx being different. Dave, can you hit the warnings with this? Totally untested... diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed64771..80a9c45a9235 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1217,9 +1217,7 @@ static struct request *blk_mq_map_request(struct request_queue *q, blk_mq_set_alloc_data(_data, q, 0, ctx, hctx); rq = __blk_mq_alloc_request(_data, op, op_flags); - hctx->queued++; - data->hctx = hctx; - data->ctx = ctx; + data->hctx->queued++; return rq; } -- Jens Axboe
Re: bio linked list corruption.
On 10/26/2016 04:58 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds wrote: Dave: it might be a good idea to split that "WARN_ON_ONCE()" in blk_mq_merge_queue_io() into two I did that myself too, since Dave sees this during boot. But I'm not getting the warning ;( Dave gets it with ext4, and thats' what I have too, so I'm not sure what the required trigger would be. Actually, I think I see what might trigger it. You are on nvme, iirc, and that has a deep queue. Dave, are you testing on a sata drive or something similar with a shallower queue depth? If we end up sleeping for a request, I think we could trigger data->ctx being different. Dave, can you hit the warnings with this? Totally untested... diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed64771..80a9c45a9235 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1217,9 +1217,7 @@ static struct request *blk_mq_map_request(struct request_queue *q, blk_mq_set_alloc_data(_data, q, 0, ctx, hctx); rq = __blk_mq_alloc_request(_data, op, op_flags); - hctx->queued++; - data->hctx = hctx; - data->ctx = ctx; + data->hctx->queued++; return rq; } -- Jens Axboe
Re: bio linked list corruption.
On 10/26/2016 05:01 PM, Dave Jones wrote: On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote: > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > blk_mq_merge_queue_io() into two, since right now it can trigger both > for the > > blk_mq_bio_to_request(rq, bio); > > path _and_ for the > > if (!blk_mq_attempt_merge(q, ctx, bio)) { > blk_mq_bio_to_request(rq, bio); > goto insert_rq; > > path. If you split it into two: one before that "insert_rq:" label, > and one before the "goto insert_rq" thing, then we could see if it is > just one of the blk_mq_merge_queue_io() cases (or both) that is > broken.. It's the latter of the two. [ 12.302392] WARNING: CPU: 3 PID: 272 at block/blk-mq.c:1191 blk_sq_make_request+0x320/0x4d0 I would expect that - so normal request, since we know merging is on. I just sent out a patch, can you give that a whirl? Keep Linus' debugging patch, and apply it on top. Below as well, for reference. diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed64771..80a9c45a9235 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1217,9 +1217,7 @@ static struct request *blk_mq_map_request(struct request_queue *q, blk_mq_set_alloc_data(_data, q, 0, ctx, hctx); rq = __blk_mq_alloc_request(_data, op, op_flags); - hctx->queued++; - data->hctx = hctx; - data->ctx = ctx; + data->hctx->queued++; return rq; } -- Jens Axboe
Re: bio linked list corruption.
On 10/26/2016 05:01 PM, Dave Jones wrote: On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote: > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > blk_mq_merge_queue_io() into two, since right now it can trigger both > for the > > blk_mq_bio_to_request(rq, bio); > > path _and_ for the > > if (!blk_mq_attempt_merge(q, ctx, bio)) { > blk_mq_bio_to_request(rq, bio); > goto insert_rq; > > path. If you split it into two: one before that "insert_rq:" label, > and one before the "goto insert_rq" thing, then we could see if it is > just one of the blk_mq_merge_queue_io() cases (or both) that is > broken.. It's the latter of the two. [ 12.302392] WARNING: CPU: 3 PID: 272 at block/blk-mq.c:1191 blk_sq_make_request+0x320/0x4d0 I would expect that - so normal request, since we know merging is on. I just sent out a patch, can you give that a whirl? Keep Linus' debugging patch, and apply it on top. Below as well, for reference. diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed64771..80a9c45a9235 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1217,9 +1217,7 @@ static struct request *blk_mq_map_request(struct request_queue *q, blk_mq_set_alloc_data(_data, q, 0, ctx, hctx); rq = __blk_mq_alloc_request(_data, op, op_flags); - hctx->queued++; - data->hctx = hctx; - data->ctx = ctx; + data->hctx->queued++; return rq; } -- Jens Axboe
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote: > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > blk_mq_merge_queue_io() into two, since right now it can trigger both > for the > > blk_mq_bio_to_request(rq, bio); > > path _and_ for the > > if (!blk_mq_attempt_merge(q, ctx, bio)) { > blk_mq_bio_to_request(rq, bio); > goto insert_rq; > > path. If you split it into two: one before that "insert_rq:" label, > and one before the "goto insert_rq" thing, then we could see if it is > just one of the blk_mq_merge_queue_io() cases (or both) that is > broken.. It's the latter of the two. [ 12.302392] WARNING: CPU: 3 PID: 272 at block/blk-mq.c:1191 blk_sq_make_request+0x320/0x4d0 Dave
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote: > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > blk_mq_merge_queue_io() into two, since right now it can trigger both > for the > > blk_mq_bio_to_request(rq, bio); > > path _and_ for the > > if (!blk_mq_attempt_merge(q, ctx, bio)) { > blk_mq_bio_to_request(rq, bio); > goto insert_rq; > > path. If you split it into two: one before that "insert_rq:" label, > and one before the "goto insert_rq" thing, then we could see if it is > just one of the blk_mq_merge_queue_io() cases (or both) that is > broken.. It's the latter of the two. [ 12.302392] WARNING: CPU: 3 PID: 272 at block/blk-mq.c:1191 blk_sq_make_request+0x320/0x4d0 Dave
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvaldswrote: > > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > blk_mq_merge_queue_io() into two I did that myself too, since Dave sees this during boot. But I'm not getting the warning ;( Dave gets it with ext4, and thats' what I have too, so I'm not sure what the required trigger would be. Linus
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds wrote: > > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > blk_mq_merge_queue_io() into two I did that myself too, since Dave sees this during boot. But I'm not getting the warning ;( Dave gets it with ext4, and thats' what I have too, so I'm not sure what the required trigger would be. Linus
Re: bio linked list corruption.
On 10/26/2016 04:51 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:40 PM, Dave Joneswrote: I gave it a shot too for shits & giggles. This falls out during boot. [9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0 Hmm. That's the WARN_ON_ONCE(rq->mq_ctx != ctx); that I added to blk_mq_merge_queue_io(), and I really think that warning is valid, and the fact that it triggers shows that something is wrong with locking. We just did a spin_lock(>lock); and that lock is *supposed* to protect the __blk_mq_insert_request(), but that uses rq->mq_ctx. So if rq->mq_ctx != ctx, then we're locking the wrong context. Jens - please explain to me why I'm wrong. Or maybe I actually might have found the problem? In which case please send me a patch that fixes it ;) I think you're pretty close, the two should not be different and I don't immediately see how. I'll run some testing here, should be easier with this knowledge. Dave: it might be a good idea to split that "WARN_ON_ONCE()" in blk_mq_merge_queue_io() into two, since right now it can trigger both for the blk_mq_bio_to_request(rq, bio); path _and_ for the if (!blk_mq_attempt_merge(q, ctx, bio)) { blk_mq_bio_to_request(rq, bio); goto insert_rq; And just in case I can't trigger, would be interesting to add a call to blk_rq_dump_flags() as well, in case this is some special request. -- Jens Axboe
Re: bio linked list corruption.
On 10/26/2016 04:51 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:40 PM, Dave Jones wrote: I gave it a shot too for shits & giggles. This falls out during boot. [9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0 Hmm. That's the WARN_ON_ONCE(rq->mq_ctx != ctx); that I added to blk_mq_merge_queue_io(), and I really think that warning is valid, and the fact that it triggers shows that something is wrong with locking. We just did a spin_lock(>lock); and that lock is *supposed* to protect the __blk_mq_insert_request(), but that uses rq->mq_ctx. So if rq->mq_ctx != ctx, then we're locking the wrong context. Jens - please explain to me why I'm wrong. Or maybe I actually might have found the problem? In which case please send me a patch that fixes it ;) I think you're pretty close, the two should not be different and I don't immediately see how. I'll run some testing here, should be easier with this knowledge. Dave: it might be a good idea to split that "WARN_ON_ONCE()" in blk_mq_merge_queue_io() into two, since right now it can trigger both for the blk_mq_bio_to_request(rq, bio); path _and_ for the if (!blk_mq_attempt_merge(q, ctx, bio)) { blk_mq_bio_to_request(rq, bio); goto insert_rq; And just in case I can't trigger, would be interesting to add a call to blk_rq_dump_flags() as well, in case this is some special request. -- Jens Axboe
Re: bio linked list corruption.
On 10/26/2016 04:40 PM, Dave Jones wrote: On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote: > Could you try the attached patch? It adds a couple of sanity tests: > > - a number of tests to verify that 'rq->queuelist' isn't already on > some queue when it is added to a queue > > - one test to verify that rq->mq_ctx is the same ctx that we have locked. > > I may be completely full of shit, and this patch may be pure garbage > or "obviously will never trigger", but humor me. I gave it a shot too for shits & giggles. This falls out during boot. [9.244030] EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: (null) [9.271391] [ cut here ] [9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0 [9.285613] CPU: 0 PID: 1 Comm: init Not tainted 4.9.0-rc2-think+ #4 Very odd, don't immediately see how that can happen. For testing, can you try and add the below patch? Just curious if that fixes the list corruption. Thing is, I don't see how ->mq_ctx and ctx are different in this path, but I can debug that on the side. diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed64771..73b9462aa21f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1165,9 +1165,10 @@ static inline bool hctx_allow_merges(struct blk_mq_hw_ctx *hctx) } static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx, -struct blk_mq_ctx *ctx, struct request *rq, struct bio *bio) { + struct blk_mq_ctx *ctx = rq->mq_ctx; + if (!hctx_allow_merges(hctx) || !bio_mergeable(bio)) { blk_mq_bio_to_request(rq, bio); spin_lock(>lock); @@ -1338,7 +1339,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) goto done; } - if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) { + if (!blk_mq_merge_queue_io(data.hctx, rq, bio)) { /* * For a SYNC request, send it to the hardware immediately. For * an ASYNC request, just ensure that we run it later on. The @@ -1416,7 +1417,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) return cookie; } - if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) { + if (!blk_mq_merge_queue_io(data.hctx, rq, bio)) { /* * For a SYNC request, send it to the hardware immediately. For * an ASYNC request, just ensure that we run it later on. The -- Jens Axboe
Re: bio linked list corruption.
On 10/26/2016 04:40 PM, Dave Jones wrote: On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote: > Could you try the attached patch? It adds a couple of sanity tests: > > - a number of tests to verify that 'rq->queuelist' isn't already on > some queue when it is added to a queue > > - one test to verify that rq->mq_ctx is the same ctx that we have locked. > > I may be completely full of shit, and this patch may be pure garbage > or "obviously will never trigger", but humor me. I gave it a shot too for shits & giggles. This falls out during boot. [9.244030] EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: (null) [9.271391] [ cut here ] [9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0 [9.285613] CPU: 0 PID: 1 Comm: init Not tainted 4.9.0-rc2-think+ #4 Very odd, don't immediately see how that can happen. For testing, can you try and add the below patch? Just curious if that fixes the list corruption. Thing is, I don't see how ->mq_ctx and ctx are different in this path, but I can debug that on the side. diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed64771..73b9462aa21f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1165,9 +1165,10 @@ static inline bool hctx_allow_merges(struct blk_mq_hw_ctx *hctx) } static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx, -struct blk_mq_ctx *ctx, struct request *rq, struct bio *bio) { + struct blk_mq_ctx *ctx = rq->mq_ctx; + if (!hctx_allow_merges(hctx) || !bio_mergeable(bio)) { blk_mq_bio_to_request(rq, bio); spin_lock(>lock); @@ -1338,7 +1339,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) goto done; } - if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) { + if (!blk_mq_merge_queue_io(data.hctx, rq, bio)) { /* * For a SYNC request, send it to the hardware immediately. For * an ASYNC request, just ensure that we run it later on. The @@ -1416,7 +1417,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) return cookie; } - if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) { + if (!blk_mq_merge_queue_io(data.hctx, rq, bio)) { /* * For a SYNC request, send it to the hardware immediately. For * an ASYNC request, just ensure that we run it later on. The -- Jens Axboe
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 3:40 PM, Dave Joneswrote: > > I gave it a shot too for shits & giggles. > This falls out during boot. > > [9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 > blk_sq_make_request+0x465/0x4a0 Hmm. That's the WARN_ON_ONCE(rq->mq_ctx != ctx); that I added to blk_mq_merge_queue_io(), and I really think that warning is valid, and the fact that it triggers shows that something is wrong with locking. We just did a spin_lock(>lock); and that lock is *supposed* to protect the __blk_mq_insert_request(), but that uses rq->mq_ctx. So if rq->mq_ctx != ctx, then we're locking the wrong context. Jens - please explain to me why I'm wrong. Or maybe I actually might have found the problem? In which case please send me a patch that fixes it ;) Dave: it might be a good idea to split that "WARN_ON_ONCE()" in blk_mq_merge_queue_io() into two, since right now it can trigger both for the blk_mq_bio_to_request(rq, bio); path _and_ for the if (!blk_mq_attempt_merge(q, ctx, bio)) { blk_mq_bio_to_request(rq, bio); goto insert_rq; path. If you split it into two: one before that "insert_rq:" label, and one before the "goto insert_rq" thing, then we could see if it is just one of the blk_mq_merge_queue_io() cases (or both) that is broken.. Linus
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 3:40 PM, Dave Jones wrote: > > I gave it a shot too for shits & giggles. > This falls out during boot. > > [9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 > blk_sq_make_request+0x465/0x4a0 Hmm. That's the WARN_ON_ONCE(rq->mq_ctx != ctx); that I added to blk_mq_merge_queue_io(), and I really think that warning is valid, and the fact that it triggers shows that something is wrong with locking. We just did a spin_lock(>lock); and that lock is *supposed* to protect the __blk_mq_insert_request(), but that uses rq->mq_ctx. So if rq->mq_ctx != ctx, then we're locking the wrong context. Jens - please explain to me why I'm wrong. Or maybe I actually might have found the problem? In which case please send me a patch that fixes it ;) Dave: it might be a good idea to split that "WARN_ON_ONCE()" in blk_mq_merge_queue_io() into two, since right now it can trigger both for the blk_mq_bio_to_request(rq, bio); path _and_ for the if (!blk_mq_attempt_merge(q, ctx, bio)) { blk_mq_bio_to_request(rq, bio); goto insert_rq; path. If you split it into two: one before that "insert_rq:" label, and one before the "goto insert_rq" thing, then we could see if it is just one of the blk_mq_merge_queue_io() cases (or both) that is broken.. Linus
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote: > Could you try the attached patch? It adds a couple of sanity tests: > > - a number of tests to verify that 'rq->queuelist' isn't already on > some queue when it is added to a queue > > - one test to verify that rq->mq_ctx is the same ctx that we have locked. > > I may be completely full of shit, and this patch may be pure garbage > or "obviously will never trigger", but humor me. I gave it a shot too for shits & giggles. This falls out during boot. [9.244030] EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: (null) [9.271391] [ cut here ] [9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0 [9.285613] CPU: 0 PID: 1 Comm: init Not tainted 4.9.0-rc2-think+ #4 [9.300106] c9013848 [9.307353] 8130d27c [9.307420] [9.307456] [9.307494] c9013888 [9.314780] 81077a41 [9.314847] 049d00013898 [9.314884] 0001 [9.314922] 0002 [9.322213] e8c06600 [9.322280] e8606600 [9.322317] 8805015c2e80 [9.322355] Call Trace: [9.329689] [] dump_stack+0x4f/0x73 [9.337208] [] __warn+0xc1/0xe0 [9.344730] [] warn_slowpath_null+0x18/0x20 [9.352282] [] blk_sq_make_request+0x465/0x4a0 [9.359816] [] ? generic_make_request+0xca/0x210 [9.367367] [] generic_make_request+0xd7/0x210 [9.374931] [] submit_bio+0x69/0x120 [9.382488] [] ? submit_bh_wbc+0x16f/0x1e0 [9.390083] [] submit_bh_wbc+0x16f/0x1e0 [9.397673] [] submit_bh+0xe/0x10 [9.405205] [] __ext4_get_inode_loc+0x1ac/0x3d0 [9.412795] [] ext4_iget+0x6b/0xb70 [9.420366] [] ? lookup_slow+0xf8/0x1f0 [9.427947] [] ext4_iget_normal+0x2a/0x30 [9.435541] [] ext4_lookup+0x114/0x230
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote: > Could you try the attached patch? It adds a couple of sanity tests: > > - a number of tests to verify that 'rq->queuelist' isn't already on > some queue when it is added to a queue > > - one test to verify that rq->mq_ctx is the same ctx that we have locked. > > I may be completely full of shit, and this patch may be pure garbage > or "obviously will never trigger", but humor me. I gave it a shot too for shits & giggles. This falls out during boot. [9.244030] EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: (null) [9.271391] [ cut here ] [9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0 [9.285613] CPU: 0 PID: 1 Comm: init Not tainted 4.9.0-rc2-think+ #4 [9.300106] c9013848 [9.307353] 8130d27c [9.307420] [9.307456] [9.307494] c9013888 [9.314780] 81077a41 [9.314847] 049d00013898 [9.314884] 0001 [9.314922] 0002 [9.322213] e8c06600 [9.322280] e8606600 [9.322317] 8805015c2e80 [9.322355] Call Trace: [9.329689] [] dump_stack+0x4f/0x73 [9.337208] [] __warn+0xc1/0xe0 [9.344730] [] warn_slowpath_null+0x18/0x20 [9.352282] [] blk_sq_make_request+0x465/0x4a0 [9.359816] [] ? generic_make_request+0xca/0x210 [9.367367] [] generic_make_request+0xd7/0x210 [9.374931] [] submit_bio+0x69/0x120 [9.382488] [] ? submit_bh_wbc+0x16f/0x1e0 [9.390083] [] submit_bh_wbc+0x16f/0x1e0 [9.397673] [] submit_bh+0xe/0x10 [9.405205] [] __ext4_get_inode_loc+0x1ac/0x3d0 [9.412795] [] ext4_iget+0x6b/0xb70 [9.420366] [] ? lookup_slow+0xf8/0x1f0 [9.427947] [] ext4_iget_normal+0x2a/0x30 [9.435541] [] ext4_lookup+0x114/0x230
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 2:52 PM, Chris Masonwrote: > > This one is special because CONFIG_VMAP_STACK is not set. Btrfs triggers in > < 10 minutes. > I've done 30 minutes each with XFS and Ext4 without luck. Ok, see the email I wrote that crossed yours - if it's really some list corruption on ctx->rq_list due to some locking problem, I really would expect CONFIG_VMAP_STACK to be entirely irrelevant, except perhaps from a timing standpoint. > WARNING: CPU: 6 PID: 4481 at lib/list_debug.c:33 __list_add+0xbe/0xd0 > list_add corruption. prev->next should be next (e8d80b08), but was > 88012b65fb88. (prev=880128c8d500). > Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper lrw > gf128mul ablk_helper i2c_piix4 cryptd i2c_core virtio_net serio_raw floppy > button pcspkr sch_fq_codel autofs4 virtio_blk > CPU: 6 PID: 4481 Comm: dbench Not tainted 4.9.0-rc2-15419-g811d54d #319 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 > 04/01/2014 > 880104eff868 814fde0f 8151c46e 880104eff8c8 > 880104eff8c8 880104eff8b8 810648cf > 880128cab2c0 00213fc57c68 8801384e8928 880128cab180 > Call Trace: > [] dump_stack+0x53/0x74 > [] ? __list_add+0xbe/0xd0 > [] __warn+0xff/0x120 > [] warn_slowpath_fmt+0x49/0x50 > [] __list_add+0xbe/0xd0 > [] blk_sq_make_request+0x388/0x580 > [] generic_make_request+0x104/0x200 Well, it's very consistent, I have to say. So I really don't think this is random corruption. Could you try the attached patch? It adds a couple of sanity tests: - a number of tests to verify that 'rq->queuelist' isn't already on some queue when it is added to a queue - one test to verify that rq->mq_ctx is the same ctx that we have locked. I may be completely full of shit, and this patch may be pure garbage or "obviously will never trigger", but humor me. Linus block/blk-mq.c | 9 + 1 file changed, 9 insertions(+) diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed64771..4f575de7fdd0 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -521,6 +521,8 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool at_head) */ BUG_ON(rq->cmd_flags & REQ_SOFTBARRIER); +WARN_ON_ONCE(!list_empty(>queuelist)); + spin_lock_irqsave(>requeue_lock, flags); if (at_head) { rq->cmd_flags |= REQ_SOFTBARRIER; @@ -838,6 +840,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) queued++; break; case BLK_MQ_RQ_QUEUE_BUSY: +WARN_ON_ONCE(!list_empty(>queuelist)); list_add(>queuelist, _list); __blk_mq_requeue_request(rq); break; @@ -1034,6 +1037,8 @@ static inline void __blk_mq_insert_req_list(struct blk_mq_hw_ctx *hctx, trace_block_rq_insert(hctx->queue, rq); +WARN_ON_ONCE(!list_empty(>queuelist)); + if (at_head) list_add(>queuelist, >rq_list); else @@ -1137,6 +1142,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) depth = 0; } +WARN_ON_ONCE(!list_empty(>queuelist)); depth++; list_add_tail(>queuelist, _list); } @@ -1172,6 +1178,7 @@ static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx, blk_mq_bio_to_request(rq, bio); spin_lock(>lock); insert_rq: +WARN_ON_ONCE(rq->mq_ctx != ctx); __blk_mq_insert_request(hctx, rq, false); spin_unlock(>lock); return false; @@ -1326,6 +1333,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) old_rq = same_queue_rq; list_del_init(_rq->queuelist); } +WARN_ON_ONCE(!list_empty(>queuelist)); list_add_tail(>queuelist, >mq_list); } else /* is_sync */ old_rq = rq; @@ -1412,6 +1420,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) trace_block_plug(q); } +WARN_ON_ONCE(!list_empty(>queuelist)); list_add_tail(>queuelist, >mq_list); return cookie; }
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 2:52 PM, Chris Mason wrote: > > This one is special because CONFIG_VMAP_STACK is not set. Btrfs triggers in > < 10 minutes. > I've done 30 minutes each with XFS and Ext4 without luck. Ok, see the email I wrote that crossed yours - if it's really some list corruption on ctx->rq_list due to some locking problem, I really would expect CONFIG_VMAP_STACK to be entirely irrelevant, except perhaps from a timing standpoint. > WARNING: CPU: 6 PID: 4481 at lib/list_debug.c:33 __list_add+0xbe/0xd0 > list_add corruption. prev->next should be next (e8d80b08), but was > 88012b65fb88. (prev=880128c8d500). > Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper lrw > gf128mul ablk_helper i2c_piix4 cryptd i2c_core virtio_net serio_raw floppy > button pcspkr sch_fq_codel autofs4 virtio_blk > CPU: 6 PID: 4481 Comm: dbench Not tainted 4.9.0-rc2-15419-g811d54d #319 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 > 04/01/2014 > 880104eff868 814fde0f 8151c46e 880104eff8c8 > 880104eff8c8 880104eff8b8 810648cf > 880128cab2c0 00213fc57c68 8801384e8928 880128cab180 > Call Trace: > [] dump_stack+0x53/0x74 > [] ? __list_add+0xbe/0xd0 > [] __warn+0xff/0x120 > [] warn_slowpath_fmt+0x49/0x50 > [] __list_add+0xbe/0xd0 > [] blk_sq_make_request+0x388/0x580 > [] generic_make_request+0x104/0x200 Well, it's very consistent, I have to say. So I really don't think this is random corruption. Could you try the attached patch? It adds a couple of sanity tests: - a number of tests to verify that 'rq->queuelist' isn't already on some queue when it is added to a queue - one test to verify that rq->mq_ctx is the same ctx that we have locked. I may be completely full of shit, and this patch may be pure garbage or "obviously will never trigger", but humor me. Linus block/blk-mq.c | 9 + 1 file changed, 9 insertions(+) diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed64771..4f575de7fdd0 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -521,6 +521,8 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool at_head) */ BUG_ON(rq->cmd_flags & REQ_SOFTBARRIER); +WARN_ON_ONCE(!list_empty(>queuelist)); + spin_lock_irqsave(>requeue_lock, flags); if (at_head) { rq->cmd_flags |= REQ_SOFTBARRIER; @@ -838,6 +840,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) queued++; break; case BLK_MQ_RQ_QUEUE_BUSY: +WARN_ON_ONCE(!list_empty(>queuelist)); list_add(>queuelist, _list); __blk_mq_requeue_request(rq); break; @@ -1034,6 +1037,8 @@ static inline void __blk_mq_insert_req_list(struct blk_mq_hw_ctx *hctx, trace_block_rq_insert(hctx->queue, rq); +WARN_ON_ONCE(!list_empty(>queuelist)); + if (at_head) list_add(>queuelist, >rq_list); else @@ -1137,6 +1142,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) depth = 0; } +WARN_ON_ONCE(!list_empty(>queuelist)); depth++; list_add_tail(>queuelist, _list); } @@ -1172,6 +1178,7 @@ static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx, blk_mq_bio_to_request(rq, bio); spin_lock(>lock); insert_rq: +WARN_ON_ONCE(rq->mq_ctx != ctx); __blk_mq_insert_request(hctx, rq, false); spin_unlock(>lock); return false; @@ -1326,6 +1333,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) old_rq = same_queue_rq; list_del_init(_rq->queuelist); } +WARN_ON_ONCE(!list_empty(>queuelist)); list_add_tail(>queuelist, >mq_list); } else /* is_sync */ old_rq = rq; @@ -1412,6 +1420,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) trace_block_plug(q); } +WARN_ON_ONCE(!list_empty(>queuelist)); list_add_tail(>queuelist, >mq_list); return cookie; }
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 1:00 PM, Chris Masonwrote: > > Today I turned off every CONFIG_DEBUG_* except for list debugging, and > ran dbench 2048: > > [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 > __list_add+0xbe/0xd0 > [ 2759.119652] list_add corruption. prev->next should be next > (e8c80308), but was c9ccfb88. (prev=880128522380). > [ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel > aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw > pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk > [ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted > 4.9.0-rc1-15246-g4ce9206-dirty #317 > [ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > 1.9.0-1.fc24 04/01/2014 > [ 2759.125077] c9000f6fb868 814fe4ff 8151cb5e > c9000f6fb8c8 > [ 2759.125077] c9000f6fb8c8 c9000f6fb8b8 > 81064bbf > [ 2759.127444] 880128523680 002139968000 880138b7a4a0 > 880128523540 > [ 2759.127444] Call Trace: > [ 2759.127444] [] dump_stack+0x53/0x74 > [ 2759.127444] [] ? __list_add+0xbe/0xd0 > [ 2759.127444] [] __warn+0xff/0x120 > [ 2759.127444] [] warn_slowpath_fmt+0x49/0x50 > [ 2759.127444] [] __list_add+0xbe/0xd0 > [ 2759.127444] [] blk_sq_make_request+0x388/0x580 Ok, that's definitely the same one that Dave started out seeing. The fact that it is that reliable - two different machines, two very different loads (dbench looks nothing like trinity) really makes me think that maybe the problem really is in the block plugging after all. It very much does not smell like random stack corruption. It's simply not random enough. And I just noticed something: I originally thought that this is the "list_add_tail()" to the plug - which is the only "list_add()" variant in that function. But that never made sense, because the whole "but was" isn't a stack address, and "next" in "list_add_tail()" is basically fixed, and would have to be the stack. But I now notice that there's actually another "list_add()" variant there, and it's the one from __blk_mq_insert_request() that gets inlined into blk_mq_insert_request(), which then gets inlined into blk_mq_make_request(). And that actually makes some sense just looking at the offsets too: blk_sq_make_request+0x388/0x580 so it's somewhat at the end of blk_sq_make_request(). So it's not unlikely. And there it makes perfect sense that the "next should be" value is *not* on the stack. Chris, if you built with debug info, you can try ./scripts/faddr2line /boot/vmlinux blk_sq_make_request+0x388 to get what line that blk_sq_make_request+0x388 address actually is. I think it's the list_add_tail(>queuelist, >rq_list); in __blk_mq_insert_req_list() (when it's inlined from blk_sq_make_request(), "at_head" will be false. So it smells like ">rq_list" might be corrupt. And that actually seems much more likely than the "plug" list, because while the plug is entirely thread-local (and thus shouldn't have any races), the ctx->rq_list very much is not. Jens? For example, should we have a BUG_ON(ctx != rq->mq_ctx); in blk_mq_merge_queue_io()? Because it locks ctx->lock, but then __blk_mq_insert_request() will insert things onto the queues of rq->mq_ctx. blk_mq_insert_requests() has similar issues, but there has that BUG_ON(). The locking there really is *very* messy. All the lockers do spin_lock(>lock); ... spin_unlock(>lock); but then __blk_mq_insert_request() and __blk_mq_insert_req_list don't act on "ctx", but on "ctx = rq->mq_ctx", so if you ever get those wrong, you're completely dead. Now, I'm not seeing why they'd be wrong, and why they'd be associated with the VMAP_STACK thing, but it could just be an unlucky timing thing. Linus
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason wrote: > > Today I turned off every CONFIG_DEBUG_* except for list debugging, and > ran dbench 2048: > > [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 > __list_add+0xbe/0xd0 > [ 2759.119652] list_add corruption. prev->next should be next > (e8c80308), but was c9ccfb88. (prev=880128522380). > [ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel > aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw > pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk > [ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted > 4.9.0-rc1-15246-g4ce9206-dirty #317 > [ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > 1.9.0-1.fc24 04/01/2014 > [ 2759.125077] c9000f6fb868 814fe4ff 8151cb5e > c9000f6fb8c8 > [ 2759.125077] c9000f6fb8c8 c9000f6fb8b8 > 81064bbf > [ 2759.127444] 880128523680 002139968000 880138b7a4a0 > 880128523540 > [ 2759.127444] Call Trace: > [ 2759.127444] [] dump_stack+0x53/0x74 > [ 2759.127444] [] ? __list_add+0xbe/0xd0 > [ 2759.127444] [] __warn+0xff/0x120 > [ 2759.127444] [] warn_slowpath_fmt+0x49/0x50 > [ 2759.127444] [] __list_add+0xbe/0xd0 > [ 2759.127444] [] blk_sq_make_request+0x388/0x580 Ok, that's definitely the same one that Dave started out seeing. The fact that it is that reliable - two different machines, two very different loads (dbench looks nothing like trinity) really makes me think that maybe the problem really is in the block plugging after all. It very much does not smell like random stack corruption. It's simply not random enough. And I just noticed something: I originally thought that this is the "list_add_tail()" to the plug - which is the only "list_add()" variant in that function. But that never made sense, because the whole "but was" isn't a stack address, and "next" in "list_add_tail()" is basically fixed, and would have to be the stack. But I now notice that there's actually another "list_add()" variant there, and it's the one from __blk_mq_insert_request() that gets inlined into blk_mq_insert_request(), which then gets inlined into blk_mq_make_request(). And that actually makes some sense just looking at the offsets too: blk_sq_make_request+0x388/0x580 so it's somewhat at the end of blk_sq_make_request(). So it's not unlikely. And there it makes perfect sense that the "next should be" value is *not* on the stack. Chris, if you built with debug info, you can try ./scripts/faddr2line /boot/vmlinux blk_sq_make_request+0x388 to get what line that blk_sq_make_request+0x388 address actually is. I think it's the list_add_tail(>queuelist, >rq_list); in __blk_mq_insert_req_list() (when it's inlined from blk_sq_make_request(), "at_head" will be false. So it smells like ">rq_list" might be corrupt. And that actually seems much more likely than the "plug" list, because while the plug is entirely thread-local (and thus shouldn't have any races), the ctx->rq_list very much is not. Jens? For example, should we have a BUG_ON(ctx != rq->mq_ctx); in blk_mq_merge_queue_io()? Because it locks ctx->lock, but then __blk_mq_insert_request() will insert things onto the queues of rq->mq_ctx. blk_mq_insert_requests() has similar issues, but there has that BUG_ON(). The locking there really is *very* messy. All the lockers do spin_lock(>lock); ... spin_unlock(>lock); but then __blk_mq_insert_request() and __blk_mq_insert_req_list don't act on "ctx", but on "ctx = rq->mq_ctx", so if you ever get those wrong, you're completely dead. Now, I'm not seeing why they'd be wrong, and why they'd be associated with the VMAP_STACK thing, but it could just be an unlucky timing thing. Linus
Re: bio linked list corruption.
On 10/26/2016 04:00 PM, Chris Mason wrote: > > > On 10/26/2016 03:06 PM, Linus Torvalds wrote: >> On Wed, Oct 26, 2016 at 11:42 AM, Dave Joneswrote: >>> >>> The stacks show nearly all of them are stuck in sync_inodes_sb >> >> That's just wb_wait_for_completion(), and it means that some IO isn't >> completing. >> >> There's also a lot of processes waiting for inode_lock(), and a few >> waiting for mnt_want_write() >> >> Ignoring those, we have >> >>> [] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs] >>> [] btrfs_sync_fs+0x31/0xc0 [btrfs] >>> [] sync_filesystem+0x6e/0xa0 >>> [] SyS_syncfs+0x3c/0x70 >>> [] do_syscall_64+0x5c/0x170 >>> [] entry_SYSCALL64_slow_path+0x25/0x25 >>> [] 0x >> >> Don't know this one. There's a couple of them. Could there be some >> ABBA deadlock on the ordered roots waiting? > > It's always possible, but we haven't changed anything here. > > I've tried a long list of things to reproduce this on my test boxes, > including days of trinity runs and a kernel module to exercise vmalloc, > and thread creation. > > Today I turned off every CONFIG_DEBUG_* except for list debugging, and > ran dbench 2048: > This one is special because CONFIG_VMAP_STACK is not set. Btrfs triggers in < 10 minutes. I've done 30 minutes each with XFS and Ext4 without luck. This is all in a virtual machine that I can copy on to a bunch of hosts. So I'll get some parallel tests going tonight to narrow it down. [ cut here ] WARNING: CPU: 6 PID: 4481 at lib/list_debug.c:33 __list_add+0xbe/0xd0 list_add corruption. prev->next should be next (e8d80b08), but was 88012b65fb88. (prev=880128c8d500). Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper i2c_piix4 cryptd i2c_core virtio_net serio_raw floppy button pcspkr sch_fq_codel autofs4 virtio_blk CPU: 6 PID: 4481 Comm: dbench Not tainted 4.9.0-rc2-15419-g811d54d #319 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 880104eff868 814fde0f 8151c46e 880104eff8c8 880104eff8c8 880104eff8b8 810648cf 880128cab2c0 00213fc57c68 8801384e8928 880128cab180 Call Trace: [] dump_stack+0x53/0x74 [] ? __list_add+0xbe/0xd0 [] __warn+0xff/0x120 [] warn_slowpath_fmt+0x49/0x50 [] __list_add+0xbe/0xd0 [] blk_sq_make_request+0x388/0x580 [] generic_make_request+0x104/0x200 [] submit_bio+0x65/0x130 [] ? __percpu_counter_add+0x96/0xd0 [] btrfs_map_bio+0x23c/0x310 [] btrfs_submit_bio_hook+0xd3/0x190 [] submit_one_bio+0x6d/0xa0 [] flush_epd_write_bio+0x4e/0x70 [] extent_writepages+0x5d/0x70 [] ? btrfs_releasepage+0x50/0x50 [] ? wbc_attach_and_unlock_inode+0x6e/0x170 [] btrfs_writepages+0x27/0x30 [] do_writepages+0x20/0x30 [] __filemap_fdatawrite_range+0xb5/0x100 [] filemap_fdatawrite_range+0x13/0x20 [] btrfs_fdatawrite_range+0x2b/0x70 [] btrfs_sync_file+0x88/0x490 [] ? group_send_sig_info+0x42/0x80 [] ? kill_pid_info+0x5d/0x90 [] ? SYSC_kill+0xba/0x1d0 [] ? __sb_end_write+0x58/0x80 [] vfs_fsync_range+0x4c/0xb0 [] ? syscall_trace_enter+0x201/0x2e0 [] vfs_fsync+0x1c/0x20 [] do_fsync+0x3d/0x70 [] ? syscall_slow_exit_work+0xfb/0x100 [] SyS_fsync+0x10/0x20 [] do_syscall_64+0x55/0xd0 [] ? prepare_exit_to_usermode+0x37/0x40 [] entry_SYSCALL64_slow_path+0x25/0x25 ---[ end trace efe6b17c6dba2a6e ]---
Re: bio linked list corruption.
On 10/26/2016 04:00 PM, Chris Mason wrote: > > > On 10/26/2016 03:06 PM, Linus Torvalds wrote: >> On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones wrote: >>> >>> The stacks show nearly all of them are stuck in sync_inodes_sb >> >> That's just wb_wait_for_completion(), and it means that some IO isn't >> completing. >> >> There's also a lot of processes waiting for inode_lock(), and a few >> waiting for mnt_want_write() >> >> Ignoring those, we have >> >>> [] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs] >>> [] btrfs_sync_fs+0x31/0xc0 [btrfs] >>> [] sync_filesystem+0x6e/0xa0 >>> [] SyS_syncfs+0x3c/0x70 >>> [] do_syscall_64+0x5c/0x170 >>> [] entry_SYSCALL64_slow_path+0x25/0x25 >>> [] 0x >> >> Don't know this one. There's a couple of them. Could there be some >> ABBA deadlock on the ordered roots waiting? > > It's always possible, but we haven't changed anything here. > > I've tried a long list of things to reproduce this on my test boxes, > including days of trinity runs and a kernel module to exercise vmalloc, > and thread creation. > > Today I turned off every CONFIG_DEBUG_* except for list debugging, and > ran dbench 2048: > This one is special because CONFIG_VMAP_STACK is not set. Btrfs triggers in < 10 minutes. I've done 30 minutes each with XFS and Ext4 without luck. This is all in a virtual machine that I can copy on to a bunch of hosts. So I'll get some parallel tests going tonight to narrow it down. [ cut here ] WARNING: CPU: 6 PID: 4481 at lib/list_debug.c:33 __list_add+0xbe/0xd0 list_add corruption. prev->next should be next (e8d80b08), but was 88012b65fb88. (prev=880128c8d500). Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper i2c_piix4 cryptd i2c_core virtio_net serio_raw floppy button pcspkr sch_fq_codel autofs4 virtio_blk CPU: 6 PID: 4481 Comm: dbench Not tainted 4.9.0-rc2-15419-g811d54d #319 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 880104eff868 814fde0f 8151c46e 880104eff8c8 880104eff8c8 880104eff8b8 810648cf 880128cab2c0 00213fc57c68 8801384e8928 880128cab180 Call Trace: [] dump_stack+0x53/0x74 [] ? __list_add+0xbe/0xd0 [] __warn+0xff/0x120 [] warn_slowpath_fmt+0x49/0x50 [] __list_add+0xbe/0xd0 [] blk_sq_make_request+0x388/0x580 [] generic_make_request+0x104/0x200 [] submit_bio+0x65/0x130 [] ? __percpu_counter_add+0x96/0xd0 [] btrfs_map_bio+0x23c/0x310 [] btrfs_submit_bio_hook+0xd3/0x190 [] submit_one_bio+0x6d/0xa0 [] flush_epd_write_bio+0x4e/0x70 [] extent_writepages+0x5d/0x70 [] ? btrfs_releasepage+0x50/0x50 [] ? wbc_attach_and_unlock_inode+0x6e/0x170 [] btrfs_writepages+0x27/0x30 [] do_writepages+0x20/0x30 [] __filemap_fdatawrite_range+0xb5/0x100 [] filemap_fdatawrite_range+0x13/0x20 [] btrfs_fdatawrite_range+0x2b/0x70 [] btrfs_sync_file+0x88/0x490 [] ? group_send_sig_info+0x42/0x80 [] ? kill_pid_info+0x5d/0x90 [] ? SYSC_kill+0xba/0x1d0 [] ? __sb_end_write+0x58/0x80 [] vfs_fsync_range+0x4c/0xb0 [] ? syscall_trace_enter+0x201/0x2e0 [] vfs_fsync+0x1c/0x20 [] do_fsync+0x3d/0x70 [] ? syscall_slow_exit_work+0xfb/0x100 [] SyS_fsync+0x10/0x20 [] do_syscall_64+0x55/0xd0 [] ? prepare_exit_to_usermode+0x37/0x40 [] entry_SYSCALL64_slow_path+0x25/0x25 ---[ end trace efe6b17c6dba2a6e ]---
Re: bio linked list corruption.
On 10/26/2016 03:06 PM, Linus Torvalds wrote: > On Wed, Oct 26, 2016 at 11:42 AM, Dave Joneswrote: >> >> The stacks show nearly all of them are stuck in sync_inodes_sb > > That's just wb_wait_for_completion(), and it means that some IO isn't > completing. > > There's also a lot of processes waiting for inode_lock(), and a few > waiting for mnt_want_write() > > Ignoring those, we have > >> [] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs] >> [] btrfs_sync_fs+0x31/0xc0 [btrfs] >> [] sync_filesystem+0x6e/0xa0 >> [] SyS_syncfs+0x3c/0x70 >> [] do_syscall_64+0x5c/0x170 >> [] entry_SYSCALL64_slow_path+0x25/0x25 >> [] 0x > > Don't know this one. There's a couple of them. Could there be some > ABBA deadlock on the ordered roots waiting? It's always possible, but we haven't changed anything here. I've tried a long list of things to reproduce this on my test boxes, including days of trinity runs and a kernel module to exercise vmalloc, and thread creation. Today I turned off every CONFIG_DEBUG_* except for list debugging, and ran dbench 2048: [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 __list_add+0xbe/0xd0 [ 2759.119652] list_add corruption. prev->next should be next (e8c80308), but was c9ccfb88. (prev=880128522380). [ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk [ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 4.9.0-rc1-15246-g4ce9206-dirty #317 [ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 [ 2759.125077] c9000f6fb868 814fe4ff 8151cb5e c9000f6fb8c8 [ 2759.125077] c9000f6fb8c8 c9000f6fb8b8 81064bbf [ 2759.127444] 880128523680 002139968000 880138b7a4a0 880128523540 [ 2759.127444] Call Trace: [ 2759.127444] [] dump_stack+0x53/0x74 [ 2759.127444] [] ? __list_add+0xbe/0xd0 [ 2759.127444] [] __warn+0xff/0x120 [ 2759.127444] [] warn_slowpath_fmt+0x49/0x50 [ 2759.127444] [] __list_add+0xbe/0xd0 [ 2759.127444] [] blk_sq_make_request+0x388/0x580 [ 2759.127444] [] generic_make_request+0x104/0x200 [ 2759.127444] [] submit_bio+0x65/0x130 [ 2759.127444] [] ? __percpu_counter_add+0x96/0xd0 [ 2759.127444] [] btrfs_map_bio+0x23c/0x310 [ 2759.127444] [] btrfs_submit_bio_hook+0xd3/0x190 [ 2759.127444] [] submit_one_bio+0x6d/0xa0 [ 2759.127444] [] flush_epd_write_bio+0x4e/0x70 [ 2759.127444] [] extent_writepages+0x5d/0x70 [ 2759.127444] [] ? btrfs_releasepage+0x50/0x50 [ 2759.127444] [] ? wbc_attach_and_unlock_inode+0x6e/0x170 [ 2759.127444] [] btrfs_writepages+0x27/0x30 [ 2759.127444] [] do_writepages+0x20/0x30 [ 2759.127444] [] __filemap_fdatawrite_range+0xb5/0x100 [ 2759.127444] [] filemap_fdatawrite_range+0x13/0x20 [ 2759.127444] [] btrfs_fdatawrite_range+0x2b/0x70 [ 2759.127444] [] btrfs_sync_file+0x88/0x490 [ 2759.127444] [] ? group_send_sig_info+0x42/0x80 [ 2759.127444] [] ? kill_pid_info+0x5d/0x90 [ 2759.127444] [] ? SYSC_kill+0xba/0x1d0 [ 2759.127444] [] ? __sb_end_write+0x58/0x80 [ 2759.127444] [] vfs_fsync_range+0x4c/0xb0 [ 2759.127444] [] ? syscall_trace_enter+0x201/0x2e0 [ 2759.127444] [] vfs_fsync+0x1c/0x20 [ 2759.127444] [] do_fsync+0x3d/0x70 [ 2759.127444] [] ? syscall_slow_exit_work+0xfb/0x100 [ 2759.127444] [] SyS_fsync+0x10/0x20 [ 2759.127444] [] do_syscall_64+0x55/0xd0 [ 2759.127444] [] ? prepare_exit_to_usermode+0x37/0x40 [ 2759.127444] [] entry_SYSCALL64_slow_path+0x25/0x25 [ 2759.150635] ---[ end trace 3b5b7e2ef61c3d02 ]--- I put a variant of your suggested patch in place, but my printk never triggered. Now that I've made it happen once, I'll make sure I can do it over and over again. This doesn't have the patches that Andy asked Davej to try out yet, but I'll try them once I have a reliable reproducer. diff --git a/kernel/fork.c b/kernel/fork.c index 623259f..de95e19 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -165,7 +165,7 @@ void __weak arch_release_thread_stack(unsigned long *stack) * vmalloc() is a bit slow, and calling vfree() enough times will force a TLB * flush. Try to minimize the number of calls by caching stacks. */ -#define NR_CACHED_STACKS 2 +#define NR_CACHED_STACKS 256 static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]); #endif @@ -173,7 +173,9 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node) { #ifdef CONFIG_VMAP_STACK void *stack; + char *p; int i; + int j; local_irq_disable(); for (i = 0; i < NR_CACHED_STACKS; i++) { @@ -183,7 +185,15 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node) continue; this_cpu_write(cached_stacks[i], NULL); + p = s->addr; +
Re: bio linked list corruption.
On 10/26/2016 03:06 PM, Linus Torvalds wrote: > On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones wrote: >> >> The stacks show nearly all of them are stuck in sync_inodes_sb > > That's just wb_wait_for_completion(), and it means that some IO isn't > completing. > > There's also a lot of processes waiting for inode_lock(), and a few > waiting for mnt_want_write() > > Ignoring those, we have > >> [] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs] >> [] btrfs_sync_fs+0x31/0xc0 [btrfs] >> [] sync_filesystem+0x6e/0xa0 >> [] SyS_syncfs+0x3c/0x70 >> [] do_syscall_64+0x5c/0x170 >> [] entry_SYSCALL64_slow_path+0x25/0x25 >> [] 0x > > Don't know this one. There's a couple of them. Could there be some > ABBA deadlock on the ordered roots waiting? It's always possible, but we haven't changed anything here. I've tried a long list of things to reproduce this on my test boxes, including days of trinity runs and a kernel module to exercise vmalloc, and thread creation. Today I turned off every CONFIG_DEBUG_* except for list debugging, and ran dbench 2048: [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 __list_add+0xbe/0xd0 [ 2759.119652] list_add corruption. prev->next should be next (e8c80308), but was c9ccfb88. (prev=880128522380). [ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk [ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 4.9.0-rc1-15246-g4ce9206-dirty #317 [ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 [ 2759.125077] c9000f6fb868 814fe4ff 8151cb5e c9000f6fb8c8 [ 2759.125077] c9000f6fb8c8 c9000f6fb8b8 81064bbf [ 2759.127444] 880128523680 002139968000 880138b7a4a0 880128523540 [ 2759.127444] Call Trace: [ 2759.127444] [] dump_stack+0x53/0x74 [ 2759.127444] [] ? __list_add+0xbe/0xd0 [ 2759.127444] [] __warn+0xff/0x120 [ 2759.127444] [] warn_slowpath_fmt+0x49/0x50 [ 2759.127444] [] __list_add+0xbe/0xd0 [ 2759.127444] [] blk_sq_make_request+0x388/0x580 [ 2759.127444] [] generic_make_request+0x104/0x200 [ 2759.127444] [] submit_bio+0x65/0x130 [ 2759.127444] [] ? __percpu_counter_add+0x96/0xd0 [ 2759.127444] [] btrfs_map_bio+0x23c/0x310 [ 2759.127444] [] btrfs_submit_bio_hook+0xd3/0x190 [ 2759.127444] [] submit_one_bio+0x6d/0xa0 [ 2759.127444] [] flush_epd_write_bio+0x4e/0x70 [ 2759.127444] [] extent_writepages+0x5d/0x70 [ 2759.127444] [] ? btrfs_releasepage+0x50/0x50 [ 2759.127444] [] ? wbc_attach_and_unlock_inode+0x6e/0x170 [ 2759.127444] [] btrfs_writepages+0x27/0x30 [ 2759.127444] [] do_writepages+0x20/0x30 [ 2759.127444] [] __filemap_fdatawrite_range+0xb5/0x100 [ 2759.127444] [] filemap_fdatawrite_range+0x13/0x20 [ 2759.127444] [] btrfs_fdatawrite_range+0x2b/0x70 [ 2759.127444] [] btrfs_sync_file+0x88/0x490 [ 2759.127444] [] ? group_send_sig_info+0x42/0x80 [ 2759.127444] [] ? kill_pid_info+0x5d/0x90 [ 2759.127444] [] ? SYSC_kill+0xba/0x1d0 [ 2759.127444] [] ? __sb_end_write+0x58/0x80 [ 2759.127444] [] vfs_fsync_range+0x4c/0xb0 [ 2759.127444] [] ? syscall_trace_enter+0x201/0x2e0 [ 2759.127444] [] vfs_fsync+0x1c/0x20 [ 2759.127444] [] do_fsync+0x3d/0x70 [ 2759.127444] [] ? syscall_slow_exit_work+0xfb/0x100 [ 2759.127444] [] SyS_fsync+0x10/0x20 [ 2759.127444] [] do_syscall_64+0x55/0xd0 [ 2759.127444] [] ? prepare_exit_to_usermode+0x37/0x40 [ 2759.127444] [] entry_SYSCALL64_slow_path+0x25/0x25 [ 2759.150635] ---[ end trace 3b5b7e2ef61c3d02 ]--- I put a variant of your suggested patch in place, but my printk never triggered. Now that I've made it happen once, I'll make sure I can do it over and over again. This doesn't have the patches that Andy asked Davej to try out yet, but I'll try them once I have a reliable reproducer. diff --git a/kernel/fork.c b/kernel/fork.c index 623259f..de95e19 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -165,7 +165,7 @@ void __weak arch_release_thread_stack(unsigned long *stack) * vmalloc() is a bit slow, and calling vfree() enough times will force a TLB * flush. Try to minimize the number of calls by caching stacks. */ -#define NR_CACHED_STACKS 2 +#define NR_CACHED_STACKS 256 static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]); #endif @@ -173,7 +173,9 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node) { #ifdef CONFIG_VMAP_STACK void *stack; + char *p; int i; + int j; local_irq_disable(); for (i = 0; i < NR_CACHED_STACKS; i++) { @@ -183,7 +185,15 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node) continue; this_cpu_write(cached_stacks[i], NULL); + p = s->addr; + for (j = 0; j <
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 11:42 AM, Dave Joneswrote: > > The stacks show nearly all of them are stuck in sync_inodes_sb That's just wb_wait_for_completion(), and it means that some IO isn't completing. There's also a lot of processes waiting for inode_lock(), and a few waiting for mnt_want_write() Ignoring those, we have > [] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs] > [] btrfs_sync_fs+0x31/0xc0 [btrfs] > [] sync_filesystem+0x6e/0xa0 > [] SyS_syncfs+0x3c/0x70 > [] do_syscall_64+0x5c/0x170 > [] entry_SYSCALL64_slow_path+0x25/0x25 > [] 0x Don't know this one. There's a couple of them. Could there be some ABBA deadlock on the ordered roots waiting? > [] call_rwsem_down_write_failed+0x17/0x30 > [] btrfs_fallocate+0xb2/0xfd0 [btrfs] > [] vfs_fallocate+0x13e/0x220 > [] SyS_fallocate+0x43/0x80 > [] do_syscall_64+0x5c/0x170 > [] entry_SYSCALL64_slow_path+0x25/0x25 > [] 0x This one is also inode_lock(), and is interesting only because it's fallocate(), which has shown up so many times before.. But there are other threads blocked on do_truncate, or btrfs_file_write_iter instead, or on lseek, so this is not different for any other reason. > [] wait_on_page_bit+0xaf/0xc0 > [] __filemap_fdatawait_range+0x151/0x170 > [] filemap_fdatawait_keep_errors+0x1c/0x20 > [] sync_inodes_sb+0x273/0x300 > [] sync_filesystem+0x57/0xa0 > [] SyS_syncfs+0x3c/0x70 > [] do_syscall_64+0x5c/0x170 > [] entry_SYSCALL64_slow_path+0x25/0x25 > [] 0x This is actually waiting on the page. Possibly this is the IO that is never completing, and keeps the inode lock. > [] btrfs_start_ordered_extent+0x5b/0xb0 [btrfs] > [] lock_and_cleanup_extent_if_need+0x22d/0x290 [btrfs] > [] __btrfs_buffered_write+0x1b8/0x6e0 [btrfs] > [] btrfs_file_write_iter+0x170/0x550 [btrfs] > [] do_iter_readv_writev+0xa8/0x100 > [] do_readv_writev+0x172/0x210 > [] vfs_writev+0x3a/0x50 > [] do_pwritev+0xb0/0xd0 > [] SyS_pwritev+0xc/0x10 > [] do_syscall_64+0x5c/0x170 > [] entry_SYSCALL64_slow_path+0x25/0x25 Hmm. This is the one that *started* the ordered extents (as opposed to the ones waiting for it) I dunno. There might be a lost IO. More likely it's the same corruption that causes it, it just didn't result in an oops this time. Linus
Re: bio linked list corruption.
On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones wrote: > > The stacks show nearly all of them are stuck in sync_inodes_sb That's just wb_wait_for_completion(), and it means that some IO isn't completing. There's also a lot of processes waiting for inode_lock(), and a few waiting for mnt_want_write() Ignoring those, we have > [] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs] > [] btrfs_sync_fs+0x31/0xc0 [btrfs] > [] sync_filesystem+0x6e/0xa0 > [] SyS_syncfs+0x3c/0x70 > [] do_syscall_64+0x5c/0x170 > [] entry_SYSCALL64_slow_path+0x25/0x25 > [] 0x Don't know this one. There's a couple of them. Could there be some ABBA deadlock on the ordered roots waiting? > [] call_rwsem_down_write_failed+0x17/0x30 > [] btrfs_fallocate+0xb2/0xfd0 [btrfs] > [] vfs_fallocate+0x13e/0x220 > [] SyS_fallocate+0x43/0x80 > [] do_syscall_64+0x5c/0x170 > [] entry_SYSCALL64_slow_path+0x25/0x25 > [] 0x This one is also inode_lock(), and is interesting only because it's fallocate(), which has shown up so many times before.. But there are other threads blocked on do_truncate, or btrfs_file_write_iter instead, or on lseek, so this is not different for any other reason. > [] wait_on_page_bit+0xaf/0xc0 > [] __filemap_fdatawait_range+0x151/0x170 > [] filemap_fdatawait_keep_errors+0x1c/0x20 > [] sync_inodes_sb+0x273/0x300 > [] sync_filesystem+0x57/0xa0 > [] SyS_syncfs+0x3c/0x70 > [] do_syscall_64+0x5c/0x170 > [] entry_SYSCALL64_slow_path+0x25/0x25 > [] 0x This is actually waiting on the page. Possibly this is the IO that is never completing, and keeps the inode lock. > [] btrfs_start_ordered_extent+0x5b/0xb0 [btrfs] > [] lock_and_cleanup_extent_if_need+0x22d/0x290 [btrfs] > [] __btrfs_buffered_write+0x1b8/0x6e0 [btrfs] > [] btrfs_file_write_iter+0x170/0x550 [btrfs] > [] do_iter_readv_writev+0xa8/0x100 > [] do_readv_writev+0x172/0x210 > [] vfs_writev+0x3a/0x50 > [] do_pwritev+0xb0/0xd0 > [] SyS_pwritev+0xc/0x10 > [] do_syscall_64+0x5c/0x170 > [] entry_SYSCALL64_slow_path+0x25/0x25 Hmm. This is the one that *started* the ordered extents (as opposed to the ones waiting for it) I dunno. There might be a lost IO. More likely it's the same corruption that causes it, it just didn't result in an oops this time. Linus