Re: bio linked list corruption.

2016-12-06 Thread Linus Torvalds
On Tue, Dec 6, 2016 at 12:16 AM, Peter Zijlstra  wrote:
>>
>> Of course, I'm really hoping that this shmem.c use is the _only_ such
>> case.  But I doubt it.
>
> $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l
> 28

Hmm. Most of them seem to be ok, because they use "wait_event()",
which will always remove itself from the wait-queue. And they do it
from the place that allocates the wait-queue.

IOW, the mm/shmem.c case really was fairly special, because it just
did "prepare_to_wait()", and then did a finish_wait() - and not in the
thread that allocated it on the stack.

So it's really that "some _other_ thread allocated the waitqueue on
the stack, and now we're doing a wait on it" that is bad.

So the normal pattern seems to be:

 - allocate wq on the stack
 - pass it on to a waker
 - wait for it

and that's ok, because as part of "wait for it" we will also be
cleaning things up.

The reason mm/shmem.c was buggy was that it did

 - allocate wq on the stack
 - pass it on to somebody else to wait for
 - wake it up

and *that* is buggy, because it's the waiter, not the waker, that
normally cleans things up.

Is there some good way to find this kind of pattern automatically, I wonder

 Linus


Re: bio linked list corruption.

2016-12-06 Thread Linus Torvalds
On Tue, Dec 6, 2016 at 12:16 AM, Peter Zijlstra  wrote:
>>
>> Of course, I'm really hoping that this shmem.c use is the _only_ such
>> case.  But I doubt it.
>
> $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l
> 28

Hmm. Most of them seem to be ok, because they use "wait_event()",
which will always remove itself from the wait-queue. And they do it
from the place that allocates the wait-queue.

IOW, the mm/shmem.c case really was fairly special, because it just
did "prepare_to_wait()", and then did a finish_wait() - and not in the
thread that allocated it on the stack.

So it's really that "some _other_ thread allocated the waitqueue on
the stack, and now we're doing a wait on it" that is bad.

So the normal pattern seems to be:

 - allocate wq on the stack
 - pass it on to a waker
 - wait for it

and that's ok, because as part of "wait for it" we will also be
cleaning things up.

The reason mm/shmem.c was buggy was that it did

 - allocate wq on the stack
 - pass it on to somebody else to wait for
 - wake it up

and *that* is buggy, because it's the waiter, not the waker, that
normally cleans things up.

Is there some good way to find this kind of pattern automatically, I wonder

 Linus


Re: bio linked list corruption.

2016-12-06 Thread Vegard Nossum
On 5 December 2016 at 22:33, Vegard Nossum  wrote:
> On 5 December 2016 at 21:35, Linus Torvalds
>  wrote:
>> Note for Ingo and Peter: this patch has not been tested at all. But
>> Vegard did test an earlier patch of mine that just verified that yes,
>> the issue really was that wait queue entries remained on the wait
>> queue head just as we were about to return and free it.
>
> The second patch has been running for 1h+ without any problems of any
> kind. I should typically have seen 2 crashes by now. I'll let it run
> overnight to be sure.

Alright, so nearly 12 hours later I don't see either the new warning
or the original crash at all, so feel free to add:

Tested-by: Vegard Nossum .

That said, my 8 VMs had all panicked in some way due to OOMs (which is
new since v4.8), although some got page allocation stalls for >20s and
died because "khugepaged blocked for more than 120 seconds", others
got "Out of memory and no killable processes".


Vegard


Re: bio linked list corruption.

2016-12-06 Thread Vegard Nossum
On 5 December 2016 at 22:33, Vegard Nossum  wrote:
> On 5 December 2016 at 21:35, Linus Torvalds
>  wrote:
>> Note for Ingo and Peter: this patch has not been tested at all. But
>> Vegard did test an earlier patch of mine that just verified that yes,
>> the issue really was that wait queue entries remained on the wait
>> queue head just as we were about to return and free it.
>
> The second patch has been running for 1h+ without any problems of any
> kind. I should typically have seen 2 crashes by now. I'll let it run
> overnight to be sure.

Alright, so nearly 12 hours later I don't see either the new warning
or the original crash at all, so feel free to add:

Tested-by: Vegard Nossum .

That said, my 8 VMs had all panicked in some way due to OOMs (which is
new since v4.8), although some got page allocation stalls for >20s and
died because "khugepaged blocked for more than 120 seconds", others
got "Out of memory and no killable processes".


Vegard


Re: bio linked list corruption.

2016-12-06 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l
> 28

This debug facility looks sensible. A couple of minor suggestions:

> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -39,6 +39,9 @@ struct wait_bit_queue {
>  struct __wait_queue_head {
>   spinlock_t  lock;
>   struct list_headtask_list;
> +#ifdef CONFIG_DEBUG_WAITQUEUE
> + int onstack;
> +#endif

The structure will pack better in the debug-enabled case if 'onstack' is next 
to 
'lock', as spinlock_t is 4 bytes on many architectures.

> -#define __WAIT_QUEUE_HEAD_INITIALIZER(name) {
> \
> +#ifdef CONFIG_DEBUG_WAITQUEUE
> +#define ___WAIT_QUEUE_ONSTACK(onstack)   .onstack = (onstack),
> +#else
> +#define ___WAIT_QUEUE_ONSTACK(onstack)
> +#endif

Please help readers by showing the internal structure of the definition:

#ifdef CONFIG_DEBUG_WAITQUEUE
# define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack),
#else
# define ___WAIT_QUEUE_ONSTACK(onstack)
#endif


> +static inline void prepare_debug(wait_queue_head_t *q, wait_queue_t *wait)
> +{
> +#ifdef CONFIG_DEBUG_WAITQUEUE
> + WARN_ON_ONCE(q->onstack && wait->func == autoremove_wake_function)
> +#endif
> +}

I'd name this debug_waitqueue_check() or such - as the 'prepare' is a bit 
misleadig (we don't prepare debugging, we do the debug check here).

> +config DEBUG_WAITQUEUE
> + bool "Debug waitqueue"
> + depends on DEBUG_KERNEL

I'd name it DEBUG_SCHED_WAITQUEUE=y and I'd also make it depend on 
CONFIG_DEBUG_SCHED.

LGTM otherwise!

Thanks,

Ingo


Re: bio linked list corruption.

2016-12-06 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l
> 28

This debug facility looks sensible. A couple of minor suggestions:

> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -39,6 +39,9 @@ struct wait_bit_queue {
>  struct __wait_queue_head {
>   spinlock_t  lock;
>   struct list_headtask_list;
> +#ifdef CONFIG_DEBUG_WAITQUEUE
> + int onstack;
> +#endif

The structure will pack better in the debug-enabled case if 'onstack' is next 
to 
'lock', as spinlock_t is 4 bytes on many architectures.

> -#define __WAIT_QUEUE_HEAD_INITIALIZER(name) {
> \
> +#ifdef CONFIG_DEBUG_WAITQUEUE
> +#define ___WAIT_QUEUE_ONSTACK(onstack)   .onstack = (onstack),
> +#else
> +#define ___WAIT_QUEUE_ONSTACK(onstack)
> +#endif

Please help readers by showing the internal structure of the definition:

#ifdef CONFIG_DEBUG_WAITQUEUE
# define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack),
#else
# define ___WAIT_QUEUE_ONSTACK(onstack)
#endif


> +static inline void prepare_debug(wait_queue_head_t *q, wait_queue_t *wait)
> +{
> +#ifdef CONFIG_DEBUG_WAITQUEUE
> + WARN_ON_ONCE(q->onstack && wait->func == autoremove_wake_function)
> +#endif
> +}

I'd name this debug_waitqueue_check() or such - as the 'prepare' is a bit 
misleadig (we don't prepare debugging, we do the debug check here).

> +config DEBUG_WAITQUEUE
> + bool "Debug waitqueue"
> + depends on DEBUG_KERNEL

I'd name it DEBUG_SCHED_WAITQUEUE=y and I'd also make it depend on 
CONFIG_DEBUG_SCHED.

LGTM otherwise!

Thanks,

Ingo


Re: bio linked list corruption.

2016-12-06 Thread Peter Zijlstra
On Mon, Dec 05, 2016 at 12:35:52PM -0800, Linus Torvalds wrote:
> Adding the scheduler people to the participants list, and re-attaching
> the patch, because while this patch is internal to the VM code, the
> issue itself is not.
> 
> There might well be other cases where somebody goes "wake_up_all()"
> will wake everybody up, so I can put the wait queue head on the stack,
> and then after I've woken people up I can return".
> 
> Ingo/PeterZ: the reason that does *not* work is that "wake_up_all()"
> does make sure that everybody is woken up, but the usual autoremove
> wake function only removes the wakeup entry if the process was woken
> up by that particular wakeup. If something else had woken it up, the
> entry remains on the list, and the waker in this case returned with
> the wait head still containing entries.
> 
> Which is deadly when the wait queue head is on the stack.

Yes, very much so.

> So I'm wondering if we should make that "synchronous_wake_function()"
> available generally, and maybe introduce a DEFINE_WAIT_SYNC() helper
> that uses it.

We could also do some debug code that tracks the ONSTACK ness and warns
on autoremove. Something like the below, equally untested.

> 
> Of course, I'm really hoping that this shmem.c use is the _only_ such
> case.  But I doubt it.

$ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l
28

---


diff --git a/include/linux/wait.h b/include/linux/wait.h
index 2408e8d5c05c..199faaa89847 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -39,6 +39,9 @@ struct wait_bit_queue {
 struct __wait_queue_head {
spinlock_t  lock;
struct list_headtask_list;
+#ifdef CONFIG_DEBUG_WAITQUEUE
+   int onstack;
+#endif
 };
 typedef struct __wait_queue_head wait_queue_head_t;
 
@@ -56,9 +59,18 @@ struct task_struct;
 #define DECLARE_WAITQUEUE(name, tsk)   \
wait_queue_t name = __WAITQUEUE_INITIALIZER(name, tsk)
 
-#define __WAIT_QUEUE_HEAD_INITIALIZER(name) {  \
+#ifdef CONFIG_DEBUG_WAITQUEUE
+#define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack),
+#else
+#define ___WAIT_QUEUE_ONSTACK(onstack)
+#endif
+
+#define ___WAIT_QUEUE_HEAD_INITIALIZER(name, onstack)  {   \
.lock   = __SPIN_LOCK_UNLOCKED(name.lock),  \
-   .task_list  = { &(name).task_list, &(name).task_list } }
+   .task_list  = { &(name).task_list, &(name).task_list }, \
+   ___WAIT_QUEUE_ONSTACK(onstack) }
+
+#define __WAIT_QUEUE_HEAD_INITIALIZER(name) 
___WAIT_QUEUE_HEAD_INITIALIZER(name, 0)
 
 #define DECLARE_WAIT_QUEUE_HEAD(name) \
wait_queue_head_t name = __WAIT_QUEUE_HEAD_INITIALIZER(name)
@@ -80,11 +92,12 @@ extern void __init_waitqueue_head(wait_queue_head_t *q, 
const char *name, struct
 
 #ifdef CONFIG_LOCKDEP
 # define __WAIT_QUEUE_HEAD_INIT_ONSTACK(name) \
-   ({ init_waitqueue_head(); name; })
+   ({ init_waitqueue_head(); (name).onstack = 1; name; })
 # define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \
wait_queue_head_t name = __WAIT_QUEUE_HEAD_INIT_ONSTACK(name)
 #else
-# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) DECLARE_WAIT_QUEUE_HEAD(name)
+# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \
+   wait_queue_head_t name = ___WAIT_QUEUE_HEAD_INITIALIZER(name, 1)
 #endif
 
 static inline void init_waitqueue_entry(wait_queue_t *q, struct task_struct *p)
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 9453efe9b25a..746d00117d08 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -156,6 +156,13 @@ void __wake_up_sync(wait_queue_head_t *q, unsigned int 
mode, int nr_exclusive)
 }
 EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal use only */
 
+static inline void prepare_debug(wait_queue_head_t *q, wait_queue_t *wait)
+{
+#ifdef CONFIG_DEBUG_WAITQUEUE
+   WARN_ON_ONCE(q->onstack && wait->func == autoremove_wake_function)
+#endif
+}
+
 /*
  * Note: we use "set_current_state()" _after_ the wait-queue add,
  * because we need a memory barrier there on SMP, so that any
@@ -178,6 +185,7 @@ prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, 
int state)
if (list_empty(>task_list))
__add_wait_queue(q, wait);
set_current_state(state);
+   prepare_debug(q, wait);
spin_unlock_irqrestore(>lock, flags);
 }
 EXPORT_SYMBOL(prepare_to_wait);
@@ -192,6 +200,7 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, 
wait_queue_t *wait, int state)
if (list_empty(>task_list))
__add_wait_queue_tail(q, wait);
set_current_state(state);
+   prepare_debug(q, wait);
spin_unlock_irqrestore(>lock, flags);
 }
 EXPORT_SYMBOL(prepare_to_wait_exclusive);
@@ -235,6 +244,7 @@ long prepare_to_wait_event(wait_queue_head_t *q, 
wait_queue_t *wait, int state)
}
set_current_state(state);
}
+   prepare_debug(q, wait);
spin_unlock_irqrestore(>lock, 

Re: bio linked list corruption.

2016-12-06 Thread Peter Zijlstra
On Mon, Dec 05, 2016 at 12:35:52PM -0800, Linus Torvalds wrote:
> Adding the scheduler people to the participants list, and re-attaching
> the patch, because while this patch is internal to the VM code, the
> issue itself is not.
> 
> There might well be other cases where somebody goes "wake_up_all()"
> will wake everybody up, so I can put the wait queue head on the stack,
> and then after I've woken people up I can return".
> 
> Ingo/PeterZ: the reason that does *not* work is that "wake_up_all()"
> does make sure that everybody is woken up, but the usual autoremove
> wake function only removes the wakeup entry if the process was woken
> up by that particular wakeup. If something else had woken it up, the
> entry remains on the list, and the waker in this case returned with
> the wait head still containing entries.
> 
> Which is deadly when the wait queue head is on the stack.

Yes, very much so.

> So I'm wondering if we should make that "synchronous_wake_function()"
> available generally, and maybe introduce a DEFINE_WAIT_SYNC() helper
> that uses it.

We could also do some debug code that tracks the ONSTACK ness and warns
on autoremove. Something like the below, equally untested.

> 
> Of course, I'm really hoping that this shmem.c use is the _only_ such
> case.  But I doubt it.

$ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l
28

---


diff --git a/include/linux/wait.h b/include/linux/wait.h
index 2408e8d5c05c..199faaa89847 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -39,6 +39,9 @@ struct wait_bit_queue {
 struct __wait_queue_head {
spinlock_t  lock;
struct list_headtask_list;
+#ifdef CONFIG_DEBUG_WAITQUEUE
+   int onstack;
+#endif
 };
 typedef struct __wait_queue_head wait_queue_head_t;
 
@@ -56,9 +59,18 @@ struct task_struct;
 #define DECLARE_WAITQUEUE(name, tsk)   \
wait_queue_t name = __WAITQUEUE_INITIALIZER(name, tsk)
 
-#define __WAIT_QUEUE_HEAD_INITIALIZER(name) {  \
+#ifdef CONFIG_DEBUG_WAITQUEUE
+#define ___WAIT_QUEUE_ONSTACK(onstack) .onstack = (onstack),
+#else
+#define ___WAIT_QUEUE_ONSTACK(onstack)
+#endif
+
+#define ___WAIT_QUEUE_HEAD_INITIALIZER(name, onstack)  {   \
.lock   = __SPIN_LOCK_UNLOCKED(name.lock),  \
-   .task_list  = { &(name).task_list, &(name).task_list } }
+   .task_list  = { &(name).task_list, &(name).task_list }, \
+   ___WAIT_QUEUE_ONSTACK(onstack) }
+
+#define __WAIT_QUEUE_HEAD_INITIALIZER(name) 
___WAIT_QUEUE_HEAD_INITIALIZER(name, 0)
 
 #define DECLARE_WAIT_QUEUE_HEAD(name) \
wait_queue_head_t name = __WAIT_QUEUE_HEAD_INITIALIZER(name)
@@ -80,11 +92,12 @@ extern void __init_waitqueue_head(wait_queue_head_t *q, 
const char *name, struct
 
 #ifdef CONFIG_LOCKDEP
 # define __WAIT_QUEUE_HEAD_INIT_ONSTACK(name) \
-   ({ init_waitqueue_head(); name; })
+   ({ init_waitqueue_head(); (name).onstack = 1; name; })
 # define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \
wait_queue_head_t name = __WAIT_QUEUE_HEAD_INIT_ONSTACK(name)
 #else
-# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) DECLARE_WAIT_QUEUE_HEAD(name)
+# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \
+   wait_queue_head_t name = ___WAIT_QUEUE_HEAD_INITIALIZER(name, 1)
 #endif
 
 static inline void init_waitqueue_entry(wait_queue_t *q, struct task_struct *p)
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 9453efe9b25a..746d00117d08 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -156,6 +156,13 @@ void __wake_up_sync(wait_queue_head_t *q, unsigned int 
mode, int nr_exclusive)
 }
 EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal use only */
 
+static inline void prepare_debug(wait_queue_head_t *q, wait_queue_t *wait)
+{
+#ifdef CONFIG_DEBUG_WAITQUEUE
+   WARN_ON_ONCE(q->onstack && wait->func == autoremove_wake_function)
+#endif
+}
+
 /*
  * Note: we use "set_current_state()" _after_ the wait-queue add,
  * because we need a memory barrier there on SMP, so that any
@@ -178,6 +185,7 @@ prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, 
int state)
if (list_empty(>task_list))
__add_wait_queue(q, wait);
set_current_state(state);
+   prepare_debug(q, wait);
spin_unlock_irqrestore(>lock, flags);
 }
 EXPORT_SYMBOL(prepare_to_wait);
@@ -192,6 +200,7 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, 
wait_queue_t *wait, int state)
if (list_empty(>task_list))
__add_wait_queue_tail(q, wait);
set_current_state(state);
+   prepare_debug(q, wait);
spin_unlock_irqrestore(>lock, flags);
 }
 EXPORT_SYMBOL(prepare_to_wait_exclusive);
@@ -235,6 +244,7 @@ long prepare_to_wait_event(wait_queue_head_t *q, 
wait_queue_t *wait, int state)
}
set_current_state(state);
}
+   prepare_debug(q, wait);
spin_unlock_irqrestore(>lock, 

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 21:35, Linus Torvalds
 wrote:
> Note for Ingo and Peter: this patch has not been tested at all. But
> Vegard did test an earlier patch of mine that just verified that yes,
> the issue really was that wait queue entries remained on the wait
> queue head just as we were about to return and free it.

The second patch has been running for 1h+ without any problems of any
kind. I should typically have seen 2 crashes by now. I'll let it run
overnight to be sure.


Vegard


Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 21:35, Linus Torvalds
 wrote:
> Note for Ingo and Peter: this patch has not been tested at all. But
> Vegard did test an earlier patch of mine that just verified that yes,
> the issue really was that wait queue entries remained on the wait
> queue head just as we were about to return and free it.

The second patch has been running for 1h+ without any problems of any
kind. I should typically have seen 2 crashes by now. I'll let it run
overnight to be sure.


Vegard


Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
Adding the scheduler people to the participants list, and re-attaching
the patch, because while this patch is internal to the VM code, the
issue itself is not.

There might well be other cases where somebody goes "wake_up_all()"
will wake everybody up, so I can put the wait queue head on the stack,
and then after I've woken people up I can return".

Ingo/PeterZ: the reason that does *not* work is that "wake_up_all()"
does make sure that everybody is woken up, but the usual autoremove
wake function only removes the wakeup entry if the process was woken
up by that particular wakeup. If something else had woken it up, the
entry remains on the list, and the waker in this case returned with
the wait head still containing entries.

Which is deadly when the wait queue head is on the stack.

So I'm wondering if we should make that "synchronous_wake_function()"
available generally, and maybe introduce a DEFINE_WAIT_SYNC() helper
that uses it.

Of course, I'm really hoping that this shmem.c use is the _only_ such
case.  But I doubt it.

Comments?

Note for Ingo and Peter: this patch has not been tested at all. But
Vegard did test an earlier patch of mine that just verified that yes,
the issue really was that wait queue entries remained on the wait
queue head just as we were about to return and free it.

   Linus


On Mon, Dec 5, 2016 at 12:10 PM, Linus Torvalds
 wrote:
>
> Anyway, can you try this patch instead? It should actually cause the
> wake_up_all() to always remove all entries, and thus the WARN_ON()
> should no longer happen (and I removed the "list_del()" hackery).
>
>Linus
diff --git a/mm/shmem.c b/mm/shmem.c
index 166ebf5d2bce..17beb44e9f4f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1848,6 +1848,19 @@ alloc_nohuge:page = 
shmem_alloc_and_acct_page(gfp, info, sbinfo,
return error;
 }
 
+/*
+ * This is like autoremove_wake_function, but it removes the wait queue
+ * entry unconditionally - even if something else had already woken the
+ * target.
+ */
+static int synchronous_wake_function(wait_queue_t *wait, unsigned mode, int 
sync, void *key)
+{
+   int ret = default_wake_function(wait, mode, sync, key);
+   list_del_init(>task_list);
+   return ret;
+}
+
+
 static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
struct inode *inode = file_inode(vma->vm_file);
@@ -1883,7 +1896,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf)
vmf->pgoff >= shmem_falloc->start &&
vmf->pgoff < shmem_falloc->next) {
wait_queue_head_t *shmem_falloc_waitq;
-   DEFINE_WAIT(shmem_fault_wait);
+   DEFINE_WAIT_FUNC(shmem_fault_wait, 
synchronous_wake_function);
 
ret = VM_FAULT_NOPAGE;
if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
@@ -2665,6 +2678,7 @@ static long shmem_fallocate(struct file *file, int mode, 
loff_t offset,
spin_lock(>i_lock);
inode->i_private = NULL;
wake_up_all(_falloc_waitq);
+   WARN_ON_ONCE(!list_empty(_falloc_waitq.task_list));
spin_unlock(>i_lock);
error = 0;
goto out;


Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
Adding the scheduler people to the participants list, and re-attaching
the patch, because while this patch is internal to the VM code, the
issue itself is not.

There might well be other cases where somebody goes "wake_up_all()"
will wake everybody up, so I can put the wait queue head on the stack,
and then after I've woken people up I can return".

Ingo/PeterZ: the reason that does *not* work is that "wake_up_all()"
does make sure that everybody is woken up, but the usual autoremove
wake function only removes the wakeup entry if the process was woken
up by that particular wakeup. If something else had woken it up, the
entry remains on the list, and the waker in this case returned with
the wait head still containing entries.

Which is deadly when the wait queue head is on the stack.

So I'm wondering if we should make that "synchronous_wake_function()"
available generally, and maybe introduce a DEFINE_WAIT_SYNC() helper
that uses it.

Of course, I'm really hoping that this shmem.c use is the _only_ such
case.  But I doubt it.

Comments?

Note for Ingo and Peter: this patch has not been tested at all. But
Vegard did test an earlier patch of mine that just verified that yes,
the issue really was that wait queue entries remained on the wait
queue head just as we were about to return and free it.

   Linus


On Mon, Dec 5, 2016 at 12:10 PM, Linus Torvalds
 wrote:
>
> Anyway, can you try this patch instead? It should actually cause the
> wake_up_all() to always remove all entries, and thus the WARN_ON()
> should no longer happen (and I removed the "list_del()" hackery).
>
>Linus
diff --git a/mm/shmem.c b/mm/shmem.c
index 166ebf5d2bce..17beb44e9f4f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1848,6 +1848,19 @@ alloc_nohuge:page = 
shmem_alloc_and_acct_page(gfp, info, sbinfo,
return error;
 }
 
+/*
+ * This is like autoremove_wake_function, but it removes the wait queue
+ * entry unconditionally - even if something else had already woken the
+ * target.
+ */
+static int synchronous_wake_function(wait_queue_t *wait, unsigned mode, int 
sync, void *key)
+{
+   int ret = default_wake_function(wait, mode, sync, key);
+   list_del_init(>task_list);
+   return ret;
+}
+
+
 static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
struct inode *inode = file_inode(vma->vm_file);
@@ -1883,7 +1896,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf)
vmf->pgoff >= shmem_falloc->start &&
vmf->pgoff < shmem_falloc->next) {
wait_queue_head_t *shmem_falloc_waitq;
-   DEFINE_WAIT(shmem_fault_wait);
+   DEFINE_WAIT_FUNC(shmem_fault_wait, 
synchronous_wake_function);
 
ret = VM_FAULT_NOPAGE;
if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
@@ -2665,6 +2678,7 @@ static long shmem_fallocate(struct file *file, int mode, 
loff_t offset,
spin_lock(>i_lock);
inode->i_private = NULL;
wake_up_all(_falloc_waitq);
+   WARN_ON_ONCE(!list_empty(_falloc_waitq.task_list));
spin_unlock(>i_lock);
error = 0;
goto out;


Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 20:11, Vegard Nossum  wrote:
> On 5 December 2016 at 18:55, Linus Torvalds
>  wrote:
>> On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum  
>> wrote:
>> Since you apparently can recreate this fairly easily, how about trying
>> this stupid patch?
>>
>> NOTE! This is entirely untested. I may have screwed this up entirely.
>> You get the idea, though - just remove the wait queue head from the
>> list - the list entries stay around, but nothing points to the stack
>> entry (that we're going to free) any more.
>>
>> And add the warning to see if this actually ever triggers (and because
>> I'd like to see the callchain when it does, to see if it's another
>> waitqueue somewhere or what..)
>
> [ cut here ]
> WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0
> Kernel panic - not syncing: panic_on_warn set ...

So I noticed that panic_on_warn just after sending the email and I've
been waiting for it it to trigger again.

The warning has triggered twice more without panic_on_warn set and I
haven't seen any crash yet.


Vegard


Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 20:11, Vegard Nossum  wrote:
> On 5 December 2016 at 18:55, Linus Torvalds
>  wrote:
>> On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum  
>> wrote:
>> Since you apparently can recreate this fairly easily, how about trying
>> this stupid patch?
>>
>> NOTE! This is entirely untested. I may have screwed this up entirely.
>> You get the idea, though - just remove the wait queue head from the
>> list - the list entries stay around, but nothing points to the stack
>> entry (that we're going to free) any more.
>>
>> And add the warning to see if this actually ever triggers (and because
>> I'd like to see the callchain when it does, to see if it's another
>> waitqueue somewhere or what..)
>
> [ cut here ]
> WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0
> Kernel panic - not syncing: panic_on_warn set ...

So I noticed that panic_on_warn just after sending the email and I've
been waiting for it it to trigger again.

The warning has triggered twice more without panic_on_warn set and I
haven't seen any crash yet.


Vegard


Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
On Mon, Dec 5, 2016 at 11:11 AM, Vegard Nossum  wrote:
>
> [ cut here ]
> WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0

Ok, good. So that's confirmed as the cause of this problem.

And the call chain that I wanted is obviously completely
uninteresting, because it's call cghain on the other side (the page
fault side) that would show the nested wake queue behavior. I was just
being stupid about it.

I wonder if we have any other places where we just blithely assume
that "wake_up_all()" will actually empty the whole wait queue. It's
_usually_ true, but as noted, nested waiting does happen.

Anyway, can you try this patch instead? It should actually cause the
wake_up_all() to always remove all entries, and thus the WARN_ON()
should no longer happen (and I removed the "list_del()" hackery).

   Linus
diff --git a/mm/shmem.c b/mm/shmem.c
index 166ebf5d2bce..17beb44e9f4f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1848,6 +1848,19 @@ alloc_nohuge:page = 
shmem_alloc_and_acct_page(gfp, info, sbinfo,
return error;
 }
 
+/*
+ * This is like autoremove_wake_function, but it removes the wait queue
+ * entry unconditionally - even if something else had already woken the
+ * target.
+ */
+static int synchronous_wake_function(wait_queue_t *wait, unsigned mode, int 
sync, void *key)
+{
+   int ret = default_wake_function(wait, mode, sync, key);
+   list_del_init(>task_list);
+   return ret;
+}
+
+
 static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
struct inode *inode = file_inode(vma->vm_file);
@@ -1883,7 +1896,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf)
vmf->pgoff >= shmem_falloc->start &&
vmf->pgoff < shmem_falloc->next) {
wait_queue_head_t *shmem_falloc_waitq;
-   DEFINE_WAIT(shmem_fault_wait);
+   DEFINE_WAIT_FUNC(shmem_fault_wait, 
synchronous_wake_function);
 
ret = VM_FAULT_NOPAGE;
if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
@@ -2665,6 +2678,7 @@ static long shmem_fallocate(struct file *file, int mode, 
loff_t offset,
spin_lock(>i_lock);
inode->i_private = NULL;
wake_up_all(_falloc_waitq);
+   WARN_ON_ONCE(!list_empty(_falloc_waitq.task_list));
spin_unlock(>i_lock);
error = 0;
goto out;


Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
On Mon, Dec 5, 2016 at 11:11 AM, Vegard Nossum  wrote:
>
> [ cut here ]
> WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0

Ok, good. So that's confirmed as the cause of this problem.

And the call chain that I wanted is obviously completely
uninteresting, because it's call cghain on the other side (the page
fault side) that would show the nested wake queue behavior. I was just
being stupid about it.

I wonder if we have any other places where we just blithely assume
that "wake_up_all()" will actually empty the whole wait queue. It's
_usually_ true, but as noted, nested waiting does happen.

Anyway, can you try this patch instead? It should actually cause the
wake_up_all() to always remove all entries, and thus the WARN_ON()
should no longer happen (and I removed the "list_del()" hackery).

   Linus
diff --git a/mm/shmem.c b/mm/shmem.c
index 166ebf5d2bce..17beb44e9f4f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1848,6 +1848,19 @@ alloc_nohuge:page = 
shmem_alloc_and_acct_page(gfp, info, sbinfo,
return error;
 }
 
+/*
+ * This is like autoremove_wake_function, but it removes the wait queue
+ * entry unconditionally - even if something else had already woken the
+ * target.
+ */
+static int synchronous_wake_function(wait_queue_t *wait, unsigned mode, int 
sync, void *key)
+{
+   int ret = default_wake_function(wait, mode, sync, key);
+   list_del_init(>task_list);
+   return ret;
+}
+
+
 static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
struct inode *inode = file_inode(vma->vm_file);
@@ -1883,7 +1896,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf)
vmf->pgoff >= shmem_falloc->start &&
vmf->pgoff < shmem_falloc->next) {
wait_queue_head_t *shmem_falloc_waitq;
-   DEFINE_WAIT(shmem_fault_wait);
+   DEFINE_WAIT_FUNC(shmem_fault_wait, 
synchronous_wake_function);
 
ret = VM_FAULT_NOPAGE;
if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
@@ -2665,6 +2678,7 @@ static long shmem_fallocate(struct file *file, int mode, 
loff_t offset,
spin_lock(>i_lock);
inode->i_private = NULL;
wake_up_all(_falloc_waitq);
+   WARN_ON_ONCE(!list_empty(_falloc_waitq.task_list));
spin_unlock(>i_lock);
error = 0;
goto out;


Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 18:55, Linus Torvalds
 wrote:
> On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum  wrote:
>>
>> The warning shows that it made it past the list_empty_careful() check
>> in finish_wait() but then bugs out on the >task_list
>> dereference.
>>
>> Anything stick out?
>
> I hate that shmem waitqueue garbage. It's really subtle.
>
> I think the problem is that "wake_up_all()" in shmem_fallocate()
> doesn't necessarily wake up everything. It wakes up TASK_NORMAL -
> which does include TASK_UNINTERRUPTIBLE, but doesn't actually mean
> "everything on the list".
>
> I think that what happens is that the waiters somehow move from
> TASK_UNINTERRUPTIBLE to TASK_RUNNING early, and this means that
> wake_up_all() will ignore them, leave them on the list, and now that
> list on stack is no longer empty at the end.
>
> And the way *THAT* can happen is that the task is on some *other*
> waitqueue as well, and that other waiqueue wakes it up. That's not
> impossible, you can certainly have people on wait-queues that still
> take faults.
>
> Or somebody just uses a directed wake_up_process() or something.
>
> Since you apparently can recreate this fairly easily, how about trying
> this stupid patch?
>
> NOTE! This is entirely untested. I may have screwed this up entirely.
> You get the idea, though - just remove the wait queue head from the
> list - the list entries stay around, but nothing points to the stack
> entry (that we're going to free) any more.
>
> And add the warning to see if this actually ever triggers (and because
> I'd like to see the callchain when it does, to see if it's another
> waitqueue somewhere or what..)

[ cut here ]
WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0
Kernel panic - not syncing: panic_on_warn set ...

CPU: 22 PID: 14012 Comm: trinity-c73 Not tainted 4.9.0-rc7+ #220
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
8801e32af970 81fb08c1 83e74b60 8801e32afa48
83ed7600 847103e0 8801e32afa38 81515244
41b58ab3 844e21da 81515061 8151591e
Call Trace:
[] dump_stack+0x83/0xb2
[] panic+0x1e3/0x3ad
[] __warn+0x1bf/0x1e0
[] warn_slowpath_null+0x2c/0x40
[] shmem_fallocate+0x9a7/0xac0
[] vfs_fallocate+0x350/0x620
[] SyS_madvise+0x432/0x1290
[] do_syscall_64+0x1af/0x4d0
[] entry_SYSCALL64_slow_path+0x25/0x25
[ cut here ]

Attached a full log.


Vegard


0.txt.gz
Description: GNU Zip compressed data


Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 18:55, Linus Torvalds
 wrote:
> On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum  wrote:
>>
>> The warning shows that it made it past the list_empty_careful() check
>> in finish_wait() but then bugs out on the >task_list
>> dereference.
>>
>> Anything stick out?
>
> I hate that shmem waitqueue garbage. It's really subtle.
>
> I think the problem is that "wake_up_all()" in shmem_fallocate()
> doesn't necessarily wake up everything. It wakes up TASK_NORMAL -
> which does include TASK_UNINTERRUPTIBLE, but doesn't actually mean
> "everything on the list".
>
> I think that what happens is that the waiters somehow move from
> TASK_UNINTERRUPTIBLE to TASK_RUNNING early, and this means that
> wake_up_all() will ignore them, leave them on the list, and now that
> list on stack is no longer empty at the end.
>
> And the way *THAT* can happen is that the task is on some *other*
> waitqueue as well, and that other waiqueue wakes it up. That's not
> impossible, you can certainly have people on wait-queues that still
> take faults.
>
> Or somebody just uses a directed wake_up_process() or something.
>
> Since you apparently can recreate this fairly easily, how about trying
> this stupid patch?
>
> NOTE! This is entirely untested. I may have screwed this up entirely.
> You get the idea, though - just remove the wait queue head from the
> list - the list entries stay around, but nothing points to the stack
> entry (that we're going to free) any more.
>
> And add the warning to see if this actually ever triggers (and because
> I'd like to see the callchain when it does, to see if it's another
> waitqueue somewhere or what..)

[ cut here ]
WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0
Kernel panic - not syncing: panic_on_warn set ...

CPU: 22 PID: 14012 Comm: trinity-c73 Not tainted 4.9.0-rc7+ #220
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
8801e32af970 81fb08c1 83e74b60 8801e32afa48
83ed7600 847103e0 8801e32afa38 81515244
41b58ab3 844e21da 81515061 8151591e
Call Trace:
[] dump_stack+0x83/0xb2
[] panic+0x1e3/0x3ad
[] __warn+0x1bf/0x1e0
[] warn_slowpath_null+0x2c/0x40
[] shmem_fallocate+0x9a7/0xac0
[] vfs_fallocate+0x350/0x620
[] SyS_madvise+0x432/0x1290
[] do_syscall_64+0x1af/0x4d0
[] entry_SYSCALL64_slow_path+0x25/0x25
[ cut here ]

Attached a full log.


Vegard


0.txt.gz
Description: GNU Zip compressed data


Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 19:11, Andy Lutomirski  wrote:
> On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossum  wrote:
>> On 23 November 2016 at 20:58, Dave Jones  wrote:
>>> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote:
>>>
>>>  > [  317.689216] BUG: Bad page state in process kworker/u8:8  pfn:4d8fd4
>>>  > trace from just before this happened. Does this shed any light ?
>>>  >
>>>  > https://codemonkey.org.uk/junk/trace.txt
>>>
>>> crap, I just noticed the timestamps in the trace come from quite a bit
>>> later.  I'll tweak the code to do the taint checking/ftrace stop after
>>> every syscall, that should narrow the window some more.
>>
>> FWIW I hit this as well:
>>
>> BUG: unable to handle kernel paging request at 81ff08b7
>
> We really ought to improve this message.  If nothing else, it should
> say whether it was a read, a write, or an instruction fetch.
>
>> IP: [] __lock_acquire.isra.32+0xda/0x1a30
>> PGD 461e067 PUD 461f063
>> PMD 1e001e1
>
> Too lazy to manually decode this right now, but I don't think it matters.
>
>> Oops: 0003 [#1] PREEMPT SMP KASAN
>
> Note this is SMP, but that just means CONFIG_SMP=y.  Vegard, how many
> CPUs does your kernel think you have?

My first crash was running on a 1-CPU guest (not intentionally, but
because of a badly configured qemu). I'm honestly surprised it
triggered at all with 1 CPU, but I guess it shows that it's not a true
concurrency issue at least!

>
>> RIP: 0010:[]  []
>> __lock_acquire.isra.32+0xda/0x1a30
>> RSP: 0018:8801bab8f730  EFLAGS: 00010082
>> RAX: 81ff071f RBX:  RCX: 
>
> RAX points to kernel text.

Yes, it's somewhere in the middle of iov_iter_init() -- other crashes
also had put_prev_entity(), a null pointer, and some garbage values I
couldn't identify.

>
>> Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85
>> c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e>
>> ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00
>
>   2b:3e ff 80 98 01 00 00 incl   %ds:*0x198(%rax)<--
> trapping instruction
>
> That's very strange.  I think this is:
>
> atomic_inc((atomic_t *)>ops);
>
> but my kernel contains:
>
> 3cb4:   f0 ff 80 98 01 00 00lock incl 0x198(%rax)
>
> So your kernel has been smp-alternatived.  That 3e comes from
> alternatives_smp_unlock.  If you're running on SMP with UP
> alternatives, things will break.

Yes, indeed. It was running on 1 CPU by mistake and still triggered the bug.

The crashes started really pouring in once I got my qemu fixed. Just
to reassure you, here's another crash which shows it's using the
correct instruction on an actual multicore guest:

BUG: unable to handle kernel paging request at 0003030001de
IP: [] __lock_acquire.isra.32+0xda/0x1a30
PGD 183fd2067 PUD 0

Oops: 0002 [#1] PREEMPT SMP KASAN
Dumping ftrace buffer:
  (ftrace buffer empty)
CPU: 23 PID: 9584 Comm: trinity-c104 Not tainted 4.9.0-rc7+ #219
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: 880189f68000 task.stack: 88017fe5
RIP: 0010:[]  []
__lock_acquire.isra.32+0xda/0x1a30
RSP: 0018:88017fe575e0  EFLAGS: 00010002
RAX: 00030346 RBX:  RCX: 
[...]
Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85
c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 
ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00
RIP  [] __lock_acquire.isra.32+0xda/0x1a30
RSP 
CR2: 0003030001de
---[ end trace 2846425104eb6141 ]---
Kernel panic - not syncing: Fatal exception
[ cut here ]

  2b:*  f0 ff 80 98 01 00 00lock incl 0x198(%rax)   <--
trapping instruction

> What's your kernel command line?  Can we have your entire kernel log from 
> boot?

Just in case you still want this, I've attached the boot log for the
"true SMP" guest above.


Vegard


5.txt.gz
Description: GNU Zip compressed data


Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 19:11, Andy Lutomirski  wrote:
> On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossum  wrote:
>> On 23 November 2016 at 20:58, Dave Jones  wrote:
>>> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote:
>>>
>>>  > [  317.689216] BUG: Bad page state in process kworker/u8:8  pfn:4d8fd4
>>>  > trace from just before this happened. Does this shed any light ?
>>>  >
>>>  > https://codemonkey.org.uk/junk/trace.txt
>>>
>>> crap, I just noticed the timestamps in the trace come from quite a bit
>>> later.  I'll tweak the code to do the taint checking/ftrace stop after
>>> every syscall, that should narrow the window some more.
>>
>> FWIW I hit this as well:
>>
>> BUG: unable to handle kernel paging request at 81ff08b7
>
> We really ought to improve this message.  If nothing else, it should
> say whether it was a read, a write, or an instruction fetch.
>
>> IP: [] __lock_acquire.isra.32+0xda/0x1a30
>> PGD 461e067 PUD 461f063
>> PMD 1e001e1
>
> Too lazy to manually decode this right now, but I don't think it matters.
>
>> Oops: 0003 [#1] PREEMPT SMP KASAN
>
> Note this is SMP, but that just means CONFIG_SMP=y.  Vegard, how many
> CPUs does your kernel think you have?

My first crash was running on a 1-CPU guest (not intentionally, but
because of a badly configured qemu). I'm honestly surprised it
triggered at all with 1 CPU, but I guess it shows that it's not a true
concurrency issue at least!

>
>> RIP: 0010:[]  []
>> __lock_acquire.isra.32+0xda/0x1a30
>> RSP: 0018:8801bab8f730  EFLAGS: 00010082
>> RAX: 81ff071f RBX:  RCX: 
>
> RAX points to kernel text.

Yes, it's somewhere in the middle of iov_iter_init() -- other crashes
also had put_prev_entity(), a null pointer, and some garbage values I
couldn't identify.

>
>> Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85
>> c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e>
>> ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00
>
>   2b:3e ff 80 98 01 00 00 incl   %ds:*0x198(%rax)<--
> trapping instruction
>
> That's very strange.  I think this is:
>
> atomic_inc((atomic_t *)>ops);
>
> but my kernel contains:
>
> 3cb4:   f0 ff 80 98 01 00 00lock incl 0x198(%rax)
>
> So your kernel has been smp-alternatived.  That 3e comes from
> alternatives_smp_unlock.  If you're running on SMP with UP
> alternatives, things will break.

Yes, indeed. It was running on 1 CPU by mistake and still triggered the bug.

The crashes started really pouring in once I got my qemu fixed. Just
to reassure you, here's another crash which shows it's using the
correct instruction on an actual multicore guest:

BUG: unable to handle kernel paging request at 0003030001de
IP: [] __lock_acquire.isra.32+0xda/0x1a30
PGD 183fd2067 PUD 0

Oops: 0002 [#1] PREEMPT SMP KASAN
Dumping ftrace buffer:
  (ftrace buffer empty)
CPU: 23 PID: 9584 Comm: trinity-c104 Not tainted 4.9.0-rc7+ #219
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: 880189f68000 task.stack: 88017fe5
RIP: 0010:[]  []
__lock_acquire.isra.32+0xda/0x1a30
RSP: 0018:88017fe575e0  EFLAGS: 00010002
RAX: 00030346 RBX:  RCX: 
[...]
Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85
c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 
ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00
RIP  [] __lock_acquire.isra.32+0xda/0x1a30
RSP 
CR2: 0003030001de
---[ end trace 2846425104eb6141 ]---
Kernel panic - not syncing: Fatal exception
[ cut here ]

  2b:*  f0 ff 80 98 01 00 00lock incl 0x198(%rax)   <--
trapping instruction

> What's your kernel command line?  Can we have your entire kernel log from 
> boot?

Just in case you still want this, I've attached the boot log for the
"true SMP" guest above.


Vegard


5.txt.gz
Description: GNU Zip compressed data


Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
On Mon, Dec 5, 2016 at 10:11 AM, Andy Lutomirski  wrote:
>
> So your kernel has been smp-alternatived.  That 3e comes from
> alternatives_smp_unlock.  If you're running on SMP with UP
> alternatives, things will break.

I'm assuming he's just running in a VM with a single CPU.

The problem that I pointed out with assuming wake_up_all() actually
removes all wait queue entries does not depend on SMP. The race is
much more fundamental and long-lived.

   Linus


Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
On Mon, Dec 5, 2016 at 10:11 AM, Andy Lutomirski  wrote:
>
> So your kernel has been smp-alternatived.  That 3e comes from
> alternatives_smp_unlock.  If you're running on SMP with UP
> alternatives, things will break.

I'm assuming he's just running in a VM with a single CPU.

The problem that I pointed out with assuming wake_up_all() actually
removes all wait queue entries does not depend on SMP. The race is
much more fundamental and long-lived.

   Linus


Re: bio linked list corruption.

2016-12-05 Thread Andy Lutomirski
On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossum  wrote:
> On 23 November 2016 at 20:58, Dave Jones  wrote:
>> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote:
>>
>>  > [  317.689216] BUG: Bad page state in process kworker/u8:8  pfn:4d8fd4
>>  > trace from just before this happened. Does this shed any light ?
>>  >
>>  > https://codemonkey.org.uk/junk/trace.txt
>>
>> crap, I just noticed the timestamps in the trace come from quite a bit
>> later.  I'll tweak the code to do the taint checking/ftrace stop after
>> every syscall, that should narrow the window some more.
>
> FWIW I hit this as well:
>
> BUG: unable to handle kernel paging request at 81ff08b7

We really ought to improve this message.  If nothing else, it should
say whether it was a read, a write, or an instruction fetch.

> IP: [] __lock_acquire.isra.32+0xda/0x1a30
> PGD 461e067 PUD 461f063
> PMD 1e001e1

Too lazy to manually decode this right now, but I don't think it matters.

> Oops: 0003 [#1] PREEMPT SMP KASAN

Note this is SMP, but that just means CONFIG_SMP=y.  Vegard, how many
CPUs does your kernel think you have?

> RIP: 0010:[]  []
> __lock_acquire.isra.32+0xda/0x1a30
> RSP: 0018:8801bab8f730  EFLAGS: 00010082
> RAX: 81ff071f RBX:  RCX: 

RAX points to kernel text.

> Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85
> c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e>
> ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00

  2b:3e ff 80 98 01 00 00 incl   %ds:*0x198(%rax)<--
trapping instruction

That's very strange.  I think this is:

atomic_inc((atomic_t *)>ops);

but my kernel contains:

3cb4:   f0 ff 80 98 01 00 00lock incl 0x198(%rax)

So your kernel has been smp-alternatived.  That 3e comes from
alternatives_smp_unlock.  If you're running on SMP with UP
alternatives, things will break.

What's your kernel command line?  Can we have your entire kernel log from boot?

Adding Borislav, since he's the guru for this code.


Re: bio linked list corruption.

2016-12-05 Thread Andy Lutomirski
On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossum  wrote:
> On 23 November 2016 at 20:58, Dave Jones  wrote:
>> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote:
>>
>>  > [  317.689216] BUG: Bad page state in process kworker/u8:8  pfn:4d8fd4
>>  > trace from just before this happened. Does this shed any light ?
>>  >
>>  > https://codemonkey.org.uk/junk/trace.txt
>>
>> crap, I just noticed the timestamps in the trace come from quite a bit
>> later.  I'll tweak the code to do the taint checking/ftrace stop after
>> every syscall, that should narrow the window some more.
>
> FWIW I hit this as well:
>
> BUG: unable to handle kernel paging request at 81ff08b7

We really ought to improve this message.  If nothing else, it should
say whether it was a read, a write, or an instruction fetch.

> IP: [] __lock_acquire.isra.32+0xda/0x1a30
> PGD 461e067 PUD 461f063
> PMD 1e001e1

Too lazy to manually decode this right now, but I don't think it matters.

> Oops: 0003 [#1] PREEMPT SMP KASAN

Note this is SMP, but that just means CONFIG_SMP=y.  Vegard, how many
CPUs does your kernel think you have?

> RIP: 0010:[]  []
> __lock_acquire.isra.32+0xda/0x1a30
> RSP: 0018:8801bab8f730  EFLAGS: 00010082
> RAX: 81ff071f RBX:  RCX: 

RAX points to kernel text.

> Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85
> c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e>
> ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00

  2b:3e ff 80 98 01 00 00 incl   %ds:*0x198(%rax)<--
trapping instruction

That's very strange.  I think this is:

atomic_inc((atomic_t *)>ops);

but my kernel contains:

3cb4:   f0 ff 80 98 01 00 00lock incl 0x198(%rax)

So your kernel has been smp-alternatived.  That 3e comes from
alternatives_smp_unlock.  If you're running on SMP with UP
alternatives, things will break.

What's your kernel command line?  Can we have your entire kernel log from boot?

Adding Borislav, since he's the guru for this code.


Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum  wrote:
>
> The warning shows that it made it past the list_empty_careful() check
> in finish_wait() but then bugs out on the >task_list
> dereference.
>
> Anything stick out?

I hate that shmem waitqueue garbage. It's really subtle.

I think the problem is that "wake_up_all()" in shmem_fallocate()
doesn't necessarily wake up everything. It wakes up TASK_NORMAL -
which does include TASK_UNINTERRUPTIBLE, but doesn't actually mean
"everything on the list".

I think that what happens is that the waiters somehow move from
TASK_UNINTERRUPTIBLE to TASK_RUNNING early, and this means that
wake_up_all() will ignore them, leave them on the list, and now that
list on stack is no longer empty at the end.

And the way *THAT* can happen is that the task is on some *other*
waitqueue as well, and that other waiqueue wakes it up. That's not
impossible, you can certainly have people on wait-queues that still
take faults.

Or somebody just uses a directed wake_up_process() or something.

Since you apparently can recreate this fairly easily, how about trying
this stupid patch?

NOTE! This is entirely untested. I may have screwed this up entirely.
You get the idea, though - just remove the wait queue head from the
list - the list entries stay around, but nothing points to the stack
entry (that we're going to free) any more.

And add the warning to see if this actually ever triggers (and because
I'd like to see the callchain when it does, to see if it's another
waitqueue somewhere or what..)

  Linus
diff --git a/mm/shmem.c b/mm/shmem.c
index 166ebf5d2bce..a80148b43476 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2665,6 +2665,8 @@ static long shmem_fallocate(struct file *file, int mode, 
loff_t offset,
spin_lock(>i_lock);
inode->i_private = NULL;
wake_up_all(_falloc_waitq);
+   if (WARN_ON_ONCE(!list_empty(_falloc_waitq.task_list)))
+   list_del(_falloc_waitq.task_list);
spin_unlock(>i_lock);
error = 0;
goto out;


Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum  wrote:
>
> The warning shows that it made it past the list_empty_careful() check
> in finish_wait() but then bugs out on the >task_list
> dereference.
>
> Anything stick out?

I hate that shmem waitqueue garbage. It's really subtle.

I think the problem is that "wake_up_all()" in shmem_fallocate()
doesn't necessarily wake up everything. It wakes up TASK_NORMAL -
which does include TASK_UNINTERRUPTIBLE, but doesn't actually mean
"everything on the list".

I think that what happens is that the waiters somehow move from
TASK_UNINTERRUPTIBLE to TASK_RUNNING early, and this means that
wake_up_all() will ignore them, leave them on the list, and now that
list on stack is no longer empty at the end.

And the way *THAT* can happen is that the task is on some *other*
waitqueue as well, and that other waiqueue wakes it up. That's not
impossible, you can certainly have people on wait-queues that still
take faults.

Or somebody just uses a directed wake_up_process() or something.

Since you apparently can recreate this fairly easily, how about trying
this stupid patch?

NOTE! This is entirely untested. I may have screwed this up entirely.
You get the idea, though - just remove the wait queue head from the
list - the list entries stay around, but nothing points to the stack
entry (that we're going to free) any more.

And add the warning to see if this actually ever triggers (and because
I'd like to see the callchain when it does, to see if it's another
waitqueue somewhere or what..)

  Linus
diff --git a/mm/shmem.c b/mm/shmem.c
index 166ebf5d2bce..a80148b43476 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2665,6 +2665,8 @@ static long shmem_fallocate(struct file *file, int mode, 
loff_t offset,
spin_lock(>i_lock);
inode->i_private = NULL;
wake_up_all(_falloc_waitq);
+   if (WARN_ON_ONCE(!list_empty(_falloc_waitq.task_list)))
+   list_del(_falloc_waitq.task_list);
spin_unlock(>i_lock);
error = 0;
goto out;


Re: bio linked list corruption.

2016-12-05 Thread Dave Jones
On Mon, Dec 05, 2016 at 06:09:29PM +0100, Vegard Nossum wrote:
 > On 5 December 2016 at 12:10, Vegard Nossum  wrote:
 > > On 5 December 2016 at 00:04, Vegard Nossum  wrote:
 > >> FWIW I hit this as well:
 > >>
 > >> BUG: unable to handle kernel paging request at 81ff08b7
 > >> IP: [] __lock_acquire.isra.32+0xda/0x1a30
 > >> CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB   4.9.0-rc7+ 
 > >> #217
 > > [...]
 > >
 > >> I think you can rule out btrfs in any case, probably block layer as
 > >> well, since it looks like this comes from shmem.
 > >
 > > I should rather say that the VM runs on a 9p root filesystem and it
 > > doesn't use/mount any block devices or disk-based filesystems.
 > >
 > > I have all the trinity logs for the crash if that's useful. I tried a
 > > couple of runs with just the (at the time) in-progress syscalls but it
 > > didn't turn up anything interesting. Otherwise it seems like a lot of
 > > data to go through by hand.
 > 
 > I've hit this another 7 times in the past ~3 hours.
 > 
 > Three times the address being dereferenced has pointed to
 > iov_iter_init+0xaf (even across a kernel rebuild), three times it has
 > pointed to put_prev_entity+0x55, once to 0x80008, and twice to
 > 0x292. The fact that it would hit even one of those more than once
 > across runs is pretty suspicious to me, although the ones that point
 > to iov_iter_init and put_prev_entity point to "random" instructions in
 > the sense that they are neither entry points nor return addresses.
 > 
 > shmem_fault() was always on the stack, but it came from different
 > syscalls: add_key(), newuname(), pipe2(), newstat(), fstat(),
 > clock_settime(), mount(), etc.
> [ cut here ]
 > The warning shows that it made it past the list_empty_careful() check
 > in finish_wait() but then bugs out on the >task_list
 > dereference.

I just pushed out the ftrace changes I made to Trinity that might help
you gather more clues.

Right now it's hardcoded to dump a trace to /boot/trace.txt when it
detects the kernel has become tainted.

Before a trinity run, I run this as root..

#!/bin/sh

cd /sys/kernel/debug/tracing/

echo 1 > buffer_size_kb
echo function >> current_tracer

for i in $(cat /home/davej/blacklist-symbols)
do
  echo $i >> set_ftrace_notrace
done

echo 1 >> tracing_on


blacklist-symbols is the more noisy stuff that pollutes traces.
Right now I use these: https://paste.fedoraproject.org/499794/14809582/
You may need to add some more.

(I'll get around to making all this scripting go away, and have trinity
just set this stuff up itself eventually)

Oh, and if you're not running as root, you might need a diff like below
so that trinity can stop the trace when it detects tainting.

Dave

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8696ce6bf2f6..2d6c97e871e0 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7217,7 +7217,7 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry 
*d_tracer)
trace_create_file("trace_clock", 0644, d_tracer, tr,
  _clock_fops);
 
-   trace_create_file("tracing_on", 0644, d_tracer,
+   trace_create_file("tracing_on", 0666, d_tracer,
  tr, _simple_fops);
 
create_trace_options_dir(tr);


Re: bio linked list corruption.

2016-12-05 Thread Dave Jones
On Mon, Dec 05, 2016 at 06:09:29PM +0100, Vegard Nossum wrote:
 > On 5 December 2016 at 12:10, Vegard Nossum  wrote:
 > > On 5 December 2016 at 00:04, Vegard Nossum  wrote:
 > >> FWIW I hit this as well:
 > >>
 > >> BUG: unable to handle kernel paging request at 81ff08b7
 > >> IP: [] __lock_acquire.isra.32+0xda/0x1a30
 > >> CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB   4.9.0-rc7+ 
 > >> #217
 > > [...]
 > >
 > >> I think you can rule out btrfs in any case, probably block layer as
 > >> well, since it looks like this comes from shmem.
 > >
 > > I should rather say that the VM runs on a 9p root filesystem and it
 > > doesn't use/mount any block devices or disk-based filesystems.
 > >
 > > I have all the trinity logs for the crash if that's useful. I tried a
 > > couple of runs with just the (at the time) in-progress syscalls but it
 > > didn't turn up anything interesting. Otherwise it seems like a lot of
 > > data to go through by hand.
 > 
 > I've hit this another 7 times in the past ~3 hours.
 > 
 > Three times the address being dereferenced has pointed to
 > iov_iter_init+0xaf (even across a kernel rebuild), three times it has
 > pointed to put_prev_entity+0x55, once to 0x80008, and twice to
 > 0x292. The fact that it would hit even one of those more than once
 > across runs is pretty suspicious to me, although the ones that point
 > to iov_iter_init and put_prev_entity point to "random" instructions in
 > the sense that they are neither entry points nor return addresses.
 > 
 > shmem_fault() was always on the stack, but it came from different
 > syscalls: add_key(), newuname(), pipe2(), newstat(), fstat(),
 > clock_settime(), mount(), etc.
> [ cut here ]
 > The warning shows that it made it past the list_empty_careful() check
 > in finish_wait() but then bugs out on the >task_list
 > dereference.

I just pushed out the ftrace changes I made to Trinity that might help
you gather more clues.

Right now it's hardcoded to dump a trace to /boot/trace.txt when it
detects the kernel has become tainted.

Before a trinity run, I run this as root..

#!/bin/sh

cd /sys/kernel/debug/tracing/

echo 1 > buffer_size_kb
echo function >> current_tracer

for i in $(cat /home/davej/blacklist-symbols)
do
  echo $i >> set_ftrace_notrace
done

echo 1 >> tracing_on


blacklist-symbols is the more noisy stuff that pollutes traces.
Right now I use these: https://paste.fedoraproject.org/499794/14809582/
You may need to add some more.

(I'll get around to making all this scripting go away, and have trinity
just set this stuff up itself eventually)

Oh, and if you're not running as root, you might need a diff like below
so that trinity can stop the trace when it detects tainting.

Dave

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8696ce6bf2f6..2d6c97e871e0 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7217,7 +7217,7 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry 
*d_tracer)
trace_create_file("trace_clock", 0644, d_tracer, tr,
  _clock_fops);
 
-   trace_create_file("tracing_on", 0644, d_tracer,
+   trace_create_file("tracing_on", 0666, d_tracer,
  tr, _simple_fops);
 
create_trace_options_dir(tr);


Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 12:10, Vegard Nossum  wrote:
> On 5 December 2016 at 00:04, Vegard Nossum  wrote:
>> FWIW I hit this as well:
>>
>> BUG: unable to handle kernel paging request at 81ff08b7
>> IP: [] __lock_acquire.isra.32+0xda/0x1a30
>> CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB   4.9.0-rc7+ #217
> [...]
>
>> I think you can rule out btrfs in any case, probably block layer as
>> well, since it looks like this comes from shmem.
>
> I should rather say that the VM runs on a 9p root filesystem and it
> doesn't use/mount any block devices or disk-based filesystems.
>
> I have all the trinity logs for the crash if that's useful. I tried a
> couple of runs with just the (at the time) in-progress syscalls but it
> didn't turn up anything interesting. Otherwise it seems like a lot of
> data to go through by hand.

I've hit this another 7 times in the past ~3 hours.

Three times the address being dereferenced has pointed to
iov_iter_init+0xaf (even across a kernel rebuild), three times it has
pointed to put_prev_entity+0x55, once to 0x80008, and twice to
0x292. The fact that it would hit even one of those more than once
across runs is pretty suspicious to me, although the ones that point
to iov_iter_init and put_prev_entity point to "random" instructions in
the sense that they are neither entry points nor return addresses.

shmem_fault() was always on the stack, but it came from different
syscalls: add_key(), newuname(), pipe2(), newstat(), fstat(),
clock_settime(), mount(), etc.

I also got this warning which is related:

[ cut here ]
WARNING: CPU: 9 PID: 25045 at lib/list_debug.c:59 __list_del_entry+0x14f/0x1d0
list_del corruption. prev->next should be 88014bdc79e8, but was
88014bfbfc60
Kernel panic - not syncing: panic_on_warn set ...

CPU: 9 PID: 25045 Comm: trinity-c22 Not tainted 4.9.0-rc7+ #219
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
88014bdc7700 81fb0861 83e74b60 88014bdc77d8
84006c00 847103e0 88014bdc77c8 81515244
41b58ab3 844e21c2 81515061 0054
Call Trace:
[] dump_stack+0x83/0xb2
[] panic+0x1e3/0x3ad
[] ? percpu_up_read_preempt_enable.constprop.45+0xcb/0xcb
[] ? __list_del_entry+0x14f/0x1d0
[] __warn+0x1bf/0x1e0
[] ? __lock_acquire.isra.32+0xc2/0x1a30
[] warn_slowpath_fmt+0xac/0xd0
[] ? __warn+0x1e0/0x1e0
[] ? finish_wait+0xb0/0x180
[] __list_del_entry+0x14f/0x1d0
[] ? finish_wait+0xb0/0x180
[] finish_wait+0xbb/0x180
[] shmem_fault+0x4c7/0x6b0
[] ? shmem_getpage_gfp+0x673/0x1c90
[] ? shmem_getpage_gfp+0x1c90/0x1c90
[] ? wake_atomic_t_function+0x210/0x210
[] __do_fault+0x206/0x410
[] ? do_page_mkwrite+0x320/0x320
[] ? handle_mm_fault+0x1cc/0x2a60
[] handle_mm_fault+0x10f7/0x2a60
[] ? handle_mm_fault+0x132/0x2a60
[] ? thread_group_cputime+0x49f/0x6e0
[] ? __pmd_alloc+0x370/0x370
[] ? thread_group_cputime+0x4bc/0x6e0
[] ? thread_group_cputime_adjusted+0x6d/0xe0
[] ? __do_page_fault+0x220/0x9f0
[] ? find_vma+0x30/0x150
[] __do_page_fault+0x452/0x9f0
[] trace_do_page_fault+0x1e5/0x3a0
[] do_async_page_fault+0x27/0xa0
[] async_page_fault+0x28/0x30
[] ? copy_user_generic_string+0x2c/0x40
[] ? SyS_times+0x93/0x110
[] ? do_sys_times+0x2b0/0x2b0
[] ? do_sys_times+0x2b0/0x2b0
[] do_syscall_64+0x1af/0x4d0
[] entry_SYSCALL64_slow_path+0x25/0x25
[ cut here ]

The warning shows that it made it past the list_empty_careful() check
in finish_wait() but then bugs out on the >task_list
dereference.

Anything stick out?


Vegard


Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 12:10, Vegard Nossum  wrote:
> On 5 December 2016 at 00:04, Vegard Nossum  wrote:
>> FWIW I hit this as well:
>>
>> BUG: unable to handle kernel paging request at 81ff08b7
>> IP: [] __lock_acquire.isra.32+0xda/0x1a30
>> CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB   4.9.0-rc7+ #217
> [...]
>
>> I think you can rule out btrfs in any case, probably block layer as
>> well, since it looks like this comes from shmem.
>
> I should rather say that the VM runs on a 9p root filesystem and it
> doesn't use/mount any block devices or disk-based filesystems.
>
> I have all the trinity logs for the crash if that's useful. I tried a
> couple of runs with just the (at the time) in-progress syscalls but it
> didn't turn up anything interesting. Otherwise it seems like a lot of
> data to go through by hand.

I've hit this another 7 times in the past ~3 hours.

Three times the address being dereferenced has pointed to
iov_iter_init+0xaf (even across a kernel rebuild), three times it has
pointed to put_prev_entity+0x55, once to 0x80008, and twice to
0x292. The fact that it would hit even one of those more than once
across runs is pretty suspicious to me, although the ones that point
to iov_iter_init and put_prev_entity point to "random" instructions in
the sense that they are neither entry points nor return addresses.

shmem_fault() was always on the stack, but it came from different
syscalls: add_key(), newuname(), pipe2(), newstat(), fstat(),
clock_settime(), mount(), etc.

I also got this warning which is related:

[ cut here ]
WARNING: CPU: 9 PID: 25045 at lib/list_debug.c:59 __list_del_entry+0x14f/0x1d0
list_del corruption. prev->next should be 88014bdc79e8, but was
88014bfbfc60
Kernel panic - not syncing: panic_on_warn set ...

CPU: 9 PID: 25045 Comm: trinity-c22 Not tainted 4.9.0-rc7+ #219
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
88014bdc7700 81fb0861 83e74b60 88014bdc77d8
84006c00 847103e0 88014bdc77c8 81515244
41b58ab3 844e21c2 81515061 0054
Call Trace:
[] dump_stack+0x83/0xb2
[] panic+0x1e3/0x3ad
[] ? percpu_up_read_preempt_enable.constprop.45+0xcb/0xcb
[] ? __list_del_entry+0x14f/0x1d0
[] __warn+0x1bf/0x1e0
[] ? __lock_acquire.isra.32+0xc2/0x1a30
[] warn_slowpath_fmt+0xac/0xd0
[] ? __warn+0x1e0/0x1e0
[] ? finish_wait+0xb0/0x180
[] __list_del_entry+0x14f/0x1d0
[] ? finish_wait+0xb0/0x180
[] finish_wait+0xbb/0x180
[] shmem_fault+0x4c7/0x6b0
[] ? shmem_getpage_gfp+0x673/0x1c90
[] ? shmem_getpage_gfp+0x1c90/0x1c90
[] ? wake_atomic_t_function+0x210/0x210
[] __do_fault+0x206/0x410
[] ? do_page_mkwrite+0x320/0x320
[] ? handle_mm_fault+0x1cc/0x2a60
[] handle_mm_fault+0x10f7/0x2a60
[] ? handle_mm_fault+0x132/0x2a60
[] ? thread_group_cputime+0x49f/0x6e0
[] ? __pmd_alloc+0x370/0x370
[] ? thread_group_cputime+0x4bc/0x6e0
[] ? thread_group_cputime_adjusted+0x6d/0xe0
[] ? __do_page_fault+0x220/0x9f0
[] ? find_vma+0x30/0x150
[] __do_page_fault+0x452/0x9f0
[] trace_do_page_fault+0x1e5/0x3a0
[] do_async_page_fault+0x27/0xa0
[] async_page_fault+0x28/0x30
[] ? copy_user_generic_string+0x2c/0x40
[] ? SyS_times+0x93/0x110
[] ? do_sys_times+0x2b0/0x2b0
[] ? do_sys_times+0x2b0/0x2b0
[] do_syscall_64+0x1af/0x4d0
[] entry_SYSCALL64_slow_path+0x25/0x25
[ cut here ]

The warning shows that it made it past the list_empty_careful() check
in finish_wait() but then bugs out on the >task_list
dereference.

Anything stick out?


Vegard


Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 00:04, Vegard Nossum  wrote:
> FWIW I hit this as well:
>
> BUG: unable to handle kernel paging request at 81ff08b7
> IP: [] __lock_acquire.isra.32+0xda/0x1a30
> CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB   4.9.0-rc7+ #217
[...]

> I think you can rule out btrfs in any case, probably block layer as
> well, since it looks like this comes from shmem.

I should rather say that the VM runs on a 9p root filesystem and it
doesn't use/mount any block devices or disk-based filesystems.

I have all the trinity logs for the crash if that's useful. I tried a
couple of runs with just the (at the time) in-progress syscalls but it
didn't turn up anything interesting. Otherwise it seems like a lot of
data to go through by hand.

The crashing child seems to have just started, though, if that's relevant:

[child56:21744] [0]
open(filename="/sys/block/loop2/power/runtime_active_time",
flags=0x777b01, mode=666) = -1 (Not a directory)
[child56:21744] [1] [32BIT] sched_getattr(pid=1, param=0x7f37ec26c000,
size=3415) = -1 (Invalid argument)
[child56:21744] [2] [32BIT]
access(filename="/proc/2/task/2/net/stat/arp_cache", mode=2000) = -1
(Invalid argument)
[child56:21744] [3] getegid() = 0xfffe
[child56:21744] [4]
swapoff(path="/proc/721/task/721/net/dev_snmp6/tunl0") = -1 (Operation
not permitted)
[child56:21744] [5] timerfd_create(clockid=0x0, flags=0x0) = 439
[child56:21744] [6] pkey_mprotect(start=0x7f37ee656000, len=0,
prot=0x108, key=0x600) = 0
[child56:21744] [7] msync(start=0x7f37ee657000, len=0, flags=0x6) = 0
[child56:21744] [8] flock(fd=168, cmd=0xffc191f30b0c) = -1
(Invalid argument)
[child56:21744] [9] add_key(_type=0x437a15,
_description=0x7f37ec06c000, _payload=0x7f37ec06c000, plen=0,
ringid=0xfff8)

The other logfiles end thusly:

==> trinity-child0.log <==
[child0:21593] [311] faccessat(dfd=246,
filename="/proc/983/task/983/net/protocols", mode=2000) = -1 (Invalid
argument)
[child0:21593] [312] renameat(olddfd=246,
oldname="/proc/13/task/13/attr/sockcreate", newdfd=377,
newname="/proc/16/task/16/net/stat/rt_cache") = -1 (Permission denied)
[child0:21593] [313] [32BIT] readv(fd=289, vec=0x2e1a3d0, vlen=215) = 0

==> trinity-child100.log <==
[child100:21536] [439] setgid(gid=0x2a000200) = -1 (Operation not permitted)
[child100:21536] [440] waitid(which=175, upid=21587, infop=0x4,
options=3542, ru=0x7f37ec76c000) = -1 (Invalid argument)
[child100:21536] [441]
getxattr(pathname="/proc/980/task/980/net/ptype", name=0x7f37ee466000,
value=0x7f37ec26c000, size=49) = -1 (Operation not supported)

==> trinity-child101.log <==
[child101:21537] [55] getcwd(buf=0x7f37ee466000, size=4096) = 39
[child101:21537] [56] [32BIT] munlock(addr=0x7f37ee658000, len=0) = 0
[child101:21537] [57] semctl(semid=0xbd851e2b40e7df,
semnum=0x1b1b1b1b1b, cmd=0x20, arg=0xcacacacaca) = -1 (Invalid
argument)

==> trinity-child102.log <==
[child102:21542] [11] readahead(fd=353, offset=2, count=249) = -1
(Invalid argument)
[child102:21542] [12] add_key(_type=0x43793f,
_description=0x7f37ec46c000, _payload=0x7f37ee658000, plen=32,
ringid=0xfffa) = -1 (Invalid argument)
[child102:21542] [13] time(tloc=0x7f37ee466000) = 0x584474e0

==> trinity-child103.log <==
[child103:21543] [45] dup(fildes=183) = 512
[child103:21543] [46] rt_sigpending(set=0x7f37ec86c000, sigsetsize=32)
= -1 (Invalid argument)
[child103:21543] [47] newstat(filename="/proc/587/task/587/gid_map",
statbuf=0x7f37ee466000) = 0

==> trinity-child104.log <==
[child104:21546] [49] getdents(fd=162, dirent=0x0, count=127) = -1
(Not a directory)
[child104:21546] [50] [32BIT] clock_adjtime(which_clock=0, utx=0x4) =
-1 (Bad address)
[child104:21546] [51] setsid() = 0x542a

==> trinity-child105.log <==
[child105:21547] [523] epoll_wait(epfd=244, events=0x8, maxevents=246,
timeout=-1) = -1 (Invalid argument)
[child105:21547] [524] dup2(oldfd=244, newfd=244) = 244
[child105:21547] [525] acct(name=0x7f37ec26c000) = -1 (Operation not permitted)

==> trinity-child106.log <==
[child106:19910] [136] getegid() = 0xfffe
[child106:19910] [137] munmap(addr=0x7f37ee65a000, len=4096) = 0
[child106:19910] [138] clock_nanosleep(which_clock=0x1, flags=0x1,
rqtp=0x7f37ec06c000, rmtp=0x7f37ee466000)
==> trinity-child107.log <==
[child107:21224] [994] copy_file_range(fd_in=373, off_in=0x2400e210,
fd_out=373, off_out=8, len=8, flags=0x0) = -1 (Bad file descriptor)
[child107:21224] [995] kcmp(pid1=1, pid2=21453, type=0x5,
idx1=0x787878787878, idx2=0xff6060606060) = -1 (Operation not
permitted)
[child107:21224] [996] [32BIT] readv(fd=365, vec=0x2e27e10, vlen=36) = 0

==> trinity-child108.log <==
[child108:21226] [759] recvfrom(fd=219, ubuf=0x7f37ec26c000, size=8,
flags=0x0, addr=0x2e1ed80, addr_len=110) = -1 (Bad file descriptor)
[child108:21226] [760] shmat(shmid=-4097, shmaddr=0x7f37ee465000,
shmflg=-195) = -1 (Invalid argument)
[child108:21226] [761] [32BIT] seccomp(op=0x0, flags=0x0, 

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 00:04, Vegard Nossum  wrote:
> FWIW I hit this as well:
>
> BUG: unable to handle kernel paging request at 81ff08b7
> IP: [] __lock_acquire.isra.32+0xda/0x1a30
> CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB   4.9.0-rc7+ #217
[...]

> I think you can rule out btrfs in any case, probably block layer as
> well, since it looks like this comes from shmem.

I should rather say that the VM runs on a 9p root filesystem and it
doesn't use/mount any block devices or disk-based filesystems.

I have all the trinity logs for the crash if that's useful. I tried a
couple of runs with just the (at the time) in-progress syscalls but it
didn't turn up anything interesting. Otherwise it seems like a lot of
data to go through by hand.

The crashing child seems to have just started, though, if that's relevant:

[child56:21744] [0]
open(filename="/sys/block/loop2/power/runtime_active_time",
flags=0x777b01, mode=666) = -1 (Not a directory)
[child56:21744] [1] [32BIT] sched_getattr(pid=1, param=0x7f37ec26c000,
size=3415) = -1 (Invalid argument)
[child56:21744] [2] [32BIT]
access(filename="/proc/2/task/2/net/stat/arp_cache", mode=2000) = -1
(Invalid argument)
[child56:21744] [3] getegid() = 0xfffe
[child56:21744] [4]
swapoff(path="/proc/721/task/721/net/dev_snmp6/tunl0") = -1 (Operation
not permitted)
[child56:21744] [5] timerfd_create(clockid=0x0, flags=0x0) = 439
[child56:21744] [6] pkey_mprotect(start=0x7f37ee656000, len=0,
prot=0x108, key=0x600) = 0
[child56:21744] [7] msync(start=0x7f37ee657000, len=0, flags=0x6) = 0
[child56:21744] [8] flock(fd=168, cmd=0xffc191f30b0c) = -1
(Invalid argument)
[child56:21744] [9] add_key(_type=0x437a15,
_description=0x7f37ec06c000, _payload=0x7f37ec06c000, plen=0,
ringid=0xfff8)

The other logfiles end thusly:

==> trinity-child0.log <==
[child0:21593] [311] faccessat(dfd=246,
filename="/proc/983/task/983/net/protocols", mode=2000) = -1 (Invalid
argument)
[child0:21593] [312] renameat(olddfd=246,
oldname="/proc/13/task/13/attr/sockcreate", newdfd=377,
newname="/proc/16/task/16/net/stat/rt_cache") = -1 (Permission denied)
[child0:21593] [313] [32BIT] readv(fd=289, vec=0x2e1a3d0, vlen=215) = 0

==> trinity-child100.log <==
[child100:21536] [439] setgid(gid=0x2a000200) = -1 (Operation not permitted)
[child100:21536] [440] waitid(which=175, upid=21587, infop=0x4,
options=3542, ru=0x7f37ec76c000) = -1 (Invalid argument)
[child100:21536] [441]
getxattr(pathname="/proc/980/task/980/net/ptype", name=0x7f37ee466000,
value=0x7f37ec26c000, size=49) = -1 (Operation not supported)

==> trinity-child101.log <==
[child101:21537] [55] getcwd(buf=0x7f37ee466000, size=4096) = 39
[child101:21537] [56] [32BIT] munlock(addr=0x7f37ee658000, len=0) = 0
[child101:21537] [57] semctl(semid=0xbd851e2b40e7df,
semnum=0x1b1b1b1b1b, cmd=0x20, arg=0xcacacacaca) = -1 (Invalid
argument)

==> trinity-child102.log <==
[child102:21542] [11] readahead(fd=353, offset=2, count=249) = -1
(Invalid argument)
[child102:21542] [12] add_key(_type=0x43793f,
_description=0x7f37ec46c000, _payload=0x7f37ee658000, plen=32,
ringid=0xfffa) = -1 (Invalid argument)
[child102:21542] [13] time(tloc=0x7f37ee466000) = 0x584474e0

==> trinity-child103.log <==
[child103:21543] [45] dup(fildes=183) = 512
[child103:21543] [46] rt_sigpending(set=0x7f37ec86c000, sigsetsize=32)
= -1 (Invalid argument)
[child103:21543] [47] newstat(filename="/proc/587/task/587/gid_map",
statbuf=0x7f37ee466000) = 0

==> trinity-child104.log <==
[child104:21546] [49] getdents(fd=162, dirent=0x0, count=127) = -1
(Not a directory)
[child104:21546] [50] [32BIT] clock_adjtime(which_clock=0, utx=0x4) =
-1 (Bad address)
[child104:21546] [51] setsid() = 0x542a

==> trinity-child105.log <==
[child105:21547] [523] epoll_wait(epfd=244, events=0x8, maxevents=246,
timeout=-1) = -1 (Invalid argument)
[child105:21547] [524] dup2(oldfd=244, newfd=244) = 244
[child105:21547] [525] acct(name=0x7f37ec26c000) = -1 (Operation not permitted)

==> trinity-child106.log <==
[child106:19910] [136] getegid() = 0xfffe
[child106:19910] [137] munmap(addr=0x7f37ee65a000, len=4096) = 0
[child106:19910] [138] clock_nanosleep(which_clock=0x1, flags=0x1,
rqtp=0x7f37ec06c000, rmtp=0x7f37ee466000)
==> trinity-child107.log <==
[child107:21224] [994] copy_file_range(fd_in=373, off_in=0x2400e210,
fd_out=373, off_out=8, len=8, flags=0x0) = -1 (Bad file descriptor)
[child107:21224] [995] kcmp(pid1=1, pid2=21453, type=0x5,
idx1=0x787878787878, idx2=0xff6060606060) = -1 (Operation not
permitted)
[child107:21224] [996] [32BIT] readv(fd=365, vec=0x2e27e10, vlen=36) = 0

==> trinity-child108.log <==
[child108:21226] [759] recvfrom(fd=219, ubuf=0x7f37ec26c000, size=8,
flags=0x0, addr=0x2e1ed80, addr_len=110) = -1 (Bad file descriptor)
[child108:21226] [760] shmat(shmid=-4097, shmaddr=0x7f37ee465000,
shmflg=-195) = -1 (Invalid argument)
[child108:21226] [761] [32BIT] seccomp(op=0x0, flags=0x0, uargs=0x0)
==> 

Re: bio linked list corruption.

2016-12-04 Thread Vegard Nossum
On 23 November 2016 at 20:58, Dave Jones  wrote:
> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote:
>
>  > [  317.689216] BUG: Bad page state in process kworker/u8:8  pfn:4d8fd4
>  > trace from just before this happened. Does this shed any light ?
>  >
>  > https://codemonkey.org.uk/junk/trace.txt
>
> crap, I just noticed the timestamps in the trace come from quite a bit
> later.  I'll tweak the code to do the taint checking/ftrace stop after
> every syscall, that should narrow the window some more.

FWIW I hit this as well:

BUG: unable to handle kernel paging request at 81ff08b7
IP: [] __lock_acquire.isra.32+0xda/0x1a30
PGD 461e067 PUD 461f063
PMD 1e001e1
Oops: 0003 [#1] PREEMPT SMP KASAN
Dumping ftrace buffer:
  (ftrace buffer empty)
CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB   4.9.0-rc7+ #217
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
task: 8801ee924080 task.stack: 8801bab88000
RIP: 0010:[]  []
__lock_acquire.isra.32+0xda/0x1a30
RSP: 0018:8801bab8f730  EFLAGS: 00010082
RAX: 81ff071f RBX:  RCX: 
RDX: 0003 RSI:  RDI: 85ae7d00
RBP: 8801bab8f7b0 R08: 0001 R09: 
R10: 8801e727fc40 R11: fbfff0b54ced R12: 85ae7d00
R13: 84912920 R14: 8801ee924080 R15: 
FS:  7f37ee653700() GS:8801f6a0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 81ff08b7 CR3: 0001daa7 CR4: 06f0
DR0: 7f37ee465000 DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0600
Stack:
8801ee9247d0  0001 8801ee924080
8801f6a201c0 8801f6a201c0  0001
8801 8801 8801e727fc40 8801ee924080
Call Trace:
[] lock_acquire+0x141/0x2b0
[] ? finish_wait+0xb0/0x180
[] _raw_spin_lock_irqsave+0x49/0x60
[] ? finish_wait+0xb0/0x180
[] finish_wait+0xb0/0x180
[] shmem_fault+0x4c7/0x6b0
[] ? p9_client_rpc+0x13d/0xf40
[] ? shmem_getpage_gfp+0x1c90/0x1c90
[] ? radix_tree_next_chunk+0x4f7/0x840
[] ? wake_atomic_t_function+0x210/0x210
[] __do_fault+0x206/0x410
[] ? do_page_mkwrite+0x320/0x320
[] handle_mm_fault+0x1cef/0x2a60
[] ? handle_mm_fault+0x132/0x2a60
[] ? __pmd_alloc+0x370/0x370
[] ? inode_add_bytes+0x10e/0x160
[] ? memset+0x31/0x40
[] ? find_vma+0x30/0x150
[] __do_page_fault+0x452/0x9f0
[] ? iov_iter_init+0xaf/0x1d0
[] trace_do_page_fault+0x1e5/0x3a0
[] do_async_page_fault+0x27/0xa0
[] async_page_fault+0x28/0x30
[] ? strnlen_user+0x91/0x1a0
[] ? strnlen_user+0x6e/0x1a0
[] strndup_user+0x28/0xb0
[] SyS_add_key+0xc7/0x370
[] ? key_get_type_from_user.constprop.6+0xd0/0xd0
[] ? __context_tracking_exit.part.4+0x3a/0x1e0
[] ? key_get_type_from_user.constprop.6+0xd0/0xd0
[] do_syscall_64+0x1af/0x4d0
[] entry_SYSCALL64_slow_path+0x25/0x25
Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85
c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e>
ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00
RIP  [] __lock_acquire.isra.32+0xda/0x1a30

I didn't read all the emails in this thread, the crash site looks
identical to one of the earlier traces although the caller may be
different.

I think you can rule out btrfs in any case, probably block layer as
well, since it looks like this comes from shmem.


Vegard


Re: bio linked list corruption.

2016-12-04 Thread Vegard Nossum
On 23 November 2016 at 20:58, Dave Jones  wrote:
> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote:
>
>  > [  317.689216] BUG: Bad page state in process kworker/u8:8  pfn:4d8fd4
>  > trace from just before this happened. Does this shed any light ?
>  >
>  > https://codemonkey.org.uk/junk/trace.txt
>
> crap, I just noticed the timestamps in the trace come from quite a bit
> later.  I'll tweak the code to do the taint checking/ftrace stop after
> every syscall, that should narrow the window some more.

FWIW I hit this as well:

BUG: unable to handle kernel paging request at 81ff08b7
IP: [] __lock_acquire.isra.32+0xda/0x1a30
PGD 461e067 PUD 461f063
PMD 1e001e1
Oops: 0003 [#1] PREEMPT SMP KASAN
Dumping ftrace buffer:
  (ftrace buffer empty)
CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB   4.9.0-rc7+ #217
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
task: 8801ee924080 task.stack: 8801bab88000
RIP: 0010:[]  []
__lock_acquire.isra.32+0xda/0x1a30
RSP: 0018:8801bab8f730  EFLAGS: 00010082
RAX: 81ff071f RBX:  RCX: 
RDX: 0003 RSI:  RDI: 85ae7d00
RBP: 8801bab8f7b0 R08: 0001 R09: 
R10: 8801e727fc40 R11: fbfff0b54ced R12: 85ae7d00
R13: 84912920 R14: 8801ee924080 R15: 
FS:  7f37ee653700() GS:8801f6a0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 81ff08b7 CR3: 0001daa7 CR4: 06f0
DR0: 7f37ee465000 DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0600
Stack:
8801ee9247d0  0001 8801ee924080
8801f6a201c0 8801f6a201c0  0001
8801 8801 8801e727fc40 8801ee924080
Call Trace:
[] lock_acquire+0x141/0x2b0
[] ? finish_wait+0xb0/0x180
[] _raw_spin_lock_irqsave+0x49/0x60
[] ? finish_wait+0xb0/0x180
[] finish_wait+0xb0/0x180
[] shmem_fault+0x4c7/0x6b0
[] ? p9_client_rpc+0x13d/0xf40
[] ? shmem_getpage_gfp+0x1c90/0x1c90
[] ? radix_tree_next_chunk+0x4f7/0x840
[] ? wake_atomic_t_function+0x210/0x210
[] __do_fault+0x206/0x410
[] ? do_page_mkwrite+0x320/0x320
[] handle_mm_fault+0x1cef/0x2a60
[] ? handle_mm_fault+0x132/0x2a60
[] ? __pmd_alloc+0x370/0x370
[] ? inode_add_bytes+0x10e/0x160
[] ? memset+0x31/0x40
[] ? find_vma+0x30/0x150
[] __do_page_fault+0x452/0x9f0
[] ? iov_iter_init+0xaf/0x1d0
[] trace_do_page_fault+0x1e5/0x3a0
[] do_async_page_fault+0x27/0xa0
[] async_page_fault+0x28/0x30
[] ? strnlen_user+0x91/0x1a0
[] ? strnlen_user+0x6e/0x1a0
[] strndup_user+0x28/0xb0
[] SyS_add_key+0xc7/0x370
[] ? key_get_type_from_user.constprop.6+0xd0/0xd0
[] ? __context_tracking_exit.part.4+0x3a/0x1e0
[] ? key_get_type_from_user.constprop.6+0xd0/0xd0
[] do_syscall_64+0x1af/0x4d0
[] entry_SYSCALL64_slow_path+0x25/0x25
Code: 89 4d b8 44 89 45 c0 89 4d c8 4c 89 55 d0 e8 ee c3 ff ff 48 85
c0 4c 8b 55 d0 8b 4d c8 44 8b 45 c0 4c 8b 4d b8 0f 84 c6 01 00 00 <3e>
ff 80 98 01 00 00 49 8d be 48 07 00 00 48 ba 00 00 00 00 00
RIP  [] __lock_acquire.isra.32+0xda/0x1a30

I didn't read all the emails in this thread, the crash site looks
identical to one of the earlier traces although the caller may be
different.

I think you can rule out btrfs in any case, probably block layer as
well, since it looks like this comes from shmem.


Vegard


Re: bio linked list corruption.

2016-11-23 Thread Dave Jones
On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote:

 > [  317.689216] BUG: Bad page state in process kworker/u8:8  pfn:4d8fd4
 > trace from just before this happened. Does this shed any light ?
 > 
 > https://codemonkey.org.uk/junk/trace.txt

crap, I just noticed the timestamps in the trace come from quite a bit
later.  I'll tweak the code to do the taint checking/ftrace stop after
every syscall, that should narrow the window some more.

Getting closer..

Dave


Re: bio linked list corruption.

2016-11-23 Thread Dave Jones
On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote:

 > [  317.689216] BUG: Bad page state in process kworker/u8:8  pfn:4d8fd4
 > trace from just before this happened. Does this shed any light ?
 > 
 > https://codemonkey.org.uk/junk/trace.txt

crap, I just noticed the timestamps in the trace come from quite a bit
later.  I'll tweak the code to do the taint checking/ftrace stop after
every syscall, that should narrow the window some more.

Getting closer..

Dave


Re: bio linked list corruption.

2016-11-23 Thread Dave Jones
On Mon, Oct 31, 2016 at 01:44:55PM -0600, Chris Mason wrote:
 > On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote:
 > >On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones  
 > >wrote:
 > >>
 > >> BUG: Bad page state in process kworker/u8:12  pfn:4e0e39
 > >> page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 
 > >> index:0x100c
 > >> flags: 0x400c(referenced|uptodate)
 > >> page dumped because: non-NULL mapping
 > >
 > >Hmm. So this seems to be btrfs-specific, right?
 > >
 > >I searched for all your "non-NULL mapping" cases, and they all seem to
 > >have basically the same call trace, with some work thread doing
 > >writeback and going through btrfs_writepages().
 > >
 > >Sounds like it's a race with either fallocate hole-punching or
 > >truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ
 > >clearly ran other filesystems too but I am not seeing this backtrace
 > >for anything else.
 > 
 > Agreed, I think this is a separate bug, almost certainly btrfs specific.  
 > I'll work with Dave on a better reproducer.

Ok, let's try this..
Here's the usual dump, from the current tree:


[  317.689216] BUG: Bad page state in process kworker/u8:8  pfn:4d8fd4
[  317.702235] page:ea001363f500 count:0 mapcount:0 
mapping:8804eef8e0e0 index:0x29
[  317.718068] flags: 0x400c(referenced|uptodate)
[  317.731086] page dumped because: non-NULL mapping
[  317.744118] CPU: 3 PID: 1558 Comm: kworker/u8:8 Not tainted 4.9.0-rc6-think+ 
#1 
[  317.770404] Workqueue: writeback wb_workfn
[  317.783490]  (flush-btrfs-3)
[  317.783581]  c9ff3798
[  317.796651]  813b69bc
[  317.796742]  ea001363f500
[  317.796798]  81c474ee
[  317.796854]  c9ff37c0
[  317.809910]  811b30c4
[  317.810001]  
[  317.810055]  ea001363f500
[  317.810112]  0003
[  317.823230]  c9ff37d0
[  317.823320]  811b318f
[  317.823376]  c9ff3818
[  317.823431] Call Trace:
[  317.836645]  [] dump_stack+0x4f/0x73
[  317.850063]  [] bad_page+0xc4/0x130
[  317.863426]  [] free_pages_check_bad+0x5f/0x70
[  317.876805]  [] free_hot_cold_page+0x305/0x3b0
[  317.890132]  [] free_hot_cold_page_list+0x7e/0x170
[  317.903410]  [] release_pages+0x2e7/0x3a0
[  317.916600]  [] __pagevec_release+0x27/0x40
[  317.929817]  [] 
extent_write_cache_pages.isra.44.constprop.59+0x355/0x430 [btrfs]
[  317.943073]  [] extent_writepages+0x5d/0x90 [btrfs]
[  317.956033]  [] ? btrfs_releasepage+0x40/0x40 [btrfs]
[  317.968902]  [] btrfs_writepages+0x28/0x30 [btrfs]
[  317.981744]  [] do_writepages+0x21/0x30
[  317.994524]  [] __writeback_single_inode+0x7f/0x6e0
[  318.007333]  [] ? _raw_spin_unlock+0x31/0x50
[  318.020090]  [] writeback_sb_inodes+0x31d/0x6c0
[  318.032844]  [] __writeback_inodes_wb+0x92/0xc0
[  318.045623]  [] wb_writeback+0x3f7/0x530
[  318.058304]  [] wb_workfn+0x148/0x620
[  318.070949]  [] ? process_one_work+0x194/0x690
[  318.083549]  [] ? wb_workfn+0x5/0x620
[  318.096160]  [] process_one_work+0x232/0x690
[  318.108722]  [] ? process_one_work+0x194/0x690
[  318.121225]  [] worker_thread+0x4e/0x490
[  318.133680]  [] ? process_one_work+0x690/0x690
[  318.146126]  [] ? process_one_work+0x690/0x690
[  318.158507]  [] kthread+0x102/0x120
[  318.170880]  [] ? kthread_park+0x60/0x60
[  318.183254]  [] ret_from_fork+0x22/0x30


This time, I managed to get my ftrace magic working, so we have this
trace from just before this happened. Does this shed any light ?

https://codemonkey.org.uk/junk/trace.txt

(note that some of the spinlock/rcu calls will be missing here. I chose
to add them to ftrace's exclude list because they dominate the trace
buffer so much, especially in debug builds)

Dave



Re: bio linked list corruption.

2016-11-23 Thread Dave Jones
On Mon, Oct 31, 2016 at 01:44:55PM -0600, Chris Mason wrote:
 > On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote:
 > >On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones  
 > >wrote:
 > >>
 > >> BUG: Bad page state in process kworker/u8:12  pfn:4e0e39
 > >> page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 
 > >> index:0x100c
 > >> flags: 0x400c(referenced|uptodate)
 > >> page dumped because: non-NULL mapping
 > >
 > >Hmm. So this seems to be btrfs-specific, right?
 > >
 > >I searched for all your "non-NULL mapping" cases, and they all seem to
 > >have basically the same call trace, with some work thread doing
 > >writeback and going through btrfs_writepages().
 > >
 > >Sounds like it's a race with either fallocate hole-punching or
 > >truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ
 > >clearly ran other filesystems too but I am not seeing this backtrace
 > >for anything else.
 > 
 > Agreed, I think this is a separate bug, almost certainly btrfs specific.  
 > I'll work with Dave on a better reproducer.

Ok, let's try this..
Here's the usual dump, from the current tree:


[  317.689216] BUG: Bad page state in process kworker/u8:8  pfn:4d8fd4
[  317.702235] page:ea001363f500 count:0 mapcount:0 
mapping:8804eef8e0e0 index:0x29
[  317.718068] flags: 0x400c(referenced|uptodate)
[  317.731086] page dumped because: non-NULL mapping
[  317.744118] CPU: 3 PID: 1558 Comm: kworker/u8:8 Not tainted 4.9.0-rc6-think+ 
#1 
[  317.770404] Workqueue: writeback wb_workfn
[  317.783490]  (flush-btrfs-3)
[  317.783581]  c9ff3798
[  317.796651]  813b69bc
[  317.796742]  ea001363f500
[  317.796798]  81c474ee
[  317.796854]  c9ff37c0
[  317.809910]  811b30c4
[  317.810001]  
[  317.810055]  ea001363f500
[  317.810112]  0003
[  317.823230]  c9ff37d0
[  317.823320]  811b318f
[  317.823376]  c9ff3818
[  317.823431] Call Trace:
[  317.836645]  [] dump_stack+0x4f/0x73
[  317.850063]  [] bad_page+0xc4/0x130
[  317.863426]  [] free_pages_check_bad+0x5f/0x70
[  317.876805]  [] free_hot_cold_page+0x305/0x3b0
[  317.890132]  [] free_hot_cold_page_list+0x7e/0x170
[  317.903410]  [] release_pages+0x2e7/0x3a0
[  317.916600]  [] __pagevec_release+0x27/0x40
[  317.929817]  [] 
extent_write_cache_pages.isra.44.constprop.59+0x355/0x430 [btrfs]
[  317.943073]  [] extent_writepages+0x5d/0x90 [btrfs]
[  317.956033]  [] ? btrfs_releasepage+0x40/0x40 [btrfs]
[  317.968902]  [] btrfs_writepages+0x28/0x30 [btrfs]
[  317.981744]  [] do_writepages+0x21/0x30
[  317.994524]  [] __writeback_single_inode+0x7f/0x6e0
[  318.007333]  [] ? _raw_spin_unlock+0x31/0x50
[  318.020090]  [] writeback_sb_inodes+0x31d/0x6c0
[  318.032844]  [] __writeback_inodes_wb+0x92/0xc0
[  318.045623]  [] wb_writeback+0x3f7/0x530
[  318.058304]  [] wb_workfn+0x148/0x620
[  318.070949]  [] ? process_one_work+0x194/0x690
[  318.083549]  [] ? wb_workfn+0x5/0x620
[  318.096160]  [] process_one_work+0x232/0x690
[  318.108722]  [] ? process_one_work+0x194/0x690
[  318.121225]  [] worker_thread+0x4e/0x490
[  318.133680]  [] ? process_one_work+0x690/0x690
[  318.146126]  [] ? process_one_work+0x690/0x690
[  318.158507]  [] kthread+0x102/0x120
[  318.170880]  [] ? kthread_park+0x60/0x60
[  318.183254]  [] ret_from_fork+0x22/0x30


This time, I managed to get my ftrace magic working, so we have this
trace from just before this happened. Does this shed any light ?

https://codemonkey.org.uk/junk/trace.txt

(note that some of the spinlock/rcu calls will be missing here. I chose
to add them to ftrace's exclude list because they dominate the trace
buffer so much, especially in debug builds)

Dave



Re: bio linked list corruption.

2016-10-31 Thread Chris Mason

On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote:

On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones  wrote:


BUG: Bad page state in process kworker/u8:12  pfn:4e0e39
page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c
flags: 0x400c(referenced|uptodate)
page dumped because: non-NULL mapping


Hmm. So this seems to be btrfs-specific, right?

I searched for all your "non-NULL mapping" cases, and they all seem to
have basically the same call trace, with some work thread doing
writeback and going through btrfs_writepages().

Sounds like it's a race with either fallocate hole-punching or
truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ
clearly ran other filesystems too but I am not seeing this backtrace
for anything else.


Agreed, I think this is a separate bug, almost certainly btrfs specific.  
I'll work with Dave on a better reproducer.


-chris


Re: bio linked list corruption.

2016-10-31 Thread Chris Mason

On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote:

On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones  wrote:


BUG: Bad page state in process kworker/u8:12  pfn:4e0e39
page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c
flags: 0x400c(referenced|uptodate)
page dumped because: non-NULL mapping


Hmm. So this seems to be btrfs-specific, right?

I searched for all your "non-NULL mapping" cases, and they all seem to
have basically the same call trace, with some work thread doing
writeback and going through btrfs_writepages().

Sounds like it's a race with either fallocate hole-punching or
truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ
clearly ran other filesystems too but I am not seeing this backtrace
for anything else.


Agreed, I think this is a separate bug, almost certainly btrfs specific.  
I'll work with Dave on a better reproducer.


-chris


Re: bio linked list corruption.

2016-10-31 Thread Linus Torvalds
On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones  wrote:
>
> BUG: Bad page state in process kworker/u8:12  pfn:4e0e39
> page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c
> flags: 0x400c(referenced|uptodate)
> page dumped because: non-NULL mapping

Hmm. So this seems to be btrfs-specific, right?

I searched for all your "non-NULL mapping" cases, and they all seem to
have basically the same call trace, with some work thread doing
writeback and going through btrfs_writepages().

Sounds like it's a race with either fallocate hole-punching or
truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ
clearly ran other filesystems too but I am not seeing this backtrace
for anything else.

 Linus


Re: bio linked list corruption.

2016-10-31 Thread Linus Torvalds
On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones  wrote:
>
> BUG: Bad page state in process kworker/u8:12  pfn:4e0e39
> page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c
> flags: 0x400c(referenced|uptodate)
> page dumped because: non-NULL mapping

Hmm. So this seems to be btrfs-specific, right?

I searched for all your "non-NULL mapping" cases, and they all seem to
have basically the same call trace, with some work thread doing
writeback and going through btrfs_writepages().

Sounds like it's a race with either fallocate hole-punching or
truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ
clearly ran other filesystems too but I am not seeing this backtrace
for anything else.

 Linus


Re: bio linked list corruption.

2016-10-31 Thread Dave Jones
On Wed, Oct 26, 2016 at 07:47:51PM -0400, Dave Jones wrote:
 > On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote:
 >  
 >  > >-hctx->queued++;
 >  > >-data->hctx = hctx;
 >  > >-data->ctx = ctx;
 >  > >+data->hctx = alloc_data.hctx;
 >  > >+data->ctx = alloc_data.ctx;
 >  > >+data->hctx->queued++;
 >  > > return rq;
 >  > > }
 >  > 
 >  > This made it through an entire dbench 2048 run on btrfs.  My script has 
 >  > it running in a loop, but this is farther than I've gotten before.  
 >  > Looking great so far.
 > 
 > Fixed the splat during boot for me too.
 > Now the fun part, let's see if it fixed the 'weird shit' that Trinity
 > was stumbling on.

It took a while, but.. bad news.


BUG: Bad page state in process kworker/u8:12  pfn:4e0e39
page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c
flags: 0x400c(referenced|uptodate)
page dumped because: non-NULL mapping
CPU: 3 PID: 1586 Comm: kworker/u8:12 Not tainted 4.9.0-rc3-think+ #1 
Workqueue: writeback wb_workfn
  (flush-btrfs-2)
 c9777828
 8130d04c
 ea0013838e40
 819ff654
 c9777850
 81150e5f
 
 ea0013838e40
 400c
 c9777860
 81150f1a
 c97778a8
Call Trace:
 [] dump_stack+0x4f/0x73
 [] bad_page+0xbf/0x120
 [] free_pages_check_bad+0x5a/0x70
 [] free_hot_cold_page+0x248/0x290
 [] free_hot_cold_page_list+0x2b/0x50
 [] release_pages+0x2bd/0x350
 [] __pagevec_release+0x22/0x30
 [] extent_write_cache_pages.isra.48.constprop.63+0x32e/0x400 
[btrfs]
 [] extent_writepages+0x49/0x60 [btrfs]
 [] ? btrfs_releasepage+0x40/0x40 [btrfs]
 [] btrfs_writepages+0x23/0x30 [btrfs]
 [] do_writepages+0x1c/0x30
 [] __writeback_single_inode+0x33/0x180
 [] writeback_sb_inodes+0x2a8/0x5b0
 [] __writeback_inodes_wb+0x8d/0xc0
 [] wb_writeback+0x1e3/0x1f0
 [] wb_workfn+0xd2/0x280
 [] process_one_work+0x1d5/0x490
 [] ? process_one_work+0x175/0x490
 [] worker_thread+0x49/0x490
 [] ? process_one_work+0x490/0x490
 [] kthread+0xee/0x110
 [] ? kthread_park+0x60/0x60
 [] ret_from_fork+0x22/0x30





Re: bio linked list corruption.

2016-10-31 Thread Dave Jones
On Wed, Oct 26, 2016 at 07:47:51PM -0400, Dave Jones wrote:
 > On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote:
 >  
 >  > >-hctx->queued++;
 >  > >-data->hctx = hctx;
 >  > >-data->ctx = ctx;
 >  > >+data->hctx = alloc_data.hctx;
 >  > >+data->ctx = alloc_data.ctx;
 >  > >+data->hctx->queued++;
 >  > > return rq;
 >  > > }
 >  > 
 >  > This made it through an entire dbench 2048 run on btrfs.  My script has 
 >  > it running in a loop, but this is farther than I've gotten before.  
 >  > Looking great so far.
 > 
 > Fixed the splat during boot for me too.
 > Now the fun part, let's see if it fixed the 'weird shit' that Trinity
 > was stumbling on.

It took a while, but.. bad news.


BUG: Bad page state in process kworker/u8:12  pfn:4e0e39
page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c
flags: 0x400c(referenced|uptodate)
page dumped because: non-NULL mapping
CPU: 3 PID: 1586 Comm: kworker/u8:12 Not tainted 4.9.0-rc3-think+ #1 
Workqueue: writeback wb_workfn
  (flush-btrfs-2)
 c9777828
 8130d04c
 ea0013838e40
 819ff654
 c9777850
 81150e5f
 
 ea0013838e40
 400c
 c9777860
 81150f1a
 c97778a8
Call Trace:
 [] dump_stack+0x4f/0x73
 [] bad_page+0xbf/0x120
 [] free_pages_check_bad+0x5a/0x70
 [] free_hot_cold_page+0x248/0x290
 [] free_hot_cold_page_list+0x2b/0x50
 [] release_pages+0x2bd/0x350
 [] __pagevec_release+0x22/0x30
 [] extent_write_cache_pages.isra.48.constprop.63+0x32e/0x400 
[btrfs]
 [] extent_writepages+0x49/0x60 [btrfs]
 [] ? btrfs_releasepage+0x40/0x40 [btrfs]
 [] btrfs_writepages+0x23/0x30 [btrfs]
 [] do_writepages+0x1c/0x30
 [] __writeback_single_inode+0x33/0x180
 [] writeback_sb_inodes+0x2a8/0x5b0
 [] __writeback_inodes_wb+0x8d/0xc0
 [] wb_writeback+0x1e3/0x1f0
 [] wb_workfn+0xd2/0x280
 [] process_one_work+0x1d5/0x490
 [] ? process_one_work+0x175/0x490
 [] worker_thread+0x49/0x490
 [] ? process_one_work+0x490/0x490
 [] kthread+0xee/0x110
 [] ? kthread_park+0x60/0x60
 [] ret_from_fork+0x22/0x30





Re: bio linked list corruption.

2016-10-27 Thread Dave Jones
On Thu, Oct 27, 2016 at 04:41:33PM +1100, Dave Chinner wrote:
 
 > And that's indicative of a delalloc metadata reservation being
 > being too small and so we're allocating unreserved blocks.
 > 
 > Different symptoms, same underlying cause, I think.
 > 
 > I see the latter assert from time to time in my testing, but it's
 > not common (maybe once a month) and I've never been able to track it
 > down.  However, it doesn't affect production systems unless they hit
 > ENOSPC hard enough that it causes the critical reserve pool to be
 > exhausted iand so the allocation fails. That's extremely rare -
 > usually takes a several hundred processes all trying to write as had
 > as they can concurrently and to all slip through the ENOSPC
 > detection without the correct metadata reservations and all require
 > multiple metadata blocks to be allocated durign writeback...
 > 
 > If you've got a way to trigger it quickly and reliably, that would
 > be helpful...

Seems pretty quickly reproducable for me in some shape or form.
Run trinity with --enable-fds=testfile and create enough children
to create a fair bit of contention, (for me -C64 seems a good fit on
spinning rust, but if you're running on shiny nvme you might have to pump it up 
a bit).

Dave



Re: bio linked list corruption.

2016-10-27 Thread Dave Jones
On Thu, Oct 27, 2016 at 04:41:33PM +1100, Dave Chinner wrote:
 
 > And that's indicative of a delalloc metadata reservation being
 > being too small and so we're allocating unreserved blocks.
 > 
 > Different symptoms, same underlying cause, I think.
 > 
 > I see the latter assert from time to time in my testing, but it's
 > not common (maybe once a month) and I've never been able to track it
 > down.  However, it doesn't affect production systems unless they hit
 > ENOSPC hard enough that it causes the critical reserve pool to be
 > exhausted iand so the allocation fails. That's extremely rare -
 > usually takes a several hundred processes all trying to write as had
 > as they can concurrently and to all slip through the ENOSPC
 > detection without the correct metadata reservations and all require
 > multiple metadata blocks to be allocated durign writeback...
 > 
 > If you've got a way to trigger it quickly and reliably, that would
 > be helpful...

Seems pretty quickly reproducable for me in some shape or form.
Run trinity with --enable-fds=testfile and create enough children
to create a fair bit of contention, (for me -C64 seems a good fit on
spinning rust, but if you're running on shiny nvme you might have to pump it up 
a bit).

Dave



Re: bio linked list corruption.

2016-10-27 Thread Jens Axboe

On 10/27/2016 10:34 AM, Linus Torvalds wrote:

On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwig  wrote:

Dave, can you hit the warnings with this? Totally untested...


Can we just kill off the unhelpful blk_map_ctx structure, e.g.:


Yeah, I found that hard to read too. The difference between
blk_map_ctx and blk_mq_alloc_data is minimal, might as well just use
the latter everywhere.

But please separate that patch from the patch that fixes the ctx list
corruption.


Agree


And Jens - can I have at least the corruption fix asap? Due to KS
travel, I'm going to do rc3 on Saturday, and I'd like to have the fix
in at least a day before that..


Just committed half an hour ago with the acks etc, I'll send it to you
shortly.

--
Jens Axboe



Re: bio linked list corruption.

2016-10-27 Thread Jens Axboe

On 10/27/2016 10:34 AM, Linus Torvalds wrote:

On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwig  wrote:

Dave, can you hit the warnings with this? Totally untested...


Can we just kill off the unhelpful blk_map_ctx structure, e.g.:


Yeah, I found that hard to read too. The difference between
blk_map_ctx and blk_mq_alloc_data is minimal, might as well just use
the latter everywhere.

But please separate that patch from the patch that fixes the ctx list
corruption.


Agree


And Jens - can I have at least the corruption fix asap? Due to KS
travel, I'm going to do rc3 on Saturday, and I'd like to have the fix
in at least a day before that..


Just committed half an hour ago with the acks etc, I'll send it to you
shortly.

--
Jens Axboe



Re: bio linked list corruption.

2016-10-27 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwig  wrote:
>> Dave, can you hit the warnings with this? Totally untested...
>
> Can we just kill off the unhelpful blk_map_ctx structure, e.g.:

Yeah, I found that hard to read too. The difference between
blk_map_ctx and blk_mq_alloc_data is minimal, might as well just use
the latter everywhere.

But please separate that patch from the patch that fixes the ctx list
corruption.

And Jens - can I have at least the corruption fix asap? Due to KS
travel, I'm going to do rc3 on Saturday, and I'd like to have the fix
in at least a day before that..

Linus


Re: bio linked list corruption.

2016-10-27 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwig  wrote:
>> Dave, can you hit the warnings with this? Totally untested...
>
> Can we just kill off the unhelpful blk_map_ctx structure, e.g.:

Yeah, I found that hard to read too. The difference between
blk_map_ctx and blk_mq_alloc_data is minimal, might as well just use
the latter everywhere.

But please separate that patch from the patch that fixes the ctx list
corruption.

And Jens - can I have at least the corruption fix asap? Due to KS
travel, I'm going to do rc3 on Saturday, and I'd like to have the fix
in at least a day before that..

Linus


Re: bio linked list corruption.

2016-10-27 Thread Chris Mason


On 10/26/2016 08:00 PM, Jens Axboe wrote:
> On 10/26/2016 05:47 PM, Dave Jones wrote:
>> On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote:
>>
>>  > >-hctx->queued++;
>>  > >-data->hctx = hctx;
>>  > >-data->ctx = ctx;
>>  > >+data->hctx = alloc_data.hctx;
>>  > >+data->ctx = alloc_data.ctx;
>>  > >+data->hctx->queued++;
>>  > > return rq;
>>  > > }
>>  >
>>  > This made it through an entire dbench 2048 run on btrfs.  My script
>> has
>>  > it running in a loop, but this is farther than I've gotten before.
>>  > Looking great so far.
>>
>> Fixed the splat during boot for me too.
>> Now the fun part, let's see if it fixed the 'weird shit' that Trinity
>> was stumbling on.
> 
> Let's let the testing simmer overnight, then I'll turn this into a real
> patch tomorrow and get it submitted.
> 

I ran all night on both btrfs and xfs.  XFS came out clean, but btrfs hit
the WARN_ON below.  I hit it a few times with Jens' patch, always the
same warning.  It's pretty obviously a btrfs bug, we're not cleaning up
this list properly during fsync.  I tried a v1 of a btrfs fix overnight,
but I see where it was incomplete now and will re-run.

For the blk-mq bug, I think we got it!

Tested-by: always-blaming-jens-from-now-on 

WARNING: CPU: 5 PID: 16163 at lib/list_debug.c:62 __list_del_entry+0x86/0xd0
list_del corruption. next->prev should be 8801196d3be0, but was 
88010fc63308
Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper i2c_piix4 
lrw i2c_core gf128mul ablk_helper virtio_net serio_raw button pcspkr floppy 
cryptd sch_fq_codel autofs4 virtio_blk
CPU: 5 PID: 16163 Comm: dbench Not tainted 4.9.0-rc2-00041-g811d54d-dirty #322
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 
04/01/2014
 8801196d3a68 814fde3f 8151c356 8801196d3ac8
 8801196d3ac8  8801196d3ab8 810648cf
 dead0100 003e813bfc4a 8801196d3b98 880122b5c800
Call Trace:
 [] dump_stack+0x53/0x74
 [] ? __list_del_entry+0x86/0xd0
 [] __warn+0xff/0x120
 [] warn_slowpath_fmt+0x49/0x50
 [] __list_del_entry+0x86/0xd0
 [] btrfs_sync_log+0x75d/0xbd0
 [] ? btrfs_log_inode_parent+0x547/0xbb0
 [] ? _raw_spin_lock+0x1b/0x40
 [] ? __might_sleep+0x53/0xa0
 [] ? dput+0x65/0x280
 [] ? btrfs_log_dentry_safe+0x77/0x90
 [] btrfs_sync_file+0x424/0x490
 [] ? SYSC_kill+0xba/0x1d0
 [] ? __sb_end_write+0x58/0x80
 [] vfs_fsync_range+0x4c/0xb0
 [] ? syscall_trace_enter+0x201/0x2e0
 [] vfs_fsync+0x1c/0x20
 [] do_fsync+0x3d/0x70
 [] ? syscall_slow_exit_work+0xfb/0x100
 [] SyS_fsync+0x10/0x20
 [] do_syscall_64+0x55/0xd0
 [] ? prepare_exit_to_usermode+0x37/0x40
 [] entry_SYSCALL64_slow_path+0x25/0x25
---[ end trace c93288442a6424aa ]---



Re: bio linked list corruption.

2016-10-27 Thread Chris Mason


On 10/26/2016 08:00 PM, Jens Axboe wrote:
> On 10/26/2016 05:47 PM, Dave Jones wrote:
>> On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote:
>>
>>  > >-hctx->queued++;
>>  > >-data->hctx = hctx;
>>  > >-data->ctx = ctx;
>>  > >+data->hctx = alloc_data.hctx;
>>  > >+data->ctx = alloc_data.ctx;
>>  > >+data->hctx->queued++;
>>  > > return rq;
>>  > > }
>>  >
>>  > This made it through an entire dbench 2048 run on btrfs.  My script
>> has
>>  > it running in a loop, but this is farther than I've gotten before.
>>  > Looking great so far.
>>
>> Fixed the splat during boot for me too.
>> Now the fun part, let's see if it fixed the 'weird shit' that Trinity
>> was stumbling on.
> 
> Let's let the testing simmer overnight, then I'll turn this into a real
> patch tomorrow and get it submitted.
> 

I ran all night on both btrfs and xfs.  XFS came out clean, but btrfs hit
the WARN_ON below.  I hit it a few times with Jens' patch, always the
same warning.  It's pretty obviously a btrfs bug, we're not cleaning up
this list properly during fsync.  I tried a v1 of a btrfs fix overnight,
but I see where it was incomplete now and will re-run.

For the blk-mq bug, I think we got it!

Tested-by: always-blaming-jens-from-now-on 

WARNING: CPU: 5 PID: 16163 at lib/list_debug.c:62 __list_del_entry+0x86/0xd0
list_del corruption. next->prev should be 8801196d3be0, but was 
88010fc63308
Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper i2c_piix4 
lrw i2c_core gf128mul ablk_helper virtio_net serio_raw button pcspkr floppy 
cryptd sch_fq_codel autofs4 virtio_blk
CPU: 5 PID: 16163 Comm: dbench Not tainted 4.9.0-rc2-00041-g811d54d-dirty #322
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 
04/01/2014
 8801196d3a68 814fde3f 8151c356 8801196d3ac8
 8801196d3ac8  8801196d3ab8 810648cf
 dead0100 003e813bfc4a 8801196d3b98 880122b5c800
Call Trace:
 [] dump_stack+0x53/0x74
 [] ? __list_del_entry+0x86/0xd0
 [] __warn+0xff/0x120
 [] warn_slowpath_fmt+0x49/0x50
 [] __list_del_entry+0x86/0xd0
 [] btrfs_sync_log+0x75d/0xbd0
 [] ? btrfs_log_inode_parent+0x547/0xbb0
 [] ? _raw_spin_lock+0x1b/0x40
 [] ? __might_sleep+0x53/0xa0
 [] ? dput+0x65/0x280
 [] ? btrfs_log_dentry_safe+0x77/0x90
 [] btrfs_sync_file+0x424/0x490
 [] ? SYSC_kill+0xba/0x1d0
 [] ? __sb_end_write+0x58/0x80
 [] vfs_fsync_range+0x4c/0xb0
 [] ? syscall_trace_enter+0x201/0x2e0
 [] vfs_fsync+0x1c/0x20
 [] do_fsync+0x3d/0x70
 [] ? syscall_slow_exit_work+0xfb/0x100
 [] SyS_fsync+0x10/0x20
 [] do_syscall_64+0x55/0xd0
 [] ? prepare_exit_to_usermode+0x37/0x40
 [] entry_SYSCALL64_slow_path+0x25/0x25
---[ end trace c93288442a6424aa ]---



Re: bio linked list corruption.

2016-10-27 Thread Christoph Hellwig
> Dave, can you hit the warnings with this? Totally untested...

Can we just kill off the unhelpful blk_map_ctx structure, e.g.:


diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed..d74a74a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1190,21 +1190,15 @@ static inline bool blk_mq_merge_queue_io(struct 
blk_mq_hw_ctx *hctx,
}
 }
 
-struct blk_map_ctx {
-   struct blk_mq_hw_ctx *hctx;
-   struct blk_mq_ctx *ctx;
-};
-
 static struct request *blk_mq_map_request(struct request_queue *q,
  struct bio *bio,
- struct blk_map_ctx *data)
+ struct blk_mq_alloc_data *data)
 {
struct blk_mq_hw_ctx *hctx;
struct blk_mq_ctx *ctx;
struct request *rq;
int op = bio_data_dir(bio);
int op_flags = 0;
-   struct blk_mq_alloc_data alloc_data;
 
blk_queue_enter_live(q);
ctx = blk_mq_get_ctx(q);
@@ -1214,12 +1208,10 @@ static struct request *blk_mq_map_request(struct 
request_queue *q,
op_flags |= REQ_SYNC;
 
trace_block_getrq(q, bio, op);
-   blk_mq_set_alloc_data(_data, q, 0, ctx, hctx);
-   rq = __blk_mq_alloc_request(_data, op, op_flags);
+   blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
+   rq = __blk_mq_alloc_request(data, op, op_flags);
 
-   hctx->queued++;
-   data->hctx = hctx;
-   data->ctx = ctx;
+   data->hctx->queued++;
return rq;
 }
 
@@ -1267,7 +1259,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
 {
const int is_sync = rw_is_sync(bio_op(bio), bio->bi_opf);
const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA);
-   struct blk_map_ctx data;
+   struct blk_mq_alloc_data data;
struct request *rq;
unsigned int request_count = 0;
struct blk_plug *plug;
@@ -1363,7 +1355,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue 
*q, struct bio *bio)
const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA);
struct blk_plug *plug;
unsigned int request_count = 0;
-   struct blk_map_ctx data;
+   struct blk_mq_alloc_data data;
struct request *rq;
blk_qc_t cookie;
 


Re: bio linked list corruption.

2016-10-27 Thread Christoph Hellwig
> Dave, can you hit the warnings with this? Totally untested...

Can we just kill off the unhelpful blk_map_ctx structure, e.g.:


diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed..d74a74a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1190,21 +1190,15 @@ static inline bool blk_mq_merge_queue_io(struct 
blk_mq_hw_ctx *hctx,
}
 }
 
-struct blk_map_ctx {
-   struct blk_mq_hw_ctx *hctx;
-   struct blk_mq_ctx *ctx;
-};
-
 static struct request *blk_mq_map_request(struct request_queue *q,
  struct bio *bio,
- struct blk_map_ctx *data)
+ struct blk_mq_alloc_data *data)
 {
struct blk_mq_hw_ctx *hctx;
struct blk_mq_ctx *ctx;
struct request *rq;
int op = bio_data_dir(bio);
int op_flags = 0;
-   struct blk_mq_alloc_data alloc_data;
 
blk_queue_enter_live(q);
ctx = blk_mq_get_ctx(q);
@@ -1214,12 +1208,10 @@ static struct request *blk_mq_map_request(struct 
request_queue *q,
op_flags |= REQ_SYNC;
 
trace_block_getrq(q, bio, op);
-   blk_mq_set_alloc_data(_data, q, 0, ctx, hctx);
-   rq = __blk_mq_alloc_request(_data, op, op_flags);
+   blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
+   rq = __blk_mq_alloc_request(data, op, op_flags);
 
-   hctx->queued++;
-   data->hctx = hctx;
-   data->ctx = ctx;
+   data->hctx->queued++;
return rq;
 }
 
@@ -1267,7 +1259,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
 {
const int is_sync = rw_is_sync(bio_op(bio), bio->bi_opf);
const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA);
-   struct blk_map_ctx data;
+   struct blk_mq_alloc_data data;
struct request *rq;
unsigned int request_count = 0;
struct blk_plug *plug;
@@ -1363,7 +1355,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue 
*q, struct bio *bio)
const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA);
struct blk_plug *plug;
unsigned int request_count = 0;
-   struct blk_map_ctx data;
+   struct blk_mq_alloc_data data;
struct request *rq;
blk_qc_t cookie;
 


Re: bio linked list corruption.

2016-10-26 Thread Dave Chinner
On Tue, Oct 25, 2016 at 08:27:52PM -0400, Dave Jones wrote:
> DaveC: Do these look like real problems, or is this more "looks like
> random memory corruption" ?  It's been a while since I did some stress
> testing on XFS, so these might not be new..
> 
> XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 
> 2938
> [ cut here ]
> kernel BUG at fs/xfs/xfs_message.c:113!
> invalid opcode:  [#1] PREEMPT SMP
> CPU: 1 PID: 6227 Comm: trinity-c9 Not tainted 4.9.0-rc1-think+ #6 
> task: 8804f4658040 task.stack: 88050568c000
> RIP: 0010:[] 
>   [] assfail+0x1b/0x20 [xfs]
> RSP: :88050568f9e8  EFLAGS: 00010282
> RAX: ffea RBX: 0046 RCX: 0001
> RDX: ffc0 RSI: 000a RDI: a02fe34d
> RBP: 88050568f9e8 R08:  R09: 
> R10: 000a R11: f000 R12: 88050568fb44
> R13: 00f3 R14: 8804f292bf88 R15: 000e0046
> FS:  7fe2ddfdfb40() GS:88050a00() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7fe2dbabd000 CR3: 0004f461f000 CR4: 001406e0
> Stack:
>  88050568fa88  a027ccee  fff9  8804f16fd8b0
>  3ffa  0032  8804f292bf40  4976
>  000e0008  04fd  8804  5107
> Call Trace:
>  [] xfs_bmap_add_extent_hole_delay+0x54e/0x620 [xfs]
>  [] xfs_bmapi_reserve_delalloc+0x2b4/0x400 [xfs]
>  [] xfs_file_iomap_begin_delay.isra.12+0x247/0x3c0 [xfs]
>  [] xfs_file_iomap_begin+0x181/0x270 [xfs]
>  [] iomap_apply+0x53/0x100
>  [] iomap_file_buffered_write+0x6b/0x90

All this iomap code is new, so it's entirely possible that this is a
new bug. The assert failure is indicative that the delalloc extent's
metadata reservation grew when we expected it to stay the same or
shrink.

> XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: 
> fs/xfs/xfs_trans.c, line: 309
...
>  [] xfs_trans_mod_sb+0x241/0x280 [xfs]
>  [] xfs_ag_resv_alloc_extent+0x4f/0xc0 [xfs]
>  [] xfs_alloc_ag_vextent+0x23d/0x300 [xfs]
>  [] xfs_alloc_vextent+0x5fb/0x6d0 [xfs]
>  [] xfs_bmap_btalloc+0x304/0x8e0 [xfs]
>  [] ? xfs_iext_bno_to_ext+0xee/0x170 [xfs]
>  [] xfs_bmap_alloc+0x2b/0x40 [xfs]
>  [] xfs_bmapi_write+0x640/0x1210 [xfs]
>  [] xfs_iomap_write_allocate+0x166/0x350 [xfs]
>  [] xfs_map_blocks+0x1b0/0x260 [xfs]
>  [] xfs_do_writepage+0x23b/0x730 [xfs]

And that's indicative of a delalloc metadata reservation being
being too small and so we're allocating unreserved blocks.

Different symptoms, same underlying cause, I think.

I see the latter assert from time to time in my testing, but it's
not common (maybe once a month) and I've never been able to track it
down.  However, it doesn't affect production systems unless they hit
ENOSPC hard enough that it causes the critical reserve pool to be
exhausted iand so the allocation fails. That's extremely rare -
usually takes a several hundred processes all trying to write as had
as they can concurrently and to all slip through the ENOSPC
detection without the correct metadata reservations and all require
multiple metadata blocks to be allocated durign writeback...

If you've got a way to trigger it quickly and reliably, that would
be helpful...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: bio linked list corruption.

2016-10-26 Thread Dave Chinner
On Tue, Oct 25, 2016 at 08:27:52PM -0400, Dave Jones wrote:
> DaveC: Do these look like real problems, or is this more "looks like
> random memory corruption" ?  It's been a while since I did some stress
> testing on XFS, so these might not be new..
> 
> XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 
> 2938
> [ cut here ]
> kernel BUG at fs/xfs/xfs_message.c:113!
> invalid opcode:  [#1] PREEMPT SMP
> CPU: 1 PID: 6227 Comm: trinity-c9 Not tainted 4.9.0-rc1-think+ #6 
> task: 8804f4658040 task.stack: 88050568c000
> RIP: 0010:[] 
>   [] assfail+0x1b/0x20 [xfs]
> RSP: :88050568f9e8  EFLAGS: 00010282
> RAX: ffea RBX: 0046 RCX: 0001
> RDX: ffc0 RSI: 000a RDI: a02fe34d
> RBP: 88050568f9e8 R08:  R09: 
> R10: 000a R11: f000 R12: 88050568fb44
> R13: 00f3 R14: 8804f292bf88 R15: 000e0046
> FS:  7fe2ddfdfb40() GS:88050a00() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7fe2dbabd000 CR3: 0004f461f000 CR4: 001406e0
> Stack:
>  88050568fa88  a027ccee  fff9  8804f16fd8b0
>  3ffa  0032  8804f292bf40  4976
>  000e0008  04fd  8804  5107
> Call Trace:
>  [] xfs_bmap_add_extent_hole_delay+0x54e/0x620 [xfs]
>  [] xfs_bmapi_reserve_delalloc+0x2b4/0x400 [xfs]
>  [] xfs_file_iomap_begin_delay.isra.12+0x247/0x3c0 [xfs]
>  [] xfs_file_iomap_begin+0x181/0x270 [xfs]
>  [] iomap_apply+0x53/0x100
>  [] iomap_file_buffered_write+0x6b/0x90

All this iomap code is new, so it's entirely possible that this is a
new bug. The assert failure is indicative that the delalloc extent's
metadata reservation grew when we expected it to stay the same or
shrink.

> XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: 
> fs/xfs/xfs_trans.c, line: 309
...
>  [] xfs_trans_mod_sb+0x241/0x280 [xfs]
>  [] xfs_ag_resv_alloc_extent+0x4f/0xc0 [xfs]
>  [] xfs_alloc_ag_vextent+0x23d/0x300 [xfs]
>  [] xfs_alloc_vextent+0x5fb/0x6d0 [xfs]
>  [] xfs_bmap_btalloc+0x304/0x8e0 [xfs]
>  [] ? xfs_iext_bno_to_ext+0xee/0x170 [xfs]
>  [] xfs_bmap_alloc+0x2b/0x40 [xfs]
>  [] xfs_bmapi_write+0x640/0x1210 [xfs]
>  [] xfs_iomap_write_allocate+0x166/0x350 [xfs]
>  [] xfs_map_blocks+0x1b0/0x260 [xfs]
>  [] xfs_do_writepage+0x23b/0x730 [xfs]

And that's indicative of a delalloc metadata reservation being
being too small and so we're allocating unreserved blocks.

Different symptoms, same underlying cause, I think.

I see the latter assert from time to time in my testing, but it's
not common (maybe once a month) and I've never been able to track it
down.  However, it doesn't affect production systems unless they hit
ENOSPC hard enough that it causes the critical reserve pool to be
exhausted iand so the allocation fails. That's extremely rare -
usually takes a several hundred processes all trying to write as had
as they can concurrently and to all slip through the ENOSPC
detection without the correct metadata reservations and all require
multiple metadata blocks to be allocated durign writeback...

If you've got a way to trigger it quickly and reliably, that would
be helpful...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe

On 10/26/2016 05:47 PM, Dave Jones wrote:

On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote:

 > >- hctx->queued++;
 > >- data->hctx = hctx;
 > >- data->ctx = ctx;
 > >+ data->hctx = alloc_data.hctx;
 > >+ data->ctx = alloc_data.ctx;
 > >+ data->hctx->queued++;
 > >  return rq;
 > > }
 >
 > This made it through an entire dbench 2048 run on btrfs.  My script has
 > it running in a loop, but this is farther than I've gotten before.
 > Looking great so far.

Fixed the splat during boot for me too.
Now the fun part, let's see if it fixed the 'weird shit' that Trinity
was stumbling on.


Let's let the testing simmer overnight, then I'll turn this into a real
patch tomorrow and get it submitted.

--
Jens Axboe



Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe

On 10/26/2016 05:47 PM, Dave Jones wrote:

On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote:

 > >- hctx->queued++;
 > >- data->hctx = hctx;
 > >- data->ctx = ctx;
 > >+ data->hctx = alloc_data.hctx;
 > >+ data->ctx = alloc_data.ctx;
 > >+ data->hctx->queued++;
 > >  return rq;
 > > }
 >
 > This made it through an entire dbench 2048 run on btrfs.  My script has
 > it running in a loop, but this is farther than I've gotten before.
 > Looking great so far.

Fixed the splat during boot for me too.
Now the fun part, let's see if it fixed the 'weird shit' that Trinity
was stumbling on.


Let's let the testing simmer overnight, then I'll turn this into a real
patch tomorrow and get it submitted.

--
Jens Axboe



Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote:
 
 > >-   hctx->queued++;
 > >-   data->hctx = hctx;
 > >-   data->ctx = ctx;
 > >+   data->hctx = alloc_data.hctx;
 > >+   data->ctx = alloc_data.ctx;
 > >+   data->hctx->queued++;
 > >return rq;
 > > }
 > 
 > This made it through an entire dbench 2048 run on btrfs.  My script has 
 > it running in a loop, but this is farther than I've gotten before.  
 > Looking great so far.

Fixed the splat during boot for me too.
Now the fun part, let's see if it fixed the 'weird shit' that Trinity
was stumbling on.

Dave
 


Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote:
 
 > >-   hctx->queued++;
 > >-   data->hctx = hctx;
 > >-   data->ctx = ctx;
 > >+   data->hctx = alloc_data.hctx;
 > >+   data->ctx = alloc_data.ctx;
 > >+   data->hctx->queued++;
 > >return rq;
 > > }
 > 
 > This made it through an entire dbench 2048 run on btrfs.  My script has 
 > it running in a loop, but this is farther than I've gotten before.  
 > Looking great so far.

Fixed the splat during boot for me too.
Now the fun part, let's see if it fixed the 'weird shit' that Trinity
was stumbling on.

Dave
 


Re: bio linked list corruption.

2016-10-26 Thread Chris Mason

On Wed, Oct 26, 2016 at 05:20:01PM -0600, Jens Axboe wrote:

On 10/26/2016 05:08 PM, Linus Torvalds wrote:

On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe  wrote:


Actually, I think I see what might trigger it. You are on nvme, iirc,
and that has a deep queue.


Yes. I have long since moved on from slow disks, so all my systems are
not just flash, but m.2 nvme ssd's.

So at least that could explain why Dave sees it at bootup but I don't.


Yep, you'd never sleep during normal boot. The original patch had a
problem, though... This one should be better. Too many 'data's, we'll
still need to assign ctx/hctx, we should just use the current ones, not
the original ones.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed64771..e56fec187ba6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1217,9 +1224,9 @@ static struct request *blk_mq_map_request(struct 
request_queue *q,

blk_mq_set_alloc_data(_data, q, 0, ctx, hctx);
rq = __blk_mq_alloc_request(_data, op, op_flags);

-   hctx->queued++;
-   data->hctx = hctx;
-   data->ctx = ctx;
+   data->hctx = alloc_data.hctx;
+   data->ctx = alloc_data.ctx;
+   data->hctx->queued++;
return rq;
}


This made it through an entire dbench 2048 run on btrfs.  My script has 
it running in a loop, but this is farther than I've gotten before.  
Looking great so far.


-chris



Re: bio linked list corruption.

2016-10-26 Thread Chris Mason

On Wed, Oct 26, 2016 at 05:20:01PM -0600, Jens Axboe wrote:

On 10/26/2016 05:08 PM, Linus Torvalds wrote:

On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe  wrote:


Actually, I think I see what might trigger it. You are on nvme, iirc,
and that has a deep queue.


Yes. I have long since moved on from slow disks, so all my systems are
not just flash, but m.2 nvme ssd's.

So at least that could explain why Dave sees it at bootup but I don't.


Yep, you'd never sleep during normal boot. The original patch had a
problem, though... This one should be better. Too many 'data's, we'll
still need to assign ctx/hctx, we should just use the current ones, not
the original ones.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed64771..e56fec187ba6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1217,9 +1224,9 @@ static struct request *blk_mq_map_request(struct 
request_queue *q,

blk_mq_set_alloc_data(_data, q, 0, ctx, hctx);
rq = __blk_mq_alloc_request(_data, op, op_flags);

-   hctx->queued++;
-   data->hctx = hctx;
-   data->ctx = ctx;
+   data->hctx = alloc_data.hctx;
+   data->ctx = alloc_data.ctx;
+   data->hctx->queued++;
return rq;
}


This made it through an entire dbench 2048 run on btrfs.  My script has 
it running in a loop, but this is farther than I've gotten before.  
Looking great so far.


-chris



Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe

On 10/26/2016 05:19 PM, Chris Mason wrote:

On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote:

On 10/26/2016 04:58 PM, Linus Torvalds wrote:

On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds
 wrote:


Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
blk_mq_merge_queue_io() into two


I did that myself too, since Dave sees this during boot.

But I'm not getting the warning ;(

Dave gets it with ext4, and thats' what I have too, so I'm not sure
what the required trigger would be.


Actually, I think I see what might trigger it. You are on nvme, iirc,
and that has a deep queue. Dave, are you testing on a sata drive or
something similar with a shallower queue depth? If we end up sleeping
for a request, I think we could trigger data->ctx being different.

Dave, can you hit the warnings with this? Totally untested...


Confirmed, totally untested ;)  Don't try this one at home folks
(working this out with Jens offlist)


Ahem, I did say untested! The one I just sent in reply to Linus should
be better. Though that one is completely untested as well...

--
Jens Axboe



Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe

On 10/26/2016 05:19 PM, Chris Mason wrote:

On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote:

On 10/26/2016 04:58 PM, Linus Torvalds wrote:

On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds
 wrote:


Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
blk_mq_merge_queue_io() into two


I did that myself too, since Dave sees this during boot.

But I'm not getting the warning ;(

Dave gets it with ext4, and thats' what I have too, so I'm not sure
what the required trigger would be.


Actually, I think I see what might trigger it. You are on nvme, iirc,
and that has a deep queue. Dave, are you testing on a sata drive or
something similar with a shallower queue depth? If we end up sleeping
for a request, I think we could trigger data->ctx being different.

Dave, can you hit the warnings with this? Totally untested...


Confirmed, totally untested ;)  Don't try this one at home folks
(working this out with Jens offlist)


Ahem, I did say untested! The one I just sent in reply to Linus should
be better. Though that one is completely untested as well...

--
Jens Axboe



Re: bio linked list corruption.

2016-10-26 Thread Chris Mason

On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote:

On 10/26/2016 04:58 PM, Linus Torvalds wrote:

On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds
 wrote:


Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
blk_mq_merge_queue_io() into two


I did that myself too, since Dave sees this during boot.

But I'm not getting the warning ;(

Dave gets it with ext4, and thats' what I have too, so I'm not sure
what the required trigger would be.


Actually, I think I see what might trigger it. You are on nvme, iirc,
and that has a deep queue. Dave, are you testing on a sata drive or
something similar with a shallower queue depth? If we end up sleeping
for a request, I think we could trigger data->ctx being different.

Dave, can you hit the warnings with this? Totally untested...


Confirmed, totally untested ;)  Don't try this one at home folks 
(working this out with Jens offlist)


G: unable to handle kernel paging request at 02411200
IP: [] _raw_spin_lock+0x22/0x40
PGD 12840a067
PUD 128446067
PMD 0

Oops: 0002 [#1] PREEMPT SMP
Modules linked in: virtio_blk(+)
CPU: 4 PID: 125 Comm: modprobe Not tainted 
4.9.0-rc2-00041-g811d54d-dirty #320
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 
04/01/2014

task: 88013849aac0 task.stack: 8801293d8000
RIP: 0010:[]  [] 
_raw_spin_lock+0x22/0x40

RSP: 0018:8801293db278  EFLAGS: 00010246
RAX:  RBX: 02411200 RCX: 
RDX: 0001 RSI: 88013a5c1048 RDI: 
RBP: 8801293db288 R08: 0005 R09: 880128449380
R10:  R11: 0008 R12: 
R13: 0001 R14: 0076 R15: 8801293b6a80
FS:  7f1a2a9cdb40() GS:88013fd0() 
knlGS:

CS:  0010 DS:  ES:  CR0: 80050033
CR2: 02411200 CR3: 00013a5d1000 CR4: 000406e0
Stack:
8801293db2d0 880128488000 8801293db348 814debff
00ff8801293db2c8 8801293db338 8801284888c0 8801284888b8
60fec4f9 02411200 880128f810c0 880128f810c0
Call Trace:
[] blk_sq_make_request+0x34f/0x580
[] ? mempool_alloc_slab+0x15/0x20
[] generic_make_request+0x104/0x200
[] submit_bio+0x65/0x130
[] submit_bh_wbc+0x16e/0x210
[] submit_bh+0x13/0x20
[] block_read_full_page+0x205/0x3d0
[] ? I_BDEV+0x20/0x20
[] ? lru_cache_add+0xe/0x10
[] ? add_to_page_cache_lru+0x92/0xf0
[] ? __page_cache_alloc+0xd1/0xe0
[] blkdev_readpage+0x18/0x20
[] do_read_cache_page+0x1c6/0x380
[] ? blkdev_writepages+0x10/0x10
[] ? alloc_pages_current+0xb2/0x1c0
[] read_cache_page+0x12/0x20
[] read_dev_sector+0x31/0xb0
[] read_lba+0xbd/0x130
[] find_valid_gpt+0xa2/0x580
[] ? find_valid_gpt+0x580/0x580
[] efi_partition+0x67/0x3d0
[] ? vsnprintf+0x2aa/0x470
[] ? snprintf+0x34/0x40
[] ? find_valid_gpt+0x580/0x580
[] check_partition+0x106/0x1e0
[] rescan_partitions+0x8c/0x270
[] __blkdev_get+0x328/0x3f0
[] blkdev_get+0x54/0x320
[] ? unlock_new_inode+0x5a/0x80
[] ? bdget+0xff/0x110
[] device_add_disk+0x3c6/0x450
[] ? ioread8+0x1a/0x40
[] ? vp_get+0x4e/0x70
[] virtblk_probe+0x460/0x708 [virtio_blk]
[] ? vp_finalize_features+0x36/0x50
[] virtio_dev_probe+0x132/0x1e0
[] driver_probe_device+0x1a9/0x2d0
[] ? mutex_lock+0x24/0x50
[] __driver_attach+0xbd/0xc0
[] ? driver_probe_device+0x2d0/0x2d0
[] ? driver_probe_device+0x2d0/0x2d0
[] bus_for_each_dev+0x8a/0xb0
[] driver_attach+0x1e/0x20
[] bus_add_driver+0x1b6/0x230
[] driver_register+0x60/0xe0
[] register_virtio_driver+0x20/0x40
[] init+0x57/0x81 [virtio_blk]
[] ? 0xa0004000
[] ? 0xa0004000
[] do_one_initcall+0x46/0x150
[] do_init_module+0x6a/0x210
[] ? vfree+0x37/0x90
[] load_module+0x1638/0x1860
[] ? do_free_init+0x30/0x30
[] ? kernel_read_file_from_fd+0x54/0x90
[] SYSC_finit_module+0xc2/0xd0
[] SyS_finit_module+0xe/0x10
[] entry_SYSCALL_64_fastpath+0x13/0x94
Code: 89 df e8 a2 52 70 ff eb e6 55 48 89 e5 53 48 83 ec 08 66 66 66 66 
90 48 89 fb bf 01 00 00 00 e8 95 53 6e ff 31 c0 ba 01 00 00 00  0f 
b1 13 85 c0 75 07 48 83 c4 08 5b c9 c3 89 c6 48 89 df e8

RIP  [] _raw_spin_lock+0x22/0x40
RSP 
CR2: 02411200
---[ end trace e8cb117e64947621 ]---
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception

-chris


Re: bio linked list corruption.

2016-10-26 Thread Chris Mason

On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote:

On 10/26/2016 04:58 PM, Linus Torvalds wrote:

On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds
 wrote:


Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
blk_mq_merge_queue_io() into two


I did that myself too, since Dave sees this during boot.

But I'm not getting the warning ;(

Dave gets it with ext4, and thats' what I have too, so I'm not sure
what the required trigger would be.


Actually, I think I see what might trigger it. You are on nvme, iirc,
and that has a deep queue. Dave, are you testing on a sata drive or
something similar with a shallower queue depth? If we end up sleeping
for a request, I think we could trigger data->ctx being different.

Dave, can you hit the warnings with this? Totally untested...


Confirmed, totally untested ;)  Don't try this one at home folks 
(working this out with Jens offlist)


G: unable to handle kernel paging request at 02411200
IP: [] _raw_spin_lock+0x22/0x40
PGD 12840a067
PUD 128446067
PMD 0

Oops: 0002 [#1] PREEMPT SMP
Modules linked in: virtio_blk(+)
CPU: 4 PID: 125 Comm: modprobe Not tainted 
4.9.0-rc2-00041-g811d54d-dirty #320
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 
04/01/2014

task: 88013849aac0 task.stack: 8801293d8000
RIP: 0010:[]  [] 
_raw_spin_lock+0x22/0x40

RSP: 0018:8801293db278  EFLAGS: 00010246
RAX:  RBX: 02411200 RCX: 
RDX: 0001 RSI: 88013a5c1048 RDI: 
RBP: 8801293db288 R08: 0005 R09: 880128449380
R10:  R11: 0008 R12: 
R13: 0001 R14: 0076 R15: 8801293b6a80
FS:  7f1a2a9cdb40() GS:88013fd0() 
knlGS:

CS:  0010 DS:  ES:  CR0: 80050033
CR2: 02411200 CR3: 00013a5d1000 CR4: 000406e0
Stack:
8801293db2d0 880128488000 8801293db348 814debff
00ff8801293db2c8 8801293db338 8801284888c0 8801284888b8
60fec4f9 02411200 880128f810c0 880128f810c0
Call Trace:
[] blk_sq_make_request+0x34f/0x580
[] ? mempool_alloc_slab+0x15/0x20
[] generic_make_request+0x104/0x200
[] submit_bio+0x65/0x130
[] submit_bh_wbc+0x16e/0x210
[] submit_bh+0x13/0x20
[] block_read_full_page+0x205/0x3d0
[] ? I_BDEV+0x20/0x20
[] ? lru_cache_add+0xe/0x10
[] ? add_to_page_cache_lru+0x92/0xf0
[] ? __page_cache_alloc+0xd1/0xe0
[] blkdev_readpage+0x18/0x20
[] do_read_cache_page+0x1c6/0x380
[] ? blkdev_writepages+0x10/0x10
[] ? alloc_pages_current+0xb2/0x1c0
[] read_cache_page+0x12/0x20
[] read_dev_sector+0x31/0xb0
[] read_lba+0xbd/0x130
[] find_valid_gpt+0xa2/0x580
[] ? find_valid_gpt+0x580/0x580
[] efi_partition+0x67/0x3d0
[] ? vsnprintf+0x2aa/0x470
[] ? snprintf+0x34/0x40
[] ? find_valid_gpt+0x580/0x580
[] check_partition+0x106/0x1e0
[] rescan_partitions+0x8c/0x270
[] __blkdev_get+0x328/0x3f0
[] blkdev_get+0x54/0x320
[] ? unlock_new_inode+0x5a/0x80
[] ? bdget+0xff/0x110
[] device_add_disk+0x3c6/0x450
[] ? ioread8+0x1a/0x40
[] ? vp_get+0x4e/0x70
[] virtblk_probe+0x460/0x708 [virtio_blk]
[] ? vp_finalize_features+0x36/0x50
[] virtio_dev_probe+0x132/0x1e0
[] driver_probe_device+0x1a9/0x2d0
[] ? mutex_lock+0x24/0x50
[] __driver_attach+0xbd/0xc0
[] ? driver_probe_device+0x2d0/0x2d0
[] ? driver_probe_device+0x2d0/0x2d0
[] bus_for_each_dev+0x8a/0xb0
[] driver_attach+0x1e/0x20
[] bus_add_driver+0x1b6/0x230
[] driver_register+0x60/0xe0
[] register_virtio_driver+0x20/0x40
[] init+0x57/0x81 [virtio_blk]
[] ? 0xa0004000
[] ? 0xa0004000
[] do_one_initcall+0x46/0x150
[] do_init_module+0x6a/0x210
[] ? vfree+0x37/0x90
[] load_module+0x1638/0x1860
[] ? do_free_init+0x30/0x30
[] ? kernel_read_file_from_fd+0x54/0x90
[] SYSC_finit_module+0xc2/0xd0
[] SyS_finit_module+0xe/0x10
[] entry_SYSCALL_64_fastpath+0x13/0x94
Code: 89 df e8 a2 52 70 ff eb e6 55 48 89 e5 53 48 83 ec 08 66 66 66 66 
90 48 89 fb bf 01 00 00 00 e8 95 53 6e ff 31 c0 ba 01 00 00 00  0f 
b1 13 85 c0 75 07 48 83 c4 08 5b c9 c3 89 c6 48 89 df e8

RIP  [] _raw_spin_lock+0x22/0x40
RSP 
CR2: 02411200
---[ end trace e8cb117e64947621 ]---
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception

-chris


Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe

On 10/26/2016 05:08 PM, Linus Torvalds wrote:

On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe  wrote:


Actually, I think I see what might trigger it. You are on nvme, iirc,
and that has a deep queue.


Yes. I have long since moved on from slow disks, so all my systems are
not just flash, but m.2 nvme ssd's.

So at least that could explain why Dave sees it at bootup but I don't.


Yep, you'd never sleep during normal boot. The original patch had a
problem, though... This one should be better. Too many 'data's, we'll
still need to assign ctx/hctx, we should just use the current ones, not
the original ones.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed64771..e56fec187ba6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1217,9 +1224,9 @@ static struct request *blk_mq_map_request(struct 
request_queue *q,

blk_mq_set_alloc_data(_data, q, 0, ctx, hctx);
rq = __blk_mq_alloc_request(_data, op, op_flags);

-   hctx->queued++;
-   data->hctx = hctx;
-   data->ctx = ctx;
+   data->hctx = alloc_data.hctx;
+   data->ctx = alloc_data.ctx;
+   data->hctx->queued++;
return rq;
 }


--
Jens Axboe



Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe

On 10/26/2016 05:08 PM, Linus Torvalds wrote:

On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe  wrote:


Actually, I think I see what might trigger it. You are on nvme, iirc,
and that has a deep queue.


Yes. I have long since moved on from slow disks, so all my systems are
not just flash, but m.2 nvme ssd's.

So at least that could explain why Dave sees it at bootup but I don't.


Yep, you'd never sleep during normal boot. The original patch had a
problem, though... This one should be better. Too many 'data's, we'll
still need to assign ctx/hctx, we should just use the current ones, not
the original ones.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed64771..e56fec187ba6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1217,9 +1224,9 @@ static struct request *blk_mq_map_request(struct 
request_queue *q,

blk_mq_set_alloc_data(_data, q, 0, ctx, hctx);
rq = __blk_mq_alloc_request(_data, op, op_flags);

-   hctx->queued++;
-   data->hctx = hctx;
-   data->ctx = ctx;
+   data->hctx = alloc_data.hctx;
+   data->ctx = alloc_data.ctx;
+   data->hctx->queued++;
return rq;
 }


--
Jens Axboe



Re: bio linked list corruption.

2016-10-26 Thread Chris Mason

On Wed, Oct 26, 2016 at 03:07:10PM -0700, Linus Torvalds wrote:

On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason  wrote:


Today I turned off every CONFIG_DEBUG_* except for list debugging, and
ran dbench 2048:

[ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 
__list_add+0xbe/0xd0
[ 2759.119652] list_add corruption. prev->next should be next 
(e8c80308), but was c9ccfb88. (prev=880128522380).
[ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 
virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button 
ablk_helper cryptd sch_fq_codel autofs4 virtio_blk
[ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 
4.9.0-rc1-15246-g4ce9206-dirty #317
[ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.9.0-1.fc24 04/01/2014
[ 2759.125077]  c9000f6fb868 814fe4ff 8151cb5e 
c9000f6fb8c8
[ 2759.125077]  c9000f6fb8c8  c9000f6fb8b8 
81064bbf
[ 2759.127444]  880128523680 002139968000 880138b7a4a0 
880128523540
[ 2759.127444] Call Trace:
[ 2759.127444]  [] dump_stack+0x53/0x74
[ 2759.127444]  [] ? __list_add+0xbe/0xd0
[ 2759.127444]  [] __warn+0xff/0x120
[ 2759.127444]  [] warn_slowpath_fmt+0x49/0x50
[ 2759.127444]  [] __list_add+0xbe/0xd0
[ 2759.127444]  [] blk_sq_make_request+0x388/0x580


Ok, that's definitely the same one that Dave started out seeing.

The fact that it is that reliable - two different machines, two very
different loads (dbench looks nothing like trinity) really makes me
think that maybe the problem really is in the block plugging after
all.

It very much does not smell like random stack corruption. It's simply
not random enough.


Agreed.  I'd feel better if I could trigger this outside of btrfs, even 
though Dave Chinner hit something very similar on xfs.  I'll peel off 
another test machine for a long XFS run.




And I just noticed something: I originally thought that this is the
"list_add_tail()" to the plug - which is the only "list_add()" variant
in that function.

But that never made sense, because the whole "but was" isn't a stack
address, and "next" in "list_add_tail()" is basically fixed, and would
have to be the stack.

But I now notice that there's actually another "list_add()" variant
there, and it's the one from __blk_mq_insert_request() that gets
inlined into blk_mq_insert_request(), which then gets inlined into
blk_mq_make_request().

And that actually makes some sense just looking at the offsets too:

blk_sq_make_request+0x388/0x580

so it's somewhat at the end of blk_sq_make_request(). So it's not unlikely.

And there it makes perfect sense that the "next should be" value is
*not* on the stack.

Chris, if you built with debug info, you can try

   ./scripts/faddr2line /boot/vmlinux blk_sq_make_request+0x388

to get what line that blk_sq_make_request+0x388 address actually is. I
think it's the

   list_add_tail(>queuelist, >rq_list);

in __blk_mq_insert_req_list() (when it's inlined from
blk_sq_make_request(), "at_head" will be false.

So it smells like ">rq_list" might be corrupt.


I'm running your current git here, so these line numbers should line up 
for you:


blk_sq_make_request+0x388/0x578:
__blk_mq_insert_request at block/blk-mq.c:1049
(inlined by) blk_mq_merge_queue_io at block/blk-mq.c:1175
 (inlined by) blk_sq_make_request at block/blk-mq.c:1419

The fsync path in the WARN doesn't have any plugs that I can see, so its 
not surprising that we're not in the plugging path.  I'm here:


if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
/*
 * For a SYNC request, send it to the hardware immediately. For
 * an ASYNC request, just ensure that we run it later on. The
 * latter allows for merging opportunities and more efficient
 * dispatching.
 */

I'll try the debugging patch you sent in the other email.

-chris


Re: bio linked list corruption.

2016-10-26 Thread Chris Mason

On Wed, Oct 26, 2016 at 03:07:10PM -0700, Linus Torvalds wrote:

On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason  wrote:


Today I turned off every CONFIG_DEBUG_* except for list debugging, and
ran dbench 2048:

[ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 
__list_add+0xbe/0xd0
[ 2759.119652] list_add corruption. prev->next should be next 
(e8c80308), but was c9ccfb88. (prev=880128522380).
[ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 
virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button 
ablk_helper cryptd sch_fq_codel autofs4 virtio_blk
[ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 
4.9.0-rc1-15246-g4ce9206-dirty #317
[ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.9.0-1.fc24 04/01/2014
[ 2759.125077]  c9000f6fb868 814fe4ff 8151cb5e 
c9000f6fb8c8
[ 2759.125077]  c9000f6fb8c8  c9000f6fb8b8 
81064bbf
[ 2759.127444]  880128523680 002139968000 880138b7a4a0 
880128523540
[ 2759.127444] Call Trace:
[ 2759.127444]  [] dump_stack+0x53/0x74
[ 2759.127444]  [] ? __list_add+0xbe/0xd0
[ 2759.127444]  [] __warn+0xff/0x120
[ 2759.127444]  [] warn_slowpath_fmt+0x49/0x50
[ 2759.127444]  [] __list_add+0xbe/0xd0
[ 2759.127444]  [] blk_sq_make_request+0x388/0x580


Ok, that's definitely the same one that Dave started out seeing.

The fact that it is that reliable - two different machines, two very
different loads (dbench looks nothing like trinity) really makes me
think that maybe the problem really is in the block plugging after
all.

It very much does not smell like random stack corruption. It's simply
not random enough.


Agreed.  I'd feel better if I could trigger this outside of btrfs, even 
though Dave Chinner hit something very similar on xfs.  I'll peel off 
another test machine for a long XFS run.




And I just noticed something: I originally thought that this is the
"list_add_tail()" to the plug - which is the only "list_add()" variant
in that function.

But that never made sense, because the whole "but was" isn't a stack
address, and "next" in "list_add_tail()" is basically fixed, and would
have to be the stack.

But I now notice that there's actually another "list_add()" variant
there, and it's the one from __blk_mq_insert_request() that gets
inlined into blk_mq_insert_request(), which then gets inlined into
blk_mq_make_request().

And that actually makes some sense just looking at the offsets too:

blk_sq_make_request+0x388/0x580

so it's somewhat at the end of blk_sq_make_request(). So it's not unlikely.

And there it makes perfect sense that the "next should be" value is
*not* on the stack.

Chris, if you built with debug info, you can try

   ./scripts/faddr2line /boot/vmlinux blk_sq_make_request+0x388

to get what line that blk_sq_make_request+0x388 address actually is. I
think it's the

   list_add_tail(>queuelist, >rq_list);

in __blk_mq_insert_req_list() (when it's inlined from
blk_sq_make_request(), "at_head" will be false.

So it smells like ">rq_list" might be corrupt.


I'm running your current git here, so these line numbers should line up 
for you:


blk_sq_make_request+0x388/0x578:
__blk_mq_insert_request at block/blk-mq.c:1049
(inlined by) blk_mq_merge_queue_io at block/blk-mq.c:1175
 (inlined by) blk_sq_make_request at block/blk-mq.c:1419

The fsync path in the WARN doesn't have any plugs that I can see, so its 
not surprising that we're not in the plugging path.  I'm here:


if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
/*
 * For a SYNC request, send it to the hardware immediately. For
 * an ASYNC request, just ensure that we run it later on. The
 * latter allows for merging opportunities and more efficient
 * dispatching.
 */

I'll try the debugging patch you sent in the other email.

-chris


Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe  wrote:
>
> Actually, I think I see what might trigger it. You are on nvme, iirc,
> and that has a deep queue.

Yes. I have long since moved on from slow disks, so all my systems are
not just flash, but m.2 nvme ssd's.

So at least that could explain why Dave sees it at bootup but I don't.

Linus


Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe  wrote:
>
> Actually, I think I see what might trigger it. You are on nvme, iirc,
> and that has a deep queue.

Yes. I have long since moved on from slow disks, so all my systems are
not just flash, but m.2 nvme ssd's.

So at least that could explain why Dave sees it at bootup but I don't.

Linus


Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote:
 > On 10/26/2016 04:58 PM, Linus Torvalds wrote:
 > > On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds
 > >  wrote:
 > >>
 > >> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
 > >> blk_mq_merge_queue_io() into two
 > >
 > > I did that myself too, since Dave sees this during boot.
 > >
 > > But I'm not getting the warning ;(
 > >
 > > Dave gets it with ext4, and thats' what I have too, so I'm not sure
 > > what the required trigger would be.
 > 
 > Actually, I think I see what might trigger it. You are on nvme, iirc,
 > and that has a deep queue. Dave, are you testing on a sata drive or
 > something similar with a shallower queue depth? If we end up sleeping
 > for a request, I think we could trigger data->ctx being different.

yeah, just regular sata. I've been meaning to put an ssd in that box for
a while, but now it sounds like my procrastination may have paid off.

 > Dave, can you hit the warnings with this? Totally untested...
 
Coming up..
 
Dave



Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote:
 > On 10/26/2016 04:58 PM, Linus Torvalds wrote:
 > > On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds
 > >  wrote:
 > >>
 > >> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
 > >> blk_mq_merge_queue_io() into two
 > >
 > > I did that myself too, since Dave sees this during boot.
 > >
 > > But I'm not getting the warning ;(
 > >
 > > Dave gets it with ext4, and thats' what I have too, so I'm not sure
 > > what the required trigger would be.
 > 
 > Actually, I think I see what might trigger it. You are on nvme, iirc,
 > and that has a deep queue. Dave, are you testing on a sata drive or
 > something similar with a shallower queue depth? If we end up sleeping
 > for a request, I think we could trigger data->ctx being different.

yeah, just regular sata. I've been meaning to put an ssd in that box for
a while, but now it sounds like my procrastination may have paid off.

 > Dave, can you hit the warnings with this? Totally untested...
 
Coming up..
 
Dave



Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe

On 10/26/2016 04:58 PM, Linus Torvalds wrote:

On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds
 wrote:


Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
blk_mq_merge_queue_io() into two


I did that myself too, since Dave sees this during boot.

But I'm not getting the warning ;(

Dave gets it with ext4, and thats' what I have too, so I'm not sure
what the required trigger would be.


Actually, I think I see what might trigger it. You are on nvme, iirc,
and that has a deep queue. Dave, are you testing on a sata drive or
something similar with a shallower queue depth? If we end up sleeping
for a request, I think we could trigger data->ctx being different.

Dave, can you hit the warnings with this? Totally untested...


diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed64771..80a9c45a9235 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1217,9 +1217,7 @@ static struct request *blk_mq_map_request(struct 
request_queue *q,

blk_mq_set_alloc_data(_data, q, 0, ctx, hctx);
rq = __blk_mq_alloc_request(_data, op, op_flags);

-   hctx->queued++;
-   data->hctx = hctx;
-   data->ctx = ctx;
+   data->hctx->queued++;
return rq;
 }


--
Jens Axboe



Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe

On 10/26/2016 04:58 PM, Linus Torvalds wrote:

On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds
 wrote:


Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
blk_mq_merge_queue_io() into two


I did that myself too, since Dave sees this during boot.

But I'm not getting the warning ;(

Dave gets it with ext4, and thats' what I have too, so I'm not sure
what the required trigger would be.


Actually, I think I see what might trigger it. You are on nvme, iirc,
and that has a deep queue. Dave, are you testing on a sata drive or
something similar with a shallower queue depth? If we end up sleeping
for a request, I think we could trigger data->ctx being different.

Dave, can you hit the warnings with this? Totally untested...


diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed64771..80a9c45a9235 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1217,9 +1217,7 @@ static struct request *blk_mq_map_request(struct 
request_queue *q,

blk_mq_set_alloc_data(_data, q, 0, ctx, hctx);
rq = __blk_mq_alloc_request(_data, op, op_flags);

-   hctx->queued++;
-   data->hctx = hctx;
-   data->ctx = ctx;
+   data->hctx->queued++;
return rq;
 }


--
Jens Axboe



Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe

On 10/26/2016 05:01 PM, Dave Jones wrote:

On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote:
 > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
 > blk_mq_merge_queue_io() into two, since right now it can trigger both
 > for the
 >
 > blk_mq_bio_to_request(rq, bio);
 >
 > path _and_ for the
 >
 > if (!blk_mq_attempt_merge(q, ctx, bio)) {
 > blk_mq_bio_to_request(rq, bio);
 > goto insert_rq;
 >
 > path. If you split it into two: one before that "insert_rq:" label,
 > and one before the "goto insert_rq" thing, then we could see if it is
 > just one of the blk_mq_merge_queue_io() cases (or both) that is
 > broken..

It's the latter of the two.

[   12.302392] WARNING: CPU: 3 PID: 272 at block/blk-mq.c:1191
blk_sq_make_request+0x320/0x4d0


I would expect that - so normal request, since we know merging is on. I
just sent out a patch, can you give that a whirl?

Keep Linus' debugging patch, and apply it on top. Below as well, for
reference.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed64771..80a9c45a9235 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1217,9 +1217,7 @@ static struct request *blk_mq_map_request(struct 
request_queue *q,

blk_mq_set_alloc_data(_data, q, 0, ctx, hctx);
rq = __blk_mq_alloc_request(_data, op, op_flags);

-   hctx->queued++;
-   data->hctx = hctx;
-   data->ctx = ctx;
+   data->hctx->queued++;
return rq;
 }


--
Jens Axboe



Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe

On 10/26/2016 05:01 PM, Dave Jones wrote:

On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote:
 > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
 > blk_mq_merge_queue_io() into two, since right now it can trigger both
 > for the
 >
 > blk_mq_bio_to_request(rq, bio);
 >
 > path _and_ for the
 >
 > if (!blk_mq_attempt_merge(q, ctx, bio)) {
 > blk_mq_bio_to_request(rq, bio);
 > goto insert_rq;
 >
 > path. If you split it into two: one before that "insert_rq:" label,
 > and one before the "goto insert_rq" thing, then we could see if it is
 > just one of the blk_mq_merge_queue_io() cases (or both) that is
 > broken..

It's the latter of the two.

[   12.302392] WARNING: CPU: 3 PID: 272 at block/blk-mq.c:1191
blk_sq_make_request+0x320/0x4d0


I would expect that - so normal request, since we know merging is on. I
just sent out a patch, can you give that a whirl?

Keep Linus' debugging patch, and apply it on top. Below as well, for
reference.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed64771..80a9c45a9235 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1217,9 +1217,7 @@ static struct request *blk_mq_map_request(struct 
request_queue *q,

blk_mq_set_alloc_data(_data, q, 0, ctx, hctx);
rq = __blk_mq_alloc_request(_data, op, op_flags);

-   hctx->queued++;
-   data->hctx = hctx;
-   data->ctx = ctx;
+   data->hctx->queued++;
return rq;
 }


--
Jens Axboe



Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote:
 > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
 > blk_mq_merge_queue_io() into two, since right now it can trigger both
 > for the
 > 
 > blk_mq_bio_to_request(rq, bio);
 > 
 > path _and_ for the
 > 
 > if (!blk_mq_attempt_merge(q, ctx, bio)) {
 > blk_mq_bio_to_request(rq, bio);
 > goto insert_rq;
 > 
 > path. If you split it into two: one before that "insert_rq:" label,
 > and one before the "goto insert_rq" thing, then we could see if it is
 > just one of the blk_mq_merge_queue_io() cases (or both) that is
 > broken..

It's the latter of the two.

[   12.302392] WARNING: CPU: 3 PID: 272 at block/blk-mq.c:1191
blk_sq_make_request+0x320/0x4d0

Dave




Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote:
 > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
 > blk_mq_merge_queue_io() into two, since right now it can trigger both
 > for the
 > 
 > blk_mq_bio_to_request(rq, bio);
 > 
 > path _and_ for the
 > 
 > if (!blk_mq_attempt_merge(q, ctx, bio)) {
 > blk_mq_bio_to_request(rq, bio);
 > goto insert_rq;
 > 
 > path. If you split it into two: one before that "insert_rq:" label,
 > and one before the "goto insert_rq" thing, then we could see if it is
 > just one of the blk_mq_merge_queue_io() cases (or both) that is
 > broken..

It's the latter of the two.

[   12.302392] WARNING: CPU: 3 PID: 272 at block/blk-mq.c:1191
blk_sq_make_request+0x320/0x4d0

Dave




Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds
 wrote:
>
> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
> blk_mq_merge_queue_io() into two

I did that myself too, since Dave sees this during boot.

But I'm not getting the warning ;(

Dave gets it with ext4, and thats' what I have too, so I'm not sure
what the required trigger would be.

  Linus


Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds
 wrote:
>
> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
> blk_mq_merge_queue_io() into two

I did that myself too, since Dave sees this during boot.

But I'm not getting the warning ;(

Dave gets it with ext4, and thats' what I have too, so I'm not sure
what the required trigger would be.

  Linus


Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe

On 10/26/2016 04:51 PM, Linus Torvalds wrote:

On Wed, Oct 26, 2016 at 3:40 PM, Dave Jones  wrote:


I gave it a shot too for shits & giggles.
This falls out during boot.

[9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 
blk_sq_make_request+0x465/0x4a0


Hmm. That's the

WARN_ON_ONCE(rq->mq_ctx != ctx);

that I added to blk_mq_merge_queue_io(), and I really think that
warning is valid, and the fact that it triggers shows that something
is wrong with locking.

We just did a

spin_lock(>lock);

and that lock is *supposed* to protect the __blk_mq_insert_request(),
but that uses rq->mq_ctx.

So if rq->mq_ctx != ctx, then we're locking the wrong context.

Jens - please explain to me why I'm wrong.

Or maybe I actually might have found the problem? In which case please
send me a patch that fixes it ;)


I think you're pretty close, the two should not be different and I don't
immediately see how. I'll run some testing here, should be easier with
this knowledge.


Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
blk_mq_merge_queue_io() into two, since right now it can trigger both
for the

blk_mq_bio_to_request(rq, bio);

path _and_ for the

if (!blk_mq_attempt_merge(q, ctx, bio)) {
blk_mq_bio_to_request(rq, bio);
goto insert_rq;


And just in case I can't trigger, would be interesting to add a call to
blk_rq_dump_flags() as well, in case this is some special request.


--
Jens Axboe



Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe

On 10/26/2016 04:51 PM, Linus Torvalds wrote:

On Wed, Oct 26, 2016 at 3:40 PM, Dave Jones  wrote:


I gave it a shot too for shits & giggles.
This falls out during boot.

[9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 
blk_sq_make_request+0x465/0x4a0


Hmm. That's the

WARN_ON_ONCE(rq->mq_ctx != ctx);

that I added to blk_mq_merge_queue_io(), and I really think that
warning is valid, and the fact that it triggers shows that something
is wrong with locking.

We just did a

spin_lock(>lock);

and that lock is *supposed* to protect the __blk_mq_insert_request(),
but that uses rq->mq_ctx.

So if rq->mq_ctx != ctx, then we're locking the wrong context.

Jens - please explain to me why I'm wrong.

Or maybe I actually might have found the problem? In which case please
send me a patch that fixes it ;)


I think you're pretty close, the two should not be different and I don't
immediately see how. I'll run some testing here, should be easier with
this knowledge.


Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
blk_mq_merge_queue_io() into two, since right now it can trigger both
for the

blk_mq_bio_to_request(rq, bio);

path _and_ for the

if (!blk_mq_attempt_merge(q, ctx, bio)) {
blk_mq_bio_to_request(rq, bio);
goto insert_rq;


And just in case I can't trigger, would be interesting to add a call to
blk_rq_dump_flags() as well, in case this is some special request.


--
Jens Axboe



Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe

On 10/26/2016 04:40 PM, Dave Jones wrote:

On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote:

 > Could you try the attached patch? It adds a couple of sanity tests:
 >
 >  - a number of tests to verify that 'rq->queuelist' isn't already on
 > some queue when it is added to a queue
 >
 >  - one test to verify that rq->mq_ctx is the same ctx that we have locked.
 >
 > I may be completely full of shit, and this patch may be pure garbage
 > or "obviously will never trigger", but humor me.

I gave it a shot too for shits & giggles.
This falls out during boot.

[9.244030] EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: 
(null)
[9.271391] [ cut here ]
[9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 
blk_sq_make_request+0x465/0x4a0
[9.285613] CPU: 0 PID: 1 Comm: init Not tainted 4.9.0-rc2-think+ #4


Very odd, don't immediately see how that can happen. For testing, can
you try and add the below patch? Just curious if that fixes the list
corruption. Thing is, I don't see how ->mq_ctx and ctx are different in
this path, but I can debug that on the side.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed64771..73b9462aa21f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1165,9 +1165,10 @@ static inline bool hctx_allow_merges(struct 
blk_mq_hw_ctx *hctx)

 }

 static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx,
-struct blk_mq_ctx *ctx,
 struct request *rq, struct bio *bio)
 {
+   struct blk_mq_ctx *ctx = rq->mq_ctx;
+
if (!hctx_allow_merges(hctx) || !bio_mergeable(bio)) {
blk_mq_bio_to_request(rq, bio);
spin_lock(>lock);
@@ -1338,7 +1339,7 @@ static blk_qc_t blk_mq_make_request(struct 
request_queue *q, struct bio *bio)

goto done;
}

-   if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
+   if (!blk_mq_merge_queue_io(data.hctx, rq, bio)) {
/*
 * For a SYNC request, send it to the hardware immediately. For
 * an ASYNC request, just ensure that we run it later on. The
@@ -1416,7 +1417,7 @@ static blk_qc_t blk_sq_make_request(struct 
request_queue *q, struct bio *bio)

return cookie;
}

-   if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
+   if (!blk_mq_merge_queue_io(data.hctx, rq, bio)) {
/*
 * For a SYNC request, send it to the hardware immediately. For
 * an ASYNC request, just ensure that we run it later on. The

--
Jens Axboe



Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe

On 10/26/2016 04:40 PM, Dave Jones wrote:

On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote:

 > Could you try the attached patch? It adds a couple of sanity tests:
 >
 >  - a number of tests to verify that 'rq->queuelist' isn't already on
 > some queue when it is added to a queue
 >
 >  - one test to verify that rq->mq_ctx is the same ctx that we have locked.
 >
 > I may be completely full of shit, and this patch may be pure garbage
 > or "obviously will never trigger", but humor me.

I gave it a shot too for shits & giggles.
This falls out during boot.

[9.244030] EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: 
(null)
[9.271391] [ cut here ]
[9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 
blk_sq_make_request+0x465/0x4a0
[9.285613] CPU: 0 PID: 1 Comm: init Not tainted 4.9.0-rc2-think+ #4


Very odd, don't immediately see how that can happen. For testing, can
you try and add the below patch? Just curious if that fixes the list
corruption. Thing is, I don't see how ->mq_ctx and ctx are different in
this path, but I can debug that on the side.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed64771..73b9462aa21f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1165,9 +1165,10 @@ static inline bool hctx_allow_merges(struct 
blk_mq_hw_ctx *hctx)

 }

 static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx,
-struct blk_mq_ctx *ctx,
 struct request *rq, struct bio *bio)
 {
+   struct blk_mq_ctx *ctx = rq->mq_ctx;
+
if (!hctx_allow_merges(hctx) || !bio_mergeable(bio)) {
blk_mq_bio_to_request(rq, bio);
spin_lock(>lock);
@@ -1338,7 +1339,7 @@ static blk_qc_t blk_mq_make_request(struct 
request_queue *q, struct bio *bio)

goto done;
}

-   if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
+   if (!blk_mq_merge_queue_io(data.hctx, rq, bio)) {
/*
 * For a SYNC request, send it to the hardware immediately. For
 * an ASYNC request, just ensure that we run it later on. The
@@ -1416,7 +1417,7 @@ static blk_qc_t blk_sq_make_request(struct 
request_queue *q, struct bio *bio)

return cookie;
}

-   if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
+   if (!blk_mq_merge_queue_io(data.hctx, rq, bio)) {
/*
 * For a SYNC request, send it to the hardware immediately. For
 * an ASYNC request, just ensure that we run it later on. The

--
Jens Axboe



Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 3:40 PM, Dave Jones  wrote:
>
> I gave it a shot too for shits & giggles.
> This falls out during boot.
>
> [9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 
> blk_sq_make_request+0x465/0x4a0

Hmm. That's the

WARN_ON_ONCE(rq->mq_ctx != ctx);

that I added to blk_mq_merge_queue_io(), and I really think that
warning is valid, and the fact that it triggers shows that something
is wrong with locking.

We just did a

spin_lock(>lock);

and that lock is *supposed* to protect the __blk_mq_insert_request(),
but that uses rq->mq_ctx.

So if rq->mq_ctx != ctx, then we're locking the wrong context.

Jens - please explain to me why I'm wrong.

Or maybe I actually might have found the problem? In which case please
send me a patch that fixes it ;)

Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
blk_mq_merge_queue_io() into two, since right now it can trigger both
for the

blk_mq_bio_to_request(rq, bio);

path _and_ for the

if (!blk_mq_attempt_merge(q, ctx, bio)) {
blk_mq_bio_to_request(rq, bio);
goto insert_rq;

path. If you split it into two: one before that "insert_rq:" label,
and one before the "goto insert_rq" thing, then we could see if it is
just one of the blk_mq_merge_queue_io() cases (or both) that is
broken..

  Linus


Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 3:40 PM, Dave Jones  wrote:
>
> I gave it a shot too for shits & giggles.
> This falls out during boot.
>
> [9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 
> blk_sq_make_request+0x465/0x4a0

Hmm. That's the

WARN_ON_ONCE(rq->mq_ctx != ctx);

that I added to blk_mq_merge_queue_io(), and I really think that
warning is valid, and the fact that it triggers shows that something
is wrong with locking.

We just did a

spin_lock(>lock);

and that lock is *supposed* to protect the __blk_mq_insert_request(),
but that uses rq->mq_ctx.

So if rq->mq_ctx != ctx, then we're locking the wrong context.

Jens - please explain to me why I'm wrong.

Or maybe I actually might have found the problem? In which case please
send me a patch that fixes it ;)

Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
blk_mq_merge_queue_io() into two, since right now it can trigger both
for the

blk_mq_bio_to_request(rq, bio);

path _and_ for the

if (!blk_mq_attempt_merge(q, ctx, bio)) {
blk_mq_bio_to_request(rq, bio);
goto insert_rq;

path. If you split it into two: one before that "insert_rq:" label,
and one before the "goto insert_rq" thing, then we could see if it is
just one of the blk_mq_merge_queue_io() cases (or both) that is
broken..

  Linus


Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote:

 > Could you try the attached patch? It adds a couple of sanity tests:
 > 
 >  - a number of tests to verify that 'rq->queuelist' isn't already on
 > some queue when it is added to a queue
 > 
 >  - one test to verify that rq->mq_ctx is the same ctx that we have locked.
 > 
 > I may be completely full of shit, and this patch may be pure garbage
 > or "obviously will never trigger", but humor me.
 
I gave it a shot too for shits & giggles.
This falls out during boot.

[9.244030] EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: 
(null)
[9.271391] [ cut here ]
[9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 
blk_sq_make_request+0x465/0x4a0
[9.285613] CPU: 0 PID: 1 Comm: init Not tainted 4.9.0-rc2-think+ #4
[9.300106]  c9013848
[9.307353]  8130d27c
[9.307420]  
[9.307456]  
[9.307494]  c9013888
[9.314780]  81077a41
[9.314847]  049d00013898
[9.314884]  0001
[9.314922]  0002
[9.322213]  e8c06600
[9.322280]  e8606600
[9.322317]  8805015c2e80
[9.322355] Call Trace:
[9.329689]  [] dump_stack+0x4f/0x73
[9.337208]  [] __warn+0xc1/0xe0
[9.344730]  [] warn_slowpath_null+0x18/0x20
[9.352282]  [] blk_sq_make_request+0x465/0x4a0
[9.359816]  [] ? generic_make_request+0xca/0x210
[9.367367]  [] generic_make_request+0xd7/0x210
[9.374931]  [] submit_bio+0x69/0x120
[9.382488]  [] ? submit_bh_wbc+0x16f/0x1e0
[9.390083]  [] submit_bh_wbc+0x16f/0x1e0
[9.397673]  [] submit_bh+0xe/0x10
[9.405205]  [] __ext4_get_inode_loc+0x1ac/0x3d0
[9.412795]  [] ext4_iget+0x6b/0xb70
[9.420366]  [] ? lookup_slow+0xf8/0x1f0
[9.427947]  [] ext4_iget_normal+0x2a/0x30
[9.435541]  [] ext4_lookup+0x114/0x230



Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote:

 > Could you try the attached patch? It adds a couple of sanity tests:
 > 
 >  - a number of tests to verify that 'rq->queuelist' isn't already on
 > some queue when it is added to a queue
 > 
 >  - one test to verify that rq->mq_ctx is the same ctx that we have locked.
 > 
 > I may be completely full of shit, and this patch may be pure garbage
 > or "obviously will never trigger", but humor me.
 
I gave it a shot too for shits & giggles.
This falls out during boot.

[9.244030] EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: 
(null)
[9.271391] [ cut here ]
[9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 
blk_sq_make_request+0x465/0x4a0
[9.285613] CPU: 0 PID: 1 Comm: init Not tainted 4.9.0-rc2-think+ #4
[9.300106]  c9013848
[9.307353]  8130d27c
[9.307420]  
[9.307456]  
[9.307494]  c9013888
[9.314780]  81077a41
[9.314847]  049d00013898
[9.314884]  0001
[9.314922]  0002
[9.322213]  e8c06600
[9.322280]  e8606600
[9.322317]  8805015c2e80
[9.322355] Call Trace:
[9.329689]  [] dump_stack+0x4f/0x73
[9.337208]  [] __warn+0xc1/0xe0
[9.344730]  [] warn_slowpath_null+0x18/0x20
[9.352282]  [] blk_sq_make_request+0x465/0x4a0
[9.359816]  [] ? generic_make_request+0xca/0x210
[9.367367]  [] generic_make_request+0xd7/0x210
[9.374931]  [] submit_bio+0x69/0x120
[9.382488]  [] ? submit_bh_wbc+0x16f/0x1e0
[9.390083]  [] submit_bh_wbc+0x16f/0x1e0
[9.397673]  [] submit_bh+0xe/0x10
[9.405205]  [] __ext4_get_inode_loc+0x1ac/0x3d0
[9.412795]  [] ext4_iget+0x6b/0xb70
[9.420366]  [] ? lookup_slow+0xf8/0x1f0
[9.427947]  [] ext4_iget_normal+0x2a/0x30
[9.435541]  [] ext4_lookup+0x114/0x230



Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 2:52 PM, Chris Mason  wrote:
>
> This one is special because CONFIG_VMAP_STACK is not set.  Btrfs triggers in 
> < 10 minutes.
> I've done 30 minutes each with XFS and Ext4 without luck.

Ok, see the email I wrote that crossed yours - if it's really some
list corruption on ctx->rq_list due to some locking problem, I really
would expect CONFIG_VMAP_STACK to be entirely irrelevant, except
perhaps from a timing standpoint.

> WARNING: CPU: 6 PID: 4481 at lib/list_debug.c:33 __list_add+0xbe/0xd0
> list_add corruption. prev->next should be next (e8d80b08), but was 
> 88012b65fb88. (prev=880128c8d500).
> Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper lrw 
> gf128mul ablk_helper i2c_piix4 cryptd i2c_core virtio_net serio_raw floppy 
> button pcspkr sch_fq_codel autofs4 virtio_blk
> CPU: 6 PID: 4481 Comm: dbench Not tainted 4.9.0-rc2-15419-g811d54d #319
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 
> 04/01/2014
>  880104eff868 814fde0f 8151c46e 880104eff8c8
>  880104eff8c8  880104eff8b8 810648cf
>  880128cab2c0 00213fc57c68 8801384e8928 880128cab180
> Call Trace:
>  [] dump_stack+0x53/0x74
>  [] ? __list_add+0xbe/0xd0
>  [] __warn+0xff/0x120
>  [] warn_slowpath_fmt+0x49/0x50
>  [] __list_add+0xbe/0xd0
>  [] blk_sq_make_request+0x388/0x580
>  [] generic_make_request+0x104/0x200

Well, it's very consistent, I have to say. So I really don't think
this is random corruption.

Could you try the attached patch? It adds a couple of sanity tests:

 - a number of tests to verify that 'rq->queuelist' isn't already on
some queue when it is added to a queue

 - one test to verify that rq->mq_ctx is the same ctx that we have locked.

I may be completely full of shit, and this patch may be pure garbage
or "obviously will never trigger", but humor me.

  Linus
 block/blk-mq.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed64771..4f575de7fdd0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -521,6 +521,8 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool 
at_head)
 */
BUG_ON(rq->cmd_flags & REQ_SOFTBARRIER);
 
+WARN_ON_ONCE(!list_empty(>queuelist));
+
spin_lock_irqsave(>requeue_lock, flags);
if (at_head) {
rq->cmd_flags |= REQ_SOFTBARRIER;
@@ -838,6 +840,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
queued++;
break;
case BLK_MQ_RQ_QUEUE_BUSY:
+WARN_ON_ONCE(!list_empty(>queuelist));
list_add(>queuelist, _list);
__blk_mq_requeue_request(rq);
break;
@@ -1034,6 +1037,8 @@ static inline void __blk_mq_insert_req_list(struct 
blk_mq_hw_ctx *hctx,
 
trace_block_rq_insert(hctx->queue, rq);
 
+WARN_ON_ONCE(!list_empty(>queuelist));
+
if (at_head)
list_add(>queuelist, >rq_list);
else
@@ -1137,6 +1142,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool 
from_schedule)
depth = 0;
}
 
+WARN_ON_ONCE(!list_empty(>queuelist));
depth++;
list_add_tail(>queuelist, _list);
}
@@ -1172,6 +1178,7 @@ static inline bool blk_mq_merge_queue_io(struct 
blk_mq_hw_ctx *hctx,
blk_mq_bio_to_request(rq, bio);
spin_lock(>lock);
 insert_rq:
+WARN_ON_ONCE(rq->mq_ctx != ctx);
__blk_mq_insert_request(hctx, rq, false);
spin_unlock(>lock);
return false;
@@ -1326,6 +1333,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
old_rq = same_queue_rq;
list_del_init(_rq->queuelist);
}
+WARN_ON_ONCE(!list_empty(>queuelist));
list_add_tail(>queuelist, >mq_list);
} else /* is_sync */
old_rq = rq;
@@ -1412,6 +1420,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue 
*q, struct bio *bio)
trace_block_plug(q);
}
 
+WARN_ON_ONCE(!list_empty(>queuelist));
list_add_tail(>queuelist, >mq_list);
return cookie;
}


Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 2:52 PM, Chris Mason  wrote:
>
> This one is special because CONFIG_VMAP_STACK is not set.  Btrfs triggers in 
> < 10 minutes.
> I've done 30 minutes each with XFS and Ext4 without luck.

Ok, see the email I wrote that crossed yours - if it's really some
list corruption on ctx->rq_list due to some locking problem, I really
would expect CONFIG_VMAP_STACK to be entirely irrelevant, except
perhaps from a timing standpoint.

> WARNING: CPU: 6 PID: 4481 at lib/list_debug.c:33 __list_add+0xbe/0xd0
> list_add corruption. prev->next should be next (e8d80b08), but was 
> 88012b65fb88. (prev=880128c8d500).
> Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper lrw 
> gf128mul ablk_helper i2c_piix4 cryptd i2c_core virtio_net serio_raw floppy 
> button pcspkr sch_fq_codel autofs4 virtio_blk
> CPU: 6 PID: 4481 Comm: dbench Not tainted 4.9.0-rc2-15419-g811d54d #319
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 
> 04/01/2014
>  880104eff868 814fde0f 8151c46e 880104eff8c8
>  880104eff8c8  880104eff8b8 810648cf
>  880128cab2c0 00213fc57c68 8801384e8928 880128cab180
> Call Trace:
>  [] dump_stack+0x53/0x74
>  [] ? __list_add+0xbe/0xd0
>  [] __warn+0xff/0x120
>  [] warn_slowpath_fmt+0x49/0x50
>  [] __list_add+0xbe/0xd0
>  [] blk_sq_make_request+0x388/0x580
>  [] generic_make_request+0x104/0x200

Well, it's very consistent, I have to say. So I really don't think
this is random corruption.

Could you try the attached patch? It adds a couple of sanity tests:

 - a number of tests to verify that 'rq->queuelist' isn't already on
some queue when it is added to a queue

 - one test to verify that rq->mq_ctx is the same ctx that we have locked.

I may be completely full of shit, and this patch may be pure garbage
or "obviously will never trigger", but humor me.

  Linus
 block/blk-mq.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed64771..4f575de7fdd0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -521,6 +521,8 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool 
at_head)
 */
BUG_ON(rq->cmd_flags & REQ_SOFTBARRIER);
 
+WARN_ON_ONCE(!list_empty(>queuelist));
+
spin_lock_irqsave(>requeue_lock, flags);
if (at_head) {
rq->cmd_flags |= REQ_SOFTBARRIER;
@@ -838,6 +840,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
queued++;
break;
case BLK_MQ_RQ_QUEUE_BUSY:
+WARN_ON_ONCE(!list_empty(>queuelist));
list_add(>queuelist, _list);
__blk_mq_requeue_request(rq);
break;
@@ -1034,6 +1037,8 @@ static inline void __blk_mq_insert_req_list(struct 
blk_mq_hw_ctx *hctx,
 
trace_block_rq_insert(hctx->queue, rq);
 
+WARN_ON_ONCE(!list_empty(>queuelist));
+
if (at_head)
list_add(>queuelist, >rq_list);
else
@@ -1137,6 +1142,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool 
from_schedule)
depth = 0;
}
 
+WARN_ON_ONCE(!list_empty(>queuelist));
depth++;
list_add_tail(>queuelist, _list);
}
@@ -1172,6 +1178,7 @@ static inline bool blk_mq_merge_queue_io(struct 
blk_mq_hw_ctx *hctx,
blk_mq_bio_to_request(rq, bio);
spin_lock(>lock);
 insert_rq:
+WARN_ON_ONCE(rq->mq_ctx != ctx);
__blk_mq_insert_request(hctx, rq, false);
spin_unlock(>lock);
return false;
@@ -1326,6 +1333,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
old_rq = same_queue_rq;
list_del_init(_rq->queuelist);
}
+WARN_ON_ONCE(!list_empty(>queuelist));
list_add_tail(>queuelist, >mq_list);
} else /* is_sync */
old_rq = rq;
@@ -1412,6 +1420,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue 
*q, struct bio *bio)
trace_block_plug(q);
}
 
+WARN_ON_ONCE(!list_empty(>queuelist));
list_add_tail(>queuelist, >mq_list);
return cookie;
}


Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason  wrote:
>
> Today I turned off every CONFIG_DEBUG_* except for list debugging, and
> ran dbench 2048:
>
> [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 
> __list_add+0xbe/0xd0
> [ 2759.119652] list_add corruption. prev->next should be next 
> (e8c80308), but was c9ccfb88. (prev=880128522380).
> [ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel 
> aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw 
> pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk
> [ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 
> 4.9.0-rc1-15246-g4ce9206-dirty #317
> [ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.9.0-1.fc24 04/01/2014
> [ 2759.125077]  c9000f6fb868 814fe4ff 8151cb5e 
> c9000f6fb8c8
> [ 2759.125077]  c9000f6fb8c8  c9000f6fb8b8 
> 81064bbf
> [ 2759.127444]  880128523680 002139968000 880138b7a4a0 
> 880128523540
> [ 2759.127444] Call Trace:
> [ 2759.127444]  [] dump_stack+0x53/0x74
> [ 2759.127444]  [] ? __list_add+0xbe/0xd0
> [ 2759.127444]  [] __warn+0xff/0x120
> [ 2759.127444]  [] warn_slowpath_fmt+0x49/0x50
> [ 2759.127444]  [] __list_add+0xbe/0xd0
> [ 2759.127444]  [] blk_sq_make_request+0x388/0x580

Ok, that's definitely the same one that Dave started out seeing.

The fact that it is that reliable - two different machines, two very
different loads (dbench looks nothing like trinity) really makes me
think that maybe the problem really is in the block plugging after
all.

It very much does not smell like random stack corruption. It's simply
not random enough.

And I just noticed something: I originally thought that this is the
"list_add_tail()" to the plug - which is the only "list_add()" variant
in that function.

But that never made sense, because the whole "but was" isn't a stack
address, and "next" in "list_add_tail()" is basically fixed, and would
have to be the stack.

But I now notice that there's actually another "list_add()" variant
there, and it's the one from __blk_mq_insert_request() that gets
inlined into blk_mq_insert_request(), which then gets inlined into
blk_mq_make_request().

And that actually makes some sense just looking at the offsets too:

 blk_sq_make_request+0x388/0x580

so it's somewhat at the end of blk_sq_make_request(). So it's not unlikely.

And there it makes perfect sense that the "next should be" value is
*not* on the stack.

Chris, if you built with debug info, you can try

./scripts/faddr2line /boot/vmlinux blk_sq_make_request+0x388

to get what line that blk_sq_make_request+0x388 address actually is. I
think it's the

list_add_tail(>queuelist, >rq_list);

in __blk_mq_insert_req_list() (when it's inlined from
blk_sq_make_request(), "at_head" will be false.

So it smells like ">rq_list" might be corrupt.

And that actually seems much more likely than the "plug" list, because
while the plug is entirely thread-local (and thus shouldn't have any
races), the ctx->rq_list very much is not.

Jens?

For example, should we have a

BUG_ON(ctx != rq->mq_ctx);

in blk_mq_merge_queue_io()? Because it locks ctx->lock, but then
__blk_mq_insert_request() will insert things onto the queues of
rq->mq_ctx.

blk_mq_insert_requests() has similar issues, but there has that BUG_ON().

The locking there really is *very* messy. All the lockers do

spin_lock(>lock);
...
spin_unlock(>lock);

but then __blk_mq_insert_request() and __blk_mq_insert_req_list don't
act on "ctx", but on "ctx = rq->mq_ctx", so if you ever get those
wrong, you're completely dead.

Now, I'm not seeing why they'd be wrong, and why they'd be associated
with the VMAP_STACK thing, but it could just be an unlucky timing
thing.

   Linus


Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason  wrote:
>
> Today I turned off every CONFIG_DEBUG_* except for list debugging, and
> ran dbench 2048:
>
> [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 
> __list_add+0xbe/0xd0
> [ 2759.119652] list_add corruption. prev->next should be next 
> (e8c80308), but was c9ccfb88. (prev=880128522380).
> [ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel 
> aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw 
> pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk
> [ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 
> 4.9.0-rc1-15246-g4ce9206-dirty #317
> [ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.9.0-1.fc24 04/01/2014
> [ 2759.125077]  c9000f6fb868 814fe4ff 8151cb5e 
> c9000f6fb8c8
> [ 2759.125077]  c9000f6fb8c8  c9000f6fb8b8 
> 81064bbf
> [ 2759.127444]  880128523680 002139968000 880138b7a4a0 
> 880128523540
> [ 2759.127444] Call Trace:
> [ 2759.127444]  [] dump_stack+0x53/0x74
> [ 2759.127444]  [] ? __list_add+0xbe/0xd0
> [ 2759.127444]  [] __warn+0xff/0x120
> [ 2759.127444]  [] warn_slowpath_fmt+0x49/0x50
> [ 2759.127444]  [] __list_add+0xbe/0xd0
> [ 2759.127444]  [] blk_sq_make_request+0x388/0x580

Ok, that's definitely the same one that Dave started out seeing.

The fact that it is that reliable - two different machines, two very
different loads (dbench looks nothing like trinity) really makes me
think that maybe the problem really is in the block plugging after
all.

It very much does not smell like random stack corruption. It's simply
not random enough.

And I just noticed something: I originally thought that this is the
"list_add_tail()" to the plug - which is the only "list_add()" variant
in that function.

But that never made sense, because the whole "but was" isn't a stack
address, and "next" in "list_add_tail()" is basically fixed, and would
have to be the stack.

But I now notice that there's actually another "list_add()" variant
there, and it's the one from __blk_mq_insert_request() that gets
inlined into blk_mq_insert_request(), which then gets inlined into
blk_mq_make_request().

And that actually makes some sense just looking at the offsets too:

 blk_sq_make_request+0x388/0x580

so it's somewhat at the end of blk_sq_make_request(). So it's not unlikely.

And there it makes perfect sense that the "next should be" value is
*not* on the stack.

Chris, if you built with debug info, you can try

./scripts/faddr2line /boot/vmlinux blk_sq_make_request+0x388

to get what line that blk_sq_make_request+0x388 address actually is. I
think it's the

list_add_tail(>queuelist, >rq_list);

in __blk_mq_insert_req_list() (when it's inlined from
blk_sq_make_request(), "at_head" will be false.

So it smells like ">rq_list" might be corrupt.

And that actually seems much more likely than the "plug" list, because
while the plug is entirely thread-local (and thus shouldn't have any
races), the ctx->rq_list very much is not.

Jens?

For example, should we have a

BUG_ON(ctx != rq->mq_ctx);

in blk_mq_merge_queue_io()? Because it locks ctx->lock, but then
__blk_mq_insert_request() will insert things onto the queues of
rq->mq_ctx.

blk_mq_insert_requests() has similar issues, but there has that BUG_ON().

The locking there really is *very* messy. All the lockers do

spin_lock(>lock);
...
spin_unlock(>lock);

but then __blk_mq_insert_request() and __blk_mq_insert_req_list don't
act on "ctx", but on "ctx = rq->mq_ctx", so if you ever get those
wrong, you're completely dead.

Now, I'm not seeing why they'd be wrong, and why they'd be associated
with the VMAP_STACK thing, but it could just be an unlucky timing
thing.

   Linus


Re: bio linked list corruption.

2016-10-26 Thread Chris Mason


On 10/26/2016 04:00 PM, Chris Mason wrote:
> 
> 
> On 10/26/2016 03:06 PM, Linus Torvalds wrote:
>> On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones  wrote:
>>>
>>> The stacks show nearly all of them are stuck in sync_inodes_sb
>>
>> That's just wb_wait_for_completion(), and it means that some IO isn't
>> completing.
>>
>> There's also a lot of processes waiting for inode_lock(), and a few
>> waiting for mnt_want_write()
>>
>> Ignoring those, we have
>>
>>> [] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs]
>>> [] btrfs_sync_fs+0x31/0xc0 [btrfs]
>>> [] sync_filesystem+0x6e/0xa0
>>> [] SyS_syncfs+0x3c/0x70
>>> [] do_syscall_64+0x5c/0x170
>>> [] entry_SYSCALL64_slow_path+0x25/0x25
>>> [] 0x
>>
>> Don't know this one. There's a couple of them. Could there be some
>> ABBA deadlock on the ordered roots waiting?
> 
> It's always possible, but we haven't changed anything here.
> 
> I've tried a long list of things to reproduce this on my test boxes,
> including days of trinity runs and a kernel module to exercise vmalloc,
> and thread creation.
> 
> Today I turned off every CONFIG_DEBUG_* except for list debugging, and
> ran dbench 2048:
> 

This one is special because CONFIG_VMAP_STACK is not set.  Btrfs triggers in < 
10 minutes.
I've done 30 minutes each with XFS and Ext4 without luck.

This is all in a virtual machine that I can copy on to a bunch of hosts.  So 
I'll get some
parallel tests going tonight to narrow it down.

[ cut here ]
WARNING: CPU: 6 PID: 4481 at lib/list_debug.c:33 __list_add+0xbe/0xd0
list_add corruption. prev->next should be next (e8d80b08), but was 
88012b65fb88. (prev=880128c8d500).
Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul 
ablk_helper i2c_piix4 cryptd i2c_core virtio_net serio_raw floppy button pcspkr 
sch_fq_codel autofs4 virtio_blk
CPU: 6 PID: 4481 Comm: dbench Not tainted 4.9.0-rc2-15419-g811d54d #319
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 
04/01/2014
 880104eff868 814fde0f 8151c46e 880104eff8c8
 880104eff8c8  880104eff8b8 810648cf
 880128cab2c0 00213fc57c68 8801384e8928 880128cab180
Call Trace:
 [] dump_stack+0x53/0x74
 [] ? __list_add+0xbe/0xd0
 [] __warn+0xff/0x120
 [] warn_slowpath_fmt+0x49/0x50
 [] __list_add+0xbe/0xd0
 [] blk_sq_make_request+0x388/0x580
 [] generic_make_request+0x104/0x200
 [] submit_bio+0x65/0x130
 [] ? __percpu_counter_add+0x96/0xd0
 [] btrfs_map_bio+0x23c/0x310
 [] btrfs_submit_bio_hook+0xd3/0x190
 [] submit_one_bio+0x6d/0xa0
 [] flush_epd_write_bio+0x4e/0x70
 [] extent_writepages+0x5d/0x70
 [] ? btrfs_releasepage+0x50/0x50
 [] ? wbc_attach_and_unlock_inode+0x6e/0x170
 [] btrfs_writepages+0x27/0x30
 [] do_writepages+0x20/0x30
 [] __filemap_fdatawrite_range+0xb5/0x100
 [] filemap_fdatawrite_range+0x13/0x20
 [] btrfs_fdatawrite_range+0x2b/0x70
 [] btrfs_sync_file+0x88/0x490
 [] ? group_send_sig_info+0x42/0x80
 [] ? kill_pid_info+0x5d/0x90
 [] ? SYSC_kill+0xba/0x1d0
 [] ? __sb_end_write+0x58/0x80
 [] vfs_fsync_range+0x4c/0xb0
 [] ? syscall_trace_enter+0x201/0x2e0
 [] vfs_fsync+0x1c/0x20
 [] do_fsync+0x3d/0x70
 [] ? syscall_slow_exit_work+0xfb/0x100
 [] SyS_fsync+0x10/0x20
 [] do_syscall_64+0x55/0xd0
 [] ? prepare_exit_to_usermode+0x37/0x40
 [] entry_SYSCALL64_slow_path+0x25/0x25
---[ end trace efe6b17c6dba2a6e ]---


Re: bio linked list corruption.

2016-10-26 Thread Chris Mason


On 10/26/2016 04:00 PM, Chris Mason wrote:
> 
> 
> On 10/26/2016 03:06 PM, Linus Torvalds wrote:
>> On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones  wrote:
>>>
>>> The stacks show nearly all of them are stuck in sync_inodes_sb
>>
>> That's just wb_wait_for_completion(), and it means that some IO isn't
>> completing.
>>
>> There's also a lot of processes waiting for inode_lock(), and a few
>> waiting for mnt_want_write()
>>
>> Ignoring those, we have
>>
>>> [] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs]
>>> [] btrfs_sync_fs+0x31/0xc0 [btrfs]
>>> [] sync_filesystem+0x6e/0xa0
>>> [] SyS_syncfs+0x3c/0x70
>>> [] do_syscall_64+0x5c/0x170
>>> [] entry_SYSCALL64_slow_path+0x25/0x25
>>> [] 0x
>>
>> Don't know this one. There's a couple of them. Could there be some
>> ABBA deadlock on the ordered roots waiting?
> 
> It's always possible, but we haven't changed anything here.
> 
> I've tried a long list of things to reproduce this on my test boxes,
> including days of trinity runs and a kernel module to exercise vmalloc,
> and thread creation.
> 
> Today I turned off every CONFIG_DEBUG_* except for list debugging, and
> ran dbench 2048:
> 

This one is special because CONFIG_VMAP_STACK is not set.  Btrfs triggers in < 
10 minutes.
I've done 30 minutes each with XFS and Ext4 without luck.

This is all in a virtual machine that I can copy on to a bunch of hosts.  So 
I'll get some
parallel tests going tonight to narrow it down.

[ cut here ]
WARNING: CPU: 6 PID: 4481 at lib/list_debug.c:33 __list_add+0xbe/0xd0
list_add corruption. prev->next should be next (e8d80b08), but was 
88012b65fb88. (prev=880128c8d500).
Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul 
ablk_helper i2c_piix4 cryptd i2c_core virtio_net serio_raw floppy button pcspkr 
sch_fq_codel autofs4 virtio_blk
CPU: 6 PID: 4481 Comm: dbench Not tainted 4.9.0-rc2-15419-g811d54d #319
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 
04/01/2014
 880104eff868 814fde0f 8151c46e 880104eff8c8
 880104eff8c8  880104eff8b8 810648cf
 880128cab2c0 00213fc57c68 8801384e8928 880128cab180
Call Trace:
 [] dump_stack+0x53/0x74
 [] ? __list_add+0xbe/0xd0
 [] __warn+0xff/0x120
 [] warn_slowpath_fmt+0x49/0x50
 [] __list_add+0xbe/0xd0
 [] blk_sq_make_request+0x388/0x580
 [] generic_make_request+0x104/0x200
 [] submit_bio+0x65/0x130
 [] ? __percpu_counter_add+0x96/0xd0
 [] btrfs_map_bio+0x23c/0x310
 [] btrfs_submit_bio_hook+0xd3/0x190
 [] submit_one_bio+0x6d/0xa0
 [] flush_epd_write_bio+0x4e/0x70
 [] extent_writepages+0x5d/0x70
 [] ? btrfs_releasepage+0x50/0x50
 [] ? wbc_attach_and_unlock_inode+0x6e/0x170
 [] btrfs_writepages+0x27/0x30
 [] do_writepages+0x20/0x30
 [] __filemap_fdatawrite_range+0xb5/0x100
 [] filemap_fdatawrite_range+0x13/0x20
 [] btrfs_fdatawrite_range+0x2b/0x70
 [] btrfs_sync_file+0x88/0x490
 [] ? group_send_sig_info+0x42/0x80
 [] ? kill_pid_info+0x5d/0x90
 [] ? SYSC_kill+0xba/0x1d0
 [] ? __sb_end_write+0x58/0x80
 [] vfs_fsync_range+0x4c/0xb0
 [] ? syscall_trace_enter+0x201/0x2e0
 [] vfs_fsync+0x1c/0x20
 [] do_fsync+0x3d/0x70
 [] ? syscall_slow_exit_work+0xfb/0x100
 [] SyS_fsync+0x10/0x20
 [] do_syscall_64+0x55/0xd0
 [] ? prepare_exit_to_usermode+0x37/0x40
 [] entry_SYSCALL64_slow_path+0x25/0x25
---[ end trace efe6b17c6dba2a6e ]---


Re: bio linked list corruption.

2016-10-26 Thread Chris Mason


On 10/26/2016 03:06 PM, Linus Torvalds wrote:
> On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones  wrote:
>>
>> The stacks show nearly all of them are stuck in sync_inodes_sb
> 
> That's just wb_wait_for_completion(), and it means that some IO isn't
> completing.
> 
> There's also a lot of processes waiting for inode_lock(), and a few
> waiting for mnt_want_write()
> 
> Ignoring those, we have
> 
>> [] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs]
>> [] btrfs_sync_fs+0x31/0xc0 [btrfs]
>> [] sync_filesystem+0x6e/0xa0
>> [] SyS_syncfs+0x3c/0x70
>> [] do_syscall_64+0x5c/0x170
>> [] entry_SYSCALL64_slow_path+0x25/0x25
>> [] 0x
> 
> Don't know this one. There's a couple of them. Could there be some
> ABBA deadlock on the ordered roots waiting?

It's always possible, but we haven't changed anything here.

I've tried a long list of things to reproduce this on my test boxes,
including days of trinity runs and a kernel module to exercise vmalloc,
and thread creation.

Today I turned off every CONFIG_DEBUG_* except for list debugging, and
ran dbench 2048:

[ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 
__list_add+0xbe/0xd0
[ 2759.119652] list_add corruption. prev->next should be next 
(e8c80308), but was c9ccfb88. (prev=880128522380).
[ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 
virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button 
ablk_helper cryptd sch_fq_codel autofs4 virtio_blk
[ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 
4.9.0-rc1-15246-g4ce9206-dirty #317
[ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.9.0-1.fc24 04/01/2014
[ 2759.125077]  c9000f6fb868 814fe4ff 8151cb5e 
c9000f6fb8c8
[ 2759.125077]  c9000f6fb8c8  c9000f6fb8b8 
81064bbf
[ 2759.127444]  880128523680 002139968000 880138b7a4a0 
880128523540
[ 2759.127444] Call Trace:
[ 2759.127444]  [] dump_stack+0x53/0x74
[ 2759.127444]  [] ? __list_add+0xbe/0xd0
[ 2759.127444]  [] __warn+0xff/0x120
[ 2759.127444]  [] warn_slowpath_fmt+0x49/0x50
[ 2759.127444]  [] __list_add+0xbe/0xd0
[ 2759.127444]  [] blk_sq_make_request+0x388/0x580
[ 2759.127444]  [] generic_make_request+0x104/0x200
[ 2759.127444]  [] submit_bio+0x65/0x130
[ 2759.127444]  [] ? __percpu_counter_add+0x96/0xd0
[ 2759.127444]  [] btrfs_map_bio+0x23c/0x310
[ 2759.127444]  [] btrfs_submit_bio_hook+0xd3/0x190
[ 2759.127444]  [] submit_one_bio+0x6d/0xa0
[ 2759.127444]  [] flush_epd_write_bio+0x4e/0x70
[ 2759.127444]  [] extent_writepages+0x5d/0x70
[ 2759.127444]  [] ? btrfs_releasepage+0x50/0x50
[ 2759.127444]  [] ? wbc_attach_and_unlock_inode+0x6e/0x170
[ 2759.127444]  [] btrfs_writepages+0x27/0x30
[ 2759.127444]  [] do_writepages+0x20/0x30
[ 2759.127444]  [] __filemap_fdatawrite_range+0xb5/0x100
[ 2759.127444]  [] filemap_fdatawrite_range+0x13/0x20
[ 2759.127444]  [] btrfs_fdatawrite_range+0x2b/0x70
[ 2759.127444]  [] btrfs_sync_file+0x88/0x490
[ 2759.127444]  [] ? group_send_sig_info+0x42/0x80
[ 2759.127444]  [] ? kill_pid_info+0x5d/0x90
[ 2759.127444]  [] ? SYSC_kill+0xba/0x1d0
[ 2759.127444]  [] ? __sb_end_write+0x58/0x80
[ 2759.127444]  [] vfs_fsync_range+0x4c/0xb0
[ 2759.127444]  [] ? syscall_trace_enter+0x201/0x2e0
[ 2759.127444]  [] vfs_fsync+0x1c/0x20
[ 2759.127444]  [] do_fsync+0x3d/0x70
[ 2759.127444]  [] ? syscall_slow_exit_work+0xfb/0x100
[ 2759.127444]  [] SyS_fsync+0x10/0x20
[ 2759.127444]  [] do_syscall_64+0x55/0xd0
[ 2759.127444]  [] ? prepare_exit_to_usermode+0x37/0x40
[ 2759.127444]  [] entry_SYSCALL64_slow_path+0x25/0x25
[ 2759.150635] ---[ end trace 3b5b7e2ef61c3d02 ]---

I put a variant of your suggested patch in place, but my printk never
triggered.  Now that I've made it happen once, I'll make sure I can do it
over and over again.  This doesn't have the patches that Andy asked Davej to
try out yet, but I'll try them once I have a reliable reproducer.

diff --git a/kernel/fork.c b/kernel/fork.c
index 623259f..de95e19 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -165,7 +165,7 @@ void __weak arch_release_thread_stack(unsigned long *stack)
  * vmalloc() is a bit slow, and calling vfree() enough times will force a TLB
  * flush.  Try to minimize the number of calls by caching stacks.
  */
-#define NR_CACHED_STACKS 2
+#define NR_CACHED_STACKS 256
 static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]);
 #endif
 
@@ -173,7 +173,9 @@ static unsigned long *alloc_thread_stack_node(struct 
task_struct *tsk, int node)
 {
 #ifdef CONFIG_VMAP_STACK
void *stack;
+   char *p;
int i;
+   int j;
 
local_irq_disable();
for (i = 0; i < NR_CACHED_STACKS; i++) {
@@ -183,7 +185,15 @@ static unsigned long *alloc_thread_stack_node(struct 
task_struct *tsk, int node)
continue;
this_cpu_write(cached_stacks[i], NULL);
 
+   p = s->addr;
+  

Re: bio linked list corruption.

2016-10-26 Thread Chris Mason


On 10/26/2016 03:06 PM, Linus Torvalds wrote:
> On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones  wrote:
>>
>> The stacks show nearly all of them are stuck in sync_inodes_sb
> 
> That's just wb_wait_for_completion(), and it means that some IO isn't
> completing.
> 
> There's also a lot of processes waiting for inode_lock(), and a few
> waiting for mnt_want_write()
> 
> Ignoring those, we have
> 
>> [] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs]
>> [] btrfs_sync_fs+0x31/0xc0 [btrfs]
>> [] sync_filesystem+0x6e/0xa0
>> [] SyS_syncfs+0x3c/0x70
>> [] do_syscall_64+0x5c/0x170
>> [] entry_SYSCALL64_slow_path+0x25/0x25
>> [] 0x
> 
> Don't know this one. There's a couple of them. Could there be some
> ABBA deadlock on the ordered roots waiting?

It's always possible, but we haven't changed anything here.

I've tried a long list of things to reproduce this on my test boxes,
including days of trinity runs and a kernel module to exercise vmalloc,
and thread creation.

Today I turned off every CONFIG_DEBUG_* except for list debugging, and
ran dbench 2048:

[ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 
__list_add+0xbe/0xd0
[ 2759.119652] list_add corruption. prev->next should be next 
(e8c80308), but was c9ccfb88. (prev=880128522380).
[ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 
virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button 
ablk_helper cryptd sch_fq_codel autofs4 virtio_blk
[ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 
4.9.0-rc1-15246-g4ce9206-dirty #317
[ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.9.0-1.fc24 04/01/2014
[ 2759.125077]  c9000f6fb868 814fe4ff 8151cb5e 
c9000f6fb8c8
[ 2759.125077]  c9000f6fb8c8  c9000f6fb8b8 
81064bbf
[ 2759.127444]  880128523680 002139968000 880138b7a4a0 
880128523540
[ 2759.127444] Call Trace:
[ 2759.127444]  [] dump_stack+0x53/0x74
[ 2759.127444]  [] ? __list_add+0xbe/0xd0
[ 2759.127444]  [] __warn+0xff/0x120
[ 2759.127444]  [] warn_slowpath_fmt+0x49/0x50
[ 2759.127444]  [] __list_add+0xbe/0xd0
[ 2759.127444]  [] blk_sq_make_request+0x388/0x580
[ 2759.127444]  [] generic_make_request+0x104/0x200
[ 2759.127444]  [] submit_bio+0x65/0x130
[ 2759.127444]  [] ? __percpu_counter_add+0x96/0xd0
[ 2759.127444]  [] btrfs_map_bio+0x23c/0x310
[ 2759.127444]  [] btrfs_submit_bio_hook+0xd3/0x190
[ 2759.127444]  [] submit_one_bio+0x6d/0xa0
[ 2759.127444]  [] flush_epd_write_bio+0x4e/0x70
[ 2759.127444]  [] extent_writepages+0x5d/0x70
[ 2759.127444]  [] ? btrfs_releasepage+0x50/0x50
[ 2759.127444]  [] ? wbc_attach_and_unlock_inode+0x6e/0x170
[ 2759.127444]  [] btrfs_writepages+0x27/0x30
[ 2759.127444]  [] do_writepages+0x20/0x30
[ 2759.127444]  [] __filemap_fdatawrite_range+0xb5/0x100
[ 2759.127444]  [] filemap_fdatawrite_range+0x13/0x20
[ 2759.127444]  [] btrfs_fdatawrite_range+0x2b/0x70
[ 2759.127444]  [] btrfs_sync_file+0x88/0x490
[ 2759.127444]  [] ? group_send_sig_info+0x42/0x80
[ 2759.127444]  [] ? kill_pid_info+0x5d/0x90
[ 2759.127444]  [] ? SYSC_kill+0xba/0x1d0
[ 2759.127444]  [] ? __sb_end_write+0x58/0x80
[ 2759.127444]  [] vfs_fsync_range+0x4c/0xb0
[ 2759.127444]  [] ? syscall_trace_enter+0x201/0x2e0
[ 2759.127444]  [] vfs_fsync+0x1c/0x20
[ 2759.127444]  [] do_fsync+0x3d/0x70
[ 2759.127444]  [] ? syscall_slow_exit_work+0xfb/0x100
[ 2759.127444]  [] SyS_fsync+0x10/0x20
[ 2759.127444]  [] do_syscall_64+0x55/0xd0
[ 2759.127444]  [] ? prepare_exit_to_usermode+0x37/0x40
[ 2759.127444]  [] entry_SYSCALL64_slow_path+0x25/0x25
[ 2759.150635] ---[ end trace 3b5b7e2ef61c3d02 ]---

I put a variant of your suggested patch in place, but my printk never
triggered.  Now that I've made it happen once, I'll make sure I can do it
over and over again.  This doesn't have the patches that Andy asked Davej to
try out yet, but I'll try them once I have a reliable reproducer.

diff --git a/kernel/fork.c b/kernel/fork.c
index 623259f..de95e19 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -165,7 +165,7 @@ void __weak arch_release_thread_stack(unsigned long *stack)
  * vmalloc() is a bit slow, and calling vfree() enough times will force a TLB
  * flush.  Try to minimize the number of calls by caching stacks.
  */
-#define NR_CACHED_STACKS 2
+#define NR_CACHED_STACKS 256
 static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]);
 #endif
 
@@ -173,7 +173,9 @@ static unsigned long *alloc_thread_stack_node(struct 
task_struct *tsk, int node)
 {
 #ifdef CONFIG_VMAP_STACK
void *stack;
+   char *p;
int i;
+   int j;
 
local_irq_disable();
for (i = 0; i < NR_CACHED_STACKS; i++) {
@@ -183,7 +185,15 @@ static unsigned long *alloc_thread_stack_node(struct 
task_struct *tsk, int node)
continue;
this_cpu_write(cached_stacks[i], NULL);
 
+   p = s->addr;
+   for (j = 0; j < 

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones  wrote:
>
> The stacks show nearly all of them are stuck in sync_inodes_sb

That's just wb_wait_for_completion(), and it means that some IO isn't
completing.

There's also a lot of processes waiting for inode_lock(), and a few
waiting for mnt_want_write()

Ignoring those, we have

> [] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs]
> [] btrfs_sync_fs+0x31/0xc0 [btrfs]
> [] sync_filesystem+0x6e/0xa0
> [] SyS_syncfs+0x3c/0x70
> [] do_syscall_64+0x5c/0x170
> [] entry_SYSCALL64_slow_path+0x25/0x25
> [] 0x

Don't know this one. There's a couple of them. Could there be some
ABBA deadlock on the ordered roots waiting?

> [] call_rwsem_down_write_failed+0x17/0x30
> [] btrfs_fallocate+0xb2/0xfd0 [btrfs]
> [] vfs_fallocate+0x13e/0x220
> [] SyS_fallocate+0x43/0x80
> [] do_syscall_64+0x5c/0x170
> [] entry_SYSCALL64_slow_path+0x25/0x25
> [] 0x

This one is also inode_lock(), and is interesting only because it's
fallocate(), which has shown up so many times before..

But there are other threads blocked on do_truncate, or
btrfs_file_write_iter instead, or on lseek, so this is not different
for any other reason.

> [] wait_on_page_bit+0xaf/0xc0
> [] __filemap_fdatawait_range+0x151/0x170
> [] filemap_fdatawait_keep_errors+0x1c/0x20
> [] sync_inodes_sb+0x273/0x300
> [] sync_filesystem+0x57/0xa0
> [] SyS_syncfs+0x3c/0x70
> [] do_syscall_64+0x5c/0x170
> [] entry_SYSCALL64_slow_path+0x25/0x25
> [] 0x

This is actually waiting on the page. Possibly this is the IO that is
never completing, and keeps the inode lock.

> [] btrfs_start_ordered_extent+0x5b/0xb0 [btrfs]
> [] lock_and_cleanup_extent_if_need+0x22d/0x290 [btrfs]
> [] __btrfs_buffered_write+0x1b8/0x6e0 [btrfs]
> [] btrfs_file_write_iter+0x170/0x550 [btrfs]
> [] do_iter_readv_writev+0xa8/0x100
> [] do_readv_writev+0x172/0x210
> [] vfs_writev+0x3a/0x50
> [] do_pwritev+0xb0/0xd0
> [] SyS_pwritev+0xc/0x10
> [] do_syscall_64+0x5c/0x170
> [] entry_SYSCALL64_slow_path+0x25/0x25

Hmm. This is the one that *started* the ordered extents (as opposed to
the ones waiting for it)

I dunno. There might be a lost IO. More likely it's the same
corruption that causes it, it just didn't result in an oops this time.

 Linus


Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones  wrote:
>
> The stacks show nearly all of them are stuck in sync_inodes_sb

That's just wb_wait_for_completion(), and it means that some IO isn't
completing.

There's also a lot of processes waiting for inode_lock(), and a few
waiting for mnt_want_write()

Ignoring those, we have

> [] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs]
> [] btrfs_sync_fs+0x31/0xc0 [btrfs]
> [] sync_filesystem+0x6e/0xa0
> [] SyS_syncfs+0x3c/0x70
> [] do_syscall_64+0x5c/0x170
> [] entry_SYSCALL64_slow_path+0x25/0x25
> [] 0x

Don't know this one. There's a couple of them. Could there be some
ABBA deadlock on the ordered roots waiting?

> [] call_rwsem_down_write_failed+0x17/0x30
> [] btrfs_fallocate+0xb2/0xfd0 [btrfs]
> [] vfs_fallocate+0x13e/0x220
> [] SyS_fallocate+0x43/0x80
> [] do_syscall_64+0x5c/0x170
> [] entry_SYSCALL64_slow_path+0x25/0x25
> [] 0x

This one is also inode_lock(), and is interesting only because it's
fallocate(), which has shown up so many times before..

But there are other threads blocked on do_truncate, or
btrfs_file_write_iter instead, or on lseek, so this is not different
for any other reason.

> [] wait_on_page_bit+0xaf/0xc0
> [] __filemap_fdatawait_range+0x151/0x170
> [] filemap_fdatawait_keep_errors+0x1c/0x20
> [] sync_inodes_sb+0x273/0x300
> [] sync_filesystem+0x57/0xa0
> [] SyS_syncfs+0x3c/0x70
> [] do_syscall_64+0x5c/0x170
> [] entry_SYSCALL64_slow_path+0x25/0x25
> [] 0x

This is actually waiting on the page. Possibly this is the IO that is
never completing, and keeps the inode lock.

> [] btrfs_start_ordered_extent+0x5b/0xb0 [btrfs]
> [] lock_and_cleanup_extent_if_need+0x22d/0x290 [btrfs]
> [] __btrfs_buffered_write+0x1b8/0x6e0 [btrfs]
> [] btrfs_file_write_iter+0x170/0x550 [btrfs]
> [] do_iter_readv_writev+0xa8/0x100
> [] do_readv_writev+0x172/0x210
> [] vfs_writev+0x3a/0x50
> [] do_pwritev+0xb0/0xd0
> [] SyS_pwritev+0xc/0x10
> [] do_syscall_64+0x5c/0x170
> [] entry_SYSCALL64_slow_path+0x25/0x25

Hmm. This is the one that *started* the ordered extents (as opposed to
the ones waiting for it)

I dunno. There might be a lost IO. More likely it's the same
corruption that causes it, it just didn't result in an oops this time.

 Linus


  1   2   >