Re: bio linked list corruption.

2016-12-06 Thread Linus Torvalds
On Tue, Dec 6, 2016 at 12:16 AM, Peter Zijlstra wrote: >> >> Of course, I'm really hoping that this shmem.c use is the _only_ such >> case. But I doubt it. > > $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l > 28 Hmm. Most of them seem to be ok, because they use

Re: bio linked list corruption.

2016-12-06 Thread Vegard Nossum
On 5 December 2016 at 22:33, Vegard Nossum wrote: > On 5 December 2016 at 21:35, Linus Torvalds > wrote: >> Note for Ingo and Peter: this patch has not been tested at all. But >> Vegard did test an earlier patch of mine that just verified

Re: bio linked list corruption.

2016-12-06 Thread Ingo Molnar
* Peter Zijlstra wrote: > $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l > 28 This debug facility looks sensible. A couple of minor suggestions: > --- a/include/linux/wait.h > +++ b/include/linux/wait.h > @@ -39,6 +39,9 @@ struct wait_bit_queue { > struct

Re: bio linked list corruption.

2016-12-06 Thread Peter Zijlstra
On Mon, Dec 05, 2016 at 12:35:52PM -0800, Linus Torvalds wrote: > Adding the scheduler people to the participants list, and re-attaching > the patch, because while this patch is internal to the VM code, the > issue itself is not. > > There might well be other cases where somebody goes

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 21:35, Linus Torvalds wrote: > Note for Ingo and Peter: this patch has not been tested at all. But > Vegard did test an earlier patch of mine that just verified that yes, > the issue really was that wait queue entries remained on the wait >

Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
Adding the scheduler people to the participants list, and re-attaching the patch, because while this patch is internal to the VM code, the issue itself is not. There might well be other cases where somebody goes "wake_up_all()" will wake everybody up, so I can put the wait queue head on the

Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
On Mon, Dec 5, 2016 at 11:11 AM, Vegard Nossum wrote: > > [ cut here ] > WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0 Ok, good. So that's confirmed as the cause of this problem. And the call chain that I wanted is

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 20:11, Vegard Nossum wrote: > On 5 December 2016 at 18:55, Linus Torvalds > wrote: >> On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum >> wrote: >> Since you apparently can recreate this fairly

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 18:55, Linus Torvalds wrote: > On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum wrote: >> >> The warning shows that it made it past the list_empty_careful() check >> in finish_wait() but then bugs out on the >task_list >>

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 19:11, Andy Lutomirski wrote: > On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossum wrote: >> On 23 November 2016 at 20:58, Dave Jones wrote: >>> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: >>>

Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
On Mon, Dec 5, 2016 at 10:11 AM, Andy Lutomirski wrote: > > So your kernel has been smp-alternatived. That 3e comes from > alternatives_smp_unlock. If you're running on SMP with UP > alternatives, things will break. I'm assuming he's just running in a VM with a single CPU.

Re: bio linked list corruption.

2016-12-05 Thread Andy Lutomirski
On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossum wrote: > On 23 November 2016 at 20:58, Dave Jones wrote: >> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: >> >> > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4

Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum wrote: > > The warning shows that it made it past the list_empty_careful() check > in finish_wait() but then bugs out on the >task_list > dereference. > > Anything stick out? I hate that shmem waitqueue garbage. It's really

Re: bio linked list corruption.

2016-12-05 Thread Dave Jones
On Mon, Dec 05, 2016 at 06:09:29PM +0100, Vegard Nossum wrote: > On 5 December 2016 at 12:10, Vegard Nossum wrote: > > On 5 December 2016 at 00:04, Vegard Nossum wrote: > >> FWIW I hit this as well: > >> > >> BUG: unable to handle kernel

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 12:10, Vegard Nossum wrote: > On 5 December 2016 at 00:04, Vegard Nossum wrote: >> FWIW I hit this as well: >> >> BUG: unable to handle kernel paging request at 81ff08b7 >> IP: [] __lock_acquire.isra.32+0xda/0x1a30

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 00:04, Vegard Nossum wrote: > FWIW I hit this as well: > > BUG: unable to handle kernel paging request at 81ff08b7 > IP: [] __lock_acquire.isra.32+0xda/0x1a30 > CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB 4.9.0-rc7+ #217

Re: bio linked list corruption.

2016-12-04 Thread Vegard Nossum
On 23 November 2016 at 20:58, Dave Jones wrote: > On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: > > > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 > > trace from just before this happened. Does this shed any light ? > > > >

Re: bio linked list corruption.

2016-11-23 Thread Dave Jones
On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 > trace from just before this happened. Does this shed any light ? > > https://codemonkey.org.uk/junk/trace.txt crap, I just noticed the timestamps in the

Re: bio linked list corruption.

2016-11-23 Thread Dave Jones
On Mon, Oct 31, 2016 at 01:44:55PM -0600, Chris Mason wrote: > On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote: > >On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones > >wrote: > >> > >> BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 > >>

Re: bio linked list corruption.

2016-10-31 Thread Chris Mason
On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote: On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones wrote: BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c flags:

Re: bio linked list corruption.

2016-10-31 Thread Linus Torvalds
On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones wrote: > > BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 > page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c > flags: 0x400c(referenced|uptodate) > page dumped because:

Re: bio linked list corruption.

2016-10-31 Thread Dave Jones
On Wed, Oct 26, 2016 at 07:47:51PM -0400, Dave Jones wrote: > On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: > > > >-hctx->queued++; > > >-data->hctx = hctx; > > >-data->ctx = ctx; > > >+data->hctx = alloc_data.hctx; > > >+

Re: bio linked list corruption.

2016-10-27 Thread Dave Jones
On Thu, Oct 27, 2016 at 04:41:33PM +1100, Dave Chinner wrote: > And that's indicative of a delalloc metadata reservation being > being too small and so we're allocating unreserved blocks. > > Different symptoms, same underlying cause, I think. > > I see the latter assert from time to

Re: bio linked list corruption.

2016-10-27 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwig wrote: >> Dave, can you hit the warnings with this? Totally untested... > > Can we just kill off the unhelpful blk_map_ctx structure, e.g.: Yeah, I found that hard to read too. The difference between blk_map_ctx and

Re: bio linked list corruption.

2016-10-27 Thread Chris Mason
On 10/26/2016 08:00 PM, Jens Axboe wrote: > On 10/26/2016 05:47 PM, Dave Jones wrote: >> On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: >> >> > >-hctx->queued++; >> > >-data->hctx = hctx; >> > >-data->ctx = ctx; >> > >+data->hctx = alloc_data.hctx; >> > >+

Re: bio linked list corruption.

2016-10-27 Thread Christoph Hellwig
> Dave, can you hit the warnings with this? Totally untested... Can we just kill off the unhelpful blk_map_ctx structure, e.g.: diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed..d74a74a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1190,21 +1190,15 @@ static inline bool

Re: bio linked list corruption.

2016-10-26 Thread Dave Chinner
On Tue, Oct 25, 2016 at 08:27:52PM -0400, Dave Jones wrote: > DaveC: Do these look like real problems, or is this more "looks like > random memory corruption" ? It's been a while since I did some stress > testing on XFS, so these might not be new.. > > XFS: Assertion failed: oldlen > newlen,

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 05:47 PM, Dave Jones wrote: On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: > >- hctx->queued++; > >- data->hctx = hctx; > >- data->ctx = ctx; > >+ data->hctx = alloc_data.hctx; > >+ data->ctx = alloc_data.ctx; > >+ data->hctx->queued++;

Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: > >- hctx->queued++; > >- data->hctx = hctx; > >- data->ctx = ctx; > >+ data->hctx = alloc_data.hctx; > >+ data->ctx = alloc_data.ctx; > >+ data->hctx->queued++; > >return rq; > > } > > This made it through

Re: bio linked list corruption.

2016-10-26 Thread Chris Mason
On Wed, Oct 26, 2016 at 05:20:01PM -0600, Jens Axboe wrote: On 10/26/2016 05:08 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe wrote: Actually, I think I see what might trigger it. You are on nvme, iirc, and that has a deep queue. Yes. I have long since

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 05:19 PM, Chris Mason wrote: On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote: On 10/26/2016 04:58 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds wrote: Dave: it might be a good idea to split that

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 05:08 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe wrote: Actually, I think I see what might trigger it. You are on nvme, iirc, and that has a deep queue. Yes. I have long since moved on from slow disks, so all my systems are not just

Re: bio linked list corruption.

2016-10-26 Thread Chris Mason
On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote: On 10/26/2016 04:58 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds wrote: Dave: it might be a good idea to split that "WARN_ON_ONCE()" in blk_mq_merge_queue_io() into two I

Re: bio linked list corruption.

2016-10-26 Thread Chris Mason
On Wed, Oct 26, 2016 at 03:07:10PM -0700, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason wrote: Today I turned off every CONFIG_DEBUG_* except for list debugging, and ran dbench 2048: [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe wrote: > > Actually, I think I see what might trigger it. You are on nvme, iirc, > and that has a deep queue. Yes. I have long since moved on from slow disks, so all my systems are not just flash, but m.2 nvme ssd's. So at least that

Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote: > On 10/26/2016 04:58 PM, Linus Torvalds wrote: > > On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds > > wrote: > >> > >> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > >>

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 05:01 PM, Dave Jones wrote: On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote: > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > blk_mq_merge_queue_io() into two, since right now it can trigger both > for the > >

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 04:58 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds wrote: Dave: it might be a good idea to split that "WARN_ON_ONCE()" in blk_mq_merge_queue_io() into two I did that myself too, since Dave sees this during boot. But

Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote: > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > blk_mq_merge_queue_io() into two, since right now it can trigger both > for the > > blk_mq_bio_to_request(rq, bio); > > path _and_ for the >

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds wrote: > > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > blk_mq_merge_queue_io() into two I did that myself too, since Dave sees this during boot. But I'm not getting the warning ;( Dave gets it

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 04:51 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:40 PM, Dave Jones wrote: I gave it a shot too for shits & giggles. This falls out during boot. [9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0 Hmm.

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 04:40 PM, Dave Jones wrote: On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote: > Could you try the attached patch? It adds a couple of sanity tests: > > - a number of tests to verify that 'rq->queuelist' isn't already on > some queue when it is added to a queue

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 3:40 PM, Dave Jones wrote: > > I gave it a shot too for shits & giggles. > This falls out during boot. > > [9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 > blk_sq_make_request+0x465/0x4a0 Hmm. That's the WARN_ON_ONCE(rq->mq_ctx

Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote: > Could you try the attached patch? It adds a couple of sanity tests: > > - a number of tests to verify that 'rq->queuelist' isn't already on > some queue when it is added to a queue > > - one test to verify that rq->mq_ctx

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 2:52 PM, Chris Mason wrote: > > This one is special because CONFIG_VMAP_STACK is not set. Btrfs triggers in > < 10 minutes. > I've done 30 minutes each with XFS and Ext4 without luck. Ok, see the email I wrote that crossed yours - if it's really some list

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason wrote: > > Today I turned off every CONFIG_DEBUG_* except for list debugging, and > ran dbench 2048: > > [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 > __list_add+0xbe/0xd0 > [ 2759.119652] list_add corruption.

Re: bio linked list corruption.

2016-10-26 Thread Chris Mason
On 10/26/2016 04:00 PM, Chris Mason wrote: > > > On 10/26/2016 03:06 PM, Linus Torvalds wrote: >> On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones wrote: >>> >>> The stacks show nearly all of them are stuck in sync_inodes_sb >> >> That's just wb_wait_for_completion(), and

Re: bio linked list corruption.

2016-10-26 Thread Chris Mason
On 10/26/2016 03:06 PM, Linus Torvalds wrote: > On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones wrote: >> >> The stacks show nearly all of them are stuck in sync_inodes_sb > > That's just wb_wait_for_completion(), and it means that some IO isn't > completing. > > There's

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones wrote: > > The stacks show nearly all of them are stuck in sync_inodes_sb That's just wb_wait_for_completion(), and it means that some IO isn't completing. There's also a lot of processes waiting for inode_lock(), and a few

Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 09:48:39AM -0700, Linus Torvalds wrote: > I know you already had this in some email, but I lost it. I think you > narrowed it down to a specific set of system calls that seems to > trigger this best. fallocate and xattrs or something? So I was about to give that a shot

Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 09:48:39AM -0700, Linus Torvalds wrote: > On Wed, Oct 26, 2016 at 9:30 AM, Dave Jones wrote: > > > > I gave this a go last thing last night. It crashed within 5 minutes, > > but it was one we've already seen (the bad page map trace) with

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 9:30 AM, Dave Jones wrote: > > I gave this a go last thing last night. It crashed within 5 minutes, > but it was one we've already seen (the bad page map trace) with nothing > additional that looked interesting. Did the bad page map trace have any

Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Tue, Oct 25, 2016 at 06:39:03PM -0700, Linus Torvalds wrote: > On Tue, Oct 25, 2016 at 6:33 PM, Linus Torvalds > wrote: > > > > Completely untested. Maybe there's some reason we can't write to the > > whole thing like that? > > That hack boots and seems

Re: bio linked list corruption.

2016-10-25 Thread Linus Torvalds
On Tue, Oct 25, 2016 at 6:33 PM, Linus Torvalds wrote: > > Completely untested. Maybe there's some reason we can't write to the > whole thing like that? That hack boots and seems to work for me, but doesn't show anything. Dave, mind just trying that oneliner?

Re: bio linked list corruption.

2016-10-25 Thread Linus Torvalds
On Tue, Oct 25, 2016 at 5:27 PM, Dave Jones wrote: > > DaveC: Do these look like real problems, or is this more "looks like > random memory corruption" ? It's been a while since I did some stress > testing on XFS, so these might not be new.. Andy, do you think we could

Re: bio linked list corruption.

2016-10-24 Thread Andy Lutomirski
On Oct 24, 2016 5:00 PM, "Linus Torvalds" wrote: > > On Mon, Oct 24, 2016 at 3:42 PM, Andy Lutomirski wrote: > > > Now the fallocate thread catches up and *exits*. Dave's test makes a > > new thread that reuses the stack (the vmap area or the

Re: bio linked list corruption.

2016-10-24 Thread Linus Torvalds
On Mon, Oct 24, 2016 at 3:42 PM, Andy Lutomirski wrote: > > Here's my theory: I think you're looking at the right code but the > wrong stack. shmem_fault_wait is fine, but shmem_fault_waitq looks > really dicey. Hmm. > Consider: > > fallocate calls wake_up_all(), which

Re: bio linked list corruption.

2016-10-24 Thread Andy Lutomirski
On Mon, Oct 24, 2016 at 1:46 PM, Linus Torvalds wrote: > On Mon, Oct 24, 2016 at 1:06 PM, Andy Lutomirski wrote: >>> >>> [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC >> >> This is an unhandled kernel page fault. The string "Oops"

Re: bio linked list corruption.

2016-10-24 Thread Chris Mason
On 10/24/2016 05:50 PM, Linus Torvalds wrote: On Mon, Oct 24, 2016 at 2:17 PM, Linus Torvalds wrote: The vmalloc/vfree code itself is a bit scary. In particular, we have a rather insane model of TLB flushing. We leave the virtual area on a lazy purge-list, and

Re: bio linked list corruption.

2016-10-24 Thread Linus Torvalds
On Mon, Oct 24, 2016 at 2:17 PM, Linus Torvalds wrote: > > The vmalloc/vfree code itself is a bit scary. In particular, we have a > rather insane model of TLB flushing. We leave the virtual area on a > lazy purge-list, and we delay flushing the TLB and actually

Re: bio linked list corruption.

2016-10-24 Thread Linus Torvalds
On Mon, Oct 24, 2016 at 1:46 PM, Linus Torvalds wrote: > > So this is all some really subtle code, but I'm not seeing that it > would be wrong. Ahh... Except maybe.. The vmalloc/vfree code itself is a bit scary. In particular, we have a rather insane model of TLB

Re: bio linked list corruption.

2016-10-24 Thread Linus Torvalds
On Mon, Oct 24, 2016 at 1:06 PM, Andy Lutomirski wrote: >> >> [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC > > This is an unhandled kernel page fault. The string "Oops" is so helpful :-/ I think there was a line above it that DaveJ just didn't include. > >>

Re: bio linked list corruption.

2016-10-24 Thread Andy Lutomirski
On Sun, Oct 23, 2016 at 9:40 PM, Dave Jones wrote: > On Sun, Oct 23, 2016 at 05:32:21PM -0400, Chris Mason wrote: > > > > > > On 10/22/2016 11:20 AM, Dave Jones wrote: > > > On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote: > > > > > > > > It could be

Re: bio linked list corruption.

2016-10-24 Thread Chris Mason
On 10/24/2016 12:40 AM, Dave Jones wrote: On Sun, Oct 23, 2016 at 05:32:21PM -0400, Chris Mason wrote: > > > On 10/22/2016 11:20 AM, Dave Jones wrote: > > On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote: > > > > > > It could be worth trying this, too: > > > > > > > >

Re: bio linked list corruption.

2016-10-23 Thread Dave Jones
On Sun, Oct 23, 2016 at 05:32:21PM -0400, Chris Mason wrote: > > > On 10/22/2016 11:20 AM, Dave Jones wrote: > > On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote: > > > > > > It could be worth trying this, too: > > > > > > > > > >

Re: bio linked list corruption.

2016-10-23 Thread Chris Mason
On 10/22/2016 11:20 AM, Dave Jones wrote: On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote: > > It could be worth trying this, too: > > > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack=174531fef4e8 > > > > It occurred to me that the

Re: bio linked list corruption.

2016-10-22 Thread Dave Jones
On Fri, Oct 21, 2016 at 04:02:45PM -0400, Dave Jones wrote: > > It could be worth trying this, too: > > > > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack=174531fef4e8 > > > > It occurred to me that the current code is a little bit fragile. >

Re: bio linked list corruption.

2016-10-21 Thread Dave Jones
On Fri, Oct 21, 2016 at 04:41:09PM -0400, Josef Bacik wrote: > >> > > >> > btrfs inspect inode 130654 mntpoint > >> > >> Interesting, they all return > >> > >> ERROR: ino paths ioctl: No such file or directory > >> > >> So these files got deleted perhaps ? > >> > > Yeah, they must

Re: bio linked list corruption.

2016-10-21 Thread Chris Mason
On 10/21/2016 04:23 PM, Dave Jones wrote: On Fri, Oct 21, 2016 at 04:17:48PM -0400, Chris Mason wrote: > > BTRFS warning (device sda3): csum failed ino 130654 off 0 csum 2566472073 expected csum 3008371513 > > BTRFS warning (device sda3): csum failed ino 131057 off 4096 csum 3563910319

Re: bio linked list corruption.

2016-10-21 Thread Josef Bacik
On 10/21/2016 04:38 PM, Chris Mason wrote: On 10/21/2016 04:23 PM, Dave Jones wrote: On Fri, Oct 21, 2016 at 04:17:48PM -0400, Chris Mason wrote: > > BTRFS warning (device sda3): csum failed ino 130654 off 0 csum 2566472073 expected csum 3008371513 > > BTRFS warning (device sda3): csum

Re: bio linked list corruption.

2016-10-21 Thread Dave Jones
On Fri, Oct 21, 2016 at 04:17:48PM -0400, Chris Mason wrote: > > BTRFS warning (device sda3): csum failed ino 130654 off 0 csum 2566472073 > > expected csum 3008371513 > > BTRFS warning (device sda3): csum failed ino 131057 off 4096 csum > > 3563910319 expected csum 738595262 > > BTRFS

Re: bio linked list corruption.

2016-10-21 Thread Chris Mason
On 10/21/2016 04:02 PM, Dave Jones wrote: On Thu, Oct 20, 2016 at 04:23:32PM -0700, Andy Lutomirski wrote: > On Thu, Oct 20, 2016 at 4:03 PM, Dave Jones wrote: > > On Thu, Oct 20, 2016 at 04:01:12PM -0700, Andy Lutomirski wrote: > > > On Thu, Oct 20, 2016 at 3:50

Re: bio linked list corruption.

2016-10-21 Thread Dave Jones
On Thu, Oct 20, 2016 at 04:23:32PM -0700, Andy Lutomirski wrote: > On Thu, Oct 20, 2016 at 4:03 PM, Dave Jones wrote: > > On Thu, Oct 20, 2016 at 04:01:12PM -0700, Andy Lutomirski wrote: > > > On Thu, Oct 20, 2016 at 3:50 PM, Dave Jones > >

Re: bio linked list corruption.

2016-10-20 Thread Andy Lutomirski
On Thu, Oct 20, 2016 at 4:03 PM, Dave Jones wrote: > On Thu, Oct 20, 2016 at 04:01:12PM -0700, Andy Lutomirski wrote: > > On Thu, Oct 20, 2016 at 3:50 PM, Dave Jones > wrote: > > > On Tue, Oct 18, 2016 at 06:05:57PM -0700, Andy Lutomirski

Re: bio linked list corruption.

2016-10-20 Thread Dave Jones
On Thu, Oct 20, 2016 at 04:01:12PM -0700, Andy Lutomirski wrote: > On Thu, Oct 20, 2016 at 3:50 PM, Dave Jones wrote: > > On Tue, Oct 18, 2016 at 06:05:57PM -0700, Andy Lutomirski wrote: > > > > > One possible debugging approach would be to change: > > > > > >

Re: bio linked list corruption.

2016-10-20 Thread Andy Lutomirski
On Thu, Oct 20, 2016 at 3:50 PM, Dave Jones wrote: > On Tue, Oct 18, 2016 at 06:05:57PM -0700, Andy Lutomirski wrote: > > > One possible debugging approach would be to change: > > > > #define NR_CACHED_STACKS 2 > > > > to > > > > #define NR_CACHED_STACKS 0 > > >

Re: bio linked list corruption.

2016-10-20 Thread Dave Jones
On Tue, Oct 18, 2016 at 06:05:57PM -0700, Andy Lutomirski wrote: > One possible debugging approach would be to change: > > #define NR_CACHED_STACKS 2 > > to > > #define NR_CACHED_STACKS 0 > > in kernel/fork.c and to set CONFIG_DEBUG_PAGEALLOC=y. The latter will > force an

Re: bio linked list corruption.

2016-10-20 Thread Dave Jones
On Tue, Oct 18, 2016 at 05:28:44PM -0700, Linus Torvalds wrote: > On Tue, Oct 18, 2016 at 5:10 PM, Linus Torvalds > wrote: > > > > Adding Andy to the cc, because this *might* be triggered by the > > vmalloc stack code itself. Maybe the re-use of stacks showing

Re: bio linked list corruption.

2016-10-20 Thread Thomas Gleixner
On Thu, 20 Oct 2016, Ingo Molnar wrote: > * Linus Torvalds wrote: > > So I don't think it's related. Yours looks like some subtle timer base > > race. It smells like a locking problem with timers. I'm not seeing > > what it might be, but it *might* have been fixed

Re: bio linked list corruption.

2016-10-20 Thread Ingo Molnar
* Linus Torvalds wrote: > On Wed, Oct 19, 2016 at 10:09 AM, Philipp Hahn wrote: > > > > Nearly a month ago I reported also a "list_add corruption", but with 4.1.6: > > > > > > That server

Re: bio linked list corruption.

2016-10-19 Thread Linus Torvalds
On Wed, Oct 19, 2016 at 10:09 AM, Philipp Hahn wrote: > > Nearly a month ago I reported also a "list_add corruption", but with 4.1.6: > > > That server rungs Samba4, which also is a heavy user of xattr. That one looks very

Re: bio linked list corruption.

2016-10-19 Thread Philipp Hahn
Hello, Am 19.10.2016 um 01:42 schrieb Chris Mason: > On Tue, Oct 18, 2016 at 04:39:22PM -0700, Linus Torvalds wrote: >> On Tue, Oct 18, 2016 at 4:31 PM, Chris Mason wrote: >>> >>> Jens, not sure if you saw the whole thread. This has triggered bad page >>> state errors, and also

Re: bio linked list corruption.

2016-10-18 Thread Andy Lutomirski
On 10/18/2016 05:10 PM, Linus Torvalds wrote: On Tue, Oct 18, 2016 at 4:42 PM, Chris Mason wrote: Seems to be the whole thing: Ahh. On lkml, so I do have it in my mailbox, but Dave changed the subject line when he tested on ext4 rather than btrfs.. Anyway, the corrupted

Re: bio linked list corruption.

2016-10-18 Thread Linus Torvalds
On Tue, Oct 18, 2016 at 5:10 PM, Linus Torvalds wrote: > > Adding Andy to the cc, because this *might* be triggered by the > vmalloc stack code itself. Maybe the re-use of stacks showing some > problem? Maybe Chris (who can't see the problem) doesn't have >

Re: bio linked list corruption.

2016-10-18 Thread Chris Mason
On Tue, Oct 18, 2016 at 05:10:56PM -0700, Linus Torvalds wrote: On Tue, Oct 18, 2016 at 4:42 PM, Chris Mason wrote: Seems to be the whole thing: Ahh. On lkml, so I do have it in my mailbox, but Dave changed the subject line when he tested on ext4 rather than btrfs.. Anyway,

Re: bio linked list corruption.

2016-10-18 Thread Linus Torvalds
of stacks showing some problem? Maybe Chris (who can't see the problem) doesn't have CONFIG_VMAP_STACK enabled? Andy - this is on lkml, under Dave Chinner: [regression, 4.9-rc1] blk-mq: list corruption in request queue Dave Jones: btrfs bio linked list corruption. Re: bio linked list corru

Re: bio linked list corruption.

2016-10-18 Thread Chris Mason
On Tue, Oct 18, 2016 at 04:39:22PM -0700, Linus Torvalds wrote: On Tue, Oct 18, 2016 at 4:31 PM, Chris Mason wrote: Jens, not sure if you saw the whole thread. This has triggered bad page state errors, and also corrupted a btrfs list. It hurts me to say, but it might not

Re: bio linked list corruption.

2016-10-18 Thread Linus Torvalds
On Tue, Oct 18, 2016 at 4:31 PM, Chris Mason wrote: > > Jens, not sure if you saw the whole thread. This has triggered bad page > state errors, and also corrupted a btrfs list. It hurts me to say, but it > might not actually be your fault. Where is that thread, and what is the

Re: bio linked list corruption.

2016-10-18 Thread Jens Axboe
On 10/18/2016 05:31 PM, Chris Mason wrote: On Tue, Oct 18, 2016 at 05:12:41PM -0600, Jens Axboe wrote: On 10/18/2016 04:42 PM, Dave Jones wrote: So Chris had me do a run on ext4 just for giggles. It took a while, but eventually this fell out... WARNING: CPU: 3 PID: 21324 at

Re: bio linked list corruption.

2016-10-18 Thread Chris Mason
On Tue, Oct 18, 2016 at 05:12:41PM -0600, Jens Axboe wrote: On 10/18/2016 04:42 PM, Dave Jones wrote: So Chris had me do a run on ext4 just for giggles. It took a while, but eventually this fell out... WARNING: CPU: 3 PID: 21324 at lib/list_debug.c:33 __list_add+0x89/0xb0 list_add corruption.

Re: bio linked list corruption.

2016-10-18 Thread Jens Axboe
On 10/18/2016 04:42 PM, Dave Jones wrote: On Tue, Oct 11, 2016 at 10:45:07AM -0400, Dave Jones wrote: > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 > list_add corruption. prev->next should be next (e8806648), but was c967fcd8.

Re: bio linked list corruption.

2016-10-18 Thread Dave Jones
On Tue, Oct 11, 2016 at 10:45:07AM -0400, Dave Jones wrote: > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 > list_add corruption. prev->next should be next (e8806648), but was > c967fcd8. (prev=880503878b80). > CPU: 1 PID: 3673 Comm: trinity-c0

Re: btrfs bio linked list corruption.

2016-10-17 Thread Chris Mason
On Sat, Oct 15, 2016 at 08:42:40PM -0400, Dave Jones wrote: On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote: > > > > .. and of course the first thing that happens is a completely different > > > > btrfs trace.. > > > > > > > > > > > > WARNING: CPU: 1 PID: 21706 at

Re: btrfs bio linked list corruption.

2016-10-15 Thread Dave Jones
On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote: > > > > .. and of course the first thing that happens is a completely > > different > > > > btrfs trace.. > > > > > > > > > > > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 > > start_transaction+0x40a/0x440

Re: btrfs bio linked list corruption.

2016-10-13 Thread Dave Jones
On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote: > > > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 > > start_transaction+0x40a/0x440 [btrfs] > > > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14 > > > > c900019076a8 b731ff3c

Re: btrfs bio linked list corruption.

2016-10-13 Thread Chris Mason
On 10/13/2016 02:16 PM, Dave Jones wrote: On Wed, Oct 12, 2016 at 10:42:46AM -0400, Chris Mason wrote: > On 10/12/2016 10:40 AM, Dave Jones wrote: > > On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote: > > > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote: > > > > >

Re: btrfs bio linked list corruption.

2016-10-13 Thread Dave Jones
On Wed, Oct 12, 2016 at 10:42:46AM -0400, Chris Mason wrote: > On 10/12/2016 10:40 AM, Dave Jones wrote: > > On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote: > > > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote: > > > > > > > > > > > > On 10/11/2016 10:45 AM,

Re: btrfs bio linked list corruption.

2016-10-12 Thread Dave Jones
On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote: > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote: > > > > > > On 10/11/2016 10:45 AM, Dave Jones wrote: > > > This is from Linus' current tree, with Al's iovec fixups on top. > > > > > > [ cut here

Re: btrfs bio linked list corruption.

2016-10-12 Thread Chris Mason
On 10/12/2016 10:40 AM, Dave Jones wrote: On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote: > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote: > > > > > > On 10/11/2016 10:45 AM, Dave Jones wrote: > > > This is from Linus' current tree, with Al's iovec fixups on

Re: btrfs bio linked list corruption.

2016-10-12 Thread Dave Jones
On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote: > > > On 10/11/2016 10:45 AM, Dave Jones wrote: > > This is from Linus' current tree, with Al's iovec fixups on top. > > > > [ cut here ] > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33

  1   2   >