Re: btrfs bio linked list corruption.

2016-10-17 Thread Chris Mason

On Sat, Oct 15, 2016 at 08:42:40PM -0400, Dave Jones wrote:

On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote:

> >  > > .. and of course the first thing that happens is a completely different
> >  > > btrfs trace..
> >  > >
> >  > >
> >  > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 
start_transaction+0x40a/0x440 [btrfs]
> >  > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
> >  > >  c900019076a8 b731ff3c  
> >  > >  c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98
> >  > >  0801 880501cfa2a8 008a 008a
> >  >
> >  > This isn't even IO.  Uuug.  We're going to need a fast enough test
> >  > that we can bisect.
> >
> > Progress...
> > I've found that this combination of syscalls..
> >
> > ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr 
-c lremovexattr -c pwritev2
> >
> > hits one of these two bugs in a few minutes runtime.
> >
> > Just the xattr syscalls + fsync isn't enough, neither is just pwrite + 
fsync.
> > Mix them together though, and something goes awry.
> >
> Hasn't triggered here yet.  I'll leave it running though.

The hits keep coming..

BUG: Bad page state in process kworker/u8:12  pfn:4988fa
page:ea0012623e80 count:0 mapcount:0 mapping:8804450456e0 index:0x9


Hmpf, I've had this running since Friday without failing.  Can you send 
me your .config please?


-chris


Re: btrfs bio linked list corruption.

2016-10-17 Thread Chris Mason

On Sat, Oct 15, 2016 at 08:42:40PM -0400, Dave Jones wrote:

On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote:

> >  > > .. and of course the first thing that happens is a completely different
> >  > > btrfs trace..
> >  > >
> >  > >
> >  > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 
start_transaction+0x40a/0x440 [btrfs]
> >  > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
> >  > >  c900019076a8 b731ff3c  
> >  > >  c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98
> >  > >  0801 880501cfa2a8 008a 008a
> >  >
> >  > This isn't even IO.  Uuug.  We're going to need a fast enough test
> >  > that we can bisect.
> >
> > Progress...
> > I've found that this combination of syscalls..
> >
> > ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr 
-c lremovexattr -c pwritev2
> >
> > hits one of these two bugs in a few minutes runtime.
> >
> > Just the xattr syscalls + fsync isn't enough, neither is just pwrite + 
fsync.
> > Mix them together though, and something goes awry.
> >
> Hasn't triggered here yet.  I'll leave it running though.

The hits keep coming..

BUG: Bad page state in process kworker/u8:12  pfn:4988fa
page:ea0012623e80 count:0 mapcount:0 mapping:8804450456e0 index:0x9


Hmpf, I've had this running since Friday without failing.  Can you send 
me your .config please?


-chris


Re: btrfs bio linked list corruption.

2016-10-15 Thread Dave Jones
On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote:

 > >  > > .. and of course the first thing that happens is a completely 
 > > different
 > >  > > btrfs trace..
 > >  > >
 > >  > >
 > >  > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 
 > > start_transaction+0x40a/0x440 [btrfs]
 > >  > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
 > >  > >  c900019076a8 b731ff3c  
 > >  > >  c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98
 > >  > >  0801 880501cfa2a8 008a 008a
 > >  >
 > >  > This isn't even IO.  Uuug.  We're going to need a fast enough test
 > >  > that we can bisect.
 > >
 > > Progress...
 > > I've found that this combination of syscalls..
 > >
 > > ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr 
 > > -c lremovexattr -c pwritev2
 > >
 > > hits one of these two bugs in a few minutes runtime.
 > >
 > > Just the xattr syscalls + fsync isn't enough, neither is just pwrite + 
 > > fsync.
 > > Mix them together though, and something goes awry.
 > >
 > Hasn't triggered here yet.  I'll leave it running though.

The hits keep coming..

BUG: Bad page state in process kworker/u8:12  pfn:4988fa
page:ea0012623e80 count:0 mapcount:0 mapping:8804450456e0 index:0x9

flags: 0x400c(referenced|uptodate)
page dumped because: non-NULL mapping
CPU: 2 PID: 1388 Comm: kworker/u8:12 Not tainted 4.8.0-think+ #18 
Workqueue: writeback wb_workfn
 (flush-btrfs-1)

 c9aef7e8
 81320e7c
 ea0012623e80
 819fe6ec

 c9aef810
 81159b3f
 
 ea0012623e80

 400c
 c9aef820
 81159bfa
 c9aef868

Call Trace:
 [] dump_stack+0x4f/0x73
 [] bad_page+0xbf/0x120
 [] free_pages_check_bad+0x5a/0x70
 [] free_hot_cold_page+0x20b/0x270
 [] free_hot_cold_page_list+0x2b/0x50
 [] release_pages+0x2d2/0x380
 [] __pagevec_release+0x22/0x30
 [] extent_write_cache_pages.isra.48.constprop.63+0x350/0x430 
[btrfs]
 [] ? debug_smp_processor_id+0x17/0x20
 [] ? get_lock_stats+0x19/0x50
 [] extent_writepages+0x58/0x80 [btrfs]
 [] ? btrfs_releasepage+0x40/0x40 [btrfs]
 [] btrfs_writepages+0x23/0x30 [btrfs]
 [] do_writepages+0x1c/0x30
 [] __writeback_single_inode+0x33/0x180
 [] writeback_sb_inodes+0x2cb/0x5d0
 [] __writeback_inodes_wb+0x8d/0xc0
 [] wb_writeback+0x203/0x210
 [] wb_workfn+0xe7/0x2a0
 [] ? __lock_acquire.isra.32+0x1cf/0x8c0
 [] process_one_work+0x1da/0x4b0
 [] ? process_one_work+0x17a/0x4b0
 [] worker_thread+0x49/0x490
 [] ? process_one_work+0x4b0/0x4b0
 [] ? process_one_work+0x4b0/0x4b0



Re: btrfs bio linked list corruption.

2016-10-15 Thread Dave Jones
On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote:

 > >  > > .. and of course the first thing that happens is a completely 
 > > different
 > >  > > btrfs trace..
 > >  > >
 > >  > >
 > >  > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 
 > > start_transaction+0x40a/0x440 [btrfs]
 > >  > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
 > >  > >  c900019076a8 b731ff3c  
 > >  > >  c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98
 > >  > >  0801 880501cfa2a8 008a 008a
 > >  >
 > >  > This isn't even IO.  Uuug.  We're going to need a fast enough test
 > >  > that we can bisect.
 > >
 > > Progress...
 > > I've found that this combination of syscalls..
 > >
 > > ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr 
 > > -c lremovexattr -c pwritev2
 > >
 > > hits one of these two bugs in a few minutes runtime.
 > >
 > > Just the xattr syscalls + fsync isn't enough, neither is just pwrite + 
 > > fsync.
 > > Mix them together though, and something goes awry.
 > >
 > Hasn't triggered here yet.  I'll leave it running though.

The hits keep coming..

BUG: Bad page state in process kworker/u8:12  pfn:4988fa
page:ea0012623e80 count:0 mapcount:0 mapping:8804450456e0 index:0x9

flags: 0x400c(referenced|uptodate)
page dumped because: non-NULL mapping
CPU: 2 PID: 1388 Comm: kworker/u8:12 Not tainted 4.8.0-think+ #18 
Workqueue: writeback wb_workfn
 (flush-btrfs-1)

 c9aef7e8
 81320e7c
 ea0012623e80
 819fe6ec

 c9aef810
 81159b3f
 
 ea0012623e80

 400c
 c9aef820
 81159bfa
 c9aef868

Call Trace:
 [] dump_stack+0x4f/0x73
 [] bad_page+0xbf/0x120
 [] free_pages_check_bad+0x5a/0x70
 [] free_hot_cold_page+0x20b/0x270
 [] free_hot_cold_page_list+0x2b/0x50
 [] release_pages+0x2d2/0x380
 [] __pagevec_release+0x22/0x30
 [] extent_write_cache_pages.isra.48.constprop.63+0x350/0x430 
[btrfs]
 [] ? debug_smp_processor_id+0x17/0x20
 [] ? get_lock_stats+0x19/0x50
 [] extent_writepages+0x58/0x80 [btrfs]
 [] ? btrfs_releasepage+0x40/0x40 [btrfs]
 [] btrfs_writepages+0x23/0x30 [btrfs]
 [] do_writepages+0x1c/0x30
 [] __writeback_single_inode+0x33/0x180
 [] writeback_sb_inodes+0x2cb/0x5d0
 [] __writeback_inodes_wb+0x8d/0xc0
 [] wb_writeback+0x203/0x210
 [] wb_workfn+0xe7/0x2a0
 [] ? __lock_acquire.isra.32+0x1cf/0x8c0
 [] process_one_work+0x1da/0x4b0
 [] ? process_one_work+0x17a/0x4b0
 [] worker_thread+0x49/0x490
 [] ? process_one_work+0x4b0/0x4b0
 [] ? process_one_work+0x4b0/0x4b0



Re: btrfs bio linked list corruption.

2016-10-13 Thread Dave Jones
On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote:

 > >  > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 
 > > start_transaction+0x40a/0x440 [btrfs]
 > >  > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
 > >  > >  c900019076a8 b731ff3c  
 > >  > >  c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98
 > >  > >  0801 880501cfa2a8 008a 008a
 > >  >
 > >  > This isn't even IO.  Uuug.  We're going to need a fast enough test
 > >  > that we can bisect.
 > >
 > > Progress...
 > > I've found that this combination of syscalls..
 > >
 > > ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr 
 > > -c lremovexattr -c pwritev2
 > >
 > > hits one of these two bugs in a few minutes runtime.
 > >
 > > Just the xattr syscalls + fsync isn't enough, neither is just pwrite + 
 > > fsync.
 > > Mix them together though, and something goes awry.
 > >
 > 
 > Hasn't triggered here yet.  I'll leave it running though.

With that combo of params I triggered it 3-4 times in a row within minutes.. 
Then
as soon as I posted, it stopped being so easy to repro.

There's some other variable I haven't figured out yet (maybe how the random way 
that files
get opened in fds/testfiles.c), but it does seem to point at the xattr changes. 

I'll poke at it some more tomorrow.

Dave



Re: btrfs bio linked list corruption.

2016-10-13 Thread Dave Jones
On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote:

 > >  > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 
 > > start_transaction+0x40a/0x440 [btrfs]
 > >  > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
 > >  > >  c900019076a8 b731ff3c  
 > >  > >  c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98
 > >  > >  0801 880501cfa2a8 008a 008a
 > >  >
 > >  > This isn't even IO.  Uuug.  We're going to need a fast enough test
 > >  > that we can bisect.
 > >
 > > Progress...
 > > I've found that this combination of syscalls..
 > >
 > > ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr 
 > > -c lremovexattr -c pwritev2
 > >
 > > hits one of these two bugs in a few minutes runtime.
 > >
 > > Just the xattr syscalls + fsync isn't enough, neither is just pwrite + 
 > > fsync.
 > > Mix them together though, and something goes awry.
 > >
 > 
 > Hasn't triggered here yet.  I'll leave it running though.

With that combo of params I triggered it 3-4 times in a row within minutes.. 
Then
as soon as I posted, it stopped being so easy to repro.

There's some other variable I haven't figured out yet (maybe how the random way 
that files
get opened in fds/testfiles.c), but it does seem to point at the xattr changes. 

I'll poke at it some more tomorrow.

Dave



Re: btrfs bio linked list corruption.

2016-10-13 Thread Chris Mason

On 10/13/2016 02:16 PM, Dave Jones wrote:

On Wed, Oct 12, 2016 at 10:42:46AM -0400, Chris Mason wrote:
 > On 10/12/2016 10:40 AM, Dave Jones wrote:
 > > On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote:
 > >  > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
 > >  >  >
 > >  >  >
 > >  >  > On 10/11/2016 10:45 AM, Dave Jones wrote:
 > >  >  > > This is from Linus' current tree, with Al's iovec fixups on top.
 > >  >  > >
 > >  >  > > [ cut here ]
 > >  >  > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 
__list_add+0x89/0xb0
 > >  >  > > list_add corruption. prev->next should be next (e8806648), 
but was c967fcd8. (prev=880503878b80).
 > >  >  > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
 > >  >  > >  c9d87458 8d32007c c9d874a8 

 > >  >  > >  c9d87498 8d07a6c1 00210246 
88050388e880
 > >  >
 > >  > I hit this again overnight, it's the same trace, the only difference
 > >  > being slightly different addresses in the list pointers:
 > >  >
 > >  > [42572.777196] list_add corruption. prev->next should be next 
(e8806648), but was c9647cd8. (prev=880503a0ba00).
 > >  >
 > >  > I'm actually a little surprised that ->next was the same across two
 > >  > reboots on two different kernel builds.  That might be a sign this is
 > >  > more repeatable than I'd thought, even if it does take hours of runtime
 > >  > right now to trigger it.  I'll try and narrow the scope of what trinity
 > >  > is doing to see if I can make it happen faster.
 > >
 > > .. and of course the first thing that happens is a completely different
 > > btrfs trace..
 > >
 > >
 > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 
start_transaction+0x40a/0x440 [btrfs]
 > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
 > >  c900019076a8 b731ff3c  
 > >  c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98
 > >  0801 880501cfa2a8 008a 008a
 >
 > This isn't even IO.  Uuug.  We're going to need a fast enough test
 > that we can bisect.

Progress...
I've found that this combination of syscalls..

./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c 
lremovexattr -c pwritev2

hits one of these two bugs in a few minutes runtime.

Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync.
Mix them together though, and something goes awry.



Hasn't triggered here yet.  I'll leave it running though.

-chris


Re: btrfs bio linked list corruption.

2016-10-13 Thread Chris Mason

On 10/13/2016 02:16 PM, Dave Jones wrote:

On Wed, Oct 12, 2016 at 10:42:46AM -0400, Chris Mason wrote:
 > On 10/12/2016 10:40 AM, Dave Jones wrote:
 > > On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote:
 > >  > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
 > >  >  >
 > >  >  >
 > >  >  > On 10/11/2016 10:45 AM, Dave Jones wrote:
 > >  >  > > This is from Linus' current tree, with Al's iovec fixups on top.
 > >  >  > >
 > >  >  > > [ cut here ]
 > >  >  > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 
__list_add+0x89/0xb0
 > >  >  > > list_add corruption. prev->next should be next (e8806648), 
but was c967fcd8. (prev=880503878b80).
 > >  >  > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
 > >  >  > >  c9d87458 8d32007c c9d874a8 

 > >  >  > >  c9d87498 8d07a6c1 00210246 
88050388e880
 > >  >
 > >  > I hit this again overnight, it's the same trace, the only difference
 > >  > being slightly different addresses in the list pointers:
 > >  >
 > >  > [42572.777196] list_add corruption. prev->next should be next 
(e8806648), but was c9647cd8. (prev=880503a0ba00).
 > >  >
 > >  > I'm actually a little surprised that ->next was the same across two
 > >  > reboots on two different kernel builds.  That might be a sign this is
 > >  > more repeatable than I'd thought, even if it does take hours of runtime
 > >  > right now to trigger it.  I'll try and narrow the scope of what trinity
 > >  > is doing to see if I can make it happen faster.
 > >
 > > .. and of course the first thing that happens is a completely different
 > > btrfs trace..
 > >
 > >
 > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 
start_transaction+0x40a/0x440 [btrfs]
 > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
 > >  c900019076a8 b731ff3c  
 > >  c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98
 > >  0801 880501cfa2a8 008a 008a
 >
 > This isn't even IO.  Uuug.  We're going to need a fast enough test
 > that we can bisect.

Progress...
I've found that this combination of syscalls..

./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c 
lremovexattr -c pwritev2

hits one of these two bugs in a few minutes runtime.

Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync.
Mix them together though, and something goes awry.



Hasn't triggered here yet.  I'll leave it running though.

-chris


Re: btrfs bio linked list corruption.

2016-10-13 Thread Dave Jones
On Wed, Oct 12, 2016 at 10:42:46AM -0400, Chris Mason wrote:
 > On 10/12/2016 10:40 AM, Dave Jones wrote:
 > > On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote:
 > >  > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
 > >  >  >
 > >  >  >
 > >  >  > On 10/11/2016 10:45 AM, Dave Jones wrote:
 > >  >  > > This is from Linus' current tree, with Al's iovec fixups on top.
 > >  >  > >
 > >  >  > > [ cut here ]
 > >  >  > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 
 > > __list_add+0x89/0xb0
 > >  >  > > list_add corruption. prev->next should be next (e8806648), 
 > > but was c967fcd8. (prev=880503878b80).
 > >  >  > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
 > >  >  > >  c9d87458 8d32007c c9d874a8 
 > > 
 > >  >  > >  c9d87498 8d07a6c1 00210246 
 > > 88050388e880
 > >  >
 > >  > I hit this again overnight, it's the same trace, the only difference
 > >  > being slightly different addresses in the list pointers:
 > >  >
 > >  > [42572.777196] list_add corruption. prev->next should be next 
 > > (e8806648), but was c9647cd8. (prev=880503a0ba00).
 > >  >
 > >  > I'm actually a little surprised that ->next was the same across two
 > >  > reboots on two different kernel builds.  That might be a sign this is
 > >  > more repeatable than I'd thought, even if it does take hours of runtime
 > >  > right now to trigger it.  I'll try and narrow the scope of what trinity
 > >  > is doing to see if I can make it happen faster.
 > >
 > > .. and of course the first thing that happens is a completely different
 > > btrfs trace..
 > >
 > >
 > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 
 > > start_transaction+0x40a/0x440 [btrfs]
 > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
 > >  c900019076a8 b731ff3c  
 > >  c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98
 > >  0801 880501cfa2a8 008a 008a
 > 
 > This isn't even IO.  Uuug.  We're going to need a fast enough test 
 > that we can bisect.

Progress...
I've found that this combination of syscalls..

./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c 
lremovexattr -c pwritev2

hits one of these two bugs in a few minutes runtime.

Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync.
Mix them together though, and something goes awry.

Dave



Re: btrfs bio linked list corruption.

2016-10-13 Thread Dave Jones
On Wed, Oct 12, 2016 at 10:42:46AM -0400, Chris Mason wrote:
 > On 10/12/2016 10:40 AM, Dave Jones wrote:
 > > On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote:
 > >  > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
 > >  >  >
 > >  >  >
 > >  >  > On 10/11/2016 10:45 AM, Dave Jones wrote:
 > >  >  > > This is from Linus' current tree, with Al's iovec fixups on top.
 > >  >  > >
 > >  >  > > [ cut here ]
 > >  >  > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 
 > > __list_add+0x89/0xb0
 > >  >  > > list_add corruption. prev->next should be next (e8806648), 
 > > but was c967fcd8. (prev=880503878b80).
 > >  >  > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
 > >  >  > >  c9d87458 8d32007c c9d874a8 
 > > 
 > >  >  > >  c9d87498 8d07a6c1 00210246 
 > > 88050388e880
 > >  >
 > >  > I hit this again overnight, it's the same trace, the only difference
 > >  > being slightly different addresses in the list pointers:
 > >  >
 > >  > [42572.777196] list_add corruption. prev->next should be next 
 > > (e8806648), but was c9647cd8. (prev=880503a0ba00).
 > >  >
 > >  > I'm actually a little surprised that ->next was the same across two
 > >  > reboots on two different kernel builds.  That might be a sign this is
 > >  > more repeatable than I'd thought, even if it does take hours of runtime
 > >  > right now to trigger it.  I'll try and narrow the scope of what trinity
 > >  > is doing to see if I can make it happen faster.
 > >
 > > .. and of course the first thing that happens is a completely different
 > > btrfs trace..
 > >
 > >
 > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 
 > > start_transaction+0x40a/0x440 [btrfs]
 > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
 > >  c900019076a8 b731ff3c  
 > >  c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98
 > >  0801 880501cfa2a8 008a 008a
 > 
 > This isn't even IO.  Uuug.  We're going to need a fast enough test 
 > that we can bisect.

Progress...
I've found that this combination of syscalls..

./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c 
lremovexattr -c pwritev2

hits one of these two bugs in a few minutes runtime.

Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync.
Mix them together though, and something goes awry.

Dave



Re: btrfs bio linked list corruption.

2016-10-12 Thread Dave Jones
On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote:
 > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
 >  > 
 >  > 
 >  > On 10/11/2016 10:45 AM, Dave Jones wrote:
 >  > > This is from Linus' current tree, with Al's iovec fixups on top.
 >  > > 
 >  > > [ cut here ]
 >  > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
 >  > > list_add corruption. prev->next should be next (e8806648), but 
 > was c967fcd8. (prev=880503878b80).
 >  > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 
 >  > >  c9d87458 8d32007c c9d874a8 
 >  > >  c9d87498 8d07a6c1 00210246 88050388e880
 > 
 > I hit this again overnight, it's the same trace, the only difference
 > being slightly different addresses in the list pointers:
 > 
 > [42572.777196] list_add corruption. prev->next should be next 
 > (e8806648), but was c9647cd8. (prev=880503a0ba00).
 > 
 > I'm actually a little surprised that ->next was the same across two
 > reboots on two different kernel builds.  That might be a sign this is
 > more repeatable than I'd thought, even if it does take hours of runtime
 > right now to trigger it.  I'll try and narrow the scope of what trinity
 > is doing to see if I can make it happen faster.

.. and of course the first thing that happens is a completely different
btrfs trace..


WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 
start_transaction+0x40a/0x440 [btrfs]
CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14 
 c900019076a8 b731ff3c  
 c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98
 0801 880501cfa2a8 008a 008a

Call Trace:
 [] dump_stack+0x4f/0x73
 [] __warn+0xc1/0xe0
 [] warn_slowpath_null+0x18/0x20
 [] start_transaction+0x40a/0x440 [btrfs]
 [] ? btrfs_alloc_path+0x15/0x20 [btrfs]
 [] btrfs_join_transaction+0x12/0x20 [btrfs]
 [] cow_file_range_inline+0xef/0x830 [btrfs]
 [] cow_file_range.isra.64+0x365/0x480 [btrfs]
 [] ? _raw_spin_unlock+0x2c/0x50
 [] ? release_extent_buffer+0x9f/0x110 [btrfs]
 [] run_delalloc_nocow+0x409/0xbd0 [btrfs]
 [] ? get_lock_stats+0x19/0x50
 [] run_delalloc_range+0x38a/0x3e0 [btrfs]
 [] writepage_delalloc.isra.47+0x10a/0x190 [btrfs]
 [] __extent_writepage+0xd8/0x2c0 [btrfs]
 [] extent_write_cache_pages.isra.44.constprop.63+0x2ce/0x430 
[btrfs]
 [] ? debug_smp_processor_id+0x17/0x20
 [] ? get_lock_stats+0x19/0x50
 [] extent_writepages+0x58/0x80 [btrfs]
 [] ? btrfs_releasepage+0x40/0x40 [btrfs]
 [] btrfs_writepages+0x23/0x30 [btrfs]
 [] do_writepages+0x1c/0x30
 [] __filemap_fdatawrite_range+0xc1/0x100
 [] filemap_fdatawrite_range+0xe/0x10
 [] btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
 [] btrfs_wait_ordered_range+0x40/0x100 [btrfs]
 [] btrfs_sync_file+0x285/0x390 [btrfs]
 [] vfs_fsync_range+0x46/0xa0
 [] do_fsync+0x38/0x60
 [] SyS_fsync+0xb/0x10
 [] do_syscall_64+0x5c/0x170
 [] entry_SYSCALL64_slow_path+0x25/0x25


Re: btrfs bio linked list corruption.

2016-10-12 Thread Dave Jones
On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote:
 > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
 >  > 
 >  > 
 >  > On 10/11/2016 10:45 AM, Dave Jones wrote:
 >  > > This is from Linus' current tree, with Al's iovec fixups on top.
 >  > > 
 >  > > [ cut here ]
 >  > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
 >  > > list_add corruption. prev->next should be next (e8806648), but 
 > was c967fcd8. (prev=880503878b80).
 >  > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 
 >  > >  c9d87458 8d32007c c9d874a8 
 >  > >  c9d87498 8d07a6c1 00210246 88050388e880
 > 
 > I hit this again overnight, it's the same trace, the only difference
 > being slightly different addresses in the list pointers:
 > 
 > [42572.777196] list_add corruption. prev->next should be next 
 > (e8806648), but was c9647cd8. (prev=880503a0ba00).
 > 
 > I'm actually a little surprised that ->next was the same across two
 > reboots on two different kernel builds.  That might be a sign this is
 > more repeatable than I'd thought, even if it does take hours of runtime
 > right now to trigger it.  I'll try and narrow the scope of what trinity
 > is doing to see if I can make it happen faster.

.. and of course the first thing that happens is a completely different
btrfs trace..


WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 
start_transaction+0x40a/0x440 [btrfs]
CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14 
 c900019076a8 b731ff3c  
 c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98
 0801 880501cfa2a8 008a 008a

Call Trace:
 [] dump_stack+0x4f/0x73
 [] __warn+0xc1/0xe0
 [] warn_slowpath_null+0x18/0x20
 [] start_transaction+0x40a/0x440 [btrfs]
 [] ? btrfs_alloc_path+0x15/0x20 [btrfs]
 [] btrfs_join_transaction+0x12/0x20 [btrfs]
 [] cow_file_range_inline+0xef/0x830 [btrfs]
 [] cow_file_range.isra.64+0x365/0x480 [btrfs]
 [] ? _raw_spin_unlock+0x2c/0x50
 [] ? release_extent_buffer+0x9f/0x110 [btrfs]
 [] run_delalloc_nocow+0x409/0xbd0 [btrfs]
 [] ? get_lock_stats+0x19/0x50
 [] run_delalloc_range+0x38a/0x3e0 [btrfs]
 [] writepage_delalloc.isra.47+0x10a/0x190 [btrfs]
 [] __extent_writepage+0xd8/0x2c0 [btrfs]
 [] extent_write_cache_pages.isra.44.constprop.63+0x2ce/0x430 
[btrfs]
 [] ? debug_smp_processor_id+0x17/0x20
 [] ? get_lock_stats+0x19/0x50
 [] extent_writepages+0x58/0x80 [btrfs]
 [] ? btrfs_releasepage+0x40/0x40 [btrfs]
 [] btrfs_writepages+0x23/0x30 [btrfs]
 [] do_writepages+0x1c/0x30
 [] __filemap_fdatawrite_range+0xc1/0x100
 [] filemap_fdatawrite_range+0xe/0x10
 [] btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
 [] btrfs_wait_ordered_range+0x40/0x100 [btrfs]
 [] btrfs_sync_file+0x285/0x390 [btrfs]
 [] vfs_fsync_range+0x46/0xa0
 [] do_fsync+0x38/0x60
 [] SyS_fsync+0xb/0x10
 [] do_syscall_64+0x5c/0x170
 [] entry_SYSCALL64_slow_path+0x25/0x25


Re: btrfs bio linked list corruption.

2016-10-12 Thread Chris Mason

On 10/12/2016 10:40 AM, Dave Jones wrote:

On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote:
 > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
 >  >
 >  >
 >  > On 10/11/2016 10:45 AM, Dave Jones wrote:
 >  > > This is from Linus' current tree, with Al's iovec fixups on top.
 >  > >
 >  > > [ cut here ]
 >  > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
 >  > > list_add corruption. prev->next should be next (e8806648), but 
was c967fcd8. (prev=880503878b80).
 >  > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
 >  > >  c9d87458 8d32007c c9d874a8 
 >  > >  c9d87498 8d07a6c1 00210246 88050388e880
 >
 > I hit this again overnight, it's the same trace, the only difference
 > being slightly different addresses in the list pointers:
 >
 > [42572.777196] list_add corruption. prev->next should be next 
(e8806648), but was c9647cd8. (prev=880503a0ba00).
 >
 > I'm actually a little surprised that ->next was the same across two
 > reboots on two different kernel builds.  That might be a sign this is
 > more repeatable than I'd thought, even if it does take hours of runtime
 > right now to trigger it.  I'll try and narrow the scope of what trinity
 > is doing to see if I can make it happen faster.

.. and of course the first thing that happens is a completely different
btrfs trace..


WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 
start_transaction+0x40a/0x440 [btrfs]
CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
 c900019076a8 b731ff3c  
 c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98
 0801 880501cfa2a8 008a 008a


This isn't even IO.  Uuug.  We're going to need a fast enough test 
that we can bisect.


-chris


Re: btrfs bio linked list corruption.

2016-10-12 Thread Chris Mason

On 10/12/2016 10:40 AM, Dave Jones wrote:

On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote:
 > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
 >  >
 >  >
 >  > On 10/11/2016 10:45 AM, Dave Jones wrote:
 >  > > This is from Linus' current tree, with Al's iovec fixups on top.
 >  > >
 >  > > [ cut here ]
 >  > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
 >  > > list_add corruption. prev->next should be next (e8806648), but 
was c967fcd8. (prev=880503878b80).
 >  > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
 >  > >  c9d87458 8d32007c c9d874a8 
 >  > >  c9d87498 8d07a6c1 00210246 88050388e880
 >
 > I hit this again overnight, it's the same trace, the only difference
 > being slightly different addresses in the list pointers:
 >
 > [42572.777196] list_add corruption. prev->next should be next 
(e8806648), but was c9647cd8. (prev=880503a0ba00).
 >
 > I'm actually a little surprised that ->next was the same across two
 > reboots on two different kernel builds.  That might be a sign this is
 > more repeatable than I'd thought, even if it does take hours of runtime
 > right now to trigger it.  I'll try and narrow the scope of what trinity
 > is doing to see if I can make it happen faster.

.. and of course the first thing that happens is a completely different
btrfs trace..


WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 
start_transaction+0x40a/0x440 [btrfs]
CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
 c900019076a8 b731ff3c  
 c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98
 0801 880501cfa2a8 008a 008a


This isn't even IO.  Uuug.  We're going to need a fast enough test 
that we can bisect.


-chris


Re: btrfs bio linked list corruption.

2016-10-12 Thread Dave Jones
On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
 > 
 > 
 > On 10/11/2016 10:45 AM, Dave Jones wrote:
 > > This is from Linus' current tree, with Al's iovec fixups on top.
 > > 
 > > [ cut here ]
 > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
 > > list_add corruption. prev->next should be next (e8806648), but was 
 > > c967fcd8. (prev=880503878b80).
 > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 
 > >  c9d87458 8d32007c c9d874a8 
 > >  c9d87498 8d07a6c1 00210246 88050388e880

I hit this again overnight, it's the same trace, the only difference
being slightly different addresses in the list pointers:

[42572.777196] list_add corruption. prev->next should be next 
(e8806648), but was c9647cd8. (prev=880503a0ba00).

I'm actually a little surprised that ->next was the same across two
reboots on two different kernel builds.  That might be a sign this is
more repeatable than I'd thought, even if it does take hours of runtime
right now to trigger it.  I'll try and narrow the scope of what trinity
is doing to see if I can make it happen faster.

Dave



Re: btrfs bio linked list corruption.

2016-10-12 Thread Dave Jones
On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
 > 
 > 
 > On 10/11/2016 10:45 AM, Dave Jones wrote:
 > > This is from Linus' current tree, with Al's iovec fixups on top.
 > > 
 > > [ cut here ]
 > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
 > > list_add corruption. prev->next should be next (e8806648), but was 
 > > c967fcd8. (prev=880503878b80).
 > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 
 > >  c9d87458 8d32007c c9d874a8 
 > >  c9d87498 8d07a6c1 00210246 88050388e880

I hit this again overnight, it's the same trace, the only difference
being slightly different addresses in the list pointers:

[42572.777196] list_add corruption. prev->next should be next 
(e8806648), but was c9647cd8. (prev=880503a0ba00).

I'm actually a little surprised that ->next was the same across two
reboots on two different kernel builds.  That might be a sign this is
more repeatable than I'd thought, even if it does take hours of runtime
right now to trigger it.  I'll try and narrow the scope of what trinity
is doing to see if I can make it happen faster.

Dave



Re: btrfs bio linked list corruption.

2016-10-11 Thread Chris Mason


On 10/11/2016 10:45 AM, Dave Jones wrote:
> This is from Linus' current tree, with Al's iovec fixups on top.
> 
> [ cut here ]
> WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
> list_add corruption. prev->next should be next (e8806648), but was 
> c967fcd8. (prev=880503878b80).
> CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 
>  c9d87458 8d32007c c9d874a8 
>  c9d87498 8d07a6c1 00210246 88050388e880
>  880503878b80 e8806648 e8c06600 880502808008
> Call Trace:
> [] dump_stack+0x4f/0x73
> [] __warn+0xc1/0xe0
> [] warn_slowpath_fmt+0x5a/0x80
> [] __list_add+0x89/0xb0
> [] blk_sq_make_request+0x2f8/0x350

   /*  
 * A task plug currently exists. Since this is completely lockless, 
 * utilize that to temporarily store requests until the task is 
 * either done or scheduled away.   
 */ 
plug = current->plug;   
if (plug) { 
blk_mq_bio_to_request(rq, bio); 
if (!request_count) 
trace_block_plug(q);

blk_mq_put_ctx(data.ctx);   

if (request_count >= BLK_MAX_REQUEST_COUNT) {   
blk_flush_plug_list(plug, false);   
trace_block_plug(q);
}   

list_add_tail(>queuelist, >mq_list);  
^^

Dave, is this where we're crashing?  This seems strange.

-chris


Re: btrfs bio linked list corruption.

2016-10-11 Thread Chris Mason


On 10/11/2016 10:45 AM, Dave Jones wrote:
> This is from Linus' current tree, with Al's iovec fixups on top.
> 
> [ cut here ]
> WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
> list_add corruption. prev->next should be next (e8806648), but was 
> c967fcd8. (prev=880503878b80).
> CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 
>  c9d87458 8d32007c c9d874a8 
>  c9d87498 8d07a6c1 00210246 88050388e880
>  880503878b80 e8806648 e8c06600 880502808008
> Call Trace:
> [] dump_stack+0x4f/0x73
> [] __warn+0xc1/0xe0
> [] warn_slowpath_fmt+0x5a/0x80
> [] __list_add+0x89/0xb0
> [] blk_sq_make_request+0x2f8/0x350

   /*  
 * A task plug currently exists. Since this is completely lockless, 
 * utilize that to temporarily store requests until the task is 
 * either done or scheduled away.   
 */ 
plug = current->plug;   
if (plug) { 
blk_mq_bio_to_request(rq, bio); 
if (!request_count) 
trace_block_plug(q);

blk_mq_put_ctx(data.ctx);   

if (request_count >= BLK_MAX_REQUEST_COUNT) {   
blk_flush_plug_list(plug, false);   
trace_block_plug(q);
}   

list_add_tail(>queuelist, >mq_list);  
^^

Dave, is this where we're crashing?  This seems strange.

-chris


Re: btrfs bio linked list corruption.

2016-10-11 Thread Dave Jones
On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
 > 
 > 
 > On 10/11/2016 10:45 AM, Dave Jones wrote:
 > > This is from Linus' current tree, with Al's iovec fixups on top.
 > > 
 > > [ cut here ]
 > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
 > > list_add corruption. prev->next should be next (e8806648), but was 
 > > c967fcd8. (prev=880503878b80).
 > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 
 > >  c9d87458 8d32007c c9d874a8 
 > >  c9d87498 8d07a6c1 00210246 88050388e880
 > >  880503878b80 e8806648 e8c06600 880502808008
 > > Call Trace:
 > > [] dump_stack+0x4f/0x73
 > > [] __warn+0xc1/0xe0
 > > [] warn_slowpath_fmt+0x5a/0x80
 > > [] __list_add+0x89/0xb0
 > > [] blk_sq_make_request+0x2f8/0x350
 > 
 >/*
 >   
 >  * A task plug currently exists. Since this is completely lockless,  
 >
 >  * utilize that to temporarily store requests until the task is  
 >
 >  * either done or scheduled away.
 >
 >  */  
 >
 > plug = current->plug;
 >
 > if (plug) {  
 >
 > blk_mq_bio_to_request(rq, bio);  
 >
 > if (!request_count)  
 >
 > trace_block_plug(q); 
 >
 >  
 >
 > blk_mq_put_ctx(data.ctx);
 >
 >  
 >
 > if (request_count >= BLK_MAX_REQUEST_COUNT) {
 >
 > blk_flush_plug_list(plug, false);
 >
 > trace_block_plug(q); 
 >
 > }
 >
 >  
 >
 > list_add_tail(>queuelist, >mq_list);   
 >
 > ^^
 > 
 > Dave, is this where we're crashing?  This seems strange.

According to objdump -S ..


8130a1b7:   48 8b 70 50 mov0x50(%rax),%rsi
list_add_tail(>queuelist, >rq_list);
8130a1bb:   48 8d 50 48 lea0x48(%rax),%rdx
8130a1bf:   48 89 45 a8 mov%rax,-0x58(%rbp)
8130a1c3:   e8 38 44 03 00  callq  8133e600 
<__list_add>
blk_mq_hctx_mark_pending(hctx, ctx);
8130a1c8:   48 8b 45 a8 mov-0x58(%rbp),%rax
8130a1cc:   4c 89 ffmov%r15,%rdi

That looks like the list_add_tail from __blk_mq_insert_req_list

Dave


Re: btrfs bio linked list corruption.

2016-10-11 Thread Dave Jones
On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
 > 
 > 
 > On 10/11/2016 10:45 AM, Dave Jones wrote:
 > > This is from Linus' current tree, with Al's iovec fixups on top.
 > > 
 > > [ cut here ]
 > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
 > > list_add corruption. prev->next should be next (e8806648), but was 
 > > c967fcd8. (prev=880503878b80).
 > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 
 > >  c9d87458 8d32007c c9d874a8 
 > >  c9d87498 8d07a6c1 00210246 88050388e880
 > >  880503878b80 e8806648 e8c06600 880502808008
 > > Call Trace:
 > > [] dump_stack+0x4f/0x73
 > > [] __warn+0xc1/0xe0
 > > [] warn_slowpath_fmt+0x5a/0x80
 > > [] __list_add+0x89/0xb0
 > > [] blk_sq_make_request+0x2f8/0x350
 > 
 >/*
 >   
 >  * A task plug currently exists. Since this is completely lockless,  
 >
 >  * utilize that to temporarily store requests until the task is  
 >
 >  * either done or scheduled away.
 >
 >  */  
 >
 > plug = current->plug;
 >
 > if (plug) {  
 >
 > blk_mq_bio_to_request(rq, bio);  
 >
 > if (!request_count)  
 >
 > trace_block_plug(q); 
 >
 >  
 >
 > blk_mq_put_ctx(data.ctx);
 >
 >  
 >
 > if (request_count >= BLK_MAX_REQUEST_COUNT) {
 >
 > blk_flush_plug_list(plug, false);
 >
 > trace_block_plug(q); 
 >
 > }
 >
 >  
 >
 > list_add_tail(>queuelist, >mq_list);   
 >
 > ^^
 > 
 > Dave, is this where we're crashing?  This seems strange.

According to objdump -S ..


8130a1b7:   48 8b 70 50 mov0x50(%rax),%rsi
list_add_tail(>queuelist, >rq_list);
8130a1bb:   48 8d 50 48 lea0x48(%rax),%rdx
8130a1bf:   48 89 45 a8 mov%rax,-0x58(%rbp)
8130a1c3:   e8 38 44 03 00  callq  8133e600 
<__list_add>
blk_mq_hctx_mark_pending(hctx, ctx);
8130a1c8:   48 8b 45 a8 mov-0x58(%rbp),%rax
8130a1cc:   4c 89 ffmov%r15,%rdi

That looks like the list_add_tail from __blk_mq_insert_req_list

Dave


Re: btrfs bio linked list corruption.

2016-10-11 Thread Dave Jones
On Tue, Oct 11, 2016 at 11:20:41AM -0400, Chris Mason wrote:
 > 
 > 
 > On 10/11/2016 11:19 AM, Dave Jones wrote:
 > > On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote:
 > >  > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
 > >  > > This is from Linus' current tree, with Al's iovec fixups on top.
 > >  >
 > >  > Those iovec fixups are in the current tree...
 > >
 > > ah yeah, git quietly dropped my local copy when I rebased so I didn't 
 > > notice.
 > >
 > >  > TBH, I don't see anything
 > >  > in splice-related stuff that could come anywhere near that (short of
 > >  > some general memory corruption having random effects of that sort).
 > >  >
 > >  > Could you try to bisect that sucker, or is it too hard to reproduce?
 > >
 > > Only hit it the once overnight so far. Will see if I can find a better way 
 > > to
 > > reproduce today.
 > 
 > This call trace is reading metadata so we can finish the truncate.  I'd 
 > say adding more memory pressure would make it happen more often.

That story checks out. There were a bunch of oom's in the log before this.

Dave



Re: btrfs bio linked list corruption.

2016-10-11 Thread Dave Jones
On Tue, Oct 11, 2016 at 11:20:41AM -0400, Chris Mason wrote:
 > 
 > 
 > On 10/11/2016 11:19 AM, Dave Jones wrote:
 > > On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote:
 > >  > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
 > >  > > This is from Linus' current tree, with Al's iovec fixups on top.
 > >  >
 > >  > Those iovec fixups are in the current tree...
 > >
 > > ah yeah, git quietly dropped my local copy when I rebased so I didn't 
 > > notice.
 > >
 > >  > TBH, I don't see anything
 > >  > in splice-related stuff that could come anywhere near that (short of
 > >  > some general memory corruption having random effects of that sort).
 > >  >
 > >  > Could you try to bisect that sucker, or is it too hard to reproduce?
 > >
 > > Only hit it the once overnight so far. Will see if I can find a better way 
 > > to
 > > reproduce today.
 > 
 > This call trace is reading metadata so we can finish the truncate.  I'd 
 > say adding more memory pressure would make it happen more often.

That story checks out. There were a bunch of oom's in the log before this.

Dave



Re: btrfs bio linked list corruption.

2016-10-11 Thread Chris Mason



On 10/11/2016 11:19 AM, Dave Jones wrote:

On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote:
 > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
 > > This is from Linus' current tree, with Al's iovec fixups on top.
 >
 > Those iovec fixups are in the current tree...

ah yeah, git quietly dropped my local copy when I rebased so I didn't notice.

 > TBH, I don't see anything
 > in splice-related stuff that could come anywhere near that (short of
 > some general memory corruption having random effects of that sort).
 >
 > Could you try to bisect that sucker, or is it too hard to reproduce?

Only hit it the once overnight so far. Will see if I can find a better way to
reproduce today.


This call trace is reading metadata so we can finish the truncate.  I'd 
say adding more memory pressure would make it happen more often.


I'll try to trigger.

-chris



Re: btrfs bio linked list corruption.

2016-10-11 Thread Chris Mason



On 10/11/2016 11:19 AM, Dave Jones wrote:

On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote:
 > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
 > > This is from Linus' current tree, with Al's iovec fixups on top.
 >
 > Those iovec fixups are in the current tree...

ah yeah, git quietly dropped my local copy when I rebased so I didn't notice.

 > TBH, I don't see anything
 > in splice-related stuff that could come anywhere near that (short of
 > some general memory corruption having random effects of that sort).
 >
 > Could you try to bisect that sucker, or is it too hard to reproduce?

Only hit it the once overnight so far. Will see if I can find a better way to
reproduce today.


This call trace is reading metadata so we can finish the truncate.  I'd 
say adding more memory pressure would make it happen more often.


I'll try to trigger.

-chris



Re: btrfs bio linked list corruption.

2016-10-11 Thread Dave Jones
On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote:
 > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
 > > This is from Linus' current tree, with Al's iovec fixups on top.
 > 
 > Those iovec fixups are in the current tree...

ah yeah, git quietly dropped my local copy when I rebased so I didn't notice.

 > TBH, I don't see anything
 > in splice-related stuff that could come anywhere near that (short of
 > some general memory corruption having random effects of that sort).
 > 
 > Could you try to bisect that sucker, or is it too hard to reproduce?

Only hit it the once overnight so far. Will see if I can find a better way to
reproduce today.

Dave



Re: btrfs bio linked list corruption.

2016-10-11 Thread Dave Jones
On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote:
 > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
 > > This is from Linus' current tree, with Al's iovec fixups on top.
 > 
 > Those iovec fixups are in the current tree...

ah yeah, git quietly dropped my local copy when I rebased so I didn't notice.

 > TBH, I don't see anything
 > in splice-related stuff that could come anywhere near that (short of
 > some general memory corruption having random effects of that sort).
 > 
 > Could you try to bisect that sucker, or is it too hard to reproduce?

Only hit it the once overnight so far. Will see if I can find a better way to
reproduce today.

Dave



Re: btrfs bio linked list corruption.

2016-10-11 Thread Al Viro
On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
> This is from Linus' current tree, with Al's iovec fixups on top.

Those iovec fixups are in the current tree...  TBH, I don't see anything
in splice-related stuff that could come anywhere near that (short of
some general memory corruption having random effects of that sort).

Could you try to bisect that sucker, or is it too hard to reproduce?


Re: btrfs bio linked list corruption.

2016-10-11 Thread Al Viro
On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
> This is from Linus' current tree, with Al's iovec fixups on top.

Those iovec fixups are in the current tree...  TBH, I don't see anything
in splice-related stuff that could come anywhere near that (short of
some general memory corruption having random effects of that sort).

Could you try to bisect that sucker, or is it too hard to reproduce?