Re: Btrfs resize seems to deadlock

2018-10-22 Thread Liu Bo
On Sat, Oct 20, 2018 at 1:34 PM Filipe Manana  wrote:
>
> On Sat, Oct 20, 2018 at 9:27 PM Liu Bo  wrote:
> >
> > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson  
> > wrote:
> > >
> > > I am having an issue with btrfs resize in Fedora 28. I am attempting
> > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem
> > > resize max $MOUNT", the command runs for a few minutes and then hangs
> > > forcing the system to be reset. I am not sure what the state of the
> > > filesystem really is at this point. Btrfs usage does report the
> > > correct size for after resizing. Details below:
> > >
> >
> > Thanks for the report, the stack is helpful, but this needs a few
> > deeper debugging, may I ask you to post "btrfs inspect-internal
> > dump-tree -t 1 /dev/your_btrfs_disk"?
>
> I believe it's actually easy to understand from the trace alone and
> it's kind of a bad luck scenario.
> I made this fix a few hours ago:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock
>
> But haven't done full testing yet and might have missed something.
> Bo, can you take a look and let me know what you think?
>

The patch is OK to me, should fix the problem.

Since load_free_space_cache() is touching root tree with
path->commit_root and skip_locking, I was wondering if we could just
provide a same way for free space cache inode's iget().

thanks,
liubo

> Thanks.
>
> >
> > So I'd like to know what's the height of your tree "1" which refers to
> > root tree in btrfs.
> >
> > thanks,
> > liubo
> >
> > > $ sudo btrfs filesystem usage $MOUNT
> > > Overall:
> > > Device size:  90.96TiB
> > > Device allocated: 72.62TiB
> > > Device unallocated:   18.33TiB
> > > Device missing:  0.00B
> > > Used: 72.62TiB
> > > Free (estimated): 18.34TiB  (min: 9.17TiB)
> > > Data ratio:   1.00
> > > Metadata ratio:   2.00
> > > Global reserve:  512.00MiB  (used: 24.11MiB)
> > >
> > > Data,single: Size:72.46TiB, Used:72.45TiB
> > > $MOUNT72.46TiB
> > >
> > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB
> > > $MOUNT   172.00GiB
> > >
> > > System,DUP: Size:40.00MiB, Used:7.53MiB
> > >$MOUNT80.00MiB
> > >
> > > Unallocated:
> > > $MOUNT18.33TiB
> > >
> > > $ uname -a
> > > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 15
> > > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> > >
> > > btrfs-transacti D0  2501  2 0x8000
> > > Call Trace:
> > >  ? __schedule+0x253/0x860
> > >  schedule+0x28/0x80
> > >  btrfs_commit_transaction+0x7aa/0x8b0 [btrfs]
> > >  ? kmem_cache_alloc+0x166/0x1d0
> > >  ? join_transaction+0x22/0x3e0 [btrfs]
> > >  ? finish_wait+0x80/0x80
> > >  transaction_kthread+0x155/0x170 [btrfs]
> > >  ? btrfs_cleanup_transaction+0x550/0x550 [btrfs]
> > >  kthread+0x112/0x130
> > >  ? kthread_create_worker_on_cpu+0x70/0x70
> > >  ret_from_fork+0x35/0x40
> > > btrfs   D0  2504   2502 0x0002
> > > Call Trace:
> > >  ? __schedule+0x253/0x860
> > >  schedule+0x28/0x80
> > >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> > >  ? finish_wait+0x80/0x80
> > >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> > >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> > >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> > >  ? inode_insert5+0x119/0x190
> > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > >  ? kmem_cache_alloc+0x166/0x1d0
> > >  btrfs_iget+0x113/0x690 [btrfs]
> > >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> > >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> > >  load_free_space_cache+0x7c/0x170 [btrfs]
> > >  ? cache_block_group+0x72/0x3b0 [btrfs]
> > >  cache_block_group+0x1b3/0x3b0 [btrfs]
> > >  ? finish_wait+0x80/0x80
> > >  find_free_extent+0x799/0x1010 [btrfs]
> > >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> > >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> > >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> > >  btrfs_cow_block+0xdc/0x180 [btrfs]
> > >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > >  ? kmem_cache_alloc+0x166/0x1d0
> > >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> > >  cache_save_setup+0xe4/0x3a0 [btrfs]
> > >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> > >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> > >  ? btrfs_release_path+0x13/0x80 [btrfs]
> > >  ? btrfs_update_device+0x8d/0x1c0 [btrfs]
> > >  btrfs_ioctl_resize.cold.46+0xf4/0xf9 [btrfs]
> > >  btrfs_ioctl+0xa25/0x2cf0 [btrfs]
> > >  ? tty_write+0x1fc/0x330
> > >  ? do_vfs_ioctl+0xa4/0x620
> > >  do_vfs_ioctl+0xa4/0x620
> > >  ksys_ioctl+0x60/0x90
> > >  ? ksys_write+0x4f/0xb0
> > >  __x64_sys_ioctl+0x16/0x20
> > >  do_syscall_64+0x5b/0x160
> > >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > RIP: 0033:0x7fcdc0d78c57
> > > Code: Bad RIP value.
> > > RSP: 002b:7ffdd1ee6cf8 EFLAGS: 0246 ORIG_RAX: 

Re: Filesystem corruption?

2018-10-22 Thread Qu Wenruo


On 2018/10/23 上午4:02, Gervais, Francois wrote:
> Hi,
> 
> I think I lost power on my btrfs disk and it looks like it is now in an 
> unfunctional state.

What does the word "unfunctional" mean?

Unable to mount? Or what else?

> 
> Any idea how I could debug that issue?
> 
> Here is what I have:
> 
> kernel 4.4.0-119-generic

The kernel is somewhat old now.

> btrfs-progs v4.4

The progs is definitely too old.

It's highly recommended to use the latest btrfs-progs for its better
"btrfs check" code.

> 
> 
> 
> sudo btrfs check /dev/sdd
> Checking filesystem on /dev/sdd
> UUID: 9a14b7a1-672c-44da-b49a-1f6566db3e44
> checking extents
> checking free space cache
> checking fs roots
> checking csums
> checking root refs

So no error reported from all these essential trees.
Unless there is some bug in btrfs-progs 4.4, your fs should be mostly OK.

> checking quota groups
> Ignoring qgroup relation key 310
[snip]
> Ignoring qgroup relation key 71776119061217590

Just a lot of qgroup relation key problems.
Not a big problem, especially considering you're using older kernel
without proper qgroup fixes.

Just in case, please run "btrfs check" with latest btrfs-progs (v4.17.1)
to see if it reports extra error.

Despite that, if the fs can be mounted RW, mount it then execute "btrfs
quota disable " should disable quota and solves the problem.

Thanks,
Qu

> found 29301522460 bytes used err is 0
> total csum bytes: 27525424
> total tree bytes: 541573120
> total fs tree bytes: 494223360
> total extent tree bytes: 16908288
> btree space waste bytes: 85047903
> file data blocks allocated: 273892241408
>  referenced 44667650048
> extent buffer leak: start 29360128 len 16384
> extent buffer leak: start 740524032 len 16384
> extent buffer leak: start 446840832 len 16384
> extent buffer leak: start 142819328 len 16384
> extent buffer leak: start 143179776 len 16384
> extent buffer leak: start 184107008 len 16384
> extent buffer leak: start 190513152 len 16384
> extent buffer leak: start 190939136 len 16384
> extent buffer leak: start 239943680 len 16384
> extent buffer leak: start 29392896 len 16384
> extent buffer leak: start 295223296 len 16384
> extent buffer leak: start 30556160 len 16384
> extent buffer leak: start 29376512 len 16384
> extent buffer leak: start 29409280 len 16384
> extent buffer leak: start 29491200 len 16384
> extent buffer leak: start 29556736 len 16384
> extent buffer leak: start 29720576 len 16384
> extent buffer leak: start 29884416 len 16384
> extent buffer leak: start 30097408 len 16384
> extent buffer leak: start 30179328 len 16384
> extent buffer leak: start 30228480 len 16384
> extent buffer leak: start 30277632 len 16384
> extent buffer leak: start 30343168 len 16384
> extent buffer leak: start 30392320 len 16384
> extent buffer leak: start 30457856 len 16384
> extent buffer leak: start 30507008 len 16384
> extent buffer leak: start 30572544 len 16384
> extent buffer leak: start 30621696 len 16384
> extent buffer leak: start 30670848 len 16384
> extent buffer leak: start 3072 len 16384
> extent buffer leak: start 30769152 len 16384
> extent buffer leak: start 30801920 len 16384
> extent buffer leak: start 30867456 len 16384
> extent buffer leak: start 30916608 len 16384
> extent buffer leak: start 102498304 len 16384
> extent buffer leak: start 204488704 len 16384
> extent buffer leak: start 237912064 len 16384
> extent buffer leak: start 328499200 len 16384
> extent buffer leak: start 684539904 len 16384
> extent buffer leak: start 849362944 len 16384
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread Filipe Manana
On Mon, Oct 22, 2018 at 8:18 PM Josef Bacik  wrote:
>
> On Mon, Oct 22, 2018 at 08:10:37PM +0100, fdman...@kernel.org wrote:
> > From: Filipe Manana 
> >
> > When we are writing out a free space cache, during the transaction commit
> > phase, we can end up in a deadlock which results in a stack trace like the
> > following:
> >
> >  schedule+0x28/0x80
> >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> >  ? inode_insert5+0x119/0x190
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_iget+0x113/0x690 [btrfs]
> >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> >  load_free_space_cache+0x7c/0x170 [btrfs]
> >  ? cache_block_group+0x72/0x3b0 [btrfs]
> >  cache_block_group+0x1b3/0x3b0 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  find_free_extent+0x799/0x1010 [btrfs]
> >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> >  btrfs_cow_block+0xdc/0x180 [btrfs]
> >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> >  cache_save_setup+0xe4/0x3a0 [btrfs]
> >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> >
> > At cache_save_setup() we need to update the inode item of a block group's
> > cache which is located in the tree root (fs_info->tree_root), which means
> > that it may result in COWing a leaf from that tree. If that happens we
> > need to find a free metadata extent and while looking for one, if we find
> > a block group which was not cached yet we attempt to load its cache by
> > calling cache_block_group(). However this function will try to load the
> > inode of the free space cache, which requires finding the matching inode
> > item in the tree root - if that inode item is located in the same leaf as
> > the inode item of the space cache we are updating at cache_save_setup(),
> > we end up in a deadlock, since we try to obtain a read lock on the same
> > extent buffer that we previously write locked.
> >
> > So fix this by skipping the loading of free space caches of any block
> > groups that are not yet cached (rare cases) if we are COWing an extent
> > buffer from the root tree and space caching is enabled (-o space_cache
> > mount option). This is a rare case and its downside is failure to
> > find a free extent (return -ENOSPC) when all the already cached block
> > groups have no free extents.
> >
> > Reported-by: Andrew Nelson 
> > Link: 
> > https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> > Tested-by: Andrew Nelson 
> > Signed-off-by: Filipe Manana 
>
> Great, thanks,
>
> Reviewed-by: Josef Bacik 

So this makes many fstests occasionally fail with aborted transaction
due to ENOSPC.
It's late and I haven't verified yet, but I suppose this is because we
always skip loading the cache regardless of currently being COWing an
existing leaf or allocating a new one (growing the tree).
Needs to be fixed.

>
> Josef


Re: [PATCH v9 0/6] Btrfs: implement swap file support

2018-10-22 Thread Omar Sandoval
On Fri, Oct 19, 2018 at 05:43:18PM +0200, David Sterba wrote:
> On Thu, Sep 27, 2018 at 11:17:32AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > This series implements swap file support for Btrfs.
> > 
> > Changes from v8 [1]:
> > 
> > - Fixed a bug in btrfs_swap_activate() which would cause us to miss some
> >   file extents if they were merged into one extent map entry.
> > - Fixed build for !CONFIG_SWAP.
> > - Changed all error messages to KERN_WARN.
> > - Unindented long error messages.
> > 
> > I've Cc'd Jon and Al on patch 3 this time, so hopefully we can get an
> > ack for that one, too.
> > 
> > Thanks!
> > 
> > 1: https://www.spinics.net/lists/linux-btrfs/msg82267.html
> > 
> > Omar Sandoval (6):
> >   mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
> >   mm: export add_swap_extent()
> >   vfs: update swap_{,de}activate documentation
> >   Btrfs: prevent ioctls from interfering with a swap file
> >   Btrfs: rename get_chunk_map() and make it non-static
> >   Btrfs: support swap files
> 
> Patches 1 and 2 now going through Andrew's tree, the btrfs part will be
> delayed and not merged to 4.20. This is a bit unfortuante, I was busy
> with the non-feature patches and other things, sorry.

That's perfectly fine with me, thanks, Dave!


Filesystem corruption?

2018-10-22 Thread Gervais, Francois
Hi,

I think I lost power on my btrfs disk and it looks like it is now in an 
unfunctional state.

Any idea how I could debug that issue?

Here is what I have:

kernel 4.4.0-119-generic
btrfs-progs v4.4



sudo btrfs check /dev/sdd
Checking filesystem on /dev/sdd
UUID: 9a14b7a1-672c-44da-b49a-1f6566db3e44
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
checking quota groups
Ignoring qgroup relation key 310
Ignoring qgroup relation key 311
Ignoring qgroup relation key 313
Ignoring qgroup relation key 321
Ignoring qgroup relation key 326
Ignoring qgroup relation key 346
Ignoring qgroup relation key 354
Ignoring qgroup relation key 355
Ignoring qgroup relation key 356
Ignoring qgroup relation key 367
Ignoring qgroup relation key 370
Ignoring qgroup relation key 371
Ignoring qgroup relation key 373
Ignoring qgroup relation key 71213169107796323
Ignoring qgroup relation key 71213169107796323
Ignoring qgroup relation key 71494644084506935
Ignoring qgroup relation key 71494644084506935
Ignoring qgroup relation key 71494644084506937
Ignoring qgroup relation key 71494644084506937
Ignoring qgroup relation key 71494644084506945
Ignoring qgroup relation key 71494644084506945
Ignoring qgroup relation key 71494644084506950
Ignoring qgroup relation key 71494644084506950
Ignoring qgroup relation key 71494644084506970
Ignoring qgroup relation key 71494644084506970
Ignoring qgroup relation key 71494644084506978
Ignoring qgroup relation key 71494644084506978
Ignoring qgroup relation key 71494644084506978
Ignoring qgroup relation key 71494644084506980
Ignoring qgroup relation key 71494644084506980
Ignoring qgroup relation key 71494644084506991
Ignoring qgroup relation key 71494644084506991
Ignoring qgroup relation key 71494644084506994
Ignoring qgroup relation key 71494644084506994
Ignoring qgroup relation key 71494644084506995
Ignoring qgroup relation key 71494644084506995
Ignoring qgroup relation key 71494644084506997
Ignoring qgroup relation key 71494644084506997
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
Ignoring qgroup relation key 71776119061217590
found 29301522460 bytes used err is 0
total csum bytes: 27525424
total tree bytes: 541573120
total fs tree bytes: 494223360
total extent tree bytes: 16908288
btree space waste bytes: 85047903
file data blocks allocated: 273892241408
 referenced 44667650048
extent buffer leak: start 29360128 len 16384
extent buffer leak: start 740524032 len 16384
extent buffer leak: start 446840832 len 16384
extent buffer leak: start 142819328 len 16384
extent buffer leak: start 143179776 len 16384
extent buffer leak: start 184107008 len 16384
extent buffer leak: start 190513152 len 16384
extent buffer leak: start 190939136 len 16384
extent buffer leak: start 239943680 len 16384
extent buffer leak: start 29392896 len 16384
extent buffer leak: start 295223296 len 16384
extent buffer leak: start 30556160 len 16384
extent buffer leak: start 29376512 len 16384
extent buffer leak: start 29409280 len 16384
extent buffer leak: start 29491200 len 16384
extent buffer leak: start 29556736 len 16384
extent buffer leak: start 29720576 len 16384
extent buffer leak: start 29884416 len 16384
extent buffer leak: start 30097408 len 16384
extent buffer leak: start 30179328 len 16384
extent buffer leak: start 30228480 len 16384
extent buffer leak: start 30277632 len 16384
extent buffer leak: start 30343168 len 16384
extent buffer leak: start 30392320 len 16384
extent buffer leak: start 30457856 len 16384
extent buffer leak: start 30507008 len 16384
extent buffer leak: start 30572544 len 16384
extent buffer leak: start 30621696 len 16384
extent buffer leak: start 30670848 len 16384
extent buffer leak: start 3072 len 16384
extent buffer leak: start 30769152 len 16384
extent buffer leak: start 30801920 len 16384
extent buffer leak: start 30867456 len 16384
extent buffer leak: start 30916608 len 16384
extent buffer leak: start 102498304 len 16384
extent buffer leak: start 204488704 len 16384
extent buffer leak: start 237912064 len 16384
extent buffer leak: start 328499200 len 16384
extent buffer leak: start 684539904 len 16384
extent buffer leak: start 849362944 len 16384


Re: [PATCH v3] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread Josef Bacik
On Mon, Oct 22, 2018 at 08:10:37PM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> When we are writing out a free space cache, during the transaction commit
> phase, we can end up in a deadlock which results in a stack trace like the
> following:
> 
>  schedule+0x28/0x80
>  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
>  ? finish_wait+0x80/0x80
>  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
>  btrfs_search_slot+0xf6/0x9f0 [btrfs]
>  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
>  ? inode_insert5+0x119/0x190
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_iget+0x113/0x690 [btrfs]
>  __lookup_free_space_inode+0xd8/0x150 [btrfs]
>  lookup_free_space_inode+0x5b/0xb0 [btrfs]
>  load_free_space_cache+0x7c/0x170 [btrfs]
>  ? cache_block_group+0x72/0x3b0 [btrfs]
>  cache_block_group+0x1b3/0x3b0 [btrfs]
>  ? finish_wait+0x80/0x80
>  find_free_extent+0x799/0x1010 [btrfs]
>  btrfs_reserve_extent+0x9b/0x180 [btrfs]
>  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
>  __btrfs_cow_block+0x11d/0x500 [btrfs]
>  btrfs_cow_block+0xdc/0x180 [btrfs]
>  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_update_inode_item+0x46/0x100 [btrfs]
>  cache_save_setup+0xe4/0x3a0 [btrfs]
>  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
>  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> 
> At cache_save_setup() we need to update the inode item of a block group's
> cache which is located in the tree root (fs_info->tree_root), which means
> that it may result in COWing a leaf from that tree. If that happens we
> need to find a free metadata extent and while looking for one, if we find
> a block group which was not cached yet we attempt to load its cache by
> calling cache_block_group(). However this function will try to load the
> inode of the free space cache, which requires finding the matching inode
> item in the tree root - if that inode item is located in the same leaf as
> the inode item of the space cache we are updating at cache_save_setup(),
> we end up in a deadlock, since we try to obtain a read lock on the same
> extent buffer that we previously write locked.
> 
> So fix this by skipping the loading of free space caches of any block
> groups that are not yet cached (rare cases) if we are COWing an extent
> buffer from the root tree and space caching is enabled (-o space_cache
> mount option). This is a rare case and its downside is failure to
> find a free extent (return -ENOSPC) when all the already cached block
> groups have no free extents.
> 
> Reported-by: Andrew Nelson 
> Link: 
> https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> Tested-by: Andrew Nelson 
> Signed-off-by: Filipe Manana 

Great, thanks,

Reviewed-by: Josef Bacik 

Josef


[PATCH v3] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread fdmanana
From: Filipe Manana 

When we are writing out a free space cache, during the transaction commit
phase, we can end up in a deadlock which results in a stack trace like the
following:

 schedule+0x28/0x80
 btrfs_tree_read_lock+0x8e/0x120 [btrfs]
 ? finish_wait+0x80/0x80
 btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
 btrfs_search_slot+0xf6/0x9f0 [btrfs]
 ? evict_refill_and_join+0xd0/0xd0 [btrfs]
 ? inode_insert5+0x119/0x190
 btrfs_lookup_inode+0x3a/0xc0 [btrfs]
 ? kmem_cache_alloc+0x166/0x1d0
 btrfs_iget+0x113/0x690 [btrfs]
 __lookup_free_space_inode+0xd8/0x150 [btrfs]
 lookup_free_space_inode+0x5b/0xb0 [btrfs]
 load_free_space_cache+0x7c/0x170 [btrfs]
 ? cache_block_group+0x72/0x3b0 [btrfs]
 cache_block_group+0x1b3/0x3b0 [btrfs]
 ? finish_wait+0x80/0x80
 find_free_extent+0x799/0x1010 [btrfs]
 btrfs_reserve_extent+0x9b/0x180 [btrfs]
 btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
 __btrfs_cow_block+0x11d/0x500 [btrfs]
 btrfs_cow_block+0xdc/0x180 [btrfs]
 btrfs_search_slot+0x3bd/0x9f0 [btrfs]
 btrfs_lookup_inode+0x3a/0xc0 [btrfs]
 ? kmem_cache_alloc+0x166/0x1d0
 btrfs_update_inode_item+0x46/0x100 [btrfs]
 cache_save_setup+0xe4/0x3a0 [btrfs]
 btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
 btrfs_commit_transaction+0xcb/0x8b0 [btrfs]

At cache_save_setup() we need to update the inode item of a block group's
cache which is located in the tree root (fs_info->tree_root), which means
that it may result in COWing a leaf from that tree. If that happens we
need to find a free metadata extent and while looking for one, if we find
a block group which was not cached yet we attempt to load its cache by
calling cache_block_group(). However this function will try to load the
inode of the free space cache, which requires finding the matching inode
item in the tree root - if that inode item is located in the same leaf as
the inode item of the space cache we are updating at cache_save_setup(),
we end up in a deadlock, since we try to obtain a read lock on the same
extent buffer that we previously write locked.

So fix this by skipping the loading of free space caches of any block
groups that are not yet cached (rare cases) if we are COWing an extent
buffer from the root tree and space caching is enabled (-o space_cache
mount option). This is a rare case and its downside is failure to
find a free extent (return -ENOSPC) when all the already cached block
groups have no free extents.

Reported-by: Andrew Nelson 
Link: 
https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
Tested-by: Andrew Nelson 
Signed-off-by: Filipe Manana 
---

V2: Made the solution more generic, since the problem could happen in any
path COWing an extent buffer from the root tree.

Applies on top of a previous patch titled:

 "Btrfs: fix deadlock when writing out free space caches"

V3: Made it more simple by avoiding the atomic from V2 and pass the root
to find_free_extent().

 fs/btrfs/extent-tree.c | 22 ++
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 577878324799..e5fd086799ab 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7218,12 +7218,13 @@ btrfs_release_block_group(struct 
btrfs_block_group_cache *cache,
  * the free space extent currently.
  */
 static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
+   struct btrfs_root *root,
u64 ram_bytes, u64 num_bytes, u64 empty_size,
u64 hint_byte, struct btrfs_key *ins,
u64 flags, int delalloc)
 {
int ret = 0;
-   struct btrfs_root *root = fs_info->extent_root;
+   struct btrfs_root *extent_root = fs_info->extent_root;
struct btrfs_free_cluster *last_ptr = NULL;
struct btrfs_block_group_cache *block_group = NULL;
u64 search_start = 0;
@@ -7366,7 +7367,20 @@ static noinline int find_free_extent(struct 
btrfs_fs_info *fs_info,
 
 have_block_group:
cached = block_group_cache_done(block_group);
-   if (unlikely(!cached)) {
+   /*
+* If we are COWing a leaf/node from the root tree, we can not
+* start caching of a block group because we could deadlock on
+* an extent buffer of the root tree.
+* Because if we are COWing a leaf from the root tree, we are
+* holding a write lock on the respective extent buffer, and
+* loading the space cache of a block group requires searching
+* for its inode item in the root tree, which can be located
+* in the same leaf that we previously write locked, in which
+* case we will deadlock.
+*/
+   if (unlikely(!cached) &&
+   (root != 

Re: [PATCH v2] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread Filipe Manana
On Mon, Oct 22, 2018 at 7:56 PM Josef Bacik  wrote:
>
> On Mon, Oct 22, 2018 at 07:48:30PM +0100, fdman...@kernel.org wrote:
> > From: Filipe Manana 
> >
> > When we are writing out a free space cache, during the transaction commit
> > phase, we can end up in a deadlock which results in a stack trace like the
> > following:
> >
> >  schedule+0x28/0x80
> >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> >  ? inode_insert5+0x119/0x190
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_iget+0x113/0x690 [btrfs]
> >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> >  load_free_space_cache+0x7c/0x170 [btrfs]
> >  ? cache_block_group+0x72/0x3b0 [btrfs]
> >  cache_block_group+0x1b3/0x3b0 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  find_free_extent+0x799/0x1010 [btrfs]
> >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> >  btrfs_cow_block+0xdc/0x180 [btrfs]
> >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> >  cache_save_setup+0xe4/0x3a0 [btrfs]
> >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> >
> > At cache_save_setup() we need to update the inode item of a block group's
> > cache which is located in the tree root (fs_info->tree_root), which means
> > that it may result in COWing a leaf from that tree. If that happens we
> > need to find a free metadata extent and while looking for one, if we find
> > a block group which was not cached yet we attempt to load its cache by
> > calling cache_block_group(). However this function will try to load the
> > inode of the free space cache, which requires finding the matching inode
> > item in the tree root - if that inode item is located in the same leaf as
> > the inode item of the space cache we are updating at cache_save_setup(),
> > we end up in a deadlock, since we try to obtain a read lock on the same
> > extent buffer that we previously write locked.
> >
> > So fix this by skipping the loading of free space caches of any block
> > groups that are not yet cached (rare cases) if we are COWing an extent
> > buffer from the root tree and space caching is enabled (-o space_cache
> > mount option). This is a rare case and its downside is failure to
> > find a free extent (return -ENOSPC) when all the already cached block
> > groups have no free extents.
> >
> > Reported-by: Andrew Nelson 
> > Link: 
> > https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> > Tested-by: Andrew Nelson 
> > Signed-off-by: Filipe Manana 
> > ---
> >
> > V2: Made the solution more generic, since the problem could happen in any
> > path COWing an extent buffer from the root tree.
> >
> > Applies on top of a previous patch titled:
> >
> >  "Btrfs: fix deadlock when writing out free space caches"
> >
> >  fs/btrfs/ctree.c   |  4 
> >  fs/btrfs/ctree.h   |  3 +++
> >  fs/btrfs/disk-io.c |  2 ++
> >  fs/btrfs/extent-tree.c | 15 ++-
> >  4 files changed, 23 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > index 089b46c4d97f..646aafda55a3 100644
> > --- a/fs/btrfs/ctree.c
> > +++ b/fs/btrfs/ctree.c
> > @@ -1065,10 +1065,14 @@ static noinline int __btrfs_cow_block(struct 
> > btrfs_trans_handle *trans,
> >   root == fs_info->chunk_root ||
> >   root == fs_info->dev_root)
> >   trans->can_flush_pending_bgs = false;
> > + else if (root == fs_info->tree_root)
> > + atomic_inc(_info->tree_root_cows);
> >
> >   cow = btrfs_alloc_tree_block(trans, root, parent_start,
> >   root->root_key.objectid, _key, level,
> >   search_start, empty_size);
> > + if (root == fs_info->tree_root)
> > + atomic_dec(_info->tree_root_cows);
>
> Do we need this though?  Our root should be the root we're cow'ing the block
> for, and it should be passed all the way down to find_free_extent properly, so
> we really should be able to just do if (root == fs_info->tree_root) and not 
> add
> all this stuff.

Ups, I missed that we could actually pass the root down to find_free_extent().
That's why made the atomic thing.

Sending v3, thanks.

>
> Not to mention this will race with anybody else doing stuff, so if another
> thread that isn't actually touching the tree_root it would skip caching a 
> block
> group when it's completely ok for that thread to do it.  Thanks,
>
> Josef


Re: [PATCH v2] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread Josef Bacik
On Mon, Oct 22, 2018 at 07:48:30PM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> When we are writing out a free space cache, during the transaction commit
> phase, we can end up in a deadlock which results in a stack trace like the
> following:
> 
>  schedule+0x28/0x80
>  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
>  ? finish_wait+0x80/0x80
>  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
>  btrfs_search_slot+0xf6/0x9f0 [btrfs]
>  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
>  ? inode_insert5+0x119/0x190
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_iget+0x113/0x690 [btrfs]
>  __lookup_free_space_inode+0xd8/0x150 [btrfs]
>  lookup_free_space_inode+0x5b/0xb0 [btrfs]
>  load_free_space_cache+0x7c/0x170 [btrfs]
>  ? cache_block_group+0x72/0x3b0 [btrfs]
>  cache_block_group+0x1b3/0x3b0 [btrfs]
>  ? finish_wait+0x80/0x80
>  find_free_extent+0x799/0x1010 [btrfs]
>  btrfs_reserve_extent+0x9b/0x180 [btrfs]
>  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
>  __btrfs_cow_block+0x11d/0x500 [btrfs]
>  btrfs_cow_block+0xdc/0x180 [btrfs]
>  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_update_inode_item+0x46/0x100 [btrfs]
>  cache_save_setup+0xe4/0x3a0 [btrfs]
>  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
>  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> 
> At cache_save_setup() we need to update the inode item of a block group's
> cache which is located in the tree root (fs_info->tree_root), which means
> that it may result in COWing a leaf from that tree. If that happens we
> need to find a free metadata extent and while looking for one, if we find
> a block group which was not cached yet we attempt to load its cache by
> calling cache_block_group(). However this function will try to load the
> inode of the free space cache, which requires finding the matching inode
> item in the tree root - if that inode item is located in the same leaf as
> the inode item of the space cache we are updating at cache_save_setup(),
> we end up in a deadlock, since we try to obtain a read lock on the same
> extent buffer that we previously write locked.
> 
> So fix this by skipping the loading of free space caches of any block
> groups that are not yet cached (rare cases) if we are COWing an extent
> buffer from the root tree and space caching is enabled (-o space_cache
> mount option). This is a rare case and its downside is failure to
> find a free extent (return -ENOSPC) when all the already cached block
> groups have no free extents.
> 
> Reported-by: Andrew Nelson 
> Link: 
> https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> Tested-by: Andrew Nelson 
> Signed-off-by: Filipe Manana 
> ---
> 
> V2: Made the solution more generic, since the problem could happen in any
> path COWing an extent buffer from the root tree.
> 
> Applies on top of a previous patch titled:
> 
>  "Btrfs: fix deadlock when writing out free space caches"
> 
>  fs/btrfs/ctree.c   |  4 
>  fs/btrfs/ctree.h   |  3 +++
>  fs/btrfs/disk-io.c |  2 ++
>  fs/btrfs/extent-tree.c | 15 ++-
>  4 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 089b46c4d97f..646aafda55a3 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -1065,10 +1065,14 @@ static noinline int __btrfs_cow_block(struct 
> btrfs_trans_handle *trans,
>   root == fs_info->chunk_root ||
>   root == fs_info->dev_root)
>   trans->can_flush_pending_bgs = false;
> + else if (root == fs_info->tree_root)
> + atomic_inc(_info->tree_root_cows);
>  
>   cow = btrfs_alloc_tree_block(trans, root, parent_start,
>   root->root_key.objectid, _key, level,
>   search_start, empty_size);
> + if (root == fs_info->tree_root)
> + atomic_dec(_info->tree_root_cows);

Do we need this though?  Our root should be the root we're cow'ing the block
for, and it should be passed all the way down to find_free_extent properly, so
we really should be able to just do if (root == fs_info->tree_root) and not add
all this stuff.

Not to mention this will race with anybody else doing stuff, so if another
thread that isn't actually touching the tree_root it would skip caching a block
group when it's completely ok for that thread to do it.  Thanks,

Josef


Re: [PATCH] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread Filipe Manana
On Mon, Oct 22, 2018 at 7:07 PM Josef Bacik  wrote:
>
> On Mon, Oct 22, 2018 at 10:09:46AM +0100, fdman...@kernel.org wrote:
> > From: Filipe Manana 
> >
> > When we are writing out a free space cache, during the transaction commit
> > phase, we can end up in a deadlock which results in a stack trace like the
> > following:
> >
> >  schedule+0x28/0x80
> >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> >  ? inode_insert5+0x119/0x190
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_iget+0x113/0x690 [btrfs]
> >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> >  load_free_space_cache+0x7c/0x170 [btrfs]
> >  ? cache_block_group+0x72/0x3b0 [btrfs]
> >  cache_block_group+0x1b3/0x3b0 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  find_free_extent+0x799/0x1010 [btrfs]
> >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> >  btrfs_cow_block+0xdc/0x180 [btrfs]
> >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> >  cache_save_setup+0xe4/0x3a0 [btrfs]
> >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> >
> > At cache_save_setup() we need to update the inode item of a block group's
> > cache which is located in the tree root (fs_info->tree_root), which means
> > that it may result in COWing a leaf from that tree. If that happens we
> > need to find a free metadata extent and while looking for one, if we find
> > a block group which was not cached yet we attempt to load its cache by
> > calling cache_block_group(). However this function will try to load the
> > inode of the free space cache, which requires finding the matching inode
> > item in the tree root - if that inode item is located in the same leaf as
> > the inode item of the space cache we are updating at cache_save_setup(),
> > we end up in a deadlock, since we try to obtain a read lock on the same
> > extent buffer that we previously write locked.
> >
> > So fix this by skipping the loading of free space caches of any block
> > groups that are not yet cached (rare cases) if we are updating the inode
> > of a free space cache. This is a rare case and its downside is failure to
> > find a free extent (return -ENOSPC) when all the already cached block
> > groups have no free extents.
> >
>
> Actually isn't this a problem for anything that tries to allocate an extent
> while in the tree_root?  Like we snapshot or make a subvolume or anything?

Indeed. Initially I considered making it more generic (like the recent
fix for deadlock when cowing from extent/chunk/device tree) but I
totally forgot about the other cases like you mentioned.

>  We
> should just disallow if root == tree_root.  But even then we only need to do
> this if we're using SPACE_CACHE, using the ye-olde caching or the free space
> tree are both ok.  Let's just limit it to those cases.  Thanks,

Yep, makes all sense.

Thanks! V2 sent out.

>
> Josef


[PATCH v2] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread fdmanana
From: Filipe Manana 

When we are writing out a free space cache, during the transaction commit
phase, we can end up in a deadlock which results in a stack trace like the
following:

 schedule+0x28/0x80
 btrfs_tree_read_lock+0x8e/0x120 [btrfs]
 ? finish_wait+0x80/0x80
 btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
 btrfs_search_slot+0xf6/0x9f0 [btrfs]
 ? evict_refill_and_join+0xd0/0xd0 [btrfs]
 ? inode_insert5+0x119/0x190
 btrfs_lookup_inode+0x3a/0xc0 [btrfs]
 ? kmem_cache_alloc+0x166/0x1d0
 btrfs_iget+0x113/0x690 [btrfs]
 __lookup_free_space_inode+0xd8/0x150 [btrfs]
 lookup_free_space_inode+0x5b/0xb0 [btrfs]
 load_free_space_cache+0x7c/0x170 [btrfs]
 ? cache_block_group+0x72/0x3b0 [btrfs]
 cache_block_group+0x1b3/0x3b0 [btrfs]
 ? finish_wait+0x80/0x80
 find_free_extent+0x799/0x1010 [btrfs]
 btrfs_reserve_extent+0x9b/0x180 [btrfs]
 btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
 __btrfs_cow_block+0x11d/0x500 [btrfs]
 btrfs_cow_block+0xdc/0x180 [btrfs]
 btrfs_search_slot+0x3bd/0x9f0 [btrfs]
 btrfs_lookup_inode+0x3a/0xc0 [btrfs]
 ? kmem_cache_alloc+0x166/0x1d0
 btrfs_update_inode_item+0x46/0x100 [btrfs]
 cache_save_setup+0xe4/0x3a0 [btrfs]
 btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
 btrfs_commit_transaction+0xcb/0x8b0 [btrfs]

At cache_save_setup() we need to update the inode item of a block group's
cache which is located in the tree root (fs_info->tree_root), which means
that it may result in COWing a leaf from that tree. If that happens we
need to find a free metadata extent and while looking for one, if we find
a block group which was not cached yet we attempt to load its cache by
calling cache_block_group(). However this function will try to load the
inode of the free space cache, which requires finding the matching inode
item in the tree root - if that inode item is located in the same leaf as
the inode item of the space cache we are updating at cache_save_setup(),
we end up in a deadlock, since we try to obtain a read lock on the same
extent buffer that we previously write locked.

So fix this by skipping the loading of free space caches of any block
groups that are not yet cached (rare cases) if we are COWing an extent
buffer from the root tree and space caching is enabled (-o space_cache
mount option). This is a rare case and its downside is failure to
find a free extent (return -ENOSPC) when all the already cached block
groups have no free extents.

Reported-by: Andrew Nelson 
Link: 
https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
Tested-by: Andrew Nelson 
Signed-off-by: Filipe Manana 
---

V2: Made the solution more generic, since the problem could happen in any
path COWing an extent buffer from the root tree.

Applies on top of a previous patch titled:

 "Btrfs: fix deadlock when writing out free space caches"

 fs/btrfs/ctree.c   |  4 
 fs/btrfs/ctree.h   |  3 +++
 fs/btrfs/disk-io.c |  2 ++
 fs/btrfs/extent-tree.c | 15 ++-
 4 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 089b46c4d97f..646aafda55a3 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1065,10 +1065,14 @@ static noinline int __btrfs_cow_block(struct 
btrfs_trans_handle *trans,
root == fs_info->chunk_root ||
root == fs_info->dev_root)
trans->can_flush_pending_bgs = false;
+   else if (root == fs_info->tree_root)
+   atomic_inc(_info->tree_root_cows);
 
cow = btrfs_alloc_tree_block(trans, root, parent_start,
root->root_key.objectid, _key, level,
search_start, empty_size);
+   if (root == fs_info->tree_root)
+   atomic_dec(_info->tree_root_cows);
trans->can_flush_pending_bgs = true;
if (IS_ERR(cow))
return PTR_ERR(cow);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2cddfe7806a4..1b73433c69e2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1121,6 +1121,9 @@ struct btrfs_fs_info {
u32 sectorsize;
u32 stripesize;
 
+   /* Number of tasks corrently COWing a leaf/node from the tree root. */
+   atomic_t tree_root_cows;
+
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
spinlock_t ref_verify_lock;
struct rb_root block_tree;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 05dc3c17cb62..08c15bf69fb5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2782,6 +2782,8 @@ int open_ctree(struct super_block *sb,
fs_info->sectorsize = 4096;
fs_info->stripesize = 4096;
 
+   atomic_set(_info->tree_root_cows, 0);
+
ret = btrfs_alloc_stripe_hash_table(fs_info);
if (ret) {
err = ret;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 577878324799..14f35e020050 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7366,7 +7366,20 

Re: [PATCH] Btrfs: fix use-after-free when dumping free space

2018-10-22 Thread David Sterba
On Mon, Oct 22, 2018 at 10:43:06AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 

...

> Reported-by: Nikolay Borisov 
> Signed-off-by: Filipe Manana 

Added to misc-next, thanks.


Re: [PATCH] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread Josef Bacik
On Mon, Oct 22, 2018 at 10:09:46AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> When we are writing out a free space cache, during the transaction commit
> phase, we can end up in a deadlock which results in a stack trace like the
> following:
> 
>  schedule+0x28/0x80
>  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
>  ? finish_wait+0x80/0x80
>  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
>  btrfs_search_slot+0xf6/0x9f0 [btrfs]
>  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
>  ? inode_insert5+0x119/0x190
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_iget+0x113/0x690 [btrfs]
>  __lookup_free_space_inode+0xd8/0x150 [btrfs]
>  lookup_free_space_inode+0x5b/0xb0 [btrfs]
>  load_free_space_cache+0x7c/0x170 [btrfs]
>  ? cache_block_group+0x72/0x3b0 [btrfs]
>  cache_block_group+0x1b3/0x3b0 [btrfs]
>  ? finish_wait+0x80/0x80
>  find_free_extent+0x799/0x1010 [btrfs]
>  btrfs_reserve_extent+0x9b/0x180 [btrfs]
>  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
>  __btrfs_cow_block+0x11d/0x500 [btrfs]
>  btrfs_cow_block+0xdc/0x180 [btrfs]
>  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_update_inode_item+0x46/0x100 [btrfs]
>  cache_save_setup+0xe4/0x3a0 [btrfs]
>  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
>  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> 
> At cache_save_setup() we need to update the inode item of a block group's
> cache which is located in the tree root (fs_info->tree_root), which means
> that it may result in COWing a leaf from that tree. If that happens we
> need to find a free metadata extent and while looking for one, if we find
> a block group which was not cached yet we attempt to load its cache by
> calling cache_block_group(). However this function will try to load the
> inode of the free space cache, which requires finding the matching inode
> item in the tree root - if that inode item is located in the same leaf as
> the inode item of the space cache we are updating at cache_save_setup(),
> we end up in a deadlock, since we try to obtain a read lock on the same
> extent buffer that we previously write locked.
> 
> So fix this by skipping the loading of free space caches of any block
> groups that are not yet cached (rare cases) if we are updating the inode
> of a free space cache. This is a rare case and its downside is failure to
> find a free extent (return -ENOSPC) when all the already cached block
> groups have no free extents.
> 

Actually isn't this a problem for anything that tries to allocate an extent
while in the tree_root?  Like we snapshot or make a subvolume or anything?  We
should just disallow if root == tree_root.  But even then we only need to do
this if we're using SPACE_CACHE, using the ye-olde caching or the free space
tree are both ok.  Let's just limit it to those cases.  Thanks,

Josef


Re: [PATCH] Btrfs: fix use-after-free when dumping free space

2018-10-22 Thread Josef Bacik
On Mon, Oct 22, 2018 at 10:43:06AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> We were iterating a block group's free space cache rbtree without locking
> first the lock that protects it (the free_space_ctl->free_space_offset
> rbtree is protected by the free_space_ctl->tree_lock spinlock).
> 
> KASAN reported an use-after-free problem when iterating such a rbtree due
> to a concurrent rbtree delete:
> 
> [ 9520.359168] 
> ==
> [ 9520.359656] BUG: KASAN: use-after-free in rb_next+0x13/0x90
> [ 9520.359949] Read of size 8 at addr 8800b7ada500 by task 
> btrfs-transacti/1721
> [ 9520.360357]
> [ 9520.360530] CPU: 4 PID: 1721 Comm: btrfs-transacti Tainted: G 
> L4.19.0-rc8-nbor #555
> [ 9520.360990] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-1ubuntu1 04/01/2014
> [ 9520.362682] Call Trace:
> [ 9520.362887]  dump_stack+0xa4/0xf5
> [ 9520.363146]  print_address_description+0x78/0x280
> [ 9520.363412]  kasan_report+0x263/0x390
> [ 9520.363650]  ? rb_next+0x13/0x90
> [ 9520.363873]  __asan_load8+0x54/0x90
> [ 9520.364102]  rb_next+0x13/0x90
> [ 9520.364380]  btrfs_dump_free_space+0x146/0x160 [btrfs]
> [ 9520.364697]  dump_space_info+0x2cd/0x310 [btrfs]
> [ 9520.364997]  btrfs_reserve_extent+0x1ee/0x1f0 [btrfs]
> [ 9520.365310]  __btrfs_prealloc_file_range+0x1cc/0x620 [btrfs]
> [ 9520.365646]  ? btrfs_update_time+0x180/0x180 [btrfs]
> [ 9520.365923]  ? _raw_spin_unlock+0x27/0x40
> [ 9520.366204]  ? btrfs_alloc_data_chunk_ondemand+0x2c0/0x5c0 [btrfs]
> [ 9520.366549]  btrfs_prealloc_file_range_trans+0x23/0x30 [btrfs]
> [ 9520.366880]  cache_save_setup+0x42e/0x580 [btrfs]
> [ 9520.367220]  ? btrfs_check_data_free_space+0xd0/0xd0 [btrfs]
> [ 9520.367518]  ? lock_downgrade+0x2f0/0x2f0
> [ 9520.367799]  ? btrfs_write_dirty_block_groups+0x11f/0x6e0 [btrfs]
> [ 9520.368104]  ? kasan_check_read+0x11/0x20
> [ 9520.368349]  ? do_raw_spin_unlock+0xa8/0x140
> [ 9520.368638]  btrfs_write_dirty_block_groups+0x2af/0x6e0 [btrfs]
> [ 9520.368978]  ? btrfs_start_dirty_block_groups+0x870/0x870 [btrfs]
> [ 9520.369282]  ? do_raw_spin_unlock+0xa8/0x140
> [ 9520.369534]  ? _raw_spin_unlock+0x27/0x40
> [ 9520.369811]  ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
> [ 9520.370137]  commit_cowonly_roots+0x4b9/0x610 [btrfs]
> [ 9520.370560]  ? commit_fs_roots+0x350/0x350 [btrfs]
> [ 9520.370926]  ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
> [ 9520.371285]  btrfs_commit_transaction+0x5e5/0x10e0 [btrfs]
> [ 9520.371612]  ? btrfs_apply_pending_changes+0x90/0x90 [btrfs]
> [ 9520.371943]  ? start_transaction+0x168/0x6c0 [btrfs]
> [ 9520.372257]  transaction_kthread+0x21c/0x240 [btrfs]
> [ 9520.372537]  kthread+0x1d2/0x1f0
> [ 9520.372793]  ? btrfs_cleanup_transaction+0xb50/0xb50 [btrfs]
> [ 9520.373090]  ? kthread_park+0xb0/0xb0
> [ 9520.373329]  ret_from_fork+0x3a/0x50
> [ 9520.373567]
> [ 9520.373738] Allocated by task 1804:
> [ 9520.373974]  kasan_kmalloc+0xff/0x180
> [ 9520.374208]  kasan_slab_alloc+0x11/0x20
> [ 9520.374447]  kmem_cache_alloc+0xfc/0x2d0
> [ 9520.374731]  __btrfs_add_free_space+0x40/0x580 [btrfs]
> [ 9520.375044]  unpin_extent_range+0x4f7/0x7a0 [btrfs]
> [ 9520.375383]  btrfs_finish_extent_commit+0x15f/0x4d0 [btrfs]
> [ 9520.375707]  btrfs_commit_transaction+0xb06/0x10e0 [btrfs]
> [ 9520.376027]  btrfs_alloc_data_chunk_ondemand+0x237/0x5c0 [btrfs]
> [ 9520.376365]  btrfs_check_data_free_space+0x81/0xd0 [btrfs]
> [ 9520.376689]  btrfs_delalloc_reserve_space+0x25/0x80 [btrfs]
> [ 9520.377018]  btrfs_direct_IO+0x42e/0x6d0 [btrfs]
> [ 9520.377284]  generic_file_direct_write+0x11e/0x220
> [ 9520.377587]  btrfs_file_write_iter+0x472/0xac0 [btrfs]
> [ 9520.377875]  aio_write+0x25c/0x360
> [ 9520.378106]  io_submit_one+0xaa0/0xdc0
> [ 9520.378343]  __se_sys_io_submit+0xfa/0x2f0
> [ 9520.378589]  __x64_sys_io_submit+0x43/0x50
> [ 9520.378840]  do_syscall_64+0x7d/0x240
> [ 9520.379081]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 9520.379387]
> [ 9520.379557] Freed by task 1802:
> [ 9520.379782]  __kasan_slab_free+0x173/0x260
> [ 9520.380028]  kasan_slab_free+0xe/0x10
> [ 9520.380262]  kmem_cache_free+0xc1/0x2c0
> [ 9520.380544]  btrfs_find_space_for_alloc+0x4cd/0x4e0 [btrfs]
> [ 9520.380866]  find_free_extent+0xa99/0x17e0 [btrfs]
> [ 9520.381166]  btrfs_reserve_extent+0xd5/0x1f0 [btrfs]
> [ 9520.381474]  btrfs_get_blocks_direct+0x60b/0xbd0 [btrfs]
> [ 9520.381761]  __blockdev_direct_IO+0x10ee/0x58a1
> [ 9520.382059]  btrfs_direct_IO+0x25a/0x6d0 [btrfs]
> [ 9520.382321]  generic_file_direct_write+0x11e/0x220
> [ 9520.382623]  btrfs_file_write_iter+0x472/0xac0 [btrfs]
> [ 9520.382904]  aio_write+0x25c/0x360
> [ 9520.383172]  io_submit_one+0xaa0/0xdc0
> [ 9520.383416]  __se_sys_io_submit+0xfa/0x2f0
> [ 9520.383678]  __x64_sys_io_submit+0x43/0x50
> [ 9520.383927]  do_syscall_64+0x7d/0x240
> [ 9520.384165]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 9520.384439]
> [ 9520.384610] The buggy address belongs 

Re: [PATCH] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread Filipe Manana
On Mon, Oct 22, 2018 at 10:10 AM  wrote:
>
> From: Filipe Manana 
>
> When we are writing out a free space cache, during the transaction commit
> phase, we can end up in a deadlock which results in a stack trace like the
> following:
>
>  schedule+0x28/0x80
>  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
>  ? finish_wait+0x80/0x80
>  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
>  btrfs_search_slot+0xf6/0x9f0 [btrfs]
>  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
>  ? inode_insert5+0x119/0x190
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_iget+0x113/0x690 [btrfs]
>  __lookup_free_space_inode+0xd8/0x150 [btrfs]
>  lookup_free_space_inode+0x5b/0xb0 [btrfs]
>  load_free_space_cache+0x7c/0x170 [btrfs]
>  ? cache_block_group+0x72/0x3b0 [btrfs]
>  cache_block_group+0x1b3/0x3b0 [btrfs]
>  ? finish_wait+0x80/0x80
>  find_free_extent+0x799/0x1010 [btrfs]
>  btrfs_reserve_extent+0x9b/0x180 [btrfs]
>  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
>  __btrfs_cow_block+0x11d/0x500 [btrfs]
>  btrfs_cow_block+0xdc/0x180 [btrfs]
>  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_update_inode_item+0x46/0x100 [btrfs]
>  cache_save_setup+0xe4/0x3a0 [btrfs]
>  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
>  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
>
> At cache_save_setup() we need to update the inode item of a block group's
> cache which is located in the tree root (fs_info->tree_root), which means
> that it may result in COWing a leaf from that tree. If that happens we
> need to find a free metadata extent and while looking for one, if we find
> a block group which was not cached yet we attempt to load its cache by
> calling cache_block_group(). However this function will try to load the
> inode of the free space cache, which requires finding the matching inode
> item in the tree root - if that inode item is located in the same leaf as
> the inode item of the space cache we are updating at cache_save_setup(),
> we end up in a deadlock, since we try to obtain a read lock on the same
> extent buffer that we previously write locked.
>
> So fix this by skipping the loading of free space caches of any block
> groups that are not yet cached (rare cases) if we are updating the inode
> of a free space cache. This is a rare case and its downside is failure to
> find a free extent (return -ENOSPC) when all the already cached block
> groups have no free extents.
>
> Reported-by: Andrew Nelson 
> Link: 
> https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> Signed-off-by: Filipe Manana 

Tested-by: Andrew Nelson 

> ---
>  fs/btrfs/ctree.h   |  3 +++
>  fs/btrfs/disk-io.c |  2 ++
>  fs/btrfs/extent-tree.c | 22 +-
>  3 files changed, 26 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2cddfe7806a4..d23ee26eb17d 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1121,6 +1121,9 @@ struct btrfs_fs_info {
> u32 sectorsize;
> u32 stripesize;
>
> +   /* The task currently updating a free space cache inode item. */
> +   struct task_struct *space_cache_updater;
> +
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
> spinlock_t ref_verify_lock;
> struct rb_root block_tree;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 05dc3c17cb62..aa5e9a91e560 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2782,6 +2782,8 @@ int open_ctree(struct super_block *sb,
> fs_info->sectorsize = 4096;
> fs_info->stripesize = 4096;
>
> +   fs_info->space_cache_updater = NULL;
> +
> ret = btrfs_alloc_stripe_hash_table(fs_info);
> if (ret) {
> err = ret;
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 577878324799..e93040449771 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3364,7 +3364,9 @@ static int cache_save_setup(struct 
> btrfs_block_group_cache *block_group,
>  * time.
>  */
> BTRFS_I(inode)->generation = 0;
> +   fs_info->space_cache_updater = current;
> ret = btrfs_update_inode(trans, root, inode);
> +   fs_info->space_cache_updater = NULL;
> if (ret) {
> /*
>  * So theoretically we could recover from this, simply set the
> @@ -7366,7 +7368,25 @@ static noinline int find_free_extent(struct 
> btrfs_fs_info *fs_info,
>
>  have_block_group:
> cached = block_group_cache_done(block_group);
> -   if (unlikely(!cached)) {
> +   /*
> +* If we are updating the inode of a free space cache, we can
> +* not start the caching of any block group because we could
> +* deadlock on an extent buffer of the root tree.
> +* At 

Re: [PATCH] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread Filipe Manana
On Mon, Oct 22, 2018 at 10:10 AM  wrote:
>
> From: Filipe Manana 
>
> When we are writing out a free space cache, during the transaction commit
> phase, we can end up in a deadlock which results in a stack trace like the
> following:
>
>  schedule+0x28/0x80
>  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
>  ? finish_wait+0x80/0x80
>  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
>  btrfs_search_slot+0xf6/0x9f0 [btrfs]
>  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
>  ? inode_insert5+0x119/0x190
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_iget+0x113/0x690 [btrfs]
>  __lookup_free_space_inode+0xd8/0x150 [btrfs]
>  lookup_free_space_inode+0x5b/0xb0 [btrfs]
>  load_free_space_cache+0x7c/0x170 [btrfs]
>  ? cache_block_group+0x72/0x3b0 [btrfs]
>  cache_block_group+0x1b3/0x3b0 [btrfs]
>  ? finish_wait+0x80/0x80
>  find_free_extent+0x799/0x1010 [btrfs]
>  btrfs_reserve_extent+0x9b/0x180 [btrfs]
>  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
>  __btrfs_cow_block+0x11d/0x500 [btrfs]
>  btrfs_cow_block+0xdc/0x180 [btrfs]
>  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_update_inode_item+0x46/0x100 [btrfs]
>  cache_save_setup+0xe4/0x3a0 [btrfs]
>  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
>  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
>
> At cache_save_setup() we need to update the inode item of a block group's
> cache which is located in the tree root (fs_info->tree_root), which means
> that it may result in COWing a leaf from that tree. If that happens we
> need to find a free metadata extent and while looking for one, if we find
> a block group which was not cached yet we attempt to load its cache by
> calling cache_block_group(). However this function will try to load the
> inode of the free space cache, which requires finding the matching inode
> item in the tree root - if that inode item is located in the same leaf as
> the inode item of the space cache we are updating at cache_save_setup(),
> we end up in a deadlock, since we try to obtain a read lock on the same
> extent buffer that we previously write locked.
>
> So fix this by skipping the loading of free space caches of any block
> groups that are not yet cached (rare cases) if we are updating the inode
> of a free space cache. This is a rare case and its downside is failure to
> find a free extent (return -ENOSPC) when all the already cached block
> groups have no free extents.
>
> Reported-by: Andrew Nelson 
> Link: 
> https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> Signed-off-by: Filipe Manana 

Andrew Nelson 


> ---
>  fs/btrfs/ctree.h   |  3 +++
>  fs/btrfs/disk-io.c |  2 ++
>  fs/btrfs/extent-tree.c | 22 +-
>  3 files changed, 26 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2cddfe7806a4..d23ee26eb17d 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1121,6 +1121,9 @@ struct btrfs_fs_info {
> u32 sectorsize;
> u32 stripesize;
>
> +   /* The task currently updating a free space cache inode item. */
> +   struct task_struct *space_cache_updater;
> +
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
> spinlock_t ref_verify_lock;
> struct rb_root block_tree;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 05dc3c17cb62..aa5e9a91e560 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2782,6 +2782,8 @@ int open_ctree(struct super_block *sb,
> fs_info->sectorsize = 4096;
> fs_info->stripesize = 4096;
>
> +   fs_info->space_cache_updater = NULL;
> +
> ret = btrfs_alloc_stripe_hash_table(fs_info);
> if (ret) {
> err = ret;
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 577878324799..e93040449771 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3364,7 +3364,9 @@ static int cache_save_setup(struct 
> btrfs_block_group_cache *block_group,
>  * time.
>  */
> BTRFS_I(inode)->generation = 0;
> +   fs_info->space_cache_updater = current;
> ret = btrfs_update_inode(trans, root, inode);
> +   fs_info->space_cache_updater = NULL;
> if (ret) {
> /*
>  * So theoretically we could recover from this, simply set the
> @@ -7366,7 +7368,25 @@ static noinline int find_free_extent(struct 
> btrfs_fs_info *fs_info,
>
>  have_block_group:
> cached = block_group_cache_done(block_group);
> -   if (unlikely(!cached)) {
> +   /*
> +* If we are updating the inode of a free space cache, we can
> +* not start the caching of any block group because we could
> +* deadlock on an extent buffer of the root tree.
> +* At 

[PATCH] Btrfs: fix use-after-free when dumping free space

2018-10-22 Thread fdmanana
From: Filipe Manana 

We were iterating a block group's free space cache rbtree without locking
first the lock that protects it (the free_space_ctl->free_space_offset
rbtree is protected by the free_space_ctl->tree_lock spinlock).

KASAN reported an use-after-free problem when iterating such a rbtree due
to a concurrent rbtree delete:

[ 9520.359168] 
==
[ 9520.359656] BUG: KASAN: use-after-free in rb_next+0x13/0x90
[ 9520.359949] Read of size 8 at addr 8800b7ada500 by task 
btrfs-transacti/1721
[ 9520.360357]
[ 9520.360530] CPU: 4 PID: 1721 Comm: btrfs-transacti Tainted: G L  
  4.19.0-rc8-nbor #555
[ 9520.360990] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1ubuntu1 04/01/2014
[ 9520.362682] Call Trace:
[ 9520.362887]  dump_stack+0xa4/0xf5
[ 9520.363146]  print_address_description+0x78/0x280
[ 9520.363412]  kasan_report+0x263/0x390
[ 9520.363650]  ? rb_next+0x13/0x90
[ 9520.363873]  __asan_load8+0x54/0x90
[ 9520.364102]  rb_next+0x13/0x90
[ 9520.364380]  btrfs_dump_free_space+0x146/0x160 [btrfs]
[ 9520.364697]  dump_space_info+0x2cd/0x310 [btrfs]
[ 9520.364997]  btrfs_reserve_extent+0x1ee/0x1f0 [btrfs]
[ 9520.365310]  __btrfs_prealloc_file_range+0x1cc/0x620 [btrfs]
[ 9520.365646]  ? btrfs_update_time+0x180/0x180 [btrfs]
[ 9520.365923]  ? _raw_spin_unlock+0x27/0x40
[ 9520.366204]  ? btrfs_alloc_data_chunk_ondemand+0x2c0/0x5c0 [btrfs]
[ 9520.366549]  btrfs_prealloc_file_range_trans+0x23/0x30 [btrfs]
[ 9520.366880]  cache_save_setup+0x42e/0x580 [btrfs]
[ 9520.367220]  ? btrfs_check_data_free_space+0xd0/0xd0 [btrfs]
[ 9520.367518]  ? lock_downgrade+0x2f0/0x2f0
[ 9520.367799]  ? btrfs_write_dirty_block_groups+0x11f/0x6e0 [btrfs]
[ 9520.368104]  ? kasan_check_read+0x11/0x20
[ 9520.368349]  ? do_raw_spin_unlock+0xa8/0x140
[ 9520.368638]  btrfs_write_dirty_block_groups+0x2af/0x6e0 [btrfs]
[ 9520.368978]  ? btrfs_start_dirty_block_groups+0x870/0x870 [btrfs]
[ 9520.369282]  ? do_raw_spin_unlock+0xa8/0x140
[ 9520.369534]  ? _raw_spin_unlock+0x27/0x40
[ 9520.369811]  ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
[ 9520.370137]  commit_cowonly_roots+0x4b9/0x610 [btrfs]
[ 9520.370560]  ? commit_fs_roots+0x350/0x350 [btrfs]
[ 9520.370926]  ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
[ 9520.371285]  btrfs_commit_transaction+0x5e5/0x10e0 [btrfs]
[ 9520.371612]  ? btrfs_apply_pending_changes+0x90/0x90 [btrfs]
[ 9520.371943]  ? start_transaction+0x168/0x6c0 [btrfs]
[ 9520.372257]  transaction_kthread+0x21c/0x240 [btrfs]
[ 9520.372537]  kthread+0x1d2/0x1f0
[ 9520.372793]  ? btrfs_cleanup_transaction+0xb50/0xb50 [btrfs]
[ 9520.373090]  ? kthread_park+0xb0/0xb0
[ 9520.373329]  ret_from_fork+0x3a/0x50
[ 9520.373567]
[ 9520.373738] Allocated by task 1804:
[ 9520.373974]  kasan_kmalloc+0xff/0x180
[ 9520.374208]  kasan_slab_alloc+0x11/0x20
[ 9520.374447]  kmem_cache_alloc+0xfc/0x2d0
[ 9520.374731]  __btrfs_add_free_space+0x40/0x580 [btrfs]
[ 9520.375044]  unpin_extent_range+0x4f7/0x7a0 [btrfs]
[ 9520.375383]  btrfs_finish_extent_commit+0x15f/0x4d0 [btrfs]
[ 9520.375707]  btrfs_commit_transaction+0xb06/0x10e0 [btrfs]
[ 9520.376027]  btrfs_alloc_data_chunk_ondemand+0x237/0x5c0 [btrfs]
[ 9520.376365]  btrfs_check_data_free_space+0x81/0xd0 [btrfs]
[ 9520.376689]  btrfs_delalloc_reserve_space+0x25/0x80 [btrfs]
[ 9520.377018]  btrfs_direct_IO+0x42e/0x6d0 [btrfs]
[ 9520.377284]  generic_file_direct_write+0x11e/0x220
[ 9520.377587]  btrfs_file_write_iter+0x472/0xac0 [btrfs]
[ 9520.377875]  aio_write+0x25c/0x360
[ 9520.378106]  io_submit_one+0xaa0/0xdc0
[ 9520.378343]  __se_sys_io_submit+0xfa/0x2f0
[ 9520.378589]  __x64_sys_io_submit+0x43/0x50
[ 9520.378840]  do_syscall_64+0x7d/0x240
[ 9520.379081]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 9520.379387]
[ 9520.379557] Freed by task 1802:
[ 9520.379782]  __kasan_slab_free+0x173/0x260
[ 9520.380028]  kasan_slab_free+0xe/0x10
[ 9520.380262]  kmem_cache_free+0xc1/0x2c0
[ 9520.380544]  btrfs_find_space_for_alloc+0x4cd/0x4e0 [btrfs]
[ 9520.380866]  find_free_extent+0xa99/0x17e0 [btrfs]
[ 9520.381166]  btrfs_reserve_extent+0xd5/0x1f0 [btrfs]
[ 9520.381474]  btrfs_get_blocks_direct+0x60b/0xbd0 [btrfs]
[ 9520.381761]  __blockdev_direct_IO+0x10ee/0x58a1
[ 9520.382059]  btrfs_direct_IO+0x25a/0x6d0 [btrfs]
[ 9520.382321]  generic_file_direct_write+0x11e/0x220
[ 9520.382623]  btrfs_file_write_iter+0x472/0xac0 [btrfs]
[ 9520.382904]  aio_write+0x25c/0x360
[ 9520.383172]  io_submit_one+0xaa0/0xdc0
[ 9520.383416]  __se_sys_io_submit+0xfa/0x2f0
[ 9520.383678]  __x64_sys_io_submit+0x43/0x50
[ 9520.383927]  do_syscall_64+0x7d/0x240
[ 9520.384165]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 9520.384439]
[ 9520.384610] The buggy address belongs to the object at 8800b7ada500
which belongs to the cache btrfs_free_space of size 72
[ 9520.385175] The buggy address is located 0 bytes inside of
72-byte region [8800b7ada500, 8800b7ada548)
[ 9520.385691] The buggy 

Re: Btrfs resize seems to deadlock

2018-10-22 Thread Filipe Manana
On Mon, Oct 22, 2018 at 10:06 AM Andrew Nelson
 wrote:
>
> OK, an update: After unmouting and running btrfs check, the drive
> reverted to reporting the old size. Not sure if this was due to
> unmounting / mounting or doing btrfs check. Btrfs check should have
> been running in readonly mode.

It reverted to the old size because the transaction used for the
resize operation never got committed due to the deadlock, not because
of 'btrfs check'.

>  Since it looked like something was
> wrong with the resize process, I patched my kernel with the posted
> patch. This time the resize operation finished successfully.

Great, thanks for testing!

> On Sun, Oct 21, 2018 at 1:56 AM Filipe Manana  wrote:
> >
> > On Sun, Oct 21, 2018 at 6:05 AM Andrew Nelson  
> > wrote:
> > >
> > > Also, is the drive in a safe state to use? Is there anything I should
> > > run on the drive to check consistency?
> >
> > It should be in a safe state. You can verify it running "btrfs check
> > /dev/" (it's a readonly operation).
> >
> > If you are able to patch and build a kernel, you can also try the
> > patch. I left it running tests overnight and haven't got any
> > regressions.
> >
> > Thanks.
> >
> > > On Sat, Oct 20, 2018 at 10:02 PM Andrew Nelson
> > >  wrote:
> > > >
> > > > I have ran the "btrfs inspect-internal dump-tree -t 1" command, but
> > > > the output is ~55mb. Is there something in particular you are looking
> > > > for in this?
> > > > On Sat, Oct 20, 2018 at 1:34 PM Filipe Manana  
> > > > wrote:
> > > > >
> > > > > On Sat, Oct 20, 2018 at 9:27 PM Liu Bo  wrote:
> > > > > >
> > > > > > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson 
> > > > > >  wrote:
> > > > > > >
> > > > > > > I am having an issue with btrfs resize in Fedora 28. I am 
> > > > > > > attempting
> > > > > > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem
> > > > > > > resize max $MOUNT", the command runs for a few minutes and then 
> > > > > > > hangs
> > > > > > > forcing the system to be reset. I am not sure what the state of 
> > > > > > > the
> > > > > > > filesystem really is at this point. Btrfs usage does report the
> > > > > > > correct size for after resizing. Details below:
> > > > > > >
> > > > > >
> > > > > > Thanks for the report, the stack is helpful, but this needs a few
> > > > > > deeper debugging, may I ask you to post "btrfs inspect-internal
> > > > > > dump-tree -t 1 /dev/your_btrfs_disk"?
> > > > >
> > > > > I believe it's actually easy to understand from the trace alone and
> > > > > it's kind of a bad luck scenario.
> > > > > I made this fix a few hours ago:
> > > > >
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock
> > > > >
> > > > > But haven't done full testing yet and might have missed something.
> > > > > Bo, can you take a look and let me know what you think?
> > > > >
> > > > > Thanks.
> > > > >
> > > > > >
> > > > > > So I'd like to know what's the height of your tree "1" which refers 
> > > > > > to
> > > > > > root tree in btrfs.
> > > > > >
> > > > > > thanks,
> > > > > > liubo
> > > > > >
> > > > > > > $ sudo btrfs filesystem usage $MOUNT
> > > > > > > Overall:
> > > > > > > Device size:  90.96TiB
> > > > > > > Device allocated: 72.62TiB
> > > > > > > Device unallocated:   18.33TiB
> > > > > > > Device missing:  0.00B
> > > > > > > Used: 72.62TiB
> > > > > > > Free (estimated): 18.34TiB  (min: 9.17TiB)
> > > > > > > Data ratio:   1.00
> > > > > > > Metadata ratio:   2.00
> > > > > > > Global reserve:  512.00MiB  (used: 24.11MiB)
> > > > > > >
> > > > > > > Data,single: Size:72.46TiB, Used:72.45TiB
> > > > > > > $MOUNT72.46TiB
> > > > > > >
> > > > > > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB
> > > > > > > $MOUNT   172.00GiB
> > > > > > >
> > > > > > > System,DUP: Size:40.00MiB, Used:7.53MiB
> > > > > > >$MOUNT80.00MiB
> > > > > > >
> > > > > > > Unallocated:
> > > > > > > $MOUNT18.33TiB
> > > > > > >
> > > > > > > $ uname -a
> > > > > > > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon 
> > > > > > > Oct 15
> > > > > > > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> > > > > > >
> > > > > > > btrfs-transacti D0  2501  2 0x8000
> > > > > > > Call Trace:
> > > > > > >  ? __schedule+0x253/0x860
> > > > > > >  schedule+0x28/0x80
> > > > > > >  btrfs_commit_transaction+0x7aa/0x8b0 [btrfs]
> > > > > > >  ? kmem_cache_alloc+0x166/0x1d0
> > > > > > >  ? join_transaction+0x22/0x3e0 [btrfs]
> > > > > > >  ? finish_wait+0x80/0x80
> > > > > > >  transaction_kthread+0x155/0x170 [btrfs]
> > > > > > >  ? btrfs_cleanup_transaction+0x550/0x550 [btrfs]
> > > > > > >  kthread+0x112/0x130
> > > > > > >  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > > >  ret_from_fork+0x35/0x40
> > > > > > 

[PATCH] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread fdmanana
From: Filipe Manana 

When we are writing out a free space cache, during the transaction commit
phase, we can end up in a deadlock which results in a stack trace like the
following:

 schedule+0x28/0x80
 btrfs_tree_read_lock+0x8e/0x120 [btrfs]
 ? finish_wait+0x80/0x80
 btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
 btrfs_search_slot+0xf6/0x9f0 [btrfs]
 ? evict_refill_and_join+0xd0/0xd0 [btrfs]
 ? inode_insert5+0x119/0x190
 btrfs_lookup_inode+0x3a/0xc0 [btrfs]
 ? kmem_cache_alloc+0x166/0x1d0
 btrfs_iget+0x113/0x690 [btrfs]
 __lookup_free_space_inode+0xd8/0x150 [btrfs]
 lookup_free_space_inode+0x5b/0xb0 [btrfs]
 load_free_space_cache+0x7c/0x170 [btrfs]
 ? cache_block_group+0x72/0x3b0 [btrfs]
 cache_block_group+0x1b3/0x3b0 [btrfs]
 ? finish_wait+0x80/0x80
 find_free_extent+0x799/0x1010 [btrfs]
 btrfs_reserve_extent+0x9b/0x180 [btrfs]
 btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
 __btrfs_cow_block+0x11d/0x500 [btrfs]
 btrfs_cow_block+0xdc/0x180 [btrfs]
 btrfs_search_slot+0x3bd/0x9f0 [btrfs]
 btrfs_lookup_inode+0x3a/0xc0 [btrfs]
 ? kmem_cache_alloc+0x166/0x1d0
 btrfs_update_inode_item+0x46/0x100 [btrfs]
 cache_save_setup+0xe4/0x3a0 [btrfs]
 btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
 btrfs_commit_transaction+0xcb/0x8b0 [btrfs]

At cache_save_setup() we need to update the inode item of a block group's
cache which is located in the tree root (fs_info->tree_root), which means
that it may result in COWing a leaf from that tree. If that happens we
need to find a free metadata extent and while looking for one, if we find
a block group which was not cached yet we attempt to load its cache by
calling cache_block_group(). However this function will try to load the
inode of the free space cache, which requires finding the matching inode
item in the tree root - if that inode item is located in the same leaf as
the inode item of the space cache we are updating at cache_save_setup(),
we end up in a deadlock, since we try to obtain a read lock on the same
extent buffer that we previously write locked.

So fix this by skipping the loading of free space caches of any block
groups that are not yet cached (rare cases) if we are updating the inode
of a free space cache. This is a rare case and its downside is failure to
find a free extent (return -ENOSPC) when all the already cached block
groups have no free extents.

Reported-by: Andrew Nelson 
Link: 
https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
Signed-off-by: Filipe Manana 
---
 fs/btrfs/ctree.h   |  3 +++
 fs/btrfs/disk-io.c |  2 ++
 fs/btrfs/extent-tree.c | 22 +-
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2cddfe7806a4..d23ee26eb17d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1121,6 +1121,9 @@ struct btrfs_fs_info {
u32 sectorsize;
u32 stripesize;
 
+   /* The task currently updating a free space cache inode item. */
+   struct task_struct *space_cache_updater;
+
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
spinlock_t ref_verify_lock;
struct rb_root block_tree;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 05dc3c17cb62..aa5e9a91e560 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2782,6 +2782,8 @@ int open_ctree(struct super_block *sb,
fs_info->sectorsize = 4096;
fs_info->stripesize = 4096;
 
+   fs_info->space_cache_updater = NULL;
+
ret = btrfs_alloc_stripe_hash_table(fs_info);
if (ret) {
err = ret;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 577878324799..e93040449771 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3364,7 +3364,9 @@ static int cache_save_setup(struct 
btrfs_block_group_cache *block_group,
 * time.
 */
BTRFS_I(inode)->generation = 0;
+   fs_info->space_cache_updater = current;
ret = btrfs_update_inode(trans, root, inode);
+   fs_info->space_cache_updater = NULL;
if (ret) {
/*
 * So theoretically we could recover from this, simply set the
@@ -7366,7 +7368,25 @@ static noinline int find_free_extent(struct 
btrfs_fs_info *fs_info,
 
 have_block_group:
cached = block_group_cache_done(block_group);
-   if (unlikely(!cached)) {
+   /*
+* If we are updating the inode of a free space cache, we can
+* not start the caching of any block group because we could
+* deadlock on an extent buffer of the root tree.
+* At cache_save_setup() we update the inode item of a free
+* space cache, so we may need to COW a leaf of the root tree,
+* which implies finding a free metadata extent. So if when
+* searching for such an extent we find a block group that was
+   

Re: Btrfs resize seems to deadlock

2018-10-22 Thread Andrew Nelson
OK, an update: After unmouting and running btrfs check, the drive
reverted to reporting the old size. Not sure if this was due to
unmounting / mounting or doing btrfs check. Btrfs check should have
been running in readonly mode. Since it looked like something was
wrong with the resize process, I patched my kernel with the posted
patch. This time the resize operation finished successfully.
On Sun, Oct 21, 2018 at 1:56 AM Filipe Manana  wrote:
>
> On Sun, Oct 21, 2018 at 6:05 AM Andrew Nelson  
> wrote:
> >
> > Also, is the drive in a safe state to use? Is there anything I should
> > run on the drive to check consistency?
>
> It should be in a safe state. You can verify it running "btrfs check
> /dev/" (it's a readonly operation).
>
> If you are able to patch and build a kernel, you can also try the
> patch. I left it running tests overnight and haven't got any
> regressions.
>
> Thanks.
>
> > On Sat, Oct 20, 2018 at 10:02 PM Andrew Nelson
> >  wrote:
> > >
> > > I have ran the "btrfs inspect-internal dump-tree -t 1" command, but
> > > the output is ~55mb. Is there something in particular you are looking
> > > for in this?
> > > On Sat, Oct 20, 2018 at 1:34 PM Filipe Manana  wrote:
> > > >
> > > > On Sat, Oct 20, 2018 at 9:27 PM Liu Bo  wrote:
> > > > >
> > > > > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson 
> > > > >  wrote:
> > > > > >
> > > > > > I am having an issue with btrfs resize in Fedora 28. I am attempting
> > > > > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem
> > > > > > resize max $MOUNT", the command runs for a few minutes and then 
> > > > > > hangs
> > > > > > forcing the system to be reset. I am not sure what the state of the
> > > > > > filesystem really is at this point. Btrfs usage does report the
> > > > > > correct size for after resizing. Details below:
> > > > > >
> > > > >
> > > > > Thanks for the report, the stack is helpful, but this needs a few
> > > > > deeper debugging, may I ask you to post "btrfs inspect-internal
> > > > > dump-tree -t 1 /dev/your_btrfs_disk"?
> > > >
> > > > I believe it's actually easy to understand from the trace alone and
> > > > it's kind of a bad luck scenario.
> > > > I made this fix a few hours ago:
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock
> > > >
> > > > But haven't done full testing yet and might have missed something.
> > > > Bo, can you take a look and let me know what you think?
> > > >
> > > > Thanks.
> > > >
> > > > >
> > > > > So I'd like to know what's the height of your tree "1" which refers to
> > > > > root tree in btrfs.
> > > > >
> > > > > thanks,
> > > > > liubo
> > > > >
> > > > > > $ sudo btrfs filesystem usage $MOUNT
> > > > > > Overall:
> > > > > > Device size:  90.96TiB
> > > > > > Device allocated: 72.62TiB
> > > > > > Device unallocated:   18.33TiB
> > > > > > Device missing:  0.00B
> > > > > > Used: 72.62TiB
> > > > > > Free (estimated): 18.34TiB  (min: 9.17TiB)
> > > > > > Data ratio:   1.00
> > > > > > Metadata ratio:   2.00
> > > > > > Global reserve:  512.00MiB  (used: 24.11MiB)
> > > > > >
> > > > > > Data,single: Size:72.46TiB, Used:72.45TiB
> > > > > > $MOUNT72.46TiB
> > > > > >
> > > > > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB
> > > > > > $MOUNT   172.00GiB
> > > > > >
> > > > > > System,DUP: Size:40.00MiB, Used:7.53MiB
> > > > > >$MOUNT80.00MiB
> > > > > >
> > > > > > Unallocated:
> > > > > > $MOUNT18.33TiB
> > > > > >
> > > > > > $ uname -a
> > > > > > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 
> > > > > > 15
> > > > > > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> > > > > >
> > > > > > btrfs-transacti D0  2501  2 0x8000
> > > > > > Call Trace:
> > > > > >  ? __schedule+0x253/0x860
> > > > > >  schedule+0x28/0x80
> > > > > >  btrfs_commit_transaction+0x7aa/0x8b0 [btrfs]
> > > > > >  ? kmem_cache_alloc+0x166/0x1d0
> > > > > >  ? join_transaction+0x22/0x3e0 [btrfs]
> > > > > >  ? finish_wait+0x80/0x80
> > > > > >  transaction_kthread+0x155/0x170 [btrfs]
> > > > > >  ? btrfs_cleanup_transaction+0x550/0x550 [btrfs]
> > > > > >  kthread+0x112/0x130
> > > > > >  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > >  ret_from_fork+0x35/0x40
> > > > > > btrfs   D0  2504   2502 0x0002
> > > > > > Call Trace:
> > > > > >  ? __schedule+0x253/0x860
> > > > > >  schedule+0x28/0x80
> > > > > >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> > > > > >  ? finish_wait+0x80/0x80
> > > > > >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> > > > > >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> > > > > >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> > > > > >  ? inode_insert5+0x119/0x190
> > > > > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > > > > >  ? 

Re: How to recover from btrfs scrub errors? (uncorrectable errors, checksum error at logical)

2018-10-22 Thread Qu Wenruo


On 2018/10/22 下午2:29, Otto Kekäläinen wrote:
> I never got a reply to this thread, 

I replied to you but got no rely:

https://lore.kernel.org/linux-btrfs/eba5de6f-535a-0f5d-e415-9cd622d71...@gmx.com/

And your steps are just what I suggested.

Thanks,
Qu

> but I am not replying to myself in
> case somebody has the same issue and is reading the archive:
> 
> The problem went away after:
> - deleted all snapshots as they seemed to slow down btrfs I/O so much
> that simple commands like rm and rsync were unusable
> - replaced the disk that had the corrupted file (just in case -
> smartctl did not indicate any disk failures) with btrfs replace
> - rsynced files from another location to this filesystem so that the
> corrupted files got overwritten
> 
> Now btrfs scrub does not find any corruption anymore and the
> filesystem I/O speed is usable, though still slower than what it used
> to be in the past.
> 
> ma 15. lokak. 2018 klo 10.50 Otto Kekäläinen (o...@seravo.fi) kirjoitti:
>>
>> Hello!
>>
>> I am trying to figure out how to recover from errors detected by btrfs scrub.
>>
>> Scrub status reports:
>>
>> scrub status for 4f4479d5-648a-45b9-bcbf-978c766aeb41
>> scrub started at Mon Oct 15 10:02:28 2018, running for 00:35:39
>> total bytes scrubbed: 791.15GiB with 18 errors
>> error details: csum=18
>> corrected errors: 0, uncorrectable errors: 18, unverified errors: 0
>>
>> Kernel log contains lines like
>>
>>   BTRFS warning (device dm-8): checksum error at logical 7351706472448 on dev
>>   /dev/mapper/disk6tb, sector 61412648, root 12725, inode 152358265,
>> offset 483328:
>>   path resolving failed with ret=-2
>>
>> I've tried so far:
>> - deleting the files (when path is visible)
>> - overwriting the files with new data
>> - changed disk (with btrfs replace)
>>
>> The checksum errors however persist.
>> How do I get rid of them?
>>
>>
>> The files are logs and other non-vital information. I am fine by
>> deleting the corrupted files. It is OK to recover so that I loose a
>> few gigabytes of data, but not the entire filesystem.
>>
>> Setup is a multi-disk btrfs filesystem, data single, metadata RAID-1
>> Mounted with:
>>
>> /dev/mapper/wdc3td on /data type btrfs
>> (rw,noatime,compress=lzo,space_cache,subvolid=5,subvol=/)
>>
>> I've read lots of online sources on the topic but none of these help
>> me on how to recover from the current state:
>>
>> https://btrfs.wiki.kernel.org/index.php/Btrfsck
>> http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html
>> https://wiki.archlinux.org/index.php/Identify_damaged_files#Find_damaged_files
> 
> 
> 



signature.asc
Description: OpenPGP digital signature


Re: How to recover from btrfs scrub errors? (uncorrectable errors, checksum error at logical)

2018-10-22 Thread Otto Kekäläinen
I never got a reply to this thread, but I am not replying to myself in
case somebody has the same issue and is reading the archive:

The problem went away after:
- deleted all snapshots as they seemed to slow down btrfs I/O so much
that simple commands like rm and rsync were unusable
- replaced the disk that had the corrupted file (just in case -
smartctl did not indicate any disk failures) with btrfs replace
- rsynced files from another location to this filesystem so that the
corrupted files got overwritten

Now btrfs scrub does not find any corruption anymore and the
filesystem I/O speed is usable, though still slower than what it used
to be in the past.

ma 15. lokak. 2018 klo 10.50 Otto Kekäläinen (o...@seravo.fi) kirjoitti:
>
> Hello!
>
> I am trying to figure out how to recover from errors detected by btrfs scrub.
>
> Scrub status reports:
>
> scrub status for 4f4479d5-648a-45b9-bcbf-978c766aeb41
> scrub started at Mon Oct 15 10:02:28 2018, running for 00:35:39
> total bytes scrubbed: 791.15GiB with 18 errors
> error details: csum=18
> corrected errors: 0, uncorrectable errors: 18, unverified errors: 0
>
> Kernel log contains lines like
>
>   BTRFS warning (device dm-8): checksum error at logical 7351706472448 on dev
>   /dev/mapper/disk6tb, sector 61412648, root 12725, inode 152358265,
> offset 483328:
>   path resolving failed with ret=-2
>
> I've tried so far:
> - deleting the files (when path is visible)
> - overwriting the files with new data
> - changed disk (with btrfs replace)
>
> The checksum errors however persist.
> How do I get rid of them?
>
>
> The files are logs and other non-vital information. I am fine by
> deleting the corrupted files. It is OK to recover so that I loose a
> few gigabytes of data, but not the entire filesystem.
>
> Setup is a multi-disk btrfs filesystem, data single, metadata RAID-1
> Mounted with:
>
> /dev/mapper/wdc3td on /data type btrfs
> (rw,noatime,compress=lzo,space_cache,subvolid=5,subvol=/)
>
> I've read lots of online sources on the topic but none of these help
> me on how to recover from the current state:
>
> https://btrfs.wiki.kernel.org/index.php/Btrfsck
> http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html
> https://wiki.archlinux.org/index.php/Identify_damaged_files#Find_damaged_files



-- 
Otto Kekäläinen
CEO
Seravo
+358 44 566 2204

Follow me at @ottokekalainen