Re: [PATCH] Btrfs: remove no longer used stuff for tracking pending ordered extents

2018-10-26 Thread Liu Bo
On Fri, Oct 26, 2018 at 9:16 AM  wrote:
>
> From: Filipe Manana 
>
> Tracking pending ordered extents per transaction was introduced in commit
> 50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current
> transaction V3") and later updated in commit 161c3549b45a ("Btrfs: change
> how we wait for pending ordered extents").
>
> However now that on fsync we always wait for ordered extents to complete
> before logging, done in commit 5636cf7d6dc8 ("btrfs: remove the logged
> extents infrastructure"), we no longer need the stuff to track for pending
> ordered extents, which was not completely removed in the mentioned commit.
> So remove the remaining of the pending ordered extents infrastructure.
>

Reviewed-by: Liu Bo 

thanks,
liubo

> Signed-off-by: Filipe Manana 
> ---
>  fs/btrfs/ordered-data.c | 30 --
>  fs/btrfs/ordered-data.h |  2 --
>  fs/btrfs/transaction.c  | 11 ---
>  fs/btrfs/transaction.h  |  2 --
>  4 files changed, 45 deletions(-)
>
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 0c4ef208b8b9..6fde2b2741ef 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -460,7 +460,6 @@ void btrfs_remove_ordered_extent(struct inode *inode,
> struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
> struct btrfs_root *root = btrfs_inode->root;
> struct rb_node *node;
> -   bool dec_pending_ordered = false;
>
> /* This is paired with btrfs_add_ordered_extent. */
> spin_lock(_inode->lock);
> @@ -477,37 +476,8 @@ void btrfs_remove_ordered_extent(struct inode *inode,
> if (tree->last == node)
> tree->last = NULL;
> set_bit(BTRFS_ORDERED_COMPLETE, >flags);
> -   if (test_and_clear_bit(BTRFS_ORDERED_PENDING, >flags))
> -   dec_pending_ordered = true;
> spin_unlock_irq(>lock);
>
> -   /*
> -* The current running transaction is waiting on us, we need to let it
> -* know that we're complete and wake it up.
> -*/
> -   if (dec_pending_ordered) {
> -   struct btrfs_transaction *trans;
> -
> -   /*
> -* The checks for trans are just a formality, it should be 
> set,
> -* but if it isn't we don't want to deref/assert under the 
> spin
> -* lock, so be nice and check if trans is set, but ASSERT() so
> -* if it isn't set a developer will notice.
> -*/
> -   spin_lock(_info->trans_lock);
> -   trans = fs_info->running_transaction;
> -   if (trans)
> -   refcount_inc(>use_count);
> -   spin_unlock(_info->trans_lock);
> -
> -   ASSERT(trans);
> -   if (trans) {
> -   if (atomic_dec_and_test(>pending_ordered))
> -   wake_up(>pending_wait);
> -   btrfs_put_transaction(trans);
> -   }
> -   }
> -
> spin_lock(>ordered_extent_lock);
> list_del_init(>root_extent_list);
> root->nr_ordered_extents--;
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index 02d813aaa261..b10e6765d88f 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -56,8 +56,6 @@ struct btrfs_ordered_sum {
>* the isize. */
>  #define BTRFS_ORDERED_TRUNCATED 8 /* Set when we have to truncate an extent 
> */
>
> -#define BTRFS_ORDERED_PENDING 9 /* We are waiting for this ordered extent to
> - * complete in the current transaction. */
>  #define BTRFS_ORDERED_REGULAR 10 /* Regular IO for COW */
>
>  struct btrfs_ordered_extent {
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 3b84f5015029..2fe6c2b1d94b 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -232,14 +232,12 @@ static noinline int join_transaction(struct 
> btrfs_fs_info *fs_info,
> extwriter_counter_init(cur_trans, type);
> init_waitqueue_head(_trans->writer_wait);
> init_waitqueue_head(_trans->commit_wait);
> -   init_waitqueue_head(_trans->pending_wait);
> cur_trans->state = TRANS_STATE_RUNNING;
> /*
>  * One for this trans handle, one so it will live on until we
>  * commit the transaction.
>  */
> refcount_set(_trans->use_count, 2);
> -   atomic_set(_trans->pending_ordered, 0);
> cur_trans->flags = 0;
> cur_trans->

Re: Btrfs resize seems to deadlock

2018-10-22 Thread Liu Bo
On Sat, Oct 20, 2018 at 1:34 PM Filipe Manana  wrote:
>
> On Sat, Oct 20, 2018 at 9:27 PM Liu Bo  wrote:
> >
> > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson  
> > wrote:
> > >
> > > I am having an issue with btrfs resize in Fedora 28. I am attempting
> > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem
> > > resize max $MOUNT", the command runs for a few minutes and then hangs
> > > forcing the system to be reset. I am not sure what the state of the
> > > filesystem really is at this point. Btrfs usage does report the
> > > correct size for after resizing. Details below:
> > >
> >
> > Thanks for the report, the stack is helpful, but this needs a few
> > deeper debugging, may I ask you to post "btrfs inspect-internal
> > dump-tree -t 1 /dev/your_btrfs_disk"?
>
> I believe it's actually easy to understand from the trace alone and
> it's kind of a bad luck scenario.
> I made this fix a few hours ago:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock
>
> But haven't done full testing yet and might have missed something.
> Bo, can you take a look and let me know what you think?
>

The patch is OK to me, should fix the problem.

Since load_free_space_cache() is touching root tree with
path->commit_root and skip_locking, I was wondering if we could just
provide a same way for free space cache inode's iget().

thanks,
liubo

> Thanks.
>
> >
> > So I'd like to know what's the height of your tree "1" which refers to
> > root tree in btrfs.
> >
> > thanks,
> > liubo
> >
> > > $ sudo btrfs filesystem usage $MOUNT
> > > Overall:
> > > Device size:  90.96TiB
> > > Device allocated: 72.62TiB
> > > Device unallocated:   18.33TiB
> > > Device missing:  0.00B
> > > Used: 72.62TiB
> > > Free (estimated): 18.34TiB  (min: 9.17TiB)
> > > Data ratio:   1.00
> > > Metadata ratio:   2.00
> > > Global reserve:  512.00MiB  (used: 24.11MiB)
> > >
> > > Data,single: Size:72.46TiB, Used:72.45TiB
> > > $MOUNT72.46TiB
> > >
> > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB
> > > $MOUNT   172.00GiB
> > >
> > > System,DUP: Size:40.00MiB, Used:7.53MiB
> > >$MOUNT80.00MiB
> > >
> > > Unallocated:
> > > $MOUNT18.33TiB
> > >
> > > $ uname -a
> > > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 15
> > > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> > >
> > > btrfs-transacti D0  2501  2 0x8000
> > > Call Trace:
> > >  ? __schedule+0x253/0x860
> > >  schedule+0x28/0x80
> > >  btrfs_commit_transaction+0x7aa/0x8b0 [btrfs]
> > >  ? kmem_cache_alloc+0x166/0x1d0
> > >  ? join_transaction+0x22/0x3e0 [btrfs]
> > >  ? finish_wait+0x80/0x80
> > >  transaction_kthread+0x155/0x170 [btrfs]
> > >  ? btrfs_cleanup_transaction+0x550/0x550 [btrfs]
> > >  kthread+0x112/0x130
> > >  ? kthread_create_worker_on_cpu+0x70/0x70
> > >  ret_from_fork+0x35/0x40
> > > btrfs   D0  2504   2502 0x0002
> > > Call Trace:
> > >  ? __schedule+0x253/0x860
> > >  schedule+0x28/0x80
> > >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> > >  ? finish_wait+0x80/0x80
> > >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> > >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> > >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> > >  ? inode_insert5+0x119/0x190
> > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > >  ? kmem_cache_alloc+0x166/0x1d0
> > >  btrfs_iget+0x113/0x690 [btrfs]
> > >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> > >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> > >  load_free_space_cache+0x7c/0x170 [btrfs]
> > >  ? cache_block_group+0x72/0x3b0 [btrfs]
> > >  cache_block_group+0x1b3/0x3b0 [btrfs]
> > >  ? finish_wait+0x80/0x80
> > >  find_free_extent+0x799/0x1010 [btrfs]
> > >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> > >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> > >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> > >  btrfs_cow_block+0xdc/0x180 [btrfs]
> > >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> > >  btrfs_looku

Re: Btrfs resize seems to deadlock

2018-10-20 Thread Liu Bo
On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson  wrote:
>
> I am having an issue with btrfs resize in Fedora 28. I am attempting
> to enlarge my Btrfs partition. Every time I run "btrfs filesystem
> resize max $MOUNT", the command runs for a few minutes and then hangs
> forcing the system to be reset. I am not sure what the state of the
> filesystem really is at this point. Btrfs usage does report the
> correct size for after resizing. Details below:
>

Thanks for the report, the stack is helpful, but this needs a few
deeper debugging, may I ask you to post "btrfs inspect-internal
dump-tree -t 1 /dev/your_btrfs_disk"?

So I'd like to know what's the height of your tree "1" which refers to
root tree in btrfs.

thanks,
liubo

> $ sudo btrfs filesystem usage $MOUNT
> Overall:
> Device size:  90.96TiB
> Device allocated: 72.62TiB
> Device unallocated:   18.33TiB
> Device missing:  0.00B
> Used: 72.62TiB
> Free (estimated): 18.34TiB  (min: 9.17TiB)
> Data ratio:   1.00
> Metadata ratio:   2.00
> Global reserve:  512.00MiB  (used: 24.11MiB)
>
> Data,single: Size:72.46TiB, Used:72.45TiB
> $MOUNT72.46TiB
>
> Metadata,DUP: Size:86.00GiB, Used:84.96GiB
> $MOUNT   172.00GiB
>
> System,DUP: Size:40.00MiB, Used:7.53MiB
>$MOUNT80.00MiB
>
> Unallocated:
> $MOUNT18.33TiB
>
> $ uname -a
> Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 15
> 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> btrfs-transacti D0  2501  2 0x8000
> Call Trace:
>  ? __schedule+0x253/0x860
>  schedule+0x28/0x80
>  btrfs_commit_transaction+0x7aa/0x8b0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  ? join_transaction+0x22/0x3e0 [btrfs]
>  ? finish_wait+0x80/0x80
>  transaction_kthread+0x155/0x170 [btrfs]
>  ? btrfs_cleanup_transaction+0x550/0x550 [btrfs]
>  kthread+0x112/0x130
>  ? kthread_create_worker_on_cpu+0x70/0x70
>  ret_from_fork+0x35/0x40
> btrfs   D0  2504   2502 0x0002
> Call Trace:
>  ? __schedule+0x253/0x860
>  schedule+0x28/0x80
>  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
>  ? finish_wait+0x80/0x80
>  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
>  btrfs_search_slot+0xf6/0x9f0 [btrfs]
>  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
>  ? inode_insert5+0x119/0x190
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_iget+0x113/0x690 [btrfs]
>  __lookup_free_space_inode+0xd8/0x150 [btrfs]
>  lookup_free_space_inode+0x5b/0xb0 [btrfs]
>  load_free_space_cache+0x7c/0x170 [btrfs]
>  ? cache_block_group+0x72/0x3b0 [btrfs]
>  cache_block_group+0x1b3/0x3b0 [btrfs]
>  ? finish_wait+0x80/0x80
>  find_free_extent+0x799/0x1010 [btrfs]
>  btrfs_reserve_extent+0x9b/0x180 [btrfs]
>  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
>  __btrfs_cow_block+0x11d/0x500 [btrfs]
>  btrfs_cow_block+0xdc/0x180 [btrfs]
>  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_update_inode_item+0x46/0x100 [btrfs]
>  cache_save_setup+0xe4/0x3a0 [btrfs]
>  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
>  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
>  ? btrfs_release_path+0x13/0x80 [btrfs]
>  ? btrfs_update_device+0x8d/0x1c0 [btrfs]
>  btrfs_ioctl_resize.cold.46+0xf4/0xf9 [btrfs]
>  btrfs_ioctl+0xa25/0x2cf0 [btrfs]
>  ? tty_write+0x1fc/0x330
>  ? do_vfs_ioctl+0xa4/0x620
>  do_vfs_ioctl+0xa4/0x620
>  ksys_ioctl+0x60/0x90
>  ? ksys_write+0x4f/0xb0
>  __x64_sys_ioctl+0x16/0x20
>  do_syscall_64+0x5b/0x160
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7fcdc0d78c57
> Code: Bad RIP value.
> RSP: 002b:7ffdd1ee6cf8 EFLAGS: 0246 ORIG_RAX: 0010
> RAX: ffda RBX: 7ffdd1ee888a RCX: 7fcdc0d78c57
> RDX: 7ffdd1ee6da0 RSI: 50009403 RDI: 0003
> RBP: 7ffdd1ee6da0 R08:  R09: 7ffdd1ee67e0
> R10:  R11: 0246 R12: 7ffdd1ee888e
> R13: 0003 R14:  R15: 
> kworker/u48:1   D0  2505  2 0x8000
> Workqueue: btrfs-freespace-write btrfs_freespace_write_helper [btrfs]
> Call Trace:
>  ? __schedule+0x253/0x860
>  schedule+0x28/0x80
>  btrfs_tree_lock+0x130/0x210 [btrfs]
>  ? finish_wait+0x80/0x80
>  btrfs_search_slot+0x775/0x9f0 [btrfs]
>  btrfs_mark_extent_written+0xb0/0xb20 [btrfs]
>  ? join_transaction+0x22/0x3e0 [btrfs]
>  btrfs_finish_ordered_io+0x2e0/0x7e0 [btrfs]
>  ? syscall_return_via_sysret+0x13/0x83
>  ? __switch_to_asm+0x34/0x70
>  normal_work_helper+0xd3/0x2f0 [btrfs]
>  process_one_work+0x1a1/0x350
>  worker_thread+0x30/0x380
>  ? pwq_unbound_release_workfn+0xd0/0xd0
>  kthread+0x112/0x130
>  ? kthread_create_worker_on_cpu+0x70/0x70
>  ret_from_fork+0x35/0x40


Re: [PATCH v2] Btrfs: fix null pointer dereference on compressed write path error

2018-10-12 Thread Liu Bo
On Fri, Oct 12, 2018 at 4:38 PM  wrote:
>
> From: Filipe Manana 
>
> At inode.c:compress_file_range(), under the "free_pages_out" label, we can
> end up dereferencing the "pages" pointer when it has a NULL value. This
> case happens when "start" has a value of 0 and we fail to allocate memory
> for the "pages" pointer. When that happens we jump to the "cont" label and
> then enter the "if (start == 0)" branch where we immediately call the
> cow_file_range_inline() function. If that function returns 0 (success
> creating an inline extent) or an error (like -ENOMEM for example) we jump
> to the "free_pages_out" label and then access "pages[i]" leading to a NULL
> pointer dereference, since "nr_pages" has a value greater than zero at
> that point.
>
> Fix this by setting "nr_pages" to 0 when we fail to allocate memory for
> the "pages" pointer.
>

Looks good.

Reviewed-by: Liu Bo 

thanks,
liubo

> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119
> Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads")
> Signed-off-by: Filipe Manana 
> ---
>
> V2: Updated changelog.
>
>  fs/btrfs/inode.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 66c6c4103d2f..d6b61b1facdd 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -503,6 +503,7 @@ static noinline void compress_file_range(struct inode 
> *inode,
> pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
> if (!pages) {
> /* just bail out to the uncompressed code */
> +   nr_pages = 0;
> goto cont;
> }
>
> --
> 2.11.0
>


Re: [PATCH 0/5] rb_first to rb_first_cached conversion

2018-09-20 Thread Liu Bo
On Thu, Sep 20, 2018 at 06:49:59PM +0200, David Sterba wrote:
> On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote:
> > Several structs in btrfs are using rb_first() in a while loop, it'd be
> > more efficient to do this with rb_first_cached() which has the O(1)
> > complexity.
> > 
> > This patch set updates five structs which may have a large rb tree in
> > practice
> > 
> > Liu Bo (5):
> >   Btrfs: href_root: use rb_first_cached
> >   Btrfs: href->ref_tree: use rb_first_cached
> >   Btrfs: delayed_items: use rb_first_cached
> >   Btrfs: extent_map: use rb_first_cached
> >   Btrfs: preftree: use rb_first_cached
> 
> For the record, patches have been merged to misc-next. I've changed the
> subject titles to
> 
> Btrfs: delayed-refs: use rb_first_cached for href_root
> Btrfs: delayed-refs: use rb_first_cached for ref_tree
> Btrfs: delayed-inode: use rb_first_cached for ins_root and del_root
> Btrfs: extent_map: use rb_first_cached
> Btrfs: preftree: use rb_first_cached
> 
> added Holger's tested-by and my reviewed-by.

Thanks so much for the efforts.

thanks,
-liubo


Re: btrfs panic problem

2018-09-19 Thread Liu Bo
On Mon, Sep 17, 2018 at 5:28 PM, sunny.s.zhang  wrote:
> Hi All,
>
> My OS(4.1.12) panic in kmem_cache_alloc, which is called by
> btrfs_get_or_create_delayed_node.
>
> I found that the freelist of the slub is wrong.
>
> crash> struct kmem_cache_cpu 887e7d7a24b0
>
> struct kmem_cache_cpu {
>   freelist = 0x2026,   <<< the value is id of one inode
>   tid = 29567861,
>   page = 0xea0132168d00,
>   partial = 0x0
> }
>
> And, I found there are two different btrfs inodes pointing delayed_node. It
> means that the same slub is used twice.
>
> I think this slub is freed twice, and then the next pointer of this slub
> point itself. So we get the same slub twice.
>
> When use this slub again, that break the freelist.
>
> Folloing code will make the delayed node being freed twice. But I don't
> found what is the process.
>
> Process A (btrfs_evict_inode) Process B
>
> call btrfs_remove_delayed_node call  btrfs_get_delayed_node
>
> node = ACCESS_ONCE(btrfs_inode->delayed_node);
>
> BTRFS_I(inode)->delayed_node = NULL;
> btrfs_release_delayed_node(delayed_node);
>
> if (node) {
> atomic_inc(>refs);
> return node;
> }
>
> ..
>
> btrfs_release_delayed_node(delayed_node);
>

By looking at the race,  seems the following commit has addressed it.

btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ec35e48b286959991cdbb886f1bdeda4575c80b4

thanks,
liubo


>
> 1313 void btrfs_remove_delayed_node(struct inode *inode)
> 1314 {
> 1315 struct btrfs_delayed_node *delayed_node;
> 1316
> 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node);
> 1318 if (!delayed_node)
> 1319 return;
> 1320
> 1321 BTRFS_I(inode)->delayed_node = NULL;
> 1322 btrfs_release_delayed_node(delayed_node);
> 1323 }
>
>
>   87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct inode
> *inode)
>   88 {
>   89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
>   90 struct btrfs_root *root = btrfs_inode->root;
>   91 u64 ino = btrfs_ino(inode);
>   92 struct btrfs_delayed_node *node;
>   93
>   94 node = ACCESS_ONCE(btrfs_inode->delayed_node);
>   95 if (node) {
>   96 atomic_inc(>refs);
>   97 return node;
>   98 }
>
>
> Thanks,
>
> Sunny
>
>
> PS:
>
> 
>
> panic informations
>
> PID: 73638  TASK: 887deb586200  CPU: 38  COMMAND: "dockerd"
>  #0 [88130404f940] machine_kexec at 8105ec10
>  #1 [88130404f9b0] crash_kexec at 811145b8
>  #2 [88130404fa80] oops_end at 8101a868
>  #3 [88130404fab0] no_context at 8106ea91
>  #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d
>  #5 [88130404fb50] bad_area_nosemaphore at 8106eda3
>  #6 [88130404fb60] __do_page_fault at 8106f328
>  #7 [88130404fbd0] do_page_fault at 8106f637
>  #8 [88130404fc10] page_fault at 816f6308
> [exception RIP: kmem_cache_alloc+121]
> RIP: 811ef019  RSP: 88130404fcc8  RFLAGS: 00010286
> RAX:   RBX:   RCX: 01c32b76
> RDX: 01c32b75  RSI:   RDI: 000224b0
> RBP: 88130404fd08   R8: 887e7d7a24b0   R9: 
> R10: 8802668b6618  R11: 0002  R12: 887e3e230a00
> R13: 2026  R14: 887e3e230a00  R15: a01abf49
> ORIG_RAX:   CS: 0010  SS: 0018
>  #9 [88130404fd10] btrfs_get_or_create_delayed_node at a01abf49
> [btrfs]
> #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12
> [btrfs]
> #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs]
> #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs]
> #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs]
> #14 [88130404fe50] touch_atime at 812286d3
> #15 [88130404fe90] iterate_dir at 81221929
> #16 [88130404fee0] sys_getdents64 at 81221a19
> #17 [88130404ff50] system_call_fastpath at 816f2594
> RIP: 006b68e4  RSP: 00c866259080  RFLAGS: 0246
> RAX: ffda  RBX: 00c828dbbe00  RCX: 006b68e4
> RDX: 1000  RSI: 00c83da14000  RDI: 0011
> RBP:    R8:    R9: 
> R10:   R11: 0246  R12: 00c7
> R13: 02174e74  R14: 0555  R15: 0038
> ORIG_RAX: 00d9  CS: 0033  SS: 002b
>
>
> We also find the list double add informations, including n_list and p_list:
>
> [8642921.110568] [ cut here ]
> [8642921.167929] WARNING: CPU: 38 PID: 73638 at lib/list_debug.c:33
> __list_add+0xbe/0xd0()
> [8642921.263780] list_add 

[PATCH v2] Btrfs: remove wait_ordered_rane in btrfs_evict_inode

2018-09-14 Thread Liu Bo
When we delete an inode,

btrfs_evict_inode() {
truncate_inode_pages_final()
truncate_inode_pages_range()
lock_page()
truncate_cleanup_page()
 btrfs_invalidatepage()
  wait_on_page_writeback
   btrfs_lookup_ordered_range()
 cancel_dirty_page()
   unlock_page()
 ...
 btrfs_wait_ordered_range()
 ...

As VFS has called ->invalidatepage() to get all ordered extents done
(if there is any) and truncated all page cache pages (no dirty pages
to writeback after this step), wait_ordered_range() is just a noop.

Reviewed-by: David Sterba 
Signed-off-by: Liu Bo 
---
v2: More details in the description.

 fs/btrfs/inode.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ff1d2ed2dc94..d3febc3a6bc0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5390,9 +5390,6 @@ void btrfs_evict_inode(struct inode *inode)
 
if (is_bad_inode(inode))
goto no_delete;
-   /* do we really want it for ->i_nlink > 0 and zero btrfs_root_refs? */
-   if (!special_file(inode->i_mode))
-   btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
btrfs_free_io_failure_record(BTRFS_I(inode), 0, (u64)-1);
 
-- 
1.8.3.1



Re: [PATCH] Btrfs: remove wait_ordered_range in btrfs_evict_inode

2018-09-14 Thread Liu Bo
On Fri, Sep 14, 2018 at 03:14:33PM +0200, David Sterba wrote:
> On Wed, Sep 12, 2018 at 06:06:22AM +0800, Liu Bo wrote:
> > As VFS has called ->invalidatepage() to get all ordered extents done
> > and truncated all page cache pages, wait_ordered_range() is just a
> > noop.
> 
> Agreed, though looking up the exact points when there are no pages to be
> waited for took me some time. More references and hints in the changelog
> would be good. I'll add the patch to misc-next so it can be tested, but
> please send me an update if you have idea how to rephrase the changelog.
> Thanks.
> 
> Reviewed-by: David Sterba 

OK, I'll add some call stack, hopefully can make it better.

thanks,
-liubo


Re: [PATCH 0/5] rb_first to rb_first_cached conversion

2018-09-14 Thread Liu Bo
On Fri, Sep 14, 2018 at 05:11:02PM +0200, David Sterba wrote:
> On Tue, Sep 11, 2018 at 11:31:49AM -0700, Liu Bo wrote:
> > On Tue, Sep 11, 2018 at 05:34:03PM +0200, David Sterba wrote:
> > > On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote:
> > > > Several structs in btrfs are using rb_first() in a while loop, it'd be
> > > > more efficient to do this with rb_first_cached() which has the O(1)
> > > > complexity.
> > > 
> > > We'd like to see some numbers, though just by reading the code I think
> > > there's going to be a noticeable improvement for some workloads.
> > > 
> > > There's a common pattern:
> > > 
> > > while(node = rb_first) {
> > >   entry = rb_entry(node)
> > >   next = rb_next(node)
> > >   rb_erase(node)
> > >   cleanup(entry)
> > > }
> > > 
> > > rb_first needs to traverse the tree up to logN depth, rb_erase can
> > > completely reshuffle the tree. With the caching we'll skip the traversal
> > > in rb_first. That's a cached memory access vs looped pointer
> > > dereference trade-off that IMHO has a clear winner.
> > > 
> > > As the pattern in many cases traverses the whole tree and removes all
> > > entries, ideally we could get rid of the rebalancing too. The entries
> > > would be cleaned up in postorder way, ie. reset the data and kfree in
> > > the end. This requires that order of the freeing does not matter, which
> > > might no apply to some trees.
> > 
> > The order of freeing does not matter, but we need the tree to return
> > the correct next node to free, IOW, we have to maintain the rbtree
> > until the last two nodes.
> > 
> > > 
> > > I did not find suitable rbtree functions or helpers for that,
> > > rbtree_postorder_for_each_entry_safe does not allow rb_erase and
> > > rb_erase itself forces the rebalancing internally. But I think we can
> > > implement that.
> > > 
> > > We could possibly use rb_next_postorder that makes sure that all
> > > children are traversed before the parent so this is at least something
> > > that could considerd.
> > > 
> > 
> > Qu was inquiring about the same thing, I haven't got time to dig it
> > further.
> > 
> > The question we need to answer is that whether we care about how fast
> > destruction can make, as some are in critical paths while some are
> > not.
> 
> Yeah, I take __btrfs_return_cluster_to_free_space as an example where
> the more efficient tree destruction would shorten the time where a lock
> is held.
> 
> In other cases it might complicate the code too much, though the
> performance is not critical, eg. at umount time. Although, it'd be good
> to optimize that too now that we know how.
> 
> And in the remaining cases it's not possible to avoid rb_erase. I had a
> closer look at btrfs_cleanup_defrag_inodes and btrfs_put_tree_mod_seq.
> Based on that I'd like to add this series to for-next, the first node
> caching is clear enough and safe. We can do the other optimizations
> later.

This needs more fine-grained debugging to see what's the cost
lies on.

About the perf. number of rb_cached_node, I measured the time spent in
extent map loop in btrfs_evict_inode(),

evict_inode_truncate_pages()
while (!RB_EMPTY_ROOT(_tree->map)) {
 rb_first();
 rb_erase();
}


for a em tree containing 10,000 em,
- with rb_first_cached, the loop takes 4880454 ns,
- with rb_first, the loop takes 4584148 ns,

looks like the difference is not that much, ~6%.  After all it's about
scalability, the more rb tree gets, the more rb_first_cached() wins.

thanks,
-liubo


[PATCH v2] Btrfs: remove level==0 check in balance_level

2018-09-13 Thread Liu Bo
btrfs_search_slot()
   if (level != 0)
  setup_nodes_for_search()
  balance_level()

It is just impossible to have level=0 in balance_level.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Liu Bo 
---
v2: add assertion for level just in case someone breaks the rule in the
future.

 fs/btrfs/ctree.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 8b31caa60b0a..ada44c786f2e 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1778,8 +1778,7 @@ static noinline int balance_level(struct 
btrfs_trans_handle *trans,
int orig_slot = path->slots[level];
u64 orig_ptr;
 
-   if (level == 0)
-   return 0;
+   ASSERT(level > 0);
 
mid = path->nodes[level];
 
-- 
1.8.3.1



[PATCH v2] Btrfs: assert page dirty bit

2018-09-13 Thread Liu Bo
Just in case that someone breaks the rule that pages are dirty as long
as eb is dirty.

Signed-off-by: Liu Bo 
---
v2: fix typo of CONFIG_BTRFS_DEBUG

 fs/btrfs/extent_io.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index fb2bf50134a1..f88231171009 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5184,6 +5184,11 @@ bool set_extent_buffer_dirty(struct extent_buffer *eb)
set_page_dirty(eb->pages[i]);
}
 
+#ifdef CONFIG_BTRFS_DEBUG
+   for (i = 0; i < num_pages; i++)
+   ASSERT(PageDirty(eb->pages[i]));
+#endif
+
return was_dirty;
 }
 
-- 
1.8.3.1



[PATCH v2] Btrfs: skip set_page_dirty if eb is dirty

2018-09-13 Thread Liu Bo
As long as @eb is marked with EXTENT_BUFFER_DIRTY, all of its pages
are dirty, so no need to set pages dirty again.

Ftrace showed that the loop took 10us on my dev box, so removing this
can save us at least 10us if eb is already dirty.

Signed-off-by: Liu Bo 
---
v2: Update the patch's description.

 fs/btrfs/extent_io.c | 11 +++
 fs/btrfs/extent_io.h |  2 +-
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 628f1aef34b0..fb2bf50134a1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5165,11 +5165,11 @@ void clear_extent_buffer_dirty(struct extent_buffer *eb)
WARN_ON(atomic_read(>refs) == 0);
 }
 
-int set_extent_buffer_dirty(struct extent_buffer *eb)
+bool set_extent_buffer_dirty(struct extent_buffer *eb)
 {
int i;
int num_pages;
-   int was_dirty = 0;
+   bool was_dirty;
 
check_buffer_tree_ref(eb);
 
@@ -5179,8 +5179,11 @@ int set_extent_buffer_dirty(struct extent_buffer *eb)
WARN_ON(atomic_read(>refs) == 0);
WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, >bflags));
 
-   for (i = 0; i < num_pages; i++)
-   set_page_dirty(eb->pages[i]);
+   if (!was_dirty) {
+   for (i = 0; i < num_pages; i++)
+   set_page_dirty(eb->pages[i]);
+   }
+
return was_dirty;
 }
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index b4d03e677e1d..f2ab42d57f02 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -479,7 +479,7 @@ void extent_buffer_bitmap_set(struct extent_buffer *eb, 
unsigned long start,
 void extent_buffer_bitmap_clear(struct extent_buffer *eb, unsigned long start,
unsigned long pos, unsigned long len);
 void clear_extent_buffer_dirty(struct extent_buffer *eb);
-int set_extent_buffer_dirty(struct extent_buffer *eb);
+bool set_extent_buffer_dirty(struct extent_buffer *eb);
 void set_extent_buffer_uptodate(struct extent_buffer *eb);
 void clear_extent_buffer_uptodate(struct extent_buffer *eb);
 int extent_buffer_under_io(struct extent_buffer *eb);
-- 
1.8.3.1



Re: [PATCH] Btrfs: skip set_page_dirty if eb is dirty

2018-09-13 Thread Liu Bo
On Thu, Sep 13, 2018 at 01:29:59PM +0200, David Sterba wrote:
> On Wed, Sep 12, 2018 at 12:27:46PM -0700, Liu Bo wrote:
> > On Wed, Sep 12, 2018 at 09:37:20AM +0300, Nikolay Borisov wrote:
> > > 
> > > 
> > > On 12.09.2018 01:06, Liu Bo wrote:
> > > > As long as @eb is marked with EXTENT_BUFFER_DIRTY, all of its pages
> > > > are dirty, so no need to set pages dirty again.
> > > > 
> > > > Signed-off-by: Liu Bo 
> > > 
> > > Does make it any performance difference, numbers?
> > >
> > 
> > To be honest, the performance difference would be trivial in a normal
> > big test round.  But I just looked into the difference from my ftrace,
> > removing the loop can reduce the time spent by 10us in my box.
> 
> 10us was for the case where the pages were dirty already and the for
> cycle was then skipped?
> 
> set_page_dirty is not lightweight, calls down to various functions and
> holds locks. I can't tell if this is still fast enough on your machine
> so that it really takes 10us.
>

Sorry for not making it clear, on my box 10us was spent in the 'for'
loop, which iterates 4 pages and 4 calls for set_page_dirty().

thanks,
-liubo

> The conditional dirtying is definetelly worth though,
> 
> Reviewed-by: David Sterba 


Re: [PATCH] Btrfs: remove level==0 check in balance_level

2018-09-12 Thread Liu Bo
On Wed, Sep 12, 2018 at 02:52:38PM +0200, David Sterba wrote:
> On Wed, Sep 12, 2018 at 06:06:23AM +0800, Liu Bo wrote:
> > btrfs_search_slot()
> >if (level != 0)
> >   setup_nodes_for_search()
> >   balance_level()
> > 
> > It is just impossible to have level=0 in balance_level.
> 
> While this is true, what do you think about adding ASSERT(level > 0) ?
> This is to catch accidentally passing level 0.

Sounds good, will update it.

thanks,
-liubo


Re: [PATCH] Btrfs: assert page dirty bit

2018-09-12 Thread Liu Bo
On Wed, Sep 12, 2018 at 09:38:49AM +0300, Nikolay Borisov wrote:
> 
> 
> On 12.09.2018 01:06, Liu Bo wrote:
> > Just in case that someone breaks the rule that pages are dirty as long
> > as eb is dirty.
> > 
> > Signed-off-by: Liu Bo 
> > ---
> >  fs/btrfs/extent_io.c | 5 +
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index fb2bf50134a1..99895f196ecb 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -5184,6 +5184,11 @@ bool set_extent_buffer_dirty(struct extent_buffer 
> > *eb)
> > set_page_dirty(eb->pages[i]);
> > }
> >  
> > +#ifdef BTRFS_DEBUG
> 
> And this will never be compiled since the actual ifdef name is
> "CONFIG_BTRFS_DEBUG"
>

Oops, will fix it.

thanks,
-liubo

> > +   for (i = 0; i < num_pages; i++)
> > +   ASSERT(PageDirty(eb->pages[i]));
> > +#endif
> > +
> > return was_dirty;
> >  }
> >  
> > 


Re: [PATCH] Btrfs: skip set_page_dirty if eb is dirty

2018-09-12 Thread Liu Bo
On Wed, Sep 12, 2018 at 09:37:20AM +0300, Nikolay Borisov wrote:
> 
> 
> On 12.09.2018 01:06, Liu Bo wrote:
> > As long as @eb is marked with EXTENT_BUFFER_DIRTY, all of its pages
> > are dirty, so no need to set pages dirty again.
> > 
> > Signed-off-by: Liu Bo 
> 
> Does make it any performance difference, numbers?
>

To be honest, the performance difference would be trivial in a normal
big test round.  But I just looked into the difference from my ftrace,
removing the loop can reduce the time spent by 10us in my box.

> Reviewed-by: Nikolay Borisov 

Thanks a lot for reviewing.

thanks,
-liubo
> > ---
> >  fs/btrfs/extent_io.c | 11 +++
> >  fs/btrfs/extent_io.h |  2 +-
> >  2 files changed, 8 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 628f1aef34b0..fb2bf50134a1 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -5165,11 +5165,11 @@ void clear_extent_buffer_dirty(struct extent_buffer 
> > *eb)
> > WARN_ON(atomic_read(>refs) == 0);
> >  }
> >  
> > -int set_extent_buffer_dirty(struct extent_buffer *eb)
> > +bool set_extent_buffer_dirty(struct extent_buffer *eb)
> >  {
> > int i;
> > int num_pages;
> > -   int was_dirty = 0;
> > +   bool was_dirty;
> >  
> > check_buffer_tree_ref(eb);
> >  
> > @@ -5179,8 +5179,11 @@ int set_extent_buffer_dirty(struct extent_buffer *eb)
> > WARN_ON(atomic_read(>refs) == 0);
> > WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, >bflags));
> >  
> > -   for (i = 0; i < num_pages; i++)
> > -   set_page_dirty(eb->pages[i]);
> > +   if (!was_dirty) {
> > +   for (i = 0; i < num_pages; i++)
> > +   set_page_dirty(eb->pages[i]);
> > +   }
> > +
> > return was_dirty;
> >  }
> >  
> > diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> > index b4d03e677e1d..f2ab42d57f02 100644
> > --- a/fs/btrfs/extent_io.h
> > +++ b/fs/btrfs/extent_io.h
> > @@ -479,7 +479,7 @@ void extent_buffer_bitmap_set(struct extent_buffer *eb, 
> > unsigned long start,
> >  void extent_buffer_bitmap_clear(struct extent_buffer *eb, unsigned long 
> > start,
> > unsigned long pos, unsigned long len);
> >  void clear_extent_buffer_dirty(struct extent_buffer *eb);
> > -int set_extent_buffer_dirty(struct extent_buffer *eb);
> > +bool set_extent_buffer_dirty(struct extent_buffer *eb);
> >  void set_extent_buffer_uptodate(struct extent_buffer *eb);
> >  void clear_extent_buffer_uptodate(struct extent_buffer *eb);
> >  int extent_buffer_under_io(struct extent_buffer *eb);
> > 


[PATCH] Btrfs: assert page dirty bit

2018-09-11 Thread Liu Bo
Just in case that someone breaks the rule that pages are dirty as long
as eb is dirty.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_io.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index fb2bf50134a1..99895f196ecb 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5184,6 +5184,11 @@ bool set_extent_buffer_dirty(struct extent_buffer *eb)
set_page_dirty(eb->pages[i]);
}
 
+#ifdef BTRFS_DEBUG
+   for (i = 0; i < num_pages; i++)
+   ASSERT(PageDirty(eb->pages[i]));
+#endif
+
return was_dirty;
 }
 
-- 
1.8.3.1



[PATCH] Btrfs: remove level==0 check in balance_level

2018-09-11 Thread Liu Bo
btrfs_search_slot()
   if (level != 0)
  setup_nodes_for_search()
  balance_level()

It is just impossible to have level=0 in balance_level.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 8b31caa60b0a..858085490e23 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1778,9 +1778,6 @@ static noinline int balance_level(struct 
btrfs_trans_handle *trans,
int orig_slot = path->slots[level];
u64 orig_ptr;
 
-   if (level == 0)
-   return 0;
-
mid = path->nodes[level];
 
WARN_ON(path->locks[level] != BTRFS_WRITE_LOCK &&
-- 
1.8.3.1



[PATCH] Btrfs: unify error handling of btrfs_lookup_dir_item

2018-09-11 Thread Liu Bo
Unify the error handling part with IS_ERR_OR_NULL.

No functional changes.

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c | 21 +
 fs/btrfs/send.c  |  8 ++--
 2 files changed, 7 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index fd64d7ac76f9..99ab0203b701 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3925,12 +3925,8 @@ static int __btrfs_unlink_inode(struct 
btrfs_trans_handle *trans,
path->leave_spinning = 1;
di = btrfs_lookup_dir_item(trans, root, path, dir_ino,
name, name_len, -1);
-   if (IS_ERR(di)) {
-   ret = PTR_ERR(di);
-   goto err;
-   }
-   if (!di) {
-   ret = -ENOENT;
+   if (IS_ERR_OR_NULL(di)) {
+   ret = di ? PTR_ERR(di) : -ENOENT;
goto err;
}
ret = btrfs_delete_one_dir_name(trans, root, path, di);
@@ -4088,10 +4084,7 @@ static int btrfs_unlink_subvol(struct btrfs_trans_handle 
*trans,
di = btrfs_lookup_dir_item(trans, root, path, dir_ino,
   name, name_len, -1);
if (IS_ERR_OR_NULL(di)) {
-   if (!di)
-   ret = -ENOENT;
-   else
-   ret = PTR_ERR(di);
+   ret = di ? PTR_ERR(di) : -ENOENT;
goto out;
}
 
@@ -5481,12 +5474,8 @@ static int btrfs_inode_by_name(struct inode *dir, struct 
dentry *dentry,
 
di = btrfs_lookup_dir_item(NULL, root, path, btrfs_ino(BTRFS_I(dir)),
name, namelen, 0);
-   if (!di) {
-   ret = -ENOENT;
-   goto out;
-   }
-   if (IS_ERR(di)) {
-   ret = PTR_ERR(di);
+   if (IS_ERR_OR_NULL(di)) {
+   ret = di ? PTR_ERR(di) : -ENOENT;
goto out;
}
 
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index ba8950bfd9c7..b8c83778f0f7 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -1693,12 +1693,8 @@ static int lookup_dir_item_inode(struct btrfs_root *root,
 
di = btrfs_lookup_dir_item(NULL, root, path,
dir, name, name_len, 0);
-   if (!di) {
-   ret = -ENOENT;
-   goto out;
-   }
-   if (IS_ERR(di)) {
-   ret = PTR_ERR(di);
+   if (IS_ERR_OR_NULL(di)) {
+   ret = di ? PTR_ERR(di) : -ENOENT;
goto out;
}
btrfs_dir_item_key_to_cpu(path->nodes[0], di, );
-- 
1.8.3.1



[PATCH] Btrfs: remove wait_ordered_range in btrfs_evict_inode

2018-09-11 Thread Liu Bo
As VFS has called ->invalidatepage() to get all ordered extents done
and truncated all page cache pages, wait_ordered_range() is just a
noop.

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ff1d2ed2dc94..d3febc3a6bc0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5390,9 +5390,6 @@ void btrfs_evict_inode(struct inode *inode)
 
if (is_bad_inode(inode))
goto no_delete;
-   /* do we really want it for ->i_nlink > 0 and zero btrfs_root_refs? */
-   if (!special_file(inode->i_mode))
-   btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
btrfs_free_io_failure_record(BTRFS_I(inode), 0, (u64)-1);
 
-- 
1.8.3.1



[PATCH] Btrfs: remove unused code in __btrfs_unlink_inode

2018-09-11 Thread Liu Bo
It might get @leaf and @key in order to do some sanity checks on key's
fields, but since we don't check it any more, it's fine to remove the
code.

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d3febc3a6bc0..fd64d7ac76f9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3911,9 +3911,7 @@ static int __btrfs_unlink_inode(struct btrfs_trans_handle 
*trans,
struct btrfs_fs_info *fs_info = root->fs_info;
struct btrfs_path *path;
int ret = 0;
-   struct extent_buffer *leaf;
struct btrfs_dir_item *di;
-   struct btrfs_key key;
u64 index;
u64 ino = btrfs_ino(inode);
u64 dir_ino = btrfs_ino(dir);
@@ -3935,8 +3933,6 @@ static int __btrfs_unlink_inode(struct btrfs_trans_handle 
*trans,
ret = -ENOENT;
goto err;
}
-   leaf = path->nodes[0];
-   btrfs_dir_item_key_to_cpu(leaf, di, );
ret = btrfs_delete_one_dir_name(trans, root, path, di);
if (ret)
goto err;
-- 
1.8.3.1



[PATCH] Btrfs: do not wait after queue async work for delaye refs

2018-09-11 Thread Liu Bo
If metadata space is hungry, how fast flush_space() can run determines
the latency we spend in reserve_metadata_space().

flush_space()
   case FLUSH_DELAYED_ITEMS:
  ...
  btrfs_end_transaction()
   case ALLOC_CHUNK:
  ...
  btrfs_end_transaction()

btrfs_end_transaction()
   btrfs_async_run_delayed_refs()
   // queue a work to process delayed refs
   ...
   if (wait)
   wait_for_completion()

Although processing delayed refs can add to pinned bytes, pinned bytes
can only be used after committing transaction, so waiting async in
flush_space() doesn't help.

In fact we don't have to wait for asynchronous delayed refs processing
as delayed refs are not blocking the subsequent operations(unless within
committing transaction we need to make sure filesystem is consistent).

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.h   |  2 +-
 fs/btrfs/extent-tree.c | 18 ++
 fs/btrfs/inode.c   |  2 +-
 fs/btrfs/transaction.c |  3 +--
 4 files changed, 5 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 00e506de70ba..4c3a733ee4cf 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2620,7 +2620,7 @@ void btrfs_dec_block_group_reservations(struct 
btrfs_fs_info *fs_info,
 int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
   unsigned long count);
 int btrfs_async_run_delayed_refs(struct btrfs_fs_info *fs_info,
-unsigned long count, u64 transid, int wait);
+unsigned long count, u64 transid);
 int btrfs_lookup_data_extent(struct btrfs_fs_info *fs_info, u64 start, u64 
len);
 int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info, u64 bytenr,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b6767f9031b5..cac9a9d04d0c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2852,14 +2852,11 @@ static void delayed_ref_async_start(struct btrfs_work 
*work)
if (ret && !async->error)
async->error = ret;
 done:
-   if (async->sync)
-   complete(>wait);
-   else
-   kfree(async);
+   kfree(async);
 }
 
 int btrfs_async_run_delayed_refs(struct btrfs_fs_info *fs_info,
-unsigned long count, u64 transid, int wait)
+unsigned long count, u64 transid)
 {
struct async_delayed_refs *async;
int ret;
@@ -2872,23 +2869,12 @@ int btrfs_async_run_delayed_refs(struct btrfs_fs_info 
*fs_info,
async->count = count;
async->error = 0;
async->transid = transid;
-   if (wait)
-   async->sync = 1;
-   else
-   async->sync = 0;
-   init_completion(>wait);
 
btrfs_init_work(>work, btrfs_extent_refs_helper,
delayed_ref_async_start, NULL, NULL);
 
btrfs_queue_work(fs_info->extent_workers, >work);
 
-   if (wait) {
-   wait_for_completion(>wait);
-   ret = async->error;
-   kfree(async);
-   return ret;
-   }
return 0;
 }
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 99ab0203b701..fd4af54f0d3b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4736,7 +4736,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
if (btrfs_should_throttle_delayed_refs(trans, fs_info))
btrfs_async_run_delayed_refs(fs_info,
trans->delayed_ref_updates * 2,
-   trans->transid, 0);
+   trans->transid);
if (be_nice) {
if (truncate_space_check(trans, root,
 extent_num_bytes)) {
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 772963a61072..a816c0999fb2 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -889,8 +889,7 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
 
kmem_cache_free(btrfs_trans_handle_cachep, trans);
if (must_run_delayed_refs) {
-   btrfs_async_run_delayed_refs(info, cur, transid,
-must_run_delayed_refs == 1);
+   btrfs_async_run_delayed_refs(info, cur, transid);
}
return err;
 }
-- 
1.8.3.1



[PATCH] Btrfs: skip set_page_dirty if eb is dirty

2018-09-11 Thread Liu Bo
As long as @eb is marked with EXTENT_BUFFER_DIRTY, all of its pages
are dirty, so no need to set pages dirty again.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_io.c | 11 +++
 fs/btrfs/extent_io.h |  2 +-
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 628f1aef34b0..fb2bf50134a1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5165,11 +5165,11 @@ void clear_extent_buffer_dirty(struct extent_buffer *eb)
WARN_ON(atomic_read(>refs) == 0);
 }
 
-int set_extent_buffer_dirty(struct extent_buffer *eb)
+bool set_extent_buffer_dirty(struct extent_buffer *eb)
 {
int i;
int num_pages;
-   int was_dirty = 0;
+   bool was_dirty;
 
check_buffer_tree_ref(eb);
 
@@ -5179,8 +5179,11 @@ int set_extent_buffer_dirty(struct extent_buffer *eb)
WARN_ON(atomic_read(>refs) == 0);
WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, >bflags));
 
-   for (i = 0; i < num_pages; i++)
-   set_page_dirty(eb->pages[i]);
+   if (!was_dirty) {
+   for (i = 0; i < num_pages; i++)
+   set_page_dirty(eb->pages[i]);
+   }
+
return was_dirty;
 }
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index b4d03e677e1d..f2ab42d57f02 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -479,7 +479,7 @@ void extent_buffer_bitmap_set(struct extent_buffer *eb, 
unsigned long start,
 void extent_buffer_bitmap_clear(struct extent_buffer *eb, unsigned long start,
unsigned long pos, unsigned long len);
 void clear_extent_buffer_dirty(struct extent_buffer *eb);
-int set_extent_buffer_dirty(struct extent_buffer *eb);
+bool set_extent_buffer_dirty(struct extent_buffer *eb);
 void set_extent_buffer_uptodate(struct extent_buffer *eb);
 void clear_extent_buffer_uptodate(struct extent_buffer *eb);
 int extent_buffer_under_io(struct extent_buffer *eb);
-- 
1.8.3.1



[PATCH] Btrfs: skip setting path to blocking mode if balance is not needed

2018-09-11 Thread Liu Bo
balance_level() may return early in some cases, but these checks don't
have to be done with blocking write lock.

This puts together these checks into a helper and the benefit is to
avoid switching spinning locks to blocking locks (in these paticular
cases) which slows down btrfs overall.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.c | 41 ++---
 1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 858085490e23..ba267a069ca1 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1758,6 +1758,29 @@ static void root_sub_used(struct btrfs_root *root, u32 
size)
return eb;
 }
 
+static bool need_balance_level(struct btrfs_fs_info *fs_info,
+ struct btrfs_trans_handle *trans,
+ struct btrfs_path *path, int level)
+{
+   struct extent_buffer *mid;
+
+   mid = path->nodes[level];
+
+   WARN_ON(path->locks[level] != BTRFS_WRITE_LOCK &&
+   path->locks[level] != BTRFS_WRITE_LOCK_BLOCKING);
+   WARN_ON(btrfs_header_generation(mid) != trans->transid);
+
+   /* If mid is the root node. */
+   if (level < BTRFS_MAX_LEVEL - 1 && path->nodes[level + 1] == NULL)
+   if (btrfs_header_nritems(mid) != 1)
+   return false;
+
+   if (btrfs_header_nritems(mid) > BTRFS_NODEPTRS_PER_BLOCK(fs_info) / 4)
+   return false;
+
+   return true;
+}
+
 /*
  * node level balancing, used to make sure nodes are in proper order for
  * item deletion.  We balance from the top down, so we have to make sure
@@ -1780,10 +1803,6 @@ static noinline int balance_level(struct 
btrfs_trans_handle *trans,
 
mid = path->nodes[level];
 
-   WARN_ON(path->locks[level] != BTRFS_WRITE_LOCK &&
-   path->locks[level] != BTRFS_WRITE_LOCK_BLOCKING);
-   WARN_ON(btrfs_header_generation(mid) != trans->transid);
-
orig_ptr = btrfs_node_blockptr(mid, orig_slot);
 
if (level < BTRFS_MAX_LEVEL - 1) {
@@ -1798,9 +1817,6 @@ static noinline int balance_level(struct 
btrfs_trans_handle *trans,
if (!parent) {
struct extent_buffer *child;
 
-   if (btrfs_header_nritems(mid) != 1)
-   return 0;
-
/* promote the child to a root */
child = read_node_slot(fs_info, mid, 0);
if (IS_ERR(child)) {
@@ -1838,9 +1854,6 @@ static noinline int balance_level(struct 
btrfs_trans_handle *trans,
free_extent_buffer_stale(mid);
return 0;
}
-   if (btrfs_header_nritems(mid) >
-   BTRFS_NODEPTRS_PER_BLOCK(fs_info) / 4)
-   return 0;
 
left = read_node_slot(fs_info, parent, pslot - 1);
if (IS_ERR(left))
@@ -2460,14 +2473,20 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path 
*path, int level)
goto again;
}
 
+   /* Skip setting path to blocking if balance is not needed. */
+   if (!need_balance_level(fs_info, trans, p, level)) {
+   ret = 0;
+   goto done;
+   }
+
btrfs_set_path_blocking(p);
reada_for_balance(fs_info, p, level);
sret = balance_level(trans, root, p, level);
-
if (sret) {
ret = sret;
goto done;
}
+
b = p->nodes[level];
if (!b) {
btrfs_release_path(p);
-- 
1.8.3.1



Re: [PATCH 0/5] rb_first to rb_first_cached conversion

2018-09-11 Thread Liu Bo
On Tue, Sep 11, 2018 at 05:34:03PM +0200, David Sterba wrote:
> On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote:
> > Several structs in btrfs are using rb_first() in a while loop, it'd be
> > more efficient to do this with rb_first_cached() which has the O(1)
> > complexity.
> 
> We'd like to see some numbers, though just by reading the code I think
> there's going to be a noticeable improvement for some workloads.
> 
> There's a common pattern:
> 
> while(node = rb_first) {
>   entry = rb_entry(node)
>   next = rb_next(node)
>   rb_erase(node)
>   cleanup(entry)
> }
> 
> rb_first needs to traverse the tree up to logN depth, rb_erase can
> completely reshuffle the tree. With the caching we'll skip the traversal
> in rb_first. That's a cached memory access vs looped pointer
> dereference trade-off that IMHO has a clear winner.
> 
> As the pattern in many cases traverses the whole tree and removes all
> entries, ideally we could get rid of the rebalancing too. The entries
> would be cleaned up in postorder way, ie. reset the data and kfree in
> the end. This requires that order of the freeing does not matter, which
> might no apply to some trees.

The order of freeing does not matter, but we need the tree to return
the correct next node to free, IOW, we have to maintain the rbtree
until the last two nodes.

> 
> I did not find suitable rbtree functions or helpers for that,
> rbtree_postorder_for_each_entry_safe does not allow rb_erase and
> rb_erase itself forces the rebalancing internally. But I think we can
> implement that.
> 
> We could possibly use rb_next_postorder that makes sure that all
> children are traversed before the parent so this is at least something
> that could considerd.
> 

Qu was inquiring about the same thing, I haven't got time to dig it
further.

The question we need to answer is that whether we care about how fast
destruction can make, as some are in critical paths while some are
not.

thanks,
-liubo

> In cases where the order of rbnode processing must be preserved, ie. the
> rb_erase must be there, the cached rb tree is all we can do.


[PATCH v2] Btrfs: use correct args for kcalloc in btrfsic_read_block

2018-09-07 Thread Liu Bo
kcalloc is defined as
kcalloc(size_t n, size_t size, gfp_t flags)

Although this won't cause problems in practice, btrfsic_read_block()
has switched n with size.

This updates btrfsic_read_block() to be correct in using kcalloc.

Reviewed-by: Omar Sandoval 
Signed-off-by: Liu Bo 
---
v2: use size_t to be consistent with kcalloc()'s prototype.

 fs/btrfs/check-integrity.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 833cf3c35b4d..2e43fba44035 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -1594,6 +1594,7 @@ static int btrfsic_read_block(struct btrfsic_state *state,
 {
unsigned int num_pages;
unsigned int i;
+   size_t size;
u64 dev_bytenr;
int ret;
 
@@ -1608,9 +1609,8 @@ static int btrfsic_read_block(struct btrfsic_state *state,
 
num_pages = (block_ctx->len + (u64)PAGE_SIZE - 1) >>
PAGE_SHIFT;
-   block_ctx->mem_to_free = kcalloc(sizeof(*block_ctx->datav) +
-   sizeof(*block_ctx->pagev),
-num_pages, GFP_NOFS);
+   size = sizeof(*block_ctx->datav) + sizeof(*block_ctx->pagev);
+   block_ctx->mem_to_free = kcalloc(num_pages, size, GFP_NOFS);
if (!block_ctx->mem_to_free)
return -ENOMEM;
block_ctx->datav = block_ctx->mem_to_free;
-- 
1.8.3.1



Re: [PATCH] Btrfs: use correct args for kcalloc in btrfsic_read_block

2018-09-07 Thread Liu Bo
On Fri, Sep 7, 2018 at 11:21 AM, Omar Sandoval  wrote:
> On Sat, Sep 08, 2018 at 01:59:50AM +0800, Liu Bo wrote:
>> kcalloc is defined as
>> kcalloc(size_t n, size_t size, gfp_t flags)
>>
>> Although this won't cause problems in practice, btrfsic_read_block()
>> has switched n with size.
>>
>> This updates btrfsic_read_block() to be correct in using kcalloc.
>>
>> Signed-off-by: Liu Bo 
>> ---
>>  fs/btrfs/check-integrity.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
>> index 833cf3c35b4d..651b75939ce9 100644
>> --- a/fs/btrfs/check-integrity.c
>> +++ b/fs/btrfs/check-integrity.c
>> @@ -1594,6 +1594,7 @@ static int btrfsic_read_block(struct btrfsic_state 
>> *state,
>>  {
>>   unsigned int num_pages;
>>   unsigned int i;
>> + u64 size;
>
> size_t?
>

oops, will update it.

thanks,
liubo

> Besides that, Reviewed-by: Omar Sandoval 
>
>>   u64 dev_bytenr;
>>   int ret;
>>
>> @@ -1608,9 +1609,8 @@ static int btrfsic_read_block(struct btrfsic_state 
>> *state,
>>
>>   num_pages = (block_ctx->len + (u64)PAGE_SIZE - 1) >>
>>   PAGE_SHIFT;
>> - block_ctx->mem_to_free = kcalloc(sizeof(*block_ctx->datav) +
>> - sizeof(*block_ctx->pagev),
>> -  num_pages, GFP_NOFS);
>> + size = sizeof(*block_ctx->datav) + sizeof(*block_ctx->pagev);
>> + block_ctx->mem_to_free = kcalloc(num_pages, size, GFP_NOFS);
>>   if (!block_ctx->mem_to_free)
>>   return -ENOMEM;
>>   block_ctx->datav = block_ctx->mem_to_free;
>> --
>> 1.8.3.1
>>


[PATCH] Btrfs: use correct args for kcalloc in btrfsic_read_block

2018-09-07 Thread Liu Bo
kcalloc is defined as
kcalloc(size_t n, size_t size, gfp_t flags)

Although this won't cause problems in practice, btrfsic_read_block()
has switched n with size.

This updates btrfsic_read_block() to be correct in using kcalloc.

Signed-off-by: Liu Bo 
---
 fs/btrfs/check-integrity.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 833cf3c35b4d..651b75939ce9 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -1594,6 +1594,7 @@ static int btrfsic_read_block(struct btrfsic_state *state,
 {
unsigned int num_pages;
unsigned int i;
+   u64 size;
u64 dev_bytenr;
int ret;
 
@@ -1608,9 +1609,8 @@ static int btrfsic_read_block(struct btrfsic_state *state,
 
num_pages = (block_ctx->len + (u64)PAGE_SIZE - 1) >>
PAGE_SHIFT;
-   block_ctx->mem_to_free = kcalloc(sizeof(*block_ctx->datav) +
-   sizeof(*block_ctx->pagev),
-num_pages, GFP_NOFS);
+   size = sizeof(*block_ctx->datav) + sizeof(*block_ctx->pagev);
+   block_ctx->mem_to_free = kcalloc(num_pages, size, GFP_NOFS);
if (!block_ctx->mem_to_free)
return -ENOMEM;
block_ctx->datav = block_ctx->mem_to_free;
-- 
1.8.3.1



Re: [PATCH] Btrfs: remove redundant btrfs_trans_release_metadata"

2018-09-06 Thread Liu Bo
On Thu, Sep 6, 2018 at 11:50 AM, Nikolay Borisov  wrote:
>
>
> On  6.09.2018 09:47, Liu Bo wrote:
>> On Wed, Sep 5, 2018 at 10:45 PM, Liu Bo  wrote:
>>> Somehow this ends up with crash in btrfs/124, I'm trying to figure out
>>> what went wrong.
>>>
>>
>> It revealed the problem addressed in Josef's patch[1], so with it,
>> this patch works just fine.
>
> What exactly was the crash ?
>

assertion failed: list_empty(_group->bg_list), file:
fs/btrfs/extent-tree.c,

kernel BUG at fs/btrfs/ctree.h:3427!
...
close_ctree+0x142/0x310 [btrfs]

thanks,
liubo



>>
>> [1] btrfs: make sure we create all new bgs
>>
>> thanks,
>> liubo
>>
>>>
>>> On Tue, Sep 4, 2018 at 6:14 PM, Liu Bo  wrote:
>>>> __btrfs_end_transaction() has done the metadata release twice,
>>>> probably because it used to process delayed refs in between, but now
>>>> that we don't process delayed refs any more, the 2nd release is always
>>>> a noop.
>>>>
>>>> Signed-off-by: Liu Bo 
>>>> ---
>>>>  fs/btrfs/transaction.c | 6 --
>>>>  1 file changed, 6 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>>> index bb1b9f526e98..94b036a74d11 100644
>>>> --- a/fs/btrfs/transaction.c
>>>> +++ b/fs/btrfs/transaction.c
>>>> @@ -826,12 +826,6 @@ static int __btrfs_end_transaction(struct 
>>>> btrfs_trans_handle *trans,
>>>> return 0;
>>>> }
>>>>
>>>> -   btrfs_trans_release_metadata(trans);
>>>> -   trans->block_rsv = NULL;
>>>> -
>>>> -   if (!list_empty(>new_bgs))
>>>> -   btrfs_create_pending_block_groups(trans);
>>>> -
>>>> trans->delayed_ref_updates = 0;
>>>> if (!trans->sync) {
>>>> must_run_delayed_refs =
>>>> --
>>>> 1.8.3.1
>>>>
>>


Re: [PATCH] Btrfs: remove redundant btrfs_trans_release_metadata"

2018-09-06 Thread Liu Bo
On Wed, Sep 5, 2018 at 10:45 PM, Liu Bo  wrote:
> Somehow this ends up with crash in btrfs/124, I'm trying to figure out
> what went wrong.
>

It revealed the problem addressed in Josef's patch[1], so with it,
this patch works just fine.

[1] btrfs: make sure we create all new bgs

thanks,
liubo

>
> On Tue, Sep 4, 2018 at 6:14 PM, Liu Bo  wrote:
>> __btrfs_end_transaction() has done the metadata release twice,
>> probably because it used to process delayed refs in between, but now
>> that we don't process delayed refs any more, the 2nd release is always
>> a noop.
>>
>> Signed-off-by: Liu Bo 
>> ---
>>  fs/btrfs/transaction.c | 6 --
>>  1 file changed, 6 deletions(-)
>>
>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>> index bb1b9f526e98..94b036a74d11 100644
>> --- a/fs/btrfs/transaction.c
>> +++ b/fs/btrfs/transaction.c
>> @@ -826,12 +826,6 @@ static int __btrfs_end_transaction(struct 
>> btrfs_trans_handle *trans,
>> return 0;
>> }
>>
>> -   btrfs_trans_release_metadata(trans);
>> -   trans->block_rsv = NULL;
>> -
>> -   if (!list_empty(>new_bgs))
>> -   btrfs_create_pending_block_groups(trans);
>> -
>> trans->delayed_ref_updates = 0;
>> if (!trans->sync) {
>> must_run_delayed_refs =
>> --
>> 1.8.3.1
>>


Re: [PATCH 22/35] btrfs: make sure we create all new bgs

2018-09-06 Thread Liu Bo
On Fri, Aug 31, 2018 at 7:03 AM, Josef Bacik  wrote:
> On Fri, Aug 31, 2018 at 10:31:49AM +0300, Nikolay Borisov wrote:
>>
>>
>> On 30.08.2018 20:42, Josef Bacik wrote:
>> > We can actually allocate new chunks while we're creating our bg's, so
>> > instead of doing list_for_each_safe, just do while (!list_empty()) so we
>> > make sure to catch any new bg's that get added to the list.
>>
>> HOw can this occur, please elaborate and put an example callstack in the
>> commit log.
>>
>
> Eh?  We're modifying the extent tree and chunk tree, which can cause bg's to 
> be
> allocated, it's just common sense.
>

This explains a bit.

  => btrfs_make_block_group
  => __btrfs_alloc_chunk
  => do_chunk_alloc
  => find_free_extent
  => btrfs_reserve_extent
  => btrfs_alloc_tree_block
  => __btrfs_cow_block
  => btrfs_cow_block
  => btrfs_search_slot
  => btrfs_update_device
  => btrfs_finish_chunk_alloc
  => btrfs_create_pending_block_groups
 ...


Reviewed-by: Liu Bo 

thanks,
liubo


Re: [PATCH] Btrfs: remove redundant btrfs_trans_release_metadata"

2018-09-05 Thread Liu Bo
Somehow this ends up with crash in btrfs/124, I'm trying to figure out
what went wrong.

thanks,
liubo


On Tue, Sep 4, 2018 at 6:14 PM, Liu Bo  wrote:
> __btrfs_end_transaction() has done the metadata release twice,
> probably because it used to process delayed refs in between, but now
> that we don't process delayed refs any more, the 2nd release is always
> a noop.
>
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/transaction.c | 6 --
>  1 file changed, 6 deletions(-)
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index bb1b9f526e98..94b036a74d11 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -826,12 +826,6 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
> return 0;
> }
>
> -   btrfs_trans_release_metadata(trans);
> -   trans->block_rsv = NULL;
> -
> -   if (!list_empty(>new_bgs))
> -   btrfs_create_pending_block_groups(trans);
> -
> trans->delayed_ref_updates = 0;
> if (!trans->sync) {
> must_run_delayed_refs =
> --
> 1.8.3.1
>


[PATCH v2] Btrfs: remove confusing tracepoint in btrfs_add_reserved_bytes

2018-09-04 Thread Liu Bo
Here we're not releasing any space, but transferring bytes from
->bytes_may_use to ->bytes_reserved.

Signed-off-by: Liu Bo 
---
v2: Add missing commit log.

 fs/btrfs/extent-tree.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 41a02cbb5a4a..76ee5ebef2b9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6401,10 +6401,6 @@ static int btrfs_add_reserved_bytes(struct 
btrfs_block_group_cache *cache,
} else {
cache->reserved += num_bytes;
space_info->bytes_reserved += num_bytes;
-
-   trace_btrfs_space_reservation(cache->fs_info,
-   "space_info", space_info->flags,
-   ram_bytes, 0);
space_info->bytes_may_use -= ram_bytes;
if (delalloc)
cache->delalloc_bytes += num_bytes;
-- 
1.8.3.1



Re: [PATCH 20/35] btrfs: reset max_extent_size on clear in a bitmap

2018-09-04 Thread Liu Bo
On Thu, Aug 30, 2018 at 10:42 AM, Josef Bacik  wrote:
> From: Josef Bacik 
>
> We need to clear the max_extent_size when we clear bits from a bitmap
> since it could have been from the range that contains the
> max_extent_size.
>

Looks OK.
Reviewed-by: Liu Bo 

thanks,
liubo

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/free-space-cache.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 53521027dd78..7faca05e61ea 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -1683,6 +1683,8 @@ static inline void __bitmap_clear_bits(struct 
> btrfs_free_space_ctl *ctl,
> bitmap_clear(info->bitmap, start, count);
>
> info->bytes -= bytes;
> +   if (info->max_extent_size > ctl->unit)
> +   info->max_extent_size = 0;
>  }
>
>  static void bitmap_clear_bits(struct btrfs_free_space_ctl *ctl,
> --
> 2.14.3
>


Re: [PATCH 08/35] btrfs: release metadata before running delayed refs

2018-09-04 Thread Liu Bo
On Thu, Aug 30, 2018 at 10:41 AM, Josef Bacik  wrote:
> We want to release the unused reservation we have since it refills the
> delayed refs reserve, which will make everything go smoother when
> running the delayed refs if we're short on our reservation.
>

Looks good.
Reviewed-by: Liu Bo 

thanks,
liubo

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/transaction.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 99741254e27e..ebb0c0405598 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1915,6 +1915,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
> return ret;
> }
>
> +   btrfs_trans_release_metadata(trans);
> +   trans->block_rsv = NULL;
> +
> /* make a pass through all the delayed refs we have so far
>  * any runnings procs may add more while we are here
>  */
> @@ -1924,9 +1927,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
> return ret;
> }
>
> -   btrfs_trans_release_metadata(trans);
> -   trans->block_rsv = NULL;
> -
> cur_trans = trans->transaction;
>
> /*
> --
> 2.14.3
>


[PATCH] Btrfs: remove confusing tracepoint in btrfs_add_reserved_bytes

2018-09-04 Thread Liu Bo
Signed-off-by: Liu Bo 
---
 fs/btrfs/extent-tree.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 41a02cbb5a4a..76ee5ebef2b9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6401,10 +6401,6 @@ static int btrfs_add_reserved_bytes(struct 
btrfs_block_group_cache *cache,
} else {
cache->reserved += num_bytes;
space_info->bytes_reserved += num_bytes;
-
-   trace_btrfs_space_reservation(cache->fs_info,
-   "space_info", space_info->flags,
-   ram_bytes, 0);
space_info->bytes_may_use -= ram_bytes;
if (delalloc)
cache->delalloc_bytes += num_bytes;
-- 
1.8.3.1



[PATCH] Btrfs: remove redundant btrfs_trans_release_metadata"

2018-09-04 Thread Liu Bo
__btrfs_end_transaction() has done the metadata release twice,
probably because it used to process delayed refs in between, but now
that we don't process delayed refs any more, the 2nd release is always
a noop.

Signed-off-by: Liu Bo 
---
 fs/btrfs/transaction.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index bb1b9f526e98..94b036a74d11 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -826,12 +826,6 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
return 0;
}
 
-   btrfs_trans_release_metadata(trans);
-   trans->block_rsv = NULL;
-
-   if (!list_empty(>new_bgs))
-   btrfs_create_pending_block_groups(trans);
-
trans->delayed_ref_updates = 0;
if (!trans->sync) {
must_run_delayed_refs =
-- 
1.8.3.1



Re: [PATCH 02/35] btrfs: add cleanup_ref_head_accounting helper

2018-09-04 Thread Liu Bo
On Thu, Aug 30, 2018 at 10:41 AM, Josef Bacik  wrote:
> From: Josef Bacik 
>
> We were missing some quota cleanups in check_ref_cleanup, so break the
> ref head accounting cleanup into a helper and call that from both
> check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
> we don't screw up accounting in the future for other things that we add.
>



Reviewed-by: Liu Bo 

thanks,
liubo
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 67 
> +-
>  1 file changed, 39 insertions(+), 28 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 6799950fa057..4c9fd35bca07 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2461,6 +2461,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle 
> *trans,
> return ret ? ret : 1;
>  }
>
> +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> +   struct btrfs_delayed_ref_head *head)
> +{
> +   struct btrfs_fs_info *fs_info = trans->fs_info;
> +   struct btrfs_delayed_ref_root *delayed_refs =
> +   >transaction->delayed_refs;
> +
> +   if (head->total_ref_mod < 0) {
> +   struct btrfs_space_info *space_info;
> +   u64 flags;
> +
> +   if (head->is_data)
> +   flags = BTRFS_BLOCK_GROUP_DATA;
> +   else if (head->is_system)
> +   flags = BTRFS_BLOCK_GROUP_SYSTEM;
> +   else
> +   flags = BTRFS_BLOCK_GROUP_METADATA;
> +   space_info = __find_space_info(fs_info, flags);
> +   ASSERT(space_info);
> +   percpu_counter_add_batch(_info->total_bytes_pinned,
> +  -head->num_bytes,
> +  BTRFS_TOTAL_BYTES_PINNED_BATCH);
> +
> +   if (head->is_data) {
> +   spin_lock(_refs->lock);
> +   delayed_refs->pending_csums -= head->num_bytes;
> +   spin_unlock(_refs->lock);
> +   }
> +   }
> +
> +   /* Also free its reserved qgroup space */
> +   btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> + head->qgroup_reserved);
> +}
> +
>  static int cleanup_ref_head(struct btrfs_trans_handle *trans,
> struct btrfs_delayed_ref_head *head)
>  {
> @@ -2496,31 +2531,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
> spin_unlock(_refs->lock);
> spin_unlock(>lock);
>
> -   trace_run_delayed_ref_head(fs_info, head, 0);
> -
> -   if (head->total_ref_mod < 0) {
> -   struct btrfs_space_info *space_info;
> -   u64 flags;
> -
> -   if (head->is_data)
> -   flags = BTRFS_BLOCK_GROUP_DATA;
> -   else if (head->is_system)
> -   flags = BTRFS_BLOCK_GROUP_SYSTEM;
> -   else
> -   flags = BTRFS_BLOCK_GROUP_METADATA;
> -   space_info = __find_space_info(fs_info, flags);
> -   ASSERT(space_info);
> -   percpu_counter_add_batch(_info->total_bytes_pinned,
> -  -head->num_bytes,
> -  BTRFS_TOTAL_BYTES_PINNED_BATCH);
> -
> -   if (head->is_data) {
> -   spin_lock(_refs->lock);
> -   delayed_refs->pending_csums -= head->num_bytes;
> -   spin_unlock(_refs->lock);
> -   }
> -   }
> -
> if (head->must_insert_reserved) {
> btrfs_pin_extent(fs_info, head->bytenr,
>  head->num_bytes, 1);
> @@ -2530,9 +2540,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
> }
> }
>
> -   /* Also free its reserved qgroup space */
> -   btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> - head->qgroup_reserved);
> +   cleanup_ref_head_accounting(trans, head);
> +
> +   trace_run_delayed_ref_head(fs_info, head, 0);
> btrfs_delayed_ref_unlock(head);
> btrfs_put_delayed_ref_head(head);
> return 0;
> @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
> if (head->must_insert_reserved)
> ret = 1;
>
> +   cleanup_ref_head_accounting(trans, head);
> mutex_unlock(>mutex);
> btrfs_put_delayed_ref_head(head);
> return ret;
> --
> 2.14.3
>


[PATCH] Btrfs: fix alignment in declaration and prototype of btrfs_get_extent

2018-08-24 Thread Liu Bo
This fixes btrfs_get_extent to be consistent with our existing
declaration style.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.h | 4 ++--
 fs/btrfs/inode.c | 6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1aeed3c0e949..00e506de70ba 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3177,8 +3177,8 @@ int btrfs_merge_bio_hook(struct page *page, unsigned long 
offset,
 struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
 struct btrfs_root *root, int *was_new);
 struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
-   struct page *page, size_t pg_offset,
-   u64 start, u64 end, int create);
+   struct page *page, size_t pg_offset,
+   u64 start, u64 end, int create);
 int btrfs_update_inode(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
  struct inode *inode);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1e0af0cc56fc..31d43355f052 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6787,9 +6787,9 @@ static noinline int uncompress_inline(struct btrfs_path 
*path,
  * This also copies inline extents directly into the page.
  */
 struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
-   struct page *page,
-   size_t pg_offset, u64 start, u64 len,
-   int create)
+   struct page *page,
+   size_t pg_offset, u64 start, u64 len,
+   int create)
 {
struct btrfs_fs_info *fs_info = inode->root->fs_info;
int ret;
-- 
1.8.3.1



[PATCH v3] Btrfs: set leave_spinning in btrfs_get_extent

2018-08-24 Thread Liu Bo
Unless it's going to read inline extents from btree leaf to page,
btrfs_get_extent won't sleep during the period of holding path lock.

This sets leave_spinning at first and sets path to blocking mode right
before reading inline extent if that's the case.  The benefit is that a
path in spinning mode typically has less impact (faster) on waiters
rather than that in blocking mode.

Also fixes the misalignment of the prototype, which is too trivial for
a single patch.


Signed-off-by: Liu Bo 
---
v3: separate alignment fix out of this patch.

 fs/btrfs/inode.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 10e86f7b5398..1e0af0cc56fc 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6843,6 +6843,12 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
*inode,
 */
path->reada = READA_FORWARD;
 
+   /*
+* Unless we're going to uncompress inline extent, no sleep would
+* happen.
+*/
+   path->leave_spinning = 1;
+
ret = btrfs_lookup_file_extent(NULL, root, path, objectid, start, 0);
if (ret < 0) {
err = ret;
@@ -6945,6 +6951,8 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
*inode,
em->orig_block_len = em->len;
em->orig_start = em->start;
ptr = btrfs_file_extent_inline_start(item) + extent_offset;
+
+   btrfs_set_path_blocking(path);
if (!PageUptodate(page)) {
if (btrfs_file_extent_compression(leaf, item) !=
BTRFS_COMPRESS_NONE) {
-- 
1.8.3.1



Re: [PATCH v2] Btrfs: set leave_spinning in btrfs_get_extent

2018-08-24 Thread Liu Bo
On Fri, Aug 24, 2018 at 04:59:49PM +0200, David Sterba wrote:
> On Thu, Aug 23, 2018 at 07:41:05AM +0800, Liu Bo wrote:
> > Unless it's going to read inline extents from btree leaf to page,
> > btrfs_get_extent won't sleep during the period of holding path lock.
> > 
> > This sets leave_spinning at first and sets path to blocking mode right
> > before reading inline extent if that's the case.  The benefit is that a
> > path in spinning mode typically has less impact (faster) on waiters
> > rather than that in blocking mode.
> > 
> > Also fixes the misalignment of the prototype, which is too trivial for
> > a single patch.
> > 
> > Signed-off-by: Liu Bo 
> > ---
> > v2: send out the correct patch.
> > 
> >  fs/btrfs/ctree.h |  4 ++--
> >  fs/btrfs/inode.c | 14 +++---
> >  2 files changed, 13 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 1aeed3c0e949..00e506de70ba 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -3177,8 +3177,8 @@ int btrfs_merge_bio_hook(struct page *page, unsigned 
> > long offset,
> >  struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
> >  struct btrfs_root *root, int *was_new);
> >  struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
> > -   struct page *page, size_t pg_offset,
> > -   u64 start, u64 end, int create);
> > +   struct page *page, size_t pg_offset,
> > +   u64 start, u64 end, int create);
> 
> Please don't add unrelated changes.
> 
> >  int btrfs_update_inode(struct btrfs_trans_handle *trans,
> >   struct btrfs_root *root,
> >   struct inode *inode);
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 10e86f7b5398..31d43355f052 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -6787,9 +6787,9 @@ static noinline int uncompress_inline(struct 
> > btrfs_path *path,
> >   * This also copies inline extents directly into the page.
> >   */
> >  struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
> > -   struct page *page,
> > -   size_t pg_offset, u64 start, u64 len,
> > -   int create)
> > +   struct page *page,
> > +   size_t pg_offset, u64 start, u64 len,
> > +   int create)
> 
> Ok, this looks ugly and the reformatting makes sense, both declaration
> and prototype, but please send it as a separate patch.

Sure.

Note that there're still some format issues introduced by the old
patch set of "inode -> btrfs_inode".

thanks,
-liubo


[PATCH v2] Btrfs: set leave_spinning in btrfs_get_extent

2018-08-22 Thread Liu Bo
Unless it's going to read inline extents from btree leaf to page,
btrfs_get_extent won't sleep during the period of holding path lock.

This sets leave_spinning at first and sets path to blocking mode right
before reading inline extent if that's the case.  The benefit is that a
path in spinning mode typically has less impact (faster) on waiters
rather than that in blocking mode.

Also fixes the misalignment of the prototype, which is too trivial for
a single patch.

Signed-off-by: Liu Bo 
---
v2: send out the correct patch.

 fs/btrfs/ctree.h |  4 ++--
 fs/btrfs/inode.c | 14 +++---
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1aeed3c0e949..00e506de70ba 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3177,8 +3177,8 @@ int btrfs_merge_bio_hook(struct page *page, unsigned long 
offset,
 struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
 struct btrfs_root *root, int *was_new);
 struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
-   struct page *page, size_t pg_offset,
-   u64 start, u64 end, int create);
+   struct page *page, size_t pg_offset,
+   u64 start, u64 end, int create);
 int btrfs_update_inode(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
  struct inode *inode);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 10e86f7b5398..31d43355f052 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6787,9 +6787,9 @@ static noinline int uncompress_inline(struct btrfs_path 
*path,
  * This also copies inline extents directly into the page.
  */
 struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
-   struct page *page,
-   size_t pg_offset, u64 start, u64 len,
-   int create)
+   struct page *page,
+   size_t pg_offset, u64 start, u64 len,
+   int create)
 {
struct btrfs_fs_info *fs_info = inode->root->fs_info;
int ret;
@@ -6843,6 +6843,12 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
*inode,
 */
path->reada = READA_FORWARD;
 
+   /*
+* Unless we're going to uncompress inline extent, no sleep would
+* happen.
+*/
+   path->leave_spinning = 1;
+
ret = btrfs_lookup_file_extent(NULL, root, path, objectid, start, 0);
if (ret < 0) {
err = ret;
@@ -6945,6 +6951,8 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
*inode,
em->orig_block_len = em->len;
em->orig_start = em->start;
ptr = btrfs_file_extent_inline_start(item) + extent_offset;
+
+   btrfs_set_path_blocking(path);
if (!PageUptodate(page)) {
if (btrfs_file_extent_compression(leaf, item) !=
BTRFS_COMPRESS_NONE) {
-- 
1.8.3.1



[PATCH] Btrfs: btrfs_get_extent: put btrfs_free_path ahead

2018-08-22 Thread Liu Bo
trace_btrfs_get_extent() has nothing to do with path, place
btrfs_free_path ahead so that we can unlock path on error.

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 31d43355f052..6a0380f3f318 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7000,10 +7000,10 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
*inode,
err = btrfs_add_extent_mapping(fs_info, em_tree, , start, len);
write_unlock(_tree->lock);
 out:
+   btrfs_free_path(path);
 
trace_btrfs_get_extent(root, inode, em);
 
-   btrfs_free_path(path);
if (err) {
free_extent_map(em);
return ERR_PTR(err);
-- 
1.8.3.1



[PATCH] Btrfs: set leave_spinning in btrfs_get_extent

2018-08-22 Thread Liu Bo
Unless it's going to read inline extents from btree leaf to page,
btrfs_get_extent won't sleep during the period of holding path lock.

This sets leave_spinning at first and sets path to blocking mode right
before reading inline extent if that's the case.  The benefit is that a
path in spinning mode typically has less impact (faster) on waiters
rather than that in blocking mode.

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 10e86f7b5398..4219be9c21f9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6843,6 +6843,12 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
*inode,
 */
path->reada = READA_FORWARD;
 
+   /*
+* Unless we're going to uncompress inline extent, no sleep would
+* happen.
+*/
+   path->leave_spinning = 1;
+
ret = btrfs_lookup_file_extent(NULL, root, path, objectid, start, 0);
if (ret < 0) {
err = ret;
@@ -6945,6 +6951,8 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
*inode,
em->orig_block_len = em->len;
em->orig_start = em->start;
ptr = btrfs_file_extent_inline_start(item) + extent_offset;
+
+   btrfs_set_path_blocking(path);
if (!PageUptodate(page)) {
if (btrfs_file_extent_compression(leaf, item) !=
BTRFS_COMPRESS_NONE) {
@@ -6992,10 +7000,10 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
*inode,
err = btrfs_add_extent_mapping(fs_info, em_tree, , start, len);
write_unlock(_tree->lock);
 out:
+   btrfs_free_path(path);
 
trace_btrfs_get_extent(root, inode, em);
 
-   btrfs_free_path(path);
if (err) {
free_extent_map(em);
return ERR_PTR(err);
-- 
1.8.3.1



Re: [PATCH 0/2] address lock contention of btree root

2018-08-22 Thread Liu Bo
On Tue, Aug 21, 2018 at 02:48:47PM -0400, Chris Mason wrote:
> On 21 Aug 2018, at 14:15, Liu Bo wrote:
> 
> > On Tue, Aug 21, 2018 at 01:54:11PM -0400, Chris Mason wrote:
> > > On 16 Aug 2018, at 17:07, Liu Bo wrote:
> > > 
> > > > The lock contention on btree nodes (esp. root node) is apparently a
> > > > bottleneck when there're multiple readers and writers concurrently
> > > > trying to access them.  Unfortunately this is by design and it's not
> > > > easy to fix it unless with some complex changes, however, there is
> > > > still some room.
> > > > 
> > > > With a stable workload based on fsmark which has 16 threads creating
> > > > 1,600K files, we could see that a good amount of overhead comes from
> > > > switching path between spinning mode and blocking mode in
> > > > btrfs_search_slot().
> > > > 
> > > > Patch 1 provides more details about the overhead and test
> > > > results from
> > > > fsmark and dbench.
> > > > Patch 2 kills leave_spinning due to the behaviour change from
> > > > patch 1.
> > > 
> > > This is really interesting, do you have numbers about how often we
> > > are able
> > > to stay spinning?
> > > 
> > 
> > I didn't gather how long we stay spinning,
> 
> I'm less worried about length of time spinning than I am how often we're
> able to call btrfs_search_slot() without ever blocking.  If one caller ends
> up going into blocking mode, it can cascade into all of the other callers,
> which can slow things down in low-to-medium contention workloads.
> 
> The flip side is that maybe the adaptive spinning in the mutexes is good
> enough now and we can just deleting the spinning completely.
>

hmm, looks like the current mutex with adaptive spinning doesn't offer
read/write version, meaning we're not able to simple drop rwlock.

thanks,
-liubo

> > but I was tracking
> > 
> > (1) how long a read lock or write lock can wait until it gets the
> > lock, with vanilla kernel, for read lock, it could be up to 10ms while
> > for write lock, it could be up to 200ms.
> 
> Nice, please add the stats to the patch descriptions, both before and after.
> I'd love to see a histogram like you can get out of bcc's argdist.py.
> 
> -chris


[PATCH 4/5] Btrfs: extent_map: use rb_first_cached

2018-08-22 Thread Liu Bo
rb_first_cached() trades an extra pointer "leftmost" for doing the
same job as rb_first() but in O(1).

As evict_inode_truncate_pages() removes all extent mapping by always
looking for the first rb entry, it's helpful to use rb_first_cached
instead.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_map.c | 27 +++
 fs/btrfs/extent_map.h |  2 +-
 fs/btrfs/inode.c  |  4 ++--
 fs/btrfs/tests/extent-map-tests.c |  4 ++--
 fs/btrfs/volumes.c|  5 +++--
 5 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 6648d55e5339..819420eab91d 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -34,7 +34,7 @@ void __cold extent_map_exit(void)
  */
 void extent_map_tree_init(struct extent_map_tree *tree)
 {
-   tree->map = RB_ROOT;
+   tree->map = RB_ROOT_CACHED;
INIT_LIST_HEAD(>modified_extents);
rwlock_init(>lock);
 }
@@ -90,24 +90,27 @@ static u64 range_end(u64 start, u64 len)
return start + len;
 }
 
-static int tree_insert(struct rb_root *root, struct extent_map *em)
+static int tree_insert(struct rb_root_cached *root, struct extent_map *em)
 {
-   struct rb_node **p = >rb_node;
+   struct rb_node **p = >rb_root.rb_node;
struct rb_node *parent = NULL;
struct extent_map *entry = NULL;
struct rb_node *orig_parent = NULL;
u64 end = range_end(em->start, em->len);
+   bool leftmost = true;
 
while (*p) {
parent = *p;
entry = rb_entry(parent, struct extent_map, rb_node);
 
-   if (em->start < entry->start)
+   if (em->start < entry->start) {
p = &(*p)->rb_left;
-   else if (em->start >= extent_map_end(entry))
+   } else if (em->start >= extent_map_end(entry)) {
p = &(*p)->rb_right;
-   else
+   leftmost = false;
+   } else {
return -EEXIST;
+   }
}
 
orig_parent = parent;
@@ -130,7 +133,7 @@ static int tree_insert(struct rb_root *root, struct 
extent_map *em)
return -EEXIST;
 
rb_link_node(>rb_node, orig_parent, p);
-   rb_insert_color(>rb_node, root);
+   rb_insert_color_cached(>rb_node, root, leftmost);
return 0;
 }
 
@@ -242,7 +245,7 @@ static void try_merge_map(struct extent_map_tree *tree, 
struct extent_map *em)
em->mod_start = merge->mod_start;
em->generation = max(em->generation, merge->generation);
 
-   rb_erase(>rb_node, >map);
+   rb_erase_cached(>rb_node, >map);
RB_CLEAR_NODE(>rb_node);
free_extent_map(merge);
}
@@ -254,7 +257,7 @@ static void try_merge_map(struct extent_map_tree *tree, 
struct extent_map *em)
if (rb && mergable_maps(em, merge)) {
em->len += merge->len;
em->block_len += merge->block_len;
-   rb_erase(>rb_node, >map);
+   rb_erase_cached(>rb_node, >map);
RB_CLEAR_NODE(>rb_node);
em->mod_len = (merge->mod_start + merge->mod_len) - 
em->mod_start;
em->generation = max(em->generation, merge->generation);
@@ -367,7 +370,7 @@ int add_extent_mapping(struct extent_map_tree *tree,
struct rb_node *next = NULL;
u64 end = range_end(start, len);
 
-   rb_node = __tree_search(>map, start, , );
+   rb_node = __tree_search(>map.rb_root, start, , );
if (!rb_node) {
if (prev)
rb_node = prev;
@@ -433,7 +436,7 @@ int remove_extent_mapping(struct extent_map_tree *tree, 
struct extent_map *em)
int ret = 0;
 
WARN_ON(test_bit(EXTENT_FLAG_PINNED, >flags));
-   rb_erase(>rb_node, >map);
+   rb_erase_cached(>rb_node, >map);
if (!test_bit(EXTENT_FLAG_LOGGING, >flags))
list_del_init(>list);
RB_CLEAR_NODE(>rb_node);
@@ -449,7 +452,7 @@ void replace_extent_mapping(struct extent_map_tree *tree,
ASSERT(extent_map_in_tree(cur));
if (!test_bit(EXTENT_FLAG_LOGGING, >flags))
list_del_init(>list);
-   rb_replace_node(>rb_node, >rb_node, >map);
+   rb_replace_node_cached(>rb_node, >rb_node, >map);
RB_CLEAR_NODE(>rb_node);
 
setup_extent_mapping(tree, new, modified);
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 25d985e7532a..6afe786bf6d0 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -49,7 +49,7 @@ struct extent_map {
 }

[PATCH 2/5] Btrfs: href->ref_tree: use rb_first_cached

2018-08-22 Thread Liu Bo
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).

Functions manipulating href->ref_tree need to get the first entry, this
converts href->ref_tree to use rb_first_cached().

Signed-off-by: Liu Bo 
---
 fs/btrfs/backref.c |  2 +-
 fs/btrfs/delayed-ref.c | 24 ++--
 fs/btrfs/delayed-ref.h |  2 +-
 fs/btrfs/disk-io.c |  4 ++--
 fs/btrfs/extent-tree.c | 13 +++--
 5 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index ae750b1574a2..bc1ea6e9ea4f 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -769,7 +769,7 @@ static int add_delayed_refs(const struct btrfs_fs_info 
*fs_info,
btrfs_disk_key_to_cpu(_op_key, _op->key);
 
spin_lock(>lock);
-   for (n = rb_first(>ref_tree); n; n = rb_next(n)) {
+   for (n = rb_first_cached(>ref_tree); n; n = rb_next(n)) {
node = rb_entry(n, struct btrfs_delayed_ref_node,
ref_node);
if (node->seq > seq)
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 4ef4a30ccc3d..5f4d46e0f2c0 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -133,13 +133,14 @@ static struct btrfs_delayed_ref_head *htree_insert(struct 
rb_root_cached *root,
return NULL;
 }
 
-static struct btrfs_delayed_ref_node* tree_insert(struct rb_root *root,
+static struct btrfs_delayed_ref_node *tree_insert(struct rb_root_cached *root,
struct btrfs_delayed_ref_node *ins)
 {
-   struct rb_node **p = >rb_node;
+   struct rb_node **p = >rb_root.rb_node;
struct rb_node *node = >ref_node;
struct rb_node *parent_node = NULL;
struct btrfs_delayed_ref_node *entry;
+   bool leftmost = true;
 
while (*p) {
int comp;
@@ -148,16 +149,18 @@ static struct btrfs_delayed_ref_node* tree_insert(struct 
rb_root *root,
entry = rb_entry(parent_node, struct btrfs_delayed_ref_node,
 ref_node);
comp = comp_refs(ins, entry, true);
-   if (comp < 0)
+   if (comp < 0) {
p = &(*p)->rb_left;
-   else if (comp > 0)
+   } else if (comp > 0) {
p = &(*p)->rb_right;
-   else
+   leftmost = false;
+   } else {
return entry;
+   }
}
 
rb_link_node(node, parent_node, p);
-   rb_insert_color(node, root);
+   rb_insert_color_cached(node, root, leftmost);
return NULL;
 }
 
@@ -231,7 +234,7 @@ static inline void drop_delayed_ref(struct 
btrfs_trans_handle *trans,
struct btrfs_delayed_ref_node *ref)
 {
lockdep_assert_held(>lock);
-   rb_erase(>ref_node, >ref_tree);
+   rb_erase_cached(>ref_node, >ref_tree);
RB_CLEAR_NODE(>ref_node);
if (!list_empty(>add_list))
list_del(>add_list);
@@ -300,7 +303,7 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle 
*trans,
 
lockdep_assert_held(>lock);
 
-   if (RB_EMPTY_ROOT(>ref_tree))
+   if (RB_EMPTY_ROOT(>ref_tree.rb_root))
return;
 
/* We don't have too many refs to merge for data. */
@@ -318,7 +321,8 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle 
*trans,
spin_unlock(_info->tree_mod_seq_lock);
 
 again:
-   for (node = rb_first(>ref_tree); node; node = rb_next(node)) {
+   for (node = rb_first_cached(>ref_tree); node;
+node = rb_next(node)) {
ref = rb_entry(node, struct btrfs_delayed_ref_node, ref_node);
if (seq && ref->seq >= seq)
continue;
@@ -573,7 +577,7 @@ static void init_delayed_ref_head(struct 
btrfs_delayed_ref_head *head_ref,
head_ref->must_insert_reserved = must_insert_reserved;
head_ref->is_data = is_data;
head_ref->is_system = is_system;
-   head_ref->ref_tree = RB_ROOT;
+   head_ref->ref_tree = RB_ROOT_CACHED;
INIT_LIST_HEAD(_ref->ref_add_list);
RB_CLEAR_NODE(_ref->href_node);
head_ref->processing = 0;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 88438b6cee45..c3e3486a126c 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -79,7 +79,7 @@ struct btrfs_delayed_ref_head {
struct mutex mutex;
 
spinlock_t lock;
-   struct rb_root ref_tree;
+   struct rb_root_cached ref_tree;
/* accumulate add BTRFS_ADD_DELAYED_REF nodes to this ref_add_list. */
struct list_head ref_add_list;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index dee7bcc13ffa..95a3d33a0454 10

[PATCH 0/5] rb_first to rb_first_cached conversion

2018-08-22 Thread Liu Bo
Several structs in btrfs are using rb_first() in a while loop, it'd be
more efficient to do this with rb_first_cached() which has the O(1)
complexity.

This patch set updates five structs which may have a large rb tree in
practice

Liu Bo (5):
  Btrfs: href_root: use rb_first_cached
  Btrfs: href->ref_tree: use rb_first_cached
  Btrfs: delayed_items: use rb_first_cached
  Btrfs: extent_map: use rb_first_cached
  Btrfs: preftree: use rb_first_cached

 fs/btrfs/backref.c| 34 +-
 fs/btrfs/delayed-inode.c  | 29 +--
 fs/btrfs/delayed-inode.h  |  4 ++--
 fs/btrfs/delayed-ref.c| 50 +++
 fs/btrfs/delayed-ref.h|  4 ++--
 fs/btrfs/disk-io.c|  8 +++
 fs/btrfs/extent-tree.c| 19 ---
 fs/btrfs/extent_map.c | 27 +++--
 fs/btrfs/extent_map.h |  2 +-
 fs/btrfs/inode.c  |  4 ++--
 fs/btrfs/tests/extent-map-tests.c |  4 ++--
 fs/btrfs/transaction.c|  5 ++--
 fs/btrfs/volumes.c|  5 ++--
 13 files changed, 107 insertions(+), 88 deletions(-)

-- 
1.8.3.1



[PATCH 1/5] Btrfs: href_root: use rb_first_cached

2018-08-22 Thread Liu Bo
rb_first_cached() trades an extra pointer "leftmost" for doing the
same job as rb_first() but in O(1).

Functions manipulating href_root need to get the first entry, this
converts href_root to use rb_first_cached().

Signed-off-by: Liu Bo 
---
 fs/btrfs/delayed-ref.c | 26 +++---
 fs/btrfs/delayed-ref.h |  2 +-
 fs/btrfs/disk-io.c |  4 ++--
 fs/btrfs/extent-tree.c |  6 +++---
 fs/btrfs/transaction.c |  5 +++--
 5 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 62ff545ba1f7..4ef4a30ccc3d 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -101,14 +101,15 @@ static int comp_refs(struct btrfs_delayed_ref_node *ref1,
 }
 
 /* insert a new ref to head ref rbtree */
-static struct btrfs_delayed_ref_head *htree_insert(struct rb_root *root,
+static struct btrfs_delayed_ref_head *htree_insert(struct rb_root_cached *root,
   struct rb_node *node)
 {
-   struct rb_node **p = >rb_node;
+   struct rb_node **p = >rb_root.rb_node;
struct rb_node *parent_node = NULL;
struct btrfs_delayed_ref_head *entry;
struct btrfs_delayed_ref_head *ins;
u64 bytenr;
+   bool leftmost = true;
 
ins = rb_entry(node, struct btrfs_delayed_ref_head, href_node);
bytenr = ins->bytenr;
@@ -117,16 +118,18 @@ static struct btrfs_delayed_ref_head *htree_insert(struct 
rb_root *root,
entry = rb_entry(parent_node, struct btrfs_delayed_ref_head,
 href_node);
 
-   if (bytenr < entry->bytenr)
+   if (bytenr < entry->bytenr) {
p = &(*p)->rb_left;
-   else if (bytenr > entry->bytenr)
+   } else if (bytenr > entry->bytenr) {
p = &(*p)->rb_right;
-   else
+   leftmost = false;
+   } else {
return entry;
+   }
}
 
rb_link_node(node, parent_node, p);
-   rb_insert_color(node, root);
+   rb_insert_color_cached(node, root, leftmost);
return NULL;
 }
 
@@ -165,9 +168,10 @@ static struct btrfs_delayed_ref_node* tree_insert(struct 
rb_root *root,
  * match is found.
  */
 static struct btrfs_delayed_ref_head *
-find_ref_head(struct rb_root *root, u64 bytenr,
+find_ref_head(struct btrfs_delayed_ref_root *dr, u64 bytenr,
  int return_bigger)
 {
+   struct rb_root *root = >href_root.rb_root;
struct rb_node *n;
struct btrfs_delayed_ref_head *entry;
 
@@ -187,7 +191,7 @@ static struct btrfs_delayed_ref_node* tree_insert(struct 
rb_root *root,
if (bytenr > entry->bytenr) {
n = rb_next(>href_node);
if (!n)
-   n = rb_first(root);
+   n = rb_first_cached(>href_root);
entry = rb_entry(n, struct btrfs_delayed_ref_head,
 href_node);
return entry;
@@ -357,12 +361,12 @@ struct btrfs_delayed_ref_head *
 
 again:
start = delayed_refs->run_delayed_start;
-   head = find_ref_head(_refs->href_root, start, 1);
+   head = find_ref_head(delayed_refs, start, 1);
if (!head && !loop) {
delayed_refs->run_delayed_start = 0;
start = 0;
loop = true;
-   head = find_ref_head(_refs->href_root, start, 1);
+   head = find_ref_head(delayed_refs, start, 1);
if (!head)
return NULL;
} else if (!head && loop) {
@@ -903,7 +907,7 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info 
*fs_info,
 struct btrfs_delayed_ref_head *
 btrfs_find_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs, u64 
bytenr)
 {
-   return find_ref_head(_refs->href_root, bytenr, 0);
+   return find_ref_head(delayed_refs, bytenr, 0);
 }
 
 void __cold btrfs_delayed_ref_exit(void)
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index d9f2a4ebd5db..88438b6cee45 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -148,7 +148,7 @@ struct btrfs_delayed_data_ref {
 
 struct btrfs_delayed_ref_root {
/* head ref rbtree */
-   struct rb_root href_root;
+   struct rb_root_cached href_root;
 
/* dirty extent records */
struct rb_root dirty_extent_root;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5124c15705ce..dee7bcc13ffa 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4203,7 +4203,7 @@ static int btrfs_destroy_delayed_refs(struct 
btrfs_transaction *trans,
return ret;
}
 
-   while ((node = rb_first(_refs->href_root)) != NULL) {
+   while 

[PATCH 3/5] Btrfs: delayed_items: use rb_first_cached

2018-08-22 Thread Liu Bo
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).

Functions manipulating delayed_item need to get the first entry, this
converts it to use rb_first_cached().

Signed-off-by: Liu Bo 
---
 fs/btrfs/delayed-inode.c | 29 -
 fs/btrfs/delayed-inode.h |  4 ++--
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index db9f45082c61..72569849828e 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -42,8 +42,8 @@ static inline void btrfs_init_delayed_node(
delayed_node->root = root;
delayed_node->inode_id = inode_id;
refcount_set(_node->refs, 0);
-   delayed_node->ins_root = RB_ROOT;
-   delayed_node->del_root = RB_ROOT;
+   delayed_node->ins_root = RB_ROOT_CACHED;
+   delayed_node->del_root = RB_ROOT_CACHED;
mutex_init(_node->mutex);
INIT_LIST_HEAD(_node->n_list);
INIT_LIST_HEAD(_node->p_list);
@@ -390,7 +390,7 @@ static struct btrfs_delayed_item 
*__btrfs_lookup_delayed_insertion_item(
struct btrfs_delayed_node *delayed_node,
struct btrfs_key *key)
 {
-   return __btrfs_lookup_delayed_item(_node->ins_root, key,
+   return __btrfs_lookup_delayed_item(_node->ins_root.rb_root, key,
   NULL, NULL);
 }
 
@@ -400,9 +400,10 @@ static int __btrfs_add_delayed_item(struct 
btrfs_delayed_node *delayed_node,
 {
struct rb_node **p, *node;
struct rb_node *parent_node = NULL;
-   struct rb_root *root;
+   struct rb_root_cached *root;
struct btrfs_delayed_item *item;
int cmp;
+   bool leftmost = true;
 
if (action == BTRFS_DELAYED_INSERTION_ITEM)
root = _node->ins_root;
@@ -410,7 +411,7 @@ static int __btrfs_add_delayed_item(struct 
btrfs_delayed_node *delayed_node,
root = _node->del_root;
else
BUG();
-   p = >rb_node;
+   p = >rb_root.rb_node;
node = >rb_node;
 
while (*p) {
@@ -419,16 +420,18 @@ static int __btrfs_add_delayed_item(struct 
btrfs_delayed_node *delayed_node,
 rb_node);
 
cmp = btrfs_comp_cpu_keys(>key, >key);
-   if (cmp < 0)
+   if (cmp < 0) {
p = &(*p)->rb_right;
-   else if (cmp > 0)
+   leftmost = false;
+   } else if (cmp > 0) {
p = &(*p)->rb_left;
-   else
+   } else {
return -EEXIST;
+   }
}
 
rb_link_node(node, parent_node, p);
-   rb_insert_color(node, root);
+   rb_insert_color_cached(node, root, leftmost);
ins->delayed_node = delayed_node;
ins->ins_or_del = action;
 
@@ -468,7 +471,7 @@ static void finish_one_item(struct btrfs_delayed_root 
*delayed_root)
 
 static void __btrfs_remove_delayed_item(struct btrfs_delayed_item 
*delayed_item)
 {
-   struct rb_root *root;
+   struct rb_root_cached *root;
struct btrfs_delayed_root *delayed_root;
 
delayed_root = delayed_item->delayed_node->root->fs_info->delayed_root;
@@ -482,7 +485,7 @@ static void __btrfs_remove_delayed_item(struct 
btrfs_delayed_item *delayed_item)
else
root = _item->delayed_node->del_root;
 
-   rb_erase(_item->rb_node, root);
+   rb_erase_cached(_item->rb_node, root);
delayed_item->delayed_node->count--;
 
finish_one_item(delayed_root);
@@ -503,7 +506,7 @@ static struct btrfs_delayed_item 
*__btrfs_first_delayed_insertion_item(
struct rb_node *p;
struct btrfs_delayed_item *item = NULL;
 
-   p = rb_first(_node->ins_root);
+   p = rb_first_cached(_node->ins_root);
if (p)
item = rb_entry(p, struct btrfs_delayed_item, rb_node);
 
@@ -516,7 +519,7 @@ static struct btrfs_delayed_item 
*__btrfs_first_delayed_deletion_item(
struct rb_node *p;
struct btrfs_delayed_item *item = NULL;
 
-   p = rb_first(_node->del_root);
+   p = rb_first_cached(_node->del_root);
if (p)
item = rb_entry(p, struct btrfs_delayed_item, rb_node);
 
diff --git a/fs/btrfs/delayed-inode.h b/fs/btrfs/delayed-inode.h
index 33536cd681d4..74ae226ffaf0 100644
--- a/fs/btrfs/delayed-inode.h
+++ b/fs/btrfs/delayed-inode.h
@@ -50,8 +50,8 @@ struct btrfs_delayed_node {
 * is waiting to be dealt with by the async worker.
 */
struct list_head p_list;
-   struct rb_root ins_root;
-   struct rb_root del_root;
+   struct rb_root_cached ins_root;
+   struct rb_root_cached del_root;
struct mutex mutex;
struct btrfs_inode_item inode_item;
refcount_t refs;
-- 
1.8.3.1



[PATCH 5/5] Btrfs: preftree: use rb_first_cached

2018-08-22 Thread Liu Bo
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).

While resolving indirect refs and missing refs, it always looks for the
first rb entry in a while loop, it's helpful to use rb_first_cached
instead.

Signed-off-by: Liu Bo 
---
 fs/btrfs/backref.c | 32 +---
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index bc1ea6e9ea4f..49f2188e728b 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -112,11 +112,11 @@ static int find_extent_in_eb(const struct extent_buffer 
*eb,
 }
 
 struct preftree {
-   struct rb_root root;
+   struct rb_root_cached root;
unsigned int count;
 };
 
-#define PREFTREE_INIT  { .root = RB_ROOT, .count = 0 }
+#define PREFTREE_INIT  { .root = RB_ROOT_CACHED, .count = 0 }
 
 struct preftrees {
struct preftree direct;/* BTRFS_SHARED_[DATA|BLOCK]_REF_KEY */
@@ -225,14 +225,15 @@ static void prelim_ref_insert(const struct btrfs_fs_info 
*fs_info,
  struct prelim_ref *newref,
  struct share_check *sc)
 {
-   struct rb_root *root;
+   struct rb_root_cached *root;
struct rb_node **p;
struct rb_node *parent = NULL;
struct prelim_ref *ref;
int result;
+   bool leftmost = true;
 
root = >root;
-   p = >rb_node;
+   p = >rb_root.rb_node;
 
while (*p) {
parent = *p;
@@ -242,6 +243,7 @@ static void prelim_ref_insert(const struct btrfs_fs_info 
*fs_info,
p = &(*p)->rb_left;
} else if (result > 0) {
p = &(*p)->rb_right;
+   leftmost = false;
} else {
/* Identical refs, merge them and free @newref */
struct extent_inode_elem *eie = ref->inode_list;
@@ -272,7 +274,7 @@ static void prelim_ref_insert(const struct btrfs_fs_info 
*fs_info,
preftree->count++;
trace_btrfs_prelim_ref_insert(fs_info, newref, NULL, preftree->count);
rb_link_node(>rbnode, parent, p);
-   rb_insert_color(>rbnode, root);
+   rb_insert_color_cached(>rbnode, root, leftmost);
 }
 
 /*
@@ -283,11 +285,11 @@ static void prelim_release(struct preftree *preftree)
 {
struct prelim_ref *ref, *next_ref;
 
-   rbtree_postorder_for_each_entry_safe(ref, next_ref, >root,
-rbnode)
+   rbtree_postorder_for_each_entry_safe(ref, next_ref,
+>root.rb_root, rbnode)
free_pref(ref);
 
-   preftree->root = RB_ROOT;
+   preftree->root = RB_ROOT_CACHED;
preftree->count = 0;
 }
 
@@ -627,7 +629,7 @@ static int resolve_indirect_refs(struct btrfs_fs_info 
*fs_info,
 * freeing the entire indirect tree when we're done.  In some test
 * cases, the tree can grow quite large (~200k objects).
 */
-   while ((rnode = rb_first(>indirect.root))) {
+   while ((rnode = rb_first_cached(>indirect.root))) {
struct prelim_ref *ref;
 
ref = rb_entry(rnode, struct prelim_ref, rbnode);
@@ -637,7 +639,7 @@ static int resolve_indirect_refs(struct btrfs_fs_info 
*fs_info,
goto out;
}
 
-   rb_erase(>rbnode, >indirect.root);
+   rb_erase_cached(>rbnode, >indirect.root);
preftrees->indirect.count--;
 
if (ref->count == 0) {
@@ -717,9 +719,9 @@ static int add_missing_keys(struct btrfs_fs_info *fs_info,
struct preftree *tree = >indirect_missing_keys;
struct rb_node *node;
 
-   while ((node = rb_first(>root))) {
+   while ((node = rb_first_cached(>root))) {
ref = rb_entry(node, struct prelim_ref, rbnode);
-   rb_erase(node, >root);
+   rb_erase_cached(node, >root);
 
BUG_ON(ref->parent);/* should not be a direct ref */
BUG_ON(ref->key_for_search.type);
@@ -1229,14 +1231,14 @@ static int find_parent_nodes(struct btrfs_trans_handle 
*trans,
if (ret)
goto out;
 
-   WARN_ON(!RB_EMPTY_ROOT(_missing_keys.root));
+   WARN_ON(!RB_EMPTY_ROOT(_missing_keys.root.rb_root));
 
ret = resolve_indirect_refs(fs_info, path, time_seq, ,
extent_item_pos, total_refs, sc, 
ignore_offset);
if (ret)
goto out;
 
-   WARN_ON(!RB_EMPTY_ROOT());
+   WARN_ON(!RB_EMPTY_ROOT(_root));
 
/*
 * This walks the tree of merged and resolved refs. Tree blocks are
@@ -1245,7 +1247,7 @@ static int find_parent_nodes(struct btrfs_trans_handle 
*trans,
 *
 * We release the entire tree in one

[PATCH] Btrfs: use next_state in find_first_extent_bit

2018-08-22 Thread Liu Bo
As next_state() is already defined to get the next state, update us to use
the helper.

No functional changes.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_io.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 628f1aef34b0..44397fa909c2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1424,20 +1424,15 @@ int find_first_extent_bit(struct extent_io_tree *tree, 
u64 start,
  struct extent_state **cached_state)
 {
struct extent_state *state;
-   struct rb_node *n;
int ret = 1;
 
spin_lock(>lock);
if (cached_state && *cached_state) {
state = *cached_state;
if (state->end == start - 1 && extent_state_in_tree(state)) {
-   n = rb_next(>rb_node);
-   while (n) {
-   state = rb_entry(n, struct extent_state,
-rb_node);
+   while ((state = next_state(state)) != NULL) {
if (state->state & bits)
goto got_it;
-   n = rb_next(n);
}
free_extent_state(*cached_state);
*cached_state = NULL;
-- 
1.8.3.1



Re: [PATCH v2] Btrfs: kill btrfs_clear_path_blocking

2018-08-22 Thread Liu Bo
On Wed, Aug 22, 2018 at 03:13:28PM +0200, David Sterba wrote:
> On Wed, Aug 22, 2018 at 05:54:37AM +0800, Liu Bo wrote:
> > w/o patch:
> > real2m27.307s
> > user0m12.839s
> > sys 13m42.831s
> > 
> > w/ patch:
> > real1m2.273s
> > user0m15.802s
> > sys 8m16.495s
> 
> This looks good. The change does not have any potential correctness
> problems right? I'm asking if its more or less safe to add it to
> for-next now. I have pending cleanups in the extent buffer locking, so I
> can finish it and post so the changes are done at once.

Had xfstests run for at least 10 times, without patch 2, I didn't see
any problems related to patch 1.

thanks,
-liubo


Re: [PATCH 0/2] address lock contention of btree root

2018-08-21 Thread Liu Bo
On Tue, Aug 21, 2018 at 02:48:47PM -0400, Chris Mason wrote:
> On 21 Aug 2018, at 14:15, Liu Bo wrote:
> 
> > On Tue, Aug 21, 2018 at 01:54:11PM -0400, Chris Mason wrote:
> > > On 16 Aug 2018, at 17:07, Liu Bo wrote:
> > > 
> > > > The lock contention on btree nodes (esp. root node) is apparently a
> > > > bottleneck when there're multiple readers and writers concurrently
> > > > trying to access them.  Unfortunately this is by design and it's not
> > > > easy to fix it unless with some complex changes, however, there is
> > > > still some room.
> > > > 
> > > > With a stable workload based on fsmark which has 16 threads creating
> > > > 1,600K files, we could see that a good amount of overhead comes from
> > > > switching path between spinning mode and blocking mode in
> > > > btrfs_search_slot().
> > > > 
> > > > Patch 1 provides more details about the overhead and test
> > > > results from
> > > > fsmark and dbench.
> > > > Patch 2 kills leave_spinning due to the behaviour change from
> > > > patch 1.
> > > 
> > > This is really interesting, do you have numbers about how often we
> > > are able
> > > to stay spinning?
> > > 
> > 
> > I didn't gather how long we stay spinning,
> 
> I'm less worried about length of time spinning than I am how often we're
> able to call btrfs_search_slot() without ever blocking.  If one caller ends
> up going into blocking mode, it can cascade into all of the other callers,
> which can slow things down in low-to-medium contention workloads.

Right, these blocking cases could be COW, split_node, split_leaf, etc.

> 
> The flip side is that maybe the adaptive spinning in the mutexes is good
> enough now and we can just deleting the spinning completely.
>

Looks like so.

> > but I was tracking
> > 
> > (1) how long a read lock or write lock can wait until it gets the
> > lock, with vanilla kernel, for read lock, it could be up to 10ms while
> > for write lock, it could be up to 200ms.
> 
> Nice, please add the stats to the patch descriptions, both before and after.
> I'd love to see a histogram like you can get out of bcc's argdist.py.

Just gave another run of testing with bcc tool funclatency, for write
lock, the worst latency could be up to >500ms, but it could be that
kprobe itself adds some cost to the latency.

thanks,
-liubo


Re: [PATCH 2/2] Btrfs: kill leave_spinning

2018-08-21 Thread Liu Bo
Please kindly ignore this patch as it did introduce soft lockup when
btrfs_search_slot() returns a spinning path and its callers goes to
sleep before releasing the path.

thanks,
liubo


On Thu, Aug 16, 2018 at 2:07 PM, Liu Bo  wrote:
> As btrfs_clear_path_blocking() turns out to be a major source of lock
> contention, we've kill it and without it btrfs_search_slot() and
> btrfs_search_old_slot() are not able to return a path in spinning
> mode, lets kill leave_spinning, too.
>
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/backref.c|  3 ---
>  fs/btrfs/ctree.c  | 16 +++-
>  fs/btrfs/ctree.h  |  1 -
>  fs/btrfs/delayed-inode.c  |  4 
>  fs/btrfs/dir-item.c   |  1 -
>  fs/btrfs/export.c |  1 -
>  fs/btrfs/extent-tree.c|  7 ---
>  fs/btrfs/extent_io.c  |  1 -
>  fs/btrfs/file-item.c  |  4 
>  fs/btrfs/free-space-tree.c|  2 --
>  fs/btrfs/inode-item.c |  6 --
>  fs/btrfs/inode.c  |  8 
>  fs/btrfs/ioctl.c  |  3 ---
>  fs/btrfs/qgroup.c |  2 --
>  fs/btrfs/super.c  |  2 --
>  fs/btrfs/tests/qgroup-tests.c |  4 
>  16 files changed, 3 insertions(+), 62 deletions(-)
>
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index ae750b1574a2..70c629b90710 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -1613,13 +1613,11 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, 
> struct btrfs_path *path,
> s64 bytes_left = ((s64)size) - 1;
> struct extent_buffer *eb = eb_in;
> struct btrfs_key found_key;
> -   int leave_spinning = path->leave_spinning;
> struct btrfs_inode_ref *iref;
>
> if (bytes_left >= 0)
> dest[bytes_left] = '\0';
>
> -   path->leave_spinning = 1;
> while (1) {
> bytes_left -= name_len;
> if (bytes_left >= 0)
> @@ -1665,7 +1663,6 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, 
> struct btrfs_path *path,
> }
>
> btrfs_release_path(path);
> -   path->leave_spinning = leave_spinning;
>
> if (ret)
> return ERR_PTR(ret);
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 8b31caa60b0a..d2df7cfbec06 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -2875,14 +2875,10 @@ int btrfs_search_slot(struct btrfs_trans_handle 
> *trans, struct btrfs_root *root,
> }
> ret = 1;
>  done:
> -   /*
> -* we don't really know what they plan on doing with the path
> -* from here on, so for now just mark it as blocking
> -*/
> -   if (!p->leave_spinning)
> -   btrfs_set_path_blocking(p);
> if (ret < 0 && !p->skip_release_on_error)
> btrfs_release_path(p);
> +
> +   /* path is supposed to be in blocking mode from now on. */
> return ret;
>  }
>
> @@ -2987,11 +2983,10 @@ int btrfs_search_old_slot(struct btrfs_root *root, 
> const struct btrfs_key *key,
> }
> ret = 1;
>  done:
> -   if (!p->leave_spinning)
> -   btrfs_set_path_blocking(p);
> if (ret < 0)
> btrfs_release_path(p);
>
> +   /* path is supposed to be in blocking mode from now on.*/
> return ret;
>  }
>
> @@ -5628,7 +5623,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct 
> btrfs_path *path,
> struct btrfs_key key;
> u32 nritems;
> int ret;
> -   int old_spinning = path->leave_spinning;
> int next_rw_lock = 0;
>
> nritems = btrfs_header_nritems(path->nodes[0]);
> @@ -5643,7 +5637,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct 
> btrfs_path *path,
> btrfs_release_path(path);
>
> path->keep_locks = 1;
> -   path->leave_spinning = 1;
>
> if (time_seq)
> ret = btrfs_search_old_slot(root, , path, time_seq);
> @@ -5780,9 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct 
> btrfs_path *path,
> ret = 0;
>  done:
> unlock_up(path, 0, 1, 0, NULL);
> -   path->leave_spinning = old_spinning;
> -   if (!old_spinning)
> -   btrfs_set_path_blocking(path);
>
> return ret;
>  }
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 1aeed3c0e949..e8bddf21b7f7 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -339,7 +339,6 @@ struct btrfs_path {
> unsigned int search_for_split:1;
> unsigned int keep_locks

Re: [PATCH 0/2] address lock contention of btree root

2018-08-21 Thread Liu Bo
On Tue, Aug 21, 2018 at 08:35:30PM +0200, Holger Hoffstätte wrote:
> On 08/21/18 20:15, Liu Bo wrote:
> > I just realize that patch 2 can result in softlockup as
> > btrfs_search_slot() may return a path with all nodes being in spinning
> > lock, and if the callers want to sleep, we're in trouble.  I've
> > removed patch 2 and am re-running the test (xfstests, fsmark and
> > dbench).
> 
> You mean like this, when trying to balance? :)
> Got it only once so far, subsequent attempts worked. Otherwise everything
> seems fine.

Exactly, it didn't occur every time though, but thanks a lot for
testing it through.

thanks,
-liubo


> 
> -h
> 
> kernel: BTRFS info (device sdc1): relocating block group 4128424067072 flags 
> data
> kernel: BTRFS info (device sdc1): found 1706 extents
> kernel: INFO: rcu_sched self-detected stall on CPU
> kernel: ^I3-: (17999 ticks this GP) idle=f5e/1/4611686018427387906 
> softirq=269430/269430 fqs=5999
> kernel: ^I (t=18000 jiffies g=232869 c=232868 q=4365)
> kernel: NMI backtrace for cpu 3
> kernel: CPU: 3 PID: 4287 Comm: kworker/u8:0 Not tainted 4.18.3 #1
> kernel: Hardware name: System manufacturer System Product Name/P8Z68-V LX, 
> BIOS 4105 07/01/2013
> kernel: Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
> kernel: Call Trace:
> kernel:  
> kernel:  dump_stack+0x46/0x60
> kernel:  nmi_cpu_backtrace.cold.0+0x13/0x57
> kernel:  ? lapic_can_unplug_cpu.cold.5+0x34/0x34
> kernel:  nmi_trigger_cpumask_backtrace+0x8f/0x91
> kernel:  rcu_dump_cpu_stacks+0x87/0xb2
> kernel:  rcu_check_callbacks.cold.59+0x2ac/0x430
> kernel:  ? tick_sched_handle.isra.6+0x40/0x40
> kernel:  update_process_times+0x28/0x60
> kernel:  tick_sched_handle.isra.6+0x35/0x40
> kernel:  tick_sched_timer+0x3b/0x80
> kernel:  __hrtimer_run_queues+0xfe/0x270
> kernel:  hrtimer_interrupt+0xf4/0x210
> kernel:  smp_apic_timer_interrupt+0x56/0x110
> kernel:  apic_timer_interrupt+0xf/0x20
> kernel:  
> kernel: RIP: 0010:queued_write_lock_slowpath+0x4a/0x80
> kernel: Code: ff 00 00 00 f0 0f b1 13 85 c0 74 32 f0 81 03 00 01 00 00 ba ff 
> 00 00 00 b9 00 01 00 00 8b 03 3d 00 01 00 00 74 0b f3 90 8b 03 <3d> 00 01 00 
> 00 75 f5 89 c8 f0 0f b1 13 3d 00 01 00 00 75 df c6 43
> kernel: RSP: 0018:c940fc40 EFLAGS: 0206 ORIG_RAX: ff13
> kernel: RAX: 0300 RBX: 88071abf86f0 RCX: 0100
> kernel: RDX: 00ff RSI: 8800 RDI: 88071abf86f0
> kernel: RBP:  R08: 88071abf8690 R09: 880724ebeb58
> kernel: R10: 880724ebeb80 R11:  R12: 0001
> kernel: R13: 8803d0db5f54 R14: 1600 R15: 0006
> kernel:  btrfs_try_tree_write_lock+0x23/0x60 [btrfs]
> kernel:  btrfs_search_slot+0x2df/0x970 [btrfs]
> kernel:  btrfs_mark_extent_written+0xb0/0xac0 [btrfs]
> kernel:  ? kmem_cache_alloc+0x1a5/0x1b0
> kernel:  btrfs_finish_ordered_io+0x2e2/0x7a0 [btrfs]
> kernel:  normal_work_helper+0xad/0x2c0 [btrfs]
> kernel:  process_one_work+0x1e3/0x390
> kernel:  worker_thread+0x2d/0x3c0
> kernel:  ? process_one_work+0x390/0x390
> kernel:  kthread+0x111/0x130
> kernel:  ? kthread_flush_work_fn+0x10/0x10
> kernel:  ret_from_fork+0x1f/0x30


[PATCH v2] Btrfs: kill btrfs_clear_path_blocking

2018-08-21 Thread Liu Bo
Btrfs's btree locking has two modes, spinning mode and blocking mode,
while searching btree, locking is always acquired in spinning mode and
then converted to blocking mode if necessary, and in some hot paths we may
switch the locking back to spinning mode by btrfs_clear_path_blocking().

When acquiring locks, both of reader and writer need to wait for blocking
readers and writers to complete before doing read_lock()/write_lock().

The problem is that btrfs_clear_path_blocking() needs to switch nodes
in the path to blocking mode at first (by btrfs_set_path_blocking) to
make lockdep happy before doing its actual clearing blocking job.

When switching to blocking mode from spinning mode, it consists of

step 1) bumping up blocking readers counter and
step 2) read_unlock()/write_unlock(),

this has caused serious ping-pong effect if there're a great amount of
concurrent readers/writers, as waiters will be woken up and go to
sleep immediately.

1) Killing this kind of ping-pong results in a big improvement in my 1600k
files creation script,

MNT=/mnt/btrfs
mkfs.btrfs -f /dev/sdf
mount /dev/def $MNT
time fsmark  -D  1  -S0  -n  10  -s  0  -L  1 -l /tmp/fs_log.txt \
-d  $MNT/0  -d  $MNT/1 \
-d  $MNT/2  -d  $MNT/3 \
-d  $MNT/4  -d  $MNT/5 \
-d  $MNT/6  -d  $MNT/7 \
-d  $MNT/8  -d  $MNT/9 \
-d  $MNT/10  -d  $MNT/11 \
-d  $MNT/12  -d  $MNT/13 \
-d  $MNT/14  -d  $MNT/15

w/o patch:
real2m27.307s
user0m12.839s
sys 13m42.831s

w/ patch:
real1m2.273s
user0m15.802s
sys 8m16.495s

1.1) latency histogram from funclatency[1]

Overall with the patch, there're ~50% less write lock acquisition and
the 95% max latency that write lock takes also reduces to ~100ms from
>500ms.


w/o patch:

Function = btrfs_tree_lock
 msecs   : count distribution
 0 -> 1  : 2385222  ||
 2 -> 3  : 37147||
 4 -> 7  : 20452||
 8 -> 15 : 13131||
16 -> 31 : 3877 ||
32 -> 63 : 3900 ||
64 -> 127: 2612 ||
   128 -> 255: 974  ||
   256 -> 511: 165  ||
   512 -> 1023   : 13   ||

Function = btrfs_tree_read_lock
 msecs   : count distribution
 0 -> 1  : 6743860  ||
 2 -> 3  : 2146 ||
 4 -> 7  : 190  ||
 8 -> 15 : 38   ||
16 -> 31 : 4||


w/ patch:

Function = btrfs_tree_lock
 msecs   : count distribution
 0 -> 1  : 1318454  ||
 2 -> 3  : 6800 ||
 4 -> 7  : 3664 ||
 8 -> 15 : 2145 ||
16 -> 31 : 809  ||
32 -> 63 : 219  ||
64 -> 127: 10   ||

Function = btrfs_tree_read_lock
 msecs   : count distribution
 0 -> 1  : 6854317  ||
 2 -> 3  : 2383 ||
 4 -> 7  : 601  ||
 8 -> 15 : 92   ||

2) dbench also proves the improvement,
dbench -t 120 -D /mnt/btrfs 16

w/o patch:
Throughput 158.363 MB/sec

w/ patch:
Throughput 449.52 MB/sec

3) xfstests didn't show any additional failures.

One thing to note is that callers may set path->leave_spinning to have
all nodes in the path stay in spinning mode, which means callers are
ready to not sleep before releasing the path, but it won't cause
problems if they don't want to sleep in blocking mode.

[1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py

Signed-off-by: Liu Bo 
---

v2

Re: [PATCH 0/2] address lock contention of btree root

2018-08-21 Thread Liu Bo
On Tue, Aug 21, 2018 at 01:54:11PM -0400, Chris Mason wrote:
> On 16 Aug 2018, at 17:07, Liu Bo wrote:
> 
> > The lock contention on btree nodes (esp. root node) is apparently a
> > bottleneck when there're multiple readers and writers concurrently
> > trying to access them.  Unfortunately this is by design and it's not
> > easy to fix it unless with some complex changes, however, there is
> > still some room.
> > 
> > With a stable workload based on fsmark which has 16 threads creating
> > 1,600K files, we could see that a good amount of overhead comes from
> > switching path between spinning mode and blocking mode in
> > btrfs_search_slot().
> > 
> > Patch 1 provides more details about the overhead and test results from
> > fsmark and dbench.
> > Patch 2 kills leave_spinning due to the behaviour change from patch 1.
> 
> This is really interesting, do you have numbers about how often we are able
> to stay spinning?
>

I didn't gather how long we stay spinning, but I was tracking

(1) how long a read lock or write lock can wait until it gets the
lock, with vanilla kernel, for read lock, it could be up to 10ms while
for write lock, it could be up to 200ms.

(2) how many switches occurs when reader/writers wait for blocking locks,
for both readers and writers, there are up to 50 times.

but I guess it highly depends on the test environment, I'm using a 24
core VM and the test dispatches 16 threads to do the job.

> IOW, do we end up blocking every time?

With setting leave_spinning, we don't, this part works fine.

I just realize that patch 2 can result in softlockup as
btrfs_search_slot() may return a path with all nodes being in spinning
lock, and if the callers want to sleep, we're in trouble.  I've
removed patch 2 and am re-running the test (xfstests, fsmark and
dbench).

thanks,
-liubo


Re: [PATCH 1/2] Btrfs: kill btrfs_clear_path_blocking

2018-08-21 Thread Liu Bo
On Mon, Aug 20, 2018 at 10:31:49AM +0300, Nikolay Borisov wrote:
> 
> 
> On 20.08.2018 09:07, Liu Bo wrote:
> > On Fri, Aug 17, 2018 at 10:24:58AM +0300, Nikolay Borisov wrote:
> >>
> >>
> >> On 17.08.2018 00:07, Liu Bo wrote:
> >>> Btrfs's btree locking has two modes, spinning mode and blocking mode,
> >>> while searching btree, locking is always acquired in spinning mode and
> >>> then converted to blocking mode if necessary, and in some hot paths we may
> >>> switch the locking back to spinning mode by btrfs_clear_path_blocking().
> >>>
> >>> When acquiring locks, both of reader and writer need to wait for blocking
> >>> readers and writers to complete before doing read_lock()/write_lock().
> >>>
> >>> The problem is that btrfs_clear_path_blocking() needs to switch nodes
> >>> in the path to blocking mode at first (by btrfs_set_path_blocking) to
> >>> make lockdep happy before doing its actual clearing blocking job.
> >>>
> >>> When switching to blocking mode from spinning mode, it consists of
> >>>
> >>> step 1) bumping up blocking readers counter and
> >>> step 2) read_unlock()/write_unlock(),
> >>>
> >>> this has caused serious ping-pong effect if there're a great amount of
> >>> concurrent readers/writers, as waiters will be woken up and go to
> >>> sleep immediately.
> >>
> >> I think this ping-pong needs to be explained in a bit more detail.
> >>
> >> Looking at the code it seems the issue happens when we have a path
> >> locked in spinning mode (via btrfs_tree_lock) - in this case we have
> >> blocking_readers/writes == 0 and write_lock acquired. Then we could
> >> potentially have multiple requests to lock the tree (in this case the
> >> root) and since the current holder is spinning then btrfs_tree_lock
> >> callers will block on the write_lock. Subsequently when the holder
> >> switches to blocking he calls
> >> btrfs_set_path_blocking->btrfs_set_lock_blocking_rw which of course
> >> increments the blocking reader/writes and calls the unlock at which
> >> point a "thundering herd" problem occurs on the lock. Am I correct?
> >>
> > 
> > That's correct, but the "thundering herd' problem is not really the
> > problem that this patch is trying to address, see more in below.
> > 
> >> Furthermore I think the problem occurs not in btrfs_clear_path_blocking
> >> but rather in the initial call to btrfs_set_path_blocking.
> >>
> >> The way I see it the following sequence of operations occur:
> >>
> >> 1. Call btrfs_set_path_blocking: increments
> >> blocking_writers/blocking_readers, calls unlock: ping-pong happens. Now
> >> the lock types for the nodes in the path are
> >> BTRFS_READ_LOCK_BLOCKING/BTRFS_WRITE_LOCK_BLOCKING
> >>
> >> 2. Some time later we call btrfs_clear_path_blocking
> >> which also calls btrfs_set_path_blocking, however this time this
> >> function doesn't do anything because the lock types in path struct are
> >> already blocking and none of the 2 conditions inside
> >> btrfs_set_lock_blocking_rw will match hence this code won't be executed.
> >>
> >> Did I miss something ?
> >>
> > 
> > Thanks for sharing your thoughts.
> > 
> > We have to set path blocking as the subsequent code needs to sleep,
> > it's the cost we have to pay by design, I'm OK with it.
> > 
> > The problem is that btrfs_clear_path_blocking always tries to set path
> > to blocking before clearing path blocking as it needs to ensure
> > spin_locks on each level are acquired in right order.  And if all the
> > nodes in the path are already in spinning mode (say that
> > should_cow_block() decides to not COW), setting path blocking doesn't
> > really make sense and causes others waiting on the lock to do a
> > wake-up and a go-to-sleep.
> 
> Ok for the case of btrfs_clear_path_blocking in btrfs_search_path I
> agree this could happen but it can't in any of the other hunks you
> touched since they all call btrfs_clear_path_blocking AFTER
> btrfs_set_path_blocking was called. So the problem can't stem from them
> since they take the perf hit at the time btrfs_set_path_blocking is
> called. IMO this is important information for context and needs to be
> included in the changelog. In short the only problematic code is the one
> in btrfs_search_slot.
> 
> Another less intrusive idea is to touch only btrfs

Re: [PATCH 1/2] Btrfs: kill btrfs_clear_path_blocking

2018-08-20 Thread Liu Bo
On Fri, Aug 17, 2018 at 10:24:58AM +0300, Nikolay Borisov wrote:
> 
> 
> On 17.08.2018 00:07, Liu Bo wrote:
> > Btrfs's btree locking has two modes, spinning mode and blocking mode,
> > while searching btree, locking is always acquired in spinning mode and
> > then converted to blocking mode if necessary, and in some hot paths we may
> > switch the locking back to spinning mode by btrfs_clear_path_blocking().
> > 
> > When acquiring locks, both of reader and writer need to wait for blocking
> > readers and writers to complete before doing read_lock()/write_lock().
> > 
> > The problem is that btrfs_clear_path_blocking() needs to switch nodes
> > in the path to blocking mode at first (by btrfs_set_path_blocking) to
> > make lockdep happy before doing its actual clearing blocking job.
> > 
> > When switching to blocking mode from spinning mode, it consists of
> > 
> > step 1) bumping up blocking readers counter and
> > step 2) read_unlock()/write_unlock(),
> > 
> > this has caused serious ping-pong effect if there're a great amount of
> > concurrent readers/writers, as waiters will be woken up and go to
> > sleep immediately.
> 
> I think this ping-pong needs to be explained in a bit more detail.
> 
> Looking at the code it seems the issue happens when we have a path
> locked in spinning mode (via btrfs_tree_lock) - in this case we have
> blocking_readers/writes == 0 and write_lock acquired. Then we could
> potentially have multiple requests to lock the tree (in this case the
> root) and since the current holder is spinning then btrfs_tree_lock
> callers will block on the write_lock. Subsequently when the holder
> switches to blocking he calls
> btrfs_set_path_blocking->btrfs_set_lock_blocking_rw which of course
> increments the blocking reader/writes and calls the unlock at which
> point a "thundering herd" problem occurs on the lock. Am I correct?
>

That's correct, but the "thundering herd' problem is not really the
problem that this patch is trying to address, see more in below.

> Furthermore I think the problem occurs not in btrfs_clear_path_blocking
> but rather in the initial call to btrfs_set_path_blocking.
> 
> The way I see it the following sequence of operations occur:
> 
> 1. Call btrfs_set_path_blocking: increments
> blocking_writers/blocking_readers, calls unlock: ping-pong happens. Now
> the lock types for the nodes in the path are
> BTRFS_READ_LOCK_BLOCKING/BTRFS_WRITE_LOCK_BLOCKING
> 
> 2. Some time later we call btrfs_clear_path_blocking
> which also calls btrfs_set_path_blocking, however this time this
> function doesn't do anything because the lock types in path struct are
> already blocking and none of the 2 conditions inside
> btrfs_set_lock_blocking_rw will match hence this code won't be executed.
> 
> Did I miss something ?
> 

Thanks for sharing your thoughts.

We have to set path blocking as the subsequent code needs to sleep,
it's the cost we have to pay by design, I'm OK with it.

The problem is that btrfs_clear_path_blocking always tries to set path
to blocking before clearing path blocking as it needs to ensure
spin_locks on each level are acquired in right order.  And if all the
nodes in the path are already in spinning mode (say that
should_cow_block() decides to not COW), setting path blocking doesn't
really make sense and causes others waiting on the lock to do a
wake-up and a go-to-sleep.

My trace showed that there could be up to 50 switches for a single
lock waiter and the difference on the finished time also proved how
big the impact is.

thanks,
-liubo
> > 
> > 1) Killing this kind of ping-pong results in a big improvement in my 1600k
> > files creation script,
> > 
> > MNT=/mnt/btrfs
> > mkfs.btrfs -f /dev/sdf
> > mount /dev/def $MNT
> > time fsmark  -D  1  -S0  -n  10  -s  0  -L  1 -l /tmp/fs_log.txt \
> > -d  $MNT/0  -d  $MNT/1 \
> > -d  $MNT/2  -d  $MNT/3 \
> > -d  $MNT/4  -d  $MNT/5 \
> > -d  $MNT/6  -d  $MNT/7 \
> > -d  $MNT/8  -d  $MNT/9 \
> > -d  $MNT/10  -d  $MNT/11 \
> > -d  $MNT/12  -d  $MNT/13 \
> > -d  $MNT/14  -d  $MNT/15
> > 
> > w/o patch:
> > real2m27.307s
> > user0m12.839s
> > sys 13m42.831s
> > 
> > w/ patch:
> > real1m2.273s
> > user0m15.802s
> > sys 8m16.495s
> > 
> > 2) dbench also proves the improvement,
> > dbench -t 120 -D /mnt/btrfs 16
> > 
> > w/o patch:
> > Throughput 158.363 MB/sec
> > 
> > w/ patch:
> > Throughput 449.52 MB/sec
> > 
> > 3) xfstests didn't show any additional failures.

[PATCH 1/2] Btrfs: kill btrfs_clear_path_blocking

2018-08-16 Thread Liu Bo
Btrfs's btree locking has two modes, spinning mode and blocking mode,
while searching btree, locking is always acquired in spinning mode and
then converted to blocking mode if necessary, and in some hot paths we may
switch the locking back to spinning mode by btrfs_clear_path_blocking().

When acquiring locks, both of reader and writer need to wait for blocking
readers and writers to complete before doing read_lock()/write_lock().

The problem is that btrfs_clear_path_blocking() needs to switch nodes
in the path to blocking mode at first (by btrfs_set_path_blocking) to
make lockdep happy before doing its actual clearing blocking job.

When switching to blocking mode from spinning mode, it consists of

step 1) bumping up blocking readers counter and
step 2) read_unlock()/write_unlock(),

this has caused serious ping-pong effect if there're a great amount of
concurrent readers/writers, as waiters will be woken up and go to
sleep immediately.

1) Killing this kind of ping-pong results in a big improvement in my 1600k
files creation script,

MNT=/mnt/btrfs
mkfs.btrfs -f /dev/sdf
mount /dev/def $MNT
time fsmark  -D  1  -S0  -n  10  -s  0  -L  1 -l /tmp/fs_log.txt \
-d  $MNT/0  -d  $MNT/1 \
-d  $MNT/2  -d  $MNT/3 \
-d  $MNT/4  -d  $MNT/5 \
-d  $MNT/6  -d  $MNT/7 \
-d  $MNT/8  -d  $MNT/9 \
-d  $MNT/10  -d  $MNT/11 \
-d  $MNT/12  -d  $MNT/13 \
-d  $MNT/14  -d  $MNT/15

w/o patch:
real2m27.307s
user0m12.839s
sys 13m42.831s

w/ patch:
real1m2.273s
user0m15.802s
sys 8m16.495s

2) dbench also proves the improvement,
dbench -t 120 -D /mnt/btrfs 16

w/o patch:
Throughput 158.363 MB/sec

w/ patch:
Throughput 449.52 MB/sec

3) xfstests didn't show any additional failures.

One thing to note is that callers may set leave_spinning to have all
nodes in the path stay in spinning mode, which means callers are ready
to not sleep before releasing the path, but it won't cause problems if
they don't want to sleep in blocking mode, IOW, we can just get rid of
leave_spinning.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.c | 57 
 fs/btrfs/ctree.h |  2 --
 fs/btrfs/delayed-inode.c |  3 ---
 3 files changed, 4 insertions(+), 58 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index d436fb4c002e..8b31caa60b0a 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -52,42 +52,6 @@ noinline void btrfs_set_path_blocking(struct btrfs_path *p)
}
 }
 
-/*
- * reset all the locked nodes in the patch to spinning locks.
- *
- * held is used to keep lockdep happy, when lockdep is enabled
- * we set held to a blocking lock before we go around and
- * retake all the spinlocks in the path.  You can safely use NULL
- * for held
- */
-noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
-   struct extent_buffer *held, int held_rw)
-{
-   int i;
-
-   if (held) {
-   btrfs_set_lock_blocking_rw(held, held_rw);
-   if (held_rw == BTRFS_WRITE_LOCK)
-   held_rw = BTRFS_WRITE_LOCK_BLOCKING;
-   else if (held_rw == BTRFS_READ_LOCK)
-   held_rw = BTRFS_READ_LOCK_BLOCKING;
-   }
-   btrfs_set_path_blocking(p);
-
-   for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) {
-   if (p->nodes[i] && p->locks[i]) {
-   btrfs_clear_lock_blocking_rw(p->nodes[i], p->locks[i]);
-   if (p->locks[i] == BTRFS_WRITE_LOCK_BLOCKING)
-   p->locks[i] = BTRFS_WRITE_LOCK;
-   else if (p->locks[i] == BTRFS_READ_LOCK_BLOCKING)
-   p->locks[i] = BTRFS_READ_LOCK;
-   }
-   }
-
-   if (held)
-   btrfs_clear_lock_blocking_rw(held, held_rw);
-}
-
 /* this also releases the path */
 void btrfs_free_path(struct btrfs_path *p)
 {
@@ -1306,7 +1270,6 @@ static struct tree_mod_elem *__tree_mod_log_oldest_root(
}
}
 
-   btrfs_clear_path_blocking(path, NULL, BTRFS_READ_LOCK);
btrfs_tree_read_unlock_blocking(eb);
free_extent_buffer(eb);
 
@@ -2483,7 +2446,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path 
*path, int level)
btrfs_set_path_blocking(p);
reada_for_balance(fs_info, p, level);
sret = split_node(trans, root, p, level);
-   btrfs_clear_path_blocking(p, NULL, 0);
 
BUG_ON(sret > 0);
if (sret) {
@@ -2504,7 +2466,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path 
*path, int level)
btrfs_set_path_blocking(p);
reada_for_balance(fs_info, p, level);
sret = balance_level(trans, root, p, level);
-   btrfs_clear_path_blocking(p, NULL, 0);
 
if (sret) {
 

[PATCH 2/2] Btrfs: kill leave_spinning

2018-08-16 Thread Liu Bo
As btrfs_clear_path_blocking() turns out to be a major source of lock
contention, we've kill it and without it btrfs_search_slot() and
btrfs_search_old_slot() are not able to return a path in spinning
mode, lets kill leave_spinning, too.

Signed-off-by: Liu Bo 
---
 fs/btrfs/backref.c|  3 ---
 fs/btrfs/ctree.c  | 16 +++-
 fs/btrfs/ctree.h  |  1 -
 fs/btrfs/delayed-inode.c  |  4 
 fs/btrfs/dir-item.c   |  1 -
 fs/btrfs/export.c |  1 -
 fs/btrfs/extent-tree.c|  7 ---
 fs/btrfs/extent_io.c  |  1 -
 fs/btrfs/file-item.c  |  4 
 fs/btrfs/free-space-tree.c|  2 --
 fs/btrfs/inode-item.c |  6 --
 fs/btrfs/inode.c  |  8 
 fs/btrfs/ioctl.c  |  3 ---
 fs/btrfs/qgroup.c |  2 --
 fs/btrfs/super.c  |  2 --
 fs/btrfs/tests/qgroup-tests.c |  4 
 16 files changed, 3 insertions(+), 62 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index ae750b1574a2..70c629b90710 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1613,13 +1613,11 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, 
struct btrfs_path *path,
s64 bytes_left = ((s64)size) - 1;
struct extent_buffer *eb = eb_in;
struct btrfs_key found_key;
-   int leave_spinning = path->leave_spinning;
struct btrfs_inode_ref *iref;
 
if (bytes_left >= 0)
dest[bytes_left] = '\0';
 
-   path->leave_spinning = 1;
while (1) {
bytes_left -= name_len;
if (bytes_left >= 0)
@@ -1665,7 +1663,6 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, 
struct btrfs_path *path,
}
 
btrfs_release_path(path);
-   path->leave_spinning = leave_spinning;
 
if (ret)
return ERR_PTR(ret);
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 8b31caa60b0a..d2df7cfbec06 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2875,14 +2875,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
}
ret = 1;
 done:
-   /*
-* we don't really know what they plan on doing with the path
-* from here on, so for now just mark it as blocking
-*/
-   if (!p->leave_spinning)
-   btrfs_set_path_blocking(p);
if (ret < 0 && !p->skip_release_on_error)
btrfs_release_path(p);
+
+   /* path is supposed to be in blocking mode from now on. */
return ret;
 }
 
@@ -2987,11 +2983,10 @@ int btrfs_search_old_slot(struct btrfs_root *root, 
const struct btrfs_key *key,
}
ret = 1;
 done:
-   if (!p->leave_spinning)
-   btrfs_set_path_blocking(p);
if (ret < 0)
btrfs_release_path(p);
 
+   /* path is supposed to be in blocking mode from now on.*/
return ret;
 }
 
@@ -5628,7 +5623,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct 
btrfs_path *path,
struct btrfs_key key;
u32 nritems;
int ret;
-   int old_spinning = path->leave_spinning;
int next_rw_lock = 0;
 
nritems = btrfs_header_nritems(path->nodes[0]);
@@ -5643,7 +5637,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct 
btrfs_path *path,
btrfs_release_path(path);
 
path->keep_locks = 1;
-   path->leave_spinning = 1;
 
if (time_seq)
ret = btrfs_search_old_slot(root, , path, time_seq);
@@ -5780,9 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct 
btrfs_path *path,
ret = 0;
 done:
unlock_up(path, 0, 1, 0, NULL);
-   path->leave_spinning = old_spinning;
-   if (!old_spinning)
-   btrfs_set_path_blocking(path);
 
return ret;
 }
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1aeed3c0e949..e8bddf21b7f7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -339,7 +339,6 @@ struct btrfs_path {
unsigned int search_for_split:1;
unsigned int keep_locks:1;
unsigned int skip_locking:1;
-   unsigned int leave_spinning:1;
unsigned int search_commit_root:1;
unsigned int need_commit_sem:1;
unsigned int skip_release_on_error:1;
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index db9f45082c61..e6fbcdbc313e 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1138,7 +1138,6 @@ static int __btrfs_run_delayed_items(struct 
btrfs_trans_handle *trans, int nr)
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
-   path->leave_spinning = 1;
 
block_rsv = trans->block_rsv;
trans->block_rsv = _info->delayed_block_rsv;
@@ -1203,7 +1202,6 @@ int btrfs_commit_inode_delayed_items(struct 
btrfs_trans_handle *trans,
btrfs_release_delayed_node(delayed_node);
 

[PATCH 0/2] address lock contention of btree root

2018-08-16 Thread Liu Bo
The lock contention on btree nodes (esp. root node) is apparently a
bottleneck when there're multiple readers and writers concurrently
trying to access them.  Unfortunately this is by design and it's not
easy to fix it unless with some complex changes, however, there is
still some room.

With a stable workload based on fsmark which has 16 threads creating
1,600K files, we could see that a good amount of overhead comes from
switching path between spinning mode and blocking mode in
btrfs_search_slot().

Patch 1 provides more details about the overhead and test results from
fsmark and dbench.
Patch 2 kills leave_spinning due to the behaviour change from patch 1.

Liu Bo (2):
  Btrfs: kill btrfs_clear_path_blocking
  Btrfs: kill leave_spinning

 fs/btrfs/backref.c|  3 --
 fs/btrfs/ctree.c  | 73 +--
 fs/btrfs/ctree.h  |  3 --
 fs/btrfs/delayed-inode.c  |  7 -
 fs/btrfs/dir-item.c   |  1 -
 fs/btrfs/export.c |  1 -
 fs/btrfs/extent-tree.c|  7 -
 fs/btrfs/extent_io.c  |  1 -
 fs/btrfs/file-item.c  |  4 ---
 fs/btrfs/free-space-tree.c|  2 --
 fs/btrfs/inode-item.c |  6 
 fs/btrfs/inode.c  |  8 -
 fs/btrfs/ioctl.c  |  3 --
 fs/btrfs/qgroup.c |  2 --
 fs/btrfs/super.c  |  2 --
 fs/btrfs/tests/qgroup-tests.c |  4 ---
 16 files changed, 7 insertions(+), 120 deletions(-)

-- 
1.8.3.1



[PATCH] Btrfs: remove always true if branch in btrfs_get_extent

2018-08-16 Thread Liu Bo
@path is always NULL when it comes to the if branch.

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8b135a46835f..4b79916472fb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6825,18 +6825,15 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
*inode,
em->len = (u64)-1;
em->block_len = (u64)-1;
 
+   path = btrfs_alloc_path();
if (!path) {
-   path = btrfs_alloc_path();
-   if (!path) {
-   err = -ENOMEM;
-   goto out;
-   }
-   /*
-* Chances are we'll be called again, so go ahead and do
-* readahead
-*/
-   path->reada = READA_FORWARD;
+   err = -ENOMEM;
+   goto out;
}
+   /*
+* Chances are we'll be called again, so go ahead and do readahead.
+*/
+   path->reada = READA_FORWARD;
 
ret = btrfs_lookup_file_extent(NULL, root, path, objectid, start, 0);
if (ret < 0) {
-- 
1.8.3.1



Re: [PATCH] Btrfs: remove duplicated btrfs_release_path in btrfs_unlink_subvol

2018-08-15 Thread Liu Bo
On Wed, Aug 15, 2018 at 4:48 AM, David Sterba  wrote:
> On Wed, Aug 15, 2018 at 10:52:56AM +0800, Liu Bo wrote:
>> On Tue, Aug 14, 2018 at 12:46:00PM +0200, David Sterba wrote:
>> > On Tue, Aug 14, 2018 at 10:47:09AM +0800, Liu Bo wrote:
>> > > The btrfs_release_path() is just useless as path is only used in error 
>> > > handling.
>> >
>> > Where is it duplicated? And I don't think it's useless, while the
>> > changelog does not explain why and it's not obvious from the context. If
>> > the path is locked, then releasing it right after it's not needed makes
>> > sense.  There are several potentially heavyweight operations between the
>> > release and final free.
>>
>> I see, the diff context is a little bit misleading, the logic in
>> btrfs_unlink_subvol() is like,
>
> Well, you based you patch on old code, the duplicate btrfs_release_path
> has been removed in 5b7d687ad5913a56b6a8788435d7a53990b4176d and was
> in the devel branches since like 2 weeks.
>
> The patch removes the first release while it would could have been the
> other one to save some cycles when the branch with
> btrfs_search_dir_index_item is not taken.

Ah I see, sorry for the noise.

thanks,
liubo


Re: [PATCH] Btrfs: remove duplicated btrfs_release_path in btrfs_unlink_subvol

2018-08-14 Thread Liu Bo
On Tue, Aug 14, 2018 at 12:46:00PM +0200, David Sterba wrote:
> On Tue, Aug 14, 2018 at 10:47:09AM +0800, Liu Bo wrote:
> > The btrfs_release_path() is just useless as path is only used in error 
> > handling.
> 
> Where is it duplicated? And I don't think it's useless, while the
> changelog does not explain why and it's not obvious from the context. If
> the path is locked, then releasing it right after it's not needed makes
> sense.  There are several potentially heavyweight operations between the
> release and final free.

I see, the diff context is a little bit misleading, the logic in
btrfs_unlink_subvol() is like,

{
...
btrfs_delete_one_dir_name(path);
btrfs_release_path(path);

ret = btrfs_del_root_ref(); <<<< path is not used here.
if (ret < 0) {
btrfs_search_dir_index_item(path);
btrfs_release_path(path);
}
btrfs_release_path(path);   <<<< path has been released anyway.
...
}

thanks,
-liubo


[PATCH] Btrfs: remove duplicated btrfs_release_path in btrfs_unlink_subvol

2018-08-13 Thread Liu Bo
The btrfs_release_path() is just useless as path is only used in error handling.

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index eba61bcb9bb3..5680e9c462a3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4148,7 +4148,6 @@ static int btrfs_unlink_subvol(struct btrfs_trans_handle 
*trans,
btrfs_release_path(path);
index = key.offset;
}
-   btrfs_release_path(path);
 
ret = btrfs_delete_delayed_dir_index(trans, fs_info, BTRFS_I(dir), 
index);
if (ret) {
-- 
1.8.3.1



[PATCH] Btrfs: do not pass write_lock_level when processing leaf

2018-08-13 Thread Liu Bo
As we're going to return, it doesn't make sense to get a new
write_lock_level from unlock_up.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 41eb47488e75..f032b48094b4 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2950,7 +2950,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
}
if (!p->search_for_split)
unlock_up(p, level, lowest_unlock,
- min_write_lock_level, 
_lock_level);
+ min_write_lock_level, NULL);
goto done;
}
}
-- 
1.8.3.1



Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-06 Thread Liu Bo
On Wed, Jun 6, 2018 at 9:44 PM, Chris Mason  wrote:
> On 6 Jun 2018, at 9:38, Liu Bo wrote:
>
>> On Wed, Jun 6, 2018 at 8:18 AM, Chris Mason  wrote:
>>>
>>>
>>>
>>> On 5 Jun 2018, at 16:03, Andrew Morton wrote:
>>>
>>>> (switched to email.  Please respond via emailed reply-to-all, not via
>>>> the
>>>> bugzilla web interface).
>>>>
>>>> On Tue, 05 Jun 2018 18:01:36 + bugzilla-dae...@bugzilla.kernel.org
>>>> wrote:
>>>>
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=199931
>>>>>
>>>>> Bug ID: 199931
>>>>>Summary: systemd/rtorrent file data corruption when using
>>>>> echo
>>>>> 3 >/proc/sys/vm/drop_caches
>>>>
>>>>
>>>>
>>>> A long tale of woe here.  Chris, do you think the pagecache corruption
>>>> is a general thing, or is it possible that btrfs is contributing?
>>>>
>>>> Also, that 4.4 oom-killer regression sounds very serious.
>>>
>>>
>>>
>>> This week I found a bug in btrfs file write with how we handle stable
>>> pages.
>>> Basically it works like this:
>>>
>>> write(fd, some bytes less than a page)
>>> write(fd, some bytes into the same page)
>>> btrfs prefaults the userland page
>>> lock_and_cleanup_extent_if_need()   <- stable pages
>>> wait for writeback()
>>> clear_page_dirty_for_io()
>>>
>>> At this point we have a page that was dirty and is now clean.  That's
>>> normally fine, unless our prefaulted page isn't in ram anymore.
>>>
>>> iov_iter_copy_from_user_atomic() <--- uh oh
>>>
>>> If the copy_from_user fails, we drop all our locks and retry.  But along
>>> the
>>> way, we completely lost the dirty bit on the page.  If the page is
>>> dropped
>>> by drop_caches, the writes are lost.  We'll just read back the stale
>>> contents of that page during the retry loop.  This won't result in crc
>>> errors because the bytes we lost were never crc'd.
>>>
>>
>> So we're going to carefully redirty the page under the page lock, right?
>
>
> I don't think we actually need to clean it.  We have the page locked,
> writeback won't start until we unlock.
>

My concern is that the buggy thing is similar to compression path,
where we also did the trick of clear_page_dirty_for_io and
redirty_pages to avoid any faults wandering in and changing pages
underneath, but seems here we're fine if pages get changed in between.

>>
>>> It could result in zeros in the file because we're basically reading a
>>> hole,
>>> and those zeros could move around in the page depending on which part of
>>> the
>>> page was dirty when the writes were lost.
>>>
>>
>> I got a question, while re-reading this page, wouldn't it read
>> old/stale on-disk data?
>
>
> If it was never written we should be treating it like a hole, but I'll
> double check.
>

Okay, I think this would also happen in the overwrite case, where
stale data lies on disk.

thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-06 Thread Liu Bo
On Wed, Jun 6, 2018 at 8:18 AM, Chris Mason  wrote:
>
>
> On 5 Jun 2018, at 16:03, Andrew Morton wrote:
>
>> (switched to email.  Please respond via emailed reply-to-all, not via the
>> bugzilla web interface).
>>
>> On Tue, 05 Jun 2018 18:01:36 + bugzilla-dae...@bugzilla.kernel.org
>> wrote:
>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=199931
>>>
>>> Bug ID: 199931
>>>Summary: systemd/rtorrent file data corruption when using echo
>>> 3 >/proc/sys/vm/drop_caches
>>
>>
>> A long tale of woe here.  Chris, do you think the pagecache corruption
>> is a general thing, or is it possible that btrfs is contributing?
>>
>> Also, that 4.4 oom-killer regression sounds very serious.
>
>
> This week I found a bug in btrfs file write with how we handle stable pages.
> Basically it works like this:
>
> write(fd, some bytes less than a page)
> write(fd, some bytes into the same page)
> btrfs prefaults the userland page
> lock_and_cleanup_extent_if_need()   <- stable pages
> wait for writeback()
> clear_page_dirty_for_io()
>
> At this point we have a page that was dirty and is now clean.  That's
> normally fine, unless our prefaulted page isn't in ram anymore.
>
> iov_iter_copy_from_user_atomic() <--- uh oh
>
> If the copy_from_user fails, we drop all our locks and retry.  But along the
> way, we completely lost the dirty bit on the page.  If the page is dropped
> by drop_caches, the writes are lost.  We'll just read back the stale
> contents of that page during the retry loop.  This won't result in crc
> errors because the bytes we lost were never crc'd.
>

So we're going to carefully redirty the page under the page lock, right?

> It could result in zeros in the file because we're basically reading a hole,
> and those zeros could move around in the page depending on which part of the
> page was dirty when the writes were lost.
>

I got a question, while re-reading this page, wouldn't it read
old/stale on-disk data?

thanks,
liubo

> I spent a morning trying to trigger this with drop_caches and couldn't make
> it happen, even with schedule_timeout()s inserted and other tricks.  But I
> was able to get corruptions if I manually invalidated pages in the critical
> section.
>
> I'm working on a patch, and I'll check and see if any of the other recent
> fixes Dave integrated may have a less exotic explanation.
>
> -chris
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] Btrfs: remove unused check of skip_locking

2018-05-29 Thread Liu Bo
The check is superfluous since all of callers who set search_for_commit
also have skip_locking set.

ASSERT() is put in place to ensure skip_locking is set by callers.

Reviewed-by: Qu Wenruo 
Signed-off-by: Liu Bo 
---
v3: Add assertion and comments to ensure skip_locking is not forgot by callers.
v2: Rebase.

 fs/btrfs/ctree.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index a2bcfdf85726..9c4f6b919d51 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2623,8 +2623,11 @@ static struct extent_buffer 
*btrfs_search_slot_get_root(struct btrfs_root *root,
level = btrfs_header_level(b);
if (p->need_commit_sem)
up_read(_info->commit_root_sem);
-   if (!p->skip_locking)
-   btrfs_tree_read_lock(b);
+   /*
+* Ensure that all callers have set skip_locking when
+* p->search_commit_root = 1.
+*/
+   ASSERT(p->skip_locking == 1);
 
goto out;
}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/6] btrfs_search_slot cleanups

2018-05-29 Thread Liu Bo
On Mon, May 28, 2018 at 06:17:18PM +0200, David Sterba wrote:
> On Mon, May 28, 2018 at 04:40:28PM +0200, David Sterba wrote:
> > On Fri, May 18, 2018 at 11:00:18AM +0800, Liu Bo wrote:
> > > Here are a collection of patches I did for btrfs_search_slot().
> > > 
> > > v2: more explicit commit log for each patch.
> > > 
> > > Liu Bo (6):
> > >   Btrfs: remove superfluous free_extent_buffer
> > >   Btrfs: use more straightforward extent_buffer_uptodate
> > >   Btrfs: move get root of btrfs_search_slot to a helper
> > >   Btrfs: remove unused check of skip_locking
> > >   Btrfs: grab write lock directly if write_lock_level is the max level
> > >   Btrfs: remove always true check in unlock_up
> > 
> > 1,2,4,5,6 merged. Please update patch 3 and add the ASSERT. Thanks.
> 
> Sorry, it's the 4/6 that's not merged and I'm awaiting an updated
> versino. Thanks.

Thank you, David, I'll update it.

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/6] btrfs_search_slot cleanups

2018-05-23 Thread Liu Bo
On Wed, May 23, 2018 at 02:34:40PM +0200, David Sterba wrote:
> On Wed, May 23, 2018 at 10:16:55AM +0800, Su Yue wrote:
> > >>> [   47.692084] kernel BUG at fs/btrfs/locking.c:286!
> > >>
> > >> I saw the crash too but did not investigate the root cause. So I'll
> > >> remove the branch from for-next until it's fixed. Thanks for the report.
> > > 
> > > I think the problem stems from Qu's patch, which sets search_commit_root
> > > =1 but doesn't set skip_locking, as a result we don't lock the tree when
> > > we obtain a reference to the root node, yet later when traversing the
> > > tree due to skip_locking not being set we try to lock it, and this
> > > causes btrfs_assert_tree_locked to triggers. Can you test whether the
> > > following diff solves the issues:
> > > 
> > > 
> > > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> > > index bc19a7d11c98..23fadb640c59 100644
> > > --- a/fs/btrfs/qgroup.c
> > > +++ b/fs/btrfs/qgroup.c
> > > @@ -2702,6 +2702,7 @@ static void btrfs_qgroup_rescan_worker(struct
> > > btrfs_work *work)
> > >  * should be recorded by qgroup
> > >  */
> > > path->search_commit_root = 1;
> > > +   path->skip_locking = 1;
> > > 
> > > err = 0;
> > > while (!err && !btrfs_fs_closing(fs_info)) {
> > > 
> > > 
> > > If it does, this only means we need to make skip_locking = 1 being
> > > conditional on search_commit_root being set and this situation should be
> > > handled in btrfs_search_slot.
> > 
> > After patching the change, btrfs/139 passes without BUG_ON.
> 
> Confirmed, I've added the fixup to for-next, Liu Bo's branch is there to
> and the crash did not show up.

Thanks a lot for the testing and fixes!

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix the corruption by reading stale btree blocks

2018-05-18 Thread Liu Bo
On Wed, May 16, 2018 at 04:12:21PM +0200, David Sterba wrote:
> On Wed, May 16, 2018 at 11:00:14AM +0300, Nikolay Borisov wrote:
> > 
> > 
> > On 15.05.2018 20:37, Liu Bo wrote:
> > > If a btree block, aka. extent buffer, is not available in the extent
> > > buffer cache, it'll be read out from the disk instead, i.e.
> > > 
> > > btrfs_search_slot()
> > >   read_block_for_search()  # hold parent and its lock, go to read child
> > > btrfs_release_path()
> > > read_tree_block()  # read child
> > > 
> > > Unfortunately, the parent lock got released before reading child, so
> > > commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
> > > used 0 as parent transid to read the child block.  It forces
> > > read_tree_block() not to check if parent transid is different with the
> > > generation id of the child that it reads out from disk.
> > > 
> > > A simple PoC is included in btrfs/124,
> > > 
> > > 0. A two-disk raid1 btrfs,
> > > 
> > > 1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
> > > 
> > > 2. Mount this filesystem and put it in use, after a while, device tree's
> > >root got COW but block A hasn't been allocated/overwritten yet.
> > > 
> > > 3. Umount it and reload the btrfs module to remove both disks from the
> > >global @fs_devices list.
> > > 
> > > 4. mount -odegraded dev1 and write some data, so now block A is allocated
> > >to be a leaf in checksum tree.  Note that only dev1 has the latest
> > >metadata of this filesystem.
> > > 
> > > 5. Umount it and mount it again normally (with both disks), since raid1
> > >can pick up one disk by the writer task's pid, if btrfs_search_slot()
> > >needs to read block A, dev2 which does NOT have the latest metadata
> > >might be read for block A, then we got a stale block A.
> > > 
> > > 6. As parent transid is not checked, block A is marked as uptodate and
> > >put into the extent buffer cache, so the future search won't bother
> > >    to read disk again, which means it'll make changes on this stale
> > >one and make it dirty and flush it onto disk.
> > > 
> > > To avoid the problem, parent transid needs to be passed to
> > > read_tree_block().
> > > 
> > > In order to get a valid parent transid, we need to hold the parent's
> > > lock until finish reading child.
> > > 
> > > Signed-off-by: Liu Bo <bo@linux.alibaba.com>
> > 
> > Doesn't this warrant a stable tag and
> > Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")
> 
> The patch will not apply to < 4.16 as it depends on the addition of
> _key check from 581c1760415c4, but yes we need it for stable.

True, we can easily fix the conflicts by removing @first_key, can't
we?

The problem does exist before it.

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/6] Btrfs: remove unused check of skip_locking

2018-05-17 Thread Liu Bo
The check is superfluous since all of callers who set search_for_commit
also have skip_locking set.

Signed-off-by: Liu Bo <bo@linux.alibaba.com>
---
 fs/btrfs/ctree.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index d12fc0474e21..8d3b09038f37 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2623,8 +2623,6 @@ static struct extent_buffer 
*btrfs_search_slot_get_root(struct btrfs_root *root,
level = btrfs_header_level(b);
if (p->need_commit_sem)
up_read(_info->commit_root_sem);
-   if (!p->skip_locking)
-   btrfs_tree_read_lock(b);
 
goto out;
}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/6] Btrfs: use more straightforward extent_buffer_uptodate

2018-05-17 Thread Liu Bo
If parent_transid "0" is passed to btrfs_buffer_uptodate(),
btrfs_buffer_uptodate() is equivalent to extent_buffer_uptodate(), but
extent_buffer_uptodate() is preferred since we don't have to look into
verify_parent_transid().

Signed-off-by: Liu Bo <bo@linux.alibaba.com>
---
 fs/btrfs/ctree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 9fa3d77c98d4..a96d308c51b8 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2445,7 +2445,7 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path 
*path, int level)
 * and give up so that our caller doesn't loop forever
 * on our EAGAINs.
 */
-   if (!btrfs_buffer_uptodate(tmp, 0, 0))
+   if (!extent_buffer_uptodate(tmp))
ret = -EIO;
free_extent_buffer(tmp);
} else {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/6] Btrfs: remove superfluous free_extent_buffer

2018-05-17 Thread Liu Bo
read_block_for_search() can be simplified as,

tmp = find_extent_buffer();
if (tmp)
   return;

free_extent_buffer();
read_tree_block();

Apparently, @tmp must be NULL at this point, free_extent_buffer() is not
needed.

Signed-off-by: Liu Bo <bo@linux.alibaba.com>
---
 fs/btrfs/ctree.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index b3f6f300e492..9fa3d77c98d4 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2432,7 +2432,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path 
*path, int level)
btrfs_unlock_up_safe(p, level + 1);
btrfs_set_path_blocking(p);
 
-   free_extent_buffer(tmp);
if (p->reada != READA_NONE)
reada_for_search(fs_info, p, level, slot, key->objectid);
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/6] Btrfs: move get root of btrfs_search_slot to a helper

2018-05-17 Thread Liu Bo
It's good to have a helper instead of having all get-root details
open-coded.

The new helper locks (if necessary) and sets root node of the path.

Also invert the checks to make the code flow easier to read.

There is no functional change in this commit.

Signed-off-by: Liu Bo <bo@linux.alibaba.com>
---
 fs/btrfs/ctree.c | 115 +--
 1 file changed, 70 insertions(+), 45 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index a96d308c51b8..d12fc0474e21 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2598,6 +2598,75 @@ int btrfs_find_item(struct btrfs_root *fs_root, struct 
btrfs_path *path,
return 0;
 }
 
+static struct extent_buffer *btrfs_search_slot_get_root(struct btrfs_root 
*root,
+   struct btrfs_path *p,
+   int write_lock_level)
+{
+   struct btrfs_fs_info *fs_info = root->fs_info;
+   struct extent_buffer *b;
+   int root_lock;
+   int level = 0;
+
+   /*
+* we try very hard to do read locks on the root
+*/
+   root_lock = BTRFS_READ_LOCK;
+
+   if (p->search_commit_root) {
+   /*
+* the commit roots are read only so we always do read locks
+*/
+   if (p->need_commit_sem)
+   down_read(_info->commit_root_sem);
+   b = root->commit_root;
+   extent_buffer_get(b);
+   level = btrfs_header_level(b);
+   if (p->need_commit_sem)
+   up_read(_info->commit_root_sem);
+   if (!p->skip_locking)
+   btrfs_tree_read_lock(b);
+
+   goto out;
+   }
+
+   if (p->skip_locking) {
+   b = btrfs_root_node(root);
+   level = btrfs_header_level(b);
+   goto out;
+   }
+
+   /*
+* we don't know the level of the root node until we actually
+* have it read locked
+*/
+   b = btrfs_read_lock_root_node(root);
+   level = btrfs_header_level(b);
+   if (level > write_lock_level)
+   goto out;
+
+   /*
+* whoops, must trade for write lock
+*/
+   btrfs_tree_read_unlock(b);
+   free_extent_buffer(b);
+   b = btrfs_lock_root_node(root);
+   root_lock = BTRFS_WRITE_LOCK;
+   /*
+* the level might have changed, check again
+*/
+   level = btrfs_header_level(b);
+
+out:
+   p->nodes[level] = b;
+   if (!p->skip_locking)
+   p->locks[level] = root_lock;
+   /*
+* Callers are responsible for drop b's refs.
+*/
+   return b;
+}
+
+
 /*
  * btrfs_search_slot - look for a key in a tree and perform necessary
  * modifications to preserve tree invariants.
@@ -2634,7 +2703,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
int err;
int level;
int lowest_unlock = 1;
-   int root_lock;
/* everything at write_lock_level or lower must be write locked */
int write_lock_level = 0;
u8 lowest_level = 0;
@@ -2672,50 +2740,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
 
 again:
prev_cmp = -1;
-   /*
-* we try very hard to do read locks on the root
-*/
-   root_lock = BTRFS_READ_LOCK;
-   level = 0;
-   if (p->search_commit_root) {
-   /*
-* the commit roots are read only
-* so we always do read locks
-*/
-   if (p->need_commit_sem)
-   down_read(_info->commit_root_sem);
-   b = root->commit_root;
-   extent_buffer_get(b);
-   level = btrfs_header_level(b);
-   if (p->need_commit_sem)
-   up_read(_info->commit_root_sem);
-   if (!p->skip_locking)
-   btrfs_tree_read_lock(b);
-   } else {
-   if (p->skip_locking) {
-   b = btrfs_root_node(root);
-   level = btrfs_header_level(b);
-   } else {
-   /* we don't know the level of the root node
-* until we actually have it read locked
-*/
-   b = btrfs_read_lock_root_node(root);
-   level = btrfs_header_level(b);
-   if (level <= write_lock_level) {
-   /* whoops, must trade for write lock */
-   btrfs_tree_read_unlock(b);
-   free_extent_buffer(b);
-   b = btrfs_lock_root_node(root);
-   root_lock = BTRFS_WRITE_LOCK;
-
-   

[PATCH v2 5/6] Btrfs: grab write lock directly if write_lock_level is the max level

2018-05-17 Thread Liu Bo
Typically, when acquiring root node's lock, btrfs tries its best to get
read lock and trade for write lock if @write_lock_level implies to do so.

In case of (cow && (p->keep_locks || p->lowest_level)), write_lock_level
is set to BTRFS_MAX_LEVEL, which means we need to acquire root node's
write lock directly.

In this particular case, the dance of acquiring read lock and then trading
for write lock can be saved.

Signed-off-by: Liu Bo <bo@linux.alibaba.com>
---
 fs/btrfs/ctree.c | 29 -
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 8d3b09038f37..e619f7e01794 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2633,20 +2633,23 @@ static struct extent_buffer 
*btrfs_search_slot_get_root(struct btrfs_root *root,
goto out;
}
 
-   /*
-* we don't know the level of the root node until we actually
-* have it read locked
-*/
-   b = btrfs_read_lock_root_node(root);
-   level = btrfs_header_level(b);
-   if (level > write_lock_level)
-   goto out;
+   if (write_lock_level < BTRFS_MAX_LEVEL) {
+   /*
+* we don't know the level of the root node until we actually
+* have it read locked
+*/
+   b = btrfs_read_lock_root_node(root);
+   level = btrfs_header_level(b);
+   if (level > write_lock_level)
+   goto out;
+
+   /*
+* whoops, must trade for write lock
+*/
+   btrfs_tree_read_unlock(b);
+   free_extent_buffer(b);
+   }
 
-   /*
-* whoops, must trade for write lock
-*/
-   btrfs_tree_read_unlock(b);
-   free_extent_buffer(b);
b = btrfs_lock_root_node(root);
root_lock = BTRFS_WRITE_LOCK;
/*
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 6/6] Btrfs: remove always true check in unlock_up

2018-05-17 Thread Liu Bo
As unlock_up() is written as

for () {
   if (!path->locks[i])
   break;
   ...
   if (... && path->locks[i]) {
   }
}

Apparently, @path->locks[i] is always true at this 'if'.

Signed-off-by: Liu Bo <bo@linux.alibaba.com>
---
 fs/btrfs/ctree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index e619f7e01794..65dbebb8cc59 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2330,7 +2330,7 @@ static noinline void unlock_up(struct btrfs_path *path, 
int level,
no_skips = 1;
 
t = path->nodes[i];
-   if (i >= lowest_unlock && i > skip_level && path->locks[i]) {
+   if (i >= lowest_unlock && i > skip_level) {
btrfs_tree_unlock_rw(t, path->locks[i]);
path->locks[i] = 0;
if (write_lock_level &&
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/6] btrfs_search_slot cleanups

2018-05-17 Thread Liu Bo
Here are a collection of patches I did for btrfs_search_slot().

v2: more explicit commit log for each patch.

Liu Bo (6):
  Btrfs: remove superfluous free_extent_buffer
  Btrfs: use more straightforward extent_buffer_uptodate
  Btrfs: move get root of btrfs_search_slot to a helper
  Btrfs: remove unused check of skip_locking
  Btrfs: grab write lock directly if write_lock_level is the max level
  Btrfs: remove always true check in unlock_up

 fs/btrfs/ctree.c | 121 +--
 1 file changed, 73 insertions(+), 48 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: add parent_transid parameter to veirfy_level_key

2018-05-17 Thread Liu Bo
As verify_level_key() is checked after verify_parent_transid(), i.e.

if (verify_parent_transid())
   ret = -EIO;
else if (verify_level_key())
   ret = -EUCLEAN;

if parent_transid is 0, verify_parent_transid() skips verifying
parent_transid and considers eb as valid, and if verify_level_key()
reports something wrong, we're not going to know if it's caused by
corrupted metadata or non-checkecd eb (e.g. stale eb).

@parent_transid is able to tell whether the eb's generation has been
verified by the caller.

Signed-off-by: Liu Bo <bo@linux.alibaba.com>
---
v2: - more explicit commit log.
- adjust the position shown in error msg.

 fs/btrfs/disk-io.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 60caa68c3618..ad865176a3ca 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -416,7 +416,7 @@ static int btrfs_check_super_csum(struct btrfs_fs_info 
*fs_info,
 
 static int verify_level_key(struct btrfs_fs_info *fs_info,
struct extent_buffer *eb, int level,
-   struct btrfs_key *first_key)
+   struct btrfs_key *first_key, u64 parent_transid)
 {
int found_level;
struct btrfs_key found_key;
@@ -454,10 +454,11 @@ static int verify_level_key(struct btrfs_fs_info *fs_info,
if (ret) {
WARN_ON(1);
btrfs_err(fs_info,
-"tree first key mismatch detected, bytenr=%llu key expected=(%llu, %u, %llu) 
has=(%llu, %u, %llu)",
- eb->start, first_key->objectid, first_key->type,
- first_key->offset, found_key.objectid,
- found_key.type, found_key.offset);
+"tree first key mismatch detected, bytenr=%llu parent_transid=%llu key 
expected=(%llu, %u, %llu) has=(%llu, %u, %llu)",
+ eb->start, parent_transid, first_key->objectid,
+ first_key->type, first_key->offset,
+ found_key.objectid, found_key.type,
+ found_key.offset);
}
 #endif
return ret;
@@ -493,7 +494,7 @@ static int btree_read_extent_buffer_pages(struct 
btrfs_fs_info *fs_info,
   parent_transid, 0))
ret = -EIO;
else if (verify_level_key(fs_info, eb, level,
- first_key))
+ first_key, parent_transid))
ret = -EUCLEAN;
else
break;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] Btrfs: remove unused check of skip_locking

2018-05-17 Thread Liu Bo
On Wed, May 16, 2018 at 3:03 PM, Nikolay Borisov <nbori...@suse.com> wrote:
>
>
> On 15.05.2018 20:52, Liu Bo wrote:
>> The check is superfluous since all of callers who set search_for_commit
>> also have skip_locking set.
>
> This is false. For example btrfs_qgroup_rescan_worker sets
> search_commit_root = 1 but doesn't set skip locking. So either your
> assumption is wrong or the code in btrfs_qgroup_rescan_worker has missed
> that (which seems more likely).
>

I'm a bit confused as I didn't see btrfs_qgroup_rescan_worker() set
path->search_commit_root.

thanks,
liubo

> I think this change necessitates either:
>
> a) ASSERT(p->skip_locking) inside the if (p->search_commit_root) branch
>
> or
>
> b) Unconditionally setting ->skip_locking if we have set
> search_commit_root and removing opencoded set of skip_locking alongside
> search_commit_root.
>
> I think b) is the better option since it provides "fire and forget"
> semantics when you want to search the commit root, since it's only read
> only.
>
>>
>> Signed-off-by: Liu Bo <bo@linux.alibaba.com>
>> ---
>>  fs/btrfs/ctree.c | 2 --
>>  1 file changed, 2 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
>> index 399839df7a8f..cf34eca41d4e 100644
>> --- a/fs/btrfs/ctree.c
>> +++ b/fs/btrfs/ctree.c
>> @@ -2623,8 +2623,6 @@ static struct extent_buffer 
>> *btrfs_search_slot_get_root(struct btrfs_root *root,
>>   level = btrfs_header_level(b);
>>   if (p->need_commit_sem)
>>   up_read(_info->commit_root_sem);
>> - if (!p->skip_locking)
>> - btrfs_tree_read_lock(b);
>>
>>   goto out;
>>   }
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] Btrfs: use more straightforward extent_buffer_uptodate

2018-05-17 Thread Liu Bo
On Wed, May 16, 2018 at 2:43 PM, Nikolay Borisov <nbori...@suse.com> wrote:
>
>
> On 15.05.2018 20:52, Liu Bo wrote:
>> In read_block_for_search(), it's straightforward to use
>> extent_buffer_uptodate() instead since 0 is passed as parent transid to
>
> "instead of the more heavyweight btrfs_buffer_update"
>

I don't think it's about heavyweight, they're actually equivalent in this case.

I just want to reduce the burden of reading these code as
verify_parent_transid() really has some corner cases to think about.

>> btrfs_buffer_uptodate(), which means the check for parent transid is not
>> needed.
>>
>> Signed-off-by: Liu Bo <bo@linux.alibaba.com>
>
> Codewise LGTM:
>
> Reviewed-by: Nikolay Borisov <nbori...@suse.com>

Thanks for taking a look at this.

thanks,
liubo

>> ---
>>  fs/btrfs/ctree.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
>> index 9fa3d77c98d4..a96d308c51b8 100644
>> --- a/fs/btrfs/ctree.c
>> +++ b/fs/btrfs/ctree.c
>> @@ -2445,7 +2445,7 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path 
>> *path, int level)
>>* and give up so that our caller doesn't loop forever
>>* on our EAGAINs.
>>*/
>> - if (!btrfs_buffer_uptodate(tmp, 0, 0))
>> + if (!extent_buffer_uptodate(tmp))
>>   ret = -EIO;
>>   free_extent_buffer(tmp);
>>   } else {
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] Btrfs: use more straightforward extent_buffer_uptodate

2018-05-15 Thread Liu Bo
In read_block_for_search(), it's straightforward to use
extent_buffer_uptodate() instead since 0 is passed as parent transid to
btrfs_buffer_uptodate(), which means the check for parent transid is not
needed.

Signed-off-by: Liu Bo <bo@linux.alibaba.com>
---
 fs/btrfs/ctree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 9fa3d77c98d4..a96d308c51b8 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2445,7 +2445,7 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path 
*path, int level)
 * and give up so that our caller doesn't loop forever
 * on our EAGAINs.
 */
-   if (!btrfs_buffer_uptodate(tmp, 0, 0))
+   if (!extent_buffer_uptodate(tmp))
ret = -EIO;
free_extent_buffer(tmp);
} else {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] Btrfs: remove always true check in unlock_up

2018-05-15 Thread Liu Bo
@path->lock[i] is always true at this point.

Signed-off-by: Liu Bo <bo@linux.alibaba.com>
---
 fs/btrfs/ctree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index f7c3f581f647..16d28a4ec54f 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2330,7 +2330,7 @@ static noinline void unlock_up(struct btrfs_path *path, 
int level,
no_skips = 1;
 
t = path->nodes[i];
-   if (i >= lowest_unlock && i > skip_level && path->locks[i]) {
+   if (i >= lowest_unlock && i > skip_level) {
btrfs_tree_unlock_rw(t, path->locks[i]);
path->locks[i] = 0;
if (write_lock_level &&
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] Btrfs: remove unused check of skip_locking

2018-05-15 Thread Liu Bo
The check is superfluous since all of callers who set search_for_commit
also have skip_locking set.

Signed-off-by: Liu Bo <bo@linux.alibaba.com>
---
 fs/btrfs/ctree.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 399839df7a8f..cf34eca41d4e 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2623,8 +2623,6 @@ static struct extent_buffer 
*btrfs_search_slot_get_root(struct btrfs_root *root,
level = btrfs_header_level(b);
if (p->need_commit_sem)
up_read(_info->commit_root_sem);
-   if (!p->skip_locking)
-   btrfs_tree_read_lock(b);
 
goto out;
}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] Btrfs: move get root of btrfs_search_slot to a helper

2018-05-15 Thread Liu Bo
It's good to have a helper instead of having all get-root details
open-coded.

The new helper locks (if necessary) and sets root node of the path.

There is no functional change in this commit.

Signed-off-by: Liu Bo <bo@linux.alibaba.com>
---
 fs/btrfs/ctree.c | 112 +--
 1 file changed, 67 insertions(+), 45 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index a96d308c51b8..399839df7a8f 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2598,6 +2598,72 @@ int btrfs_find_item(struct btrfs_root *fs_root, struct 
btrfs_path *path,
return 0;
 }
 
+static struct extent_buffer *btrfs_search_slot_get_root(struct btrfs_root 
*root,
+   struct btrfs_path *p,
+   int write_lock_level)
+{
+   struct btrfs_fs_info *fs_info = root->fs_info;
+   struct extent_buffer *b;
+   int root_lock;
+   int level = 0;
+
+   /*
+* we try very hard to do read locks on the root
+*/
+   root_lock = BTRFS_READ_LOCK;
+
+   if (p->search_commit_root) {
+   /*
+* the commit roots are read only so we always do read locks
+*/
+   if (p->need_commit_sem)
+   down_read(_info->commit_root_sem);
+   b = root->commit_root;
+   extent_buffer_get(b);
+   level = btrfs_header_level(b);
+   if (p->need_commit_sem)
+   up_read(_info->commit_root_sem);
+   if (!p->skip_locking)
+   btrfs_tree_read_lock(b);
+
+   goto out;
+   }
+
+   if (p->skip_locking) {
+   b = btrfs_root_node(root);
+   level = btrfs_header_level(b);
+   goto out;
+   }
+
+   /*
+* we don't know the level of the root node until we actually
+* have it read locked
+*/
+   b = btrfs_read_lock_root_node(root);
+   level = btrfs_header_level(b);
+   if (level > write_lock_level)
+   goto out;
+
+   /*
+* whoops, must trade for write lock
+*/
+   btrfs_tree_read_unlock(b);
+   free_extent_buffer(b);
+   b = btrfs_lock_root_node(root);
+   root_lock = BTRFS_WRITE_LOCK;
+   /*
+* the level might have changed, check again
+*/
+   level = btrfs_header_level(b);
+
+out:
+   p->nodes[level] = b;
+   if (!p->skip_locking)
+   p->locks[level] = root_lock;
+   return b;
+}
+
+
 /*
  * btrfs_search_slot - look for a key in a tree and perform necessary
  * modifications to preserve tree invariants.
@@ -2634,7 +2700,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
int err;
int level;
int lowest_unlock = 1;
-   int root_lock;
/* everything at write_lock_level or lower must be write locked */
int write_lock_level = 0;
u8 lowest_level = 0;
@@ -2672,50 +2737,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
 
 again:
prev_cmp = -1;
-   /*
-* we try very hard to do read locks on the root
-*/
-   root_lock = BTRFS_READ_LOCK;
-   level = 0;
-   if (p->search_commit_root) {
-   /*
-* the commit roots are read only
-* so we always do read locks
-*/
-   if (p->need_commit_sem)
-   down_read(_info->commit_root_sem);
-   b = root->commit_root;
-   extent_buffer_get(b);
-   level = btrfs_header_level(b);
-   if (p->need_commit_sem)
-   up_read(_info->commit_root_sem);
-   if (!p->skip_locking)
-   btrfs_tree_read_lock(b);
-   } else {
-   if (p->skip_locking) {
-   b = btrfs_root_node(root);
-   level = btrfs_header_level(b);
-   } else {
-   /* we don't know the level of the root node
-* until we actually have it read locked
-*/
-   b = btrfs_read_lock_root_node(root);
-   level = btrfs_header_level(b);
-   if (level <= write_lock_level) {
-   /* whoops, must trade for write lock */
-   btrfs_tree_read_unlock(b);
-   free_extent_buffer(b);
-   b = btrfs_lock_root_node(root);
-   root_lock = BTRFS_WRITE_LOCK;
-
-   /* the level might have changed, check again */
-   level = btrfs_header_level(b);
-

[PATCH 1/6] Btrfs: remove superfluous free_extent_buffer

2018-05-15 Thread Liu Bo
@tmp must be NULL at this point, free_extent_buffer is not needed at all.

Signed-off-by: Liu Bo <bo@linux.alibaba.com>
---
 fs/btrfs/ctree.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index b3f6f300e492..9fa3d77c98d4 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2432,7 +2432,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path 
*path, int level)
btrfs_unlock_up_safe(p, level + 1);
btrfs_set_path_blocking(p);
 
-   free_extent_buffer(tmp);
if (p->reada != READA_NONE)
reada_for_search(fs_info, p, level, slot, key->objectid);
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6] btrfs_search_slot cleanups

2018-05-15 Thread Liu Bo
These're the cleanups I made for btrfs_search_slot() and its callees.

Liu Bo (6):
  Btrfs: remove superfluous free_extent_buffer
  Btrfs: use more straightforward extent_buffer_uptodate
  Btrfs: move get root of btrfs_search_slot to a helper
  Btrfs: remove unused check of skip_locking
  Btrfs: grab write lock directly if write_lock_level is the max level
  Btrfs: remove always true check in unlock_up

 fs/btrfs/ctree.c | 118 +--
 1 file changed, 70 insertions(+), 48 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] Btrfs: grab write lock directly if write_lock_level is the max level

2018-05-15 Thread Liu Bo
In case of (cow && (p->keep_locks || p->lowest_level)), write_lock_level
is the max level, and we should grab write lock of root node from the very
beginning.

Signed-off-by: Liu Bo <bo@linux.alibaba.com>
---
 fs/btrfs/ctree.c | 29 -
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index cf34eca41d4e..f7c3f581f647 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2633,20 +2633,23 @@ static struct extent_buffer 
*btrfs_search_slot_get_root(struct btrfs_root *root,
goto out;
}
 
-   /*
-* we don't know the level of the root node until we actually
-* have it read locked
-*/
-   b = btrfs_read_lock_root_node(root);
-   level = btrfs_header_level(b);
-   if (level > write_lock_level)
-   goto out;
+   if (write_lock_level < BTRFS_MAX_LEVEL) {
+   /*
+* we don't know the level of the root node until we actually
+* have it read locked
+*/
+   b = btrfs_read_lock_root_node(root);
+   level = btrfs_header_level(b);
+   if (level > write_lock_level)
+   goto out;
+
+   /*
+* whoops, must trade for write lock
+*/
+   btrfs_tree_read_unlock(b);
+   free_extent_buffer(b);
+   }
 
-   /*
-* whoops, must trade for write lock
-*/
-   btrfs_tree_read_unlock(b);
-   free_extent_buffer(b);
b = btrfs_lock_root_node(root);
root_lock = BTRFS_WRITE_LOCK;
/*
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix the corruption by reading stale btree blocks

2018-05-15 Thread Liu Bo
If a btree block, aka. extent buffer, is not available in the extent
buffer cache, it'll be read out from the disk instead, i.e.

btrfs_search_slot()
  read_block_for_search()  # hold parent and its lock, go to read child
btrfs_release_path()
read_tree_block()  # read child

Unfortunately, the parent lock got released before reading child, so
commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
used 0 as parent transid to read the child block.  It forces
read_tree_block() not to check if parent transid is different with the
generation id of the child that it reads out from disk.

A simple PoC is included in btrfs/124,

0. A two-disk raid1 btrfs,

1. Right after mkfs.btrfs, block A is allocated to be device tree's root.

2. Mount this filesystem and put it in use, after a while, device tree's
   root got COW but block A hasn't been allocated/overwritten yet.

3. Umount it and reload the btrfs module to remove both disks from the
   global @fs_devices list.

4. mount -odegraded dev1 and write some data, so now block A is allocated
   to be a leaf in checksum tree.  Note that only dev1 has the latest
   metadata of this filesystem.

5. Umount it and mount it again normally (with both disks), since raid1
   can pick up one disk by the writer task's pid, if btrfs_search_slot()
   needs to read block A, dev2 which does NOT have the latest metadata
   might be read for block A, then we got a stale block A.

6. As parent transid is not checked, block A is marked as uptodate and
   put into the extent buffer cache, so the future search won't bother
   to read disk again, which means it'll make changes on this stale
   one and make it dirty and flush it onto disk.

To avoid the problem, parent transid needs to be passed to
read_tree_block().

In order to get a valid parent transid, we need to hold the parent's
lock until finish reading child.

Signed-off-by: Liu Bo <bo@linux.alibaba.com>
---
 fs/btrfs/ctree.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 3fd44835b386..b3f6f300e492 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2436,10 +2436,8 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path 
*path, int level)
if (p->reada != READA_NONE)
reada_for_search(fs_info, p, level, slot, key->objectid);
 
-   btrfs_release_path(p);
-
ret = -EAGAIN;
-   tmp = read_tree_block(fs_info, blocknr, 0, parent_level - 1,
+   tmp = read_tree_block(fs_info, blocknr, gen, parent_level - 1,
  _key);
if (!IS_ERR(tmp)) {
/*
@@ -2454,6 +2452,8 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path 
*path, int level)
} else {
ret = PTR_ERR(tmp);
}
+
+   btrfs_release_path(p);
return ret;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: add parent_transid parameter to veirfy_level_key

2018-05-15 Thread Liu Bo
@parent_transid could tell whether the eb's generation has been verified
by the caller.

Signed-off-by: Liu Bo <bo@linux.alibaba.com>
---
 fs/btrfs/disk-io.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 60caa68c3618..b5d55b0ec19b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -416,7 +416,7 @@ static int btrfs_check_super_csum(struct btrfs_fs_info 
*fs_info,
 
 static int verify_level_key(struct btrfs_fs_info *fs_info,
struct extent_buffer *eb, int level,
-   struct btrfs_key *first_key)
+   struct btrfs_key *first_key, u64 parent_transid)
 {
int found_level;
struct btrfs_key found_key;
@@ -454,10 +454,10 @@ static int verify_level_key(struct btrfs_fs_info *fs_info,
if (ret) {
WARN_ON(1);
btrfs_err(fs_info,
-"tree first key mismatch detected, bytenr=%llu key expected=(%llu, %u, %llu) 
has=(%llu, %u, %llu)",
+"tree first key mismatch detected, bytenr=%llu key expected=(%llu, %u, %llu) 
has=(%llu, %u, %llu) parent_transid %llu",
  eb->start, first_key->objectid, first_key->type,
  first_key->offset, found_key.objectid,
- found_key.type, found_key.offset);
+ found_key.type, found_key.offset, parent_transid);
}
 #endif
return ret;
@@ -493,7 +493,7 @@ static int btree_read_extent_buffer_pages(struct 
btrfs_fs_info *fs_info,
   parent_transid, 0))
ret = -EIO;
else if (verify_level_key(fs_info, eb, level,
- first_key))
+ first_key, parent_transid))
ret = -EUCLEAN;
else
break;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: verify key failure

2018-05-15 Thread Liu Bo
On Tue, May 15, 2018 at 9:29 AM, Qu Wenruo <quwenruo.bt...@gmx.com> wrote:
>
>
> On 2018年05月14日 22:35, Liu Bo wrote:
>> Hi,
>>
>> I got another warning of verify_level_key by running btrfs/124 in a loop, 
>> I'm testing against 4.17-rc3.
>>
>> Not sure if it's false positive.
>>
>> [101414.336691] WARNING: CPU: 3 PID: 30194 at fs/btrfs/disk-io.c:455 
>> btree_read_extent_buffer_pages+0x183/0x220 [btrfs]
>> [101414.340372] Modules linked in: btrfs(O) xor zstd_decompress 
>> zstd_compress xxhash zlib_inflate lzo_compress lzo_decompress zlib_deflate 
>> raid6_pq dm_flakey [last unloaded: xor]
>> [101414.345713] CPU: 3 PID: 30194 Comm: btrfs Tainted: GW  O  
>> 4.17.0-rc3-liubo+ #35
>> [101414.348501] RIP: 0010:btree_read_extent_buffer_pages+0x183/0x220 [btrfs]
>> ...
>> [101414.369713] Call Trace:
>> [101414.370477]  read_tree_block+0x3d/0x60 [btrfs]
>> [101414.371946]  read_block_for_search.isra.11+0x19c/0x350 [btrfs]
>> [101414.373915]  btrfs_search_slot+0x4a0/0xa60 [btrfs]
>> [101414.375489]  ? trace_hardirqs_on_caller+0x12/0x1c0
>> [101414.377080]  ? btrfs_lookup_ordered_extent+0x8b/0xd0 [btrfs]
>> [101414.379007]  btrfs_lookup_csum+0x42/0x130 [btrfs]
>> [101414.380456]  __btrfs_lookup_bio_sums+0x2fb/0x6a0 [btrfs]
>> [101414.381554]  btrfs_submit_bio_hook+0xbb/0x180 [btrfs]
>> [101414.382598]  submit_one_bio+0x57/0x80 [btrfs]
>> [101414.383509]  submit_extent_page+0xd5/0x1f0 [btrfs]
>> [101414.384507]  __do_readpage+0x2a6/0x770 [btrfs]
>> [101414.385449]  ? btrfs_create_repair_bio+0x100/0x100 [btrfs]
>> [101414.386576]  ? btrfs_direct_IO+0x3a0/0x3a0 [btrfs]
>> [101414.387569]  ? btrfs_direct_IO+0x3a0/0x3a0 [btrfs]
>> [101414.388562]  __extent_readpages+0x2e2/0x330 [btrfs]
>> [101414.389584]  extent_readpages+0x10e/0x1a0 [btrfs]
>> [101414.390565]  __do_page_cache_readahead+0x283/0x340
>> [101414.391550]  ? ondemand_readahead+0x207/0x460
>> [101414.392451]  ondemand_readahead+0x207/0x460
>> [101414.393353]  relocate_file_extent_cluster+0x364/0x4c0 [btrfs]
>> [101414.394546]  relocate_block_group+0x5d4/0x6e0 [btrfs]
>> ...
>> [101414.432616] BTRFS error (device sdb): tree first key mismatch detected, 
>> bytenr=30523392 key expected=(18446744073709551606, 128, 1120665600) has=(1, 
>> 204, 22020096)
>
> The expected key is completely fine, while the found one obviously
> belongs to extent tree.
>
> Maybe that's the bug which I'm always chasing.
>

The following patch is already in 4.17-rc3,

btrfs: Fix wrong first_key parameter in replace_path

> Can you reproduce it again with btrfs_print_tree() added to provide more
> info?
>

Not sure if I'd have time working on this one, but I'll let you know
if I get it again.

My test box is nothing special, just a plain kvm VM.

thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: verify key failure

2018-05-15 Thread Liu Bo
On Tue, May 15, 2018 at 12:10 AM, Chris Mason <c...@fb.com> wrote:
>
>
> On 14 May 2018, at 10:35, Liu Bo wrote:
>
>> Hi,
>>
>> I got another warning of verify_level_key by running btrfs/124 in a loop,
>> I'm testing against 4.17-rc3.
>>
>> Not sure if it's false positive.
>
>
> How long does this take to trigger?
>


btrfs/124 takes ~24s on my box, and it took 10 ~ 15 runs to hit.


thanks,
liubo

> -chris
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


verify key failure

2018-05-14 Thread Liu Bo
Hi,

I got another warning of verify_level_key by running btrfs/124 in a loop, I'm 
testing against 4.17-rc3.

Not sure if it's false positive.

[101414.336691] WARNING: CPU: 3 PID: 30194 at fs/btrfs/disk-io.c:455 
btree_read_extent_buffer_pages+0x183/0x220 [btrfs]
[101414.340372] Modules linked in: btrfs(O) xor zstd_decompress zstd_compress 
xxhash zlib_inflate lzo_compress lzo_decompress zlib_deflate raid6_pq dm_flakey 
[last unloaded: xor]
[101414.345713] CPU: 3 PID: 30194 Comm: btrfs Tainted: GW  O  
4.17.0-rc3-liubo+ #35
[101414.348501] RIP: 0010:btree_read_extent_buffer_pages+0x183/0x220 [btrfs]
...
[101414.369713] Call Trace:
[101414.370477]  read_tree_block+0x3d/0x60 [btrfs]
[101414.371946]  read_block_for_search.isra.11+0x19c/0x350 [btrfs]
[101414.373915]  btrfs_search_slot+0x4a0/0xa60 [btrfs]
[101414.375489]  ? trace_hardirqs_on_caller+0x12/0x1c0
[101414.377080]  ? btrfs_lookup_ordered_extent+0x8b/0xd0 [btrfs]
[101414.379007]  btrfs_lookup_csum+0x42/0x130 [btrfs]
[101414.380456]  __btrfs_lookup_bio_sums+0x2fb/0x6a0 [btrfs]
[101414.381554]  btrfs_submit_bio_hook+0xbb/0x180 [btrfs]
[101414.382598]  submit_one_bio+0x57/0x80 [btrfs]
[101414.383509]  submit_extent_page+0xd5/0x1f0 [btrfs]
[101414.384507]  __do_readpage+0x2a6/0x770 [btrfs]
[101414.385449]  ? btrfs_create_repair_bio+0x100/0x100 [btrfs]
[101414.386576]  ? btrfs_direct_IO+0x3a0/0x3a0 [btrfs]
[101414.387569]  ? btrfs_direct_IO+0x3a0/0x3a0 [btrfs]
[101414.388562]  __extent_readpages+0x2e2/0x330 [btrfs]
[101414.389584]  extent_readpages+0x10e/0x1a0 [btrfs]
[101414.390565]  __do_page_cache_readahead+0x283/0x340
[101414.391550]  ? ondemand_readahead+0x207/0x460
[101414.392451]  ondemand_readahead+0x207/0x460
[101414.393353]  relocate_file_extent_cluster+0x364/0x4c0 [btrfs]
[101414.394546]  relocate_block_group+0x5d4/0x6e0 [btrfs]
...
[101414.432616] BTRFS error (device sdb): tree first key mismatch detected, 
bytenr=30523392 key expected=(18446744073709551606, 128, 1120665600) has=(1, 
204, 22020096)

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with current->journal_info for BTRFS, Null pointer dereference in start_transaction

2018-05-07 Thread Liu Bo
On Mon, May 7, 2018 at 2:49 PM, robbieko  wrote:
> Hi,
>
> When send process requires memory allocation, shrinker may be triggered due
> to insufficient memory.
> Then evict_inode gets called when inode is dropped, and this function may
> need to start transaction.
> However, the journal_info is already points to BTRFS_SEND_TRANS_STUB, it
> passed the if condition,
> and the following use yields illegal memory access.
>
>  495 if (current->journal_info) {
>  496 WARN_ON(type & TRANS_EXTWRITERS);
>  497 h = current->journal_info;
>  498 refcount_inc(>use_count);
>  499 WARN_ON(refcount_read(>use_count) > 2);
>  500 h->orig_rsv = h->block_rsv;
>  501 h->block_rsv = NULL;
>  502 goto got_it;
>  503 }
>
> Direct IO has a similar problem, journal_info will store btrfs_dio_data,
> which will lead to illegal memory access.
>
> Anyone have the best solution?
>

btrfs_evict_inode() only starts transaction before doing truncate,
thus we can save the trans_handle and restore afterwards.

The same stuff can be applied to direct IO as what we used to do.

thanks,
liubo
> CallTrace looks like this:
> 018-04-30T04:28:00+08:00 Office kernel: [62182.567827] BUG: unable to handle
> kernel NULL pointer dereference at 0021
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.576596] IP:
> [] start_transaction+0x64/0x450 [btrfs]
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.584208] PGD 8fea4b067 PUD
> a33bea067 PMD 0
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.589179] Oops:  [#1] SMP
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.734282] CPU: 3 PID: 12681
> Comm: btrfs Tainted: P C O 3.10.102 #15266
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.742554] Hardware name:
> Synology Inc. RS3617xs Series/Type2 - Board Product Name1, BIOS M.012
> 2016/06/04
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.753451] task:
> 880a2babc040 ti: 880013e8 task.ti: 880013e8
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.761819] RIP:
> 0010:[] [] start_transaction+0x64/0x450
> [btrfs]
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.772148] RSP:
> 0018:880013e834d0 EFLAGS: 00010246
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.778085] RAX:
> 880a2babc040 RBX: 880b7e8488a0 RCX: 
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.786063] RDX:
> 88101c1bc000 RSI:  RDI: 
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.794034] RBP:
> 0801 R08: 0001 R09: 
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.802012] R10:
> 0100 R11: 0002 R12: 881018148000
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.809983] R13:
> 0001 R14: 88101c1bc188 R15: 881018148000
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.817961] FS:
> 7f3db36038c0() GS:88107fc6() knlGS:
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.827001] CS: 0010 DS: 
> ES:  CR0: 80050033
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.833424] CR2:
> 0021 CR3: 000633403000 CR4: 003407e0
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.841394] DR0:
>  DR1:  DR2: 
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.849373] DR3:
>  DR6: fffe0ff0 DR7: 0400
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.857351] Stack:
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.859585] 0002
> 881018148000 880b7e8488a0 0002
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.867869] 880933256540
> 880013e83550 88101c1bc188 881018148000
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.876161] a087a838
> 0007  88101c1bc000
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.884450] Call Trace:
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.887198] []
> ? btrfs_evict_inode+0x3d8/0x580 [btrfs]
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.894799] []
> ? evict+0xa2/0x1a0
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.900352] []
> ? shrink_dentry_list+0x308/0x3d0
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.907263] []
> ? prune_dcache_sb+0x133/0x160
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.913881] []
> ? prune_super+0xcf/0x1a0
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.920013] []
> ? shrink_slab+0x11f/0x1d0
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.926242] []
> ? do_try_to_free_pages+0x452/0x560
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.97] []
> ? throttle_direct_reclaim+0x74/0x240
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.940634] []
> ? try_to_free_pages+0xae/0xc0
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.947252] []
> ? __alloc_pages_nodemask+0x53b/0x9f0
> 2018-04-30T04:28:00+08:00 Office kernel: [62182.954542] 

Re: [PATCH v2] Btrfs: set keep_lock when necessary in btrfs_defrag_leaves

2018-04-26 Thread Liu Bo
On Fri, Apr 27, 2018 at 2:06 AM, David Sterba <dste...@suse.cz> wrote:
> On Thu, Apr 26, 2018 at 09:27:43AM +0800, Liu Bo wrote:
>> path->keep_lock may force to lock tree root and higher nodes, and make
>> lock contention worse, thus it needs to be avoided as much as
>> possible.
>>
>> In btrfs_degrag_leaves, path->keep_lock is set but @path immediatley
>> gets released, which is not necessary at all.
>
> So what is the effect of this? The locks are held throughout
> btrfs_search_forward and released after that. Thi
>
>> Signed-off-by: Liu Bo <bo@linux.alibaba.com>
>> ---
>> v2: update commit log with more details.
>>
>>  fs/btrfs/tree-defrag.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/tree-defrag.c b/fs/btrfs/tree-defrag.c
>> index 3c0987ab587d..c12747904d4c 100644
>> --- a/fs/btrfs/tree-defrag.c
>> +++ b/fs/btrfs/tree-defrag.c
>> @@ -65,8 +65,6 @@ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
>>   memcpy(, >defrag_progress, sizeof(key));
>>   }
>>
>> - path->keep_locks = 1;
>> -
>>   ret = btrfs_search_forward(root, , path, BTRFS_OLDEST_GENERATION);
>
> What does btrfs_search_forward do as the first statement:
>
> 5115 int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key 
> *min_key,
> 5116  struct btrfs_path *path,
> 5117  u64 min_trans)
> 5118 {
>  declarations
> 5128
> 5129 path->keep_locks = 1;
>
> So even if removed from above, there will be no change. The value of
> keep_locks is preserved after btrfs_path_release.
>

FYI, btrfs_search_forward() doesn't need keep_locks's semantics as all
of its callers only access path->nodes[0], thus I'm planning to remove
keep_locks setting inside it, too.

thanks,
liubo

>>   if (ret < 0)
>>   goto out;
>> @@ -81,6 +79,7 @@ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
>>* a deadlock (attempting to write lock an already write locked leaf).
>>*/
>>   path->lowest_level = 1;
>> + path->keep_locks = 1;
>>   wret = btrfs_search_slot(trans, root, , path, 0, 1);
>>
>>   if (wret < 0) {
>> --
>> 1.8.3.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: set keep_lock when necessary in btrfs_defrag_leaves

2018-04-26 Thread Liu Bo
On Fri, Apr 27, 2018 at 2:06 AM, David Sterba <dste...@suse.cz> wrote:
> On Thu, Apr 26, 2018 at 09:27:43AM +0800, Liu Bo wrote:
>> path->keep_lock may force to lock tree root and higher nodes, and make
>> lock contention worse, thus it needs to be avoided as much as
>> possible.
>>
>> In btrfs_degrag_leaves, path->keep_lock is set but @path immediatley
>> gets released, which is not necessary at all.
>
> So what is the effect of this? The locks are held throughout
> btrfs_search_forward and released after that. Thi
>
>> Signed-off-by: Liu Bo <bo@linux.alibaba.com>
>> ---
>> v2: update commit log with more details.
>>
>>  fs/btrfs/tree-defrag.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/tree-defrag.c b/fs/btrfs/tree-defrag.c
>> index 3c0987ab587d..c12747904d4c 100644
>> --- a/fs/btrfs/tree-defrag.c
>> +++ b/fs/btrfs/tree-defrag.c
>> @@ -65,8 +65,6 @@ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
>>   memcpy(, >defrag_progress, sizeof(key));
>>   }
>>
>> - path->keep_locks = 1;
>> -
>>   ret = btrfs_search_forward(root, , path, BTRFS_OLDEST_GENERATION);
>
> What does btrfs_search_forward do as the first statement:
>
> 5115 int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key 
> *min_key,
> 5116  struct btrfs_path *path,
> 5117  u64 min_trans)
> 5118 {
>  declarations
> 5128
> 5129 path->keep_locks = 1;
>
> So even if removed from above, there will be no change. The value of
> keep_locks is preserved after btrfs_path_release.
>

It will set it back,

out:
path->keep_locks = keep_locks;


thanks,
liubo
>>   if (ret < 0)
>>   goto out;
>> @@ -81,6 +79,7 @@ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
>>* a deadlock (attempting to write lock an already write locked leaf).
>>*/
>>   path->lowest_level = 1;
>> + path->keep_locks = 1;
>>   wret = btrfs_search_slot(trans, root, , path, 0, 1);
>>
>>   if (wret < 0) {
>> --
>> 1.8.3.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: set keep_lock when necessary in btrfs_defrag_leaves

2018-04-25 Thread Liu Bo
path->keep_lock may force to lock tree root and higher nodes, and make
lock contention worse, thus it needs to be avoided as much as
possible.

In btrfs_degrag_leaves, path->keep_lock is set but @path immediatley
gets released, which is not necessary at all.

Signed-off-by: Liu Bo <bo@linux.alibaba.com>
---
v2: update commit log with more details.

 fs/btrfs/tree-defrag.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/tree-defrag.c b/fs/btrfs/tree-defrag.c
index 3c0987ab587d..c12747904d4c 100644
--- a/fs/btrfs/tree-defrag.c
+++ b/fs/btrfs/tree-defrag.c
@@ -65,8 +65,6 @@ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
memcpy(, >defrag_progress, sizeof(key));
}
 
-   path->keep_locks = 1;
-
ret = btrfs_search_forward(root, , path, BTRFS_OLDEST_GENERATION);
if (ret < 0)
goto out;
@@ -81,6 +79,7 @@ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
 * a deadlock (attempting to write lock an already write locked leaf).
 */
path->lowest_level = 1;
+   path->keep_locks = 1;
wret = btrfs_search_slot(trans, root, , path, 0, 1);
 
if (wret < 0) {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   8   9   10   >