Re: [PATCH 2/8] btrfs: extent-tree: Open-code process_func in __btrfs_mod_ref

2018-12-07 Thread Nikolay Borisov



On 6.12.18 г. 8:58 ч., Qu Wenruo wrote:
> The process_func is never a function hook used anywhere else.
> 
> Open code it to make later delayed ref refactor easier, so we can
> refactor btrfs_inc_extent_ref() and btrfs_free_extent() in different
> patches.
> 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent-tree.c | 33 ++---
>  1 file changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index ea2c3d5220f0..ea68d288d761 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3220,10 +3220,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
> *trans,
>   int i;
>   int level;
>   int ret = 0;
> - int (*process_func)(struct btrfs_trans_handle *,
> - struct btrfs_root *,
> - u64, u64, u64, u64, u64, u64, bool);
> -
>  
>   if (btrfs_is_testing(fs_info))
>   return 0;
> @@ -3235,11 +3231,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
> *trans,
>   if (!test_bit(BTRFS_ROOT_REF_COWS, >state) && level == 0)
>   return 0;
>  
> - if (inc)
> - process_func = btrfs_inc_extent_ref;
> - else
> - process_func = btrfs_free_extent;
> -
>   if (full_backref)
>   parent = buf->start;
>   else
> @@ -3261,17 +3252,29 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
> *trans,
>  
>   num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi);
>   key.offset -= btrfs_file_extent_offset(buf, fi);
> - ret = process_func(trans, root, bytenr, num_bytes,
> -parent, ref_root, key.objectid,
> -key.offset, for_reloc);
> + if (inc)
> + ret = btrfs_inc_extent_ref(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + key.objectid, key.offset,
> + for_reloc);
> + else
> + ret = btrfs_free_extent(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + key.objectid, key.offset,
> + for_reloc);
>   if (ret)
>   goto fail;
>   } else {
>   bytenr = btrfs_node_blockptr(buf, i);
>   num_bytes = fs_info->nodesize;
> - ret = process_func(trans, root, bytenr, num_bytes,
> -parent, ref_root, level - 1, 0,
> -for_reloc);
> + if (inc)
> + ret = btrfs_inc_extent_ref(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + level - 1, 0, for_reloc);
> + else
> + ret = btrfs_free_extent(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + level - 1, 0, for_reloc);
>   if (ret)
>   goto fail;
>   }
> 


Re: [PATCH 05/10] btrfs: introduce delayed_refs_rsv

2018-12-07 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> Traditionally we've had voodoo in btrfs to account for the space that
> delayed refs may take up by having a global_block_rsv.  This works most
> of the time, except when it doesn't.  We've had issues reported and seen
> in production where sometimes the global reserve is exhausted during
> transaction commit before we can run all of our delayed refs, resulting
> in an aborted transaction.  Because of this voodoo we have equally
> dubious flushing semantics around throttling delayed refs which we often
> get wrong.
> 
> So instead give them their own block_rsv.  This way we can always know
> exactly how much outstanding space we need for delayed refs.  This
> allows us to make sure we are constantly filling that reservation up
> with space, and allows us to put more precise pressure on the enospc
> system.  Instead of doing math to see if its a good time to throttle,
> the normal enospc code will be invoked if we have a lot of delayed refs
> pending, and they will be run via the normal flushing mechanism.
> 
> For now the delayed_refs_rsv will hold the reservations for the delayed
> refs, the block group updates, and deleting csums.  We could have a
> separate rsv for the block group updates, but the csum deletion stuff is
> still handled via the delayed_refs so that will stay there.


I see one difference in the way that the space is managed. Essentially
for delayed refs rsv you'll only ever be increasaing the size and
->reserved only when you have to refill. This is opposite to the way
other metadata space is managed i.e by using use_block_rsv which
subtracts ->reserved everytime a block has to be CoW'ed. Why this
difference?


> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/ctree.h   |  14 +++-
>  fs/btrfs/delayed-ref.c |  43 --
>  fs/btrfs/disk-io.c |   4 +
>  fs/btrfs/extent-tree.c | 212 
> +
>  fs/btrfs/transaction.c |  37 -
>  5 files changed, 284 insertions(+), 26 deletions(-)
> 



> +/**
> + * btrfs_migrate_to_delayed_refs_rsv - transfer bytes to our delayed refs 
> rsv.
> + * @fs_info - the fs info for our fs.
> + * @src - the source block rsv to transfer from.
> + * @num_bytes - the number of bytes to transfer.
> + *
> + * This transfers up to the num_bytes amount from the src rsv to the
> + * delayed_refs_rsv.  Any extra bytes are returned to the space info.
> + */
> +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
> +struct btrfs_block_rsv *src,
> +u64 num_bytes)

This function is currently used only during transaction start, it seems
to be rather specific to the delayed refs so I'd suggest making it
private to transaction.c

> +{
> + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv;
> + u64 to_free = 0;
> +
> + spin_lock(>lock);
> + src->reserved -= num_bytes;
> + src->size -= num_bytes;
> + spin_unlock(>lock);
> +
> + spin_lock(_refs_rsv->lock);
> + if (delayed_refs_rsv->size > delayed_refs_rsv->reserved) {
> + u64 delta = delayed_refs_rsv->size -
> + delayed_refs_rsv->reserved;
> + if (num_bytes > delta) {
> + to_free = num_bytes - delta;
> + num_bytes = delta;
> + }
> + } else {
> + to_free = num_bytes;
> + num_bytes = 0;
> + }
> +
> + if (num_bytes)
> + delayed_refs_rsv->reserved += num_bytes;
> + if (delayed_refs_rsv->reserved >= delayed_refs_rsv->size)
> + delayed_refs_rsv->full = 1;
> + spin_unlock(_refs_rsv->lock);
> +
> + if (num_bytes)
> + trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv",
> +   0, num_bytes, 1);
> + if (to_free)
> + space_info_add_old_bytes(fs_info, delayed_refs_rsv->space_info,
> +  to_free);
> +}
> +
> +/**
> + * btrfs_delayed_refs_rsv_refill - refill based on our delayed refs usage.
> + * @fs_info - the fs_info for our fs.
> + * @flush - control how we can flush for this reservation.
> + *
> + * This will refill the delayed block_rsv up to 1 items size worth of space 
> and
> + * will return -ENOSPC if we can't make the reservation.
> + */
> +int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
> +   enum btrfs_reserve_flush_enum flush)
> +{
> + struct btrfs_block_rsv *block_rsv = _info->delayed_refs_rsv;
> + u64 limit = btrfs_calc_trans_metadata_size(fs_info, 1);
> + u64 num_bytes = 0;
> + int ret = -ENOSPC;
> +
> + spin_lock(_rsv->lock);
> + if (block_rsv->reserved < block_rsv->size) {
> + num_bytes = block_rsv->size - block_rsv->reserved;
> + num_bytes = min(num_bytes, limit);
> + }
> + 

Re: [PATCH 04/10] btrfs: only track ref_heads in delayed_ref_updates

2018-12-07 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We use this number to figure out how many delayed refs to run, but
> __btrfs_run_delayed_refs really only checks every time we need a new
> delayed ref head, so we always run at least one ref head completely no
> matter what the number of items on it.  Fix the accounting to only be
> adjusted when we add/remove a ref head.

David,

I think also warrants a forward looking sentence stating that the number
is also going to be used to calculate the required number of bytes in
the delayed refs rsv. Something along the lines of:

In addition to using this number to limit the number of delayed refs
run, a future patch is also going to use it to calculate the amount of
space required for delayed refs space reservation.

> 
> Reviewed-by: Nikolay Borisov 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/delayed-ref.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index b3e4c9fcb664..48725fa757a3 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -251,8 +251,6 @@ static inline void drop_delayed_ref(struct 
> btrfs_trans_handle *trans,
>   ref->in_tree = 0;
>   btrfs_put_delayed_ref(ref);
>   atomic_dec(_refs->num_entries);
> - if (trans->delayed_ref_updates)
> - trans->delayed_ref_updates--;
>  }
>  
>  static bool merge_ref(struct btrfs_trans_handle *trans,
> @@ -467,7 +465,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle 
> *trans,
>   if (ref->action == BTRFS_ADD_DELAYED_REF)
>   list_add_tail(>add_list, >ref_add_list);
>   atomic_inc(>num_entries);
> - trans->delayed_ref_updates++;
>   spin_unlock(>lock);
>   return ret;
>  }
> 


Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-07 Thread Nikolay Borisov



On 7.12.18 г. 9:09 ч., Nikolay Borisov wrote:
> 
> 
> On 6.12.18 г. 19:54 ч., David Sterba wrote:
>> On Thu, Dec 06, 2018 at 06:52:21PM +0200, Nikolay Borisov wrote:
>>>
>>>
>>> On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
>>>> Now with the delayed_refs_rsv we can now know exactly how much pending
>>>> delayed refs space we need.  This means we can drastically simplify
>>>
>>> IMO it will be helpful if there is a sentence here referring back to
>>> btrfs_update_delayed_refs_rsv to put your first sentence into context.
>>> But I guess this is something David can also do.
>>
>> I'll update the changelog, but I'm not sure what exactly you want to see
>> there, please post the replacement text. Thanks.
> 
> With the introduction of dealyed_refs_rsv infrastructure, namely
> btrfs_update_delayed_refs_rsv we now know exactly how much pending
> delayed refs space is required.

To put things into context as to why I deem this change beneficial -
basically doing the migration of reservation from transaction to delayed
refs rsv modifies both size and reserved - they will be equal. Calling
btrfs_update_delayed_refs_rsv actually increases ->size and doesn't
really decrement ->reserved. Also we never do
btrfs_block_rsv_migrate/use_block_rsv on the delayed refs block rsv so
managing ->reserved  value for delayed refs rsv is different than for
the rest of the block rsv.


> 
>>
>>>> btrfs_check_space_for_delayed_refs by simply checking how much space we
>>>> have reserved for the global rsv (which acts as a spill over buffer) and
>>>> the delayed refs rsv.  If our total size is beyond that amount then we
>>>> know it's time to commit the transaction and stop any more delayed refs
>>>> from being generated.
>>>>
>>>> Signed-off-by: Josef Bacik 
>>
> 


Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-06 Thread Nikolay Borisov



On 6.12.18 г. 19:54 ч., David Sterba wrote:
> On Thu, Dec 06, 2018 at 06:52:21PM +0200, Nikolay Borisov wrote:
>>
>>
>> On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
>>> Now with the delayed_refs_rsv we can now know exactly how much pending
>>> delayed refs space we need.  This means we can drastically simplify
>>
>> IMO it will be helpful if there is a sentence here referring back to
>> btrfs_update_delayed_refs_rsv to put your first sentence into context.
>> But I guess this is something David can also do.
> 
> I'll update the changelog, but I'm not sure what exactly you want to see
> there, please post the replacement text. Thanks.

With the introduction of dealyed_refs_rsv infrastructure, namely
btrfs_update_delayed_refs_rsv we now know exactly how much pending
delayed refs space is required.

> 
>>> btrfs_check_space_for_delayed_refs by simply checking how much space we
>>> have reserved for the global rsv (which acts as a spill over buffer) and
>>> the delayed refs rsv.  If our total size is beyond that amount then we
>>> know it's time to commit the transaction and stop any more delayed refs
>>> from being generated.
>>>
>>> Signed-off-by: Josef Bacik 
> 


Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> Now with the delayed_refs_rsv we can now know exactly how much pending
> delayed refs space we need.  This means we can drastically simplify

IMO it will be helpful if there is a sentence here referring back to
btrfs_update_delayed_refs_rsv to put your first sentence into context.
But I guess this is something David can also do.

> btrfs_check_space_for_delayed_refs by simply checking how much space we
> have reserved for the global rsv (which acts as a spill over buffer) and
> the delayed refs rsv.  If our total size is beyond that amount then we
> know it's time to commit the transaction and stop any more delayed refs
> from being generated.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/ctree.h   |  2 +-
>  fs/btrfs/extent-tree.c | 48 ++--
>  fs/btrfs/inode.c   |  4 ++--
>  fs/btrfs/transaction.c |  2 +-
>  4 files changed, 22 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2eba398c722b..30da075c042e 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2631,7 +2631,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct 
> btrfs_fs_info *fs_info,
>  }
>  
>  int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans);
> -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans);
> +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info);
>  void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
>const u64 start);
>  void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 5a2d0b061f57..07ef1b8087f7 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2839,40 +2839,28 @@ u64 btrfs_csum_bytes_to_leaves(struct btrfs_fs_info 
> *fs_info, u64 csum_bytes)
>   return num_csums;
>  }
>  
> -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans)
> +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info)
>  {
> - struct btrfs_fs_info *fs_info = trans->fs_info;
> - struct btrfs_block_rsv *global_rsv;
> - u64 num_heads = trans->transaction->delayed_refs.num_heads_ready;
> - u64 csum_bytes = trans->transaction->delayed_refs.pending_csums;
> - unsigned int num_dirty_bgs = trans->transaction->num_dirty_bgs;
> - u64 num_bytes, num_dirty_bgs_bytes;
> - int ret = 0;
> + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv;
> + struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
> + bool ret = false;
> + u64 reserved;
>  
> - num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
> - num_heads = heads_to_leaves(fs_info, num_heads);
> - if (num_heads > 1)
> - num_bytes += (num_heads - 1) * fs_info->nodesize;
> - num_bytes <<= 1;
> - num_bytes += btrfs_csum_bytes_to_leaves(fs_info, csum_bytes) *
> - fs_info->nodesize;
> - num_dirty_bgs_bytes = btrfs_calc_trans_metadata_size(fs_info,
> -  num_dirty_bgs);
> - global_rsv = _info->global_block_rsv;
> + spin_lock(_rsv->lock);
> + reserved = global_rsv->reserved;
> + spin_unlock(_rsv->lock);
>  
>   /*
> -  * If we can't allocate any more chunks lets make sure we have _lots_ of
> -  * wiggle room since running delayed refs can create more delayed refs.
> +  * Since the global reserve is just kind of magic we don't really want
> +  * to rely on it to save our bacon, so if our size is more than the
> +  * delayed_refs_rsv and the global rsv then it's time to think about
> +  * bailing.
>*/
> - if (global_rsv->space_info->full) {
> - num_dirty_bgs_bytes <<= 1;
> - num_bytes <<= 1;
> - }
> -
> - spin_lock(_rsv->lock);
> - if (global_rsv->reserved <= num_bytes + num_dirty_bgs_bytes)
> - ret = 1;
> - spin_unlock(_rsv->lock);
> + spin_lock(_refs_rsv->lock);
> + reserved += delayed_refs_rsv->reserved;
> + if (delayed_refs_rsv->size >= reserved)
> + ret = true;
> + spin_unlock(_refs_rsv->lock);
>   return ret;
>  }
>  
> @@ -2891,7 +2879,7 @@ int btrfs_should_throttle_delayed_refs(struct 
> btrfs_trans_handle *trans)
>   if (val >= NSEC_PER_SEC / 2)
>   return 2;
>  
> - return 

Re: [PATCH 09/10] btrfs: don't run delayed refs in the end transaction logic

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> Over the years we have built up a lot of infrastructure to keep delayed
> refs in check, mostly by running them at btrfs_end_transaction() time.
> We have a lot of different maths we do to figure out how much, if we
> should do it inline or async, etc.  This existed because we had no
> feedback mechanism to force the flushing of delayed refs when they
> became a problem.  However with the enospc flushing infrastructure in
> place for flushing delayed refs when they put too much pressure on the
> enospc system we have this problem solved.  Rip out all of this code as
> it is no longer needed.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/transaction.c | 38 --
>  1 file changed, 38 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 2d8401bf8df9..01f39401619a 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -798,22 +798,12 @@ static int should_end_transaction(struct 
> btrfs_trans_handle *trans)
>  int btrfs_should_end_transaction(struct btrfs_trans_handle *trans)
>  {
>   struct btrfs_transaction *cur_trans = trans->transaction;
> - int updates;
> - int err;
>  
>   smp_mb();
>   if (cur_trans->state >= TRANS_STATE_BLOCKED ||
>   cur_trans->delayed_refs.flushing)
>   return 1;
>  
> - updates = trans->delayed_ref_updates;
> - trans->delayed_ref_updates = 0;
> - if (updates) {
> - err = btrfs_run_delayed_refs(trans, updates * 2);
> - if (err) /* Error code will also eval true */
> - return err;
> - }
> -
>   return should_end_transaction(trans);
>  }
>  
> @@ -843,11 +833,8 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
>  {
>   struct btrfs_fs_info *info = trans->fs_info;
>   struct btrfs_transaction *cur_trans = trans->transaction;
> - u64 transid = trans->transid;
> - unsigned long cur = trans->delayed_ref_updates;
>   int lock = (trans->type != TRANS_JOIN_NOLOCK);
>   int err = 0;
> - int must_run_delayed_refs = 0;
>  
>   if (refcount_read(>use_count) > 1) {
>   refcount_dec(>use_count);
> @@ -858,27 +845,6 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
>   btrfs_trans_release_metadata(trans);
>   trans->block_rsv = NULL;
>  
> - if (!list_empty(>new_bgs))
> - btrfs_create_pending_block_groups(trans);

Is this being deleted because in delayed_refs_rsv you account also fo
new block groups?

> -
> - trans->delayed_ref_updates = 0;
> - if (!trans->sync) {
> - must_run_delayed_refs =
> - btrfs_should_throttle_delayed_refs(trans);
> - cur = max_t(unsigned long, cur, 32);
> -
> - /*
> -  * don't make the caller wait if they are from a NOLOCK
> -  * or ATTACH transaction, it will deadlock with commit
> -  */
> - if (must_run_delayed_refs == 1 &&
> - (trans->type & (__TRANS_JOIN_NOLOCK | __TRANS_ATTACH)))
> - must_run_delayed_refs = 2;
> - }
> -
> - btrfs_trans_release_metadata(trans);
> - trans->block_rsv = NULL;

Why remove those 2 lines as well ?

> -
>   if (!list_empty(>new_bgs))
>   btrfs_create_pending_block_groups(trans);
>  
> @@ -923,10 +889,6 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
>   }
>  
>   kmem_cache_free(btrfs_trans_handle_cachep, trans);
> - if (must_run_delayed_refs) {
> - btrfs_async_run_delayed_refs(info, cur, transid,
> -  must_run_delayed_refs == 1);
> - }
>   return err;
>  }
>  
> 


Re: [PATCH 06/10] btrfs: update may_commit_transaction to use the delayed refs rsv

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> Any space used in the delayed_refs_rsv will be freed up by a transaction
> commit, so instead of just counting the pinned space we also need to
> account for any space in the delayed_refs_rsv when deciding if it will
> make a different to commit the transaction to satisfy our space
> reservation.  If we have enough bytes to satisfy our reservation ticket
> then we are good to go, otherwise subtract out what space we would gain
> back by committing the transaction and compare that against the pinned
> space to make our decision.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: Nikolay Borisov 

However, look below for one suggestion: 

> ---
>  fs/btrfs/extent-tree.c | 24 +++-
>  1 file changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index aa0a638d0263..63ff9d832867 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4843,8 +4843,10 @@ static int may_commit_transaction(struct btrfs_fs_info 
> *fs_info,
>  {
>   struct reserve_ticket *ticket = NULL;
>   struct btrfs_block_rsv *delayed_rsv = _info->delayed_block_rsv;
> + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv;
>   struct btrfs_trans_handle *trans;
> - u64 bytes;
> + u64 bytes_needed;
> + u64 reclaim_bytes = 0;
>  
>   trans = (struct btrfs_trans_handle *)current->journal_info;
>   if (trans)
> @@ -4857,15 +4859,15 @@ static int may_commit_transaction(struct 
> btrfs_fs_info *fs_info,
>   else if (!list_empty(_info->tickets))
>   ticket = list_first_entry(_info->tickets,
> struct reserve_ticket, list);
> - bytes = (ticket) ? ticket->bytes : 0;
> + bytes_needed = (ticket) ? ticket->bytes : 0;
>   spin_unlock(_info->lock);
>  
> - if (!bytes)
> + if (!bytes_needed)
>   return 0;
>  
>   /* See if there is enough pinned space to make this reservation */
>   if (__percpu_counter_compare(_info->total_bytes_pinned,
> -bytes,
> +bytes_needed,
>  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
>   goto commit;
>  
> @@ -4877,14 +4879,18 @@ static int may_commit_transaction(struct 
> btrfs_fs_info *fs_info,
>   return -ENOSPC;

If we remove this :
 if (space_info != delayed_rsv->space_info)  
return -ENOSPC; 

Check, can't we move the reclaim_bytes calc code above the 
__percpu_counter_compare 
and eventually be left with just a single invocation to percpu_compare. 
The diff should looke something along the lines of: 

@@ -4828,19 +4827,6 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
if (!bytes)
return 0;
 
-   /* See if there is enough pinned space to make this reservation */
-   if (__percpu_counter_compare(_info->total_bytes_pinned,
-  bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
-   goto commit;
-
-   /*
-* See if there is some space in the delayed insertion reservation for
-* this reservation.
-*/
-   if (space_info != delayed_rsv->space_info)
-   return -ENOSPC;
-
spin_lock(_rsv->lock);
if (delayed_rsv->size > bytes)
bytes = 0;
@@ -4850,9 +4836,8 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 
if (__percpu_counter_compare(_info->total_bytes_pinned,
   bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
+  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
return -ENOSPC;
-   }
 
 commit:
trans = btrfs_join_transaction(fs_info->extent_root);


>  
>   spin_lock(_rsv->lock);
> - if (delayed_rsv->size > bytes)
> - bytes = 0;
> - else
> - bytes -= delayed_rsv->size;
> + reclaim_bytes += delayed_rsv->reserved;
>   spin_unlock(_rsv->lock);
>  
> + spin_lock(_refs_rsv->lock);
> + reclaim_bytes += delayed_refs_rsv->reserved;
> + spin_unlock(_refs_rsv->lock);
> + if (reclaim_bytes >= bytes_needed)
> + goto commit;
> + bytes_needed -= reclaim_bytes;
> +
>   if (__percpu_counter_compare(_info->total_bytes_pinned,
> -bytes,
> +bytes_needed,
>  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
>   return -ENOSPC;
>   }
> 


Re: [PATCH 02/10] btrfs: add cleanup_ref_head_accounting helper

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We were missing some quota cleanups in check_ref_cleanup, so break the
> ref head accounting cleanup into a helper and call that from both
> check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
> we don't screw up accounting in the future for other things that we add.
> 
> Reviewed-by: Omar Sandoval 
> Reviewed-by: Liu Bo 
> Signed-off-by: Josef Bacik 

Doesn't this also need a stable tag? Furthermore, doesn't the missing
code dealing with total_bytes_pinned in check_ref_cleanup mean that
every time the last reference for a block was freed we were leaking
bytes in total_bytes_pinned? Shouldn't this have lead to eventually
total_bytes_pinned dominating the usage in a space_info ?

Codewise lgtm:

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent-tree.c | 67 
> +-
>  1 file changed, 39 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index c36b3a42f2bb..e3ed3507018d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2443,6 +2443,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle 
> *trans,
>   return ret ? ret : 1;
>  }
>  
> +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> + struct btrfs_delayed_ref_head *head)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_delayed_ref_root *delayed_refs =
> + >transaction->delayed_refs;
> +
> + if (head->total_ref_mod < 0) {
> + struct btrfs_space_info *space_info;
> + u64 flags;
> +
> + if (head->is_data)
> + flags = BTRFS_BLOCK_GROUP_DATA;
> + else if (head->is_system)
> + flags = BTRFS_BLOCK_GROUP_SYSTEM;
> + else
> + flags = BTRFS_BLOCK_GROUP_METADATA;
> + space_info = __find_space_info(fs_info, flags);
> + ASSERT(space_info);
> + percpu_counter_add_batch(_info->total_bytes_pinned,
> +-head->num_bytes,
> +BTRFS_TOTAL_BYTES_PINNED_BATCH);
> +
> + if (head->is_data) {
> + spin_lock(_refs->lock);
> + delayed_refs->pending_csums -= head->num_bytes;
> + spin_unlock(_refs->lock);
> + }
> + }
> +
> + /* Also free its reserved qgroup space */
> + btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> +   head->qgroup_reserved);
> +}
> +
>  static int cleanup_ref_head(struct btrfs_trans_handle *trans,
>   struct btrfs_delayed_ref_head *head)
>  {
> @@ -2478,31 +2513,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
>  
> - trace_run_delayed_ref_head(fs_info, head, 0);
> -
> - if (head->total_ref_mod < 0) {
> - struct btrfs_space_info *space_info;
> - u64 flags;
> -
> - if (head->is_data)
> - flags = BTRFS_BLOCK_GROUP_DATA;
> - else if (head->is_system)
> - flags = BTRFS_BLOCK_GROUP_SYSTEM;
> - else
> - flags = BTRFS_BLOCK_GROUP_METADATA;
> - space_info = __find_space_info(fs_info, flags);
> - ASSERT(space_info);
> - percpu_counter_add_batch(_info->total_bytes_pinned,
> --head->num_bytes,
> -BTRFS_TOTAL_BYTES_PINNED_BATCH);
> -
> - if (head->is_data) {
> - spin_lock(_refs->lock);
> - delayed_refs->pending_csums -= head->num_bytes;
> - spin_unlock(_refs->lock);
> - }
> - }
> -
>   if (head->must_insert_reserved) {
>   btrfs_pin_extent(fs_info, head->bytenr,
>head->num_bytes, 1);
> @@ -2512,9 +2522,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   }
>   }
>  
> - /* Also free its reserved qgroup space */
> - btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> -   head->qgroup_reserved);
> + cleanup_ref_head_accounting(trans, head);
> +
> + trace_run_delayed_ref_head(fs_info, head, 0);
>   btrfs_delayed_ref_unlock(head);
>   btrfs_put_delayed_ref_head(head);
>   return 0;
> @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (head->must_insert_reserved)
>   ret = 1;
>  
> + cleanup_ref_head_accounting(trans, head);
>   mutex_unlock(>mutex);
>   btrfs_put_delayed_ref_head(head);
>   return ret;
> 


Re: [PATCH 01/10] btrfs: add btrfs_delete_ref_head helper

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
> into a helper and cleanup the calling functions.
> 
> Signed-off-by: Josef Bacik 
> Reviewed-by: Omar Sandoval 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/delayed-ref.c | 14 ++
>  fs/btrfs/delayed-ref.h |  3 ++-
>  fs/btrfs/extent-tree.c | 22 +++---
>  3 files changed, 19 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 9301b3ad9217..b3e4c9fcb664 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>   return head;
>  }
>  
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +struct btrfs_delayed_ref_head *head)
> +{
> + lockdep_assert_held(_refs->lock);
> + lockdep_assert_held(>lock);
> +
> + rb_erase_cached(>href_node, _refs->href_root);
> + RB_CLEAR_NODE(>href_node);
> + atomic_dec(_refs->num_entries);
> + delayed_refs->num_heads--;
> + if (head->processing == 0)
> + delayed_refs->num_heads_ready--;
> +}
> +
>  /*
>   * Helper to insert the ref_node to the tail or merge with tail.
>   *
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index 8e20c5cb5404..d2af974f68a1 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
> btrfs_delayed_ref_head *head)
>  {
>   mutex_unlock(>mutex);
>  }
> -
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +struct btrfs_delayed_ref_head *head);
>  
>  struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>   struct btrfs_delayed_ref_root *delayed_refs);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index d242a1174e50..c36b3a42f2bb 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   spin_unlock(_refs->lock);
>   return 1;
>   }
> - delayed_refs->num_heads--;
> - rb_erase_cached(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> + btrfs_delete_ref_head(delayed_refs, head);
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
> - atomic_dec(_refs->num_entries);
>  
>   trace_run_delayed_ref_head(fs_info, head, 0);
>  
> @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (!mutex_trylock(>mutex))
>   goto out;
>  
> - /*
> -  * at this point we have a head with no other entries.  Go
> -  * ahead and process it.
> -  */
> - rb_erase_cached(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> - atomic_dec(_refs->num_entries);
> -
> - /*
> -  * we don't take a ref on the node because we're removing it from the
> -  * tree, so we just steal the ref the tree was holding.
> -  */
> - delayed_refs->num_heads--;
> - if (head->processing == 0)
> - delayed_refs->num_heads_ready--;
> + btrfs_delete_ref_head(delayed_refs, head);
>   head->processing = 0;
> +
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
>  
> 


Re: [PATCH 1/3] btrfs: use offset_in_page instead of open-coding it

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 16:23 ч., Johannes Thumshirn wrote:
> Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an
> offset into a page.
> 
> So replace them by the offset_in_page() macro instead of open-coding it if
> they're not used as an alignment check.
> 
> Signed-off-by: Johannes Thumshirn 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/check-integrity.c | 12 +--
>  fs/btrfs/compression.c |  2 +-
>  fs/btrfs/extent_io.c   | 53 
> +-
>  fs/btrfs/file.c|  4 ++--
>  fs/btrfs/inode.c   |  7 +++---
>  fs/btrfs/send.c|  2 +-
>  fs/btrfs/volumes.c |  2 +-
>  7 files changed, 38 insertions(+), 44 deletions(-)
> 
> diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
> index 781cae168d2a..d319c3020c09 100644
> --- a/fs/btrfs/check-integrity.c
> +++ b/fs/btrfs/check-integrity.c
> @@ -1202,24 +1202,24 @@ static void btrfsic_read_from_block_data(
>   void *dstv, u32 offset, size_t len)
>  {
>   size_t cur;
> - size_t offset_in_page;
> + size_t pgoff;
>   char *kaddr;
>   char *dst = (char *)dstv;
> - size_t start_offset = block_ctx->start & ((u64)PAGE_SIZE - 1);
> + size_t start_offset = offset_in_page(block_ctx->start);
>   unsigned long i = (start_offset + offset) >> PAGE_SHIFT;
>  
>   WARN_ON(offset + len > block_ctx->len);
> - offset_in_page = (start_offset + offset) & (PAGE_SIZE - 1);
> + pgoff = offset_in_page(start_offset + offset);
>  
>   while (len > 0) {
> - cur = min(len, ((size_t)PAGE_SIZE - offset_in_page));
> + cur = min(len, ((size_t)PAGE_SIZE - pgoff));
>   BUG_ON(i >= DIV_ROUND_UP(block_ctx->len, PAGE_SIZE));
>   kaddr = block_ctx->datav[i];
> - memcpy(dst, kaddr + offset_in_page, cur);
> + memcpy(dst, kaddr + pgoff, cur);
>  
>   dst += cur;
>   len -= cur;
> - offset_in_page = 0;
> + pgoff = 0;
>   i++;
>   }
>  }
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index dba59ae914b8..2ab5591449f2 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -477,7 +477,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
>  
>   if (page->index == end_index) {
>   char *userpage;
> - size_t zero_offset = isize & (PAGE_SIZE - 1);
> + size_t zero_offset = offset_in_page(isize);
>  
>   if (zero_offset) {
>   int zeros;
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index b2769e92b556..e365c5272e6b 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2585,7 +2585,7 @@ static void end_bio_extent_readpage(struct bio *bio)
>   unsigned off;
>  
>   /* Zero out the end if this page straddles i_size */
> - off = i_size & (PAGE_SIZE-1);
> + off = offset_in_page(i_size);
>   if (page->index == end_index && off)
>   zero_user_segment(page, off, PAGE_SIZE);
>   SetPageUptodate(page);
> @@ -2888,7 +2888,7 @@ static int __do_readpage(struct extent_io_tree *tree,
>  
>   if (page->index == last_byte >> PAGE_SHIFT) {
>   char *userpage;
> - size_t zero_offset = last_byte & (PAGE_SIZE - 1);
> + size_t zero_offset = offset_in_page(last_byte);
>  
>   if (zero_offset) {
>   iosize = PAGE_SIZE - zero_offset;
> @@ -3432,7 +3432,7 @@ static int __extent_writepage(struct page *page, struct 
> writeback_control *wbc,
>  
>   ClearPageError(page);
>  
> - pg_offset = i_size & (PAGE_SIZE - 1);
> + pg_offset = offset_in_page(i_size);
>   if (page->index > end_index ||
>  (page->index == end_index && !pg_offset)) {
>   page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE);
> @@ -5307,7 +5307,7 @@ void read_extent_buffer(const struct extent_buffer *eb, 
> void *dstv,
>   struct page *page;
>   char *kaddr;
>   char *dst = (char *)dstv;
> - size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
> + size_t start_offset = offset_in_page(eb->start);
>   unsigned long i = (start_offset + start) >> PAGE_SHIFT;
>  
>   if (start + len > eb->len) {
> @@ -5317,7 +5317,7 @@ void read_extent

Re: [PATCH 2/3] btrfs: use PAGE_ALIGNED instead of open-coding it

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 16:23 ч., Johannes Thumshirn wrote:
> When using a 'var & (PAGE_SIZE - 1)' construct one is checking for a page
> alignment and thus should use the PAGE_ALIGNED() macro instead of
> open-coding it.
> 
> Convert all open-coded occurrences of PAGE_ALIGNED().
> 
> Signed-off-by: Johannes Thumshirn 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/check-integrity.c | 8 
>  fs/btrfs/compression.c | 2 +-
>  fs/btrfs/inode.c   | 2 +-
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
> index d319c3020c09..84e9729badaa 100644
> --- a/fs/btrfs/check-integrity.c
> +++ b/fs/btrfs/check-integrity.c
> @@ -1601,7 +1601,7 @@ static int btrfsic_read_block(struct btrfsic_state 
> *state,
>   BUG_ON(block_ctx->datav);
>   BUG_ON(block_ctx->pagev);
>   BUG_ON(block_ctx->mem_to_free);
> - if (block_ctx->dev_bytenr & ((u64)PAGE_SIZE - 1)) {
> + if (!PAGE_ALIGNED(block_ctx->dev_bytenr)) {
>   pr_info("btrfsic: read_block() with unaligned bytenr %llu\n",
>  block_ctx->dev_bytenr);
>   return -1;
> @@ -1778,7 +1778,7 @@ static void btrfsic_process_written_block(struct 
> btrfsic_dev_state *dev_state,
>   return;
>   }
>   is_metadata = 1;
> - BUG_ON(BTRFS_SUPER_INFO_SIZE & (PAGE_SIZE - 1));
> + BUG_ON(!PAGE_ALIGNED(BTRFS_SUPER_INFO_SIZE));
>   processed_len = BTRFS_SUPER_INFO_SIZE;
>   if (state->print_mask &
>   BTRFSIC_PRINT_MASK_TREE_BEFORE_SB_WRITE) {
> @@ -2892,12 +2892,12 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info,
>   struct list_head *dev_head = _devices->devices;
>   struct btrfs_device *device;
>  
> - if (fs_info->nodesize & ((u64)PAGE_SIZE - 1)) {
> + if (!PAGE_ALIGNED(fs_info->nodesize)) {
>   pr_info("btrfsic: cannot handle nodesize %d not being a 
> multiple of PAGE_SIZE %ld!\n",
>  fs_info->nodesize, PAGE_SIZE);
>   return -1;
>   }
> - if (fs_info->sectorsize & ((u64)PAGE_SIZE - 1)) {
> + if (!PAGE_ALIGNED(fs_info->sectorsize)) {
>   pr_info("btrfsic: cannot handle sectorsize %d not being a 
> multiple of PAGE_SIZE %ld!\n",
>  fs_info->sectorsize, PAGE_SIZE);
>   return -1;
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 2ab5591449f2..d5381f39a9e8 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -301,7 +301,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
> *inode, u64 start,
>   blk_status_t ret;
>   int skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
>  
> - WARN_ON(start & ((u64)PAGE_SIZE - 1));
> + WARN_ON(!PAGE_ALIGNED(start));
>   cb = kmalloc(compressed_bio_size(fs_info, compressed_len), GFP_NOFS);
>   if (!cb)
>   return BLK_STS_RESOURCE;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index bc0564c384de..5c52e91b01e8 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2027,7 +2027,7 @@ int btrfs_set_extent_delalloc(struct inode *inode, u64 
> start, u64 end,
> unsigned int extra_bits,
> struct extent_state **cached_state, int dedupe)
>  {
> - WARN_ON((end & (PAGE_SIZE - 1)) == 0);
> + WARN_ON(PAGE_ALIGNED(end));
>   return set_extent_delalloc(_I(inode)->io_tree, start, end,
>  extra_bits, cached_state);
>  }
> 


Re: [PATCH 04/10] Rename __endio_write_update_ordered() to btrfs_update_ordered_extent()

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Since we will be using it in another part of the code, use a
> better name to declare it non-static
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/ctree.h |  7 +--
>  fs/btrfs/inode.c | 14 +-
>  2 files changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 038d64ecebe5..5144d28216b0 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3170,8 +3170,11 @@ struct inode *btrfs_iget_path(struct super_block *s, 
> struct btrfs_key *location,
>  struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
>struct btrfs_root *root, int *was_new);
>  struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
> - struct page *page, size_t pg_offset,
> - u64 start, u64 end, int create);
> + struct page *page, size_t pg_offset,
> + u64 start, u64 end, int create);
> +void btrfs_update_ordered_extent(struct inode *inode,
> + const u64 offset, const u64 bytes,
> + const bool uptodate);
>  int btrfs_update_inode(struct btrfs_trans_handle *trans,
> struct btrfs_root *root,
> struct inode *inode);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9ea4c6f0352f..96e9fe9e4150 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -97,10 +97,6 @@ static struct extent_map *create_io_em(struct inode 
> *inode, u64 start, u64 len,
>  u64 ram_bytes, int compress_type,
>  int type);
>  
> -static void __endio_write_update_ordered(struct inode *inode,
> -  const u64 offset, const u64 bytes,
> -  const bool uptodate);
> -
>  /*
>   * Cleanup all submitted ordered extents in specified range to handle errors
>   * from the fill_dellaloc() callback.
> @@ -130,7 +126,7 @@ static inline void btrfs_cleanup_ordered_extents(struct 
> inode *inode,
>   ClearPagePrivate2(page);
>   put_page(page);
>   }
> - return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
> + return btrfs_update_ordered_extent(inode, offset + PAGE_SIZE,
>   bytes - PAGE_SIZE, false);
>  }
>  
> @@ -8059,7 +8055,7 @@ static void btrfs_endio_direct_read(struct bio *bio)
>   bio_put(bio);
>  }
>  
> -static void __endio_write_update_ordered(struct inode *inode,
> +void btrfs_update_ordered_extent(struct inode *inode,
>const u64 offset, const u64 bytes,
>const bool uptodate)

Since you are exporting the function I'd suggest to use the occasion to
introduce proper kernel-doc. THe primary help will be if you document
the context under which the function is called - when writes are
finished and it's used to update respective portion of the ordered extent.

>  {
> @@ -8112,7 +8108,7 @@ static void btrfs_endio_direct_write(struct bio *bio)
>   struct btrfs_dio_private *dip = bio->bi_private;
>   struct bio *dio_bio = dip->dio_bio;
>  
> - __endio_write_update_ordered(dip->inode, dip->logical_offset,
> + btrfs_update_ordered_extent(dip->inode, dip->logical_offset,
>dip->bytes, !bio->bi_status);
>  
>   kfree(dip);
> @@ -8432,7 +8428,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, 
> struct inode *inode,
>   bio = NULL;
>   } else {
>   if (write)
> - __endio_write_update_ordered(inode,
> + btrfs_update_ordered_extent(inode,
>   file_offset,
>   dio_bio->bi_iter.bi_size,
>   false);
> @@ -8572,7 +8568,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, 
> struct iov_iter *iter)
>*/
>   if (dio_data.unsubmitted_oe_range_start <
>   dio_data.unsubmitted_oe_range_end)
> - __endio_write_update_ordered(inode,
> + btrfs_update_ordered_extent(inode,
>   dio_data.unsubmitted_oe_range_start,
>   dio_data.unsubmitted_oe_range_end -
>   dio_data.unsubmitted_oe_range_start,
> 


Re: [PATCH 03/10] btrfs: dax: read zeros from holes

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/dax.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> index d614bf73bf8e..5a297674adec 100644
> --- a/fs/btrfs/dax.c
> +++ b/fs/btrfs/dax.c
> @@ -54,7 +54,12 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct 
> iov_iter *to)

nit: I think it's better if you rename the iterator variable to "iter".

>  
>  BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING);
>  
> -ret = em_dax_rw(inode, em, pos, len, to);
> + if (em->block_start == EXTENT_MAP_HOLE) {
> + u64 zero_len = min(em->len - (em->start - pos), len);

Shouldn't this be em->len - (pos - em->start) since this gives the
remaining length of the extent? Isn't pos guaranteed to be >= em->start ?

> + ret = iov_iter_zero(zero_len, to);
> + } else {
> + ret = em_dax_rw(inode, em, pos, len, to);
> + }
>  if (ret < 0)
>  goto out;
>  pos += ret;
> 


Re: [PATCH 02/10] btrfs: basic dax read

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/Makefile |  1 +
>  fs/btrfs/ctree.h  |  5 
>  fs/btrfs/dax.c| 68 
> +++
>  fs/btrfs/file.c   | 13 ++-
>  4 files changed, 86 insertions(+), 1 deletion(-)
>  create mode 100644 fs/btrfs/dax.c
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index ca693dd554e9..1fa77b875ae9 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -12,6 +12,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
> root-tree.o dir-item.o \
>  reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
>  uuid-tree.o props.o free-space-tree.o tree-checker.o
>  
> +btrfs-$(CONFIG_FS_DAX) += dax.o
>  btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
>  btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 5cc470fa6a40..038d64ecebe5 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3685,6 +3685,11 @@ int btrfs_reada_wait(void *handle);
>  void btrfs_reada_detach(void *handle);
>  int btree_readahead_hook(struct extent_buffer *eb, int err);
>  
> +#ifdef CONFIG_FS_DAX
> +/* dax.c */
> +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
> +#endif /* CONFIG_FS_DAX */
> +
>  static inline int is_fstree(u64 rootid)
>  {
>   if (rootid == BTRFS_FS_TREE_OBJECTID ||
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> new file mode 100644
> index ..d614bf73bf8e
> --- /dev/null
> +++ b/fs/btrfs/dax.c
> @@ -0,0 +1,68 @@
> +#include 
> +#include 
> +#include "ctree.h"
> +#include "btrfs_inode.h"
> +
> +static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos,
> + u64 len, struct iov_iter *iter)
> +{
> +struct dax_device *dax_dev = fs_dax_get_by_bdev(em->bdev);
> +ssize_t map_len;
> +pgoff_t blk_pg;
> +void *kaddr;
> +sector_t blk_start;
> +unsigned offset = pos & (PAGE_SIZE - 1);

offset = offset_in_page(pos)

> +
> +len = min(len + offset, em->len - (pos - em->start));
> +len = ALIGN(len, PAGE_SIZE);

len = PAGE_ALIGN(len);

> +blk_start = (get_start_sect(em->bdev) << 9) + (em->block_start + 
> (pos - em->start));
> +blk_pg = blk_start - offset;
> +map_len = dax_direct_access(dax_dev, PHYS_PFN(blk_pg), 
> PHYS_PFN(len), , NULL);
> +map_len = PFN_PHYS(map_len)> +kaddr += offset;
> +map_len -= offset;
> +if (map_len > len)
> +map_len = len;

map_len = min(map_len, len);

> +if (iov_iter_rw(iter) == WRITE)
> +return dax_copy_from_iter(dax_dev, blk_pg, kaddr, map_len, 
> iter);
> +else
> +return dax_copy_to_iter(dax_dev, blk_pg, kaddr, map_len, 
> iter);

Have you looked at the implementation of dax_iomap_actor where they have
pretty similar code. In case of either of those returning 0 they set ret
to EFAULT, should the same be done in btrfs_file_dax_read? IMO it will
be good of you can follow dax_iomap_actor's logic as much as possible
since this code has been used for quite some time and is deemed robust.

> +}
> +
> +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
> +{
> +size_t ret = 0, done = 0, count = iov_iter_count(to);
> +struct extent_map *em;
> +u64 pos = iocb->ki_pos;
> +u64 end = pos + count;
> +struct inode *inode = file_inode(iocb->ki_filp);
> +
> +if (!count)
> +return 0;
> +
> +end = i_size_read(inode) < end ? i_size_read(inode) : end;

end = min(i_size_read(inode), end)

> +
> +while (pos < end) {
> +u64 len = end - pos;
> +
> +em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, len, 0);
> +if (IS_ERR(em)) {
> +if (!ret)
> +ret = PTR_ERR(em);
> +goto out;
> +}
> +
> +BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING);

I think this can never trigger, because EXTENT_FLAG_FS_MAPPING is set
for extents that map chunk and those are housed in the chunk tree at
fs_info->mapping_tree. Since the write call back is only ever called for
file inodes I'd say this BUG_ON can be eliminated. Did you manage to
trigger it during development?


> +
> +ret = em_dax_rw(inode, em, pos, len, to);
> +if (ret < 0)
> +goto out;
> +pos += ret;
> +done += ret;
> +}
> +
> +out:
> +iocb->ki_pos += done;
> +return done ? done : ret;
> +}
> +
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 58e93bce3036..ef6ed93f44d1 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ 

Re: [PATCH 01/10] btrfs: create a mount option for dax

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Also, set the inode->i_flags to S_DAX
> 
> Signed-off-by: Goldwyn Rodrigues 

Reviewed-by: Nikolay Borisov 

One question below though .

> ---
>  fs/btrfs/ctree.h |  1 +
>  fs/btrfs/ioctl.c |  5 -
>  fs/btrfs/super.c | 15 +++
>  3 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 68f322f600a0..5cc470fa6a40 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1353,6 +1353,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct 
> btrfs_fs_info *info)
>  #define BTRFS_MOUNT_FREE_SPACE_TREE  (1 << 26)
>  #define BTRFS_MOUNT_NOLOGREPLAY  (1 << 27)
>  #define BTRFS_MOUNT_REF_VERIFY   (1 << 28)
> +#define BTRFS_MOUNT_DAX  (1 << 29)
>  
>  #define BTRFS_DEFAULT_COMMIT_INTERVAL(30)
>  #define BTRFS_DEFAULT_MAX_INLINE (2048)
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 802a628e9f7d..e9146c157816 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode 
> *inode)
>   if (binode->flags & BTRFS_INODE_DIRSYNC)
>   new_fl |= S_DIRSYNC;
>  
> + if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && 
> S_ISREG(inode->i_mode))
> + new_fl |= S_DAX;
> +
>   set_mask_bits(>i_flags,
> -   S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
> +   S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | 
> S_DAX,
> new_fl);
>  }
>  
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 645fc81e2a94..035263b61cf5 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -326,6 +326,7 @@ enum {
>   Opt_treelog, Opt_notreelog,
>   Opt_usebackuproot,
>   Opt_user_subvol_rm_allowed,
> + Opt_dax,
>  
>   /* Deprecated options */
>   Opt_alloc_start,
> @@ -393,6 +394,7 @@ static const match_table_t tokens = {
>   {Opt_notreelog, "notreelog"},
>   {Opt_usebackuproot, "usebackuproot"},
>   {Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
> + {Opt_dax, "dax"},
>  
>   /* Deprecated options */
>   {Opt_alloc_start, "alloc_start=%s"},
> @@ -739,6 +741,17 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
> *options,
>   case Opt_user_subvol_rm_allowed:
>   btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
>   break;
> +#ifdef CONFIG_FS_DAX
> + case Opt_dax:
> + if (btrfs_super_num_devices(info->super_copy) > 1) {
> + btrfs_info(info,
> +"dax not supported for multi-device 
> btrfs partition\n");

What prevents supporting dax for multiple devices so long as all devices
are dax?



> 


[PATCH v2] btrfs: Remove unnecessary code from __btrfs_rebalance

2018-12-05 Thread Nikolay Borisov
The first step fo the rebalance process, ensuring there is 1mb free on
each device. This number seems rather small. And in fact when talking
to the original authors their opinions were:

"man that's a little bonkers"
"i don't think we even need that code anymore"
"I think it was there to make sure we had room for the blank 1M at the
beginning. I bet it goes all the way back to v0"
"we just don't need any of that tho, i say we just delete it"

Clearly, this piece of code has lost its original intent throughout
the years. It doesn't really bring any real practical benefits to the
relocation process. Additionally, this patch makes the balance process
more lightweight by removing a pair of shrink/grow operations which
are rather expensive for heavily populated filesystems. This is mainly due to 
shrink requiring relocating block groups, involving heavy use of the btree.

Signed-off-by: Nikolay Borisov 
Suggested-by: Josef Bacik 
Reviewed-by: Josef Bacik 
---
Changes since v1: 
 * Improved changelog by adding information about reduced runtimes and 
explaining
 where they would come from.

 I did measurements of btrfs balance with and without the patch with 
 funclatency from bcc tools but didn't observe large differences, but this was 
 on a ligthly populated filesystem. 

 fs/btrfs/volumes.c | 53 --
 1 file changed, 53 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d49baad64fe6..19cc31de1e84 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3699,17 +3699,11 @@ static int __btrfs_balance(struct btrfs_fs_info 
*fs_info)
 {
struct btrfs_balance_control *bctl = fs_info->balance_ctl;
struct btrfs_root *chunk_root = fs_info->chunk_root;
-   struct btrfs_root *dev_root = fs_info->dev_root;
-   struct list_head *devices;
-   struct btrfs_device *device;
-   u64 old_size;
-   u64 size_to_free;
u64 chunk_type;
struct btrfs_chunk *chunk;
struct btrfs_path *path = NULL;
struct btrfs_key key;
struct btrfs_key found_key;
-   struct btrfs_trans_handle *trans;
struct extent_buffer *leaf;
int slot;
int ret;
@@ -3724,53 +3718,6 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
u32 count_sys = 0;
int chunk_reserved = 0;
 
-   /* step one make some room on all the devices */
-   devices = _info->fs_devices->devices;
-   list_for_each_entry(device, devices, dev_list) {
-   old_size = btrfs_device_get_total_bytes(device);
-   size_to_free = div_factor(old_size, 1);
-   size_to_free = min_t(u64, size_to_free, SZ_1M);
-   if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state) ||
-   btrfs_device_get_total_bytes(device) -
-   btrfs_device_get_bytes_used(device) > size_to_free ||
-   test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state))
-   continue;
-
-   ret = btrfs_shrink_device(device, old_size - size_to_free);
-   if (ret == -ENOSPC)
-   break;
-   if (ret) {
-   /* btrfs_shrink_device never returns ret > 0 */
-   WARN_ON(ret > 0);
-   goto error;
-   }
-
-   trans = btrfs_start_transaction(dev_root, 0);
-   if (IS_ERR(trans)) {
-   ret = PTR_ERR(trans);
-   btrfs_info_in_rcu(fs_info,
-"resize: unable to start transaction after shrinking device %s 
(error %d), old size %llu, new size %llu",
- rcu_str_deref(device->name), ret,
- old_size, old_size - size_to_free);
-   goto error;
-   }
-
-   ret = btrfs_grow_device(trans, device, old_size);
-   if (ret) {
-   btrfs_end_transaction(trans);
-   /* btrfs_grow_device never returns ret > 0 */
-   WARN_ON(ret > 0);
-   btrfs_info_in_rcu(fs_info,
-"resize: unable to grow device after shrinking device %s 
(error %d), old size %llu, new size %llu",
- rcu_str_deref(device->name), ret,
- old_size, old_size - size_to_free);
-   goto error;
-   }
-
-   btrfs_end_transaction(trans);
-   }
-
-   /* step two, relocate all the chunks */
path = btrfs_alloc_path();
if (!path) {
ret = -ENOMEM;
-- 
2.17.1



Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Nikolay Borisov



On 4.12.18 г. 22:14 ч., Wilson, Ellis wrote:
> On 12/4/18 8:07 AM, Nikolay Borisov wrote:
>> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>>> With 14TB drives available today, it doesn't take more than a handful of
>>> drives to result in a filesystem that takes around a minute to mount.
>>> As a result of this, I suspect this will become an increasingly problem
>>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm
>>> not a contributor so I have no room to do so -- just shedding some light
>>> on a problem that may deserve attention as filesystem sizes continue to
>>> grow.
>> Would it be possible to provide perf traces of the longer-running mount
>> time? Everyone seems to be fixated on reading block groups (which is
>> likely to be the culprit) but before pointing finger I'd like concrete
>> evidence pointed at the offender.
> 
> I am glad to collect such traces -- please advise with commands that 
> would achieve that.  If you just mean block traces, I can do that, but I 
> suspect you mean something more BTRFS-specific.

A command that would be good is :

perf record --all-kernel -g mount /dev/vdc /media/scratch/

of course replace device/mount path appropriately. This will result in a
perf.data file which contains stacktraces of the hottest paths executed
during invocation of mount. If you could send this file to the mailing
list or upload it somwhere for interested people (me and perhaps) Qu to
inspect would be appreciated.

If the file turned out way too big you can use

perf report --stdio  to create a text output and you could send that as
well.

> 
> Best,
> 
> ellis
> 


Re: [PATCHv3] btrfs: Fix error handling in btrfs_cleanup_ordered_extents

2018-12-04 Thread Nikolay Borisov



On 21.11.18 г. 17:10 ч., Nikolay Borisov wrote:
> Running btrfs/124 in a loop hung up on me sporadically with the
> following call trace:
>   btrfs   D0  5760   5324 0x
>   Call Trace:
>? __schedule+0x243/0x800
>schedule+0x33/0x90
>btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
>? wait_woken+0xa0/0xa0
>btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
>btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
>btrfs_relocate_chunk+0x49/0x100 [btrfs]
>btrfs_balance+0xbeb/0x1740 [btrfs]
>btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
>btrfs_ioctl+0x1691/0x3110 [btrfs]
>? lockdep_hardirqs_on+0xed/0x180
>? __handle_mm_fault+0x8e7/0xfb0
>? _raw_spin_unlock+0x24/0x30
>? __handle_mm_fault+0x8e7/0xfb0
>? do_vfs_ioctl+0xa5/0x6e0
>? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
>do_vfs_ioctl+0xa5/0x6e0
>? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
>ksys_ioctl+0x3a/0x70
>__x64_sys_ioctl+0x16/0x20
>do_syscall_64+0x60/0x1b0
>entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> This happens because during page writeback it's valid for
> writepage_delalloc to instantiate a delalloc range which doesn't
> belong to the page currently being written back.
> 
> The reason this case is valid is due to find_lock_delalloc_range
> returning any available range after the passed delalloc_start and
> ignorting whether the page under writeback is within that range.
> In turn ordered extents (OE) are always created for the returned range
> from find_lock_delalloc_range. If, however, a failure occurs while OE
> are being created then the clean up code in btrfs_cleanup_ordered_extents
> will be called.
> 
> Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
> the case of such 'foreign' range being processed and instead it always
> assumes that the range OE are created for belongs to the page. This
> leads to the first page of such foregin range to not be cleaned up since
> it's deliberately missed skipped by the current cleaning up code.
> 
> Fix this by correctly checking whether the current page belongs to the
> range being instantiated and if so adjsut the range parameters
> passed for cleaning up. If it doesn't, then just clean the whole OE
> range directly.
> 
> Signed-off-by: Nikolay Borisov 
> Reviewed-by: Josef Bacik 
> ---
> V3:
>  * Re-worded the commit for easier comprehension
>  * Added RB tag from Josef
> 
> V2:
>  * Fix compilation failure due to missing parentheses
>  * Fixed the "Fixes" tag.
>  fs/btrfs/inode.c | 29 -
>  1 file changed, 20 insertions(+), 9 deletions(-)
> 

Ping,

Also this patch needs:

Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid
ordered extent hang") and it needs to be applied to the stable releases 4.14





Re: [PATCH 2/2] btrfs: scrub: move scrub_setup_ctx allocation out of device_list_mutex

2018-12-04 Thread Nikolay Borisov



On 4.12.18 г. 17:11 ч., David Sterba wrote:
> The scrub context is allocated with GFP_KERNEL and called from
> btrfs_scrub_dev under the fs_info::device_list_mutex. This is not safe
> regarding reclaim that could try to flush filesystem data in order to
> get the memory. And the device_list_mutex is held during superblock
> commit, so this would cause a lockup.
> 
> Move the alocation and initialization before any changes that require
> the mutex.
> 
> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/scrub.c | 30 ++
>  1 file changed, 18 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index ffcab263e057..051d14c9f013 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3834,13 +3834,18 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, 
> u64 devid, u64 start,
>   return -EINVAL;
>   }
>  
> + /* Allocate outside of device_list_mutex */
> + sctx = scrub_setup_ctx(fs_info, is_dev_replace);
> + if (IS_ERR(sctx))
> + return PTR_ERR(sctx);
>  
>   mutex_lock(_info->fs_devices->device_list_mutex);
>   dev = btrfs_find_device(fs_info, devid, NULL, NULL);
>   if (!dev || (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state) &&
>!is_dev_replace)) {
>   mutex_unlock(_info->fs_devices->device_list_mutex);
> - return -ENODEV;
> + ret = -ENODEV;
> + goto out_free_ctx;
>   }
>  
>   if (!is_dev_replace && !readonly &&
> @@ -3848,7 +3853,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>   mutex_unlock(_info->fs_devices->device_list_mutex);
>   btrfs_err_in_rcu(fs_info, "scrub: device %s is not writable",
>   rcu_str_deref(dev->name));
> - return -EROFS;
> + ret = -EROFS;
> + goto out_free_ctx;
>   }
>  
>   mutex_lock(_info->scrub_lock);
> @@ -3856,7 +3862,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>   test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state)) {
>   mutex_unlock(_info->scrub_lock);
>   mutex_unlock(_info->fs_devices->device_list_mutex);
> - return -EIO;
> + ret = -EIO;
> + goto out_free_ctx;
>   }
>  
>   btrfs_dev_replace_read_lock(_info->dev_replace);
> @@ -3866,7 +3873,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>   btrfs_dev_replace_read_unlock(_info->dev_replace);
>   mutex_unlock(_info->scrub_lock);
>   mutex_unlock(_info->fs_devices->device_list_mutex);
> - return -EINPROGRESS;
> + ret = -EINPROGRESS;
> + goto out_free_ctx;
>   }
>   btrfs_dev_replace_read_unlock(_info->dev_replace);
>  
> @@ -3874,16 +3882,9 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>   if (ret) {
>   mutex_unlock(_info->scrub_lock);
>   mutex_unlock(_info->fs_devices->device_list_mutex);
> - return ret;
> + goto out_free_ctx;

Don't we suffer the same issue when calling scrub_workers_get since in
it we do btrfs_alloc_workqueue which also calls kzalloc with GFP_KERNEL?


>   }
>  
> - sctx = scrub_setup_ctx(fs_info, is_dev_replace);
> - if (IS_ERR(sctx)) {
> - mutex_unlock(_info->scrub_lock);
> - mutex_unlock(_info->fs_devices->device_list_mutex);
> - scrub_workers_put(fs_info);
> - return PTR_ERR(sctx);
> - }
>   sctx->readonly = readonly;
>   dev->scrub_ctx = sctx;
>   mutex_unlock(_info->fs_devices->device_list_mutex);
> @@ -3936,6 +3937,11 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>  
>   scrub_put_ctx(sctx);
>  
> + return ret;
> +
> +out_free_ctx:
> + scrub_free_ctx(sctx);
> +
>   return ret;
>  }
>  
> 


Re: [PATCH 1/2] btrfs: scrub: pass fs_info to scrub_setup_ctx

2018-12-04 Thread Nikolay Borisov



On 4.12.18 г. 17:11 ч., David Sterba wrote:
> We can pass fs_info directly as this is the only member of btrfs_device
> that's bing used inside scrub_setup_ctx.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/scrub.c | 9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index bbd1b36f4918..ffcab263e057 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -578,12 +578,11 @@ static void scrub_put_ctx(struct scrub_ctx *sctx)
>   scrub_free_ctx(sctx);
>  }
>  
> -static noinline_for_stack
> -struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int 
> is_dev_replace)
> +static noinline_for_stack struct scrub_ctx *scrub_setup_ctx(
> + struct btrfs_fs_info *fs_info, int is_dev_replace)
>  {
>   struct scrub_ctx *sctx;
>   int i;
> - struct btrfs_fs_info *fs_info = dev->fs_info;
>  
>   sctx = kzalloc(sizeof(*sctx), GFP_KERNEL);
>   if (!sctx)
> @@ -592,7 +591,7 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device 
> *dev, int is_dev_replace)
>   sctx->is_dev_replace = is_dev_replace;
>   sctx->pages_per_rd_bio = SCRUB_PAGES_PER_RD_BIO;
>   sctx->curr = -1;
> - sctx->fs_info = dev->fs_info;
> + sctx->fs_info = fs_info;
>   for (i = 0; i < SCRUB_BIOS_PER_SCTX; ++i) {
>   struct scrub_bio *sbio;
>  
> @@ -3878,7 +3877,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>   return ret;
>   }
>  
> - sctx = scrub_setup_ctx(dev, is_dev_replace);
> + sctx = scrub_setup_ctx(fs_info, is_dev_replace);
>   if (IS_ERR(sctx)) {
>   mutex_unlock(_info->scrub_lock);
>   mutex_unlock(_info->fs_devices->device_list_mutex);
> 


Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Nikolay Borisov



On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
> Hi all,
> 
> Many months ago I promised to graph how long it took to mount a BTRFS 
> filesystem as it grows.  I finally had (made) time for this, and the 
> attached is the result of my testing.  The image is a fairly 
> self-explanatory graph, and the raw data is also attached in 
> comma-delimited format for the more curious.  The columns are: 
> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
> 
> Experimental setup:
> - System:
> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
> - 3 unmount/mount cycles performed in between adding another 250GB of data
> - 250GB of data added each time in the form of 25x10GB files in their 
> own directory.  Files generated in parallel each epoch (25 at the same 
> time, with a 1MB record size).
> - 240 repetitions of this performed (to collect timings in increments of 
> 250GB between a 0GB and 60TB filesystem)
> - Normal "time" command used to measure time to mount.  "Real" time used 
> of the timings reported from time.
> - Mount:
> /dev/md0 on /btrfs type btrfs 
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
> 
> At 60TB, we take 30s to mount the filesystem, which is actually not as 
> bad as I originally thought it would be (perhaps as a result of using 
> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
> to comment if folks more intimately familiar with BTRFS think this is 
> due to the very large files I've used.  I can redo the test with much 
> more realistic data if people have legitimate reason to think it will 
> drastically change the result.
> 
> With 14TB drives available today, it doesn't take more than a handful of 
> drives to result in a filesystem that takes around a minute to mount. 
> As a result of this, I suspect this will become an increasingly problem 
> for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
> not a contributor so I have no room to do so -- just shedding some light 
> on a problem that may deserve attention as filesystem sizes continue to 
> grow.

Would it be possible to provide perf traces of the longer-running mount
time? Everyone seems to be fixated on reading block groups (which is
likely to be the culprit) but before pointing finger I'd like concrete
evidence pointed at the offender.

> 
> Best,
> 
> ellis
> 


Re: [PATCH 7/9] btrfs-progs: Fix Wmaybe-uninitialized warning

2018-12-04 Thread Nikolay Borisov



On 4.12.18 г. 14:22 ч., Adam Borowski wrote:
> On Tue, Dec 04, 2018 at 01:17:04PM +0100, David Sterba wrote:
>> On Fri, Nov 16, 2018 at 03:54:24PM +0800, Qu Wenruo wrote:
>>> The only location is the following code:
>>>
>>> int level = path->lowest_level + 1;
>>> BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
>>> while(level < BTRFS_MAX_LEVEL) {
>>> slot = path->slots[level] + 1;
>>> ...
>>> }
>>> path->slots[level] = slot;
>>>
>>> Again, it's the stupid compiler needs some hint for the fact that
>>> we will always enter the while loop for at least once, thus @slot should
>>> always be initialized.
>>
>> Harsh words for the compiler, and I say not deserved. The same code
>> pasted to kernel a built with the same version does not report the
>> warning, so it's apparently a missing annotation of BUG_ON in
>> btrfs-progs that does not give the right hint.
> 
> It's be nice if the C language provided a kind of a while loop that executes
> at least once...

But it does, it's called:

do {

} while()

> 


Re: [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue

2018-12-04 Thread Nikolay Borisov



On 3.12.18 г. 18:06 ч., Josef Bacik wrote:
> The throttle path doesn't take cleaner_delayed_iput_mutex, which means
> we could think we're done flushing iputs in the data space reservation
> path when we could have a throttler doing an iput.  There's no real
> reason to serialize the delayed iput flushing, so instead of taking the
> cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
> replace it with an atomic counter and a waitqueue.  This removes the
> short (or long depending on how big the inode is) window where we think
> there are no more pending iputs when there really are some.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/ctree.h   |  4 +++-
>  fs/btrfs/disk-io.c |  5 ++---
>  fs/btrfs/extent-tree.c | 13 -
>  fs/btrfs/inode.c   | 21 +
>  4 files changed, 34 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index dc56a4d940c3..20af5d6d81f1 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -915,7 +915,8 @@ struct btrfs_fs_info {
>  
>   spinlock_t delayed_iput_lock;
>   struct list_head delayed_iputs;
> - struct mutex cleaner_delayed_iput_mutex;
> + atomic_t nr_delayed_iputs;
> + wait_queue_head_t delayed_iputs_wait;
>  

Have you considered whether the same could be achieved with a completion
rather than an open-coded waitqueue? I tried prototyping it and it could
be done but it becomes messy regarding when the completion should be
initialised i.e only when it's not in btrfs_add_delayed_iput




> @@ -4958,9 +4962,8 @@ static void flush_space(struct btrfs_fs_info *fs_info,
>* bunch of pinned space, so make sure we run the iputs before
>* we do our pinned bytes check below.
>*/
> - mutex_lock(_info->cleaner_delayed_iput_mutex);
>   btrfs_run_delayed_iputs(fs_info);
> - mutex_unlock(_info->cleaner_delayed_iput_mutex);
> + btrfs_wait_on_delayed_iputs(fs_info);

Waiting on delayed iputs here is pointless since they are run
synchronously form this context.

>  
>   ret = may_commit_transaction(fs_info, space_info);
>   break;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0b9f3e482cea..958e30c7c744 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3260,6 +3260,7 @@ void btrfs_add_delayed_iput(struct inode *inode)
>   if (atomic_add_unless(>i_count, -1, 1))
>   return;
>  
> + atomic_inc(_info->nr_delayed_iputs);
>   spin_lock(_info->delayed_iput_lock);
>   ASSERT(list_empty(>delayed_iput));
>   list_add_tail(>delayed_iput, _info->delayed_iputs);
> @@ -3280,11 +3281,31 @@ void btrfs_run_delayed_iputs(struct btrfs_fs_info 
> *fs_info)
>   list_del_init(>delayed_iput);
>   spin_unlock(_info->delayed_iput_lock);
>   iput(>vfs_inode);
> + if (atomic_dec_and_test(_info->nr_delayed_iputs))
> + wake_up(_info->delayed_iputs_wait);
>   spin_lock(_info->delayed_iput_lock);
>   }
>   spin_unlock(_info->delayed_iput_lock);
>  }
>  
> +/**
> + * btrfs_wait_on_delayed_iputs - wait on the delayed iputs to be done running
> + * @fs_info - the fs_info for this fs
> + * @return - EINTR if we were killed, 0 if nothing's pending
> + *
> + * This will wait on any delayed iputs that are currently running with 
> KILLABLE
> + * set.  Once they are all done running we will return, unless we are killed 
> in
> + * which case we return EINTR.
> + */
> +int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info)
> +{
> + int ret = wait_event_killable(fs_info->delayed_iputs_wait,
> + atomic_read(_info->nr_delayed_iputs) == 0);
> + if (ret)
> + return -EINTR;
> + return 0;
> +}
> +
>  /*
>   * This creates an orphan entry for the given inode in case something goes 
> wrong
>   * in the middle of an unlink.
> 


Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput

2018-12-04 Thread Nikolay Borisov



On 3.12.18 г. 18:06 ч., Josef Bacik wrote:
> The cleaner thread usually takes care of delayed iputs, with the
> exception of the btrfs_end_transaction_throttle path.  The cleaner
> thread only gets woken up every 30 seconds, so instead wake it up to do
> it's work so that we can free up that space as quickly as possible.

This description misses any rationale whatsoever about why the cleaner
needs to be woken up more frequently than 30 seconds (and IMO this is
the most important question that needs answering).

Also have you done any measurements of the number of processed delayed
inodes with this change. Given the behavior you so desire why not just
make delayed iputs to be run via schedule_work on the global workqueue
and be done with it? I'm sure the latency will be better than the
current 30 seconds one :)

> 
> Reviewed-by: Filipe Manana 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/ctree.h   | 3 +++
>  fs/btrfs/disk-io.c | 3 +++
>  fs/btrfs/inode.c   | 2 ++
>  3 files changed, 8 insertions(+)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index c8ddbacb6748..dc56a4d940c3 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -769,6 +769,9 @@ bool btrfs_pinned_by_swapfile(struct btrfs_fs_info 
> *fs_info, void *ptr);
>   */
>  #define BTRFS_FS_BALANCE_RUNNING 18
>  
> +/* Indicate that the cleaner thread is awake and doing something. */
> +#define BTRFS_FS_CLEANER_RUNNING 19
> +
>  struct btrfs_fs_info {
>   u8 fsid[BTRFS_FSID_SIZE];
>   u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index c5918ff8241b..f40f6fdc1019 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1669,6 +1669,8 @@ static int cleaner_kthread(void *arg)
>   while (1) {
>   again = 0;
>  
> + set_bit(BTRFS_FS_CLEANER_RUNNING, _info->flags);
> +
>   /* Make the cleaner go to sleep early. */
>   if (btrfs_need_cleaner_sleep(fs_info))
>   goto sleep;
> @@ -1715,6 +1717,7 @@ static int cleaner_kthread(void *arg)
>*/
>   btrfs_delete_unused_bgs(fs_info);
>  sleep:
> + clear_bit(BTRFS_FS_CLEANER_RUNNING, _info->flags);
>   if (kthread_should_park())
>   kthread_parkme();
>   if (kthread_should_stop())
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8ac7abe2ae9b..0b9f3e482cea 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3264,6 +3264,8 @@ void btrfs_add_delayed_iput(struct inode *inode)
>   ASSERT(list_empty(>delayed_iput));
>   list_add_tail(>delayed_iput, _info->delayed_iputs);
>   spin_unlock(_info->delayed_iput_lock);
> + if (!test_bit(BTRFS_FS_CLEANER_RUNNING, _info->flags))
> + wake_up_process(fs_info->cleaner_kthread);
>  }
>  
>  void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
> 


Re: [PATCH 1/3] btrfs: run delayed iputs before committing

2018-12-04 Thread Nikolay Borisov



On 3.12.18 г. 18:06 ч., Josef Bacik wrote:
> Delayed iputs means we can have final iputs of deleted inodes in the
> queue, which could potentially generate a lot of pinned space that could
> be free'd.  So before we decide to commit the transaction for ENOPSC
> reasons, run the delayed iputs so that any potential space is free'd up.
> If there is and we freed enough we can then commit the transaction and
> potentially be able to make our reservation.
> 
> Signed-off-by: Josef Bacik 
> Reviewed-by: Omar Sandoval 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent-tree.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 8dfddfd3f315..0127d272cd2a 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4953,6 +4953,15 @@ static void flush_space(struct btrfs_fs_info *fs_info,
>   ret = 0;
>   break;
>   case COMMIT_TRANS:
> + /*
> +  * If we have pending delayed iputs then we could free up a
> +  * bunch of pinned space, so make sure we run the iputs before
> +  * we do our pinned bytes check below.
> +  */
> + mutex_lock(_info->cleaner_delayed_iput_mutex);
> + btrfs_run_delayed_iputs(fs_info);
> + mutex_unlock(_info->cleaner_delayed_iput_mutex);
> +
>   ret = may_commit_transaction(fs_info, space_info);
>   break;
>   default:
> 


Re: [PATCH 2/5] btrfs: Refactor btrfs_can_relocate

2018-12-03 Thread Nikolay Borisov



On 3.12.18 г. 19:25 ч., David Sterba wrote:
> On Sat, Nov 17, 2018 at 09:29:27AM +0800, Anand Jain wrote:
>>> -   ret = find_free_dev_extent(trans, device, min_free,
>>> -  _offset, NULL);
>>> -   if (!ret)
>>> +   if (!find_free_dev_extent(trans, device, min_free,
>>> +  _offset, NULL))
>>
>>   This can return -ENOMEM;
>>
>>> @@ -2856,8 +2856,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info 
>>> *fs_info, u64 chunk_offset)
>>>  */
>>> lockdep_assert_held(_info->delete_unused_bgs_mutex);
>>>   
>>> -   ret = btrfs_can_relocate(fs_info, chunk_offset);
>>> -   if (ret)
>>> +   if (!btrfs_can_relocate(fs_info, chunk_offset))
>>> return -ENOSPC;
>>
>>   And ends up converting -ENOMEM to -ENOSPC.
>>
>>   Its better to propagate the accurate error.
> 
> Right, converting to bool is obscuring the reason why the functions
> fail. Making the code simpler at this cost does not look like a good
> idea to me. I'll remove the patch from misc-next for now.

The patch itself does not make the code more obscure than currently is,
because even if ENOMEM is returned it's still converted to ENOSPC in
btrfs_relocate_chunk.
> 


Re: [RFC PATCH] btrfs: Remove __extent_readpages

2018-12-03 Thread Nikolay Borisov



On 3.12.18 г. 12:25 ч., Nikolay Borisov wrote:
> When extent_readpages is called from the generic readahead code it first
> builds a batch of 16 pages (which might or might not be consecutive,
> depending on whether add_to_page_cache_lru failed) and submits them to
> __extent_readpages. The latter ensures that the range of pages (in the
> batch of 16) that is passed to __do_contiguous_readpages is consecutive.
> 
> If add_to_page_cache_lru does't fail then __extent_readpages will call
> __do_contiguous_readpages only once with the whole batch of 16.
> Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an 
> example)
> then the contigous page read code will be called twice.
> 
> All of this can be simplified by exploiting the fact that all pages passed
> to extent_readpages are consecutive, thus when the batch is built in
> that function it is already consecutive (barring add_to_page_cache_lru
> failures) so are ready to be submitted directly to __do_contiguous_readpages.
> Also simplify the name of the function to contiguous_readpages. 
> 
> Signed-off-by: Nikolay Borisov 
> ---
> 
> So this patch looks like a very nice cleanup, however when doing performance 
> measurements with fio I was shocked to see that it actually is detrimental to 
> performance. Here are the results: 
> 
> The command line used for fio: 
> fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k
>  --numjobs=1 --size=1G --runtime=600  --group_reporting --loop 10
> 
> This was tested on a vm with 4g of ram so the size of the test is smaller 
> than 
> the memory, so pages should have been nicely readahead. 
> 
> PATCHED: 
> 
> Starting 1 process
> Jobs: 1 (f=1): [R(1)][100.0%][r=519MiB/s][r=133k IOPS][eta 00m:00s] 
> /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=3722: Mon Dec  3 
> 09:57:17 2018
>   read: IOPS=78.4k, BW=306MiB/s (321MB/s)(10.0GiB/33444msec)
> clat (nsec): min=1703, max=9042.7k, avg=5463.97, stdev=121068.28
>  lat (usec): min=2, max=9043, avg= 6.00, stdev=121.07
> clat percentiles (nsec):
>  |  1.00th=[   1848],  5.00th=[   1896], 10.00th=[   1912],
>  | 20.00th=[   1960], 30.00th=[   2024], 40.00th=[   2160],
>  | 50.00th=[   2384], 60.00th=[   2576], 70.00th=[   2800],
>  | 80.00th=[   3120], 90.00th=[   3824], 95.00th=[   4768],
>  | 99.00th=[   7968], 99.50th=[  14912], 99.90th=[  50944],
>  | 99.95th=[ 667648], 99.99th=[5931008]
>bw (  KiB/s): min= 2768, max=544542, per=100.00%, avg=409912.68, 
> stdev=162333.72, samples=50
>iops: min=  692, max=136135, avg=102478.08, stdev=40583.47, 
> samples=50
>   lat (usec)   : 2=25.93%, 4=65.58%, 10=7.69%, 20=0.57%, 50=0.13%
>   lat (usec)   : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>   lat (msec)   : 2=0.01%, 4=0.01%, 10=0.05%
>   cpu  : usr=7.20%, sys=92.55%, ctx=396, majf=0, minf=9
>   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>  latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>READ: bw=306MiB/s (321MB/s), 306MiB/s-306MiB/s (321MB/s-321MB/s), 
> io=10.0GiB (10.7GB), run=33444-33444msec
> 
> 
> UNPATCHED:
> 
> Starting 1 process
> Jobs: 1 (f=1): [R(1)][100.0%][r=568MiB/s][r=145k IOPS][eta 00m:00s] 
> /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=640: Mon Dec  3 
> 10:07:38 2018
>   read: IOPS=90.4k, BW=353MiB/s (370MB/s)(10.0GiB/29008msec)
> clat (nsec): min=1418, max=12374k, avg=4816.38, stdev=109448.00
>  lat (nsec): min=1836, max=12374k, avg=5284.46, stdev=109451.36
> clat percentiles (nsec):
>  |  1.00th=[   1576],  5.00th=[   1608], 10.00th=[   1640],
>  | 20.00th=[   1672], 30.00th=[   1720], 40.00th=[   1832],
>  | 50.00th=[   2096], 60.00th=[   2288], 70.00th=[   2480],
>  | 80.00th=[   2736], 90.00th=[   3248], 95.00th=[   3952],
>  | 99.00th=[   6368], 99.50th=[  12736], 99.90th=[  43776],
>  | 99.95th=[ 798720], 99.99th=[5341184]
>bw (  KiB/s): min=34144, max=606208, per=100.00%, avg=465737.56, 
> stdev=177637.57, samples=45
>iops: min= 8536, max=151552, avg=116434.33, stdev=44409.46, 
> samples=45
>   lat (usec)   : 2=45.74%, 4=49.50%, 10=4.13%, 20=0.45%, 50=0.08%
>   lat (usec)   : 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>   lat (msec)   : 2=0.01%, 4=0.01%, 10=0.05%, 20=0.01%
>   cpu  : usr=7.14%, sys=92.39%, ctx=1059, majf=0, minf=9

[RFC PATCH] btrfs: Remove __extent_readpages

2018-12-03 Thread Nikolay Borisov
When extent_readpages is called from the generic readahead code it first
builds a batch of 16 pages (which might or might not be consecutive,
depending on whether add_to_page_cache_lru failed) and submits them to
__extent_readpages. The latter ensures that the range of pages (in the
batch of 16) that is passed to __do_contiguous_readpages is consecutive.

If add_to_page_cache_lru does't fail then __extent_readpages will call
__do_contiguous_readpages only once with the whole batch of 16.
Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an 
example)
then the contigous page read code will be called twice.

All of this can be simplified by exploiting the fact that all pages passed
to extent_readpages are consecutive, thus when the batch is built in
that function it is already consecutive (barring add_to_page_cache_lru
failures) so are ready to be submitted directly to __do_contiguous_readpages.
Also simplify the name of the function to contiguous_readpages. 

Signed-off-by: Nikolay Borisov 
---

So this patch looks like a very nice cleanup, however when doing performance 
measurements with fio I was shocked to see that it actually is detrimental to 
performance. Here are the results: 

The command line used for fio: 
fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k
 --numjobs=1 --size=1G --runtime=600  --group_reporting --loop 10

This was tested on a vm with 4g of ram so the size of the test is smaller than 
the memory, so pages should have been nicely readahead. 

PATCHED: 

Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=519MiB/s][r=133k IOPS][eta 00m:00s] 
/media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=3722: Mon Dec  3 
09:57:17 2018
  read: IOPS=78.4k, BW=306MiB/s (321MB/s)(10.0GiB/33444msec)
clat (nsec): min=1703, max=9042.7k, avg=5463.97, stdev=121068.28
 lat (usec): min=2, max=9043, avg= 6.00, stdev=121.07
clat percentiles (nsec):
 |  1.00th=[   1848],  5.00th=[   1896], 10.00th=[   1912],
 | 20.00th=[   1960], 30.00th=[   2024], 40.00th=[   2160],
 | 50.00th=[   2384], 60.00th=[   2576], 70.00th=[   2800],
 | 80.00th=[   3120], 90.00th=[   3824], 95.00th=[   4768],
 | 99.00th=[   7968], 99.50th=[  14912], 99.90th=[  50944],
 | 99.95th=[ 667648], 99.99th=[5931008]
   bw (  KiB/s): min= 2768, max=544542, per=100.00%, avg=409912.68, 
stdev=162333.72, samples=50
   iops: min=  692, max=136135, avg=102478.08, stdev=40583.47, 
samples=50
  lat (usec)   : 2=25.93%, 4=65.58%, 10=7.69%, 20=0.57%, 50=0.13%
  lat (usec)   : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.05%
  cpu  : usr=7.20%, sys=92.55%, ctx=396, majf=0, minf=9
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=306MiB/s (321MB/s), 306MiB/s-306MiB/s (321MB/s-321MB/s), io=10.0GiB 
(10.7GB), run=33444-33444msec


UNPATCHED:

Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=568MiB/s][r=145k IOPS][eta 00m:00s] 
/media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=640: Mon Dec  3 
10:07:38 2018
  read: IOPS=90.4k, BW=353MiB/s (370MB/s)(10.0GiB/29008msec)
clat (nsec): min=1418, max=12374k, avg=4816.38, stdev=109448.00
 lat (nsec): min=1836, max=12374k, avg=5284.46, stdev=109451.36
clat percentiles (nsec):
 |  1.00th=[   1576],  5.00th=[   1608], 10.00th=[   1640],
 | 20.00th=[   1672], 30.00th=[   1720], 40.00th=[   1832],
 | 50.00th=[   2096], 60.00th=[   2288], 70.00th=[   2480],
 | 80.00th=[   2736], 90.00th=[   3248], 95.00th=[   3952],
 | 99.00th=[   6368], 99.50th=[  12736], 99.90th=[  43776],
 | 99.95th=[ 798720], 99.99th=[5341184]
   bw (  KiB/s): min=34144, max=606208, per=100.00%, avg=465737.56, 
stdev=177637.57, samples=45
   iops: min= 8536, max=151552, avg=116434.33, stdev=44409.46, 
samples=45
  lat (usec)   : 2=45.74%, 4=49.50%, 10=4.13%, 20=0.45%, 50=0.08%
  lat (usec)   : 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.05%, 20=0.01%
  cpu  : usr=7.14%, sys=92.39%, ctx=1059, majf=0, minf=9
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=353MiB/s (370MB/s), 353MiB/s-353MiB/s (370MB/s-370MB/s), io=10.0GiB 
(10.7GB), run=29008-29008

Re: btrfs development - question about crypto api integration

2018-11-30 Thread Nikolay Borisov



On 30.11.18 г. 17:22 ч., Chris Mason wrote:
> On 29 Nov 2018, at 12:37, Nikolay Borisov wrote:
> 
>> On 29.11.18 г. 18:43 ч., Jean Fobe wrote:
>>> Hi all,
>>> I've been studying LZ4 and other compression algorithms on the
>>> kernel, and seen other projects such as zram and ubifs using the
>>> crypto api. Is there a technical reason for not using the crypto api
>>> for compression (and possibly for encryption) in btrfs?
>>> I did not find any design/technical implementation choices in
>>> btrfs development in the developer's FAQ on the wiki. If I completely
>>> missed it, could someone point me in the right direction?
>>> Lastly, if there is no technical reason for this, would it be
>>> something interesting to have implemented?
>>
>> I personally think it would be better if btrfs' exploited the generic
>> framework. And in fact when you look at zstd, btrfs does use the
>> generic, low-level ZSTD routines but not the crypto library wrappers. 
>> If
>> I were I'd try and convert zstd (since it's the most recently added
>> algorithm) to using the crypto layer to see if there are any lurking
>> problems.
> 
> Back when I first added the zlib support, the zlib API was both easier 
> to use and a better fit for our async worker threads.  That doesn't mean 
> we shouldn't switch, it's just how we got to step one ;)

And what about zstd? WHy is it also using the low level api and not the
crypto wrappers?

> 
> -chris
> 


Re: btrfs development - question about crypto api integration

2018-11-29 Thread Nikolay Borisov



On 29.11.18 г. 18:43 ч., Jean Fobe wrote:
> Hi all,
> I've been studying LZ4 and other compression algorithms on the
> kernel, and seen other projects such as zram and ubifs using the
> crypto api. Is there a technical reason for not using the crypto api
> for compression (and possibly for encryption) in btrfs?
> I did not find any design/technical implementation choices in
> btrfs development in the developer's FAQ on the wiki. If I completely
> missed it, could someone point me in the right direction?
> Lastly, if there is no technical reason for this, would it be
> something interesting to have implemented?

I personally think it would be better if btrfs' exploited the generic
framework. And in fact when you look at zstd, btrfs does use the
generic, low-level ZSTD routines but not the crypto library wrappers. If
I were I'd try and convert zstd (since it's the most recently added
algorithm) to using the crypto layer to see if there are any lurking
problems.

> 
> Best regards
> 


[PATCH] btrfs: Refactor main loop in extent_readpages

2018-11-29 Thread Nikolay Borisov
extent_readpages processes all pages in the readlist in batches of 16,
this is implemented by a single for loop but thanks to an if condition
the loop does 2 things based on whether we've filled the batch or not.
Additionally due to the structure of the code there is an additional
check which deals with partial batches.

Streamline all of this by explicitly using two loops. The outter one is
used to process all pages while the inner one just fills in the batch
of 16 (currently). Due to this new structure the code guarantees that
all pages are processed in the loop hence the code to deal with any
leftovers is eliminated.

This also enable the compiler to inline __extent_readpages:

./scripts/bloat-o-meter fs/btrfs/extent_io.o extent_io.for

add/remove: 0/1 grow/shrink: 1/0 up/down: 660/-820 (-160)
Function old new   delta
extent_readpages 4761136+660
__extent_readpages   820   --820
Total: Before=44315, After=44155, chg -0.36%

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/extent_io.c | 37 -
 1 file changed, 16 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8332c5f4b1c3..233f835dd6f8 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4093,43 +4093,38 @@ int extent_writepages(struct address_space *mapping,
 int extent_readpages(struct address_space *mapping, struct list_head *pages,
 unsigned nr_pages)
 {
+#define BTRFS_PAGES_BATCH 16
+
struct bio *bio = NULL;
-   unsigned page_idx;
unsigned long bio_flags = 0;
-   struct page *pagepool[16];
-   struct page *page;
+   struct page *pagepool[BTRFS_PAGES_BATCH];
struct extent_map *em_cached = NULL;
struct extent_io_tree *tree = _I(mapping->host)->io_tree;
int nr = 0;
u64 prev_em_start = (u64)-1;
 
-   for (page_idx = 0; page_idx < nr_pages; page_idx++) {
-   page = lru_to_page(pages);
+   while (!list_empty(pages)) {
+   for (nr = 0; nr < BTRFS_PAGES_BATCH && !list_empty(pages);) {
+   struct page *page = lru_to_page(pages);
 
-   prefetchw(>flags);
-   list_del(>lru);
-   if (add_to_page_cache_lru(page, mapping,
-   page->index,
-   readahead_gfp_mask(mapping))) {
-   put_page(page);
-   continue;
+   prefetchw(>flags);
+   list_del(>lru);
+   if (add_to_page_cache_lru(page, mapping, page->index,
+   readahead_gfp_mask(mapping))) {
+   put_page(page);
+   continue;
+   }
+
+   pagepool[nr++] = page;
}
 
-   pagepool[nr++] = page;
-   if (nr < ARRAY_SIZE(pagepool))
-   continue;
__extent_readpages(tree, pagepool, nr, _cached, ,
-   _flags, _em_start);
-   nr = 0;
+  _flags, _em_start);
}
-   if (nr)
-   __extent_readpages(tree, pagepool, nr, _cached, ,
-   _flags, _em_start);
 
if (em_cached)
free_extent_map(em_cached);
 
-   BUG_ON(!list_empty(pages));
if (bio)
return submit_one_bio(bio, 0, bio_flags);
return 0;
-- 
2.17.1



Re: [PATCH v2 1/3] btrfs: remove always true if branch in find_delalloc_range

2018-11-28 Thread Nikolay Borisov



On 29.11.18 г. 5:33 ч., Lu Fengqi wrote:
> The @found is always false when it comes to the if branch. Besides, the
> bool type is more suitable for @found. Change the return value of the
> function and its caller to bool as well.
> 
> Signed-off-by: Lu Fengqi 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent_io.c | 31 +++
>  fs/btrfs/extent_io.h |  2 +-
>  fs/btrfs/tests/extent-io-tests.c |  2 +-
>  3 files changed, 17 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index b2769e92b556..4b6b87e63b4a 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1452,16 +1452,16 @@ int find_first_extent_bit(struct extent_io_tree 
> *tree, u64 start,
>   * find a contiguous range of bytes in the file marked as delalloc, not
>   * more than 'max_bytes'.  start and end are used to return the range,
>   *
> - * 1 is returned if we find something, 0 if nothing was in the tree
> + * true is returned if we find something, false if nothing was in the tree
>   */
> -static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
> +static noinline bool find_delalloc_range(struct extent_io_tree *tree,
>   u64 *start, u64 *end, u64 max_bytes,
>   struct extent_state **cached_state)
>  {
>   struct rb_node *node;
>   struct extent_state *state;
>   u64 cur_start = *start;
> - u64 found = 0;
> + bool found = false;
>   u64 total_bytes = 0;
>  
>   spin_lock(>lock);
> @@ -1472,8 +1472,7 @@ static noinline u64 find_delalloc_range(struct 
> extent_io_tree *tree,
>*/
>   node = tree_search(tree, cur_start);
>   if (!node) {
> - if (!found)
> - *end = (u64)-1;
> + *end = (u64)-1;
>   goto out;
>   }
>  
> @@ -1493,7 +1492,7 @@ static noinline u64 find_delalloc_range(struct 
> extent_io_tree *tree,
>   *cached_state = state;
>   refcount_inc(>refs);
>   }
> - found++;
> + found = true;
>   *end = state->end;
>   cur_start = state->end + 1;
>   node = rb_next(node);
> @@ -1551,13 +1550,13 @@ static noinline int lock_delalloc_pages(struct inode 
> *inode,
>  }
>  
>  /*
> - * find a contiguous range of bytes in the file marked as delalloc, not
> - * more than 'max_bytes'.  start and end are used to return the range,
> + * find and lock a contiguous range of bytes in the file marked as delalloc,
> + * not more than 'max_bytes'.  start and end are used to return the range,
>   *
> - * 1 is returned if we find something, 0 if nothing was in the tree
> + * true is returned if we find something, false if nothing was in the tree
>   */
>  EXPORT_FOR_TESTS
> -noinline_for_stack u64 find_lock_delalloc_range(struct inode *inode,
> +noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>   struct extent_io_tree *tree,
>   struct page *locked_page, u64 *start,
>   u64 *end)
> @@ -1565,7 +1564,7 @@ noinline_for_stack u64 find_lock_delalloc_range(struct 
> inode *inode,
>   u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
>   u64 delalloc_start;
>   u64 delalloc_end;
> - u64 found;
> + bool found;
>   struct extent_state *cached_state = NULL;
>   int ret;
>   int loops = 0;
> @@ -1580,7 +1579,7 @@ noinline_for_stack u64 find_lock_delalloc_range(struct 
> inode *inode,
>   *start = delalloc_start;
>   *end = delalloc_end;
>   free_extent_state(cached_state);
> - return 0;
> + return false;
>   }
>  
>   /*
> @@ -1612,7 +1611,7 @@ noinline_for_stack u64 find_lock_delalloc_range(struct 
> inode *inode,
>   loops = 1;
>   goto again;
>   } else {
> - found = 0;
> + found = false;
>   goto out_failed;
>   }
>   }
> @@ -3195,7 +3194,7 @@ static noinline_for_stack int writepage_delalloc(struct 
> inode *inode,
>  {
>   struct extent_io_tree *tree = _I(inode)->io_tree;
>   u64 page_end = delalloc_start + PAGE_SIZE - 1;
> - u64 nr_delalloc;
> + bool found;
>   u64 delalloc_to_write = 0;
>   u64 delalloc_end = 0;
>   int ret;
> @@ -3203,11 +3202,11 @@ static noinline_for_stack int 
> writepage_delalloc(struct inode *inode,
>  
&g

Re: [PATCH] btrfs: Refactor btrfs_merge_bio_hook

2018-11-28 Thread Nikolay Borisov



On 28.11.18 г. 18:46 ч., David Sterba wrote:
> On Tue, Nov 27, 2018 at 08:57:58PM +0200, Nikolay Borisov wrote:
>> This function really checks whether adding more data to the bio will
>> straddle a stripe/chunk. So first let's give it a more appropraite
>> name - btrfs_bio_fits_in_stripe. Secondly, the offset parameter was
>> never used to just remove it. Thirdly, pages are submitted to either
>> btree or data inodes so it's guaranteed that tree->ops is set so replace
>> the check with an ASSERT. Finally, document the parameters of the
>> function. No functional changes.
>>
>> Signed-off-by: Nikolay Borisov 
> 
> Reviewed-by: David Sterba 
> 
>> -submit = btrfs_merge_bio_hook(page, 0, PAGE_SIZE, bio, 
>> 0);
>> +submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio,
>> +  0);
> 
>> -submit = btrfs_merge_bio_hook(page, 0, PAGE_SIZE,
>> -comp_bio, 0);
>> +submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE,
>> +  comp_bio, 0);
> 
> 
>> -if (tree->ops && btrfs_merge_bio_hook(page, offset, page_size,
>> -  bio, bio_flags))
>> +ASSERT(tree->ops);
>> +if (btrfs_bio_fits_in_stripe(page, page_size, bio, bio_flags))
>>  can_merge = false;
> 
> Got me curious if we could get rid of the size parameter, it's 2x
> PAGE_SIZE and could be something else in one case but it's not obvious
> if it really happens.
> 
> Another thing I noticed is lack of proper error handling in all callers,
> as its' 0, 1, and negative errno. The error would be interpreted as true
> ie. add page to bio and continue.

Actually anything other than 0 is returned then the current bio is
actually submitted (I presume you refer to the code in compression.c).
As a matter of fact I think btrfs_bio_fits_in_stripe could even be
turned to return a bool value.

THe only time this function could return an error is if the mapping
logic goes haywire which could happen 'if (offset < stripe_offset) {' or
we don't find a chunk for the given offset, which is unlikely.

> 


Re: [PATCH] Fix typos

2018-11-28 Thread Nikolay Borisov



On 28.11.18 г. 15:24 ч., Brendan Hide wrote:
> 
> 
> On 11/28/18 1:23 PM, Nikolay Borisov wrote:
>>
>>
>> On 28.11.18 г. 13:05 ч., Andrea Gelmini wrote:
>>> Signed-off-by: Andrea Gelmini 
>>> ---
> 
> 
> 
>>>
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index bab2f1983c07..babbd75d91d2 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -104,7 +104,7 @@ static void __endio_write_update_ordered(struct
>>> inode *inode,
>>>     /*
>>>    * Cleanup all submitted ordered extents in specified range to
>>> handle errors
>>> - * from the fill_dellaloc() callback.
>>> + * from the fill_delalloc() callback.
>>
>> This is a pure whitespace fix which is generally frowned upon. What you
>> can do though, is replace 'fill_delalloc callback' with
>> 'btrfs_run_delalloc_range' since the callback is gone already.
>>
>>>    *
>>>    * NOTE: caller must ensure that when an error happens, it can not
>>> call
>>>    * extent_clear_unlock_delalloc() to clear both the bits
>>> EXTENT_DO_ACCOUNTING
>>> @@ -1831,7 +1831,7 @@ void btrfs_clear_delalloc_extent(struct inode
>>> *vfs_inode,
>>
>> 
>>
>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>> index 410c7e007ba8..d7b6c2b09a0c 100644
>>> --- a/fs/btrfs/ioctl.c
>>> +++ b/fs/btrfs/ioctl.c
>>> @@ -892,7 +892,7 @@ static int create_snapshot(struct btrfs_root
>>> *root, struct inode *dir,
>>>    *  7. If we were asked to remove a directory and victim isn't one
>>> - ENOTDIR.
>>>    *  8. If we were asked to remove a non-directory and victim isn't
>>> one - EISDIR.
>>>    *  9. We can't remove a root or mountpoint.
>>> - * 10. We don't allow removal of NFS sillyrenamed files; it's
>>> handled by
>>> + * 10. We don't allow removal of NFS silly renamed files; it's
>>> handled by
>>>    * nfs_async_unlink().
>>>    */
>>>   @@ -3522,7 +3522,7 @@ static int btrfs_extent_same_range(struct
>>> inode *src, u64 loff, u64 olen,
>>>  false);
>>>   /*
>>>    * If one of the inodes has dirty pages in the respective range or
>>> - * ordered extents, we need to flush dellaloc and wait for all
>>> ordered
>>> + * ordered extents, we need to flush delalloc and wait for all
>>> ordered
>>
>> Just whitespace fix, drop it.
>>
>> 
>>
> 
> If the spelling is changed, surely that is not a whitespace fix?

My bad, I missed it the first time.

> 


Re: [PATCH] Fix typos

2018-11-28 Thread Nikolay Borisov



On 28.11.18 г. 13:05 ч., Andrea Gelmini wrote:
> Signed-off-by: Andrea Gelmini 
> ---
> 
> Stupid fixes. Made on 4.20-rc4, and ported on linux-next (next-20181128).
> 
> 
>  fs/btrfs/backref.c |  4 ++--
>  fs/btrfs/check-integrity.c |  2 +-
>  fs/btrfs/compression.c |  4 ++--
>  fs/btrfs/ctree.c   |  4 ++--
>  fs/btrfs/dev-replace.c |  2 +-
>  fs/btrfs/disk-io.c |  4 ++--
>  fs/btrfs/extent-tree.c | 28 ++--
>  fs/btrfs/extent_io.c   |  4 ++--
>  fs/btrfs/extent_io.h   |  2 +-
>  fs/btrfs/extent_map.c  |  2 +-
>  fs/btrfs/file.c|  6 +++---
>  fs/btrfs/free-space-tree.c |  2 +-
>  fs/btrfs/inode.c   | 10 +-
>  fs/btrfs/ioctl.c   |  4 ++--
>  fs/btrfs/lzo.c |  2 +-
>  fs/btrfs/qgroup.c  | 14 +++---
>  fs/btrfs/qgroup.h  |  4 ++--
>  fs/btrfs/raid56.c  |  2 +-
>  fs/btrfs/ref-verify.c  |  6 +++---
>  fs/btrfs/relocation.c  |  2 +-
>  fs/btrfs/scrub.c   |  2 +-
>  fs/btrfs/send.c|  4 ++--
>  fs/btrfs/super.c   |  8 
>  fs/btrfs/transaction.c |  4 ++--
>  fs/btrfs/tree-checker.c|  6 +++---
>  fs/btrfs/tree-log.c|  4 ++--
>  fs/btrfs/volumes.c |  8 
>  27 files changed, 72 insertions(+), 72 deletions(-)
> 
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index 4a15f87dbbb4..78556447e1d5 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -591,7 +591,7 @@ unode_aux_to_inode_list(struct ulist_node *node)
>  }
>  
>  /*



> @@ -9807,7 +9807,7 @@ void btrfs_dec_block_group_ro(struct 
> btrfs_block_group_cache *cache)
>  }
>  
>  /*
> - * checks to see if its even possible to relocate this block group.
> + * checks to see if it's even possible to relocate this block group.
>   *
>   * @return - false if not enough space can be found for relocation, true
>   * otherwise
> @@ -9872,7 +9872,7 @@ bool btrfs_can_relocate(struct btrfs_fs_info *fs_info, 
> u64 bytenr)
>* ok we don't have enough space, but maybe we have free space on our
>* devices to allocate new chunks for relocation, so loop through our
>* alloc devices and guess if we have enough space.  if this block
> -  * group is going to be restriped, run checks against the target
> +  * group is going to be restripped, run checks against the target

Drop this hunk, here we mean to restripe the changes and not restrip
them. So restriped is the correct past tense of the verb.

>* profile instead of the current one.
>*/
>   can_reloc = false;
> @@ -10424,7 +10424,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info 
> *info)
>* check for two cases, either we are full, and therefore
>* don't need to bother with the caching work since we won't
>* find any space, or we are empty, and we can just add all
> -  * the space in and be done with it.  This saves us _alot_ of
> +  * the space in and be done with it.  This saves us _a_lot_ of

This should be _a lot_

>* time, particularly in the full case.
>*/
>   if (found_key.offset == btrfs_block_group_used(>item)) {
> @@ -10700,7 +10700,7 @@ int btrfs_remove_block_group(struct 
> btrfs_trans_handle *trans,
>  
>   mutex_lock(>transaction->cache_write_mutex);
>   /*
> -  * make sure our free spache cache IO is done before remove the
> +  * make sure our free space cache IO is done before remove the

I think there should be a 'we' before 'remove'

>* free space inode
>*/
>   spin_lock(>transaction->dirty_bgs_lock);
> @@ -11217,7 +11217,7 @@ static int btrfs_trim_free_extents(struct 
> btrfs_device *device,
>   if (!blk_queue_discard(bdev_get_queue(device->bdev)))
>   return 0;
>  
> - /* Not writeable = nothing to do. */
> + /* Not writable = nothing to do. */

This comment is redundant so could be removed altogether

>   if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
>   return 0;
>  
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index aef3c9866ff0..1493f0c102ec 100644



> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
> index e5089087eaa6..13e71efcfe21 100644
> --- a/fs/btrfs/free-space-tree.c
> +++ b/fs/btrfs/free-space-tree.c
> @@ -550,7 +550,7 @@ static void free_space_set_bits(struct 
> btrfs_block_group_cache *block_group,
>  
>  /*
>   * We can't use btrfs_next_item() in modify_free_space_bitmap() because
> - * btrfs_next_leaf() doesn't get the path for writing. We can forgo the fancy
> + * btrfs_next_leaf() doesn't get the path for writing. We can forgot the 
> fancy

Drop this hunk, the meaning is we can eschew the fancy tree walking.

>   * tree walking in btrfs_next_leaf() anyways because we know exactly what 
> we're
>   * looking for.
>   */
> diff --git a/fs/btrfs/inode.c 

Re: [PATCH v2 3/3] btrfs: document extent mapping assumptions in checksum

2018-11-28 Thread Nikolay Borisov



On 28.11.18 г. 10:54 ч., Johannes Thumshirn wrote:
> Document why map_private_extent_buffer() cannot return '1' (i.e. the map
> spans two pages) for the csum_tree_block() case.
> 
> The current algorithm for detecting a page boundary crossing in
> map_private_extent_buffer() will return a '1' *IFF* the extent buffer's
> offset in the page + the offset passed in by csum_tree_block() and the
> minimal length passed in by csum_tree_block() - 1 are bigger than
> PAGE_SIZE.
> 
> We always pass BTRFS_CSUM_SIZE (32) as offset and a minimal length of 32
> and the current extent buffer allocator always guarantees page aligned
> extends, so the above condition can't be true.
> 
> Signed-off-by: Johannes Thumshirn 

Reviewed-by: Nikolay Borisov 


> 
> ---
> Changes to v1:
> * Changed wording of the commit message according to Noah's suggestion
> ---
>  fs/btrfs/disk-io.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 4bc270ef29b4..14d355d0cb7a 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -279,6 +279,12 @@ static int csum_tree_block(struct btrfs_fs_info *fs_info,
>  
>   len = buf->len - offset;
>   while (len > 0) {
> + /*
> +  * Note: we don't need to check for the err == 1 case here, as
> +  * with the given combination of 'start = BTRFS_CSUM_SIZE (32)'
> +  * and 'min_len = 32' and the currently implemented mapping
> +  * algorithm we cannot cross a page boundary.
> +  */
>   err = map_private_extent_buffer(buf, offset, 32,
>   , _start, _len);
>   if (err)
> 


Re: [RFC PATCH 03/17] btrfs: priority alloc: introduce compute_block_group_priority/usage

2018-11-28 Thread Nikolay Borisov



On 28.11.18 г. 5:11 ч., Su Yue wrote:
> Introduce compute_block_group_usage() and compute_block_group_usage().
> And call the latter in btrfs_make_block_group() and
> btrfs_read_block_groups().
> 
> compute_priority_level use ilog2(free) to compute priority level.
> 
> Signed-off-by: Su Yue 
> ---
>  fs/btrfs/extent-tree.c | 60 ++
>  1 file changed, 60 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index d242a1174e50..0f4c5b1e0bcc 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10091,6 +10091,7 @@ static int check_chunk_block_group_mappings(struct 
> btrfs_fs_info *fs_info)
>   return ret;
>  }
>  
> +static long compute_block_group_priority(struct btrfs_block_group_cache *bg);

That is ugly, just put the function above the first place where they are
going to be used and don't introduce forward declarations for static
functions.

>  int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  {
>   struct btrfs_path *path;
> @@ -10224,6 +10225,8 @@ int btrfs_read_block_groups(struct btrfs_fs_info 
> *info)
>  
>   link_block_group(cache);
>  
> + cache->priority = compute_block_group_priority(cache);
> +
>   set_avail_alloc_bits(info, cache->flags);
>   if (btrfs_chunk_readonly(info, cache->key.objectid)) {
>   inc_block_group_ro(cache, 1);
> @@ -10373,6 +10376,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
> *trans, u64 bytes_used,
>  
>   link_block_group(cache);
>  
> + cache->priority = compute_block_group_priority(cache);
> +
>   list_add_tail(>bg_list, >new_bgs);
>  
>   set_avail_alloc_bits(fs_info, type);
> @@ -11190,3 +11195,58 @@ void btrfs_mark_bg_unused(struct 
> btrfs_block_group_cache *bg)
>   }
>   spin_unlock(_info->unused_bgs_lock);
>  }
> +
> +enum btrfs_priority_shift {
> + PRIORITY_USAGE_SHIFT = 0
> +};
> +
> +static inline u8
> +compute_block_group_usage(struct btrfs_block_group_cache *cache)
> +{
> + u64 num_bytes;
> + u8 usage;
> +
> + num_bytes = cache->reserved + cache->bytes_super +
> + btrfs_block_group_used(>item);
> +
> + usage = div_u64(num_bytes, div_factor_fine(cache->key.offset, 1));

Mention somewhere (either as a function description or in the patch
description) that you use the % used.

> +
> + return usage;
> +}
> +
> +static long compute_block_group_priority(struct btrfs_block_group_cache *bg)
> +{
> + u8 usage;
> + long priority = 0;
> +
> + if (btrfs_test_opt(bg->fs_info, PRIORITY_USAGE)) {
> + usage = compute_block_group_usage(bg);
> + priority |= usage << PRIORITY_USAGE_SHIFT;
> + }

Why is priority a signed type and not unsigned, I assume priority can
never be negative? I briefly looked at the other patches and most of the
time the argument passed is indeed na unsigned value.

> +
> + return priority;
> +}
> +
> +static int compute_priority_level(struct btrfs_fs_info *fs_info,
> +   long priority)
> +{
> + int level = 0;
> +
> + if (btrfs_test_opt(fs_info, PRIORITY_USAGE)) {
> + u8 free;
> +
> + WARN_ON(priority < 0);

I think this WARN_ON is redundant provided that the high-level
interfaces are sane and don't allow negative value to trickle down.

> + free = 100 - (priority >> PRIORITY_USAGE_SHIFT);
> +
> + if (free == 0)
> + level = 0;
> + else if (free == 100)
> + level = ilog2(free) + 1;
> + else
> + level = ilog2(free);
> + /* log2(1) == 0, leave level 0 for unused block_groups */
> + level = ilog2(100) + 1 - level;
> + }
> +
> + return level;
> +}
> 


Re: [RFC PATCH 01/17] btrfs: priority alloc: prepare of priority aware allocator

2018-11-28 Thread Nikolay Borisov



On 28.11.18 г. 5:11 ч., Su Yue wrote:
> To implement priority aware allocator, this patch:
> Introduces struct btrfs_priority_tree which contains block groups
> in same level.
> Adds member priority to struct btrfs_block_group_cache and pointer
> points to the priority tree it's located.
> 
> Adds member priority_trees to struct btrfs_space_info to represents
> priority trees in different raid types.
> 
> Signed-off-by: Su Yue 
> ---
>  fs/btrfs/ctree.h | 24 
>  1 file changed, 24 insertions(+)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index e62824cae00a..5c4651d8a524 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -437,6 +437,8 @@ struct btrfs_space_info {
>   struct rw_semaphore groups_sem;
>   /* for block groups in our same type */
>   struct list_head block_groups[BTRFS_NR_RAID_TYPES];
> + /* for priority trees in our same type */
> + struct rb_root priority_trees[BTRFS_NR_RAID_TYPES];
>   wait_queue_head_t wait;
>  
>   struct kobject kobj;
> @@ -558,6 +560,21 @@ struct btrfs_full_stripe_locks_tree {
>   struct mutex lock;
>  };
>  
> +/*
> + * Tree to record all block_groups in same priority level.
> + * Only used in priority aware allocator.
> + */
> +struct btrfs_priority_tree {
> + /* protected by groups_sem */
> + struct rb_root block_groups;
> + struct rw_semaphore groups_sem;
> +
> + /* for different level priority trees in same index*/
> + struct rb_node node;
> +
> + int level;

Do you ever expect the level to be a negative number? If not then use
u8/u32 depending on the range of levels you expect.

> +};
> +
>  struct btrfs_block_group_cache {
>   struct btrfs_key key;
>   struct btrfs_block_group_item item;
> @@ -571,6 +588,8 @@ struct btrfs_block_group_cache {
>   u64 flags;
>   u64 cache_generation;
>  
> + /* It's used only when priority aware allocator is enabled. */
> + long priority;

What's the range of priorities you are expecting, wouldn't an u8 be
sufficient, that gives us 256 priorities?

>   /*
>* If the free space extent count exceeds this number, convert the block
>* group to bitmaps.
> @@ -616,6 +635,9 @@ struct btrfs_block_group_cache {
>   /* for block groups in the same raid type */
>   struct list_head list;
>  
> + /* for block groups in the same priority level */
> + struct rb_node node;
> +
>   /* usage count */
>   atomic_t count;
>  
> @@ -670,6 +692,8 @@ struct btrfs_block_group_cache {
>  
>   /* Record locked full stripes for RAID5/6 block group */
>   struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
> +
> + struct btrfs_priority_tree *priority_tree;
>  };
>  
>  /* delayed seq elem */
> 


Re: [PATCH 5/8] btrfs: don't enospc all tickets on flush failure

2018-11-28 Thread Nikolay Borisov



On 27.11.18 г. 21:46 ч., Josef Bacik wrote:
> On Mon, Nov 26, 2018 at 02:25:52PM +0200, Nikolay Borisov wrote:
>>
>>
>> On 21.11.18 г. 21:03 ч., Josef Bacik wrote:
>>> With the introduction of the per-inode block_rsv it became possible to
>>> have really really large reservation requests made because of data
>>> fragmentation.  Since the ticket stuff assumed that we'd always have
>>> relatively small reservation requests it just killed all tickets if we
>>> were unable to satisfy the current request.  However this is generally
>>> not the case anymore.  So fix this logic to instead see if we had a
>>> ticket that we were able to give some reservation to, and if we were
>>> continue the flushing loop again.  Likewise we make the tickets use the
>>> space_info_add_old_bytes() method of returning what reservation they did
>>> receive in hopes that it could satisfy reservations down the line.
>>
>>
>> The logic of the patch can be summarised as follows:
>>
>> If no progress is made for a ticket, then start fail all tickets until
>> the first one that has progress made on its reservation (inclusive). In
>> this case this first ticket will be failed but at least it's space will
>> be reused via space_info_add_old_bytes.
>>
>> Frankly this seem really arbitrary.
> 
> It's not though.  The tickets are in order of who requested the reservation.
> Because we will backfill reservations for things like hugely fragmented files 
> or
> large amounts of delayed refs we can have spikes where we're trying to reserve
> 100mb's of metadata space.  We may fill 50mb of that before we run out of 
> space.
> Well so we can't satisfy that reservation, but the small 100k reservations 
> that
> are waiting to be serviced can be satisfied and they can run.  The alternative
> is you get ENOSPC and then you can turn around and touch a file no problem
> because it's a small reservation and there was room for it.  This patch 
> enables
> better behavior for the user.  Thanks,

Well this information needs to be in the changelog since it describe the
situation where this patch is useful.

> 
> Josef
> 


Re: [RFC PATCH] btrfs: drop file privileges in btrfs_clone_files

2018-11-27 Thread Nikolay Borisov



On 28.11.18 г. 9:46 ч., Christoph Hellwig wrote:
> On Wed, Nov 28, 2018 at 09:44:59AM +0200, Nikolay Borisov wrote:
>>
>>
>> On 28.11.18 г. 5:07 ч., Lu Fengqi wrote:
>>> The generic/513 tell that cloning into a file did not strip security
>>> privileges (suid, capabilities) like a regular write would.
>>>
>>> Signed-off-by: Lu Fengqi 
>>> ---
>>> The xfs and ocfs2 call generic_remap_file_range_prep to drop file
>>> privileges, I'm not sure whether btrfs should do the same thing.
>>
>> Why do you think btrfs shouldn't do the same thing. Looking at
>> remap_file_range_prep it seems that btrfs is missing a ton of checks
>> that are useful i.e immutable files/aligned offsets etc.
> 
> Any chance we could move btrfs over to use remap_file_range_prep so that
> all file systems share the exact same checks?

I'm not very familiar with the, Filipe is more familiar so adding to CC.
But IMO we should do that provided there are no blockers.

Filipe, what do you think, is it feasible?

> 


Re: [RFC PATCH] btrfs: drop file privileges in btrfs_clone_files

2018-11-27 Thread Nikolay Borisov



On 28.11.18 г. 5:07 ч., Lu Fengqi wrote:
> The generic/513 tell that cloning into a file did not strip security
> privileges (suid, capabilities) like a regular write would.
> 
> Signed-off-by: Lu Fengqi 
> ---
> The xfs and ocfs2 call generic_remap_file_range_prep to drop file
> privileges, I'm not sure whether btrfs should do the same thing.

Why do you think btrfs shouldn't do the same thing. Looking at
remap_file_range_prep it seems that btrfs is missing a ton of checks
that are useful i.e immutable files/aligned offsets etc.


> 
> Any suggestion?
> 
>  fs/btrfs/ioctl.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 410c7e007ba8..bc33c480603b 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -4312,6 +4312,10 @@ static noinline int btrfs_clone_files(struct file 
> *file, struct file *file_src,
>   goto out_unlock;
>   }
>  
> + ret = file_remove_privs(file);
> + if (ret)
> + goto out_unlock;
> +
>   if (destoff > inode->i_size) {
>   ret = btrfs_cont_expand(inode, inode->i_size, destoff);
>   if (ret)
> 


Re: [PATCH] btrfs: adjust order of unlocks in do_trimming()

2018-11-27 Thread Nikolay Borisov



On 28.11.18 г. 5:21 ч., Su Yue wrote:
> In function do_trimming(), block_group->lock should be unlocked first.
> 
> Fixes: 7fe1e6415026 ("Btrfs: rewrite btrfs_trim_block_group()")
> Signed-off-by: Su Yue 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/free-space-cache.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 4ba0aedc878b..3731fa92df56 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -3149,8 +3149,8 @@ static int do_trimming(struct btrfs_block_group_cache 
> *block_group,
>   space_info->bytes_readonly += reserved_bytes;
>   block_group->reserved -= reserved_bytes;
>   space_info->bytes_reserved -= reserved_bytes;
> - spin_unlock(_info->lock);
>   spin_unlock(_group->lock);
> + spin_unlock(_info->lock);
>   }
>  
>   return ret;
> 


Re: [PATCH 3/3] btrfs: remove redundant nowait check for buffered_write

2018-11-27 Thread Nikolay Borisov



On 28.11.18 г. 5:23 ч., Lu Fengqi wrote:
> The generic_write_checks will check the combination of IOCB_NOWAIT and
> !IOCB_DIRECT.

True, however btrfs will return ENOSUPP whereas the generic code returns
EINVAL. I guess this is not a big deal and it's likely generic code is
correct, so:

Reviewed-by: Nikolay Borisov 

> 
> Signed-off-by: Lu Fengqi 
> ---
>  fs/btrfs/file.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 3835bb8c146d..190db9a685a2 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1889,10 +1889,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb 
> *iocb,
>   loff_t oldsize;
>   int clean_page = 0;
>  
> - if (!(iocb->ki_flags & IOCB_DIRECT) &&
> - (iocb->ki_flags & IOCB_NOWAIT))
> - return -EOPNOTSUPP;
> -
>   if (!inode_trylock(inode)) {
>   if (iocb->ki_flags & IOCB_NOWAIT)
>   return -EAGAIN;
> 


Re: [PATCH 1/3] btrfs: remove always true if branch in find_delalloc_range

2018-11-27 Thread Nikolay Borisov



On 28.11.18 г. 5:21 ч., Lu Fengqi wrote:
> The @found is always false when it comes to the if branch. Besides, the
> bool type is more suitable for @found.

Well if you are ranging the type of found variable it also makes sense
to change the return value of the function to bool as well.

> 
> Signed-off-by: Lu Fengqi 
> ---
>  fs/btrfs/extent_io.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 582b4b1c41e0..b4ee3399be96 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1461,7 +1461,7 @@ static noinline u64 find_delalloc_range(struct 
> extent_io_tree *tree,
>   struct rb_node *node;
>   struct extent_state *state;
>   u64 cur_start = *start;
> - u64 found = 0;
> + bool found = false;
>   u64 total_bytes = 0;
>  
>   spin_lock(>lock);
> @@ -1472,8 +1472,7 @@ static noinline u64 find_delalloc_range(struct 
> extent_io_tree *tree,
>*/
>   node = tree_search(tree, cur_start);
>   if (!node) {
> - if (!found)
> - *end = (u64)-1;
> + *end = (u64)-1;
>   goto out;
>   }
>  
> @@ -1493,7 +1492,7 @@ static noinline u64 find_delalloc_range(struct 
> extent_io_tree *tree,
>   *cached_state = state;
>   refcount_inc(>refs);
>   }
> - found++;
> + found = true;
>   *end = state->end;
>   cur_start = state->end + 1;
>   node = rb_next(node);
> 


Re: [PATCH 2/3] btrfs: cleanup the useless DEFINE_WAIT in cleanup_transaction

2018-11-27 Thread Nikolay Borisov



On 28.11.18 г. 5:22 ч., Lu Fengqi wrote:
> When it is introduced at commit f094ac32aba3 ("Btrfs: fix NULL pointer
> after aborting a transaction"), it's useless.
> 
> Signed-off-by: Lu Fengqi 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/transaction.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index f92c0a88c4ad..67e84939b758 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1840,7 +1840,6 @@ static void cleanup_transaction(struct 
> btrfs_trans_handle *trans, int err)
>  {
>   struct btrfs_fs_info *fs_info = trans->fs_info;
>   struct btrfs_transaction *cur_trans = trans->transaction;
> - DEFINE_WAIT(wait);
>  
>   WARN_ON(refcount_read(>use_count) > 1);
>  
> 


Re: [PATCH 3/3] btrfs: document extent mapping assumptions in checksum

2018-11-27 Thread Nikolay Borisov



On 27.11.18 г. 21:08 ч., Noah Massey wrote:
> On Tue, Nov 27, 2018 at 11:43 AM Nikolay Borisov  wrote:
>>
>> On 27.11.18 г. 18:00 ч., Johannes Thumshirn wrote:
>>> Document why map_private_extent_buffer() cannot return '1' (i.e. the map
>>> spans two pages) for the csum_tree_block() case.
>>>
>>> The current algorithm for detecting a page boundary crossing in
>>> map_private_extent_buffer() will return a '1' *IFF* the product of the
>>
>> I think the word product must be replaced with 'sum', since product
>> implies multiplication :)
>>
> 
> doesn't 'sum' imply addition? How about 'output'?

It does and the code indeed sums the value and not multiply them hence
my suggestion.

> 
>>> extent buffer's offset in the page + the offset passed in by
>>> csum_tree_block() and the minimal length passed in by csum_tree_block() - 1
>>> are bigger than PAGE_SIZE.
>>>
>>> We always pass BTRFS_CSUM_SIZE (32) as offset and a minimal length of 32
>>> and the current extent buffer allocator always guarantees page aligned
>>> extends, so the above condition can't be true.
>>>
>>> Signed-off-by: Johannes Thumshirn 
>>
>> With that wording changed:
>>
>> Reviewed-by: Nikolay Borisov 
>>
>>> ---
>>>  fs/btrfs/disk-io.c | 6 ++
>>>  1 file changed, 6 insertions(+)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 4bc270ef29b4..14d355d0cb7a 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -279,6 +279,12 @@ static int csum_tree_block(struct btrfs_fs_info 
>>> *fs_info,
>>>
>>>   len = buf->len - offset;
>>>   while (len > 0) {
>>> + /*
>>> +  * Note: we don't need to check for the err == 1 case here, as
>>> +  * with the given combination of 'start = BTRFS_CSUM_SIZE 
>>> (32)'
>>> +  * and 'min_len = 32' and the currently implemented mapping
>>> +  * algorithm we cannot cross a page boundary.
>>> +  */
>>>   err = map_private_extent_buffer(buf, offset, 32,
>>>   , _start, _len);
>>>   if (err)
>>>
> 


[PATCH] btrfs: Refactor btrfs_merge_bio_hook

2018-11-27 Thread Nikolay Borisov
This function really checks whether adding more data to the bio will
straddle a stripe/chunk. So first let's give it a more appropraite
name - btrfs_bio_fits_in_stripe. Secondly, the offset parameter was
never used to just remove it. Thirdly, pages are submitted to either
btree or data inodes so it's guaranteed that tree->ops is set so replace
the check with an ASSERT. Finally, document the parameters of the
function. No functional changes.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/compression.c |  7 ---
 fs/btrfs/ctree.h   |  5 ++---
 fs/btrfs/extent_io.c   |  4 ++--
 fs/btrfs/inode.c   | 19 ---
 4 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 34d50bc5c10d..dba59ae914b8 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -332,7 +332,8 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
*inode, u64 start,
page = compressed_pages[pg_index];
page->mapping = inode->i_mapping;
if (bio->bi_iter.bi_size)
-   submit = btrfs_merge_bio_hook(page, 0, PAGE_SIZE, bio, 
0);
+   submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio,
+ 0);
 
page->mapping = NULL;
if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
@@ -610,8 +611,8 @@ blk_status_t btrfs_submit_compressed_read(struct inode 
*inode, struct bio *bio,
page->index = em_start >> PAGE_SHIFT;
 
if (comp_bio->bi_iter.bi_size)
-   submit = btrfs_merge_bio_hook(page, 0, PAGE_SIZE,
-   comp_bio, 0);
+   submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE,
+ comp_bio, 0);
 
page->mapping = NULL;
if (submit || bio_add_page(comp_bio, page, PAGE_SIZE, 0) <
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a98507fa9192..791112a82775 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3191,9 +3191,8 @@ void btrfs_merge_delalloc_extent(struct inode *inode, 
struct extent_state *new,
 struct extent_state *other);
 void btrfs_split_delalloc_extent(struct inode *inode,
 struct extent_state *orig, u64 split);
-int btrfs_merge_bio_hook(struct page *page, unsigned long offset,
-size_t size, struct bio *bio,
-unsigned long bio_flags);
+int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
+unsigned long bio_flags);
 void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 
end);
 vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf);
 int btrfs_readpage(struct file *file, struct page *page);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 582b4b1c41e0..19f4b8fd654f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2759,8 +2759,8 @@ static int submit_extent_page(unsigned int opf, struct 
extent_io_tree *tree,
else
contig = bio_end_sector(bio) == sector;
 
-   if (tree->ops && btrfs_merge_bio_hook(page, offset, page_size,
- bio, bio_flags))
+   ASSERT(tree->ops);
+   if (btrfs_bio_fits_in_stripe(page, page_size, bio, bio_flags))
can_merge = false;
 
if (prev_bio_flags != bio_flags || !contig || !can_merge ||
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index be7d43c42779..11c533db2db7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1881,16 +1881,21 @@ void btrfs_clear_delalloc_extent(struct inode 
*vfs_inode,
 }
 
 /*
- * Merge bio hook, this must check the chunk tree to make sure we don't create
- * bios that span stripes or chunks
+ * btrfs_bio_fits_in_stripe - Checks whether the size of the given bio will fit
+ * in a chunk's stripe. This function ensures that bios do not span a
+ * stripe/chunk
  *
- * return 1 if page cannot be merged to bio
- * return 0 if page can be merged to bio
+ * @page - The page we are about to add to the bio
+ * @size - size we want to add to the bio
+ * @bio - bio we want to ensure is smaller than a stripe
+ * @bio_flags - flags of the bio
+ *
+ * return 1 if page cannot be added to the bio
+ * return 0 if page can be added to the bio
  * return error otherwise
  */
-int btrfs_merge_bio_hook(struct page *page, unsigned long offset,
-size_t size, struct bio *bio,
-unsigned long bio_flags)
+int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
+unsigned long bio_flags)
 {
struct inode *inode = page->mapping-&g

Re: [PATCH 3/3] btrfs: document extent mapping assumptions in checksum

2018-11-27 Thread Nikolay Borisov



On 27.11.18 г. 18:00 ч., Johannes Thumshirn wrote:
> Document why map_private_extent_buffer() cannot return '1' (i.e. the map
> spans two pages) for the csum_tree_block() case.
> 
> The current algorithm for detecting a page boundary crossing in
> map_private_extent_buffer() will return a '1' *IFF* the product of the

I think the word product must be replaced with 'sum', since product
implies multiplication :)

> extent buffer's offset in the page + the offset passed in by
> csum_tree_block() and the minimal length passed in by csum_tree_block() - 1
> are bigger than PAGE_SIZE.
> 
> We always pass BTRFS_CSUM_SIZE (32) as offset and a minimal length of 32
> and the current extent buffer allocator always guarantees page aligned
> extends, so the above condition can't be true.
> 
> Signed-off-by: Johannes Thumshirn 

With that wording changed:

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/disk-io.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 4bc270ef29b4..14d355d0cb7a 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -279,6 +279,12 @@ static int csum_tree_block(struct btrfs_fs_info *fs_info,
>  
>   len = buf->len - offset;
>   while (len > 0) {
> + /*
> +  * Note: we don't need to check for the err == 1 case here, as
> +  * with the given combination of 'start = BTRFS_CSUM_SIZE (32)'
> +  * and 'min_len = 32' and the currently implemented mapping
> +  * algorithm we cannot cross a page boundary.
> +  */
>   err = map_private_extent_buffer(buf, offset, 32,
>   , _start, _len);
>   if (err)
> 


Re: [PATCH 2/3] btrfs: use offset_in_page for start_offset in map_private_extent_buffer()

2018-11-27 Thread Nikolay Borisov



On 27.11.18 г. 18:00 ч., Johannes Thumshirn wrote:
> In map_private_extent_buffer() use offset_in_page() to initialize
> 'start_offset' instead of open-coding it.
> 
> Signed-off-by: Johannes Thumshirn 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent_io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 7aafdec49dc3..85cd3975c680 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5383,7 +5383,7 @@ int map_private_extent_buffer(const struct 
> extent_buffer *eb,
>   size_t offset;
>   char *kaddr;
>   struct page *p;
> - size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
> + size_t start_offset = offset_in_page(eb->start);
>   unsigned long i = (start_offset + start) >> PAGE_SHIFT;
>   unsigned long end_i = (start_offset + start + min_len - 1) >>
>   PAGE_SHIFT;
> 


Re: [PATCH 1/3] btrfs: don't initialize 'offset' in map_private_extent_buffer()

2018-11-27 Thread Nikolay Borisov



On 27.11.18 г. 18:00 ч., Johannes Thumshirn wrote:
> In map_private_extent_buffer() the 'offset' variable is initialized to a
> page aligned version of the 'start' parameter.
> 
> But later on it is overwritten with either the offset from the extent
> buffer's start or 0.
> 
> So get rid of the initial initialization.
> 
> Signed-off-by: Johannes Thumshirn 

You know, the fastest/most clean code is the one which is deleted so :

Reviewed-by: Nikolay Borisov 


> ---
>  fs/btrfs/extent_io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 582b4b1c41e0..7aafdec49dc3 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5380,7 +5380,7 @@ int map_private_extent_buffer(const struct 
> extent_buffer *eb,
> char **map, unsigned long *map_start,
> unsigned long *map_len)
>  {
> - size_t offset = start & (PAGE_SIZE - 1);
> + size_t offset;
>   char *kaddr;
>   struct page *p;
>   size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
> 


Re: [PATCH] Proposal for more detail in scrub doc

2018-11-27 Thread Nikolay Borisov



On 27.11.18 г. 16:14 ч., Andrea Gelmini wrote:
> Wise words from Qu:
> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg82557.html
> ---
>  Documentation/btrfs-scrub.asciidoc | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/btrfs-scrub.asciidoc 
> b/Documentation/btrfs-scrub.asciidoc
> index 4c49269..1fc085c 100644
> --- a/Documentation/btrfs-scrub.asciidoc
> +++ b/Documentation/btrfs-scrub.asciidoc
> @@ -16,7 +16,8 @@ and metadata blocks from all devices and verify checksums. 
> Automatically repair
>  corrupted blocks if there's a correct copy available.
>  
>  NOTE: Scrub is not a filesystem checker (fsck) and does not verify nor repair
> -structural damage in the filesystem.
> +structural damage in the filesystem. It only checks csum of data and tree 
> blocks,
> +it doesn't ensure the content of tree blocks are OK.

I would rephrase this as:

"It only ensures that the checksum of given data/metadata blocks match
but doesn't guarantee that the contents themselves are consistent"

It sounds a bit more formal which I think is more appropriate for a man
page.


>  
>  The user is supposed to run it manually or via a periodic system service. The
>  recommended period is a month but could be less. The estimated device 
> bandwidth
> 


Re: [PATCH] Fix typos in docs (second try...)

2018-11-27 Thread Nikolay Borisov



On 27.11.18 г. 15:56 ч., Andrea Gelmini wrote:
> Signed-off-by: Andrea Gelmini 

Reviewed-by: Nikolay Borisov 

> ---
>  Documentation/DocConventions  | 4 ++--
>  Documentation/ReleaseChecklist| 4 ++--
>  Documentation/btrfs-man5.asciidoc | 8 
>  Documentation/btrfs-property.asciidoc | 2 +-
>  Documentation/btrfs-qgroup.asciidoc   | 4 ++--
>  Documentation/mkfs.btrfs.asciidoc | 2 +-
>  6 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/Documentation/DocConventions b/Documentation/DocConventions
> index e84ed7a..969209c 100644
> --- a/Documentation/DocConventions
> +++ b/Documentation/DocConventions
> @@ -23,7 +23,7 @@ Quotation in subcommands:
>  - command reference: bold *btrfs fi show*
>  - section references: italics 'EXAMPLES'
>  
> -- argument name in option desciption: caps in angle brackets 
> +- argument name in option description: caps in angle brackets 
>- reference in help text: caps NAME
>  also possible: caps italics 'NAME'
>  
> @@ -34,6 +34,6 @@ Quotation in subcommands:
>- optional parameter with argument: [-p ]
>  
>  
> -Refrences:
> +References:
>  - full asciidoc syntax: http://asciidoc.org/userguide.html
>  - cheatsheet: http://powerman.name/doc/asciidoc
> diff --git a/Documentation/ReleaseChecklist b/Documentation/ReleaseChecklist
> index d8bf50c..ebe251d 100644
> --- a/Documentation/ReleaseChecklist
> +++ b/Documentation/ReleaseChecklist
> @@ -4,7 +4,7 @@ Release checklist
>  Last code touches:
>  
>  * make the code ready, collect patches queued for the release
> -* look to mailinglist for any relevant last-minute fixes
> +* look to mailing list for any relevant last-minute fixes
>  
>  Pre-checks:
>  
> @@ -31,7 +31,7 @@ Release:
>  
>  Post-release:
>  
> -* write and send announcement mail to mailinglist
> +* write and send announcement mail to mailing list
>  * update wiki://Main_page#News
>  * update wiki://Changelog#btrfs-progs
>  * update title on IRC
> diff --git a/Documentation/btrfs-man5.asciidoc 
> b/Documentation/btrfs-man5.asciidoc
> index 448710a..4a269e2 100644
> --- a/Documentation/btrfs-man5.asciidoc
> +++ b/Documentation/btrfs-man5.asciidoc
> @@ -156,7 +156,7 @@ under 'nodatacow' are also set the NOCOW file attribute 
> (see `chattr`(1)).
>  NOTE: If 'nodatacow' or 'nodatasum' are enabled, compression is disabled.
>  +
>  Updates in-place improve performance for workloads that do frequent 
> overwrites,
> -at the cost of potential partial writes, in case the write is interruted
> +at the cost of potential partial writes, in case the write is interrupted
>  (system crash, device failure).
>  
>  *datasum*::
> @@ -171,7 +171,7 @@ corresponding file attribute (see `chattr`(1)).
>  NOTE: If 'nodatacow' or 'nodatasum' are enabled, compression is disabled.
>  +
>  There is a slight performance gain when checksums are turned off, the
> -correspoinding metadata blocks holding the checksums do not need to updated.
> +corresponding metadata blocks holding the checksums do not need to be 
> updated.
>  The cost of checksumming of the blocks in memory is much lower than the IO,
>  modern CPUs feature hardware support of the checksumming algorithm.
>  
> @@ -185,7 +185,7 @@ missing, for example if a stripe member is completely 
> missing from RAID0.
>  Since 4.14, the constraint checks have been improved and are verified on the
>  chunk level, not an the device level. This allows degraded mounts of
>  filesystems with mixed RAID profiles for data and metadata, even if the
> -device number constraints would not be satisfied for some of the prifles.
> +device number constraints would not be satisfied for some of the profiles.
>  +
>  Example: metadata -- raid1, data -- single, devices -- /dev/sda, /dev/sdb
>  +
> @@ -649,7 +649,7 @@ inherent limit of btrfs is 2^64^ (16 EiB) but the linux 
> VFS limit is 2^63^ (8 Ei
>  
>  maximum number of subvolumes::
>  2^64^ but depends on the available metadata space, the space consumed by all
> -subvolume metadata includes bookeeping of the shared extents can be large 
> (MiB,
> +subvolume metadata includes bookkeeping of the shared extents can be large 
> (MiB,
>  GiB)
>  
>  maximum number of hardlinks of a file in a directory::
> diff --git a/Documentation/btrfs-property.asciidoc 
> b/Documentation/btrfs-property.asciidoc
> index b562717..4bad88b 100644
> --- a/Documentation/btrfs-property.asciidoc
> +++ b/Documentation/btrfs-property.asciidoc
> @@ -34,7 +34,7 @@ specify what type of object you meant. This is only needed 
> when a
>  property could be set for more then one object type.
>  +
>  Possible types are 's[ubvol]', 'f[ilesystem]', 'i[

Re: [PATCH] Fix typos in docs

2018-11-27 Thread Nikolay Borisov



On 27.11.18 г. 15:37 ч., Andrea Gelmini wrote:
> Signed-off-by: Andrea Gelmini 
> ---
>  Documentation/DocConventions  | 4 ++--
>  Documentation/ReleaseChecklist| 4 ++--
>  Documentation/btrfs-man5.asciidoc | 8 
>  Documentation/btrfs-property.asciidoc | 2 +-
>  Documentation/btrfs-qgroup.asciidoc   | 4 ++--
>  Documentation/mkfs.btrfs.asciidoc | 2 +-
>  6 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/Documentation/DocConventions b/Documentation/DocConventions
> index e84ed7a..969209c 100644
> --- a/Documentation/DocConventions
> +++ b/Documentation/DocConventions
> @@ -23,7 +23,7 @@ Quotation in subcommands:
>  - command reference: bold *btrfs fi show*
>  - section references: italics 'EXAMPLES'
>  
> -- argument name in option desciption: caps in angle brackets 
> +- argument name in option description: caps in angle brackets 
>- reference in help text: caps NAME
>  also possible: caps italics 'NAME'
>  
> @@ -34,6 +34,6 @@ Quotation in subcommands:
>- optional parameter with argument: [-p ]
>  
>  
> -Refrences:
> +References:
>  - full asciidoc syntax: http://asciidoc.org/userguide.html
>  - cheatsheet: http://powerman.name/doc/asciidoc
> diff --git a/Documentation/ReleaseChecklist b/Documentation/ReleaseChecklist
> index d8bf50c..7fdbddf 100644
> --- a/Documentation/ReleaseChecklist
> +++ b/Documentation/ReleaseChecklist
> @@ -4,7 +4,7 @@ Release checklist
>  Last code touches:
>  
>  * make the code ready, collect patches queued for the release
> -* look to mailinglist for any relevant last-minute fixes
> +* look to mailin glist for any relevant last-minute fixes

Wrong position of space, should be mailing list
>  
>  Pre-checks:
>  
> @@ -31,7 +31,7 @@ Release:
>  
>  Post-release:
>  
> -* write and send announcement mail to mailinglist
> +* write and send announcement mail to mailing list
>  * update wiki://Main_page#News
>  * update wiki://Changelog#btrfs-progs
>  * update title on IRC
> diff --git a/Documentation/btrfs-man5.asciidoc 
> b/Documentation/btrfs-man5.asciidoc
> index 448710a..c358cef 100644
> --- a/Documentation/btrfs-man5.asciidoc
> +++ b/Documentation/btrfs-man5.asciidoc
> @@ -156,7 +156,7 @@ under 'nodatacow' are also set the NOCOW file attribute 
> (see `chattr`(1)).
>  NOTE: If 'nodatacow' or 'nodatasum' are enabled, compression is disabled.
>  +
>  Updates in-place improve performance for workloads that do frequent 
> overwrites,
> -at the cost of potential partial writes, in case the write is interruted
> +at the cost of potential partial writes, in case the write is interrupted
>  (system crash, device failure).
>  
>  *datasum*::
> @@ -171,7 +171,7 @@ corresponding file attribute (see `chattr`(1)).
>  NOTE: If 'nodatacow' or 'nodatasum' are enabled, compression is disabled.
>  +
>  There is a slight performance gain when checksums are turned off, the
> -correspoinding metadata blocks holding the checksums do not need to updated.
> +corresponding metadata blocks holding the checksums do not need to updated.

You've missed one grammatical error - the end of the sentence should
say" do not need to be updated".

>  The cost of checksumming of the blocks in memory is much lower than the IO,
>  modern CPUs feature hardware support of the checksumming algorithm.
>  




Re: [PATCH v2 1/5] btrfs-progs: image: Refactor fixup_devices() to fixup_chunks_and_devices()

2018-11-27 Thread Nikolay Borisov



On 27.11.18 г. 10:50 ч., Qu Wenruo wrote:
> 
> 
> On 2018/11/27 下午4:46, Nikolay Borisov wrote:
>>
>>
>> On 27.11.18 г. 10:38 ч., Qu Wenruo wrote:
>>> Current fixup_devices() will only remove DEV_ITEMs and reset DEV_ITEM
>>> size.
>>> Later we need to do more fixup works, so change the name to
>>> fixup_chunks_and_devices() and refactor the original device size fixup
>>> to fixup_device_size().
>>>
>>> Signed-off-by: Qu Wenruo 
>>
>> Reviewed-by: Nikolay Borisov 
>>
>> However, one minor nit below.
>>
>>> ---
>>>  image/main.c | 52 
>>>  1 file changed, 36 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/image/main.c b/image/main.c
>>> index c680ab19de6c..bbfcf8f19298 100644
>>> --- a/image/main.c
>>> +++ b/image/main.c
>>> @@ -2084,28 +2084,19 @@ static void remap_overlapping_chunks(struct 
>>> mdrestore_struct *mdres)
>>> }
>>>  }
>>>  
>>> -static int fixup_devices(struct btrfs_fs_info *fs_info,
>>> -struct mdrestore_struct *mdres, off_t dev_size)
>>> +static int fixup_device_size(struct btrfs_trans_handle *trans,
>>> +struct mdrestore_struct *mdres,
>>> +off_t dev_size)
>>>  {
>>> -   struct btrfs_trans_handle *trans;
>>> +   struct btrfs_fs_info *fs_info = trans->fs_info;
>>> struct btrfs_dev_item *dev_item;
>>> struct btrfs_path path;
>>> -   struct extent_buffer *leaf;
>>> struct btrfs_root *root = fs_info->chunk_root;
>>> struct btrfs_key key;
>>> +   struct extent_buffer *leaf;
>>
>> nit: Unnecessary change
> 
> Doesn't it look better when all btrfs_ prefix get batched together? :)

Didn't even realize this was the intended effect. IMO doesn't make much
difference, what does, though, is probably reverse christmas tree, ie

longer variable names
come before shorter
ones

But I guess this is a matter of taste, no need to resend.

> 
> Thanks,
> Qu
> 
>>
>>> u64 devid, cur_devid;
>>> int ret;
>>>  
>>> -   if (btrfs_super_log_root(fs_info->super_copy)) {
>>> -   warning(
>>> -   "log tree detected, its generation will not match superblock");
>>> -   }
>>> -   trans = btrfs_start_transaction(fs_info->tree_root, 1);
>>> -   if (IS_ERR(trans)) {
>>> -   error("cannot starting transaction %ld", PTR_ERR(trans));
>>> -   return PTR_ERR(trans);
>>> -   }
>>> -
>>> dev_item = _info->super_copy->dev_item;
>>>  
>>> devid = btrfs_stack_device_id(dev_item);
>>> @@ -2123,7 +2114,7 @@ again:
>>> ret = btrfs_search_slot(trans, root, , , -1, 1);
>>> if (ret < 0) {
>>> error("search failed: %d", ret);
>>> -   exit(1);
>>> +   return ret;
>>> }
>>>  
>>> while (1) {
>>> @@ -2170,12 +2161,41 @@ again:
>>> }
>>>  
>>> btrfs_release_path();
>>> +   return 0;
>>> +}
>>> +
>>> +static int fixup_chunks_and_devices(struct btrfs_fs_info *fs_info,
>>> +struct mdrestore_struct *mdres, off_t dev_size)
>>> +{
>>> +   struct btrfs_trans_handle *trans;
>>> +   int ret;
>>> +
>>> +   if (btrfs_super_log_root(fs_info->super_copy)) {
>>> +   warning(
>>> +   "log tree detected, its generation will not match superblock");
>>> +   }
>>> +   trans = btrfs_start_transaction(fs_info->tree_root, 1);
>>> +   if (IS_ERR(trans)) {
>>> +   error("cannot starting transaction %ld", PTR_ERR(trans));
>>> +   return PTR_ERR(trans);
>>> +   }
>>> +
>>> +   ret = fixup_device_size(trans, mdres, dev_size);
>>> +   if (ret < 0)
>>> +   goto error;
>>> +
>>> ret = btrfs_commit_transaction(trans, fs_info->tree_root);
>>> if (ret) {
>>> error("unable to commit transaction: %d", ret);
>>> return ret;
>>> }
>>> return 0;
>>> +error:
>>> +   error(
>>> +"failed to fix chunks and devices mapping, the fs may not be mountable: 
>>> %s",
>>> +   strerror(-ret));
>>> +   btrfs_abort_transaction(trans, ret);
>>> +   return ret;
>>>  }
>>>  
>>>  static int restore_metadump(const char *input, FILE *out, int old_restore,
>>> @@ -2282,7 +2302,7 @@ static int restore_metadump(const char *input, FILE 
>>> *out, int old_restore,
>>> return 1;
>>> }
>>>  
>>> -   ret = fixup_devices(info, , st.st_size);
>>> +   ret = fixup_chunks_and_devices(info, , st.st_size);
>>> close_ctree(info->chunk_root);
>>> if (ret)
>>> goto out;
>>>
> 


Re: [PATCH v2 5/5] btrfs-progs: misc-tests/021: Do extra btrfs check before mounting

2018-11-27 Thread Nikolay Borisov



On 27.11.18 г. 10:38 ч., Qu Wenruo wrote:
> Test case misc/021 is testing if we could mount a single disk btrfs
> image recovered from multi disk fs.
> 
> The problem is, current kernel has extra check for block group, chunk
> and dev extent.
> This means any image can't pass btrfs check for chunk tree will not
> mount.
> 
> So do extra btrfs check before mount, this will also help us to locate
> the problem in btrfs-image easier.
> 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov 

> ---
>  tests/misc-tests/021-image-multi-devices/test.sh | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/tests/misc-tests/021-image-multi-devices/test.sh 
> b/tests/misc-tests/021-image-multi-devices/test.sh
> index 5430847f4e2f..26beae6e4b85 100755
> --- a/tests/misc-tests/021-image-multi-devices/test.sh
> +++ b/tests/misc-tests/021-image-multi-devices/test.sh
> @@ -37,6 +37,9 @@ run_check $SUDO_HELPER wipefs -a "$loop2"
>  
>  run_check $SUDO_HELPER "$TOP/btrfs-image" -r "$IMAGE" "$loop1"
>  
> +# Run check to make sure there is nothing wrong for the recovered image
> +run_check "$TOP/btrfs" check "$loop1"
> +
>  run_check $SUDO_HELPER mount "$loop1" "$TEST_MNT"
>  new_md5=$(run_check_stdout md5sum "$TEST_MNT/foobar" | cut -d ' ' -f 1)
>  run_check $SUDO_HELPER umount "$TEST_MNT"
> 


Re: [PATCH v2 1/5] btrfs-progs: image: Refactor fixup_devices() to fixup_chunks_and_devices()

2018-11-27 Thread Nikolay Borisov



On 27.11.18 г. 10:38 ч., Qu Wenruo wrote:
> Current fixup_devices() will only remove DEV_ITEMs and reset DEV_ITEM
> size.
> Later we need to do more fixup works, so change the name to
> fixup_chunks_and_devices() and refactor the original device size fixup
> to fixup_device_size().
> 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov 

However, one minor nit below.

> ---
>  image/main.c | 52 
>  1 file changed, 36 insertions(+), 16 deletions(-)
> 
> diff --git a/image/main.c b/image/main.c
> index c680ab19de6c..bbfcf8f19298 100644
> --- a/image/main.c
> +++ b/image/main.c
> @@ -2084,28 +2084,19 @@ static void remap_overlapping_chunks(struct 
> mdrestore_struct *mdres)
>   }
>  }
>  
> -static int fixup_devices(struct btrfs_fs_info *fs_info,
> -  struct mdrestore_struct *mdres, off_t dev_size)
> +static int fixup_device_size(struct btrfs_trans_handle *trans,
> +  struct mdrestore_struct *mdres,
> +  off_t dev_size)
>  {
> - struct btrfs_trans_handle *trans;
> + struct btrfs_fs_info *fs_info = trans->fs_info;
>   struct btrfs_dev_item *dev_item;
>   struct btrfs_path path;
> - struct extent_buffer *leaf;
>   struct btrfs_root *root = fs_info->chunk_root;
>   struct btrfs_key key;
> + struct extent_buffer *leaf;

nit: Unnecessary change

>   u64 devid, cur_devid;
>   int ret;
>  
> - if (btrfs_super_log_root(fs_info->super_copy)) {
> - warning(
> - "log tree detected, its generation will not match superblock");
> - }
> - trans = btrfs_start_transaction(fs_info->tree_root, 1);
> - if (IS_ERR(trans)) {
> - error("cannot starting transaction %ld", PTR_ERR(trans));
> - return PTR_ERR(trans);
> - }
> -
>   dev_item = _info->super_copy->dev_item;
>  
>   devid = btrfs_stack_device_id(dev_item);
> @@ -2123,7 +2114,7 @@ again:
>   ret = btrfs_search_slot(trans, root, , , -1, 1);
>   if (ret < 0) {
>   error("search failed: %d", ret);
> - exit(1);
> + return ret;
>   }
>  
>   while (1) {
> @@ -2170,12 +2161,41 @@ again:
>   }
>  
>   btrfs_release_path();
> + return 0;
> +}
> +
> +static int fixup_chunks_and_devices(struct btrfs_fs_info *fs_info,
> +  struct mdrestore_struct *mdres, off_t dev_size)
> +{
> + struct btrfs_trans_handle *trans;
> + int ret;
> +
> + if (btrfs_super_log_root(fs_info->super_copy)) {
> + warning(
> + "log tree detected, its generation will not match superblock");
> + }
> + trans = btrfs_start_transaction(fs_info->tree_root, 1);
> + if (IS_ERR(trans)) {
> + error("cannot starting transaction %ld", PTR_ERR(trans));
> + return PTR_ERR(trans);
> + }
> +
> + ret = fixup_device_size(trans, mdres, dev_size);
> + if (ret < 0)
> + goto error;
> +
>   ret = btrfs_commit_transaction(trans, fs_info->tree_root);
>   if (ret) {
>   error("unable to commit transaction: %d", ret);
>   return ret;
>   }
>   return 0;
> +error:
> + error(
> +"failed to fix chunks and devices mapping, the fs may not be mountable: %s",
> + strerror(-ret));
> + btrfs_abort_transaction(trans, ret);
> + return ret;
>  }
>  
>  static int restore_metadump(const char *input, FILE *out, int old_restore,
> @@ -2282,7 +2302,7 @@ static int restore_metadump(const char *input, FILE 
> *out, int old_restore,
>   return 1;
>   }
>  
> - ret = fixup_devices(info, , st.st_size);
> + ret = fixup_chunks_and_devices(info, , st.st_size);
>   close_ctree(info->chunk_root);
>   if (ret)
>   goto out;
> 


Re: [PATCH v2 3/5] btrfs-progs: volumes: Refactor btrfs_alloc_dev_extent() into two functions

2018-11-27 Thread Nikolay Borisov



On 27.11.18 г. 10:38 ч., Qu Wenruo wrote:
> We have btrfs_alloc_dev_extent() accepting @convert flag to toggle
> special handling for convert.
> 
> However that @convert flag only determine whether we call
> find_free_dev_extent(), and we may later need to insert dev extents
> without searching dev tree.
> 
> So refactor btrfs_alloc_dev_extent() into 2 functions,
> btrfs_alloc_dev_extent(), which will try to find free dev extent, and
> btrfs_insert_dev_extent(), which will just insert a dev extent.
> 
> For implementation, btrfs_alloc_dev_extent() will call
> btrfs_insert_dev_extent() anyway, so no duplicated code.
> 
> This removes the need of @convert parameter, and make
> btrfs_insert_dev_extent() public for later usage.
> 
> Signed-off-by: Qu Wenruo 

This looks much better:

Reviewed-by: Nikolay Borisov 

> ---
>  volumes.c | 48 ++--
>  volumes.h |  3 +++
>  2 files changed, 33 insertions(+), 18 deletions(-)
> 
> diff --git a/volumes.c b/volumes.c
> index 30090ce5f8e8..0dd082cd1718 100644
> --- a/volumes.c
> +++ b/volumes.c
> @@ -530,10 +530,12 @@ static int find_free_dev_extent(struct btrfs_device 
> *device, u64 num_bytes,
>   return find_free_dev_extent_start(device, num_bytes, 0, start, len);
>  }
>  
> -static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
> -   struct btrfs_device *device,
> -   u64 chunk_offset, u64 num_bytes, u64 *start,
> -   int convert)
> +/*
> + * Insert one device extent into the fs.
> + */
> +int btrfs_insert_dev_extent(struct btrfs_trans_handle *trans,
> + struct btrfs_device *device,
> + u64 chunk_offset, u64 num_bytes, u64 start)
>  {
>   int ret;
>   struct btrfs_path *path;
> @@ -546,18 +548,8 @@ static int btrfs_alloc_dev_extent(struct 
> btrfs_trans_handle *trans,
>   if (!path)
>   return -ENOMEM;
>  
> - /*
> -  * For convert case, just skip search free dev_extent, as caller
> -  * is responsible to make sure it's free.
> -  */
> - if (!convert) {
> - ret = find_free_dev_extent(device, num_bytes, start, NULL);
> - if (ret)
> - goto err;
> - }
> -
>   key.objectid = device->devid;
> - key.offset = *start;
> + key.offset = start;
>   key.type = BTRFS_DEV_EXTENT_KEY;
>   ret = btrfs_insert_empty_item(trans, root, path, ,
> sizeof(*extent));
> @@ -583,6 +575,22 @@ err:
>   return ret;
>  }
>  
> +/*
> + * Allocate one free dev extent and insert it into the fs.
> + */
> +static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
> +   struct btrfs_device *device,
> +   u64 chunk_offset, u64 num_bytes, u64 *start)
> +{
> + int ret;
> +
> + ret = find_free_dev_extent(device, num_bytes, start, NULL);
> + if (ret)
> + return ret;
> + return btrfs_insert_dev_extent(trans, device, chunk_offset, num_bytes,
> + *start);
> +}
> +
>  static int find_next_chunk(struct btrfs_fs_info *fs_info, u64 *offset)
>  {
>   struct btrfs_root *root = fs_info->chunk_root;
> @@ -1107,7 +1115,7 @@ again:
>   list_move_tail(>dev_list, dev_list);
>  
>   ret = btrfs_alloc_dev_extent(trans, device, key.offset,
> -  calc_size, _offset, 0);
> +  calc_size, _offset);
>   if (ret < 0)
>   goto out_chunk_map;
>  
> @@ -1241,8 +1249,12 @@ int btrfs_alloc_data_chunk(struct btrfs_trans_handle 
> *trans,
>   while (index < num_stripes) {
>   struct btrfs_stripe *stripe;
>  
> - ret = btrfs_alloc_dev_extent(trans, device, key.offset,
> -  calc_size, _offset, convert);
> + if (convert)
> + ret = btrfs_insert_dev_extent(trans, device, key.offset,
> + calc_size, dev_offset);
> + else
> + ret = btrfs_alloc_dev_extent(trans, device, key.offset,
> + calc_size, _offset);
>   BUG_ON(ret);
>  
>   device->bytes_used += calc_size;
> diff --git a/volumes.h b/volumes.h
> index b4ea93f0bec3..44284ee75adb 100644
> --- a/volumes.h
> +++ b/volumes.h
> @@ -268,6 +268,9 @@ int btrfs_open_devices(struct btrfs_fs_devices 
> *fs_devices,
>

Re: [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue

2018-11-27 Thread Nikolay Borisov



On 21.11.18 г. 21:09 ч., Josef Bacik wrote:
> The throttle path doesn't take cleaner_delayed_iput_mutex, which means

Which one is the throttle path? btrfs_end_transaction_throttle is only
called during snapshot drop and relocation? What scenario triggered your
observation and prompted this fix?

> we could think we're done flushing iputs in the data space reservation
> path when we could have a throttler doing an iput.  There's no real
> reason to serialize the delayed iput flushing, so instead of taking the
> cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
> replace it with an atomic counter and a waitqueue.  This removes the
> short (or long depending on how big the inode is) window where we think
> there are no more pending iputs when there really are some.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/ctree.h   |  4 +++-
>  fs/btrfs/disk-io.c |  5 ++---
>  fs/btrfs/extent-tree.c |  9 +
>  fs/btrfs/inode.c   | 21 +
>  4 files changed, 31 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 709de7471d86..a835fe7076eb 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -912,7 +912,8 @@ struct btrfs_fs_info {
>  
>   spinlock_t delayed_iput_lock;
>   struct list_head delayed_iputs;
> - struct mutex cleaner_delayed_iput_mutex;
> + atomic_t nr_delayed_iputs;
> + wait_queue_head_t delayed_iputs_wait;
>  
>   /* this protects tree_mod_seq_list */
>   spinlock_t tree_mod_seq_lock;
> @@ -3237,6 +3238,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root);
>  int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
>  void btrfs_add_delayed_iput(struct inode *inode);
>  void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info);
> +int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info);
>  int btrfs_prealloc_file_range(struct inode *inode, int mode,
> u64 start, u64 num_bytes, u64 min_size,
> loff_t actual_len, u64 *alloc_hint);
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index c5918ff8241b..3f81dfaefa32 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1692,9 +1692,7 @@ static int cleaner_kthread(void *arg)
>   goto sleep;
>   }
>  
> - mutex_lock(_info->cleaner_delayed_iput_mutex);
>   btrfs_run_delayed_iputs(fs_info);
> - mutex_unlock(_info->cleaner_delayed_iput_mutex);
>  
>   again = btrfs_clean_one_deleted_snapshot(root);
>   mutex_unlock(_info->cleaner_mutex);
> @@ -2651,7 +2649,6 @@ int open_ctree(struct super_block *sb,
>   mutex_init(_info->delete_unused_bgs_mutex);
>   mutex_init(_info->reloc_mutex);
>   mutex_init(_info->delalloc_root_mutex);
> - mutex_init(_info->cleaner_delayed_iput_mutex);
>   seqlock_init(_info->profiles_lock);
>  
>   INIT_LIST_HEAD(_info->dirty_cowonly_roots);
> @@ -2673,6 +2670,7 @@ int open_ctree(struct super_block *sb,
>   atomic_set(_info->defrag_running, 0);
>   atomic_set(_info->qgroup_op_seq, 0);
>   atomic_set(_info->reada_works_cnt, 0);
> + atomic_set(_info->nr_delayed_iputs, 0);
>   atomic64_set(_info->tree_mod_seq, 0);
>   fs_info->sb = sb;
>   fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
> @@ -2750,6 +2748,7 @@ int open_ctree(struct super_block *sb,
>   init_waitqueue_head(_info->transaction_wait);
>   init_waitqueue_head(_info->transaction_blocked_wait);
>   init_waitqueue_head(_info->async_submit_wait);
> + init_waitqueue_head(_info->delayed_iputs_wait);
>  
>   INIT_LIST_HEAD(_info->pinned_chunks);
>  
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 3a90dc1d6b31..36f43876be22 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4272,8 +4272,9 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode 
> *inode, u64 bytes)
>* operations. Wait for it to finish so that
>* more space is released.
>*/
> - 
> mutex_lock(_info->cleaner_delayed_iput_mutex);
> - 
> mutex_unlock(_info->cleaner_delayed_iput_mutex);
> + ret = btrfs_wait_on_delayed_iputs(fs_info);
> + if (ret)
> + return ret;
>   goto again;
>   } else {
>   btrfs_end_transaction(trans);
> @@ -4838,9 +4839,9 @@ static int may_commit_transaction(struct btrfs_fs_info 
> *fs_info,
>* pinned space, so make sure we run the iputs before we do our pinned
>* bytes check below.
>*/
> - mutex_lock(_info->cleaner_delayed_iput_mutex);
>   btrfs_run_delayed_iputs(fs_info);
> - 

Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput

2018-11-27 Thread Nikolay Borisov



On 21.11.18 г. 21:09 ч., Josef Bacik wrote:
> The cleaner thread usually takes care of delayed iputs, with the
> exception of the btrfs_end_transaction_throttle path.  The cleaner
> thread only gets woken up every 30 seconds, so instead wake it up to do
> it's work so that we can free up that space as quickly as possible.

Have you done any measurements how this affects the overall system. I
suspect this introduces a lot of noise since now we are going to be
doing a thread wakeup on every iput, does this give a chance to have
nice, large batches of iputs that  the cost of wake up can be amortized
across?

> 
> Reviewed-by: Filipe Manana 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/inode.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 3da9ac463344..3c42d8887183 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3264,6 +3264,7 @@ void btrfs_add_delayed_iput(struct inode *inode)
>   ASSERT(list_empty(>delayed_iput));
>   list_add_tail(>delayed_iput, _info->delayed_iputs);
>   spin_unlock(_info->delayed_iput_lock);
> + wake_up_process(fs_info->cleaner_kthread);
>  }
>  
>  void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
> 


Re: [PATCH 4/5] btrfs-progs: image: Rebuild dev extents using chunk tree

2018-11-26 Thread Nikolay Borisov



On 27.11.18 г. 9:30 ч., Qu Wenruo wrote:
> 
> 
> On 2018/11/27 下午3:28, Nikolay Borisov wrote:
>>
>>
>> On 27.11.18 г. 4:33 ч., Qu Wenruo wrote:
>>> With existing dev extents cleaned up, now we can rebuild dev extents
>>> using the correct chunk tree.
>>>
>>> Since new dev extents are all rebuild from scratch, even we're restoring
>>> image from multi-device fs to single disk, we won't have any problem
>>> reported by btrfs check.
>>>
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>>  image/main.c | 34 ++
>>>  volumes.c| 10 +-
>>>  volumes.h|  4 
>>>  3 files changed, 43 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/image/main.c b/image/main.c
>>> index 707568f22e01..626eb933d5cc 100644
>>> --- a/image/main.c
>>> +++ b/image/main.c
>>> @@ -2265,12 +2265,46 @@ out:
>>>  static int fixup_dev_extents(struct btrfs_trans_handle *trans,
>>>  struct btrfs_fs_info *fs_info)
>>>  {
>>> +   struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
>>> +   struct btrfs_device *dev;
>>> +   struct cache_extent *ce;
>>> +   struct map_lookup *map;
>>> +   u64 devid = btrfs_stack_device_id(_info->super_copy->dev_item);
>>> +   int i;
>>> int ret;
>>>  
>>> ret = remove_all_dev_extents(trans, fs_info);
>>> if (ret < 0)
>>> error("failed to remove all existing dev extents: %s",
>>> strerror(-ret));
>>> +
>>> +   dev = btrfs_find_device(fs_info, devid, NULL, NULL);
>>> +   if (!dev) {
>>> +   error("faild to find devid %llu", devid);
>>> +   return -ENODEV;
>>> +   }
>>> +
>>> +   /* Rebuild all dev extents using chunk maps */
>>> +   for (ce = search_cache_extent(_tree->cache_tree, 0); ce;
>>> +ce = next_cache_extent(ce)) {
>>> +   u64 stripe_len;
>>> +
>>> +   map = container_of(ce, struct map_lookup, ce);
>>> +   stripe_len = calc_stripe_length(map->type, ce->size,
>>> +   map->num_stripes);
>>> +   for (i = 0; i < map->num_stripes; i++) {
>>> +   ret = btrfs_alloc_dev_extent(trans, dev, ce->start,
>>> +   stripe_len, >stripes[i].physical, 1);
>>> +   if (ret < 0) {
>>> +   error(
>>> +   "failed to insert dev extent %llu %llu: %s",
>>> +   devid, map->stripes[i].physical,
>>> +   strerror(-ret));
>>> +   goto out;
>>> +   }
>>> +   }
>>> +   }
>>> +out:
>>> return ret;
>>>  }
>>>  
>>> diff --git a/volumes.c b/volumes.c
>>> index 30090ce5f8e8..73c9204fa7d1 100644
>>> --- a/volumes.c
>>> +++ b/volumes.c
>>> @@ -530,10 +530,10 @@ static int find_free_dev_extent(struct btrfs_device 
>>> *device, u64 num_bytes,
>>> return find_free_dev_extent_start(device, num_bytes, 0, start, len);
>>>  }
>>>  
>>> -static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
>>> - struct btrfs_device *device,
>>> - u64 chunk_offset, u64 num_bytes, u64 *start,
>>> - int convert)
>>> +int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
>>> +  struct btrfs_device *device,
>>> +  u64 chunk_offset, u64 num_bytes, u64 *start,
>>> +  int insert_asis)
>>
>> Make that parameter a bool. Also why do you rename it ?
> 
> Since it's no longer only used by convert.
> 
> The best naming may be two function, one called
> btrfs_insert_device_extent(), and then btrfs_alloc_device_extent().
> 
> As for convert and this use case, we are not allocating, but just
> inserting one.
> 
> What about above naming change?

Indeed  two functions seem better, btrfs_alloc_device_extent will be a
wrapper of btrfs_insert_device_extent + the added find_free_dev_extent,
whereas btrfs_insert_device_extent will be the function doing the actual
insert.

> 
> Thanks,
> Qu
> 
>>
>>>  {
>>&

Re: [PATCH 5/5] btrfs-progs: misc-tests/021: Do extra btrfs check before mounting

2018-11-26 Thread Nikolay Borisov



On 27.11.18 г. 4:33 ч., Qu Wenruo wrote:
> Test case misc/021 is testing if we could mount a single disk btrfs
> image recovered from multi disk fs.
> 
> The problem is, current kernel has extra check for block group, chunk
> and dev extent.
> This means any image can't pass btrfs check for chunk tree will not
> mount.
> 
> So do extra btrfs check before mount, this will also help us to locate
> the problem in btrfs-image easier.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  tests/misc-tests/021-image-multi-devices/test.sh | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/tests/misc-tests/021-image-multi-devices/test.sh 
> b/tests/misc-tests/021-image-multi-devices/test.sh
> index 5430847f4e2f..26beae6e4b85 100755
> --- a/tests/misc-tests/021-image-multi-devices/test.sh
> +++ b/tests/misc-tests/021-image-multi-devices/test.sh
> @@ -37,6 +37,9 @@ run_check $SUDO_HELPER wipefs -a "$loop2"
>  
>  run_check $SUDO_HELPER "$TOP/btrfs-image" -r "$IMAGE" "$loop1"
>  
> +# Run check to make sure there is nothing wrong for the recovered image
> +run_check "$TOP/btrfs" check "$loop1"

I think this needs to be run_check $SUDO_HELPER "$TOP/btrfs" check "$loop1"
> +
>  run_check $SUDO_HELPER mount "$loop1" "$TEST_MNT"
>  new_md5=$(run_check_stdout md5sum "$TEST_MNT/foobar" | cut -d ' ' -f 1)
>  run_check $SUDO_HELPER umount "$TEST_MNT"
> 


Re: [PATCH 4/5] btrfs-progs: image: Rebuild dev extents using chunk tree

2018-11-26 Thread Nikolay Borisov



On 27.11.18 г. 4:33 ч., Qu Wenruo wrote:
> With existing dev extents cleaned up, now we can rebuild dev extents
> using the correct chunk tree.
> 
> Since new dev extents are all rebuild from scratch, even we're restoring
> image from multi-device fs to single disk, we won't have any problem
> reported by btrfs check.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  image/main.c | 34 ++
>  volumes.c| 10 +-
>  volumes.h|  4 
>  3 files changed, 43 insertions(+), 5 deletions(-)
> 
> diff --git a/image/main.c b/image/main.c
> index 707568f22e01..626eb933d5cc 100644
> --- a/image/main.c
> +++ b/image/main.c
> @@ -2265,12 +2265,46 @@ out:
>  static int fixup_dev_extents(struct btrfs_trans_handle *trans,
>struct btrfs_fs_info *fs_info)
>  {
> + struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
> + struct btrfs_device *dev;
> + struct cache_extent *ce;
> + struct map_lookup *map;
> + u64 devid = btrfs_stack_device_id(_info->super_copy->dev_item);
> + int i;
>   int ret;
>  
>   ret = remove_all_dev_extents(trans, fs_info);
>   if (ret < 0)
>   error("failed to remove all existing dev extents: %s",
>   strerror(-ret));
> +
> + dev = btrfs_find_device(fs_info, devid, NULL, NULL);
> + if (!dev) {
> + error("faild to find devid %llu", devid);
> + return -ENODEV;
> + }
> +
> + /* Rebuild all dev extents using chunk maps */
> + for (ce = search_cache_extent(_tree->cache_tree, 0); ce;
> +  ce = next_cache_extent(ce)) {
> + u64 stripe_len;
> +
> + map = container_of(ce, struct map_lookup, ce);
> + stripe_len = calc_stripe_length(map->type, ce->size,
> + map->num_stripes);
> + for (i = 0; i < map->num_stripes; i++) {
> + ret = btrfs_alloc_dev_extent(trans, dev, ce->start,
> + stripe_len, >stripes[i].physical, 1);
> + if (ret < 0) {
> + error(
> + "failed to insert dev extent %llu %llu: %s",
> + devid, map->stripes[i].physical,
> + strerror(-ret));
> + goto out;
> + }
> + }
> + }
> +out:
>   return ret;
>  }
>  
> diff --git a/volumes.c b/volumes.c
> index 30090ce5f8e8..73c9204fa7d1 100644
> --- a/volumes.c
> +++ b/volumes.c
> @@ -530,10 +530,10 @@ static int find_free_dev_extent(struct btrfs_device 
> *device, u64 num_bytes,
>   return find_free_dev_extent_start(device, num_bytes, 0, start, len);
>  }
>  
> -static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
> -   struct btrfs_device *device,
> -   u64 chunk_offset, u64 num_bytes, u64 *start,
> -   int convert)
> +int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
> +struct btrfs_device *device,
> +u64 chunk_offset, u64 num_bytes, u64 *start,
> +int insert_asis)

Make that parameter a bool. Also why do you rename it ?

>  {
>   int ret;
>   struct btrfs_path *path;
> @@ -550,7 +550,7 @@ static int btrfs_alloc_dev_extent(struct 
> btrfs_trans_handle *trans,
>* For convert case, just skip search free dev_extent, as caller
>* is responsible to make sure it's free.
>*/
> - if (!convert) {
> + if (!insert_asis) {
>   ret = find_free_dev_extent(device, num_bytes, start, NULL);
>   if (ret)
>   goto err;
> diff --git a/volumes.h b/volumes.h
> index b4ea93f0bec3..5ca2779ebd45 100644
> --- a/volumes.h
> +++ b/volumes.h
> @@ -271,6 +271,10 @@ void btrfs_close_all_devices(void);
>  int btrfs_add_device(struct btrfs_trans_handle *trans,
>struct btrfs_fs_info *fs_info,
>struct btrfs_device *device);
> +int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
> +struct btrfs_device *device,
> +u64 chunk_offset, u64 num_bytes, u64 *start,
> +int insert_asis);
>  int btrfs_update_device(struct btrfs_trans_handle *trans,
>   struct btrfs_device *device);
>  int btrfs_scan_one_device(int fd, const char *path,
> 


Re: [PATCH 3/5] btrfs-progs: image: Remove all existing dev extents for later rebuild

2018-11-26 Thread Nikolay Borisov



On 27.11.18 г. 4:33 ч., Qu Wenruo wrote:
> This patch will remove all existing dev extents for later rebuild.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  image/main.c | 68 
>  1 file changed, 68 insertions(+)
> 
> diff --git a/image/main.c b/image/main.c
> index 9060f6b1b665..707568f22e01 100644
> --- a/image/main.c
> +++ b/image/main.c
> @@ -2210,6 +2210,70 @@ static void fixup_block_groups(struct 
> btrfs_trans_handle *trans,
>   }
>  }
>  
> +static int remove_all_dev_extents(struct btrfs_trans_handle *trans,
> +   struct btrfs_fs_info *fs_info)

remove fs_info arg.


> +
> +static int fixup_dev_extents(struct btrfs_trans_handle *trans,
> +  struct btrfs_fs_info *fs_info)

Ditto

> +{
> + int ret;
> +
> + ret = remove_all_dev_extents(trans, fs_info);
> + if (ret < 0)
> + error("failed to remove all existing dev extents: %s",
> + strerror(-ret));
> + return ret;
> +}
> +
>  static int fixup_chunks_and_devices(struct btrfs_fs_info *fs_info,
>struct mdrestore_struct *mdres, off_t dev_size)
>  {
> @@ -2227,6 +2291,10 @@ static int fixup_chunks_and_devices(struct 
> btrfs_fs_info *fs_info,
>   }
>  
>   fixup_block_groups(trans, fs_info);
> + ret = fixup_dev_extents(trans, fs_info);
> + if (ret < 0)
> + goto error;
> +
>   ret = fixup_device_size(trans, fs_info, mdres, dev_size);
>   if (ret < 0)
>   goto error;
> 


Re: [PATCH 2/5] btrfs-progs: image: Fix block group item flags when restoring multi-device image to single device

2018-11-26 Thread Nikolay Borisov



On 27.11.18 г. 4:33 ч., Qu Wenruo wrote:
> Since btrfs-image is just restoring tree blocks without really check if
> that tree block contents makes sense, for multi-device image, block
> group items will keep that incorrect block group flags.
> 
> For example, for a metadata RAID1 data RAID0 btrfs recovered to a single
> disk, its chunk tree will look like:
> 
>   item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096)
>   length 8388608 owner 2 stripe_len 65536 type SYSTEM
>   item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704)
>   length 1073741824 owner 2 stripe_len 65536 type METADATA
>   item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 1104150528)
>   length 1073741824 owner 2 stripe_len 65536 type DATA
> 
> All chunks have correct type (SINGLE).
> 
> While its block group items will look like:
> 
>   item 1 key (22020096 BLOCK_GROUP_ITEM 8388608)
>   block group used 16384 chunk_objectid 256 flags SYSTEM|RAID1
>   item 3 key (30408704 BLOCK_GROUP_ITEM 1073741824)
>   block group used 114688 chunk_objectid 256 flags METADATA|RAID1
>   item 11 key (1104150528 BLOCK_GROUP_ITEM 1073741824)
>   block group used 1572864 chunk_objectid 256 flags DATA|RAID0
> 
> All block group items still have the wrong profiles.
> 
> And btrfs check (lowmem mode for better output) will report error for such 
> image:
> 
>   ERROR: chunk[22020096 30408704) related block group item flags mismatch, 
> wanted: 2, have: 18
>   ERROR: chunk[30408704 1104150528) related block group item flags mismatch, 
> wanted: 4, have: 20
>   ERROR: chunk[1104150528 2177892352) related block group item flags 
> mismatch, wanted: 1, have: 9
> 
> This patch will do an extra repair for block group items to fix the
> profile of them.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  image/main.c | 47 +++
>  1 file changed, 47 insertions(+)
> 
> diff --git a/image/main.c b/image/main.c
> index 36b5c95ea146..9060f6b1b665 100644
> --- a/image/main.c
> +++ b/image/main.c
> @@ -2164,6 +2164,52 @@ again:
>   return 0;
>  }
>  
> +static void fixup_block_groups(struct btrfs_trans_handle *trans,
> +   struct btrfs_fs_info *fs_info)

You are not even using the trans handle in this function, why pass it?

> +{
> + struct btrfs_block_group_cache *bg;
> + struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
> + struct cache_extent *ce;
> + struct map_lookup *map;
> + u64 extra_flags;
> +
> + for (ce = search_cache_extent(_tree->cache_tree, 0); ce;
> +  ce = next_cache_extent(ce)) {
> + map = container_of(ce, struct map_lookup, ce);
> +
> + bg = btrfs_lookup_block_group(fs_info, ce->start);
> + if (!bg) {
> + warning(
> + "can't find block group %llu, result fs may not be mountable",
> + ce->start);
> + continue;
> + }
> + extra_flags = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
> +
> + if (bg->flags == map->type)
> + continue;
> +
> + /* Update the block group item and mark the bg dirty */
> + bg->flags = map->type;
> + btrfs_set_block_group_flags(>item, bg->flags);
> + set_extent_bits(_info->block_group_cache, ce->start,
> + ce->start + ce->size - 1, BLOCK_GROUP_DIRTY);
> +
> + /*
> +  * Chunk and bg flags can be different, changing bg flags
> +  * without update avail_data/meta_alloc_bits will lead to
> +  * ENOSPC.
> +  * So here we set avail_*_alloc_bits to match chunk types.
> +  */
> + if (map->type & BTRFS_BLOCK_GROUP_DATA)
> + fs_info->avail_data_alloc_bits = extra_flags;
> + if (map->type & BTRFS_BLOCK_GROUP_METADATA)
> + fs_info->avail_metadata_alloc_bits = extra_flags;
> + if (map->type & BTRFS_BLOCK_GROUP_SYSTEM)
> + fs_info->avail_system_alloc_bits = extra_flags;
> + }
> +}
> +
>  static int fixup_chunks_and_devices(struct btrfs_fs_info *fs_info,
>struct mdrestore_struct *mdres, off_t dev_size)
>  {
> @@ -2180,6 +2226,7 @@ static int fixup_chunks_and_devices(struct 
> btrfs_fs_info *fs_info,
>   return PTR_ERR(trans);
>   }
>  
> + fixup_block_groups(trans, fs_info);
>   ret = fixup_device_size(trans, fs_info, mdres, dev_size);
>   if (ret < 0)
>   goto error;
> 


Re: [PATCH 1/5] btrfs-progs: image: Refactor fixup_devices() to fixup_chunks_and_devices()

2018-11-26 Thread Nikolay Borisov



On 27.11.18 г. 4:33 ч., Qu Wenruo wrote:
> Current fixup_devices() will only remove DEV_ITEMs and reset DEV_ITEM
> size.
> Later we need to do more fixup works, so change the name to
> fixup_chunks_and_devices() and refactor the original device size fixup
> to fixup_device_size().
> 
> Signed-off-by: Qu Wenruo 
> ---
>  image/main.c | 52 
>  1 file changed, 36 insertions(+), 16 deletions(-)
> 
> diff --git a/image/main.c b/image/main.c
> index c680ab19de6c..36b5c95ea146 100644
> --- a/image/main.c
> +++ b/image/main.c
> @@ -2084,28 +2084,19 @@ static void remap_overlapping_chunks(struct 
> mdrestore_struct *mdres)
>   }
>  }
>  
> -static int fixup_devices(struct btrfs_fs_info *fs_info,
> -  struct mdrestore_struct *mdres, off_t dev_size)
> +static int fixup_device_size(struct btrfs_trans_handle *trans,
> +  struct btrfs_fs_info *fs_info,

trans already has a handle to the fs_info so you can drop it from the
param list.

> +  struct mdrestore_struct *mdres,
> +  off_t dev_size)
>  {
> - struct btrfs_trans_handle *trans;
>   struct btrfs_dev_item *dev_item;
>   struct btrfs_path path;
> - struct extent_buffer *leaf;
>   struct btrfs_root *root = fs_info->chunk_root;
>   struct btrfs_key key;
> + struct extent_buffer *leaf;
>   u64 devid, cur_devid;
>   int ret;
>  
> - if (btrfs_super_log_root(fs_info->super_copy)) {
> - warning(
> - "log tree detected, its generation will not match superblock");
> - }
> - trans = btrfs_start_transaction(fs_info->tree_root, 1);
> - if (IS_ERR(trans)) {
> - error("cannot starting transaction %ld", PTR_ERR(trans));
> - return PTR_ERR(trans);
> - }
> -
>   dev_item = _info->super_copy->dev_item;
>  
>   devid = btrfs_stack_device_id(dev_item);
> @@ -2123,7 +2114,7 @@ again:
>   ret = btrfs_search_slot(trans, root, , , -1, 1);
>   if (ret < 0) {
>   error("search failed: %d", ret);
> - exit(1);
> + return ret;
>   }
>  
>   while (1) {
> @@ -2170,12 +2161,41 @@ again:
>   }
>  
>   btrfs_release_path();
> + return 0;
> +}
> +
> +static int fixup_chunks_and_devices(struct btrfs_fs_info *fs_info,
> +  struct mdrestore_struct *mdres, off_t dev_size)
> +{
> + struct btrfs_trans_handle *trans;
> + int ret;
> +
> + if (btrfs_super_log_root(fs_info->super_copy)) {
> + warning(
> + "log tree detected, its generation will not match superblock");
> + }
> + trans = btrfs_start_transaction(fs_info->tree_root, 1);
> + if (IS_ERR(trans)) {
> + error("cannot starting transaction %ld", PTR_ERR(trans));
> + return PTR_ERR(trans);
> + }
> +
> + ret = fixup_device_size(trans, fs_info, mdres, dev_size);
> + if (ret < 0)
> + goto error;
> +
>   ret = btrfs_commit_transaction(trans, fs_info->tree_root);
>   if (ret) {
>   error("unable to commit transaction: %d", ret);
>   return ret;
>   }
>   return 0;
> +error:
> + error(
> +"failed to fix chunks and devices mapping, the fs may not be mountable: %s",
> + strerror(-ret));
> + btrfs_abort_transaction(trans, ret);
> + return ret;
>  }
>  
>  static int restore_metadump(const char *input, FILE *out, int old_restore,
> @@ -2282,7 +2302,7 @@ static int restore_metadump(const char *input, FILE 
> *out, int old_restore,
>   return 1;
>   }
>  
> - ret = fixup_devices(info, , st.st_size);
> + ret = fixup_chunks_and_devices(info, , st.st_size);
>   close_ctree(info->chunk_root);
>   if (ret)
>   goto out;
> 


Re: [PATCH 1/3] btrfs: run delayed iputs before committing

2018-11-26 Thread Nikolay Borisov



On 21.11.18 г. 21:09 ч., Josef Bacik wrote:
> Delayed iputs means we can have final iputs of deleted inodes in the
> queue, which could potentially generate a lot of pinned space that could
> be free'd.  So before we decide to commit the transaction for ENOPSC
> reasons, run the delayed iputs so that any potential space is free'd up.
> If there is and we freed enough we can then commit the transaction and
> potentially be able to make our reservation.
> 
> Signed-off-by: Josef Bacik 
> Reviewed-by: Omar Sandoval 
> ---
>  fs/btrfs/extent-tree.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 90423b6749b7..3a90dc1d6b31 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4833,6 +4833,15 @@ static int may_commit_transaction(struct btrfs_fs_info 
> *fs_info,
>   if (!bytes)
>   return 0;
>  
> + /*
> +  * If we have pending delayed iputs then we could free up a bunch of
> +  * pinned space, so make sure we run the iputs before we do our pinned
> +  * bytes check below.
> +  */
> + mutex_lock(_info->cleaner_delayed_iput_mutex);
> + btrfs_run_delayed_iputs(fs_info);
> + mutex_unlock(_info->cleaner_delayed_iput_mutex);

IMHO this code is better suited to be in case COMMIT_TRANS in
flush_space. Let's leave may_commit_Trans to just decide whether it
should commit or not.

> +
>   trans = btrfs_join_transaction(fs_info->extent_root);
>   if (IS_ERR(trans))
>   return PTR_ERR(trans);
> 


Re: [PATCH 7/8] btrfs: be more explicit about allowed flush states

2018-11-26 Thread Nikolay Borisov



On 26.11.18 г. 14:41 ч., Nikolay Borisov wrote:
> 
> 
> On 21.11.18 г. 21:03 ч., Josef Bacik wrote:
>> For FLUSH_LIMIT flushers we really can only allocate chunks and flush
>> delayed inode items, everything else is problematic.  I added a bunch of
>> new states and it lead to weirdness in the FLUSH_LIMIT case because I
>> forgot about how it worked.  So instead explicitly declare the states
>> that are ok for flushing with FLUSH_LIMIT and use that for our state
>> machine.  Then as we add new things that are safe we can just add them
>> to this list.
> 
> 
> Code-wise it's ok but the changelog needs rewording. At the very least
> explain the weirdness. Also in the last sentence the word 'thing' is
> better substituted with "flush states".

Case in point, you yourself mention that you have forgotten how the
FLUSH_LIMIT case works. That's why we need good changelogs so that those
details can be quickly worked out from reading the changelog.


> 
>>
>> Signed-off-by: Josef Bacik 
>> ---
>>  fs/btrfs/extent-tree.c | 21 ++---
>>  1 file changed, 10 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 0e9ba77e5316..e31980d451c2 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -5112,12 +5112,18 @@ void btrfs_init_async_reclaim_work(struct 
>> work_struct *work)
>>  INIT_WORK(work, btrfs_async_reclaim_metadata_space);
>>  }
>>  
>> +static const enum btrfs_flush_state priority_flush_states[] = {
>> +FLUSH_DELAYED_ITEMS_NR,
>> +FLUSH_DELAYED_ITEMS,
>> +ALLOC_CHUNK,
>> +};
>> +
>>  static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
>>  struct btrfs_space_info *space_info,
>>  struct reserve_ticket *ticket)
>>  {
>>  u64 to_reclaim;
>> -int flush_state = FLUSH_DELAYED_ITEMS_NR;
>> +int flush_state = 0;
>>  
>>  spin_lock(_info->lock);
>>  to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info,
>> @@ -5129,7 +5135,8 @@ static void priority_reclaim_metadata_space(struct 
>> btrfs_fs_info *fs_info,
>>  spin_unlock(_info->lock);
>>  
>>  do {
>> -flush_space(fs_info, space_info, to_reclaim, flush_state);
>> +flush_space(fs_info, space_info, to_reclaim,
>> +priority_flush_states[flush_state]);
>>  flush_state++;
>>  spin_lock(_info->lock);
>>  if (ticket->bytes == 0) {
>> @@ -5137,15 +5144,7 @@ static void priority_reclaim_metadata_space(struct 
>> btrfs_fs_info *fs_info,
>>  return;
>>  }
>>  spin_unlock(_info->lock);
>> -
>> -/*
>> - * Priority flushers can't wait on delalloc without
>> - * deadlocking.
>> - */
>> -if (flush_state == FLUSH_DELALLOC ||
>> -flush_state == FLUSH_DELALLOC_WAIT)
>> -flush_state = ALLOC_CHUNK;
>> -} while (flush_state < COMMIT_TRANS);
>> +} while (flush_state < ARRAY_SIZE(priority_flush_states));
>>  }
>>  
>>  static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
>>
> 


Re: [PATCH 7/8] btrfs: be more explicit about allowed flush states

2018-11-26 Thread Nikolay Borisov



On 21.11.18 г. 21:03 ч., Josef Bacik wrote:
> For FLUSH_LIMIT flushers we really can only allocate chunks and flush
> delayed inode items, everything else is problematic.  I added a bunch of
> new states and it lead to weirdness in the FLUSH_LIMIT case because I
> forgot about how it worked.  So instead explicitly declare the states
> that are ok for flushing with FLUSH_LIMIT and use that for our state
> machine.  Then as we add new things that are safe we can just add them
> to this list.


Code-wise it's ok but the changelog needs rewording. At the very least
explain the weirdness. Also in the last sentence the word 'thing' is
better substituted with "flush states".

> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 21 ++---
>  1 file changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 0e9ba77e5316..e31980d451c2 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5112,12 +5112,18 @@ void btrfs_init_async_reclaim_work(struct work_struct 
> *work)
>   INIT_WORK(work, btrfs_async_reclaim_metadata_space);
>  }
>  
> +static const enum btrfs_flush_state priority_flush_states[] = {
> + FLUSH_DELAYED_ITEMS_NR,
> + FLUSH_DELAYED_ITEMS,
> + ALLOC_CHUNK,
> +};
> +
>  static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
>   struct btrfs_space_info *space_info,
>   struct reserve_ticket *ticket)
>  {
>   u64 to_reclaim;
> - int flush_state = FLUSH_DELAYED_ITEMS_NR;
> + int flush_state = 0;
>  
>   spin_lock(_info->lock);
>   to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info,
> @@ -5129,7 +5135,8 @@ static void priority_reclaim_metadata_space(struct 
> btrfs_fs_info *fs_info,
>   spin_unlock(_info->lock);
>  
>   do {
> - flush_space(fs_info, space_info, to_reclaim, flush_state);
> + flush_space(fs_info, space_info, to_reclaim,
> + priority_flush_states[flush_state]);
>   flush_state++;
>   spin_lock(_info->lock);
>   if (ticket->bytes == 0) {
> @@ -5137,15 +5144,7 @@ static void priority_reclaim_metadata_space(struct 
> btrfs_fs_info *fs_info,
>   return;
>   }
>   spin_unlock(_info->lock);
> -
> - /*
> -  * Priority flushers can't wait on delalloc without
> -  * deadlocking.
> -  */
> - if (flush_state == FLUSH_DELALLOC ||
> - flush_state == FLUSH_DELALLOC_WAIT)
> - flush_state = ALLOC_CHUNK;
> - } while (flush_state < COMMIT_TRANS);
> + } while (flush_state < ARRAY_SIZE(priority_flush_states));
>  }
>  
>  static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
> 


Re: [PATCH 5/8] btrfs: don't enospc all tickets on flush failure

2018-11-26 Thread Nikolay Borisov



On 21.11.18 г. 21:03 ч., Josef Bacik wrote:
> With the introduction of the per-inode block_rsv it became possible to
> have really really large reservation requests made because of data
> fragmentation.  Since the ticket stuff assumed that we'd always have
> relatively small reservation requests it just killed all tickets if we
> were unable to satisfy the current request.  However this is generally
> not the case anymore.  So fix this logic to instead see if we had a
> ticket that we were able to give some reservation to, and if we were
> continue the flushing loop again.  Likewise we make the tickets use the
> space_info_add_old_bytes() method of returning what reservation they did
> receive in hopes that it could satisfy reservations down the line.


The logic of the patch can be summarised as follows:

If no progress is made for a ticket, then start fail all tickets until
the first one that has progress made on its reservation (inclusive). In
this case this first ticket will be failed but at least it's space will
be reused via space_info_add_old_bytes.

Frankly this seem really arbitrary.

> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 45 +
>  1 file changed, 25 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index e6bb6ce23c84..983d086fa768 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4791,6 +4791,7 @@ static void shrink_delalloc(struct btrfs_fs_info 
> *fs_info, u64 to_reclaim,
>  }
>  
>  struct reserve_ticket {
> + u64 orig_bytes;
>   u64 bytes;
>   int error;
>   struct list_head list;
> @@ -5012,7 +5013,7 @@ static inline int need_do_async_reclaim(struct 
> btrfs_fs_info *fs_info,
>   !test_bit(BTRFS_FS_STATE_REMOUNTING, _info->fs_state));
>  }
>  
> -static void wake_all_tickets(struct list_head *head)
> +static bool wake_all_tickets(struct list_head *head)
>  {
>   struct reserve_ticket *ticket;
>  
> @@ -5021,7 +5022,10 @@ static void wake_all_tickets(struct list_head *head)
>   list_del_init(>list);
>   ticket->error = -ENOSPC;
>   wake_up(>wait);
> + if (ticket->bytes != ticket->orig_bytes)
> + return true;
>   }
> + return false;
>  }
>  
>  /*
> @@ -5089,8 +5093,12 @@ static void btrfs_async_reclaim_metadata_space(struct 
> work_struct *work)
>   if (flush_state > COMMIT_TRANS) {
>   commit_cycles++;
>   if (commit_cycles > 2) {
> - wake_all_tickets(_info->tickets);
> - space_info->flush = 0;
> + if (wake_all_tickets(_info->tickets)) {
> + flush_state = FLUSH_DELAYED_ITEMS_NR;
> + commit_cycles--;
> + } else {
> + space_info->flush = 0;
> + }
>   } else {
>   flush_state = FLUSH_DELAYED_ITEMS_NR;
>   }
> @@ -5142,10 +5150,11 @@ static void priority_reclaim_metadata_space(struct 
> btrfs_fs_info *fs_info,
>  
>  static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
>  struct btrfs_space_info *space_info,
> -struct reserve_ticket *ticket, u64 orig_bytes)
> +struct reserve_ticket *ticket)
>  
>  {
>   DEFINE_WAIT(wait);
> + u64 reclaim_bytes = 0;
>   int ret = 0;
>  
>   spin_lock(_info->lock);
> @@ -5166,14 +5175,12 @@ static int wait_reserve_ticket(struct btrfs_fs_info 
> *fs_info,
>   ret = ticket->error;
>   if (!list_empty(>list))
>   list_del_init(>list);
> - if (ticket->bytes && ticket->bytes < orig_bytes) {
> - u64 num_bytes = orig_bytes - ticket->bytes;
> - update_bytes_may_use(space_info, -num_bytes);
> - trace_btrfs_space_reservation(fs_info, "space_info",
> -   space_info->flags, num_bytes, 0);
> - }
> + if (ticket->bytes && ticket->bytes < ticket->orig_bytes)
> + reclaim_bytes = ticket->orig_bytes - ticket->bytes;
>   spin_unlock(_info->lock);
>  
> + if (reclaim_bytes)
> + space_info_add_old_bytes(fs_info, space_info, reclaim_bytes);
>   return ret;
>  }
>  
> @@ -5199,6 +5206,7 @@ static int __reserve_metadata_bytes(struct 
> btrfs_fs_info *fs_info,
>  {
>   struct reserve_ticket ticket;
>   u64 used;
> + u64 reclaim_bytes = 0;
>   int ret = 0;
>  
>   ASSERT(orig_bytes);
> @@ -5234,6 +5242,7 @@ static int __reserve_metadata_bytes(struct 
> btrfs_fs_info *fs_info,
>* the list and we will do our own flushing further down.
>*/
>   if (ret && flush != 

Re: [PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code

2018-11-26 Thread Nikolay Borisov



On 21.11.18 г. 21:03 ч., Josef Bacik wrote:
> With my change to no longer take into account the global reserve for
> metadata allocation chunks we have this side-effect for mixed block
> group fs'es where we are no longer allocating enough chunks for the
> data/metadata requirements.  To deal with this add a ALLOC_CHUNK_FORCE
> step to the flushing state machine.  This will only get used if we've
> already made a full loop through the flushing machinery and tried
> committing the transaction.  If we have then we can try and force a
> chunk allocation since we likely need it to make progress.  This
> resolves the issues I was seeing with the mixed bg tests in xfstests
> with my previous patch.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: Nikolay Borisov 

Still, my observation is that the metadata reclaim code is increasing in
complexity for rather niche use cases or the details become way too subtle.

> ---
>  fs/btrfs/ctree.h |  3 ++-
>  fs/btrfs/extent-tree.c   | 18 +-
>  include/trace/events/btrfs.h |  1 +
>  3 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 0c6d589c8ce4..8ccc5019172b 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2750,7 +2750,8 @@ enum btrfs_flush_state {
>   FLUSH_DELALLOC  =   5,
>   FLUSH_DELALLOC_WAIT =   6,
>   ALLOC_CHUNK =   7,
> - COMMIT_TRANS=   8,
> + ALLOC_CHUNK_FORCE   =   8,
> + COMMIT_TRANS=   9,
>  };
>  
>  int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a91b3183dcae..e6bb6ce23c84 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4927,6 +4927,7 @@ static void flush_space(struct btrfs_fs_info *fs_info,
>   btrfs_end_transaction(trans);
>   break;
>   case ALLOC_CHUNK:
> + case ALLOC_CHUNK_FORCE:
>   trans = btrfs_join_transaction(root);
>   if (IS_ERR(trans)) {
>   ret = PTR_ERR(trans);
> @@ -4934,7 +4935,9 @@ static void flush_space(struct btrfs_fs_info *fs_info,
>   }
>   ret = do_chunk_alloc(trans,
>btrfs_metadata_alloc_profile(fs_info),
> -  CHUNK_ALLOC_NO_FORCE);
> +  (state == ALLOC_CHUNK) ?
> +  CHUNK_ALLOC_NO_FORCE :
> +  CHUNK_ALLOC_FORCE);
>   btrfs_end_transaction(trans);
>   if (ret > 0 || ret == -ENOSPC)
>   ret = 0;
> @@ -5070,6 +5073,19 @@ static void btrfs_async_reclaim_metadata_space(struct 
> work_struct *work)
>   commit_cycles--;
>   }
>  
> + /*
> +  * We don't want to force a chunk allocation until we've tried
> +  * pretty hard to reclaim space.  Think of the case where we
> +  * free'd up a bunch of space and so have a lot of pinned space
> +  * to reclaim.  We would rather use that than possibly create a
> +  * underutilized metadata chunk.  So if this is our first run
> +  * through the flushing state machine skip ALLOC_CHUNK_FORCE and
> +  * commit the transaction.  If nothing has changed the next go
> +  * around then we can force a chunk allocation.
> +  */
> + if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles)
> + flush_state++;
> +
>   if (flush_state > COMMIT_TRANS) {
>   commit_cycles++;
>   if (commit_cycles > 2) {
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index 63d1f9d8b8c7..dd0e6f8d6b6e 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush,
>   { FLUSH_DELAYED_REFS_NR,"FLUSH_DELAYED_REFS_NR"},   
> \
>   { FLUSH_DELAYED_REFS,   "FLUSH_ELAYED_REFS"},   
> \
>   { ALLOC_CHUNK,  "ALLOC_CHUNK"}, 
> \
> + { ALLOC_CHUNK_FORCE,"ALLOC_CHUNK_FORCE"},   
> \
>   { COMMIT_TRANS, "COMMIT_TRANS"})
>  
>  TRACE_EVENT(btrfs_flush_space,
> 


Re: [PATCH 3/8] btrfs: don't use global rsv for chunk allocation

2018-11-26 Thread Nikolay Borisov



On 21.11.18 г. 21:03 ч., Josef Bacik wrote:
> We've done this forever because of the voodoo around knowing how much
> space we have.  However we have better ways of doing this now, and on
> normal file systems we'll easily have a global reserve of 512MiB, and
> since metadata chunks are usually 1GiB that means we'll allocate
> metadata chunks more readily.  Instead use the actual used amount when
> determining if we need to allocate a chunk or not.

This explanation could use more concrete wording currently it's way too
"hand wavy"/vague.

> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 9 -
>  1 file changed, 9 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 7a30fbc05e5e..a91b3183dcae 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4388,21 +4388,12 @@ static inline u64 calc_global_rsv_need_space(struct 
> btrfs_block_rsv *global)
>  static int should_alloc_chunk(struct btrfs_fs_info *fs_info,
> struct btrfs_space_info *sinfo, int force)
>  {
> - struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
>   u64 bytes_used = btrfs_space_info_used(sinfo, false);
>   u64 thresh;
>  
>   if (force == CHUNK_ALLOC_FORCE)
>   return 1;
>  
> - /*
> -  * We need to take into account the global rsv because for all intents
> -  * and purposes it's used space.  Don't worry about locking the
> -  * global_rsv, it doesn't change except when the transaction commits.
> -  */
> - if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA)
> - bytes_used += calc_global_rsv_need_space(global_rsv);
> -
>   /*
>* in limited mode, we want to have some free space up to
>* about 1% of the FS size.
> 


Re: [PATCH 1/8] btrfs: check if free bgs for commit

2018-11-26 Thread Nikolay Borisov



On 21.11.18 г. 21:03 ч., Josef Bacik wrote:
> may_commit_transaction will skip committing the transaction if we don't
> have enough pinned space or if we're trying to find space for a SYSTEM
> chunk.  However if we have pending free block groups in this transaction
> we still want to commit as we may be able to allocate a chunk to make
> our reservation.  So instead of just returning ENOSPC, check if we have
> free block groups pending, and if so commit the transaction to allow us
> to use that free space.
> 
> Signed-off-by: Josef Bacik 
> Reviewed-by: Omar Sandoval 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent-tree.c | 33 +++--
>  1 file changed, 19 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 8af68b13fa27..0dca250dc02e 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4843,10 +4843,18 @@ static int may_commit_transaction(struct 
> btrfs_fs_info *fs_info,
>   if (!bytes)
>   return 0;
>  
> - /* See if there is enough pinned space to make this reservation */
> - if (__percpu_counter_compare(_info->total_bytes_pinned,
> -bytes,
> -BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
> + trans = btrfs_join_transaction(fs_info->extent_root);
> + if (IS_ERR(trans))
> + return PTR_ERR(trans);
> +
> + /*
> +  * See if there is enough pinned space to make this reservation, or if
> +  * we have bg's that are going to be freed, allowing us to possibly do a
> +  * chunk allocation the next loop through.
> +  */
> + if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, >transaction->flags) ||
> + __percpu_counter_compare(_info->total_bytes_pinned, bytes,
> +  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
>   goto commit;
>  
>   /*
> @@ -4854,7 +4862,7 @@ static int may_commit_transaction(struct btrfs_fs_info 
> *fs_info,
>* this reservation.
>*/
>   if (space_info != delayed_rsv->space_info)
> - return -ENOSPC;
> + goto enospc;
>  
>   spin_lock(_rsv->lock);
>   reclaim_bytes += delayed_rsv->reserved;
> @@ -4868,17 +4876,14 @@ static int may_commit_transaction(struct 
> btrfs_fs_info *fs_info,
>   bytes -= reclaim_bytes;
>  
>   if (__percpu_counter_compare(_info->total_bytes_pinned,
> -bytes,
> -BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
> - return -ENOSPC;
> - }
> -
> +  bytes,
> +  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
> + goto enospc;
>  commit:
> - trans = btrfs_join_transaction(fs_info->extent_root);
> - if (IS_ERR(trans))
> - return -ENOSPC;
> -
>   return btrfs_commit_transaction(trans);
> +enospc:
> + btrfs_end_transaction(trans);
> + return -ENOSPC;
>  }
>  
>  /*
> 


Re: [PATCH 2/2] btrfs: scrub: fix circular locking dependency warning

2018-11-26 Thread Nikolay Borisov



On 26.11.18 г. 11:07 ч., Anand Jain wrote:
> Circular locking dependency check reports warning[1], that's because
> the btrfs_scrub_dev() calls the stack #0 below with, the
> fs_info::scrub_lock held. The test case leading to this warning..
> 
>   mkfs.btrfs -fq /dev/sdb && mount /dev/sdb /btrfs
>   btrfs scrub start -B /btrfs
> 
> In fact we have fs_info::scrub_workers_refcnt to tack if the init and
> destroy of the scrub workers are needed. So once we have incremented
> and decremented the fs_info::scrub_workers_refcnt value in the thread,
> its ok to drop the scrub_lock, and then actually do the
> btrfs_destroy_workqueue() part. So this patch drops the scrub_lock
> before calling btrfs_destroy_workqueue().
> 
> [1]
> [   76.146826] ==
> [   76.147086] WARNING: possible circular locking dependency detected
> [   76.147316] 4.20.0-rc3+ #41 Not tainted
> [   76.147489] --
> [   76.147722] btrfs/4065 is trying to acquire lock:
> [   76.147984] 38593bc0 ((wq_completion)"%s-%s""btrfs",
> name){+.+.}, at: flush_workqueue+0x70/0x4d0
> [   76.148337]
> but task is already holding lock:
> [   76.148594] 62392ab7 (_info->scrub_lock){+.+.}, at:
> btrfs_scrub_dev+0x316/0x5d0 [btrfs]
> [   76.148909]
> which lock already depends on the new lock.
> 
> [   76.149191]
> the existing dependency chain (in reverse order) is:
> [   76.149446]
> -> #3 (_info->scrub_lock){+.+.}:
> [   76.149707]btrfs_scrub_dev+0x11f/0x5d0 [btrfs]
> [   76.149924]btrfs_ioctl+0x1ac3/0x2d80 [btrfs]
> [   76.150216]do_vfs_ioctl+0xa9/0x6d0
> [   76.150468]ksys_ioctl+0x60/0x90
> [   76.150716]__x64_sys_ioctl+0x16/0x20
> [   76.150911]do_syscall_64+0x50/0x180
> [   76.151182]entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [   76.151469]
> -> #2 (_devs->device_list_mutex){+.+.}:
> [   76.151851]reada_start_machine_worker+0xca/0x3f0 [btrfs]
> [   76.152195]normal_work_helper+0xf0/0x4c0 [btrfs]
> [   76.152489]process_one_work+0x1f4/0x520
> [   76.152751]worker_thread+0x46/0x3d0
> [   76.153715]kthread+0xf8/0x130
> [   76.153912]ret_from_fork+0x3a/0x50
> [   76.154178]
> -> #1 ((work_completion)(>normal_work)){+.+.}:
> [   76.154575]worker_thread+0x46/0x3d0
> [   76.154828]kthread+0xf8/0x130
> [   76.155108]ret_from_fork+0x3a/0x50
> [   76.155357]
> -> #0 ((wq_completion)"%s-%s""btrfs", name){+.+.}:
> [   76.155751]flush_workqueue+0x9a/0x4d0
> [   76.155911]drain_workqueue+0xca/0x1a0
> [   76.156182]destroy_workqueue+0x17/0x230
> [   76.156455]btrfs_destroy_workqueue+0x5d/0x1c0 [btrfs]
> [   76.156756]scrub_workers_put+0x2e/0x60 [btrfs]
> [   76.156931]btrfs_scrub_dev+0x329/0x5d0 [btrfs]
> [   76.157219]btrfs_ioctl+0x1ac3/0x2d80 [btrfs]
> [   76.157491]do_vfs_ioctl+0xa9/0x6d0
> [   76.157742]ksys_ioctl+0x60/0x90
> [   76.157910]__x64_sys_ioctl+0x16/0x20
> [   76.158177]do_syscall_64+0x50/0x180
> [   76.158429]entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [   76.158716]
> other info that might help us debug this:
> 
> [   76.158908] Chain exists of:
>   (wq_completion)"%s-%s""btrfs", name --> _devs->device_list_mutex
> --> _info->scrub_lock
> 
> [   76.159629]  Possible unsafe locking scenario:
> 
> [   76.160607]CPU0CPU1
> [   76.160934]
> [   76.161210]   lock(_info->scrub_lock);
> [   76.161458]
> lock(_devs->device_list_mutex);
> [   76.161805]
> lock(_info->scrub_lock);
> [   76.161909]   lock((wq_completion)"%s-%s""btrfs", name);
> [   76.162201]
>  *** DEADLOCK ***
> 
> [   76.162627] 2 locks held by btrfs/4065:
> [   76.162897]  #0: bef2775b (sb_writers#12){.+.+}, at:
> mnt_want_write_file+0x24/0x50
> [   76.163335]  #1: 62392ab7 (_info->scrub_lock){+.+.}, at:
> btrfs_scrub_dev+0x316/0x5d0 [btrfs]
> [   76.163796]
> stack backtrace:
> [   76.163911] CPU: 1 PID: 4065 Comm: btrfs Not tainted 4.20.0-rc3+ #41
> [   76.164228] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
> VirtualBox 12/01/2006
> [   76.164646] Call Trace:
> [   76.164872]  dump_stack+0x5e/0x8b
> [   76.165128]  print_circular_bug.isra.37+0x1f1/0x1fe
> [   76.165398]  __lock_acquire+0x14aa/0x1620
> [   76.165652]  lock_acquire+0xb0/0x190
> [   76.165910]  ? flush_workqueue+0x70/0x4d0
> [   76.166175]  flush_workqueue+0x9a/0x4d0
> [   76.166420]  ? flush_workqueue+0x70/0x4d0
> [   76.166671]  ? drain_workqueue+0x52/0x1a0
> [   76.166911]  drain_workqueue+0xca/0x1a0
> [   76.167167]  destroy_workqueue+0x17/0x230
> [   76.167428]  btrfs_destroy_workqueue+0x5d/0x1c0 [btrfs]
> [   76.167720]  scrub_workers_put+0x2e/0x60 [btrfs]
> [   76.168233]  btrfs_scrub_dev+0x329/0x5d0 [btrfs]
> [   76.168504]  ? __sb_start_write+0x121/0x1b0
> [   76.168759]  ? 

Re: [PATCH 1/2] btrfs: scrub: maintain the unlock order in scrub thread

2018-11-26 Thread Nikolay Borisov



On 26.11.18 г. 11:07 ч., Anand Jain wrote:
> The fs_info::device_list_mutex and fs_info::scrub_lock creates a
> nested locks in btrfs_scrub_dev(). During the lock acquire the
> hierarchy is fs_info::device_list_mutex and then fs_info::scrub_lock,
> so following the same reverse order during unlock, that is
> fs_info::scrub_lock and then fs_info::device_list_mutex.
> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/scrub.c | 16 +++-
>  1 file changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 902819d3cf41..b1c2d1cdbd4b 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3865,7 +3865,6 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>   }
>   sctx->readonly = readonly;
>   dev->scrub_ctx = sctx;
> - mutex_unlock(_info->fs_devices->device_list_mutex);
>  
>   /*
>* checking @scrub_pause_req here, we can avoid
> @@ -3875,15 +3874,14 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, 
> u64 devid, u64 start,
>   atomic_inc(_info->scrubs_running);
>   mutex_unlock(_info->scrub_lock);
>  
> - if (!is_dev_replace) {
> - /*
> -  * by holding device list mutex, we can
> -  * kick off writing super in log tree sync.
> -  */
> - mutex_lock(_info->fs_devices->device_list_mutex);
> + /*
> +  * by holding device list mutex, we can kick off writing super in log
> +  * tree sync.
> +  */
> + if (!is_dev_replace)
>   ret = scrub_supers(sctx, dev);
> - mutex_unlock(_info->fs_devices->device_list_mutex);
> - }
> +
> + mutex_unlock(_info->fs_devices->device_list_mutex);

Have you considered whether this change will have any negative impact
due to the fact that __scrtub_blocked_if_needed can go to sleep for
arbitrary time with device_list_mutex held now ?

>  
>   if (!ret)
>   ret = scrub_enumerate_chunks(sctx, dev, start, end);
> 


Re: [PATCH 6/6] btrfs: fix truncate throttling

2018-11-26 Thread Nikolay Borisov



On 21.11.18 г. 20:59 ч., Josef Bacik wrote:
> We have a bunch of magic to make sure we're throttling delayed refs when
> truncating a file.  Now that we have a delayed refs rsv and a mechanism
> for refilling that reserve simply use that instead of all of this magic.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/inode.c | 79 
> 
>  1 file changed, 17 insertions(+), 62 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8532a2eb56d1..cae30f6c095f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4437,31 +4437,6 @@ static int btrfs_rmdir(struct inode *dir, struct 
> dentry *dentry)
>   return err;
>  }
>  
> -static int truncate_space_check(struct btrfs_trans_handle *trans,
> - struct btrfs_root *root,
> - u64 bytes_deleted)
> -{
> - struct btrfs_fs_info *fs_info = root->fs_info;
> - int ret;
> -
> - /*
> -  * This is only used to apply pressure to the enospc system, we don't
> -  * intend to use this reservation at all.
> -  */
> - bytes_deleted = btrfs_csum_bytes_to_leaves(fs_info, bytes_deleted);
> - bytes_deleted *= fs_info->nodesize;
> - ret = btrfs_block_rsv_add(root, _info->trans_block_rsv,
> -   bytes_deleted, BTRFS_RESERVE_NO_FLUSH);
> - if (!ret) {
> - trace_btrfs_space_reservation(fs_info, "transaction",
> -   trans->transid,
> -   bytes_deleted, 1);
> - trans->bytes_reserved += bytes_deleted;
> - }
> - return ret;
> -
> -}
> -
>  /*
>   * Return this if we need to call truncate_block for the last bit of the
>   * truncate.
> @@ -4506,7 +4481,6 @@ int btrfs_truncate_inode_items(struct 
> btrfs_trans_handle *trans,
>   u64 bytes_deleted = 0;
>   bool be_nice = false;
>   bool should_throttle = false;
> - bool should_end = false;
>  
>   BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY);
>  
> @@ -4719,15 +4693,7 @@ int btrfs_truncate_inode_items(struct 
> btrfs_trans_handle *trans,
>   btrfs_abort_transaction(trans, ret);
>   break;
>   }
> - if (btrfs_should_throttle_delayed_refs(trans))
> - btrfs_async_run_delayed_refs(fs_info,
> - trans->delayed_ref_updates * 2,
> - trans->transid, 0);
>   if (be_nice) {
> - if (truncate_space_check(trans, root,
> -  extent_num_bytes)) {
> - should_end = true;
> - }
>   if (btrfs_should_throttle_delayed_refs(trans))
>   should_throttle = true;
>   }
> @@ -4738,7 +4704,7 @@ int btrfs_truncate_inode_items(struct 
> btrfs_trans_handle *trans,
>  
>   if (path->slots[0] == 0 ||
>   path->slots[0] != pending_del_slot ||
> - should_throttle || should_end) {
> + should_throttle) {
>   if (pending_del_nr) {
>   ret = btrfs_del_items(trans, root, path,
>   pending_del_slot,
> @@ -4750,23 +4716,24 @@ int btrfs_truncate_inode_items(struct 
> btrfs_trans_handle *trans,
>   pending_del_nr = 0;
>   }
>   btrfs_release_path(path);
> - if (should_throttle) {
> - unsigned long updates = 
> trans->delayed_ref_updates;
> - if (updates) {
> - trans->delayed_ref_updates = 0;
> - ret = btrfs_run_delayed_refs(trans,
> -updates * 2);
> - if (ret)
> - break;
> - }
> - }
> +
>   /*
> -  * if we failed to refill our space rsv, bail out
> -  * and let the transaction restart
> +  * We can generate a lot of delayed refs, so we need to
> +   

Re: [PATCH 5/6] btrfs: introduce delayed_refs_rsv

2018-11-26 Thread Nikolay Borisov



On 21.11.18 г. 20:59 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> Traditionally we've had voodoo in btrfs to account for the space that
> delayed refs may take up by having a global_block_rsv.  This works most
> of the time, except when it doesn't.  We've had issues reported and seen
> in production where sometimes the global reserve is exhausted during
> transaction commit before we can run all of our delayed refs, resulting
> in an aborted transaction.  Because of this voodoo we have equally
> dubious flushing semantics around throttling delayed refs which we often
> get wrong.
> 
> So instead give them their own block_rsv.  This way we can always know
> exactly how much outstanding space we need for delayed refs.  This
> allows us to make sure we are constantly filling that reservation up
> with space, and allows us to put more precise pressure on the enospc
> system.  Instead of doing math to see if its a good time to throttle,
> the normal enospc code will be invoked if we have a lot of delayed refs
> pending, and they will be run via the normal flushing mechanism.
> 
> For now the delayed_refs_rsv will hold the reservations for the delayed
> refs, the block group updates, and deleting csums.  We could have a
> separate rsv for the block group updates, but the csum deletion stuff is
> still handled via the delayed_refs so that will stay there.

Couldn't the same "no premature ENOSPC in critical section" effect be
achieved if we simply allocate 2* num bytes in start transaction without
adding the additional granularity for delayd refs? There seems to be a
lot of supporting code added  and this obfuscates the simple idea that
now we do 2* reservation and put it in a separate block_rsv structure.

> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/ctree.h |  26 ++--
>  fs/btrfs/delayed-ref.c   |  28 -
>  fs/btrfs/disk-io.c   |   4 +
>  fs/btrfs/extent-tree.c   | 279 
> +++
>  fs/btrfs/inode.c |   4 +-
>  fs/btrfs/transaction.c   |  77 ++--
>  include/trace/events/btrfs.h |   2 +
>  7 files changed, 313 insertions(+), 107 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 8b41ec42f405..0c6d589c8ce4 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -448,8 +448,9 @@ struct btrfs_space_info {
>  #define  BTRFS_BLOCK_RSV_TRANS   3
>  #define  BTRFS_BLOCK_RSV_CHUNK   4
>  #define  BTRFS_BLOCK_RSV_DELOPS  5
> -#define  BTRFS_BLOCK_RSV_EMPTY   6
> -#define  BTRFS_BLOCK_RSV_TEMP7
> +#define BTRFS_BLOCK_RSV_DELREFS  6
> +#define  BTRFS_BLOCK_RSV_EMPTY   7
> +#define  BTRFS_BLOCK_RSV_TEMP8
>  
>  struct btrfs_block_rsv {
>   u64 size;
> @@ -812,6 +813,8 @@ struct btrfs_fs_info {
>   struct btrfs_block_rsv chunk_block_rsv;
>   /* block reservation for delayed operations */
>   struct btrfs_block_rsv delayed_block_rsv;
> + /* block reservation for delayed refs */
> + struct btrfs_block_rsv delayed_refs_rsv;
>  
>   struct btrfs_block_rsv empty_block_rsv;
>  
> @@ -2628,7 +2631,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct 
> btrfs_fs_info *fs_info,
>  }
>  
>  int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans);
> -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans);
> +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info);
>  void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
>const u64 start);
>  void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg);
> @@ -2742,10 +2745,12 @@ enum btrfs_reserve_flush_enum {
>  enum btrfs_flush_state {
>   FLUSH_DELAYED_ITEMS_NR  =   1,
>   FLUSH_DELAYED_ITEMS =   2,
> - FLUSH_DELALLOC  =   3,
> - FLUSH_DELALLOC_WAIT =   4,
> - ALLOC_CHUNK =   5,
> - COMMIT_TRANS=   6,
> + FLUSH_DELAYED_REFS_NR   =   3,
> + FLUSH_DELAYED_REFS  =   4,
> + FLUSH_DELALLOC  =   5,
> + FLUSH_DELALLOC_WAIT =   6,
> + ALLOC_CHUNK =   7,
> + COMMIT_TRANS=   8,
>  };
>  
>  int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
> @@ -2796,6 +2801,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info 
> *fs_info,
>  void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
>struct btrfs_block_rsv *block_rsv,
>u64 num_bytes);
> +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr);
> +void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans);
> +int btrfs_throttle_delayed_refs(struct btrfs_fs_info *fs_info,
> + enum btrfs_reserve_flush_enum flush);
> +void 

Re: [PATCH v4] Btrfs: fix deadlock with memory reclaim during scrub

2018-11-25 Thread Nikolay Borisov



On 23.11.18 г. 20:25 ч., fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> When a transaction commit starts, it attempts to pause scrub and it blocks
> until the scrub is paused. So while the transaction is blocked waiting for
> scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
> otherwise we risk getting into a deadlock with reclaim.
> 
> Checking for scrub pause requests is done early at the beginning of the
> while loop of scrub_stripe() and later in the loop, scrub_extent() and
> scrub_raid56_parity() are called, which in turn call scrub_pages() and
> scrub_pages_for_parity() respectively. These last two functions do memory
> allocations using GFP_KERNEL. Same problem could happen while scrubbing
> the super blocks, since it calls scrub_pages().
> 
> So make sure GFP_NOFS is used for the memory allocations because at any
> time a scrub pause request can happen from another task that started to
> commit a transaction.
> 
> Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
> Signed-off-by: Filipe Manana 

Reviewed-by: Nikolay Borisov 

> ---
> 
> V2: Make using GFP_NOFS unconditionial. Previous version was racy, as pausing
> requests migth happen just after we checked for them.
> 
> V3: Use memalloc_nofs_save() just like V1 did.
> 
> V4: Similar problem happened for raid56, which was previously missed, so
> deal with it as well as the case for scrub_supers().
> 
>  fs/btrfs/scrub.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 3be1456b5116..e08b7502d1f0 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3779,6 +3779,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>   struct scrub_ctx *sctx;
>   int ret;
>   struct btrfs_device *dev;
> + unsigned int nofs_flag;
>  
>   if (btrfs_fs_closing(fs_info))
>   return -EINVAL;
> @@ -3882,6 +3883,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>   atomic_inc(_info->scrubs_running);
>   mutex_unlock(_info->scrub_lock);
>  
> + /*
> +  * In order to avoid deadlock with reclaim when there is a transaction
> +  * trying to pause scrub, make sure we use GFP_NOFS for all the
> +  * allocations done at btrfs_scrub_pages() and scrub_pages_for_parity()
> +  * invoked by our callees. The pausing request is done when the
> +  * transaction commit starts, and it blocks the transaction until scrub
> +  * is paused (done at specific points at scrub_stripe() or right above
> +  * before incrementing fs_info->scrubs_running).
> +  */
> + nofs_flag = memalloc_nofs_save();
>   if (!is_dev_replace) {
>   /*
>* by holding device list mutex, we can
> @@ -3895,6 +3906,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>   if (!ret)
>   ret = scrub_enumerate_chunks(sctx, dev, start, end,
>is_dev_replace);
> + memalloc_nofs_restore(nofs_flag);
>  
>   wait_event(sctx->list_wait, atomic_read(>bios_in_flight) == 0);
>   atomic_dec(_info->scrubs_running);
> 


Re: [PATCH v2] Btrfs: fix deadlock with memory reclaim during scrub

2018-11-23 Thread Nikolay Borisov



On 23.11.18 г. 18:05 ч., fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> When a transaction commit starts, it attempts to pause scrub and it blocks
> until the scrub is paused. So while the transaction is blocked waiting for
> scrub to pause, we can not do memory allocation with GFP_KERNEL while scrub
> is running, we must use GFP_NOS to avoid deadlock with reclaim. Checking
> for pause requests is done early in the while loop of scrub_stripe(), and
> later in the loop, scrub_extent() is called, which in turns calls
> scrub_pages(), which does memory allocations using GFP_KERNEL. So use
> GFP_NOFS for the memory allocations because at any time a scrub pause
> request can happen from another task that started to commit a transaction.
> 
> Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
> Signed-off-by: Filipe Manana 
> ---
> 
> V2: Make using GFP_NOFS unconditionial. Previous version was racy, as pausing
> requests migth happen just after we checked for them.
> 
>  fs/btrfs/scrub.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 3be1456b5116..0630ea0881bc 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -2205,7 +2205,13 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
> logical, u64 len,
>   struct scrub_block *sblock;
>   int index;
>  
> - sblock = kzalloc(sizeof(*sblock), GFP_KERNEL);
> + /*
> +  * In order to avoid deadlock with reclaim when there is a transaction
> +  * trying to pause scrub, use GFP_NOFS. The pausing request is done when
> +  * the transaction commit starts, and it blocks the transaction until
> +  * scrub is paused (done at specific points at scrub_stripe()).
> +  */
> + sblock = kzalloc(sizeof(*sblock), GFP_NOFS);

Newer code shouldn't use GFP_NOFS, rather leave GFP_KERNEL as is and
instead use the memaloc_nofs_save/memalloc_nofs_restore. For background
information refer to: Documentation/core-api/gfp_mask-from-fs-io.rst

>   if (!sblock) {
>   spin_lock(>stat_lock);
>   sctx->stat.malloc_errors++;
> @@ -2223,7 +2229,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
> logical, u64 len,
>   struct scrub_page *spage;
>   u64 l = min_t(u64, len, PAGE_SIZE);
>  
> - spage = kzalloc(sizeof(*spage), GFP_KERNEL);
> + spage = kzalloc(sizeof(*spage), GFP_NOFS);
>   if (!spage) {
>  leave_nomem:
>   spin_lock(>stat_lock);
> @@ -2250,7 +2256,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
> logical, u64 len,
>   spage->have_csum = 0;
>   }
>   sblock->page_count++;
> - spage->page = alloc_page(GFP_KERNEL);
> + spage->page = alloc_page(GFP_NOFS);
>   if (!spage->page)
>   goto leave_nomem;
>   len -= l;
> 


Re: [PATCH 1/6] btrfs: add btrfs_delete_ref_head helper

2018-11-23 Thread Nikolay Borisov



On 23.11.18 г. 15:45 ч., David Sterba wrote:
> On Thu, Nov 22, 2018 at 11:42:28AM +0200, Nikolay Borisov wrote:
>>
>>
>> On 22.11.18 г. 11:12 ч., Nikolay Borisov wrote:
>>>
>>>
>>> On 21.11.18 г. 20:59 ч., Josef Bacik wrote:
>>>> From: Josef Bacik 
>>>>
>>>> We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
>>>> into a helper and cleanup the calling functions.
>>>>
>>>> Signed-off-by: Josef Bacik 
>>>> Reviewed-by: Omar Sandoval 
>>>> ---
>>>>  fs/btrfs/delayed-ref.c | 14 ++
>>>>  fs/btrfs/delayed-ref.h |  3 ++-
>>>>  fs/btrfs/extent-tree.c | 22 +++---
>>>>  3 files changed, 19 insertions(+), 20 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>>>> index 9301b3ad9217..b3e4c9fcb664 100644
>>>> --- a/fs/btrfs/delayed-ref.c
>>>> +++ b/fs/btrfs/delayed-ref.c
>>>> @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>>>>return head;
>>>>  }
>>>>  
>>>> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
>>>> + struct btrfs_delayed_ref_head *head)
>>>> +{
>>>> +  lockdep_assert_held(_refs->lock);
>>>> +  lockdep_assert_held(>lock);
>>>> +
>>>> +  rb_erase_cached(>href_node, _refs->href_root);
>>>> +  RB_CLEAR_NODE(>href_node);
>>>> +  atomic_dec(_refs->num_entries);
>>>> +  delayed_refs->num_heads--;
>>>> +  if (head->processing == 0)
>>>> +  delayed_refs->num_heads_ready--;
>>>
>>> num_heads_ready will never execute in cleanup_ref_head, since
>>> processing == 0 only when the ref head is unselected. Perhaps those 2
>>> lines shouldn't be in this function? I find it a bit confusing that if
>>> processing is 0 we decrement num_heads_ready in check_ref_cleanup, but
>>> in unselect_delayed_ref_head we set it to 0 and increment it.
>>>
>>>> +}
>>>> +
>>>>  /*
>>>>   * Helper to insert the ref_node to the tail or merge with tail.
>>>>   *
>>>> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
>>>> index 8e20c5cb5404..d2af974f68a1 100644
>>>> --- a/fs/btrfs/delayed-ref.h
>>>> +++ b/fs/btrfs/delayed-ref.h
>>>> @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
>>>> btrfs_delayed_ref_head *head)
>>>>  {
>>>>mutex_unlock(>mutex);
>>>>  }
>>>> -
>>>> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
>>>> + struct btrfs_delayed_ref_head *head);
>>>>  
>>>>  struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>>>>struct btrfs_delayed_ref_root *delayed_refs);
>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>> index d242a1174e50..c36b3a42f2bb 100644
>>>> --- a/fs/btrfs/extent-tree.c
>>>> +++ b/fs/btrfs/extent-tree.c
>>>> @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct 
>>>> btrfs_trans_handle *trans,
>>>>spin_unlock(_refs->lock);
>>>>return 1;
>>>>}
>>>> -  delayed_refs->num_heads--;
>>>> -  rb_erase_cached(>href_node, _refs->href_root);
>>>> -  RB_CLEAR_NODE(>href_node);
>>>> +  btrfs_delete_ref_head(delayed_refs, head);
>>>>spin_unlock(>lock);
>>>>spin_unlock(_refs->lock);
>>>> -  atomic_dec(_refs->num_entries);
>>>>  
>>>>trace_run_delayed_ref_head(fs_info, head, 0);
>>>>  
>>>> @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
>>>> btrfs_trans_handle *trans,
>>>>if (!mutex_trylock(>mutex))
>>>>goto out;
>>>>  
>>>> -  /*
>>>> -   * at this point we have a head with no other entries.  Go
>>>> -   * ahead and process it.
>>>> -   */
>>>> -  rb_erase_cached(>href_node, _refs->href_root);
>>>> -  RB_CLEAR_NODE(>href_node);
>>>> -  atomic_dec(_refs->num_entries);
>>>> -
>>>> -  /*
>>>> -   * we don't take a ref on the node because we're removing it from the
>>>&

Re: [PATCH] btrfs: remove btrfs_bio_end_io_t

2018-11-23 Thread Nikolay Borisov



On 23.11.18 г. 10:42 ч., Johannes Thumshirn wrote:
> The btrfs_bio_end_io_t typedef was introduced with commit
> a1d3c4786a4b ("btrfs: btrfs_multi_bio replaced with btrfs_bio")
> but never used anywhere. This commit also introduced a forward declaration
> of 'struct btrfs_bio' which is only needed for btrfs_bio_end_io_t.
> 
> Remove both as they're not needed anywhere.
> 
> Signed-off-by: Johannes Thumshirn 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/volumes.h | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 8b092bb1e2ee..c93097b0b469 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -295,9 +295,6 @@ struct btrfs_bio_stripe {
>   u64 length; /* only used for discard mappings */
>  };
>  
> -struct btrfs_bio;
> -typedef void (btrfs_bio_end_io_t) (struct btrfs_bio *bio, int err);
> -
>  struct btrfs_bio {
>   refcount_t refs;
>   atomic_t stripes_pending;
> 


Re: [PATCH 4/4] btrfs: replace btrfs_io_bio::end_io with a simple helper

2018-11-22 Thread Nikolay Borisov



On 22.11.18 г. 18:16 ч., David Sterba wrote:
> The end_io callback implemented as btrfs_io_bio_endio_readpage only
> calls kfree. Also the callback is set only in case the csum buffer is
> allocated and not pointing to the inline buffer. We can use that
> information to drop the indirection and call a helper that will free the
> csums only in the right case.
> 
> This shrinks struct btrfs_io_bio by 8 bytes.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent_io.c |  3 +--
>  fs/btrfs/file-item.c |  9 -
>  fs/btrfs/inode.c |  7 ++-
>  fs/btrfs/volumes.h   | 10 --
>  4 files changed, 11 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 4ea808d6cfbc..aef3c9866ff0 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2623,8 +2623,7 @@ static void end_bio_extent_readpage(struct bio *bio)
>   if (extent_len)
>   endio_readpage_release_extent(tree, extent_start, extent_len,
> uptodate);
> - if (io_bio->end_io)
> - io_bio->end_io(io_bio, blk_status_to_errno(bio->bi_status));
> + btrfs_io_bio_free_csum(io_bio);
>   bio_put(bio);
>  }
>  
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index 1f2d0a6ab634..920bf3b4b0ef 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -142,14 +142,6 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle 
> *trans,
>   return ret;
>  }
>  
> -static void btrfs_io_bio_endio_readpage(struct btrfs_io_bio *bio, int err)
> -{
> - if (bio->csum != bio->csum_inline) {
> - kfree(bio->csum);
> - bio->csum = NULL;
> - }
> -}
> -
>  static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio 
> *bio,
>  u64 logical_offset, u32 *dst, int dio)
>  {
> @@ -184,7 +176,6 @@ static blk_status_t __btrfs_lookup_bio_sums(struct inode 
> *inode, struct bio *bio
>   btrfs_free_path(path);
>   return BLK_STS_RESOURCE;
>   }
> - btrfs_bio->end_io = btrfs_io_bio_endio_readpage;
>   } else {
>   btrfs_bio->csum = btrfs_bio->csum_inline;
>   }
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 26b8bec7c2dc..6bfd37e58924 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8017,9 +8017,7 @@ static void btrfs_endio_direct_read(struct bio *bio)
>  
>   dio_bio->bi_status = err;
>   dio_end_io(dio_bio);
> -
> - if (io_bio->end_io)
> - io_bio->end_io(io_bio, blk_status_to_errno(err));
> + btrfs_io_bio_free_csum(io_bio);
>   bio_put(bio);
>  }
>  
> @@ -8372,8 +8370,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, 
> struct inode *inode,
>   if (!ret)
>   return;
>  
> - if (io_bio->end_io)
> - io_bio->end_io(io_bio, ret);
> + btrfs_io_bio_free_csum(io_bio);
>  
>  free_ordered:
>   /*
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 9a764f2d462e..a13045fcfc45 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -267,14 +267,12 @@ struct btrfs_fs_devices {
>   * we allocate are actually btrfs_io_bios.  We'll cram as much of
>   * struct btrfs_bio as we can into this over time.
>   */
> -typedef void (btrfs_io_bio_end_io_t) (struct btrfs_io_bio *bio, int err);
>  struct btrfs_io_bio {
>   unsigned int mirror_num;
>   unsigned int stripe_index;
>   u64 logical;
>   u8 *csum;
>   u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
> - btrfs_io_bio_end_io_t *end_io;
>   struct bvec_iter iter;
>   /*
>* This member must come last, bio_alloc_bioset will allocate enough
> @@ -288,6 +286,14 @@ static inline struct btrfs_io_bio *btrfs_io_bio(struct 
> bio *bio)
>   return container_of(bio, struct btrfs_io_bio, bio);
>  }
>  
> +static inline void btrfs_io_bio_free_csum(struct btrfs_io_bio *io_bio)
> +{
> + if (io_bio->csum != io_bio->csum_inline) {
> + kfree(io_bio->csum);
> + io_bio->csum = NULL;
> + }
> +}
> +
>  struct btrfs_bio_stripe {
>   struct btrfs_device *dev;
>   u64 physical;
> 


Re: [PATCH 3/4] btrfs: remove redundant csum buffer in btrfs_io_bio

2018-11-22 Thread Nikolay Borisov



On 22.11.18 г. 18:16 ч., David Sterba wrote:
> The io_bio tracks checksums and has an inline buffer or an allocated
> one. And there's a third member that points to the right one, but we
> don't need to use an extra pointer for that. Let btrfs_io_bio::csum
> point to the right buffer and check that the inline buffer is not
> accidentally freed.
> 
> This shrinks struct btrfs_io_bio by 8 bytes.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/file-item.c | 12 +++-
>  fs/btrfs/volumes.h   |  1 -
>  2 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index ba74827beb32..1f2d0a6ab634 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -144,7 +144,10 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle 
> *trans,
>  
>  static void btrfs_io_bio_endio_readpage(struct btrfs_io_bio *bio, int err)
>  {
> - kfree(bio->csum_allocated);
> + if (bio->csum != bio->csum_inline) {
> + kfree(bio->csum);
> + bio->csum = NULL;
> + }
>  }
>  
>  static blk_status_t __btrfs_lookup_bio_sums(struct inode *inode, struct bio 
> *bio,
> @@ -175,13 +178,12 @@ static blk_status_t __btrfs_lookup_bio_sums(struct 
> inode *inode, struct bio *bio
>   nblocks = bio->bi_iter.bi_size >> inode->i_sb->s_blocksize_bits;
>   if (!dst) {
>   if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) {
> - btrfs_bio->csum_allocated = kmalloc_array(nblocks,
> - csum_size, GFP_NOFS);
> - if (!btrfs_bio->csum_allocated) {
> + btrfs_bio->csum = kmalloc_array(nblocks, csum_size,
> + GFP_NOFS);
> + if (!btrfs_bio->csum) {
>   btrfs_free_path(path);
>   return BLK_STS_RESOURCE;
>   }
> - btrfs_bio->csum = btrfs_bio->csum_allocated;
>   btrfs_bio->end_io = btrfs_io_bio_endio_readpage;
>   } else {
>   btrfs_bio->csum = btrfs_bio->csum_inline;
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 1d936ce282c3..9a764f2d462e 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -274,7 +274,6 @@ struct btrfs_io_bio {
>   u64 logical;
>   u8 *csum;
>   u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
> - u8 *csum_allocated;
>   btrfs_io_bio_end_io_t *end_io;
>   struct bvec_iter iter;
>   /*
> 


Re: [PATCH 2/4] btrfs: replace async_cow::root with fs_info

2018-11-22 Thread Nikolay Borisov



On 22.11.18 г. 18:16 ч., David Sterba wrote:
> The async_cow::root is used to propagate fs_info to async_cow_submit.
> We can't use inode to reach it because it could become NULL after
> write without compression in async_cow_start.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/inode.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index a88122b89f50..26b8bec7c2dc 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -358,7 +358,7 @@ struct async_extent {
>  
>  struct async_cow {
>   struct inode *inode;
> - struct btrfs_root *root;
> + struct btrfs_fs_info *fs_info;
>   struct page *locked_page;
>   u64 start;
>   u64 end;
> @@ -1144,13 +1144,11 @@ static noinline void async_cow_submit(struct 
> btrfs_work *work)
>  {
>   struct btrfs_fs_info *fs_info;
>   struct async_cow *async_cow;
> - struct btrfs_root *root;
>   unsigned long nr_pages;
>  
>   async_cow = container_of(work, struct async_cow, work);
>  
> - root = async_cow->root;
> - fs_info = root->fs_info;
> + fs_info = async_cow->fs_info;
>   nr_pages = (async_cow->end - async_cow->start + PAGE_SIZE) >>
>   PAGE_SHIFT;
>  
> @@ -1179,7 +1177,6 @@ static int cow_file_range_async(struct inode *inode, 
> struct page *locked_page,
>  {
>   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   struct async_cow *async_cow;
> - struct btrfs_root *root = BTRFS_I(inode)->root;
>   unsigned long nr_pages;
>   u64 cur_end;
>  
> @@ -1189,7 +1186,7 @@ static int cow_file_range_async(struct inode *inode, 
> struct page *locked_page,
>   async_cow = kmalloc(sizeof(*async_cow), GFP_NOFS);
>   BUG_ON(!async_cow); /* -ENOMEM */
>   async_cow->inode = igrab(inode);
> - async_cow->root = root;
> + async_cow->fs_info = fs_info;
>   async_cow->locked_page = locked_page;
>   async_cow->start = start;
>   async_cow->write_flags = write_flags;
> 


Re: [PATCH 1/4] btrfs: merge btrfs_submit_bio_done to its caller

2018-11-22 Thread Nikolay Borisov



On 22.11.18 г. 18:16 ч., David Sterba wrote:
> There's one caller and its code is simple, we can open code it in
> run_one_async_done. The errors are passed through bio.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Nikolay Borisov 
> ---
>  fs/btrfs/disk-io.c | 18 +-
>  fs/btrfs/inode.c   | 23 ---
>  2 files changed, 17 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index feb67dfd663d..2f5c5442cf00 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -764,11 +764,22 @@ static void run_one_async_start(struct btrfs_work *work)
>   async->status = ret;
>  }
>  
> +/*
> + * In order to insert checksums into the metadata in large chunks, we wait
> + * until bio submission time.   All the pages in the bio are checksummed and
> + * sums are attached onto the ordered extent record.
> + *
> + * At IO completion time the csums attached on the ordered extent record are
> + * inserted into the tree.
> + */
>  static void run_one_async_done(struct btrfs_work *work)
>  {
>   struct async_submit_bio *async;
> + struct inode *inode;
> + blk_status_t ret;
>  
>   async = container_of(work, struct  async_submit_bio, work);
> + inode = async->private_data;
>  
>   /* If an error occurred we just want to clean up the bio and move on */
>   if (async->status) {
> @@ -777,7 +788,12 @@ static void run_one_async_done(struct btrfs_work *work)
>   return;
>   }
>  
> - btrfs_submit_bio_done(async->private_data, async->bio, 
> async->mirror_num);
> + ret = btrfs_map_bio(btrfs_sb(inode->i_sb), async->bio,
> + async->mirror_num, 1);
> + if (ret) {
> + async->bio->bi_status = ret;
> + bio_endio(async->bio);
> + }
>  }
>  
>  static void run_one_async_free(struct btrfs_work *work)
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9becf8543489..a88122b89f50 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1924,29 +1924,6 @@ static blk_status_t btrfs_submit_bio_start(void 
> *private_data, struct bio *bio,
>   return 0;
>  }
>  
> -/*
> - * in order to insert checksums into the metadata in large chunks,
> - * we wait until bio submission time.   All the pages in the bio are
> - * checksummed and sums are attached onto the ordered extent record.
> - *
> - * At IO completion time the cums attached on the ordered extent record
> - * are inserted into the btree
> - */
> -blk_status_t btrfs_submit_bio_done(void *private_data, struct bio *bio,
> -   int mirror_num)
> -{
> - struct inode *inode = private_data;
> - struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> - blk_status_t ret;
> -
> - ret = btrfs_map_bio(fs_info, bio, mirror_num, 1);
> - if (ret) {
> - bio->bi_status = ret;
> - bio_endio(bio);
> - }
> - return ret;
> -}
> -
>  /*
>   * extent_io.c submission hook. This does the right thing for csum 
> calculation
>   * on write, or reading the csums from the tree before a read.
> 


Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3

2018-11-22 Thread Nikolay Borisov



On 22.11.18 г. 14:31 ч., Tomasz Chmielewski wrote:
> Yet another system upgraded to 4.19 and showing strange issues.
> 
> btrfs-cleaner is showing as ~90-100% busy in iotop:
> 
>    TID  PRIO  USER DISK READ  DISK WRITE  SWAPIN IO>    COMMAND
>   1340 be/4 root    0.00 B/s    0.00 B/s  0.00 % 92.88 %
> [btrfs-cleaner]
> 
> 
> Note disk read, disk write are 0.00 B/s.
> 
> 
> iostat -mx 1 shows all disks are ~100% busy, yet there are 0 reads and 0
> writes to them:
> 
> Device    r/s w/s rMB/s wMB/s   rrqm/s   wrqm/s 
> %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
> sda  0.00    0.00  0.00  0.00 0.00 0.00  
> 0.00   0.00    0.00    0.00   0.91 0.00 0.00   0.00  90.80
> sdb  0.00    0.00  0.00  0.00 0.00 0.00  
> 0.00   0.00    0.00    0.00   1.00 0.00 0.00   0.00 100.00
> sdc  0.00    0.00  0.00  0.00 0.00 0.00  
> 0.00   0.00    0.00    0.00   1.00 0.00 0.00   0.00 100.00
> 
> 
> 
> btrfs-cleaner persists 100% busy after reboots. The system is generally
> not very responsive.
> 
> 
> 
> # echo w > /proc/sysrq-trigger
> 
> # dmesg -c
> [  931.585611] sysrq: SysRq : Show Blocked State
> [  931.585715]   task    PC stack   pid father
> [  931.590168] btrfs-cleaner   D    0  1340  2 0x8000
> [  931.590175] Call Trace:
> [  931.590190]  __schedule+0x29e/0x840
> [  931.590195]  schedule+0x2c/0x80
> [  931.590199]  schedule_timeout+0x258/0x360
> [  931.590204]  io_schedule_timeout+0x1e/0x50
> [  931.590208]  wait_for_completion_io+0xb7/0x140
> [  931.590214]  ? wake_up_q+0x80/0x80
> [  931.590219]  submit_bio_wait+0x61/0x90
> [  931.590225]  blkdev_issue_discard+0x7a/0xd0
> [  931.590266]  btrfs_issue_discard+0x123/0x160 [btrfs]
> [  931.590299]  btrfs_discard_extent+0xd8/0x160 [btrfs]
> [  931.590335]  btrfs_finish_extent_commit+0xe2/0x240 [btrfs]
> [  931.590382]  btrfs_commit_transaction+0x573/0x840 [btrfs]
> [  931.590415]  ? btrfs_block_rsv_check+0x25/0x70 [btrfs]
> [  931.590456]  __btrfs_end_transaction+0x2be/0x2d0 [btrfs]
> [  931.590493]  btrfs_end_transaction_throttle+0x13/0x20 [btrfs]
> [  931.590530]  btrfs_drop_snapshot+0x489/0x800 [btrfs]
> [  931.590567]  btrfs_clean_one_deleted_snapshot+0xbb/0xf0 [btrfs]
> [  931.590607]  cleaner_kthread+0x136/0x160 [btrfs]
> [  931.590612]  kthread+0x120/0x140
> [  931.590646]  ? btree_submit_bio_start+0x20/0x20 [btrfs]
> [  931.590658]  ? kthread_bind+0x40/0x40
> [  931.590661]  ret_from_fork+0x22/0x40
> 



It seems your filesystem is mounted with the DSICARD option meaning
every delete will result in discard this is highly suboptimal for ssd's.
Try remounting the fs without the discard option see if it helps.
Generally for discard you want to submit it in big batches (what fstrim
does) so that the ftl on the ssd could apply any optimisations it might
have up its sleeve.


Would you finally care to share the smart data + the model and make of
the ssd?

> 
> 
> 
> After rebooting to 4.19.3, I've started seeing read errors. There were
> no errors with a previous kernel (4.17.14); disks are healthy according
> to SMART; no errors reported when we read the whole surface with dd.
> 
> # grep -i btrfs.*corrected /var/log/syslog|wc -l
> 156
> 
> Things like:
> 
> Nov 22 12:17:43 lxd05 kernel: [  711.538836] BTRFS info (device sda2):
> read error corrected: ino 0 off 3739083145216 (dev /dev/sdc2 sector
> 101088384)
> Nov 22 12:17:43 lxd05 kernel: [  711.538905] BTRFS info (device sda2):
> read error corrected: ino 0 off 3739083149312 (dev /dev/sdc2 sector
> 101088392)
> Nov 22 12:17:43 lxd05 kernel: [  711.538958] BTRFS info (device sda2):
> read error corrected: ino 0 off 3739083153408 (dev /dev/sdc2 sector
> 101088400)
> Nov 22 12:17:43 lxd05 kernel: [  711.539006] BTRFS info (device sda2):
> read error corrected: ino 0 off 3739083157504 (dev /dev/sdc2 sector
> 101088408)
> 
> 
> Yet - 0 errors, according to stats, not sure if it's expected or not:

Since btrfs was able to fix the read failure on the fly it doesn't
record the error.  IMO you have problems with your storage below the
filesystem level.

> 
> # btrfs device stats /data/lxd
> [/dev/sda2].write_io_errs    0
> [/dev/sda2].read_io_errs 0
> [/dev/sda2].flush_io_errs    0
> [/dev/sda2].corruption_errs  0
> [/dev/sda2].generation_errs  0
> [/dev/sdb2].write_io_errs    0
> [/dev/sdb2].read_io_errs 0
> [/dev/sdb2].flush_io_errs    0
> [/dev/sdb2].corruption_errs  0
> [/dev/sdb2].generation_errs  0
> [/dev/sdc2].write_io_errs    0
> [/dev/sdc2].read_io_errs 0
> [/dev/sdc2].flush_io_errs    0
> [/dev/sdc2].corruption_errs  0
> [/dev/sdc2].generation_errs  0
> 
> 




Re: [PATCH 4/6] btrfs: only track ref_heads in delayed_ref_updates

2018-11-22 Thread Nikolay Borisov



On 21.11.18 г. 20:59 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We use this number to figure out how many delayed refs to run, but
> __btrfs_run_delayed_refs really only checks every time we need a new
> delayed ref head, so we always run at least one ref head completely no
> matter what the number of items on it.  Fix the accounting to only be
> adjusted when we add/remove a ref head.

LGTM:

Reviewed-by: Nikolay Borisov 

However, what if we kill delayed_ref_updates since the name is a bit
ambiguous and instead migrate num_heads_ready from delayed_refs to trans
and use that? Otherwise, as stated previously num_heads_ready is
currently unused and could be removed.



> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/delayed-ref.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index b3e4c9fcb664..48725fa757a3 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -251,8 +251,6 @@ static inline void drop_delayed_ref(struct 
> btrfs_trans_handle *trans,
>   ref->in_tree = 0;
>   btrfs_put_delayed_ref(ref);
>   atomic_dec(_refs->num_entries);
> - if (trans->delayed_ref_updates)
> - trans->delayed_ref_updates--;
>  }
>  
>  static bool merge_ref(struct btrfs_trans_handle *trans,
> @@ -467,7 +465,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle 
> *trans,
>   if (ref->action == BTRFS_ADD_DELAYED_REF)
>   list_add_tail(>add_list, >ref_add_list);
>   atomic_inc(>num_entries);
> - trans->delayed_ref_updates++;
>   spin_unlock(>lock);
>   return ret;
>  }
> 


Re: [PATCH 3/6] btrfs: cleanup extent_op handling

2018-11-22 Thread Nikolay Borisov



On 21.11.18 г. 20:59 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> The cleanup_extent_op function actually would run the extent_op if it
> needed running, which made the name sort of a misnomer.  Change it to
> run_and_cleanup_extent_op, and move the actual cleanup work to
> cleanup_extent_op so it can be used by check_ref_cleanup() in order to
> unify the extent op handling.

The whole name extent_op is actually a misnomer since AFAIR this is some
sort of modification of the references of metadata nodes. I don't see
why it can't be made as yet another type of reference which is run for a
given node.

> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 36 +++-
>  1 file changed, 23 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index e3ed3507018d..8a776dc9cb38 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2424,19 +2424,33 @@ static void unselect_delayed_ref_head(struct 
> btrfs_delayed_ref_root *delayed_ref
>   btrfs_delayed_ref_unlock(head);
>  }
>  
> -static int cleanup_extent_op(struct btrfs_trans_handle *trans,
> -  struct btrfs_delayed_ref_head *head)
> +static struct btrfs_delayed_extent_op *
> +cleanup_extent_op(struct btrfs_trans_handle *trans,
> +   struct btrfs_delayed_ref_head *head)
>  {
>   struct btrfs_delayed_extent_op *extent_op = head->extent_op;
> - int ret;
>  
>   if (!extent_op)
> - return 0;
> - head->extent_op = NULL;
> + return NULL;
> +
>   if (head->must_insert_reserved) {
> + head->extent_op = NULL;
>   btrfs_free_delayed_extent_op(extent_op);
> - return 0;
> + return NULL;
>   }
> + return extent_op;
> +}
> +
> +static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans,
> +  struct btrfs_delayed_ref_head *head)
> +{
> + struct btrfs_delayed_extent_op *extent_op =
> + cleanup_extent_op(trans, head);
> + int ret;
> +
> + if (!extent_op)
> + return 0;
> + head->extent_op = NULL;
>   spin_unlock(>lock);
>   ret = run_delayed_extent_op(trans, head, extent_op);
>   btrfs_free_delayed_extent_op(extent_op);
> @@ -2488,7 +2502,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>  
>   delayed_refs = >transaction->delayed_refs;
>  
> - ret = cleanup_extent_op(trans, head);
> + ret = run_and_cleanup_extent_op(trans, head);
>   if (ret < 0) {
>   unselect_delayed_ref_head(delayed_refs, head);
>   btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret);
> @@ -6977,12 +6991,8 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (!RB_EMPTY_ROOT(>ref_tree.rb_root))
>   goto out;
>  
> - if (head->extent_op) {
> - if (!head->must_insert_reserved)
> - goto out;
> - btrfs_free_delayed_extent_op(head->extent_op);
> - head->extent_op = NULL;
> - }
> + if (cleanup_extent_op(trans, head) != NULL)
> + goto out;
>  
>   /*
>* waiting for the lock here would deadlock.  If someone else has it
> 


Re: [PATCH 1/6] btrfs: add btrfs_delete_ref_head helper

2018-11-22 Thread Nikolay Borisov



On 22.11.18 г. 11:12 ч., Nikolay Borisov wrote:
> 
> 
> On 21.11.18 г. 20:59 ч., Josef Bacik wrote:
>> From: Josef Bacik 
>>
>> We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
>> into a helper and cleanup the calling functions.
>>
>> Signed-off-by: Josef Bacik 
>> Reviewed-by: Omar Sandoval 
>> ---
>>  fs/btrfs/delayed-ref.c | 14 ++
>>  fs/btrfs/delayed-ref.h |  3 ++-
>>  fs/btrfs/extent-tree.c | 22 +++---
>>  3 files changed, 19 insertions(+), 20 deletions(-)
>>
>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>> index 9301b3ad9217..b3e4c9fcb664 100644
>> --- a/fs/btrfs/delayed-ref.c
>> +++ b/fs/btrfs/delayed-ref.c
>> @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>>  return head;
>>  }
>>  
>> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
>> +   struct btrfs_delayed_ref_head *head)
>> +{
>> +lockdep_assert_held(_refs->lock);
>> +lockdep_assert_held(>lock);
>> +
>> +rb_erase_cached(>href_node, _refs->href_root);
>> +RB_CLEAR_NODE(>href_node);
>> +atomic_dec(_refs->num_entries);
>> +delayed_refs->num_heads--;
>> +if (head->processing == 0)
>> +delayed_refs->num_heads_ready--;
> 
> num_heads_ready will never execute in cleanup_ref_head, since
> processing == 0 only when the ref head is unselected. Perhaps those 2
> lines shouldn't be in this function? I find it a bit confusing that if
> processing is 0 we decrement num_heads_ready in check_ref_cleanup, but
> in unselect_delayed_ref_head we set it to 0 and increment it.
> 
>> +}
>> +
>>  /*
>>   * Helper to insert the ref_node to the tail or merge with tail.
>>   *
>> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
>> index 8e20c5cb5404..d2af974f68a1 100644
>> --- a/fs/btrfs/delayed-ref.h
>> +++ b/fs/btrfs/delayed-ref.h
>> @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
>> btrfs_delayed_ref_head *head)
>>  {
>>  mutex_unlock(>mutex);
>>  }
>> -
>> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
>> +   struct btrfs_delayed_ref_head *head);
>>  
>>  struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>>  struct btrfs_delayed_ref_root *delayed_refs);
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index d242a1174e50..c36b3a42f2bb 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
>> *trans,
>>  spin_unlock(_refs->lock);
>>  return 1;
>>  }
>> -delayed_refs->num_heads--;
>> -rb_erase_cached(>href_node, _refs->href_root);
>> -RB_CLEAR_NODE(>href_node);
>> +btrfs_delete_ref_head(delayed_refs, head);
>>  spin_unlock(>lock);
>>  spin_unlock(_refs->lock);
>> -atomic_dec(_refs->num_entries);
>>  
>>  trace_run_delayed_ref_head(fs_info, head, 0);
>>  
>> @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
>> btrfs_trans_handle *trans,
>>  if (!mutex_trylock(>mutex))
>>  goto out;
>>  
>> -/*
>> - * at this point we have a head with no other entries.  Go
>> - * ahead and process it.
>> - */
>> -rb_erase_cached(>href_node, _refs->href_root);
>> -RB_CLEAR_NODE(>href_node);
>> -atomic_dec(_refs->num_entries);
>> -
>> -/*
>> - * we don't take a ref on the node because we're removing it from the
>> - * tree, so we just steal the ref the tree was holding.
>> - */
>> -delayed_refs->num_heads--;
>> -if (head->processing == 0)
>> -delayed_refs->num_heads_ready--;
>> +btrfs_delete_ref_head(delayed_refs, head);
>>  head->processing = 0;

On a closer inspection I think here we can do:

ASSERT(head->processing == 0) because at that point we've taken the
head->lock spinlock which is held during ordinary delayed refs
processing (in __btrfs_run_delayed_refs) when the head is selected (and
processing is 1). So head->processing == 0 here I think is a hard
invariant of the code. The decrement here should pair with the increment
when the head was initially added to the tree.

In cleanup_ref_head we don't need to ever worry about num_heads_ready
since it has already been decremented by btrfs_select_ref_head.

As a matter fact this counter is not used anywhere so we might as well
just remove it.

> 
> Something is fishy here, before the code checked for processing == 0 and
> then also set it to 0 ?
> 
>> +
>>  spin_unlock(>lock);
>>  spin_unlock(_refs->lock);
>>  
>>
> 


Re: [PATCH 1/6] btrfs: add btrfs_delete_ref_head helper

2018-11-22 Thread Nikolay Borisov



On 21.11.18 г. 20:59 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
> into a helper and cleanup the calling functions.
> 
> Signed-off-by: Josef Bacik 
> Reviewed-by: Omar Sandoval 
> ---
>  fs/btrfs/delayed-ref.c | 14 ++
>  fs/btrfs/delayed-ref.h |  3 ++-
>  fs/btrfs/extent-tree.c | 22 +++---
>  3 files changed, 19 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 9301b3ad9217..b3e4c9fcb664 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>   return head;
>  }
>  
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +struct btrfs_delayed_ref_head *head)
> +{
> + lockdep_assert_held(_refs->lock);
> + lockdep_assert_held(>lock);
> +
> + rb_erase_cached(>href_node, _refs->href_root);
> + RB_CLEAR_NODE(>href_node);
> + atomic_dec(_refs->num_entries);
> + delayed_refs->num_heads--;
> + if (head->processing == 0)
> + delayed_refs->num_heads_ready--;

num_heads_ready will never execute in cleanup_ref_head, since
processing == 0 only when the ref head is unselected. Perhaps those 2
lines shouldn't be in this function? I find it a bit confusing that if
processing is 0 we decrement num_heads_ready in check_ref_cleanup, but
in unselect_delayed_ref_head we set it to 0 and increment it.

> +}
> +
>  /*
>   * Helper to insert the ref_node to the tail or merge with tail.
>   *
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index 8e20c5cb5404..d2af974f68a1 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
> btrfs_delayed_ref_head *head)
>  {
>   mutex_unlock(>mutex);
>  }
> -
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +struct btrfs_delayed_ref_head *head);
>  
>  struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>   struct btrfs_delayed_ref_root *delayed_refs);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index d242a1174e50..c36b3a42f2bb 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   spin_unlock(_refs->lock);
>   return 1;
>   }
> - delayed_refs->num_heads--;
> - rb_erase_cached(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> + btrfs_delete_ref_head(delayed_refs, head);
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
> - atomic_dec(_refs->num_entries);
>  
>   trace_run_delayed_ref_head(fs_info, head, 0);
>  
> @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (!mutex_trylock(>mutex))
>   goto out;
>  
> - /*
> -  * at this point we have a head with no other entries.  Go
> -  * ahead and process it.
> -  */
> - rb_erase_cached(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> - atomic_dec(_refs->num_entries);
> -
> - /*
> -  * we don't take a ref on the node because we're removing it from the
> -  * tree, so we just steal the ref the tree was holding.
> -  */
> - delayed_refs->num_heads--;
> - if (head->processing == 0)
> - delayed_refs->num_heads_ready--;
> + btrfs_delete_ref_head(delayed_refs, head);
>   head->processing = 0;

Something is fishy here, before the code checked for processing == 0 and
then also set it to 0 ?

> +
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
>  
> 


[PATCH] btrfs: Remove extent_io_ops::readpage_io_failed_hook

2018-11-22 Thread Nikolay Borisov
For data inodes this hook does nothing but to return -EAGAIN which is
used to signal to the endio routines that this bio belongs to a data
inode. If this is the case the actual retrying is handled by
bio_readpage_error. Alternatively, if this bio belongs to the btree
inode then btree_io_failed_hook just does some cleanup and doesn't
retry anything.

This patch simplifies the code flow by eliminating
readpage_io_failed_hook and instead open-coding btree_io_failed_hook in
end_bio_extent_readpage. Also eliminate some needless checks since IO is always
performed on either data inode or btree inode, both of which are guaranteed to 
have their extent_io_tree::ops set. 

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/disk-io.c   | 16 --
 fs/btrfs/extent_io.c | 69 ++--
 fs/btrfs/extent_io.h |  1 -
 fs/btrfs/inode.c |  7 -
 4 files changed, 34 insertions(+), 59 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index feb67dfd663d..5024d9374163 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -673,19 +673,6 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
return ret;
 }
 
-static int btree_io_failed_hook(struct page *page, int failed_mirror)
-{
-   struct extent_buffer *eb;
-
-   eb = (struct extent_buffer *)page->private;
-   set_bit(EXTENT_BUFFER_READ_ERR, >bflags);
-   eb->read_mirror = failed_mirror;
-   atomic_dec(>io_pages);
-   if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, >bflags))
-   btree_readahead_hook(eb, -EIO);
-   return -EIO;/* we fixed nothing */
-}
-
 static void end_workqueue_bio(struct bio *bio)
 {
struct btrfs_end_io_wq *end_io_wq = bio->bi_private;
@@ -4541,7 +4528,4 @@ static const struct extent_io_ops btree_extent_io_ops = {
/* mandatory callbacks */
.submit_bio_hook = btree_submit_bio_hook,
.readpage_end_io_hook = btree_readpage_end_io_hook,
-   .readpage_io_failed_hook = btree_io_failed_hook,
-
-   /* optional callbacks */
 };
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4ea808d6cfbc..cca3bccd142e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2333,11 +2333,10 @@ struct bio *btrfs_create_repair_bio(struct inode 
*inode, struct bio *failed_bio,
 }
 
 /*
- * this is a generic handler for readpage errors (default
- * readpage_io_failed_hook). if other copies exist, read those and write back
- * good data to the failed position. does not investigate in remapping the
- * failed extent elsewhere, hoping the device will be smart enough to do this 
as
- * needed
+ * This is a generic handler for readpage errors. If other copies exist, read
+ * those and write back good data to the failed position. Does not investigate
+ * in remapping the failed extent elsewhere, hoping the device will be smart
+ * enough to do this as needed
  */
 
 static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
@@ -2501,6 +2500,8 @@ static void end_bio_extent_readpage(struct bio *bio)
struct page *page = bvec->bv_page;
struct inode *inode = page->mapping->host;
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+   bool data_inode = btrfs_ino(BTRFS_I(inode))
+   != BTRFS_BTREE_INODE_OBJECTID;
 
btrfs_debug(fs_info,
"end_bio_extent_readpage: bi_sector=%llu, err=%d, 
mirror=%u",
@@ -2530,7 +2531,7 @@ static void end_bio_extent_readpage(struct bio *bio)
len = bvec->bv_len;
 
mirror = io_bio->mirror_num;
-   if (likely(uptodate && tree->ops)) {
+   if (likely(uptodate)) {
ret = tree->ops->readpage_end_io_hook(io_bio, offset,
  page, start, end,
  mirror);
@@ -2546,38 +2547,36 @@ static void end_bio_extent_readpage(struct bio *bio)
if (likely(uptodate))
goto readpage_ok;
 
-   if (tree->ops) {
-   ret = tree->ops->readpage_io_failed_hook(page, mirror);
-   if (ret == -EAGAIN) {
-   /*
-* Data inode's readpage_io_failed_hook() always
-* returns -EAGAIN.
-*
-* The generic bio_readpage_error handles errors
-* the following way: If possible, new read
-* requests are created and submitted and will
-* end up in end_bio_extent_readpage as well (if
-* we're lucky, not in the !uptodate case). In

[PATCH] btrfs: Remove unnecessary code from __btrfs_rebalance

2018-11-22 Thread Nikolay Borisov
The first step fo the rebalance process, ensuring there is 1mb free on
each device. This number seems rather small. And in fact when talking
to the original authors their opinions were:

"man that's a little bonkers"
"i don't think we even need that code anymore"
"I think it was there to make sure we had room for the blank 1M at the
beginning. I bet it goes all the way back to v0"
"we just don't need any of that tho, i say we just delete it"

Clearly, this piece of code has lost its original intent throughout
the years. It doesn't really bring any real practical benefits to the
relocation process. No functional changes.

Signed-off-by: Nikolay Borisov 
Suggested-by: Josef Bacik 
---
 fs/btrfs/volumes.c | 53 --
 1 file changed, 53 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8e36cbb355df..eb9fa8a6429c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3645,17 +3645,11 @@ static int __btrfs_balance(struct btrfs_fs_info 
*fs_info)
 {
struct btrfs_balance_control *bctl = fs_info->balance_ctl;
struct btrfs_root *chunk_root = fs_info->chunk_root;
-   struct btrfs_root *dev_root = fs_info->dev_root;
-   struct list_head *devices;
-   struct btrfs_device *device;
-   u64 old_size;
-   u64 size_to_free;
u64 chunk_type;
struct btrfs_chunk *chunk;
struct btrfs_path *path = NULL;
struct btrfs_key key;
struct btrfs_key found_key;
-   struct btrfs_trans_handle *trans;
struct extent_buffer *leaf;
int slot;
int ret;
@@ -3670,53 +3664,6 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
u32 count_sys = 0;
int chunk_reserved = 0;
 
-   /* step one make some room on all the devices */
-   devices = _info->fs_devices->devices;
-   list_for_each_entry(device, devices, dev_list) {
-   old_size = btrfs_device_get_total_bytes(device);
-   size_to_free = div_factor(old_size, 1);
-   size_to_free = min_t(u64, size_to_free, SZ_1M);
-   if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state) ||
-   btrfs_device_get_total_bytes(device) -
-   btrfs_device_get_bytes_used(device) > size_to_free ||
-   test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state))
-   continue;
-
-   ret = btrfs_shrink_device(device, old_size - size_to_free);
-   if (ret == -ENOSPC)
-   break;
-   if (ret) {
-   /* btrfs_shrink_device never returns ret > 0 */
-   WARN_ON(ret > 0);
-   goto error;
-   }
-
-   trans = btrfs_start_transaction(dev_root, 0);
-   if (IS_ERR(trans)) {
-   ret = PTR_ERR(trans);
-   btrfs_info_in_rcu(fs_info,
-"resize: unable to start transaction after shrinking device %s 
(error %d), old size %llu, new size %llu",
- rcu_str_deref(device->name), ret,
- old_size, old_size - size_to_free);
-   goto error;
-   }
-
-   ret = btrfs_grow_device(trans, device, old_size);
-   if (ret) {
-   btrfs_end_transaction(trans);
-   /* btrfs_grow_device never returns ret > 0 */
-   WARN_ON(ret > 0);
-   btrfs_info_in_rcu(fs_info,
-"resize: unable to grow device after shrinking device %s 
(error %d), old size %llu, new size %llu",
- rcu_str_deref(device->name), ret,
- old_size, old_size - size_to_free);
-   goto error;
-   }
-
-   btrfs_end_transaction(trans);
-   }
-
-   /* step two, relocate all the chunks */
path = btrfs_alloc_path();
if (!path) {
ret = -ENOMEM;
-- 
2.17.1



Re: [PATCH] btrfs: only run delayed refs if we're committing

2018-11-21 Thread Nikolay Borisov



On 21.11.18 г. 21:10 ч., Josef Bacik wrote:
> I noticed in a giant dbench run that we spent a lot of time on lock
> contention while running transaction commit.  This is because dbench
> results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
> they all run the delayed refs first thing, so they all contend with
> each other.  This leads to seconds of 0 throughput.  Change this to only
> run the delayed refs if we're the ones committing the transaction.  This
> makes the latency go away and we get no more lock contention.
> 
> Reviewed-by: Omar Sandoval 
> Signed-off-by: Josef Bacik 
Wohooo, we no longer run delayed refs at arbitrary points during
transaction commit.

Reviewed-by: Nikolay Borisov 
> ---
>  fs/btrfs/transaction.c | 24 +---
>  1 file changed, 9 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 3c1be9db897c..41cc96cc59a3 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1918,15 +1918,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>   btrfs_trans_release_metadata(trans);
>   trans->block_rsv = NULL;
>  
> - /* make a pass through all the delayed refs we have so far
> -  * any runnings procs may add more while we are here
> -  */
> - ret = btrfs_run_delayed_refs(trans, 0);
> - if (ret) {
> - btrfs_end_transaction(trans);
> - return ret;
> - }
> -
>   cur_trans = trans->transaction;
>  
>   /*
> @@ -1938,12 +1929,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>  
>   btrfs_create_pending_block_groups(trans);
>  
> - ret = btrfs_run_delayed_refs(trans, 0);
> - if (ret) {
> - btrfs_end_transaction(trans);
> - return ret;
> - }
> -
>   if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) {
>   int run_it = 0;
>  
> @@ -2014,6 +1999,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>   spin_unlock(_info->trans_lock);
>   }
>  
> + /*
> +  * We are now the only one in the commit area, we can run delayed refs
> +  * without hitting a bunch of lock contention from a lot of people
> +  * trying to commit the transaction at once.
> +  */
> + ret = btrfs_run_delayed_refs(trans, 0);
> + if (ret)
> + goto cleanup_transaction;
> +
>   extwriter_counter_dec(cur_trans, trans->type);
>  
>   ret = btrfs_start_delalloc_flush(fs_info);
> 


[PATCHv3] btrfs: Fix error handling in btrfs_cleanup_ordered_extents

2018-11-21 Thread Nikolay Borisov
Running btrfs/124 in a loop hung up on me sporadically with the
following call trace:
btrfs   D0  5760   5324 0x
Call Trace:
 ? __schedule+0x243/0x800
 schedule+0x33/0x90
 btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
 ? wait_woken+0xa0/0xa0
 btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
 btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
 btrfs_relocate_chunk+0x49/0x100 [btrfs]
 btrfs_balance+0xbeb/0x1740 [btrfs]
 btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
 btrfs_ioctl+0x1691/0x3110 [btrfs]
 ? lockdep_hardirqs_on+0xed/0x180
 ? __handle_mm_fault+0x8e7/0xfb0
 ? _raw_spin_unlock+0x24/0x30
 ? __handle_mm_fault+0x8e7/0xfb0
 ? do_vfs_ioctl+0xa5/0x6e0
 ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
 do_vfs_ioctl+0xa5/0x6e0
 ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
 ksys_ioctl+0x3a/0x70
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x60/0x1b0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

This happens because during page writeback it's valid for
writepage_delalloc to instantiate a delalloc range which doesn't
belong to the page currently being written back.

The reason this case is valid is due to find_lock_delalloc_range
returning any available range after the passed delalloc_start and
ignorting whether the page under writeback is within that range.
In turn ordered extents (OE) are always created for the returned range
from find_lock_delalloc_range. If, however, a failure occurs while OE
are being created then the clean up code in btrfs_cleanup_ordered_extents
will be called.

Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
the case of such 'foreign' range being processed and instead it always
assumes that the range OE are created for belongs to the page. This
leads to the first page of such foregin range to not be cleaned up since
it's deliberately missed skipped by the current cleaning up code.

Fix this by correctly checking whether the current page belongs to the
range being instantiated and if so adjsut the range parameters
passed for cleaning up. If it doesn't, then just clean the whole OE
range directly.

Signed-off-by: Nikolay Borisov 
Reviewed-by: Josef Bacik 
---
V3:
 * Re-worded the commit for easier comprehension
 * Added RB tag from Josef

V2:
 * Fix compilation failure due to missing parentheses
 * Fixed the "Fixes" tag.
 fs/btrfs/inode.c | 29 -
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a0fc564626df..9f65f131f244 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -110,17 +110,17 @@ static void __endio_write_update_ordered(struct inode 
*inode,
  * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
  * and EXTENT_DELALLOC simultaneously, because that causes the reserved 
metadata
  * to be released, which we want to happen only when finishing the ordered
- * extent (btrfs_finish_ordered_io()). Also note that the caller of
- * btrfs_run_delalloc_range already does proper cleanup for the first page of
- * the range, that is, it invokes the callback writepage_end_io_hook() for the
- * range of the first page.
+ * extent (btrfs_finish_ordered_io()).
  */
 static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
-const u64 offset,
-const u64 bytes)
+struct page *locked_page,
+u64 offset, u64 bytes)
 {
unsigned long index = offset >> PAGE_SHIFT;
unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
+   u64 page_start = page_offset(locked_page);
+   u64 page_end = page_start + PAGE_SIZE - 1;
+
struct page *page;
 
while (index <= end_index) {
@@ -131,8 +131,18 @@ static inline void btrfs_cleanup_ordered_extents(struct 
inode *inode,
ClearPagePrivate2(page);
put_page(page);
}
-   return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
-   bytes - PAGE_SIZE, false);
+
+   /*
+* In case this page belongs to the delalloc range being instantiated
+* then skip it, since the first page of a range is going to be
+* properly cleaned up by the caller of run_delalloc_range
+*/
+   if (page_start >= offset && page_end <= (offset + bytes - 1)) {
+   offset += PAGE_SIZE;
+   bytes -= PAGE_SIZE;
+   }
+
+   return __endio_write_update_ordered(inode, offset, bytes, false);
 }
 
 static int btrfs_dirty_inode(struct inode *inode);
@@ -1605,7 +1615,8 @@ int btrfs_run_delalloc_range(void *

Re: [PATCH v2] btrfs: Fix error handling in btrfs_cleanup_ordered_extents

2018-11-20 Thread Nikolay Borisov



On 20.11.18 г. 21:00 ч., Josef Bacik wrote:
> On Fri, Oct 26, 2018 at 02:41:55PM +0300, Nikolay Borisov wrote:
>> Running btrfs/124 in a loop hung up on me sporadically with the
>> following call trace:
>>  btrfs   D0  5760   5324 0x
>>  Call Trace:
>>   ? __schedule+0x243/0x800
>>   schedule+0x33/0x90
>>   btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
>>   ? wait_woken+0xa0/0xa0
>>   btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
>>   btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
>>   btrfs_relocate_chunk+0x49/0x100 [btrfs]
>>   btrfs_balance+0xbeb/0x1740 [btrfs]
>>   btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
>>   btrfs_ioctl+0x1691/0x3110 [btrfs]
>>   ? lockdep_hardirqs_on+0xed/0x180
>>   ? __handle_mm_fault+0x8e7/0xfb0
>>   ? _raw_spin_unlock+0x24/0x30
>>   ? __handle_mm_fault+0x8e7/0xfb0
>>   ? do_vfs_ioctl+0xa5/0x6e0
>>   ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
>>   do_vfs_ioctl+0xa5/0x6e0
>>   ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
>>   ksys_ioctl+0x3a/0x70
>>   __x64_sys_ioctl+0x16/0x20
>>   do_syscall_64+0x60/0x1b0
>>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> Turns out during page writeback it's possible that the code in
>> writepage_delalloc can instantiate a delalloc range which doesn't
>> belong to the page currently being written back. This happens since
>> find_lock_delalloc_range returns up to BTRFS_MAX_EXTENT_SIZE delalloc
>> range when asked and doens't really consider the range of the passed
>> page. When such a foregin range is found the code proceeds to
>> run_delalloc_range and calls the appropriate function to fill the
>> delalloc and create ordered extents. If, however, a failure occurs
>> while this operation is in effect then the clean up code in
>> btrfs_cleanup_ordered_extents will be called. This function has the
>> wrong assumption that caller of run_delalloc_range always properly
>> cleans the first page of the range hence when it calls
>> __endio_write_update_ordered it explicitly ommits the first page of
>> the delalloc range. This is wrong because this function could be
>> cleaning a delalloc range that doesn't belong to the current page. This
>> in turn means that the page cleanup code in __extent_writepage will
>> not really free the initial page for the range, leaving a hanging
>> ordered extent with bytes_left set to 4k. This bug has been present
>> ever since the original introduction of the cleanup code in 524272607e88.
>>
>> Fix this by correctly checking whether the current page belongs to the
>> range being instantiated and if so correctly adjust the range parameters
>> passed for cleaning up. If it doesn't, then just clean the whole OE
>> range directly.
>>
>> Signed-off-by: Nikolay Borisov 
>> Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid 
>> ordered extent hang")
> 
> Can we just remove the endio cleanup in __extent_writepage() and make this do
> the proper cleanup?  I'm not sure if that is feasible or not, but seems like 
> it
> would make the cleanup stuff less confusing and more straightforward.  If not
> you can add

Quickly skimming the code I think the cleanup in __extent_writepage
could be moved into __extent_writepage_io where we have 2 branches that
set PageError. So I guess it could be done, but I will have to
experiment with it.

> 
> Reviewed-by: Josef Bacik 
> 
> Thanks,
> 
> Josef
> 


Re: [PATCH] btrfs: qgroup: Skip delayed data ref for reloc trees

2018-11-20 Thread Nikolay Borisov



On 20.11.18 г. 11:07 ч., Qu Wenruo wrote:
> 
> 
> On 2018/11/20 下午4:51, Nikolay Borisov wrote:

>> I'm beginning to wonder, should we document
>> btrfs_add_delayed_data_ref/btrfs_add_tree_ref arguments separate for
>> each function, or should only the differences be documented - in this
>> case the newly added root parameter. The rest of the arguments are being
>> documented at init_delayed_ref_common.
> 
> You won't be happy with my later plan, it will add new parameter for
> btrfs_add_delayed_tree_ref(), and it may not be @root, but some bool.

You are right, but I'm starting to think that the interface of adding
those references is wrong because we shouldn't really need adding more
and more arguments. All of this feels like piling hack on top of hack
for some legacy infrastructure which no one bothers fixing from a
high-level.


> 
> So I think we may need to document at least the difference.
> 
> Thanks,
> Qu
> 
>>



Re: [PATCH] btrfs: qgroup: Skip delayed data ref for reloc trees

2018-11-20 Thread Nikolay Borisov



On 20.11.18 г. 10:46 ч., Qu Wenruo wrote:
> Currently we only check @ref_root in btrfs_add_delayed_data_ref() to
> determine whether a data delayed ref is for reloc tree.
> 
> Such check is insufficient as for relocation we could pass @ref_root
> as the source file tree, causing qgroup to trace unchanged data extents
> even we're only relocating metadata chunks.
> 
> We could insert qgroup extent record for the following call trace even
> we're only relocating metadata block group:
> 
> do_relocation()
> |- btrfs_cow_block(root=reloc_root)
>|- update_ref_for_cow(root=reloc_root)
>   |- __btrfs_mod_ref(root=reloc_root)
>  |- ref_root = btrfs_header_owner()
>  |- btrfs_add_delayed_data_ref(ref_root=source_root)
> 
> And another case when dropping reloc tree:
> 
> clean_dirty_root()
> |- btrfs_drop_snapshot(root=relocc_root)
>|- walk_up_tree(root=reloc_root)
>   |- walk_up_proc(root=reloc_root)
>  |- btrfs_dec_ref(root=reloc_root)
> |- __btrfs_mod_ref(root=reloc_root)
>|- ref_root = btrfs_header_owner()
>|- btrfs_add_delayed_data_ref(ref_root=source_root)
> 
> This patch will introduce @root parameter for
> btrfs_add_delayed_data_ref(), so that we could determine if this data
> extent belongs to reloc tree or not.
> 
> This could skip data extents which aren't really modified during
> relocation.
> 
> For the same real world 4G data 16 snapshots 4K nodesize metadata
> balance test:
>  | v4.20-rc1 + delaye*  | w/ patch   | diff
> ---
> relocated extents| 22773| 22656  | -0.1%
> qgroup dirty extents | 122879   | 74316  | -39.5%
> time (real)  | 23.073s  | 14.971 | -35.1%
> 
> *: v4.20-rc1 + delayed subtree scan patchset
> 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/delayed-ref.c | 3 ++-
>  fs/btrfs/delayed-ref.h | 1 +
>  fs/btrfs/extent-tree.c | 6 +++---
>  3 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 9301b3ad9217..269bd6ecb8f3 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -798,6 +798,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
> *trans,
>   * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
>   */
>  int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
> +struct btrfs_root *root,

I'm beginning to wonder, should we document
btrfs_add_delayed_data_ref/btrfs_add_tree_ref arguments separate for
each function, or should only the differences be documented - in this
case the newly added root parameter. The rest of the arguments are being
documented at init_delayed_ref_common.

>  u64 bytenr, u64 num_bytes,
>  u64 parent, u64 ref_root,
>  u64 owner, u64 offset, u64 reserved, int action,
> @@ -835,7 +836,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle 
> *trans,
>   }
>  
>   if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) &&
> - is_fstree(ref_root)) {
> + is_fstree(ref_root) && is_fstree(root->root_key.objectid)) {
>   record = kmalloc(sizeof(*record), GFP_NOFS);
>   if (!record) {
>   kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index 8e20c5cb5404..6c60737e55d6 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -240,6 +240,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
> *trans,
>  struct btrfs_delayed_extent_op *extent_op,
>  int *old_ref_mod, int *new_ref_mod);
>  int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
> +struct btrfs_root *root,
>  u64 bytenr, u64 num_bytes,
>  u64 parent, u64 ref_root,
>  u64 owner, u64 offset, u64 reserved, int action,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a1febf155747..0554d2cc2ea1 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2046,7 +2046,7 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle 
> *trans,
>BTRFS_ADD_DELAYED_REF, NULL,
>_ref_mod, _ref_mod);
>   } else {
> - ret = btrfs_add_delayed_data_ref(trans, bytenr,
> + ret = btrfs_add_delayed_data_ref(trans, root, bytenr,
>num_bytes, parent,
>root_objectid, owner, offset,
>0, BTRFS_ADD_DELAYED_REF,
> @@ -7104,7 

[PATCH] btrfs: Add sysfs support for metadata_uuid feature

2018-11-19 Thread Nikolay Borisov
Since the metadata_uuid is a new incompat feature it requires the
respective sysfs hooks. This patch adds the 'metdata_uuid' feature to
be shown if it supported by the kernel. Additionally it adds
/sys/fs/btrfs/UUID/metadata_uuid attribute which allows one to read
the current metadata_uuid.

Signed-off-by: Nikolay Borisov 
---

I completely forgot sysfs also needs to be hooked so here it is. 

 fs/btrfs/sysfs.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 3717c864ba23..5a5930e3d32b 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -191,6 +191,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(extended_iref, EXTENDED_IREF);
 BTRFS_FEAT_ATTR_INCOMPAT(raid56, RAID56);
 BTRFS_FEAT_ATTR_INCOMPAT(skinny_metadata, SKINNY_METADATA);
 BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
+BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID);
 BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
 
 static struct attribute *btrfs_supported_feature_attrs[] = {
@@ -204,6 +205,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
BTRFS_FEAT_ATTR_PTR(raid56),
BTRFS_FEAT_ATTR_PTR(skinny_metadata),
BTRFS_FEAT_ATTR_PTR(no_holes),
+   BTRFS_FEAT_ATTR_PTR(metadata_uuid),
BTRFS_FEAT_ATTR_PTR(free_space_tree),
NULL
 };
@@ -505,12 +507,24 @@ static ssize_t quota_override_store(struct kobject *kobj,
 
 BTRFS_ATTR_RW(, quota_override, quota_override_show, quota_override_store);
 
+static ssize_t btrfs_metadata_uuid_show(struct kobject *kobj,
+   struct kobj_attribute *a, char *buf)
+{
+   struct btrfs_fs_info *fs_info = to_fs_info(kobj);
+
+   return snprintf(buf, PAGE_SIZE, "%pU\n",
+   fs_info->fs_devices->metadata_uuid);
+}
+
+BTRFS_ATTR(, metadata_uuid, btrfs_metadata_uuid_show);
+
 static const struct attribute *btrfs_attrs[] = {
BTRFS_ATTR_PTR(, label),
BTRFS_ATTR_PTR(, nodesize),
BTRFS_ATTR_PTR(, sectorsize),
BTRFS_ATTR_PTR(, clone_alignment),
BTRFS_ATTR_PTR(, quota_override),
+   BTRFS_ATTR_PTR(, metadata_uuid),
NULL,
 };
 
-- 
2.17.1



Re: [PATCH v2] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation

2018-11-19 Thread Nikolay Borisov



On 19.11.18 г. 16:48 ч., Qu Wenruo wrote:
> There may be some qgroup reserved space related problem in such case,
> but I'm not 100% sure to foresee such problem.

But why is this a problem - we always queue quota rescan following QUOTA
enable, that should take care of proper accounting, no?

> 
> 
> The best way to do this is, commit trans first, and before any other one
> trying to start transaction, we set the bit.
> However I can't find such infrastructure now (I still remember we used
> to have a pending bit to change quota enabled flag, but removed later)


Re: [PATCH] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation

2018-11-19 Thread Nikolay Borisov



On 19.11.18 г. 11:48 ч., fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> If the quota enable and snapshot creation ioctls are called concurrently
> we can get into a deadlock where the task enabling quotas will deadlock
> on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it
> twice. The following time diagram shows how this happens.
> 
>CPU 0CPU 1
> 
>  btrfs_ioctl()
>   btrfs_ioctl_quota_ctl()
>btrfs_quota_enable()
> mutex_lock(fs_info->qgroup_ioctl_lock)
> btrfs_start_transaction()
> 
>  btrfs_ioctl()
>   btrfs_ioctl_snap_create_v2
>create_snapshot()
> --> adds snapshot to the
> list pending_snapshots
> of the current
> transaction
> 
> btrfs_commit_transaction()
>  create_pending_snapshots()
>create_pending_snapshot()
> qgroup_account_snapshot()
>  btrfs_qgroup_inherit()
>  mutex_lock(fs_info->qgroup_ioctl_lock)
>   --> deadlock, mutex already locked
>   by this task at
>   btrfs_quota_enable()
> 
> So fix this by adding a flag to the transaction handle that signals if the
> transaction is being used for enabling quotas (only seen by the task doing
> it) and do not lock the mutex qgroup_ioctl_lock at btrfs_qgroup_inherit()
> if the transaction handle corresponds to the one being used to enable the
> quotas.
> 
> Fixes: 6426c7ad697d ("btrfs: qgroup: Fix qgroup accounting when creating 
> snapshot")
> Signed-off-by: Filipe Manana 
> ---
>  fs/btrfs/qgroup.c  | 10 --
>  fs/btrfs/transaction.h |  1 +
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index d4917c0cddf5..3aec3bfa3d70 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -908,6 +908,7 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
>   trans = NULL;
>   goto out;
>   }
> + trans->enabling_quotas = true;
>  
>   fs_info->qgroup_ulist = ulist_alloc(GFP_KERNEL);
>   if (!fs_info->qgroup_ulist) {
> @@ -2250,7 +2251,11 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
> *trans, u64 srcid,
>   u32 level_size = 0;
>   u64 nums;
>  
> - mutex_lock(_info->qgroup_ioctl_lock);
> + if (trans->enabling_quotas)
> + lockdep_assert_held(_info->qgroup_ioctl_lock);
> + else
> + mutex_lock(_info->qgroup_ioctl_lock);
> +

nit: That's a bit ugly for my taste, but I don't think the 
alternative is any better: 

ASSERT((trans->enabling_quotas && 
!lockdep_assert_held(_info->qgroup_ioctl_lock)) || !trans->enabling_quotas)

if (!trans->enabling_quotas)
mutex_lock(...)


>   if (!test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags))
>   goto out;
>  
> @@ -2413,7 +2418,8 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
> *trans, u64 srcid,
>  unlock:
>   spin_unlock(_info->qgroup_lock);
>  out:
> - mutex_unlock(_info->qgroup_ioctl_lock);
> + if (!trans->enabling_quotas)
> + mutex_unlock(_info->qgroup_ioctl_lock);
>   return ret;
>  }
>  
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 703d5116a2fc..a5553a1dee30 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -122,6 +122,7 @@ struct btrfs_trans_handle {
>   bool reloc_reserved;
>   bool sync;
>   bool dirty;
> + bool enabling_quotas;
>   struct btrfs_root *root;
>   struct btrfs_fs_info *fs_info;
>   struct list_head new_bgs;
> 


  1   2   3   4   5   6   7   8   9   10   >