Re: [PATCH] Btrfs: remove redundant btrfs_trans_release_metadata"

2018-09-04 Thread Nikolay Borisov



On  5.09.2018 04:14, Liu Bo wrote:
> __btrfs_end_transaction() has done the metadata release twice,
> probably because it used to process delayed refs in between, but now
> that we don't process delayed refs any more, the 2nd release is always
> a noop.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/transaction.c | 6 --
>  1 file changed, 6 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index bb1b9f526e98..94b036a74d11 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -826,12 +826,6 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
>   return 0;
>   }
>  
> - btrfs_trans_release_metadata(trans);
> - trans->block_rsv = NULL;
> -
> - if (!list_empty(>new_bgs))
> - btrfs_create_pending_block_groups(trans);
> -

The only code which can have any implications to the transaction reserve
is the btrfs_Create_pending_block_groups since it does insert items. But
at this point trans->block_rsv is already null and additionally even if
more reservations are made for this transaction further down either
btrfs_commit_transaction is called or the transaction kthread is called
which is going to commit it. So this change really seems inconsequential.

>   trans->delayed_ref_updates = 0;
>   if (!trans->sync) {
>   must_run_delayed_refs =
> 


Re: [PATCH 5/8] btrfs-progs: Wire up delayed refs

2018-09-04 Thread Qu Wenruo



On 2018/9/5 下午1:42, Nikolay Borisov wrote:
> 
> 
> On  5.09.2018 05:10, Qu Wenruo wrote:
>>
>>
>> On 2018/8/16 下午9:10, Nikolay Borisov wrote:
>>> This commit enables the delayed refs infrastructures. This entails doing
>>> the following:
>>>
>>> 1. Replacing existing calls of btrfs_extent_post_op (which is the
>>> equivalent of delayed refs) with the proper btrfs_run_delayed_refs.
>>> As well as eliminating open-coded calls to finish_current_insert and
>>> del_pending_extents which execute the delayed ops.
>>>
>>> 2. Wiring up the addition of delayed refs when freeing extents
>>> (btrfs_free_extent) and when adding new extents (alloc_tree_block).
>>>
>>> 3. Adding calls to btrfs_run_delayed refs in the transaction commit
>>> path alongside comments why every call is needed, since it's not always
>>> obvious (those call sites were derived empirically by running and
>>> debugging existing tests)
>>>
>>> 4. Correctly flagging the transaction in which we are reinitialising
>>> the extent tree.
>>>
>>> 5 Moving btrfs_write_dirty_block_groups to btrfs_write_dirty_block_groups
>>> since blockgroups should be written to disk after the last delayed refs
>>> have been run.
>>>
>>> Signed-off-by: Nikolay Borisov 
>>> Signed-off-by: David Sterba 
>>
>> Is there something (maybe btrfs_run_delayed_refs()?) missing in btrfs-image?
>>
>> btrfs-image from devel branch can't restore image correctly, the block
>> group used bytes is not correct, thus it can't pass misc nor fsck tests.
> 
> This is really strange, all fsck/misc tests passed with those patches.
> Can you be more specific which tests exactly you mean ?

One case is fsck/020 with lowmem mode. (Original mode lacks block
group->used check).

More specifically, fsck/020/keyed_data_ref_with_shared_leaf.img

Using btrfs-image from my distribution (v4.17.1) and devel branch btrfs
check: (cwd is btrfs-progs, devel branch)

$ btrfs-image -r
tests/fsck-tests/020-extent-ref-cases/keyed_data_ref_with_shared_leaf.img 
~/test.img
$ btrfs check --mode=lowmem ~/test.img
Opening filesystem to check...
Checking filesystem on /home/adam/test.img
UUID: 12dabcf2-d4da-4a70-9701-9f3d48074e73
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs done with fs roots in lowmem mode, skipping
[7/7] checking quota groups skipped (not enabled on this FS)
found 1208320 bytes used, no error found
total csum bytes: 512
total tree bytes: 684032
total fs tree bytes: 638976
total extent tree bytes: 16384
btree space waste bytes: 305606
file data blocks allocated: 93847552
 referenced 1773568

But if using btrfs-image with your delayed ref patch:
$ ./btrfs-image -r
tests/fsck-tests/020-extent-ref-cases/keyed_data_ref_with_shared_leaf.img 
~/test.img

# No matter if I'm using btrfs-check from devel or 4.17.1
$ btrfs check --mode=lowmem ~/test.img
Opening filesystem to check...
Checking filesystem on /home/adam/test.img
UUID: 12dabcf2-d4da-4a70-9701-9f3d48074e73
[1/7] checking root items
[2/7] checking extents
ERROR: block group[4194304 8388608] used 20480 but extent items used 24576
ERROR: block group[20971520 16777216] used 659456 but extent items used
655360
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs done with fs roots in lowmem mode, skipping
[7/7] checking quota groups skipped (not enabled on this FS)
found 1208320 bytes used, error(s) found
total csum bytes: 512
total tree bytes: 684032
total fs tree bytes: 638976
total extent tree bytes: 16384
btree space waste bytes: 305606
file data blocks allocated: 93847552
 referenced 1773568

I'd say, although lowmem check is still far from perfect, it indeed has
extra checks original mode lacks, and in this case it indeed exposes
problem.

Thanks,
Qu


> 
>>
>> Thanks,
>> Qu
>>
>>> ---
>>>  check/main.c  |   3 +-
>>>  extent-tree.c | 166 
>>> ++
>>>  transaction.c |  27 +-
>>>  3 files changed, 112 insertions(+), 84 deletions(-)
>>>
>>> diff --git a/check/main.c b/check/main.c
>>> index bc2ee22f7943..b361cd7e26a0 100644
>>> --- a/check/main.c
>>> +++ b/check/main.c
>>> @@ -8710,7 +8710,7 @@ static int reinit_extent_tree(struct 
>>> btrfs_trans_handle *trans,
>>> fprintf(stderr, "Error adding block group\n");
>>> return ret;
>>> }
>>> -   btrfs_extent_post_op(trans);
>>> +   btrfs_run_delayed_refs(trans, -1);
>>> }
>>>  
>>> ret = reset_balance(trans, fs_info);
>>> @@ -9767,6 +9767,7 @@ int cmd_check(int argc, char **argv)
>>> goto close_out;
>>> }
>>>  
>>> +   trans->reinit_extent_tree = true;
>>> if (init_extent_tree) {
>>> printf("Creating a new 

Re: [PATCH 5/8] btrfs-progs: Wire up delayed refs

2018-09-04 Thread Nikolay Borisov



On  5.09.2018 05:10, Qu Wenruo wrote:
> 
> 
> On 2018/8/16 下午9:10, Nikolay Borisov wrote:
>> This commit enables the delayed refs infrastructures. This entails doing
>> the following:
>>
>> 1. Replacing existing calls of btrfs_extent_post_op (which is the
>> equivalent of delayed refs) with the proper btrfs_run_delayed_refs.
>> As well as eliminating open-coded calls to finish_current_insert and
>> del_pending_extents which execute the delayed ops.
>>
>> 2. Wiring up the addition of delayed refs when freeing extents
>> (btrfs_free_extent) and when adding new extents (alloc_tree_block).
>>
>> 3. Adding calls to btrfs_run_delayed refs in the transaction commit
>> path alongside comments why every call is needed, since it's not always
>> obvious (those call sites were derived empirically by running and
>> debugging existing tests)
>>
>> 4. Correctly flagging the transaction in which we are reinitialising
>> the extent tree.
>>
>> 5 Moving btrfs_write_dirty_block_groups to btrfs_write_dirty_block_groups
>> since blockgroups should be written to disk after the last delayed refs
>> have been run.
>>
>> Signed-off-by: Nikolay Borisov 
>> Signed-off-by: David Sterba 
> 
> Is there something (maybe btrfs_run_delayed_refs()?) missing in btrfs-image?
> 
> btrfs-image from devel branch can't restore image correctly, the block
> group used bytes is not correct, thus it can't pass misc nor fsck tests.

This is really strange, all fsck/misc tests passed with those patches.
Can you be more specific which tests exactly you mean ?

> 
> Thanks,
> Qu
> 
>> ---
>>  check/main.c  |   3 +-
>>  extent-tree.c | 166 
>> ++
>>  transaction.c |  27 +-
>>  3 files changed, 112 insertions(+), 84 deletions(-)
>>
>> diff --git a/check/main.c b/check/main.c
>> index bc2ee22f7943..b361cd7e26a0 100644
>> --- a/check/main.c
>> +++ b/check/main.c
>> @@ -8710,7 +8710,7 @@ static int reinit_extent_tree(struct 
>> btrfs_trans_handle *trans,
>>  fprintf(stderr, "Error adding block group\n");
>>  return ret;
>>  }
>> -btrfs_extent_post_op(trans);
>> +btrfs_run_delayed_refs(trans, -1);
>>  }
>>  
>>  ret = reset_balance(trans, fs_info);
>> @@ -9767,6 +9767,7 @@ int cmd_check(int argc, char **argv)
>>  goto close_out;
>>  }
>>  
>> +trans->reinit_extent_tree = true;
>>  if (init_extent_tree) {
>>  printf("Creating a new extent tree\n");
>>  ret = reinit_extent_tree(trans, info,
>> diff --git a/extent-tree.c b/extent-tree.c
>> index 7d6c37c6b371..2fa51bbc0359 100644
>> --- a/extent-tree.c
>> +++ b/extent-tree.c
>> @@ -1418,8 +1418,6 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle 
>> *trans,
>>  err = ret;
>>  out:
>>  btrfs_free_path(path);
>> -finish_current_insert(trans);
>> -del_pending_extents(trans);
>>  BUG_ON(err);
>>  return err;
>>  }
>> @@ -1602,8 +1600,6 @@ int btrfs_set_block_flags(struct btrfs_trans_handle 
>> *trans, u64 bytenr,
>>  btrfs_set_extent_flags(l, item, flags);
>>  out:
>>  btrfs_free_path(path);
>> -finish_current_insert(trans);
>> -del_pending_extents(trans);
>>  return ret;
>>  }
>>  
>> @@ -1701,7 +1697,6 @@ static int write_one_cache_group(struct 
>> btrfs_trans_handle *trans,
>>   struct btrfs_block_group_cache *cache)
>>  {
>>  int ret;
>> -int pending_ret;
>>  struct btrfs_root *extent_root = trans->fs_info->extent_root;
>>  unsigned long bi;
>>  struct extent_buffer *leaf;
>> @@ -1717,12 +1712,8 @@ static int write_one_cache_group(struct 
>> btrfs_trans_handle *trans,
>>  btrfs_mark_buffer_dirty(leaf);
>>  btrfs_release_path(path);
>>  fail:
>> -finish_current_insert(trans);
>> -pending_ret = del_pending_extents(trans);
>>  if (ret)
>>  return ret;
>> -if (pending_ret)
>> -return pending_ret;
>>  return 0;
>>  
>>  }
>> @@ -2049,6 +2040,7 @@ static int finish_current_insert(struct 
>> btrfs_trans_handle *trans)
>>  int skinny_metadata =
>>  btrfs_fs_incompat(extent_root->fs_info, SKINNY_METADATA);
>>  
>> +
>>  while(1) {
>>  ret = find_first_extent_bit(>extent_ins, 0, ,
>>  , EXTENT_LOCKED);
>> @@ -2080,6 +2072,8 @@ static int finish_current_insert(struct 
>> btrfs_trans_handle *trans)
>>  BUG_ON(1);
>>  }
>>  
>> +
>> +printf("shouldn't be executed\n");
>>  clear_extent_bits(>extent_ins, start, end, EXTENT_LOCKED);
>>  kfree(extent_op);
>>  }
>> @@ -2379,7 +2373,6 @@ static int __free_extent(struct btrfs_trans_handle 
>> *trans,
>>  }
>>  fail:
>>  btrfs_free_path(path);
>> -finish_current_insert(trans);
>>  return ret;
>>  }
>>  
>> @@ -2462,33 +2455,30 @@ 

[PATCH] btrfs: qgroup: Don't trace subtree if we're dropping tree reloc tree

2018-09-04 Thread Qu Wenruo
Tree reloc tree doesn't contribute to qgroup numbers, as we have
accounted them at balance time (check replace_path()).

Skip such unneeded subtree trace should reduce some performance
overhead.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index de6f75f5547b..4588153f414c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8643,7 +8643,13 @@ static noinline int do_walk_down(struct 
btrfs_trans_handle *trans,
parent = 0;
}
 
-   if (need_account) {
+   /*
+* Tree reloc tree doesn't contribute to qgroup numbers, and
+* we have already accounted them at merge time (replace_path),
+* thus we could skip expensive subtree trace here.
+*/
+   if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
+   need_account) {
ret = btrfs_qgroup_trace_subtree(trans, next,
 generation, level - 1);
if (ret) {
-- 
2.18.0



[PATCH] btrfs: defrag: use btrfs_mod_outstanding_extents in cluster_pages_for_defrag

2018-09-04 Thread Su Yue
Since commit 8b62f87bad9c ("Btrfs: rework outstanding_extents"),
manual operations of outstanding_extent in btrfs_inode are replaced by
btrfs_mod_outstanding_extents().
The one in cluster_pages_for_defrag seems to be lost, so replace it
of btrfs_mod_outstanding_extents().

Fixes: 8b62f87bad9c ("Btrfs: rework outstanding_extents")
Signed-off-by: Su Yue 
---
 fs/btrfs/ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 63600dc2ac4c..c180ded27092 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1308,7 +1308,7 @@ static int cluster_pages_for_defrag(struct inode *inode,
 
if (i_done != page_cnt) {
spin_lock(_I(inode)->lock);
-   BTRFS_I(inode)->outstanding_extents++;
+   btrfs_mod_outstanding_extents(BTRFS_I(inode), 1);
spin_unlock(_I(inode)->lock);
btrfs_delalloc_release_space(inode, data_reserved,
start_index << PAGE_SHIFT,
-- 
2.18.0





Re: [PATCH 5/8] btrfs-progs: Wire up delayed refs

2018-09-04 Thread Qu Wenruo



On 2018/8/16 下午9:10, Nikolay Borisov wrote:
> This commit enables the delayed refs infrastructures. This entails doing
> the following:
> 
> 1. Replacing existing calls of btrfs_extent_post_op (which is the
> equivalent of delayed refs) with the proper btrfs_run_delayed_refs.
> As well as eliminating open-coded calls to finish_current_insert and
> del_pending_extents which execute the delayed ops.
> 
> 2. Wiring up the addition of delayed refs when freeing extents
> (btrfs_free_extent) and when adding new extents (alloc_tree_block).
> 
> 3. Adding calls to btrfs_run_delayed refs in the transaction commit
> path alongside comments why every call is needed, since it's not always
> obvious (those call sites were derived empirically by running and
> debugging existing tests)
> 
> 4. Correctly flagging the transaction in which we are reinitialising
> the extent tree.
> 
> 5 Moving btrfs_write_dirty_block_groups to btrfs_write_dirty_block_groups
> since blockgroups should be written to disk after the last delayed refs
> have been run.
> 
> Signed-off-by: Nikolay Borisov 
> Signed-off-by: David Sterba 

Is there something (maybe btrfs_run_delayed_refs()?) missing in btrfs-image?

btrfs-image from devel branch can't restore image correctly, the block
group used bytes is not correct, thus it can't pass misc nor fsck tests.

Thanks,
Qu

> ---
>  check/main.c  |   3 +-
>  extent-tree.c | 166 
> ++
>  transaction.c |  27 +-
>  3 files changed, 112 insertions(+), 84 deletions(-)
> 
> diff --git a/check/main.c b/check/main.c
> index bc2ee22f7943..b361cd7e26a0 100644
> --- a/check/main.c
> +++ b/check/main.c
> @@ -8710,7 +8710,7 @@ static int reinit_extent_tree(struct btrfs_trans_handle 
> *trans,
>   fprintf(stderr, "Error adding block group\n");
>   return ret;
>   }
> - btrfs_extent_post_op(trans);
> + btrfs_run_delayed_refs(trans, -1);
>   }
>  
>   ret = reset_balance(trans, fs_info);
> @@ -9767,6 +9767,7 @@ int cmd_check(int argc, char **argv)
>   goto close_out;
>   }
>  
> + trans->reinit_extent_tree = true;
>   if (init_extent_tree) {
>   printf("Creating a new extent tree\n");
>   ret = reinit_extent_tree(trans, info,
> diff --git a/extent-tree.c b/extent-tree.c
> index 7d6c37c6b371..2fa51bbc0359 100644
> --- a/extent-tree.c
> +++ b/extent-tree.c
> @@ -1418,8 +1418,6 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle 
> *trans,
>   err = ret;
>  out:
>   btrfs_free_path(path);
> - finish_current_insert(trans);
> - del_pending_extents(trans);
>   BUG_ON(err);
>   return err;
>  }
> @@ -1602,8 +1600,6 @@ int btrfs_set_block_flags(struct btrfs_trans_handle 
> *trans, u64 bytenr,
>   btrfs_set_extent_flags(l, item, flags);
>  out:
>   btrfs_free_path(path);
> - finish_current_insert(trans);
> - del_pending_extents(trans);
>   return ret;
>  }
>  
> @@ -1701,7 +1697,6 @@ static int write_one_cache_group(struct 
> btrfs_trans_handle *trans,
>struct btrfs_block_group_cache *cache)
>  {
>   int ret;
> - int pending_ret;
>   struct btrfs_root *extent_root = trans->fs_info->extent_root;
>   unsigned long bi;
>   struct extent_buffer *leaf;
> @@ -1717,12 +1712,8 @@ static int write_one_cache_group(struct 
> btrfs_trans_handle *trans,
>   btrfs_mark_buffer_dirty(leaf);
>   btrfs_release_path(path);
>  fail:
> - finish_current_insert(trans);
> - pending_ret = del_pending_extents(trans);
>   if (ret)
>   return ret;
> - if (pending_ret)
> - return pending_ret;
>   return 0;
>  
>  }
> @@ -2049,6 +2040,7 @@ static int finish_current_insert(struct 
> btrfs_trans_handle *trans)
>   int skinny_metadata =
>   btrfs_fs_incompat(extent_root->fs_info, SKINNY_METADATA);
>  
> +
>   while(1) {
>   ret = find_first_extent_bit(>extent_ins, 0, ,
>   , EXTENT_LOCKED);
> @@ -2080,6 +2072,8 @@ static int finish_current_insert(struct 
> btrfs_trans_handle *trans)
>   BUG_ON(1);
>   }
>  
> +
> + printf("shouldn't be executed\n");
>   clear_extent_bits(>extent_ins, start, end, EXTENT_LOCKED);
>   kfree(extent_op);
>   }
> @@ -2379,7 +2373,6 @@ static int __free_extent(struct btrfs_trans_handle 
> *trans,
>   }
>  fail:
>   btrfs_free_path(path);
> - finish_current_insert(trans);
>   return ret;
>  }
>  
> @@ -2462,33 +2455,30 @@ int btrfs_free_extent(struct btrfs_trans_handle 
> *trans,
> u64 bytenr, u64 num_bytes, u64 parent,
> u64 root_objectid, u64 owner, u64 offset)
>  {
> - struct btrfs_root *extent_root = root->fs_info->extent_root;
> -  

[PATCH v2] Btrfs: remove confusing tracepoint in btrfs_add_reserved_bytes

2018-09-04 Thread Liu Bo
Here we're not releasing any space, but transferring bytes from
->bytes_may_use to ->bytes_reserved.

Signed-off-by: Liu Bo 
---
v2: Add missing commit log.

 fs/btrfs/extent-tree.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 41a02cbb5a4a..76ee5ebef2b9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6401,10 +6401,6 @@ static int btrfs_add_reserved_bytes(struct 
btrfs_block_group_cache *cache,
} else {
cache->reserved += num_bytes;
space_info->bytes_reserved += num_bytes;
-
-   trace_btrfs_space_reservation(cache->fs_info,
-   "space_info", space_info->flags,
-   ram_bytes, 0);
space_info->bytes_may_use -= ram_bytes;
if (delalloc)
cache->delalloc_bytes += num_bytes;
-- 
1.8.3.1



nbdkit as a flexible alternative to loopback mounts

2018-09-04 Thread Chris Murphy
https://rwmj.wordpress.com/2018/09/04/nbdkit-as-a-flexible-alternative-to-loopback-mounts/

This is a pretty cool writeup. I can vouch Btrfs will format mount,
write to, scrub, and btrfs check works on an 8EiB (virtual) disk.

The one thing I thought might cause a problem is the ndb device has a
1KiB sector size, but Btrfs (on x86_64) still uses 4096 byte "sector"
and it all seems to work fine despite that.

Anyway, maybe it's useful for some fstests instead of file backed
losetup devices?


-- 
Chris Murphy


Re: [PATCH 20/35] btrfs: reset max_extent_size on clear in a bitmap

2018-09-04 Thread Liu Bo
On Thu, Aug 30, 2018 at 10:42 AM, Josef Bacik  wrote:
> From: Josef Bacik 
>
> We need to clear the max_extent_size when we clear bits from a bitmap
> since it could have been from the range that contains the
> max_extent_size.
>

Looks OK.
Reviewed-by: Liu Bo 

thanks,
liubo

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/free-space-cache.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 53521027dd78..7faca05e61ea 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -1683,6 +1683,8 @@ static inline void __bitmap_clear_bits(struct 
> btrfs_free_space_ctl *ctl,
> bitmap_clear(info->bitmap, start, count);
>
> info->bytes -= bytes;
> +   if (info->max_extent_size > ctl->unit)
> +   info->max_extent_size = 0;
>  }
>
>  static void bitmap_clear_bits(struct btrfs_free_space_ctl *ctl,
> --
> 2.14.3
>


Re: [PATCH 08/35] btrfs: release metadata before running delayed refs

2018-09-04 Thread Liu Bo
On Thu, Aug 30, 2018 at 10:41 AM, Josef Bacik  wrote:
> We want to release the unused reservation we have since it refills the
> delayed refs reserve, which will make everything go smoother when
> running the delayed refs if we're short on our reservation.
>

Looks good.
Reviewed-by: Liu Bo 

thanks,
liubo

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/transaction.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 99741254e27e..ebb0c0405598 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1915,6 +1915,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
> return ret;
> }
>
> +   btrfs_trans_release_metadata(trans);
> +   trans->block_rsv = NULL;
> +
> /* make a pass through all the delayed refs we have so far
>  * any runnings procs may add more while we are here
>  */
> @@ -1924,9 +1927,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
> return ret;
> }
>
> -   btrfs_trans_release_metadata(trans);
> -   trans->block_rsv = NULL;
> -
> cur_trans = trans->transaction;
>
> /*
> --
> 2.14.3
>


[PATCH] Btrfs: remove confusing tracepoint in btrfs_add_reserved_bytes

2018-09-04 Thread Liu Bo
Signed-off-by: Liu Bo 
---
 fs/btrfs/extent-tree.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 41a02cbb5a4a..76ee5ebef2b9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6401,10 +6401,6 @@ static int btrfs_add_reserved_bytes(struct 
btrfs_block_group_cache *cache,
} else {
cache->reserved += num_bytes;
space_info->bytes_reserved += num_bytes;
-
-   trace_btrfs_space_reservation(cache->fs_info,
-   "space_info", space_info->flags,
-   ram_bytes, 0);
space_info->bytes_may_use -= ram_bytes;
if (delalloc)
cache->delalloc_bytes += num_bytes;
-- 
1.8.3.1



Re: RAID1 & BTRFS critical (device sda2): corrupt leaf, bad key order

2018-09-04 Thread Qu Wenruo


On 2018/9/5 上午4:37, Chris Murphy wrote:
> On Tue, Sep 4, 2018 at 10:22 AM, Etienne Champetier
>  wrote:
> 
>> Do you have a procedure to copy all subvolumes & skip error ? (I have
>> ~200 snapshots)
> 
> If they're already read-only snapshots, then script an iteration of
> btrfs send receive to a new volume.

Doesn't simple "cp -r" work here?
(If the important thing is data, not the subvolume layout).

Thanks,
Qu

> 
> Btrfs seed-sprout would be ideal, however in this case I don't think
> can help because a.) it's temporarily one file system, which could
> mean the corruption is inherited; and b.) I'm not sure it's multiple
> device aware, so either the btrfs-tune -S1 might fail on 2+ device
> Btrfs volumes, or possibly it insists on a two device sprout in order
> to replicate a two device seed.
> 
> If they're not already read-only, it's tricky because it sounds like
> mounting rw is possibly risky, and taking read only snapshots might
> fail anyway. There is no way to make read only snapshots unless the
> volume can be written to; and no way to force a rw subvolume to be
> treated as if it were read only even if the volume is mounted read
> only. And it takes a read only subvolume for send to work.
> 
> 



signature.asc
Description: OpenPGP digital signature


[PATCH] Btrfs: remove redundant btrfs_trans_release_metadata"

2018-09-04 Thread Liu Bo
__btrfs_end_transaction() has done the metadata release twice,
probably because it used to process delayed refs in between, but now
that we don't process delayed refs any more, the 2nd release is always
a noop.

Signed-off-by: Liu Bo 
---
 fs/btrfs/transaction.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index bb1b9f526e98..94b036a74d11 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -826,12 +826,6 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
return 0;
}
 
-   btrfs_trans_release_metadata(trans);
-   trans->block_rsv = NULL;
-
-   if (!list_empty(>new_bgs))
-   btrfs_create_pending_block_groups(trans);
-
trans->delayed_ref_updates = 0;
if (!trans->sync) {
must_run_delayed_refs =
-- 
1.8.3.1



Re: [PATCH 02/35] btrfs: add cleanup_ref_head_accounting helper

2018-09-04 Thread Liu Bo
On Thu, Aug 30, 2018 at 10:41 AM, Josef Bacik  wrote:
> From: Josef Bacik 
>
> We were missing some quota cleanups in check_ref_cleanup, so break the
> ref head accounting cleanup into a helper and call that from both
> check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
> we don't screw up accounting in the future for other things that we add.
>



Reviewed-by: Liu Bo 

thanks,
liubo
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 67 
> +-
>  1 file changed, 39 insertions(+), 28 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 6799950fa057..4c9fd35bca07 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2461,6 +2461,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle 
> *trans,
> return ret ? ret : 1;
>  }
>
> +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> +   struct btrfs_delayed_ref_head *head)
> +{
> +   struct btrfs_fs_info *fs_info = trans->fs_info;
> +   struct btrfs_delayed_ref_root *delayed_refs =
> +   >transaction->delayed_refs;
> +
> +   if (head->total_ref_mod < 0) {
> +   struct btrfs_space_info *space_info;
> +   u64 flags;
> +
> +   if (head->is_data)
> +   flags = BTRFS_BLOCK_GROUP_DATA;
> +   else if (head->is_system)
> +   flags = BTRFS_BLOCK_GROUP_SYSTEM;
> +   else
> +   flags = BTRFS_BLOCK_GROUP_METADATA;
> +   space_info = __find_space_info(fs_info, flags);
> +   ASSERT(space_info);
> +   percpu_counter_add_batch(_info->total_bytes_pinned,
> +  -head->num_bytes,
> +  BTRFS_TOTAL_BYTES_PINNED_BATCH);
> +
> +   if (head->is_data) {
> +   spin_lock(_refs->lock);
> +   delayed_refs->pending_csums -= head->num_bytes;
> +   spin_unlock(_refs->lock);
> +   }
> +   }
> +
> +   /* Also free its reserved qgroup space */
> +   btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> + head->qgroup_reserved);
> +}
> +
>  static int cleanup_ref_head(struct btrfs_trans_handle *trans,
> struct btrfs_delayed_ref_head *head)
>  {
> @@ -2496,31 +2531,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
> spin_unlock(_refs->lock);
> spin_unlock(>lock);
>
> -   trace_run_delayed_ref_head(fs_info, head, 0);
> -
> -   if (head->total_ref_mod < 0) {
> -   struct btrfs_space_info *space_info;
> -   u64 flags;
> -
> -   if (head->is_data)
> -   flags = BTRFS_BLOCK_GROUP_DATA;
> -   else if (head->is_system)
> -   flags = BTRFS_BLOCK_GROUP_SYSTEM;
> -   else
> -   flags = BTRFS_BLOCK_GROUP_METADATA;
> -   space_info = __find_space_info(fs_info, flags);
> -   ASSERT(space_info);
> -   percpu_counter_add_batch(_info->total_bytes_pinned,
> -  -head->num_bytes,
> -  BTRFS_TOTAL_BYTES_PINNED_BATCH);
> -
> -   if (head->is_data) {
> -   spin_lock(_refs->lock);
> -   delayed_refs->pending_csums -= head->num_bytes;
> -   spin_unlock(_refs->lock);
> -   }
> -   }
> -
> if (head->must_insert_reserved) {
> btrfs_pin_extent(fs_info, head->bytenr,
>  head->num_bytes, 1);
> @@ -2530,9 +2540,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
> }
> }
>
> -   /* Also free its reserved qgroup space */
> -   btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> - head->qgroup_reserved);
> +   cleanup_ref_head_accounting(trans, head);
> +
> +   trace_run_delayed_ref_head(fs_info, head, 0);
> btrfs_delayed_ref_unlock(head);
> btrfs_put_delayed_ref_head(head);
> return 0;
> @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
> if (head->must_insert_reserved)
> ret = 1;
>
> +   cleanup_ref_head_accounting(trans, head);
> mutex_unlock(>mutex);
> btrfs_put_delayed_ref_head(head);
> return ret;
> --
> 2.14.3
>


[PATCH v5 2/2] vfs: dedupe should return EPERM if permission is not granted

2018-09-04 Thread Mark Fasheh
Right now we return EINVAL if a process does not have permission to dedupe a
file. This was an oversight on my part. EPERM gives a true description of
the nature of our error, and EINVAL is already used for the case that the
filesystem does not support dedupe.

Signed-off-by: Mark Fasheh 
Reviewed-by: Darrick J. Wong 
Acked-by: David Sterba 
---
 fs/read_write.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 71e9077f8bc1..7188982e2733 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -2050,7 +2050,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
file_dedupe_range *same)
if (info->reserved) {
info->status = -EINVAL;
} else if (!allow_file_dedupe(dst_file)) {
-   info->status = -EINVAL;
+   info->status = -EPERM;
} else if (file->f_path.mnt != dst_file->f_path.mnt) {
info->status = -EXDEV;
} else if (S_ISDIR(dst->i_mode)) {
-- 
2.15.1



[PATCH v5 1/2] vfs: allow dedupe of user owned read-only files

2018-09-04 Thread Mark Fasheh
The permission check in vfs_dedupe_file_range() is too coarse - We
only allow dedupe of the destination file if the user is root, or
they have the file open for write.

This effectively limits a non-root user from deduping their own read-only
files. In addition, the write file descriptor that the user is forced to
hold open can prevent execution of files. As file data during a dedupe
does not change, the behavior is unexpected and this has caused a number of
issue reports. For an example, see:

https://github.com/markfasheh/duperemove/issues/129

So change the check so we allow dedupe on the target if:

- the root or admin is asking for it
- the process has write access
- the owner of the file is asking for the dedupe
- the process could get write access

That way users can open read-only and still get dedupe.

Signed-off-by: Mark Fasheh 
Acked-by: Darrick J. Wong 
---
 fs/read_write.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index e83bd9744b5d..71e9077f8bc1 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1964,6 +1964,20 @@ int vfs_dedupe_file_range_compare(struct inode *src, 
loff_t srcoff,
 }
 EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
 
+/* Check whether we are allowed to dedupe the destination file */
+static bool allow_file_dedupe(struct file *file)
+{
+   if (capable(CAP_SYS_ADMIN))
+   return true;
+   if (file->f_mode & FMODE_WRITE)
+   return true;
+   if (uid_eq(current_fsuid(), file_inode(file)->i_uid))
+   return true;
+   if (!inode_permission(file_inode(file), MAY_WRITE))
+   return true;
+   return false;
+}
+
 int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 {
struct file_dedupe_range_info *info;
@@ -1972,7 +1986,6 @@ int vfs_dedupe_file_range(struct file *file, struct 
file_dedupe_range *same)
u64 len;
int i;
int ret;
-   bool is_admin = capable(CAP_SYS_ADMIN);
u16 count = same->dest_count;
struct file *dst_file;
loff_t dst_off;
@@ -2036,7 +2049,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
file_dedupe_range *same)
 
if (info->reserved) {
info->status = -EINVAL;
-   } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
+   } else if (!allow_file_dedupe(dst_file)) {
info->status = -EINVAL;
} else if (file->f_path.mnt != dst_file->f_path.mnt) {
info->status = -EXDEV;
-- 
2.15.1



[RESEND][PATCH v5 0/2] vfs: fix dedupe permission check

2018-09-04 Thread Mark Fasheh
Hi Andrew/Al,

Could I please have these patches put in a tree for more public testing?
They've hit fsdevel a few times now, I have links to the discussions in the
change log below.


The following patches fix a couple of issues with the permission check
we do in vfs_dedupe_file_range().

The first patch expands our check to allow dedupe of a file if the
user owns it or otherwise would be allowed to write to it.

Current behavior is that we'll allow dedupe only if:

- the user is an admin (root)
- the user has the file open for write

This makes it impossible for a user to dedupe their own file set
unless they do it as root, or ensure that all files have write
permission. There's a couple of duperemove bugs open for this:

https://github.com/markfasheh/duperemove/issues/129
https://github.com/markfasheh/duperemove/issues/86

The other problem we have is also related to forcing the user to open
target files for write - A process trying to exec a file currently
being deduped gets ETXTBUSY. The answer (as above) is to allow them to
open the targets ro - root can already do this. There was a patch from
Adam Borowski to fix this back in 2016:

https://lkml.org/lkml/2016/7/17/130

which I have incorporated into my changes.


The 2nd patch fixes our return code for permission denied to be
EPERM. For some reason we're returning EINVAL - I think that's
probably my fault. At any rate, we need to be returning something
descriptive of the actual problem, otherwise callers see EINVAL and
can't really make a valid determination of what's gone wrong.

This has also popped up in duperemove, mostly in the form of cryptic
error messages. Because this is a code returned to userspace, I did
check the other users of extent-same that I could find. Both 'bees'
and 'rust-btrfs' do the same as duperemove and simply report the error
(as they should).


Lastly, I have an update to the fi_deduperange manpage to reflect these
changes. That patch is attached below.


Please apply.

git pull https://github.com/markfasheh/linux dedupe-perms

Thanks,
  --Mark

Changes from V4 to V5:
- Rebase and retest on 4.18-rc8
- Place updated manpage patch below, CC linux-api
- V4 discussion: https://patchwork.kernel.org/patch/10530365/

Changes from V3 to V4:
- Add a patch (below) to ioctl_fideduperange.2 explaining our
  changes. I will send this patch once the kernel update is
  accepted. Thanks to Darrick Wong for this suggestion.
- V3 discussion: https://www.spinics.net/lists/linux-btrfs/msg79135.html

Changes from V2 to V3:
- Return bool from allow_file_dedupe
- V2 discussion: https://www.spinics.net/lists/linux-btrfs/msg78421.html

Changes from V1 to V2:
- Add inode_permission check as suggested by Adam Borowski
- V1 discussion: https://marc.info/?l=linux-xfs=152606684017965=2


From: Mark Fasheh 

[PATCH] ioctl_fideduperange.2: clarify permission requirements

dedupe permission checks were recently relaxed - update our man page to
reflect those changes.

Signed-off-by: Mark Fasheh 
---
 man2/ioctl_fideduperange.2 | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/man2/ioctl_fideduperange.2 b/man2/ioctl_fideduperange.2
index 84d20a276..4040ee064 100644
--- a/man2/ioctl_fideduperange.2
+++ b/man2/ioctl_fideduperange.2
@@ -105,9 +105,12 @@ The field
 must be zero.
 During the call,
 .IR src_fd
-must be open for reading and
+must be open for reading.
 .IR dest_fd
-must be open for writing.
+can be open for writing, or reading.
+If
+.IR dest_fd
+is open for reading, the user must have write access to the file.
 The combined size of the struct
 .IR file_dedupe_range
 and the struct
@@ -185,8 +188,8 @@ This can appear if the filesystem does not support 
deduplicating either file
 descriptor, or if either file descriptor refers to special inodes.
 .TP
 .B EPERM
-.IR dest_fd
-is immutable.
+This will be returned if the user lacks permission to dedupe the file 
referenced by
+.IR dest_fd .
 .TP
 .B ETXTBSY
 One of the files is a swap file.
-- 
2.15.1



Re: RAID1 & BTRFS critical (device sda2): corrupt leaf, bad key order

2018-09-04 Thread Chris Murphy
On Tue, Sep 4, 2018 at 10:22 AM, Etienne Champetier
 wrote:

> Do you have a procedure to copy all subvolumes & skip error ? (I have
> ~200 snapshots)

If they're already read-only snapshots, then script an iteration of
btrfs send receive to a new volume.

Btrfs seed-sprout would be ideal, however in this case I don't think
can help because a.) it's temporarily one file system, which could
mean the corruption is inherited; and b.) I'm not sure it's multiple
device aware, so either the btrfs-tune -S1 might fail on 2+ device
Btrfs volumes, or possibly it insists on a two device sprout in order
to replicate a two device seed.

If they're not already read-only, it's tricky because it sounds like
mounting rw is possibly risky, and taking read only snapshots might
fail anyway. There is no way to make read only snapshots unless the
volume can be written to; and no way to force a rw subvolume to be
treated as if it were read only even if the volume is mounted read
only. And it takes a read only subvolume for send to work.


-- 
Chris Murphy


WTS: Panasonic Toughbook

2018-09-04 Thread Refurbished Computers Depot
Hello,

We have Grade A Panasonic Toughbook, tested working in good condition and looks 
like new.

Grade A cf-c2ccezxcm -Panasonic Toughbook CF-C2 Convertible Laptop Intel Core 
i5-4300U
1.90GHz CPU 8GB RAM 128GB HDD Win 8.1 Pro.
QTY 52 @ $110 each.

Contact us if you are interested, pictures are available.

Best regards,
Steven Anderson

Sales Executive

Email:sa...@refurbishedcomputersdepot.com
Phone: (650) 618-9852
Fax: (650) 644-3252
www.refurbishedcomputersdepot.com



Re: [PATCH 12/35] btrfs: add ALLOC_CHUNK_FORCE to the flushing code

2018-09-04 Thread Nikolay Borisov



On  4.09.2018 20:57, Josef Bacik wrote:
> On Mon, Sep 03, 2018 at 05:19:19PM +0300, Nikolay Borisov wrote:
>>
>>
>> On 30.08.2018 20:42, Josef Bacik wrote:
>>> +   if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles)
>>> +   flush_state++;
>>
>> This is a bit obscure. So if we allocated a chunk and !commit_cycles
>> just break from the loop? What's the reasoning behind this ?
> 
> I'll add a comment, but it doesn't break the loop, it just goes to 
> COMMIT_TRANS.
> The idea is we don't want to force a chunk allocation if we're experiencing a
> little bit of pressure, because we could end up with a drive full of empty
> metadata chunks.  We want to try committing the transaction first, and then if
> we still have issues we can force a chunk allocation.  Thanks,

I think it will be better if this check is moved up somewhere before the
the if (flush_state > commit trans).

> 
> Josef
> 


Re: [PATCH 05/35] btrfs: introduce delayed_refs_rsv

2018-09-04 Thread Josef Bacik
On Tue, Sep 04, 2018 at 06:21:23PM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:41, Josef Bacik wrote:
> > From: Josef Bacik 
> > 
> > Traditionally we've had voodoo in btrfs to account for the space that
> > delayed refs may take up by having a global_block_rsv.  This works most
> > of the time, except when it doesn't.  We've had issues reported and seen
> > in production where sometimes the global reserve is exhausted during
> > transaction commit before we can run all of our delayed refs, resulting
> > in an aborted transaction.  Because of this voodoo we have equally
> > dubious flushing semantics around throttling delayed refs which we often
> > get wrong.
> > 
> > So instead give them their own block_rsv.  This way we can always know
> > exactly how much outstanding space we need for delayed refs.  This
> > allows us to make sure we are constantly filling that reservation up
> > with space, and allows us to put more precise pressure on the enospc
> > system.  Instead of doing math to see if its a good time to throttle,
> > the normal enospc code will be invoked if we have a lot of delayed refs
> > pending, and they will be run via the normal flushing mechanism.
> > 
> > For now the delayed_refs_rsv will hold the reservations for the delayed
> > refs, the block group updates, and deleting csums.  We could have a
> > separate rsv for the block group updates, but the csum deletion stuff is
> > still handled via the delayed_refs so that will stay there.
> > 
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/ctree.h |  24 +++-
> >  fs/btrfs/delayed-ref.c   |  28 -
> >  fs/btrfs/disk-io.c   |   3 +
> >  fs/btrfs/extent-tree.c   | 268 
> > +++
> >  fs/btrfs/transaction.c   |  68 +--
> >  include/trace/events/btrfs.h |   2 +
> >  6 files changed, 294 insertions(+), 99 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 66f1d3895bca..0a4e55703d48 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -452,8 +452,9 @@ struct btrfs_space_info {
> >  #defineBTRFS_BLOCK_RSV_TRANS   3
> >  #defineBTRFS_BLOCK_RSV_CHUNK   4
> >  #defineBTRFS_BLOCK_RSV_DELOPS  5
> > -#defineBTRFS_BLOCK_RSV_EMPTY   6
> > -#defineBTRFS_BLOCK_RSV_TEMP7
> > +#define BTRFS_BLOCK_RSV_DELREFS6
> > +#defineBTRFS_BLOCK_RSV_EMPTY   7
> > +#defineBTRFS_BLOCK_RSV_TEMP8
> >  
> >  struct btrfs_block_rsv {
> > u64 size;
> > @@ -794,6 +795,8 @@ struct btrfs_fs_info {
> > struct btrfs_block_rsv chunk_block_rsv;
> > /* block reservation for delayed operations */
> > struct btrfs_block_rsv delayed_block_rsv;
> > +   /* block reservation for delayed refs */
> > +   struct btrfs_block_rsv delayed_refs_rsv;
> >  
> > struct btrfs_block_rsv empty_block_rsv;
> >  
> > @@ -2723,10 +2726,12 @@ enum btrfs_reserve_flush_enum {
> >  enum btrfs_flush_state {
> > FLUSH_DELAYED_ITEMS_NR  =   1,
> > FLUSH_DELAYED_ITEMS =   2,
> > -   FLUSH_DELALLOC  =   3,
> > -   FLUSH_DELALLOC_WAIT =   4,
> > -   ALLOC_CHUNK =   5,
> > -   COMMIT_TRANS=   6,
> > +   FLUSH_DELAYED_REFS_NR   =   3,
> > +   FLUSH_DELAYED_REFS  =   4,
> > +   FLUSH_DELALLOC  =   5,
> > +   FLUSH_DELALLOC_WAIT =   6,
> > +   ALLOC_CHUNK =   7,
> > +   COMMIT_TRANS=   8,
> >  };
> >  
> >  int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
> > @@ -2777,6 +2782,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info 
> > *fs_info,
> >  void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
> >  struct btrfs_block_rsv *block_rsv,
> >  u64 num_bytes);
> > +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr);
> > +void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans);
> > +int btrfs_refill_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
> > + enum btrfs_reserve_flush_enum flush);
> > +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
> > +  struct btrfs_block_rsv *src,
> > +  u64 num_bytes);
> >  int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache);
> >  void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache);
> >  void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
> > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> > index 27f7dd4e3d52..96ce087747b2 100644
> > --- a/fs/btrfs/delayed-ref.c
> > +++ b/fs/btrfs/delayed-ref.c
> > @@ -467,11 +467,14 @@ static int insert_delayed_ref(struct 
> > btrfs_trans_handle *trans,
> >   * existing and update must have the same bytenr
> >   */
> >  static noinline void
> > 

Re: [PATCH 21/35] btrfs: only run delayed refs if we're committing

2018-09-04 Thread Omar Sandoval
On Tue, Sep 04, 2018 at 01:54:13PM -0400, Josef Bacik wrote:
> On Fri, Aug 31, 2018 at 05:28:09PM -0700, Omar Sandoval wrote:
> > On Thu, Aug 30, 2018 at 01:42:11PM -0400, Josef Bacik wrote:
> > > I noticed in a giant dbench run that we spent a lot of time on lock
> > > contention while running transaction commit.  This is because dbench
> > > results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
> > > they all run the delayed refs first thing, so they all contend with
> > > each other.  This leads to seconds of 0 throughput.  Change this to only
> > > run the delayed refs if we're the ones committing the transaction.  This
> > > makes the latency go away and we get no more lock contention.
> > 
> > This means that we're going to spend more time running delayed refs
> > while in TRANS_STATE_COMMIT_START, so couldn't we end up blocking new
> > transactions more than before?
> > 
> 
> You'd think that, but the lock contention is enough that it makes it
> unfuckingpossible for anything to run for several seconds while everybody
> competes for either the delayed refs lock or the extent root lock.
> 
> With the delayed refs rsv we actually end up running the delayed refs often
> enough because of the extra ENOSPC pressure that we don't really end up with
> long chunks of time running delayed refs while blocking out START 
> transactions.
> 
> If at some point down the line this turns out to be an actual issue we can
> revisit the best way to do this.  Off the top of my head we do something like
> wrap it in a "run all the delayed refs" mutex so that all the committers just
> wait on whoever wins, and we move it back outside of the start logic in order 
> to
> make it better all the way around.  But I don't think that's something we need
> to do at this point.  Thanks,

Ok, that's good enough for me.

Reviewed-by: Omar Sandoval 


Re: [PATCH 12/35] btrfs: add ALLOC_CHUNK_FORCE to the flushing code

2018-09-04 Thread Josef Bacik
On Mon, Sep 03, 2018 at 05:19:19PM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:42, Josef Bacik wrote:
> > +   if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles)
> > +   flush_state++;
> 
> This is a bit obscure. So if we allocated a chunk and !commit_cycles
> just break from the loop? What's the reasoning behind this ?

I'll add a comment, but it doesn't break the loop, it just goes to COMMIT_TRANS.
The idea is we don't want to force a chunk allocation if we're experiencing a
little bit of pressure, because we could end up with a drive full of empty
metadata chunks.  We want to try committing the transaction first, and then if
we still have issues we can force a chunk allocation.  Thanks,

Josef


Re: [PATCH 21/35] btrfs: only run delayed refs if we're committing

2018-09-04 Thread Josef Bacik
On Fri, Aug 31, 2018 at 05:28:09PM -0700, Omar Sandoval wrote:
> On Thu, Aug 30, 2018 at 01:42:11PM -0400, Josef Bacik wrote:
> > I noticed in a giant dbench run that we spent a lot of time on lock
> > contention while running transaction commit.  This is because dbench
> > results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
> > they all run the delayed refs first thing, so they all contend with
> > each other.  This leads to seconds of 0 throughput.  Change this to only
> > run the delayed refs if we're the ones committing the transaction.  This
> > makes the latency go away and we get no more lock contention.
> 
> This means that we're going to spend more time running delayed refs
> while in TRANS_STATE_COMMIT_START, so couldn't we end up blocking new
> transactions more than before?
> 

You'd think that, but the lock contention is enough that it makes it
unfuckingpossible for anything to run for several seconds while everybody
competes for either the delayed refs lock or the extent root lock.

With the delayed refs rsv we actually end up running the delayed refs often
enough because of the extra ENOSPC pressure that we don't really end up with
long chunks of time running delayed refs while blocking out START transactions.

If at some point down the line this turns out to be an actual issue we can
revisit the best way to do this.  Off the top of my head we do something like
wrap it in a "run all the delayed refs" mutex so that all the committers just
wait on whoever wins, and we move it back outside of the start logic in order to
make it better all the way around.  But I don't think that's something we need
to do at this point.  Thanks,

Josef


Re: RAID1 & BTRFS critical (device sda2): corrupt leaf, bad key order

2018-09-04 Thread Etienne Champetier
Thanks Qu, one last question I think

Le mar. 4 sept. 2018 à 08:33, Qu Wenruo  a écrit :
>
> On 2018/9/4 下午7:53, Etienne Champetier wrote:
> > Hi Qu,
> >
> > Le lun. 3 sept. 2018 à 20:27, Qu Wenruo  a écrit :
> >>
> >> On 2018/9/3 下午10:18, Etienne Champetier wrote:
> >>> Hello btfrs hackers,
> >>>
> >>> I have a computer acting as backup server with BTRFS RAID1, and I
> >>> would like to know the different options to rebuild this RAID
> >>> (I saw this thread
> >>> https://www.spinics.net/lists/linux-btrfs/msg68679.html but there was
> >>> no raid 1)
> >>>
> >>> # uname -a
> >>> Linux servmaison 4.4.0-134-generic #160-Ubuntu SMP Wed Aug 15 14:58:00
> >>> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> >>>
> >>> # btrfs --version
> >>> btrfs-progs v4.4
> >>>
> >>> # dmesg
> >>> ...
> >>> [ 1955.581972] BTRFS critical (device sda2): corrupt leaf, bad key
> >>> order: block=6020235362304,root=1, slot=63
> >>> [ 1955.582299] BTRFS critical (device sda2): corrupt leaf, bad key
> >>> order: block=6020235362304,root=1, slot=63
> >
> > Now running a Fedora 28 install kernel
> >
> > # uname -a
> > Linux servmaison 4.16.3-301.fc28.x86_64 #1 SMP Mon Apr 23 21:59:58 UTC
> > 2018 x86_64 x86_64 x86_64 GNU/Linux
> > # btrfs --version
> > btrfs-progs v4.15.1
>
> Unfortunately, even for latest btrfs-progs release (v4.17.1, and even
> devel branch), btrfs check will abort checking if free space cache is
> corrupted.
>
> So we didn't get any useful info from btrfs check.
>
> Such diff would help you continue checking (if you really want, other
> than starting salvaging your data)
> --
> diff --git a/check/main.c b/check/main.c
> index b361cd7e26a0..4f720163221e 100644
> --- a/check/main.c
> +++ b/check/main.c
> @@ -9885,7 +9885,6 @@ int cmd_check(int argc, char **argv)
> error("errors found in free space tree");
> else
> error("errors found in free space cache");
> -   goto out;
> }
>
> /*
> --
>
>
> For dump tree block, the corrupted tree block belongs to extent tree.
> Which could be a good news (depends on how you define GOOD news).
>
> The corruption is not an easy fix, it's not just a swapped slot.
> The corrupted slot (item 64, whole key objectid is 5946810351616) is way
> beyond the extent data range, thus btrfs-progs can't fix it easily.
>
> Considering how much bytenr difference there is and the generation gap
> (53167 vs current generation 1555950), the bug happens a long long time
> ago (days or weeks before 2016-06-04). So it's a little too late to be
> fixed (unless someone could send me a time machine).
>
> On the other hand, this means any WRITE would easily fail due to
> corrupted extent tree, but your fs should be OK if mounted RO, thus you
> could copy your data out.
>

Do you have a procedure to copy all subvolumes & skip error ? (I have
~200 snapshots)

> >
> >>
> >> Please provide the following dump:
> >>
> >> # btrfs inspect dump-tree -t root /dev/sda2
> >> # btrfs inspect dump-tree -b 6020235362304 /dev/sda2
> >
> > All requested dump are in this repo:
> > https://github.com/champtar/debugraidbtrfs
> >
> [snip]
> >>
> >> If it's the only problem, "btrfs check --repair" indeed could fix it.
> >
> > Also available in https://github.com/champtar/debugraidbtrfs, here
> > "btrfs check --readonly /dev/sda2" output
> > 
> > checking extents
> > bad key ordering 63 64
> > bad key ordering 63 64
> > bad key ordering 63 64
> > bad key ordering 63 64
> > bad key ordering 63 64
> > bad key ordering 63 64
> > bad key ordering 63 64
> > bad key ordering 63 64
> > bad key ordering 63 64
> > bad key ordering 63 64
> > bad key ordering 63 64
> > bad key ordering 63 64
> > bad key ordering 63 64
> > bad key ordering 63 64
> > bad key ordering 63 64
> > bad key ordering 63 64
> > bad key ordering 63 64
> > bad key ordering 63 64
> > bad block 6020235362304
> > ERROR: errors found in extent allocation tree or chunk allocation
> > checking free space cache
> > there is no free space entry for 6011561750528-5942842273792
> > there is no free space entry for 6011561750528-6012044050432
> > cache appears valid but isn't 6010970308608
> > there is no free space entry for 6015529828352-5946810351616
> > there is no free space entry for 6015529828352-6016339017728
> > cache appears valid but isn't 6015265275904
> > there is no free space entry for 6139476623360-6070757146624
> > there is no free space entry for 6139476623360-6139852881920
> > cache appears valid but isn't 6138779140096
> > ERROR: errors found in free space cache
> > Checking filesystem on /dev/sda2
> > UUID: 4917db5e-fc20-4369-9556-83082a32d4cd
> > found 1321120776195 bytes used, error(s) found
> > total csum bytes: 0
> > total tree bytes: 1163182080
> > total fs tree bytes: 0
> > total extent tree bytes: 1161740288
> > btree space waste bytes: 290512355
> > file data blocks allocated: 618135552
> >  referenced 618135552
> > 
>
> 

Re: [PATCH 05/35] btrfs: introduce delayed_refs_rsv

2018-09-04 Thread Nikolay Borisov



On 30.08.2018 20:41, Josef Bacik wrote:
> From: Josef Bacik 
> 
> Traditionally we've had voodoo in btrfs to account for the space that
> delayed refs may take up by having a global_block_rsv.  This works most
> of the time, except when it doesn't.  We've had issues reported and seen
> in production where sometimes the global reserve is exhausted during
> transaction commit before we can run all of our delayed refs, resulting
> in an aborted transaction.  Because of this voodoo we have equally
> dubious flushing semantics around throttling delayed refs which we often
> get wrong.
> 
> So instead give them their own block_rsv.  This way we can always know
> exactly how much outstanding space we need for delayed refs.  This
> allows us to make sure we are constantly filling that reservation up
> with space, and allows us to put more precise pressure on the enospc
> system.  Instead of doing math to see if its a good time to throttle,
> the normal enospc code will be invoked if we have a lot of delayed refs
> pending, and they will be run via the normal flushing mechanism.
> 
> For now the delayed_refs_rsv will hold the reservations for the delayed
> refs, the block group updates, and deleting csums.  We could have a
> separate rsv for the block group updates, but the csum deletion stuff is
> still handled via the delayed_refs so that will stay there.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/ctree.h |  24 +++-
>  fs/btrfs/delayed-ref.c   |  28 -
>  fs/btrfs/disk-io.c   |   3 +
>  fs/btrfs/extent-tree.c   | 268 
> +++
>  fs/btrfs/transaction.c   |  68 +--
>  include/trace/events/btrfs.h |   2 +
>  6 files changed, 294 insertions(+), 99 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 66f1d3895bca..0a4e55703d48 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -452,8 +452,9 @@ struct btrfs_space_info {
>  #define  BTRFS_BLOCK_RSV_TRANS   3
>  #define  BTRFS_BLOCK_RSV_CHUNK   4
>  #define  BTRFS_BLOCK_RSV_DELOPS  5
> -#define  BTRFS_BLOCK_RSV_EMPTY   6
> -#define  BTRFS_BLOCK_RSV_TEMP7
> +#define BTRFS_BLOCK_RSV_DELREFS  6
> +#define  BTRFS_BLOCK_RSV_EMPTY   7
> +#define  BTRFS_BLOCK_RSV_TEMP8
>  
>  struct btrfs_block_rsv {
>   u64 size;
> @@ -794,6 +795,8 @@ struct btrfs_fs_info {
>   struct btrfs_block_rsv chunk_block_rsv;
>   /* block reservation for delayed operations */
>   struct btrfs_block_rsv delayed_block_rsv;
> + /* block reservation for delayed refs */
> + struct btrfs_block_rsv delayed_refs_rsv;
>  
>   struct btrfs_block_rsv empty_block_rsv;
>  
> @@ -2723,10 +2726,12 @@ enum btrfs_reserve_flush_enum {
>  enum btrfs_flush_state {
>   FLUSH_DELAYED_ITEMS_NR  =   1,
>   FLUSH_DELAYED_ITEMS =   2,
> - FLUSH_DELALLOC  =   3,
> - FLUSH_DELALLOC_WAIT =   4,
> - ALLOC_CHUNK =   5,
> - COMMIT_TRANS=   6,
> + FLUSH_DELAYED_REFS_NR   =   3,
> + FLUSH_DELAYED_REFS  =   4,
> + FLUSH_DELALLOC  =   5,
> + FLUSH_DELALLOC_WAIT =   6,
> + ALLOC_CHUNK =   7,
> + COMMIT_TRANS=   8,
>  };
>  
>  int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
> @@ -2777,6 +2782,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info 
> *fs_info,
>  void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
>struct btrfs_block_rsv *block_rsv,
>u64 num_bytes);
> +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr);
> +void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans);
> +int btrfs_refill_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
> +   enum btrfs_reserve_flush_enum flush);
> +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
> +struct btrfs_block_rsv *src,
> +u64 num_bytes);
>  int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache);
>  void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache);
>  void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 27f7dd4e3d52..96ce087747b2 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -467,11 +467,14 @@ static int insert_delayed_ref(struct btrfs_trans_handle 
> *trans,
>   * existing and update must have the same bytenr
>   */
>  static noinline void
> -update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs,
> +update_existing_head_ref(struct btrfs_trans_handle *trans,
>struct btrfs_delayed_ref_head *existing,
>struct 

[PATCH] btrfs-progs: calibrate extent_end when found a gap

2018-09-04 Thread Lu Fengqi
The extent_end will be used to check whether there is gap between this
extent and next extent. If it is not calibrated, check_file_extent will
mistake that there are gaps between the remaining extents.

Signed-off-by: Lu Fengqi 
---
 check/mode-lowmem.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index 1bce44f5658a..0f14a4968e84 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -1972,6 +1972,7 @@ static int check_file_extent(struct btrfs_root *root, 
struct btrfs_path *path,
root->objectid, fkey.objectid, fkey.offset,
fkey.objectid, *end);
}
+   *end = fkey.offset;
}
 
*end += extent_num_bytes;
-- 
2.18.0





[PATCH] btrfs-progs: Continue checking even we found something wrong in free space cache

2018-09-04 Thread Qu Wenruo
No need to abort checking, especially for RO check free space cache is
meaningless, the errors in fs/extent tree is more interesting for
developers.

So continue checking even something in free space cache is wrong.

Reported-by: Etienne Champetier 
Signed-off-by: Qu Wenruo 
---
 check/main.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/check/main.c b/check/main.c
index b361cd7e26a0..4f720163221e 100644
--- a/check/main.c
+++ b/check/main.c
@@ -9885,7 +9885,6 @@ int cmd_check(int argc, char **argv)
error("errors found in free space tree");
else
error("errors found in free space cache");
-   goto out;
}
 
/*
-- 
2.18.0



Re: [PATCH] btrfs-progs: dump-tree: Introduce --breadth-first option

2018-09-04 Thread Qu Wenruo



On 2018/8/23 下午3:45, Qu Wenruo wrote:
> 
> 
> On 2018/8/23 下午3:36, Nikolay Borisov wrote:
>>
>>
>> On 23.08.2018 10:31, Qu Wenruo wrote:
>>> Introduce --breadth-first option to do breadth-first tree dump.
>>> This is especially handy to inspect high level trees, e.g. comparing
>>> tree reloc tree with its source tree.
>>
>> Will it make sense instead of exposing another option to just have a
>> heuristics check that will switch to the BFS if the tree is higher than,
>> say, 2 levels?
> 
> BFS has one obvious disadvantage here, so it may not be a good idea to
> use it for default.

Well, this is only true for my implementation.

But there are other solutions to do BFS without that heavy memory usage.

> 
>>> More memory usage <<
>It needs to alloc heap memory, and this can be pretty large for
>leaves.
>At level 1, it will need to alloc nr_leaves * sizeof(bfs_entry)
>memory at least.
>Compared to DFS, it only needs to iterate at most 8 times, and all of
>its memory usage is function call stack memory.
> 
> It only makes sense for my niche use case (compare tree reloc tree with
> its source).
> For real world use case the default DFS should works fine without all
> the memory allocation burden.

Since we have btrfs_path to show where our parents are, it's possible to
use btrfs_path and avoid current memory burden.

And in that case, your idea of using BFS default for tree higher than 2
levels completely make sense.

I'll update the patchset to use a new (while a little more complex) to
implement BFS with less memory usage.

Thank your again for the idea,
Qu

> 
> So I still prefer to keep DFS as default and only provides BFS as a
> niche option for weird guys like me.
> 
> Thanks,
> Qu
> 


Re: RAID1 & BTRFS critical (device sda2): corrupt leaf, bad key order

2018-09-04 Thread Qu Wenruo


On 2018/9/4 下午7:53, Etienne Champetier wrote:
> Hi Qu,
> 
> Le lun. 3 sept. 2018 à 20:27, Qu Wenruo  a écrit :
>>
>> On 2018/9/3 下午10:18, Etienne Champetier wrote:
>>> Hello btfrs hackers,
>>>
>>> I have a computer acting as backup server with BTRFS RAID1, and I
>>> would like to know the different options to rebuild this RAID
>>> (I saw this thread
>>> https://www.spinics.net/lists/linux-btrfs/msg68679.html but there was
>>> no raid 1)
>>>
>>> # uname -a
>>> Linux servmaison 4.4.0-134-generic #160-Ubuntu SMP Wed Aug 15 14:58:00
>>> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> # btrfs --version
>>> btrfs-progs v4.4
>>>
>>> # dmesg
>>> ...
>>> [ 1955.581972] BTRFS critical (device sda2): corrupt leaf, bad key
>>> order: block=6020235362304,root=1, slot=63
>>> [ 1955.582299] BTRFS critical (device sda2): corrupt leaf, bad key
>>> order: block=6020235362304,root=1, slot=63
> 
> Now running a Fedora 28 install kernel
> 
> # uname -a
> Linux servmaison 4.16.3-301.fc28.x86_64 #1 SMP Mon Apr 23 21:59:58 UTC
> 2018 x86_64 x86_64 x86_64 GNU/Linux
> # btrfs --version
> btrfs-progs v4.15.1

Unfortunately, even for latest btrfs-progs release (v4.17.1, and even
devel branch), btrfs check will abort checking if free space cache is
corrupted.

So we didn't get any useful info from btrfs check.

Such diff would help you continue checking (if you really want, other
than starting salvaging your data)
--
diff --git a/check/main.c b/check/main.c
index b361cd7e26a0..4f720163221e 100644
--- a/check/main.c
+++ b/check/main.c
@@ -9885,7 +9885,6 @@ int cmd_check(int argc, char **argv)
error("errors found in free space tree");
else
error("errors found in free space cache");
-   goto out;
}

/*
--


For dump tree block, the corrupted tree block belongs to extent tree.
Which could be a good news (depends on how you define GOOD news).

The corruption is not an easy fix, it's not just a swapped slot.
The corrupted slot (item 64, whole key objectid is 5946810351616) is way
beyond the extent data range, thus btrfs-progs can't fix it easily.

Considering how much bytenr difference there is and the generation gap
(53167 vs current generation 1555950), the bug happens a long long time
ago (days or weeks before 2016-06-04). So it's a little too late to be
fixed (unless someone could send me a time machine).

On the other hand, this means any WRITE would easily fail due to
corrupted extent tree, but your fs should be OK if mounted RO, thus you
could copy your data out.

> 
>>
>> Please provide the following dump:
>>
>> # btrfs inspect dump-tree -t root /dev/sda2
>> # btrfs inspect dump-tree -b 6020235362304 /dev/sda2
> 
> All requested dump are in this repo:
> https://github.com/champtar/debugraidbtrfs
> 
[snip]
>>
>> If it's the only problem, "btrfs check --repair" indeed could fix it.
> 
> Also available in https://github.com/champtar/debugraidbtrfs, here
> "btrfs check --readonly /dev/sda2" output
> 
> checking extents
> bad key ordering 63 64
> bad key ordering 63 64
> bad key ordering 63 64
> bad key ordering 63 64
> bad key ordering 63 64
> bad key ordering 63 64
> bad key ordering 63 64
> bad key ordering 63 64
> bad key ordering 63 64
> bad key ordering 63 64
> bad key ordering 63 64
> bad key ordering 63 64
> bad key ordering 63 64
> bad key ordering 63 64
> bad key ordering 63 64
> bad key ordering 63 64
> bad key ordering 63 64
> bad key ordering 63 64
> bad block 6020235362304
> ERROR: errors found in extent allocation tree or chunk allocation
> checking free space cache
> there is no free space entry for 6011561750528-5942842273792
> there is no free space entry for 6011561750528-6012044050432
> cache appears valid but isn't 6010970308608
> there is no free space entry for 6015529828352-5946810351616
> there is no free space entry for 6015529828352-6016339017728
> cache appears valid but isn't 6015265275904
> there is no free space entry for 6139476623360-6070757146624
> there is no free space entry for 6139476623360-6139852881920
> cache appears valid but isn't 6138779140096
> ERROR: errors found in free space cache
> Checking filesystem on /dev/sda2
> UUID: 4917db5e-fc20-4369-9556-83082a32d4cd
> found 1321120776195 bytes used, error(s) found
> total csum bytes: 0
> total tree bytes: 1163182080
> total fs tree bytes: 0
> total extent tree bytes: 1161740288
> btree space waste bytes: 290512355
> file data blocks allocated: 618135552
>  referenced 618135552
> 

As expected, btrfs-progs is unable to fix it.

> 
> Thanks
> Etienne
> 
> P.S: sorry for the initial duplicate email, it took a very long time
> to show up in https://www.spinics.net/lists/linux-btrfs/maillist.html,
> thought it was discarded as I was not subscribed to the list

It's pretty common, I even sometimes sent patches twice for the same reason.

And just another kindly note, for "btrfs check" or "btrfs inspect

Re: fsck lowmem mode only: ERROR: errors found in fs roots

2018-09-04 Thread Christoph Anton Mitterer
On Tue, 2018-09-04 at 17:14 +0800, Qu Wenruo wrote:
> However the backtrace can't tell which process caused such fsync
> call.
> (Maybe LVM user space code?)

Well it was just literally before btrfs-check exited... so I blindly
guesses... but arguably it could be just some coincidence.

LVM tools are installed, but since I no longer use and PVs/LVs/etc. ...
I'd doubt they'd do anything here.


Cheers,
Chris.



Re: fsck lowmem mode only: ERROR: errors found in fs roots

2018-09-04 Thread Qu Wenruo



On 2018/9/4 上午4:24, Christoph Anton Mitterer wrote:
> Hey.
> 
> 
> On Fri, 2018-08-31 at 10:33 +0800, Su Yue wrote:
>> Can you please fetch btrfs-progs from my repo and run lowmem check
>> in readonly?
>> Repo: https://github.com/Damenly/btrfs-progs/tree/lowmem_debug
>> It's based on v4.17.1 plus additonal output for debug only.
> 
> I've adapted your patch to 4.17 from Debian (i.e. not the 4.17.1).
> 
> 
> First I ran it again with the pristine 4.17 from Debian:
> # btrfs check --mode=lowmem /dev/mapper/system ; echo $?
> Checking filesystem on /dev/mapper/system
> UUID: 6050ca10-e778-4d08-80e7-6d27b9c89b3c
> checking extents
> checking free space cache
> checking fs roots
> ERROR: errors found in fs roots
> found 435924422656 bytes used, error(s) found
> total csum bytes: 423418948
> total tree bytes: 2218328064
> total fs tree bytes: 1557168128
> total extent tree bytes: 125894656
> btree space waste bytes: 429599230
> file data blocks allocated: 5193373646848
>  referenced 555255164928
> [ 1248.687628] [ cut here ]
> [ 1248.688352] generic_make_request: Trying to write to read-only 
> block-device dm-0 (partno 0)
> [ 1248.689127] WARNING: CPU: 3 PID: 933 at 
> /build/linux-LgHyGB/linux-4.17.17/block/blk-core.c:2180 
> generic_make_request_checks+0x43d/0x610
> [ 1248.689909] Modules linked in: dm_crypt algif_skcipher af_alg dm_mod 
> snd_hda_codec_hdmi snd_hda_codec_realtek intel_rapl snd_hda_codec_generic 
> x86_pkg_temp_thermal intel_powerclamp i915 iwlwifi btusb coretemp btrtl btbcm 
> uvcvideo kvm_intel snd_hda_intel btintel videobuf2_vmalloc bluetooth 
> snd_hda_codec kvm videobuf2_memops videobuf2_v4l2 videobuf2_common cfg80211 
> snd_hda_core irqbypass videodev jitterentropy_rng drm_kms_helper 
> crct10dif_pclmul snd_hwdep crc32_pclmul drbg ghash_clmulni_intel intel_cstate 
> snd_pcm ansi_cprng ppdev intel_uncore drm media ecdh_generic iTCO_wdt 
> snd_timer iTCO_vendor_support rtsx_pci_ms crc16 snd intel_rapl_perf memstick 
> joydev mei_me rfkill evdev soundcore sg parport_pc pcspkr serio_raw 
> fujitsu_laptop mei i2c_algo_bit parport shpchp sparse_keymap pcc_cpufreq 
> lpc_ich button
> [ 1248.693639]  video battery ac ip_tables x_tables autofs4 btrfs 
> zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov 
> async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c 
> crc32c_generic raid1 raid0 multipath linear md_mod sd_mod uas usb_storage 
> crc32c_intel rtsx_pci_sdmmc mmc_core ahci xhci_pci libahci aesni_intel 
> ehci_pci aes_x86_64 libata crypto_simd xhci_hcd ehci_hcd cryptd glue_helper 
> psmouse i2c_i801 scsi_mod rtsx_pci e1000e usbcore usb_common
> [ 1248.696956] CPU: 3 PID: 933 Comm: btrfs Not tainted 4.17.0-3-amd64 #1 
> Debian 4.17.17-1
> [ 1248.698118] Hardware name: FUJITSU LIFEBOOK E782/FJNB253, BIOS Version 
> 2.11 07/15/2014
> [ 1248.699299] RIP: 0010:generic_make_request_checks+0x43d/0x610
> [ 1248.700495] RSP: 0018:ac89827c7d88 EFLAGS: 00010286
> [ 1248.701702] RAX:  RBX: 98f4848a9200 RCX: 
> 0006
> [ 1248.702930] RDX: 0007 RSI: 0082 RDI: 
> 98f49e2d6730
> [ 1248.704170] RBP: 98f484f6d460 R08: 033e R09: 
> 00aa
> [ 1248.705422] R10: ac89827c7e60 R11:  R12: 
> 
> [ 1248.706675] R13: 0001 R14:  R15: 
> 
> [ 1248.707928] FS:  7f92842018c0() GS:98f49e2c() 
> knlGS:
> [ 1248.709190] CS:  0010 DS:  ES:  CR0: 80050033
> [ 1248.710448] CR2: 55fc6fe1a5b0 CR3: 000407f62001 CR4: 
> 001606e0
> [ 1248.711707] Call Trace:
> [ 1248.712960]  ? do_writepages+0x4b/0xe0
> [ 1248.714201]  ? blkdev_readpages+0x20/0x20
> [ 1248.715441]  ? do_writepages+0x4b/0xe0
> [ 1248.716684]  generic_make_request+0x64/0x400
> [ 1248.717935]  ? finish_wait+0x80/0x80
> [ 1248.719181]  ? mempool_alloc+0x67/0x1a0
> [ 1248.720425]  ? submit_bio+0x6c/0x140
> [ 1248.721663]  submit_bio+0x6c/0x140
> [ 1248.722902]  submit_bio_wait+0x53/0x80
> [ 1248.724139]  blkdev_issue_flush+0x7c/0xb0
> [ 1248.725377]  blkdev_fsync+0x2f/0x40
> [ 1248.726612]  do_fsync+0x38/0x60
> [ 1248.727849]  __x64_sys_fsync+0x10/0x20
> [ 1248.729086]  do_syscall_64+0x55/0x110
> [ 1248.730323]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

I don't really think it's "btrfs check" causing the problem.

Btrfs check, just like all offline tools, uses open_ctree_flags to
determine if it's allowed to do write.
Without OPEN_CTREE_FLAGS_WRITE, all devices are opened RO, thus any
write will just return error without reaching disk.
Not to mention such fsync syscall.

However the backtrace can't tell which process caused such fsync call.
(Maybe LVM user space code?)

Thanks,
Qu

> [ 1248.731565] RIP: 0033:0x7f928354d161
> [ 1248.732805] RSP: 002b:7ffd35e3f5d8 EFLAGS: 0246 ORIG_RAX: 
> 004a
> [ 1248.734067] RAX: ffda RBX: 55fc09c0c260 RCX: 
> 

[PATCH v10.5 2/5] btrfs-progs: dedupe: Add enable command for dedupe command group

2018-09-04 Thread Lu Fengqi
From: Qu Wenruo 

Add enable subcommand for dedupe commmand group.

Signed-off-by: Qu Wenruo 
Signed-off-by: Lu Fengqi 
---
 Documentation/btrfs-dedupe-inband.asciidoc | 114 +-
 btrfs-completion   |   6 +-
 cmds-dedupe-ib.c   | 238 +
 ioctl.h|   2 +
 4 files changed, 358 insertions(+), 2 deletions(-)

diff --git a/Documentation/btrfs-dedupe-inband.asciidoc 
b/Documentation/btrfs-dedupe-inband.asciidoc
index 83113f5487e2..d895aafbcf45 100644
--- a/Documentation/btrfs-dedupe-inband.asciidoc
+++ b/Documentation/btrfs-dedupe-inband.asciidoc
@@ -22,7 +22,119 @@ use with caution.
 
 SUBCOMMAND
 --
-Nothing yet
+*enable* [options] ::
+Enable in-band de-duplication for a filesystem.
++
+`Options`
++
+-f|--force
+Force 'enable' command to be exected.
+Will skip memory limit check and allow 'enable' to be executed even in-band
+de-duplication is already enabled.
++
+NOTE: If re-enable dedupe with '-f' option, any unspecified parameter will be
+reset to its default value.
+
+-s|--storage-backend 
+Specify de-duplication hash storage backend.
+Only 'inmemory' backend is supported yet.
+If not specified, default value is 'inmemory'.
++
+Refer to *BACKENDS* sector for more information.
+
+-b|--blocksize 
+Specify dedupe block size.
+Supported values are power of 2 from '16K' to '8M'.
+Default value is '128K'.
++
+Refer to *BLOCKSIZE* sector for more information.
+
+-a|--hash-algorithm 
+Specify hash algorithm.
+Only 'sha256' is supported yet.
+
+-l|--limit-hash 
+Specify maximum number of hashes stored in memory.
+Only works for 'inmemory' backend.
+Conflicts with '-m' option.
++
+Only positive values are valid.
+Default value is '32K'.
+
+-m|--limit-memory 
+Specify maximum memory used for hashes.
+Only works for 'inmemory' backend.
+Conflicts with '-l' option.
++
+Only value larger than or equal to '1024' is valid.
+No default value.
++
+NOTE: Memory limit will be rounded down to kernel internal hash size,
+so the memory limit shown in 'btrfs dedupe-inband status' may be different
+from the .
+
+WARNING: Too large value for '-l' or '-m' will easily trigger OOM.
+Please use with caution according to system memory.
+
+NOTE: In-band de-duplication is not compactible with compression yet.
+And compression has higher priority than in-band de-duplication, means if
+compression and de-duplication is enabled at the same time, only compression
+will work.
+
+BACKENDS
+
+Btrfs in-band de-duplication will support different storage backends, with
+different use case and features.
+
+In-memory backend::
+This backend provides backward-compatibility, and more fine-tuning options.
+But hash pool is non-persistent and may exhaust kernel memory if not setup
+properly.
++
+This backend can be used on old btrfs(without '-O dedupe' mkfs option).
+When used on old btrfs, this backend needs to be enabled manually after mount.
++
+Designed for fast hash search speed, in-memory backend will keep all dedupe
+hashes in memory. (Although overall performance is still much the same with
+'ondisk' backend if all 'ondisk' hash can be cached in memory)
++
+And only keeps limited number of hash in memory to avoid exhausting memory.
+Hashes over the limit will be dropped following Last-Recent-Use behavior.
+So this backend has a consistent overhead for given limit but can\'t ensure
+all duplicated blocks will be de-duplicated.
++
+After umount and mount, in-memory backend need to refill its hash pool.
+
+On-disk backend::
+This backend provides persistent hash pool, with more smart memory management
+for hash pool.
+But it\'s not backward-compatible, meaning it must be used with '-O dedupe' 
mkfs
+option and older kernel can\'t mount it read-write.
++
+Designed for de-duplication rate, hash pool is stored as btrfs B+ tree on disk.
+This behavior may cause extra disk IO for hash search under high memory
+pressure.
++
+After umount and mount, on-disk backend still has its hash on disk, no need to
+refill its dedupe hash pool.
+
+Currently, only 'inmemory' backend is supported in btrfs-progs.
+
+DEDUPE BLOCK SIZE
+
+In-band de-duplication is done at dedupe block size.
+Any data smaller than dedupe block size won\'t go through in-band
+de-duplication.
+
+And dedupe block size affects dedupe rate and fragmentation heavily.
+
+Smaller block size will cause more fragments, but higher dedupe rate.
+
+Larger block size will cause less fragments, but lower dedupe rate.
+
+In-band de-duplication rate is highly related to the workload pattern.
+So it\'s highly recommended to align dedupe block size to the workload
+block size to make full use of de-duplication.
 
 EXIT STATUS
 ---
diff --git a/btrfs-completion b/btrfs-completion
index ae683f4ecf61..cfdf70966e47 100644
--- a/btrfs-completion
+++ b/btrfs-completion
@@ -29,7 +29,7 @@ _btrfs()
 
local cmd=${words[1]}
 
-   

[PATCH v10.5 5/5] btrfs-progs: dedupe: introduce reconfigure subcommand

2018-09-04 Thread Lu Fengqi
From: Qu Wenruo 

Introduce reconfigure subcommand to co-operate with new kernel ioctl
modification.

Signed-off-by: Qu Wenruo 
Signed-off-by: Lu Fengqi 
---
 Documentation/btrfs-dedupe-inband.asciidoc |  7 +++
 btrfs-completion   |  2 +-
 cmds-dedupe-ib.c   | 73 +-
 3 files changed, 66 insertions(+), 16 deletions(-)

diff --git a/Documentation/btrfs-dedupe-inband.asciidoc 
b/Documentation/btrfs-dedupe-inband.asciidoc
index 6096389cb0b4..78c806f772d6 100644
--- a/Documentation/btrfs-dedupe-inband.asciidoc
+++ b/Documentation/btrfs-dedupe-inband.asciidoc
@@ -86,6 +86,13 @@ And compression has higher priority than in-band 
de-duplication, means if
 compression and de-duplication is enabled at the same time, only compression
 will work.
 
+*reconfigure* [options] ::
+Re-configure in-band de-duplication parameters of a filesystem.
++
+In-band de-duplication must be enbaled first before re-configuration.
++
+[Options] are the same with 'btrfs dedupe-inband enable'.
+
 *status* ::
 Show current in-band de-duplication status of a filesystem.
 
diff --git a/btrfs-completion b/btrfs-completion
index 62a7bdd4d0d5..6ff48e4c2f6a 100644
--- a/btrfs-completion
+++ b/btrfs-completion
@@ -41,7 +41,7 @@ _btrfs()
commands_quota='enable disable rescan'
commands_qgroup='assign remove create destroy show limit'
commands_replace='start status cancel'
-   commands_dedupe_inband='enable disable status'
+   commands_dedupe_inband='enable disable status reconfigure'
 
if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then
COMPREPLY=( $( compgen -W '--help' -- "$cur" ) )
diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c
index e778457e25a8..e52f939c9ced 100644
--- a/cmds-dedupe-ib.c
+++ b/cmds-dedupe-ib.c
@@ -56,7 +56,6 @@ static const char * const cmd_dedupe_ib_enable_usage[] = {
NULL
 };
 
-
 #define report_fatal_parameter(dargs, old, member, type, err_val, fmt) \
 ({ \
if (dargs->member != old->member && \
@@ -88,6 +87,12 @@ static void report_parameter_error(struct 
btrfs_ioctl_dedupe_args *dargs,
}
report_option_parameter(dargs, old, flags, u8, -1, x);
}
+
+   if (dargs->status == 0 && old->cmd == BTRFS_DEDUPE_CTL_RECONF) {
+   error("must enable dedupe before reconfiguration");
+   return;
+   }
+
if (report_fatal_parameter(dargs, old, cmd, u16, -1, u) ||
report_fatal_parameter(dargs, old, blocksize, u64, -1, llu) ||
report_fatal_parameter(dargs, old, backend, u16, -1, u) ||
@@ -100,14 +105,17 @@ static void report_parameter_error(struct 
btrfs_ioctl_dedupe_args *dargs,
old->limit_nr, old->limit_mem);
 }
 
-static int cmd_dedupe_ib_enable(int argc, char **argv)
+static int enable_reconfig_dedupe(int argc, char **argv, int reconf)
 {
int ret;
int fd = -1;
char *path;
u64 blocksize = BTRFS_DEDUPE_BLOCKSIZE_DEFAULT;
+   int blocksize_set = 0;
u16 hash_algo = BTRFS_DEDUPE_HASH_SHA256;
+   int hash_algo_set = 0;
u16 backend = BTRFS_DEDUPE_BACKEND_INMEMORY;
+   int backend_set = 0;
u64 limit_nr = 0;
u64 limit_mem = 0;
u64 sys_mem = 0;
@@ -134,15 +142,17 @@ static int cmd_dedupe_ib_enable(int argc, char **argv)
break;
switch (c) {
case 's':
-   if (!strcasecmp("inmemory", optarg))
+   if (!strcasecmp("inmemory", optarg)) {
backend = BTRFS_DEDUPE_BACKEND_INMEMORY;
-   else {
+   backend_set = 1;
+   } else {
error("unsupported dedupe backend: %s", optarg);
exit(1);
}
break;
case 'b':
blocksize = parse_size(optarg);
+   blocksize_set = 1;
break;
case 'a':
if (strcmp("sha256", optarg)) {
@@ -224,26 +234,40 @@ static int cmd_dedupe_ib_enable(int argc, char **argv)
return 1;
}
memset(, -1, sizeof(dargs));
-   dargs.cmd = BTRFS_DEDUPE_CTL_ENABLE;
-   dargs.blocksize = blocksize;
-   dargs.hash_algo = hash_algo;
-   dargs.limit_nr = limit_nr;
-   dargs.limit_mem = limit_mem;
-   dargs.backend = backend;
-   if (force)
-   dargs.flags |= BTRFS_DEDUPE_FLAG_FORCE;
-   else
-   dargs.flags = 0;
+   if (reconf) {
+   dargs.cmd = BTRFS_DEDUPE_CTL_RECONF;
+   if (blocksize_set)
+   dargs.blocksize = blocksize;
+   if (hash_algo_set)
+  

[PATCH v10.5 3/5] btrfs-progs: dedupe: Add disable support for inband dedupelication

2018-09-04 Thread Lu Fengqi
From: Qu Wenruo 

Add disable subcommand for dedupe command group.

Signed-off-by: Qu Wenruo 
Signed-off-by: Lu Fengqi 
---
 Documentation/btrfs-dedupe-inband.asciidoc |  5 +++
 btrfs-completion   |  2 +-
 cmds-dedupe-ib.c   | 41 ++
 3 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/Documentation/btrfs-dedupe-inband.asciidoc 
b/Documentation/btrfs-dedupe-inband.asciidoc
index d895aafbcf45..3452f690e3e5 100644
--- a/Documentation/btrfs-dedupe-inband.asciidoc
+++ b/Documentation/btrfs-dedupe-inband.asciidoc
@@ -22,6 +22,11 @@ use with caution.
 
 SUBCOMMAND
 --
+*disable* ::
+Disable in-band de-duplication for a filesystem.
++
+This will trash all stored dedupe hash.
++
 *enable* [options] ::
 Enable in-band de-duplication for a filesystem.
 +
diff --git a/btrfs-completion b/btrfs-completion
index cfdf70966e47..a74a23f42022 100644
--- a/btrfs-completion
+++ b/btrfs-completion
@@ -41,7 +41,7 @@ _btrfs()
commands_quota='enable disable rescan'
commands_qgroup='assign remove create destroy show limit'
commands_replace='start status cancel'
-   commands_dedupe_inband='enable'
+   commands_dedupe_inband='enable disable'
 
if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then
COMPREPLY=( $( compgen -W '--help' -- "$cur" ) )
diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c
index 4d499677d9ae..91b6fe234043 100644
--- a/cmds-dedupe-ib.c
+++ b/cmds-dedupe-ib.c
@@ -259,10 +259,51 @@ out:
return ret;
 }
 
+static const char * const cmd_dedupe_ib_disable_usage[] = {
+   "btrfs dedupe-inband disable ",
+   "Disable in-band(write time) de-duplication of a btrfs.",
+   NULL
+};
+
+static int cmd_dedupe_ib_disable(int argc, char **argv)
+{
+   struct btrfs_ioctl_dedupe_args dargs;
+   DIR *dirstream;
+   char *path;
+   int fd;
+   int ret;
+
+   if (check_argc_exact(argc, 2))
+   usage(cmd_dedupe_ib_disable_usage);
+
+   path = argv[1];
+   fd = open_file_or_dir(path, );
+   if (fd < 0) {
+   error("failed to open file or directory: %s", path);
+   return 1;
+   }
+   memset(, 0, sizeof(dargs));
+   dargs.cmd = BTRFS_DEDUPE_CTL_DISABLE;
+
+   ret = ioctl(fd, BTRFS_IOC_DEDUPE_CTL, );
+   if (ret < 0) {
+   error("failed to disable inband deduplication: %m");
+   ret = 1;
+   goto out;
+   }
+   ret = 0;
+
+out:
+   close_file_or_dir(fd, dirstream);
+   return 0;
+}
+
 const struct cmd_group dedupe_ib_cmd_group = {
dedupe_ib_cmd_group_usage, dedupe_ib_cmd_group_info, {
{ "enable", cmd_dedupe_ib_enable, cmd_dedupe_ib_enable_usage,
  NULL, 0},
+   { "disable", cmd_dedupe_ib_disable, cmd_dedupe_ib_disable_usage,
+ NULL, 0},
NULL_CMD_STRUCT
}
 };
-- 
2.18.0





[PATCH v10.5 4/5] btrfs-progs: dedupe: Add status subcommand

2018-09-04 Thread Lu Fengqi
From: Qu Wenruo 

Add status subcommand for dedupe command group.

Signed-off-by: Qu Wenruo 
Signed-off-by: Lu Fengqi 
---
 Documentation/btrfs-dedupe-inband.asciidoc |  3 +
 btrfs-completion   |  2 +-
 cmds-dedupe-ib.c   | 80 ++
 3 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/Documentation/btrfs-dedupe-inband.asciidoc 
b/Documentation/btrfs-dedupe-inband.asciidoc
index 3452f690e3e5..6096389cb0b4 100644
--- a/Documentation/btrfs-dedupe-inband.asciidoc
+++ b/Documentation/btrfs-dedupe-inband.asciidoc
@@ -86,6 +86,9 @@ And compression has higher priority than in-band 
de-duplication, means if
 compression and de-duplication is enabled at the same time, only compression
 will work.
 
+*status* ::
+Show current in-band de-duplication status of a filesystem.
+
 BACKENDS
 
 Btrfs in-band de-duplication will support different storage backends, with
diff --git a/btrfs-completion b/btrfs-completion
index a74a23f42022..62a7bdd4d0d5 100644
--- a/btrfs-completion
+++ b/btrfs-completion
@@ -41,7 +41,7 @@ _btrfs()
commands_quota='enable disable rescan'
commands_qgroup='assign remove create destroy show limit'
commands_replace='start status cancel'
-   commands_dedupe_inband='enable disable'
+   commands_dedupe_inband='enable disable status'
 
if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then
COMPREPLY=( $( compgen -W '--help' -- "$cur" ) )
diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c
index 91b6fe234043..e778457e25a8 100644
--- a/cmds-dedupe-ib.c
+++ b/cmds-dedupe-ib.c
@@ -298,12 +298,92 @@ out:
return 0;
 }
 
+static const char * const cmd_dedupe_ib_status_usage[] = {
+   "btrfs dedupe-inband status ",
+   "Show current in-band(write time) de-duplication status of a btrfs.",
+   NULL
+};
+
+static int cmd_dedupe_ib_status(int argc, char **argv)
+{
+   struct btrfs_ioctl_dedupe_args dargs;
+   DIR *dirstream;
+   char *path;
+   int fd;
+   int ret;
+   int print_limit = 1;
+
+   if (check_argc_exact(argc, 2))
+   usage(cmd_dedupe_ib_status_usage);
+
+   path = argv[1];
+   fd = open_file_or_dir(path, );
+   if (fd < 0) {
+   error("failed to open file or directory: %s", path);
+   ret = 1;
+   goto out;
+   }
+   memset(, 0, sizeof(dargs));
+   dargs.cmd = BTRFS_DEDUPE_CTL_STATUS;
+
+   ret = ioctl(fd, BTRFS_IOC_DEDUPE_CTL, );
+   if (ret < 0) {
+   error("failed to get inband deduplication status: %m");
+   ret = 1;
+   goto out;
+   }
+   ret = 0;
+   if (dargs.status == 0) {
+   printf("Status: \t\t\tDisabled\n");
+   goto out;
+   }
+   printf("Status:\t\t\tEnabled\n");
+
+   if (dargs.hash_algo == BTRFS_DEDUPE_HASH_SHA256)
+   printf("Hash algorithm:\t\tSHA-256\n");
+   else
+   printf("Hash algorithm:\t\tUnrecognized(%x)\n",
+   dargs.hash_algo);
+
+   if (dargs.backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
+   printf("Backend:\t\tIn-memory\n");
+   print_limit = 1;
+   } else  {
+   printf("Backend:\t\tUnrecognized(%x)\n",
+   dargs.backend);
+   }
+
+   printf("Dedup Blocksize:\t%llu\n", dargs.blocksize);
+
+   if (print_limit) {
+   u64 cur_mem;
+
+   /* Limit nr may be 0 */
+   if (dargs.limit_nr)
+   cur_mem = dargs.current_nr * (dargs.limit_mem /
+   dargs.limit_nr);
+   else
+   cur_mem = 0;
+
+   printf("Number of hash: \t[%llu/%llu]\n", dargs.current_nr,
+   dargs.limit_nr);
+   printf("Memory usage: \t\t[%s/%s]\n",
+   pretty_size(cur_mem),
+   pretty_size(dargs.limit_mem));
+   }
+out:
+   close_file_or_dir(fd, dirstream);
+   return ret;
+}
+
 const struct cmd_group dedupe_ib_cmd_group = {
dedupe_ib_cmd_group_usage, dedupe_ib_cmd_group_info, {
{ "enable", cmd_dedupe_ib_enable, cmd_dedupe_ib_enable_usage,
  NULL, 0},
{ "disable", cmd_dedupe_ib_disable, cmd_dedupe_ib_disable_usage,
  NULL, 0},
+   { "status", cmd_dedupe_ib_status, cmd_dedupe_ib_status_usage,
+ NULL, 0},
NULL_CMD_STRUCT
}
 };
-- 
2.18.0





[PATCH v10.5 0/5] In-band de-duplication for btrfs-progs

2018-09-04 Thread Lu Fengqi
Patchset can be fetched from github:
https://github.com/littleroad/btrfs-progs.git dedupe_latest

Inband dedupe(in-memory backend only) ioctl support for btrfs-progs.

v7 changes:
   Update ctree.h to follow kernel structure change
   Update print-tree to follow kernel structure change
V8 changes:
   Move dedup props and on-disk backend support out of the patchset
   Change command group name to "dedupe-inband", to avoid confusion with
   possible out-of-band dedupe. Suggested by Mark.
   Rebase to latest devel branch.
V9 changes:
   Follow kernels ioctl change to support FORCE flag, new reconf ioctl,
   and more precious error reporting.
v10 changes:
   Rebase to v4.10.
   Add BUILD_ASSERT for btrfs_ioctl_dedupe_args
v10.1 changes:
   Rebase to v4.14.
v10.2 changes:
   Rebase to v4.16.1.
v10.3 changes:
   Rebase to v4.17.
v10.4 changes:
   Deal with offline reviews from Misono Tomohiro.
   1. s/btrfs-dedupe/btrfs-dedupe-inband
   2. Replace strerror(errno) with %m
   3. Use SZ_* instead of intermedia number
   4. update btrfs-completion for reconfigure subcommand
v10.5 changes:
   Rebase to v4.17.1.

Qu Wenruo (5):
  btrfs-progs: Basic framework for dedupe-inband command group
  btrfs-progs: dedupe: Add enable command for dedupe command group
  btrfs-progs: dedupe: Add disable support for inband dedupelication
  btrfs-progs: dedupe: Add status subcommand
  btrfs-progs: dedupe: introduce reconfigure subcommand

 Documentation/Makefile.in  |   1 +
 Documentation/btrfs-dedupe-inband.asciidoc | 167 
 Documentation/btrfs.asciidoc   |   4 +
 Makefile   |   3 +-
 btrfs-completion   |   6 +-
 btrfs.c|   2 +
 cmds-dedupe-ib.c   | 437 +
 commands.h |   2 +
 dedupe-ib.h|  28 ++
 ioctl.h|  38 ++
 10 files changed, 686 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/btrfs-dedupe-inband.asciidoc
 create mode 100644 cmds-dedupe-ib.c
 create mode 100644 dedupe-ib.h

-- 
2.18.0





[PATCH v10.5 1/5] btrfs-progs: Basic framework for dedupe-inband command group

2018-09-04 Thread Lu Fengqi
From: Qu Wenruo 

Add basic ioctl header and command group framework for later use.
Alone with basic man page doc.

Signed-off-by: Qu Wenruo 
Signed-off-by: Lu Fengqi 
---
 Documentation/Makefile.in  |  1 +
 Documentation/btrfs-dedupe-inband.asciidoc | 40 ++
 Documentation/btrfs.asciidoc   |  4 +++
 Makefile   |  3 +-
 btrfs.c|  2 ++
 cmds-dedupe-ib.c   | 35 +++
 commands.h |  2 ++
 dedupe-ib.h| 28 +++
 ioctl.h| 36 +++
 9 files changed, 150 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/btrfs-dedupe-inband.asciidoc
 create mode 100644 cmds-dedupe-ib.c
 create mode 100644 dedupe-ib.h

diff --git a/Documentation/Makefile.in b/Documentation/Makefile.in
index 184647c41940..402155fae001 100644
--- a/Documentation/Makefile.in
+++ b/Documentation/Makefile.in
@@ -28,6 +28,7 @@ MAN8_TXT += btrfs-qgroup.asciidoc
 MAN8_TXT += btrfs-replace.asciidoc
 MAN8_TXT += btrfs-restore.asciidoc
 MAN8_TXT += btrfs-property.asciidoc
+MAN8_TXT += btrfs-dedupe-inband.asciidoc
 
 # Category 5 manual page
 MAN5_TXT += btrfs-man5.asciidoc
diff --git a/Documentation/btrfs-dedupe-inband.asciidoc 
b/Documentation/btrfs-dedupe-inband.asciidoc
new file mode 100644
index ..83113f5487e2
--- /dev/null
+++ b/Documentation/btrfs-dedupe-inband.asciidoc
@@ -0,0 +1,40 @@
+btrfs-dedupe-inband(8)
+==
+
+NAME
+
+btrfs-dedupe-inband - manage in-band (write time) de-duplication of a btrfs
+filesystem
+
+SYNOPSIS
+
+*btrfs dedupe-inband*  
+
+DESCRIPTION
+---
+*btrfs dedupe-inband* is used to enable/disable or show current in-band 
de-duplication
+status of a btrfs filesystem.
+
+Kernel support for in-band de-duplication starts from 4.19.
+
+WARNING: In-band de-duplication is still an experimental feautre of btrfs,
+use with caution.
+
+SUBCOMMAND
+--
+Nothing yet
+
+EXIT STATUS
+---
+*btrfs dedupe-inband* returns a zero exit status if it succeeds. Non zero is
+returned in case of failure.
+
+AVAILABILITY
+
+*btrfs* is part of btrfs-progs.
+Please refer to the btrfs wiki http://btrfs.wiki.kernel.org for
+further details.
+
+SEE ALSO
+
+`mkfs.btrfs`(8),
diff --git a/Documentation/btrfs.asciidoc b/Documentation/btrfs.asciidoc
index 7316ac094413..1cf5bddec335 100644
--- a/Documentation/btrfs.asciidoc
+++ b/Documentation/btrfs.asciidoc
@@ -50,6 +50,10 @@ COMMANDS
Do off-line check on a btrfs filesystem. +
See `btrfs-check`(8) for details.
 
+*dedupe-inband*::
+   Control btrfs in-band(write time) de-duplication. +
+   See `btrfs-dedupe-inband`(8) for details.
+
 *device*::
Manage devices managed by btrfs, including add/delete/scan and so
on. +
diff --git a/Makefile b/Makefile
index fcfc815a2a5b..4052cecfae4d 100644
--- a/Makefile
+++ b/Makefile
@@ -123,7 +123,8 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o 
cmds-device.o cmds-scrub.o \
   cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \
   cmds-property.o cmds-fi-usage.o cmds-inspect-dump-tree.o \
   cmds-inspect-dump-super.o cmds-inspect-tree-stats.o cmds-fi-du.o 
\
-  mkfs/common.o check/mode-common.o check/mode-lowmem.o
+  mkfs/common.o check/mode-common.o check/mode-lowmem.o \
+  cmds-dedupe-ib.o
 libbtrfs_objects = send-stream.o send-utils.o kernel-lib/rbtree.o btrfs-list.o 
\
   kernel-lib/crc32c.o messages.o \
   uuid-tree.o utils-lib.o rbtree-utils.o
diff --git a/btrfs.c b/btrfs.c
index 2d39f2ced3e8..2168f5a8bc7f 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -255,6 +255,8 @@ static const struct cmd_group btrfs_cmd_group = {
{ "quota", cmd_quota, NULL, _cmd_group, 0 },
{ "qgroup", cmd_qgroup, NULL, _cmd_group, 0 },
{ "replace", cmd_replace, NULL, _cmd_group, 0 },
+   { "dedupe-inband", cmd_dedupe_ib, NULL, _ib_cmd_group,
+   0 },
{ "help", cmd_help, cmd_help_usage, NULL, 0 },
{ "version", cmd_version, cmd_version_usage, NULL, 0 },
NULL_CMD_STRUCT
diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c
new file mode 100644
index ..73c923a797da
--- /dev/null
+++ b/cmds-dedupe-ib.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2017 Fujitsu.  All rights reserved.
+ */
+
+#include 
+#include 
+#include 
+
+#include "ctree.h"
+#include "ioctl.h"
+
+#include "commands.h"
+#include "utils.h"
+#include "kerncompat.h"
+#include "dedupe-ib.h"
+
+static const char * const dedupe_ib_cmd_group_usage[] = {
+   "btrfs dedupe-inband  [options] ",
+   NULL
+};
+
+static const char dedupe_ib_cmd_group_info[] =

[PATCH v15 09/13] btrfs: introduce type based delalloc metadata reserve

2018-09-04 Thread Lu Fengqi
From: Wang Xiaoguang 

Introduce type based metadata reserve parameter for delalloc space
reservation/freeing function.

The problem we are going to solve is, btrfs use different max extent
size for different mount options.

For de-duplication, the max extent size can be set by the dedupe ioctl,
while for normal write it's 128M.
And furthermore, split/merge extent hook highly depends that max extent
size.

Such situation contributes to quite a lot of false ENOSPC.

So this patch introduces the facility to help solve these false ENOSPC
related to different max extent size.

Currently, only normal 128M extent size is supported. More types will
follow soon.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/ctree.h |  43 ++---
 fs/btrfs/extent-tree.c   |  48 ---
 fs/btrfs/file.c  |  30 +
 fs/btrfs/free-space-cache.c  |   6 +-
 fs/btrfs/inode-map.c |   9 ++-
 fs/btrfs/inode.c | 115 +--
 fs/btrfs/ioctl.c |  23 +++
 fs/btrfs/ordered-data.c  |   6 +-
 fs/btrfs/ordered-data.h  |   3 +-
 fs/btrfs/relocation.c|  22 ---
 fs/btrfs/tests/inode-tests.c |  15 +++--
 11 files changed, 223 insertions(+), 97 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 741ef21a6185..4f0b6a12ecb1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -98,11 +98,24 @@ static const int btrfs_csum_sizes[] = { 4 };
 /*
  * Count how many BTRFS_MAX_EXTENT_SIZE cover the @size
  */
-static inline u32 count_max_extents(u64 size)
+static inline u32 count_max_extents(u64 size, u64 max_extent_size)
 {
-   return div_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, BTRFS_MAX_EXTENT_SIZE);
+   return div_u64(size + max_extent_size - 1, max_extent_size);
 }
 
+/*
+ * Type based metadata reserve type
+ * This affects how btrfs reserve metadata space for buffered write.
+ *
+ * This is caused by the different max extent size for normal COW
+ * and further in-band dedupe
+ */
+enum btrfs_metadata_reserve_type {
+   BTRFS_RESERVE_NORMAL,
+};
+
+u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type);
+
 struct btrfs_mapping_tree {
struct extent_map_tree map_tree;
 };
@@ -2742,8 +2755,9 @@ int btrfs_check_data_free_space(struct inode *inode,
 void btrfs_free_reserved_data_space(struct inode *inode,
struct extent_changeset *reserved, u64 start, u64 len);
 void btrfs_delalloc_release_space(struct inode *inode,
- struct extent_changeset *reserved,
- u64 start, u64 len, bool qgroup_free);
+   struct extent_changeset *reserved,
+   u64 start, u64 len, bool qgroup_free,
+   enum btrfs_metadata_reserve_type reserve_type);
 void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
u64 len);
 void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans);
@@ -2753,13 +2767,17 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root 
*root,
 void btrfs_subvolume_release_metadata(struct btrfs_fs_info *fs_info,
  struct btrfs_block_rsv *rsv);
 void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes,
-   bool qgroup_free);
+   bool qgroup_free,
+   enum btrfs_metadata_reserve_type reserve_type);
 
-int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes);
+int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
+   enum btrfs_metadata_reserve_type reserve_type);
 void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes,
-bool qgroup_free);
+   bool qgroup_free,
+   enum btrfs_metadata_reserve_type reserve_type);
 int btrfs_delalloc_reserve_space(struct inode *inode,
-   struct extent_changeset **reserved, u64 start, u64 len);
+   struct extent_changeset **reserved, u64 start, u64 len,
+   enum btrfs_metadata_reserve_type reserve_type);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info,
  unsigned short type);
@@ -3165,7 +3183,11 @@ int btrfs_start_delalloc_inodes(struct btrfs_root *root);
 int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int nr);
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
  unsigned int extra_bits,
- struct extent_state **cached_state, int dedupe);
+   

[PATCH v15 03/13] btrfs: dedupe: Introduce function to add hash into in-memory tree

2018-09-04 Thread Lu Fengqi
From: Wang Xiaoguang 

Introduce static function inmem_add() to add hash into in-memory tree.
And now we can implement the btrfs_dedupe_add() interface.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/dedupe.c | 150 ++
 1 file changed, 150 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 06523162753d..784bb3a8a5ab 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -19,6 +19,14 @@ struct inmem_hash {
u8 hash[];
 };
 
+static inline struct inmem_hash *inmem_alloc_hash(u16 algo)
+{
+   if (WARN_ON(algo >= ARRAY_SIZE(btrfs_hash_sizes)))
+   return NULL;
+   return kzalloc(sizeof(struct inmem_hash) + btrfs_hash_sizes[algo],
+   GFP_NOFS);
+}
+
 static struct btrfs_dedupe_info *
 init_dedupe_info(struct btrfs_ioctl_dedupe_args *dargs)
 {
@@ -167,3 +175,145 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
/* Place holder for bisect, will be implemented in later patches */
return 0;
 }
+
+static int inmem_insert_hash(struct rb_root *root,
+struct inmem_hash *hash, int hash_len)
+{
+   struct rb_node **p = >rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, hash_node);
+   if (memcmp(hash->hash, entry->hash, hash_len) < 0)
+   p = &(*p)->rb_left;
+   else if (memcmp(hash->hash, entry->hash, hash_len) > 0)
+   p = &(*p)->rb_right;
+   else
+   return 1;
+   }
+   rb_link_node(>hash_node, parent, p);
+   rb_insert_color(>hash_node, root);
+   return 0;
+}
+
+static int inmem_insert_bytenr(struct rb_root *root,
+  struct inmem_hash *hash)
+{
+   struct rb_node **p = >rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+   if (hash->bytenr < entry->bytenr)
+   p = &(*p)->rb_left;
+   else if (hash->bytenr > entry->bytenr)
+   p = &(*p)->rb_right;
+   else
+   return 1;
+   }
+   rb_link_node(>bytenr_node, parent, p);
+   rb_insert_color(>bytenr_node, root);
+   return 0;
+}
+
+static void __inmem_del(struct btrfs_dedupe_info *dedupe_info,
+   struct inmem_hash *hash)
+{
+   list_del(>lru_list);
+   rb_erase(>hash_node, _info->hash_root);
+   rb_erase(>bytenr_node, _info->bytenr_root);
+
+   if (!WARN_ON(dedupe_info->current_nr == 0))
+   dedupe_info->current_nr--;
+
+   kfree(hash);
+}
+
+/*
+ * Insert a hash into in-memory dedupe tree
+ * Will remove exceeding last recent use hash.
+ *
+ * If the hash mathced with existing one, we won't insert it, to
+ * save memory
+ */
+static int inmem_add(struct btrfs_dedupe_info *dedupe_info,
+struct btrfs_dedupe_hash *hash)
+{
+   int ret = 0;
+   u16 algo = dedupe_info->hash_algo;
+   struct inmem_hash *ihash;
+
+   ihash = inmem_alloc_hash(algo);
+
+   if (!ihash)
+   return -ENOMEM;
+
+   /* Copy the data out */
+   ihash->bytenr = hash->bytenr;
+   ihash->num_bytes = hash->num_bytes;
+   memcpy(ihash->hash, hash->hash, btrfs_hash_sizes[algo]);
+
+   mutex_lock(_info->lock);
+
+   ret = inmem_insert_bytenr(_info->bytenr_root, ihash);
+   if (ret > 0) {
+   kfree(ihash);
+   ret = 0;
+   goto out;
+   }
+
+   ret = inmem_insert_hash(_info->hash_root, ihash,
+   btrfs_hash_sizes[algo]);
+   if (ret > 0) {
+   /*
+* We only keep one hash in tree to save memory, so if
+* hash conflicts, free the one to insert.
+*/
+   rb_erase(>bytenr_node, _info->bytenr_root);
+   kfree(ihash);
+   ret = 0;
+   goto out;
+   }
+
+   list_add(>lru_list, _info->lru_list);
+   dedupe_info->current_nr++;
+
+   /* Remove the last dedupe hash if we exceed limit */
+   while (dedupe_info->current_nr > dedupe_info->limit_nr) {
+   struct inmem_hash *last;
+
+   last = list_entry(dedupe_info->lru_list.prev,
+ struct inmem_hash, lru_list);
+   __inmem_del(dedupe_info, last);
+   }
+out:
+   mutex_unlock(_info->lock);
+   return 0;
+}
+
+int btrfs_dedupe_add(struct btrfs_fs_info *fs_info,
+struct btrfs_dedupe_hash *hash)
+{
+   struct btrfs_dedupe_info *dedupe_info 

[PATCH v15 00/13] Btrfs In-band De-duplication

2018-09-04 Thread Lu Fengqi
This patchset can be fetched from github:
https://github.com/littleroad/linux.git dedupe_latest

Now the new base is v4.19-rc2, and drop the patch about compression
which conflict with compression heuristic.

Normal test cases from auto group exposes no regression, and ib-dedupe
group can pass without problem.

xfstests ib-dedupe group can be fetched from github:
https://github.com/littleroad/xfstests-dev.git btrfs_dedupe_latest

Changelog:
v2:
  Totally reworked to handle multiple backends
v3:
  Fix a stupid but deadly on-disk backend bug
  Add handle for multiple hash on same bytenr corner case to fix abort
  trans error
  Increase dedup rate by enhancing delayed ref handler for both backend.
  Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
  Increase dedup block size up limit to 8M.
v4:
  Add dedup prop for disabling dedup for given files/dirs.
  Merge inmem_search() and ondisk_search() into generic_search() to save
  some code
  Fix another delayed_ref related bug.
  Use the same mutex for both inmem and ondisk backend.
  Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
  rate.
v5:
  Reuse compress routine for much simpler dedup function.
  Slightly improved performance due to above modification.
  Fix race between dedup enable/disable
  Fix for false ENOSPC report
v6:
  Further enable/disable race window fix.
  Minor format change according to checkpatch.
v7:
  Fix one concurrency bug with balance.
  Slightly modify return value from -EINVAL to -EOPNOTSUPP for
  btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands
  and wrong parameter.
  Rebased to integration-4.6.
v8:
  Rename 'dedup' to 'dedupe'.
  Add support to allow dedupe and compression work at the same time.
  Fix several balance related bugs. Special thanks to Satoru Takeuchi,
  who exposed most of them.
  Small dedupe hit case performance improvement.
v9:
  Re-order the patchset to completely separate pure in-memory and any
  on-disk format change.
  Fold bug fixes into its original patch.
v10:
  Adding back missing bug fix patch.
  Reduce on-disk item size.
  Hide dedupe ioctl under CONFIG_BTRFS_DEBUG.
v11:
  Remove other backend and props support to focus on the framework and
  in-memory backend. Suggested by David.
  Better disable and buffered write race protection.
  Comprehensive fix to dedupe metadata ENOSPC problem.
v12:
  Stateful 'enable' ioctl and new 'reconf' ioctl
  New FORCE flag for enable ioctl to allow stateless ioctl
  Precise error report and extendable ioctl structure.
v12.1
  Rebase to David's for-next-20160704 branch
  Add co-ordinate patch for subpage and dedupe patchset.
v12.2
  Rebase to David's for-next-20160715 branch
  Add co-ordinate patch for other patchset.
v13
  Rebase to David's for-next-20160906 branch
  Fix a reserved space leak bug, which only frees quota reserved space
  but not space_info->byte_may_use.
v13.1
  Rebase to Chris' for-linux-4.9 branch
v14
  Use generic ENOSPC fix for both compression and dedupe.
v14.1
  Further split ENOSPC fix.
v14.2
  Rebase to v4.11-rc2.
  Co-operate with count_max_extent() to calculate num_extents.
  No longer rely on qgroup fixes.
v14.3
  Rebase to v4.12-rc1.
v14.4
  Rebase to kdave/for-4.13-part1.
v14.5
  Rebase to v4.15-rc3.
v14.6
  Rebase to v4.17-rc5.
v14.7
  Replace SHASH_DESC_ON_STACK with kmalloc to remove VLA.
  Fixed the following errors by switching to div_u64.
  ├── arm-allmodconfig
  │   └── ERROR:__aeabi_uldivmod-fs-btrfs-btrfs.ko-undefined
  └── i386-allmodconfig
  └── ERROR:__udivdi3-fs-btrfs-btrfs.ko-undefined
v14.8
  Rebase to v4.18-rc4.
v15
  Rebase to v4.19-rc2.
  Drop "btrfs: Introduce COMPRESS reserve type to fix false enospc for 
compression".
  Remove the ifdef around btrfs inband dedupe ioctl.

Qu Wenruo (4):
  btrfs: delayed-ref: Add support for increasing data ref under spinlock
  btrfs: dedupe: Inband in-memory only de-duplication implement
  btrfs: relocation: Enhance error handling to avoid BUG_ON
  btrfs: dedupe: Introduce new reconfigure ioctl

Wang Xiaoguang (9):
  btrfs: dedupe: Introduce dedupe framework and its header
  btrfs: dedupe: Introduce function to initialize dedupe info
  btrfs: dedupe: Introduce function to add hash into in-memory tree
  btrfs: dedupe: Introduce function to remove hash from in-memory tree
  btrfs: dedupe: Introduce function to search for an existing hash
  btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
  btrfs: ordered-extent: Add support for dedupe
  btrfs: introduce type based delalloc metadata reserve
  btrfs: dedupe: Add ioctl for inband deduplication

 fs/btrfs/Makefile|   2 +-
 fs/btrfs/ctree.h |  52 ++-
 fs/btrfs/dedupe.c| 828 +++
 fs/btrfs/dedupe.h| 175 +++-
 fs/btrfs/delayed-ref.c   |  53 ++-
 fs/btrfs/delayed-ref.h   |  15 +
 fs/btrfs/disk-io.c   |   4 +
 fs/btrfs/extent-tree.c   |  67 ++-
 fs/btrfs/extent_io.c 

[PATCH v15 13/13] btrfs: dedupe: Introduce new reconfigure ioctl

2018-09-04 Thread Lu Fengqi
From: Qu Wenruo 

Introduce new reconfigure ioctl and new FORCE flag for in-band dedupe
ioctls.

Now dedupe enable and reconfigure ioctl are stateful.


| Current state |   Ioctl| Next state  |

| Disabled  |  enable| Enabled |
| Enabled   |  enable| Not allowed |
| Enabled   |  reconf| Enabled |
| Enabled   |  disable   | Disabled|
| Disabled  |  dsiable   | Disabled|
| Disabled  |  reconf| Not allowed |

(While disable is always stateless)

While for guys prefer stateless ioctl (myself for example), new FORCE
flag is introduced.

In FORCE mode, enable/disable is completely stateless.

| Current state |   Ioctl| Next state  |

| Disabled  |  enable| Enabled |
| Enabled   |  enable| Enabled |
| Enabled   |  disable   | Disabled|
| Disabled  |  disable   | Disabled|


Also, re-configure ioctl will only modify specified fields.
Unlike enable, un-specified fields will be filled with default value.

For example:
 # btrfs dedupe enable --block-size 64k /mnt
 # btrfs dedupe reconfigure --limit-hash 1m /mnt
Will leads to:
 dedupe blocksize: 64K
 dedupe hash limit nr: 1m

While for enable:
 # btrfs dedupe enable --force --block-size 64k /mnt
 # btrfs dedupe enable --force --limit-hash 1m /mnt
Will reset blocksize to default value:
 dedupe blocksize: 128K << reset
 dedupe hash limit nr: 1m

Suggested-by: David Sterba 
Signed-off-by: Qu Wenruo 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/dedupe.c  | 132 ++---
 fs/btrfs/dedupe.h  |  13 
 fs/btrfs/ioctl.c   |  13 
 include/uapi/linux/btrfs.h |  11 +++-
 4 files changed, 143 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index a147e148bbb8..2be3e53acc6a 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -29,6 +29,40 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 algo)
GFP_NOFS);
 }
 
+/*
+ * Copy from current dedupe info to fill dargs.
+ * For reconf case, only fill members which is uninitialized.
+ */
+static void get_dedupe_status(struct btrfs_dedupe_info *dedupe_info,
+ struct btrfs_ioctl_dedupe_args *dargs)
+{
+   int reconf = (dargs->cmd == BTRFS_DEDUPE_CTL_RECONF);
+
+   dargs->status = 1;
+
+   if (!reconf || (reconf && dargs->blocksize == (u64)-1))
+   dargs->blocksize = dedupe_info->blocksize;
+   if (!reconf || (reconf && dargs->backend == (u16)-1))
+   dargs->backend = dedupe_info->backend;
+   if (!reconf || (reconf && dargs->hash_algo == (u16)-1))
+   dargs->hash_algo = dedupe_info->hash_algo;
+
+   /*
+* For re-configure case, if not modifying limit,
+* therir limit will be set to 0, unlike other fields
+*/
+   if (!reconf || !(dargs->limit_nr || dargs->limit_mem)) {
+   dargs->limit_nr = dedupe_info->limit_nr;
+   dargs->limit_mem = dedupe_info->limit_nr *
+   (sizeof(struct inmem_hash) +
+btrfs_hash_sizes[dedupe_info->hash_algo]);
+   }
+
+   /* current_nr doesn't makes sense for reconfig case */
+   if (!reconf)
+   dargs->current_nr = dedupe_info->current_nr;
+}
+
 void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
 struct btrfs_ioctl_dedupe_args *dargs)
 {
@@ -45,15 +79,7 @@ void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
return;
}
mutex_lock(_info->lock);
-   dargs->status = 1;
-   dargs->blocksize = dedupe_info->blocksize;
-   dargs->backend = dedupe_info->backend;
-   dargs->hash_algo = dedupe_info->hash_algo;
-   dargs->limit_nr = dedupe_info->limit_nr;
-   dargs->limit_mem = dedupe_info->limit_nr *
-   (sizeof(struct inmem_hash) +
-btrfs_hash_sizes[dedupe_info->hash_algo]);
-   dargs->current_nr = dedupe_info->current_nr;
+   get_dedupe_status(dedupe_info, dargs);
mutex_unlock(_info->lock);
memset(dargs->__unused, -1, sizeof(dargs->__unused));
 }
@@ -98,17 +124,50 @@ init_dedupe_info(struct btrfs_ioctl_dedupe_args *dargs)
 static int check_dedupe_parameter(struct btrfs_fs_info *fs_info,
  struct btrfs_ioctl_dedupe_args *dargs)
 {
-   u64 blocksize = dargs->blocksize;
-   u64 limit_nr = dargs->limit_nr;
-   u64 limit_mem = dargs->limit_mem;
-   u16 hash_algo = dargs->hash_algo;
-   u8 backend = dargs->backend;
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+   u64 blocksize;
+   u64 limit_nr;
+   u64 limit_mem;
+ 

[PATCH v15 12/13] btrfs: relocation: Enhance error handling to avoid BUG_ON

2018-09-04 Thread Lu Fengqi
From: Qu Wenruo 

Since the introduction of btrfs dedupe tree, it's possible that balance can
race with dedupe disabling.

When this happens, dedupe_enabled will make btrfs_get_fs_root() return
PTR_ERR(-ENOENT).
But due to a bug in error handling branch, when this happens
backref_cache->nr_nodes is increased but the node is neither added to
backref_cache or nr_nodes decreased.
Causing BUG_ON() in backref_cache_cleanup()

[ 2611.668810] [ cut here ]
[ 2611.669946] kernel BUG at
/home/sat/ktest/linux/fs/btrfs/relocation.c:243!
[ 2611.670572] invalid opcode:  [#1] SMP
[ 2611.686797] Call Trace:
[ 2611.687034]  []
btrfs_relocate_block_group+0x1b3/0x290 [btrfs]
[ 2611.687706]  []
btrfs_relocate_chunk.isra.40+0x47/0xd0 [btrfs]
[ 2611.688385]  [] btrfs_balance+0xb22/0x11e0 [btrfs]
[ 2611.688966]  [] btrfs_ioctl_balance+0x391/0x3a0
[btrfs]
[ 2611.689587]  [] btrfs_ioctl+0x1650/0x2290 [btrfs]
[ 2611.690145]  [] ? lru_cache_add+0x3a/0x80
[ 2611.690647]  [] ?
lru_cache_add_active_or_unevictable+0x4c/0xc0
[ 2611.691310]  [] ? handle_mm_fault+0xcd4/0x17f0
[ 2611.691842]  [] ? cp_new_stat+0x153/0x180
[ 2611.692342]  [] ? __vma_link_rb+0xfd/0x110
[ 2611.692842]  [] ? vma_link+0xb9/0xc0
[ 2611.693303]  [] do_vfs_ioctl+0xa1/0x5a0
[ 2611.693781]  [] ? __do_page_fault+0x1b4/0x400
[ 2611.694310]  [] SyS_ioctl+0x41/0x70
[ 2611.694758]  [] entry_SYSCALL_64_fastpath+0x12/0x71
[ 2611.695331] Code: ff 48 8b 45 bf 49 83 af a8 05 00 00 01 49 89 87 a0
05 00 00 e9 2e fd ff ff b8 f4 ff ff ff e9 e4 fb ff ff 0f 0b 0f 0b 0f 0b
0f 0b <0f> 0b 0f 0b 41 89 c6 e9 b8 fb ff ff e8 9e a6 e8 e0 4c 89 e7 44
[ 2611.697870] RIP  []
relocate_block_group+0x741/0x7a0 [btrfs]
[ 2611.698818]  RSP 

This patch will call remove_backref_node() in error handling branch, and
cache the returned -ENOENT in relocate_tree_block() and continue
balancing.

Reported-by: Satoru Takeuchi 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/relocation.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 59a9c22ebf51..5f4b138fcb35 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -845,6 +845,13 @@ struct backref_node *build_backref_tree(struct 
reloc_control *rc,
root = read_fs_root(rc->extent_root->fs_info, key.offset);
if (IS_ERR(root)) {
err = PTR_ERR(root);
+   /*
+* Don't forget to cleanup current node.
+* As it may not be added to backref_cache but nr_node
+* increased.
+* This will cause BUG_ON() in backref_cache_cleanup().
+*/
+   remove_backref_node(>backref_cache, cur);
goto out;
}
 
@@ -3018,14 +3025,21 @@ int relocate_tree_blocks(struct btrfs_trans_handle 
*trans,
}
 
rb_node = rb_first(blocks);
-   while (rb_node) {
+   for (rb_node = rb_first(blocks); rb_node; rb_node = rb_next(rb_node)) {
block = rb_entry(rb_node, struct tree_block, rb_node);
 
node = build_backref_tree(rc, >key,
  block->level, block->bytenr);
if (IS_ERR(node)) {
+   /*
+* The root(dedupe tree yet) of the tree block is
+* going to be freed and can't be reached.
+* Just skip it and continue balancing.
+*/
+   if (PTR_ERR(node) == -ENOENT)
+   continue;
err = PTR_ERR(node);
-   goto out;
+   break;
}
 
ret = relocate_tree_block(trans, rc, node, >key,
@@ -3033,11 +3047,9 @@ int relocate_tree_blocks(struct btrfs_trans_handle 
*trans,
if (ret < 0) {
if (ret != -EAGAIN || rb_node == rb_first(blocks))
err = ret;
-   goto out;
+   break;
}
-   rb_node = rb_next(rb_node);
}
-out:
err = finish_pending_nodes(trans, rc, path, err);
 
 out_free_path:
-- 
2.18.0





[PATCH v15 07/13] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface

2018-09-04 Thread Lu Fengqi
From: Wang Xiaoguang 

Unlike in-memory or on-disk dedupe method, only SHA256 hash method is
supported yet, so implement btrfs_dedupe_calc_hash() interface using
SHA256.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/dedupe.c | 50 +++
 1 file changed, 50 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 9c6152b7f0eb..9b0a90dd8e42 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -644,3 +644,53 @@ int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
}
return ret;
 }
+
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+  struct inode *inode, u64 start,
+  struct btrfs_dedupe_hash *hash)
+{
+   int i;
+   int ret;
+   struct page *p;
+   struct shash_desc *shash;
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+   struct crypto_shash *tfm = dedupe_info->dedupe_driver;
+   u64 dedupe_bs;
+   u64 sectorsize = fs_info->sectorsize;
+
+   shash = kmalloc(sizeof(*shash) + crypto_shash_descsize(tfm), GFP_NOFS);
+   if (!shash)
+   return -ENOMEM;
+
+   if (!fs_info->dedupe_enabled || !hash)
+   return 0;
+
+   if (WARN_ON(dedupe_info == NULL))
+   return -EINVAL;
+
+   WARN_ON(!IS_ALIGNED(start, sectorsize));
+
+   dedupe_bs = dedupe_info->blocksize;
+
+   shash->tfm = tfm;
+   shash->flags = 0;
+   ret = crypto_shash_init(shash);
+   if (ret)
+   return ret;
+   for (i = 0; sectorsize * i < dedupe_bs; i++) {
+   char *d;
+
+   p = find_get_page(inode->i_mapping,
+ (start >> PAGE_SHIFT) + i);
+   if (WARN_ON(!p))
+   return -ENOENT;
+   d = kmap(p);
+   ret = crypto_shash_update(shash, d, sectorsize);
+   kunmap(p);
+   put_page(p);
+   if (ret)
+   return ret;
+   }
+   ret = crypto_shash_final(shash, hash->hash);
+   return ret;
+}
-- 
2.18.0





[PATCH v15 11/13] btrfs: dedupe: Add ioctl for inband deduplication

2018-09-04 Thread Lu Fengqi
From: Wang Xiaoguang 

Add ioctl interface for inband deduplication, which includes:
1) enable
2) disable
3) status

And a pseudo RO compat flag, to imply that btrfs now supports inband
dedup.
However we don't add any ondisk format change, it's just a pseudo RO
compat flag.

All these ioctl interfaces are state-less, which means caller don't need
to bother previous dedupe state before calling them, and only need to
care the final desired state.

For example, if user want to enable dedupe with specified block size and
limit, just fill the ioctl structure and call enable ioctl.
No need to check if dedupe is already running.

These ioctls will handle things like re-configure or disable quite well.

Also, for invalid parameters, enable ioctl interface will set the field
of the first encountered invalid parameter to (-1) to inform caller.
While for limit_nr/limit_mem, the value will be (0).

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/dedupe.c  | 50 +
 fs/btrfs/dedupe.h  | 17 +++---
 fs/btrfs/disk-io.c |  3 ++
 fs/btrfs/ioctl.c   | 65 ++
 fs/btrfs/sysfs.c   |  2 ++
 include/uapi/linux/btrfs.h | 12 ++-
 6 files changed, 143 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 9b0a90dd8e42..a147e148bbb8 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -29,6 +29,35 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 algo)
GFP_NOFS);
 }
 
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+struct btrfs_ioctl_dedupe_args *dargs)
+{
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+   if (!fs_info->dedupe_enabled || !dedupe_info) {
+   dargs->status = 0;
+   dargs->blocksize = 0;
+   dargs->backend = 0;
+   dargs->hash_algo = 0;
+   dargs->limit_nr = 0;
+   dargs->current_nr = 0;
+   memset(dargs->__unused, -1, sizeof(dargs->__unused));
+   return;
+   }
+   mutex_lock(_info->lock);
+   dargs->status = 1;
+   dargs->blocksize = dedupe_info->blocksize;
+   dargs->backend = dedupe_info->backend;
+   dargs->hash_algo = dedupe_info->hash_algo;
+   dargs->limit_nr = dedupe_info->limit_nr;
+   dargs->limit_mem = dedupe_info->limit_nr *
+   (sizeof(struct inmem_hash) +
+btrfs_hash_sizes[dedupe_info->hash_algo]);
+   dargs->current_nr = dedupe_info->current_nr;
+   mutex_unlock(_info->lock);
+   memset(dargs->__unused, -1, sizeof(dargs->__unused));
+}
+
 static struct btrfs_dedupe_info *
 init_dedupe_info(struct btrfs_ioctl_dedupe_args *dargs)
 {
@@ -402,6 +431,27 @@ static void unblock_all_writers(struct btrfs_fs_info 
*fs_info)
percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
 }
 
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+
+   fs_info->dedupe_enabled = 0;
+   /* same as disable */
+   smp_wmb();
+   dedupe_info = fs_info->dedupe_info;
+   fs_info->dedupe_info = NULL;
+
+   if (!dedupe_info)
+   return 0;
+
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   inmem_destroy(dedupe_info);
+
+   crypto_free_shash(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+   return 0;
+}
+
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 {
struct btrfs_dedupe_info *dedupe_info;
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 8157b17c4d11..fdd00355d6b5 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -90,6 +90,15 @@ static inline struct btrfs_dedupe_hash 
*btrfs_dedupe_alloc_hash(u16 algo)
 int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info,
struct btrfs_ioctl_dedupe_args *dargs);
 
+
+/*
+ * Get inband dedupe info
+ * Since it needs to access different backends' hash size, which
+ * is not exported, we need such simple function.
+ */
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+struct btrfs_ioctl_dedupe_args *dargs);
+
 /*
  * Disable dedupe and invalidate all its dedupe data.
  * Called at dedupe disable time.
@@ -101,12 +110,10 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info,
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
 
 /*
- * Get current dedupe status.
- * Return 0 for success
- * No possible error yet
+ * Cleanup current btrfs_dedupe_info
+ * Called in umount time
  */
-void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
-struct btrfs_ioctl_dedupe_args *dargs);
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info);
 
 /*
  * Calculate hash for dedupe.
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 06683b3f2f0b..43a0324c749c 100644
--- 

[PATCH v15 04/13] btrfs: dedupe: Introduce function to remove hash from in-memory tree

2018-09-04 Thread Lu Fengqi
From: Wang Xiaoguang 

Introduce static function inmem_del() to remove hash from in-memory
dedupe tree.
And implement btrfs_dedupe_del() and btrfs_dedup_disable() interfaces.

Also for btrfs_dedupe_disable(), add new functions to wait existing
writer and block incoming writers to eliminate all possible race.

Cc: Mark Fasheh 
Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/dedupe.c | 131 +++---
 1 file changed, 125 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 784bb3a8a5ab..951fefd19fde 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -170,12 +170,6 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info,
return ret;
 }
 
-int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
-{
-   /* Place holder for bisect, will be implemented in later patches */
-   return 0;
-}
-
 static int inmem_insert_hash(struct rb_root *root,
 struct inmem_hash *hash, int hash_len)
 {
@@ -317,3 +311,128 @@ int btrfs_dedupe_add(struct btrfs_fs_info *fs_info,
return inmem_add(dedupe_info, hash);
return -EINVAL;
 }
+
+static struct inmem_hash *
+inmem_search_bytenr(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+   struct rb_node **p = _info->bytenr_root.rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+
+   if (bytenr < entry->bytenr)
+   p = &(*p)->rb_left;
+   else if (bytenr > entry->bytenr)
+   p = &(*p)->rb_right;
+   else
+   return entry;
+   }
+
+   return NULL;
+}
+
+/* Delete a hash from in-memory dedupe tree */
+static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+   struct inmem_hash *hash;
+
+   mutex_lock(_info->lock);
+   hash = inmem_search_bytenr(dedupe_info, bytenr);
+   if (!hash) {
+   mutex_unlock(_info->lock);
+   return 0;
+   }
+
+   __inmem_del(dedupe_info, hash);
+   mutex_unlock(_info->lock);
+   return 0;
+}
+
+/* Remove a dedupe hash from dedupe tree */
+int btrfs_dedupe_del(struct btrfs_fs_info *fs_info, u64 bytenr)
+{
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+   if (!fs_info->dedupe_enabled)
+   return 0;
+
+   if (WARN_ON(dedupe_info == NULL))
+   return -EINVAL;
+
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   return inmem_del(dedupe_info, bytenr);
+   return -EINVAL;
+}
+
+static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info)
+{
+   struct inmem_hash *entry, *tmp;
+
+   mutex_lock(_info->lock);
+   list_for_each_entry_safe(entry, tmp, _info->lru_list, lru_list)
+   __inmem_del(dedupe_info, entry);
+   mutex_unlock(_info->lock);
+}
+
+/*
+ * Helper function to wait and block all incoming writers
+ *
+ * Use rw_sem introduced for freeze to wait/block writers.
+ * So during the block time, no new write will happen, so we can
+ * do something quite safe, espcially helpful for dedupe disable,
+ * as it affect buffered write.
+ */
+static void block_all_writers(struct btrfs_fs_info *fs_info)
+{
+   struct super_block *sb = fs_info->sb;
+
+   percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
+   down_write(>s_umount);
+}
+
+static void unblock_all_writers(struct btrfs_fs_info *fs_info)
+{
+   struct super_block *sb = fs_info->sb;
+
+   up_write(>s_umount);
+   percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
+}
+
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+   int ret;
+
+   dedupe_info = fs_info->dedupe_info;
+
+   if (!dedupe_info)
+   return 0;
+
+   /* Don't allow disable status change in RO mount */
+   if (fs_info->sb->s_flags & MS_RDONLY)
+   return -EROFS;
+
+   /*
+* Wait for all unfinished writers and block further writers.
+* Then sync the whole fs so all current write will go through
+* dedupe, and all later write won't go through dedupe.
+*/
+   block_all_writers(fs_info);
+   ret = sync_filesystem(fs_info->sb);
+   fs_info->dedupe_enabled = 0;
+   fs_info->dedupe_info = NULL;
+   unblock_all_writers(fs_info);
+   if (ret < 0)
+   return ret;
+
+   /* now we are OK to clean up everything */
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   inmem_destroy(dedupe_info);
+
+   crypto_free_shash(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+   return 0;
+}
-- 
2.18.0





[PATCH v15 02/13] btrfs: dedupe: Introduce function to initialize dedupe info

2018-09-04 Thread Lu Fengqi
From: Wang Xiaoguang 

Add generic function to initialize dedupe info.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/Makefile  |   2 +-
 fs/btrfs/dedupe.c  | 169 +
 fs/btrfs/dedupe.h  |  12 +++
 include/uapi/linux/btrfs.h |   3 +
 4 files changed, 185 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/dedupe.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index ca693dd554e9..78fdc87dba39 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -10,7 +10,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \
   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
-  uuid-tree.o props.o free-space-tree.o tree-checker.o
+  uuid-tree.o props.o free-space-tree.o tree-checker.o dedupe.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
new file mode 100644
index ..06523162753d
--- /dev/null
+++ b/fs/btrfs/dedupe.c
@@ -0,0 +1,169 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2016 Fujitsu.  All rights reserved.
+ */
+
+#include "ctree.h"
+#include "dedupe.h"
+#include "btrfs_inode.h"
+#include "delayed-ref.h"
+
+struct inmem_hash {
+   struct rb_node hash_node;
+   struct rb_node bytenr_node;
+   struct list_head lru_list;
+
+   u64 bytenr;
+   u32 num_bytes;
+
+   u8 hash[];
+};
+
+static struct btrfs_dedupe_info *
+init_dedupe_info(struct btrfs_ioctl_dedupe_args *dargs)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+
+   dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS);
+   if (!dedupe_info)
+   return ERR_PTR(-ENOMEM);
+
+   dedupe_info->hash_algo = dargs->hash_algo;
+   dedupe_info->backend = dargs->backend;
+   dedupe_info->blocksize = dargs->blocksize;
+   dedupe_info->limit_nr = dargs->limit_nr;
+
+   /* only support SHA256 yet */
+   dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0);
+   if (IS_ERR(dedupe_info->dedupe_driver)) {
+   kfree(dedupe_info);
+   return ERR_CAST(dedupe_info->dedupe_driver);
+   }
+
+   dedupe_info->hash_root = RB_ROOT;
+   dedupe_info->bytenr_root = RB_ROOT;
+   dedupe_info->current_nr = 0;
+   INIT_LIST_HEAD(_info->lru_list);
+   mutex_init(_info->lock);
+
+   return dedupe_info;
+}
+
+/*
+ * Helper to check if parameters are valid.
+ * The first invalid field will be set to (-1), to info user which parameter
+ * is invalid.
+ * Except dargs->limit_nr or dargs->limit_mem, in that case, 0 will returned
+ * to info user, since user can specify any value to limit, except 0.
+ */
+static int check_dedupe_parameter(struct btrfs_fs_info *fs_info,
+ struct btrfs_ioctl_dedupe_args *dargs)
+{
+   u64 blocksize = dargs->blocksize;
+   u64 limit_nr = dargs->limit_nr;
+   u64 limit_mem = dargs->limit_mem;
+   u16 hash_algo = dargs->hash_algo;
+   u8 backend = dargs->backend;
+
+   /*
+* Set all reserved fields to -1, allow user to detect
+* unsupported optional parameters.
+*/
+   memset(dargs->__unused, -1, sizeof(dargs->__unused));
+   if (blocksize > BTRFS_DEDUPE_BLOCKSIZE_MAX ||
+   blocksize < BTRFS_DEDUPE_BLOCKSIZE_MIN ||
+   blocksize < fs_info->sectorsize ||
+   !is_power_of_2(blocksize) ||
+   blocksize < PAGE_SIZE) {
+   dargs->blocksize = (u64)-1;
+   return -EINVAL;
+   }
+   if (hash_algo >= ARRAY_SIZE(btrfs_hash_sizes)) {
+   dargs->hash_algo = (u16)-1;
+   return -EINVAL;
+   }
+   if (backend >= BTRFS_DEDUPE_BACKEND_COUNT) {
+   dargs->backend = (u8)-1;
+   return -EINVAL;
+   }
+
+   /* Backend specific check */
+   if (backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
+   /* only one limit is accepted for enable*/
+   if (dargs->limit_nr && dargs->limit_mem) {
+   dargs->limit_nr = 0;
+   dargs->limit_mem = 0;
+   return -EINVAL;
+   }
+
+   if (!limit_nr && !limit_mem)
+   dargs->limit_nr = BTRFS_DEDUPE_LIMIT_NR_DEFAULT;
+   else {
+   u64 tmp = (u64)-1;
+
+   if (limit_mem) {
+   tmp = div_u64(limit_mem,
+   (sizeof(struct inmem_hash)) +
+   btrfs_hash_sizes[hash_algo]);
+   /* Too small limit_mem to fill a hash 

[PATCH v15 10/13] btrfs: dedupe: Inband in-memory only de-duplication implement

2018-09-04 Thread Lu Fengqi
From: Qu Wenruo 

Core implement for inband de-duplication.
It reuses the async_cow_start() facility to do the calculate dedupe hash.
And use dedupe hash to do inband de-duplication at extent level.

The workflow is as below:
1) Run delalloc range for an inode
2) Calculate hash for the delalloc range at the unit of dedupe_bs
3) For hash match(duplicated) case, just increase source extent ref
   and insert file extent.
   For hash mismatch case, go through the normal cow_file_range()
   fallback, and add hash into dedupe_tree.
   Compress for hash miss case is not supported yet.

Current implement restore all dedupe hash in memory rb-tree, with LRU
behavior to control the limit.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/ctree.h   |   4 +-
 fs/btrfs/dedupe.h  |  15 ++
 fs/btrfs/extent-tree.c |  31 +++-
 fs/btrfs/extent_io.c   |   7 +-
 fs/btrfs/extent_io.h   |   1 +
 fs/btrfs/file.c|   4 +
 fs/btrfs/inode.c   | 316 +++--
 fs/btrfs/ioctl.c   |   1 +
 fs/btrfs/relocation.c  |  18 +++
 9 files changed, 341 insertions(+), 56 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4f0b6a12ecb1..627d617e3265 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -112,9 +112,11 @@ static inline u32 count_max_extents(u64 size, u64 
max_extent_size)
  */
 enum btrfs_metadata_reserve_type {
BTRFS_RESERVE_NORMAL,
+   BTRFS_RESERVE_DEDUPE,
 };
 
-u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type);
+u64 btrfs_max_extent_size(struct btrfs_inode *inode,
+ enum btrfs_metadata_reserve_type reserve_type);
 
 struct btrfs_mapping_tree {
struct extent_map_tree map_tree;
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 87f5b7ce7766..8157b17c4d11 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -7,6 +7,7 @@
 #define BTRFS_DEDUPE_H
 
 #include 
+#include "btrfs_inode.h"
 
 /* 32 bytes for SHA256 */
 static const int btrfs_hash_sizes[] = { 32 };
@@ -47,6 +48,20 @@ struct btrfs_dedupe_info {
u64 current_nr;
 };
 
+static inline u64 btrfs_dedupe_blocksize(struct btrfs_inode *inode)
+{
+   struct btrfs_fs_info *fs_info = inode->root->fs_info;
+
+   return fs_info->dedupe_info->blocksize;
+}
+
+static inline int inode_need_dedupe(struct inode *inode)
+{
+   struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+
+   return fs_info->dedupe_enabled;
+}
+
 static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash)
 {
return (hash && hash->bytenr);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f90233ffcb27..131d48487c84 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -28,6 +28,7 @@
 #include "sysfs.h"
 #include "qgroup.h"
 #include "ref-verify.h"
+#include "dedupe.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2489,6 +2490,17 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
btrfs_pin_extent(fs_info, head->bytenr,
 head->num_bytes, 1);
if (head->is_data) {
+   /*
+* If insert_reserved is given, it means
+* a new extent is revered, then deleted
+* in one tran, and inc/dec get merged to 0.
+*
+* In this case, we need to remove its dedupe
+* hash.
+*/
+   ret = btrfs_dedupe_del(fs_info, head->bytenr);
+   if (ret < 0)
+   return ret;
ret = btrfs_del_csums(trans, fs_info, head->bytenr,
  head->num_bytes);
}
@@ -5882,13 +5894,15 @@ static void btrfs_calculate_inode_block_rsv_size(struct 
btrfs_fs_info *fs_info,
spin_unlock(_rsv->lock);
 }
 
-u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type)
+u64 btrfs_max_extent_size(struct btrfs_inode *inode,
+ enum btrfs_metadata_reserve_type reserve_type)
 {
if (reserve_type == BTRFS_RESERVE_NORMAL)
return BTRFS_MAX_EXTENT_SIZE;
-
-   ASSERT(0);
-   return BTRFS_MAX_EXTENT_SIZE;
+   else if (reserve_type == BTRFS_RESERVE_DEDUPE)
+   return btrfs_dedupe_blocksize(inode);
+   else
+   return BTRFS_MAX_EXTENT_SIZE;
 }
 
 int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
@@ -5899,7 +5913,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode 
*inode, u64 num_bytes,
enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
int ret = 0;
bool delalloc_lock = true;
-   u64 max_extent_size = btrfs_max_extent_size(reserve_type);
+   u64 max_extent_size = btrfs_max_extent_size(inode, reserve_type);
 
/* If we are a 

[PATCH v15 05/13] btrfs: delayed-ref: Add support for increasing data ref under spinlock

2018-09-04 Thread Lu Fengqi
From: Qu Wenruo 

For in-band dedupe, btrfs needs to increase data ref with delayed_ref
locked, so add a new function btrfs_add_delayed_data_ref_lock() to
increase extent ref with delayed_refs already locked. Export
init_delayed_ref_head and init_delayed_ref_common for inband dedupe.

Signed-off-by: Qu Wenruo 
Reviewed-by: Josef Bacik 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/delayed-ref.c | 53 +-
 fs/btrfs/delayed-ref.h | 15 
 2 files changed, 52 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 62ff545ba1f7..faca30b334ee 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -526,7 +526,7 @@ update_existing_head_ref(struct btrfs_delayed_ref_root 
*delayed_refs,
spin_unlock(>lock);
 }
 
-static void init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref,
+void btrfs_init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref,
  struct btrfs_qgroup_extent_record *qrecord,
  u64 bytenr, u64 num_bytes, u64 ref_root,
  u64 reserved, int action, bool is_data,
@@ -654,7 +654,7 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
 }
 
 /*
- * init_delayed_ref_common - Initialize the structure which represents a
+ * btrfs_init_delayed_ref_common - Initialize the structure which represents a
  *  modification to a an extent.
  *
  * @fs_info:Internal to the mounted filesystem mount structure.
@@ -678,7 +678,7 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
  * when recording a metadata extent or BTRFS_SHARED_DATA_REF_KEY/
  * BTRFS_EXTENT_DATA_REF_KEY when recording data extent
  */
-static void init_delayed_ref_common(struct btrfs_fs_info *fs_info,
+void btrfs_init_delayed_ref_common(struct btrfs_fs_info *fs_info,
struct btrfs_delayed_ref_node *ref,
u64 bytenr, u64 num_bytes, u64 ref_root,
int action, u8 ref_type)
@@ -751,14 +751,14 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
else
ref_type = BTRFS_TREE_BLOCK_REF_KEY;
 
-   init_delayed_ref_common(fs_info, >node, bytenr, num_bytes,
-   ref_root, action, ref_type);
+   btrfs_init_delayed_ref_common(fs_info, >node, bytenr, num_bytes,
+ ref_root, action, ref_type);
ref->root = ref_root;
ref->parent = parent;
ref->level = level;
 
-   init_delayed_ref_head(head_ref, record, bytenr, num_bytes,
- ref_root, 0, action, false, is_system);
+   btrfs_init_delayed_ref_head(head_ref, record, bytenr, num_bytes,
+   ref_root, 0, action, false, is_system);
head_ref->extent_op = extent_op;
 
delayed_refs = >transaction->delayed_refs;
@@ -787,6 +787,29 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
return 0;
 }
 
+/*
+ * Do real delayed data ref insert.
+ * Caller must hold delayed_refs->lock and allocation memory
+ * for dref,head_ref and record.
+ */
+int btrfs_add_delayed_data_ref_locked(struct btrfs_trans_handle *trans,
+   struct btrfs_delayed_ref_head *head_ref,
+   struct btrfs_qgroup_extent_record *qrecord,
+   struct btrfs_delayed_data_ref *ref, int action,
+   int *qrecord_inserted_ret, int *old_ref_mod,
+   int *new_ref_mod)
+{
+   struct btrfs_delayed_ref_root *delayed_refs;
+
+   head_ref = add_delayed_ref_head(trans, head_ref, qrecord,
+   action, qrecord_inserted_ret,
+   old_ref_mod, new_ref_mod);
+
+   delayed_refs = >transaction->delayed_refs;
+
+   return insert_delayed_ref(trans, delayed_refs, head_ref, >node);
+}
+
 /*
  * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
  */
@@ -813,7 +836,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle 
*trans,
ref_type = BTRFS_SHARED_DATA_REF_KEY;
else
ref_type = BTRFS_EXTENT_DATA_REF_KEY;
-   init_delayed_ref_common(fs_info, >node, bytenr, num_bytes,
+   btrfs_init_delayed_ref_common(fs_info, >node, bytenr, num_bytes,
ref_root, action, ref_type);
ref->root = ref_root;
ref->parent = parent;
@@ -838,8 +861,8 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle 
*trans,
}
}
 
-   init_delayed_ref_head(head_ref, record, bytenr, num_bytes, ref_root,
- reserved, action, true, false);
+   btrfs_init_delayed_ref_head(head_ref, record, bytenr, num_bytes,
+ ref_root, reserved, 

[PATCH v15 08/13] btrfs: ordered-extent: Add support for dedupe

2018-09-04 Thread Lu Fengqi
From: Wang Xiaoguang 

Add ordered-extent support for dedupe.

Note, current ordered-extent support only supports non-compressed source
extent.
Support for compressed source extent will be added later.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/ordered-data.c | 46 +
 fs/btrfs/ordered-data.h | 13 
 2 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 0c4ef208b8b9..4b112258a79b 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -12,6 +12,7 @@
 #include "extent_io.h"
 #include "disk-io.h"
 #include "compression.h"
+#include "dedupe.h"
 
 static struct kmem_cache *btrfs_ordered_extent_cache;
 
@@ -170,7 +171,8 @@ static inline struct rb_node *tree_search(struct 
btrfs_ordered_inode_tree *tree,
  */
 static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
  u64 start, u64 len, u64 disk_len,
- int type, int dio, int compress_type)
+ int type, int dio, int compress_type,
+ struct btrfs_dedupe_hash *hash)
 {
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct btrfs_root *root = BTRFS_I(inode)->root;
@@ -191,6 +193,33 @@ static int __btrfs_add_ordered_extent(struct inode *inode, 
u64 file_offset,
entry->inode = igrab(inode);
entry->compress_type = compress_type;
entry->truncated_len = (u64)-1;
+   entry->hash = NULL;
+   /*
+* A hash hit means we have already incremented the extents delayed
+* ref.
+* We must handle this even if another process is trying to
+* turn off dedupe, otherwise we will leak a reference.
+*/
+   if (hash && (hash->bytenr || root->fs_info->dedupe_enabled)) {
+   struct btrfs_dedupe_info *dedupe_info;
+
+   dedupe_info = root->fs_info->dedupe_info;
+   if (WARN_ON(dedupe_info == NULL)) {
+   kmem_cache_free(btrfs_ordered_extent_cache,
+   entry);
+   return -EINVAL;
+   }
+   entry->hash = btrfs_dedupe_alloc_hash(dedupe_info->hash_algo);
+   if (!entry->hash) {
+   kmem_cache_free(btrfs_ordered_extent_cache, entry);
+   return -ENOMEM;
+   }
+   entry->hash->bytenr = hash->bytenr;
+   entry->hash->num_bytes = hash->num_bytes;
+   memcpy(entry->hash->hash, hash->hash,
+  btrfs_hash_sizes[dedupe_info->hash_algo]);
+   }
+
if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
set_bit(type, >flags);
 
@@ -245,15 +274,23 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 
file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, NULL);
 }
 
+int btrfs_add_ordered_extent_dedupe(struct inode *inode, u64 file_offset,
+  u64 start, u64 len, u64 disk_len, int type,
+  struct btrfs_dedupe_hash *hash)
+{
+   return __btrfs_add_ordered_extent(inode, file_offset, start, len,
+ disk_len, type, 0,
+ BTRFS_COMPRESS_NONE, hash);
+}
 int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
 u64 start, u64 len, u64 disk_len, int type)
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 1,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, NULL);
 }
 
 int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
@@ -262,7 +299,7 @@ int btrfs_add_ordered_extent_compress(struct inode *inode, 
u64 file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- compress_type);
+ compress_type, NULL);
 }
 
 /*
@@ -444,6 +481,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent 
*entry)
list_del(>list);
kfree(sum);
}
+   kfree(entry->hash);
kmem_cache_free(btrfs_ordered_extent_cache, entry);
}
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 02d813aaa261..08c7ee986bb9 100644
--- 

[PATCH v15 01/13] btrfs: dedupe: Introduce dedupe framework and its header

2018-09-04 Thread Lu Fengqi
From: Wang Xiaoguang 

Introduce the header for btrfs in-band(write time) de-duplication
framework and needed header.

The new de-duplication framework is going to support 2 different dedupe
methods and 1 dedupe hash.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/ctree.h   |   7 ++
 fs/btrfs/dedupe.h  | 128 -
 fs/btrfs/disk-io.c |   1 +
 include/uapi/linux/btrfs.h |  34 ++
 4 files changed, 168 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 53af9f5253f4..741ef21a6185 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1125,6 +1125,13 @@ struct btrfs_fs_info {
spinlock_t ref_verify_lock;
struct rb_root block_tree;
 #endif
+
+   /*
+* Inband de-duplication related structures
+*/
+   unsigned long dedupe_enabled:1;
+   struct btrfs_dedupe_info *dedupe_info;
+   struct mutex dedupe_ioctl_lock;
 };
 
 static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 90281a7a35a8..222ce7b4d827 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -6,7 +6,131 @@
 #ifndef BTRFS_DEDUPE_H
 #define BTRFS_DEDUPE_H
 
-/* later in-band dedupe will expand this struct */
-struct btrfs_dedupe_hash;
+#include 
 
+/* 32 bytes for SHA256 */
+static const int btrfs_hash_sizes[] = { 32 };
+
+/*
+ * For caller outside of dedupe.c
+ *
+ * Different dedupe backends should have their own hash structure
+ */
+struct btrfs_dedupe_hash {
+   u64 bytenr;
+   u32 num_bytes;
+
+   /* last field is a variable length array of dedupe hash */
+   u8 hash[];
+};
+
+struct btrfs_dedupe_info {
+   /* dedupe blocksize */
+   u64 blocksize;
+   u16 backend;
+   u16 hash_algo;
+
+   struct crypto_shash *dedupe_driver;
+
+   /*
+* Use mutex to portect both backends
+* Even for in-memory backends, the rb-tree can be quite large,
+* so mutex is better for such use case.
+*/
+   struct mutex lock;
+
+   /* following members are only used in in-memory backend */
+   struct rb_root hash_root;
+   struct rb_root bytenr_root;
+   struct list_head lru_list;
+   u64 limit_nr;
+   u64 current_nr;
+};
+
+static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash)
+{
+   return (hash && hash->bytenr);
+}
+
+/*
+ * Initial inband dedupe info
+ * Called at dedupe enable time.
+ *
+ * Return 0 for success
+ * Return <0 for any error
+ * (from unsupported param to tree creation error for some backends)
+ */
+int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info,
+   struct btrfs_ioctl_dedupe_args *dargs);
+
+/*
+ * Disable dedupe and invalidate all its dedupe data.
+ * Called at dedupe disable time.
+ *
+ * Return 0 for success
+ * Return <0 for any error
+ * (tree operation error for some backends)
+ */
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
+
+/*
+ * Get current dedupe status.
+ * Return 0 for success
+ * No possible error yet
+ */
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+struct btrfs_ioctl_dedupe_args *dargs);
+
+/*
+ * Calculate hash for dedupe.
+ * Caller must ensure [start, start + dedupe_bs) has valid data.
+ *
+ * Return 0 for success
+ * Return <0 for any error
+ * (error from hash codes)
+ */
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+  struct inode *inode, u64 start,
+  struct btrfs_dedupe_hash *hash);
+
+/*
+ * Search for duplicated extents by calculated hash
+ * Caller must call btrfs_dedupe_calc_hash() first to get the hash.
+ *
+ * @inode: the inode for we are writing
+ * @file_pos: offset inside the inode
+ * As we will increase extent ref immediately after a hash match,
+ * we need @file_pos and @inode in this case.
+ *
+ * Return > 0 for a hash match, and the extent ref will be
+ * *INCREASED*, and hash->bytenr/num_bytes will record the existing
+ * extent data.
+ * Return 0 for a hash miss. Nothing is done
+ * Return <0 for any error
+ * (tree operation error for some backends)
+ */
+int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
+   struct inode *inode, u64 file_pos,
+   struct btrfs_dedupe_hash *hash);
+
+/*
+ * Add a dedupe hash into dedupe info
+ * Return 0 for success
+ * Return <0 for any error
+ * (tree operation error for some backends)
+ */
+int btrfs_dedupe_add(struct btrfs_fs_info *fs_info,
+struct btrfs_dedupe_hash *hash);
+
+/*
+ * Remove a dedupe hash from dedupe info
+ * Return 0 for success
+ * Return <0 for any error
+ * (tree operation error for some backends)
+ *
+ * NOTE: if hash deletion error is not handled well, it will lead
+ * to corrupted fs, as later dedupe write can points to non-exist or even
+ * wrong extent.
+ */
+int 

[PATCH v15 06/13] btrfs: dedupe: Introduce function to search for an existing hash

2018-09-04 Thread Lu Fengqi
From: Wang Xiaoguang 

Introduce static function inmem_search() to handle the job for in-memory
hash tree.

The trick is, we must ensure the delayed ref head is not being run at
the time we search the for the hash.

With inmem_search(), we can implement the btrfs_dedupe_search()
interface.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/dedupe.c | 210 +-
 1 file changed, 209 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 951fefd19fde..9c6152b7f0eb 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -7,6 +7,8 @@
 #include "dedupe.h"
 #include "btrfs_inode.h"
 #include "delayed-ref.h"
+#include "qgroup.h"
+#include "transaction.h"
 
 struct inmem_hash {
struct rb_node hash_node;
@@ -242,7 +244,6 @@ static int inmem_add(struct btrfs_dedupe_info *dedupe_info,
struct inmem_hash *ihash;
 
ihash = inmem_alloc_hash(algo);
-
if (!ihash)
return -ENOMEM;
 
@@ -436,3 +437,210 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
kfree(dedupe_info);
return 0;
 }
+
+/*
+ * Caller must ensure the corresponding ref head is not being run.
+ */
+static struct inmem_hash *
+inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash)
+{
+   struct rb_node **p = _info->hash_root.rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+   u16 hash_algo = dedupe_info->hash_algo;
+   int hash_len = btrfs_hash_sizes[hash_algo];
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, hash_node);
+
+   if (memcmp(hash, entry->hash, hash_len) < 0) {
+   p = &(*p)->rb_left;
+   } else if (memcmp(hash, entry->hash, hash_len) > 0) {
+   p = &(*p)->rb_right;
+   } else {
+   /* Found, need to re-add it to LRU list head */
+   list_del(>lru_list);
+   list_add(>lru_list, _info->lru_list);
+   return entry;
+   }
+   }
+   return NULL;
+}
+
+static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
+   struct inode *inode, u64 file_pos,
+   struct btrfs_dedupe_hash *hash)
+{
+   int ret;
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_trans_handle *trans;
+   struct btrfs_delayed_ref_root *delayed_refs;
+   struct btrfs_delayed_ref_head *head;
+   struct btrfs_delayed_ref_head *insert_head;
+   struct btrfs_delayed_data_ref *insert_dref;
+   struct btrfs_qgroup_extent_record *insert_qrecord = NULL;
+   struct inmem_hash *found_hash;
+   int free_insert = 1;
+   int qrecord_inserted = 0;
+   u64 ref_root = root->root_key.objectid;
+   u64 bytenr;
+   u32 num_bytes;
+
+   insert_head = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
+   if (!insert_head)
+   return -ENOMEM;
+   insert_head->extent_op = NULL;
+
+   insert_dref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
+   if (!insert_dref) {
+   kmem_cache_free(btrfs_delayed_ref_head_cachep, insert_head);
+   return -ENOMEM;
+   }
+   if (test_bit(BTRFS_FS_QUOTA_ENABLED, >fs_info->flags) &&
+   is_fstree(ref_root)) {
+   insert_qrecord = kmalloc(sizeof(*insert_qrecord), GFP_NOFS);
+   if (!insert_qrecord) {
+   kmem_cache_free(btrfs_delayed_ref_head_cachep,
+   insert_head);
+   kmem_cache_free(btrfs_delayed_data_ref_cachep,
+   insert_dref);
+   return -ENOMEM;
+   }
+   }
+
+   trans = btrfs_join_transaction(root);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   goto free_mem;
+   }
+
+again:
+   mutex_lock(_info->lock);
+   found_hash = inmem_search_hash(dedupe_info, hash->hash);
+   /* If we don't find a duplicated extent, just return. */
+   if (!found_hash) {
+   ret = 0;
+   goto out;
+   }
+   bytenr = found_hash->bytenr;
+   num_bytes = found_hash->num_bytes;
+
+   btrfs_init_delayed_ref_head(insert_head, insert_qrecord, bytenr,
+   num_bytes, ref_root, 0, BTRFS_ADD_DELAYED_REF, true,
+   false);
+
+   btrfs_init_delayed_ref_common(trans->fs_info, _dref->node,
+   bytenr, num_bytes, ref_root, BTRFS_ADD_DELAYED_REF,
+   BTRFS_EXTENT_DATA_REF_KEY);
+   insert_dref->root = ref_root;
+   insert_dref->parent = 0;
+   insert_dref->objectid = btrfs_ino(BTRFS_I(inode));
+   insert_dref->offset =