from:"Qu Wenruo"

Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-07 Thread Qu Wenruo



On 2018/12/8 上午8:47, David Sterba wrote:
> On Fri, Dec 07, 2018 at 06:51:21AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2018/12/7 上午3:35, David Sterba wrote:
>>> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
>>>> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
>>>>> This patchset can be fetched from github:
>>>>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
>>>>>
>>>>> Which is based on v4.20-rc1.
>>>>
>>>> Thanks, I'll add it to for-next soon.
>>>
>>> The branch was there for some time but not for at least a week (my
>>> mistake I did not notice in time). I've rebased it on top of recent
>>> misc-next, but without the delayed refs patchset from Josef.
>>>
>>> At the moment I'm considering it for merge to 4.21, there's still some
>>> time to pull it out in case it shows up to be too problematic. I'm
>>> mostly worried about the unknown interactions with the enospc updates or
>>
>> For that part, I don't think it would have some obvious problem for
>> enospc updates.
>>
>> As the user-noticeable effect is the delay of reloc tree deletion.
>>
>> Despite that, it's mostly transparent to extent allocation.
>>
>>> generally because of lack of qgroup and reloc code reviews.
>>
>> That's the biggest problem.
>>
>> However most of the current qgroup + balance optimization is done inside
>> qgroup code (to skip certain qgroup record), if we're going to hit some
>> problem then this patchset would have the highest possibility to hit
>> problem.
>>
>> Later patches will just keep tweaking qgroup to without affecting any
>> other parts mostly.
>>
>> So I'm fine if you decide to pull it out for now.
> 
> I've adapted a stress tests that unpacks a large tarball, snaphosts
> every 20 seconds, deletes a random snapshot every 50 seconds, deletes
> file from the original subvolume, now enhanced with qgroups just for the
> new snapshots inherigin the toplevel subvolume. Lockup.
> 
> It gets stuck in a snapshot call with the follwin stacktrace
> 
> [<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs]
> [<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs]

This looks like the original subtree tracing has something wrong.

Thanks for the report, I'll investigate it.
Qu

> [<0>] do_walk_down+0x681/0xb20 [btrfs]
> [<0>] walk_down_tree+0xf5/0x1c0 [btrfs]
> [<0>] btrfs_drop_snapshot+0x43b/0xb60 [btrfs]
> [<0>] btrfs_clean_one_deleted_snapshot+0xc1/0x120 [btrfs]
> [<0>] cleaner_kthread+0xf8/0x170 [btrfs]
> [<0>] kthread+0x121/0x140
> [<0>] ret_from_fork+0x27/0x50
> 
> and that's like 10th snapshot and ~3rd deltion. This is qgroup show:
> 
> qgroupid rfer excl parent
>    --
> 0/5 865.27MiB  1.66MiB ---
> 0/257   0.00B0.00B ---
> 0/259   0.00B0.00B ---
> 0/260   806.58MiB637.25MiB ---
> 0/262   0.00B0.00B ---
> 0/263   0.00B0.00B ---
> 0/264   0.00B0.00B ---
> 0/265   0.00B0.00B ---
> 0/266   0.00B0.00B ---
> 0/267   0.00B0.00B ---
> 0/268   0.00B0.00B ---
> 0/269   0.00B0.00B ---
> 0/270   989.04MiB  1.22MiB ---
> 0/271   0.00B0.00B ---
> 0/272   922.25MiB416.00KiB ---
> 0/273   931.02MiB  1.50MiB ---
> 0/274   910.94MiB  1.52MiB ---
> 1/1   1.64GiB  1.64GiB
> 0/5,0/257,0/259,0/260,0/262,0/263,0/264,0/265,0/266,0/267,0/268,0/269,0/270,0/271,0/272,0/273,0/274
> 
> No IO or cpu activity at this point, the stacktrace and show output
> remains the same.
> 
> So, considering this, I'm not going to add the patchset to 4.21 but will
> keep it in for-next for testing, any fixups or updates will be applied.
> 



signature.asc
Description: OpenPGP digital signature

Re: System unable to mount partition after a power loss

2018-12-06 Thread Qu Wenruo



On 2018/12/7 下午1:24, Doni Crosby wrote:
> All,
> 
> I'm coming to you to see if there is a way to fix or at least recover
> most of the data I have from a btrfs filesystem. The system went down
> after both a breaker and the battery backup failed. I cannot currently
> mount the system, with the following error from dmesg:
> 
> Note: The vda1 is just the entire disk being passed from the VM host
> to the VM it's not an actual true virtual block device
> 
> [ 499.704398] BTRFS info (device vda1): disk space caching is enabled
> [  499.704401] BTRFS info (device vda1): has skinny extents
> [  499.739522] BTRFS error (device vda1): parent transid verify failed
> on 3563231428608 wanted 5184691 found 5183327

Transid mismatch normally means the fs is screwed up more or less.

And according to your mount failure, it looks the fs get screwed up badly.

What's the kernel version used in the VM?
I don't really think the VM is always using the latest kernel.

> [  499.740257] BTRFS error (device vda1): parent transid verify failed
> on 3563231428608 wanted 5184691 found 5183327
> [  499.770847] BTRFS error (device vda1): open_ctree failed
> 
> I have tried running btrfsck:
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563221630976 wanted 5184691 found 5184688
> parent transid verify failed on 3563221630976 wanted 5184691 found 5184688
> parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
> parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
> parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
> parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
> parent transid verify failed on 3563224072192 wanted 5184691 found 5184688
> parent transid verify failed on 3563224072192 wanted 5184691 found 5184688
> parent transid verify failed on 3563225268224 wanted 5184691 found 5184689
> parent transid verify failed on 3563225268224 wanted 5184691 found 5184689
> parent transid verify failed on 3563227398144 wanted 5184691 found 5184689
> parent transid verify failed on 3563227398144 wanted 5184691 found 5184689
> parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
> parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
> parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
> parent transid verify failed on 3563229593600 wanted 5184691 found 5184689

According to your later dump-super output, it looks pretty possible that
the corrupted extents are all belonging to extent tree.

So it's still possible that your fs tree and other essential trees are OK.

Please dump the following output (with its stderr) to further confirm
the damage.
# btrfs ins dump-tree -b 31801344 --follow /dev/vda1

If your objective is only to recover data, then you could start to try
btrfs-restore.
It's pretty hard to fix the heavily damaged extent tree.

Thanks,
Qu
> Ignoring transid failure
> Checking filesystem on /dev/vda1
> UUID: 7c76bb05-b3dc-4804-bf56-88d010a214c6
> checking extents
> parent transid verify failed on 3563224842240 wanted 5184691 found 5184689
> parent transid verify failed on 3563224842240 wanted 5184691 found 5184689
> parent transid verify failed on 3563222974464 wanted 5184691 found 5184688
> parent transid verify failed on 3563222974464 wanted 5184691 found 5184688
> parent transid verify failed on 3563223121920 wanted 5184691 found 5184688
> parent transid verify failed on 3563223121920 wanted 5184691 found 5184688
> parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
> parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
> parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
> parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
> Ignoring transid failure
> parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
> parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
> parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
> parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
> Ignoring transid failure
> parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
> parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
> parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
> parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
> Ignoring transid failure
> parent transid verify

Re: BTRFS RAID filesystem unmountable

2018-12-06 Thread Qu Wenruo



On 2018/12/7 上午7:15, Michael Wade wrote:
> Hi Qu,
> 
> Me again! Having formatted the drives and rebuilt the RAID array I
> seem to have be having the same problem as before (no power cut this
> time [I bought a UPS]).

But strangely, your super block shows it has log tree, which means
either your hit a kernel panic/transaction abort, or a unexpected power
loss.

> The brtfs volume is broken on my ReadyNAS.
> 
> I have attached the results of some of the commands you asked me to
> run last time, and I am hoping you might be able to help me out.

This time, the problem is more serious, some chunk tree blocks are not
even inside system chunk range, no wonder it fails to mount.

To confirm it, you could run "btrfs ins dump-tree -b 17725903077376
" and paste the output.

But I don't have any clue. My guess is some kernel problem related to
new chunk allocation, or the chunk root node itself is already seriously
corrupted.

Considering how old your kernel is (4.4), it's not recommended to use
btrfs on such old kernel, unless it's well backported with tons of btrfs
fixes.

Thanks,
Qu

> 
> Kind regards
> Michael
> On Sat, 19 May 2018 at 12:43, Michael Wade  wrote:
>>
>> I have let the find root command run for 14+ days, its produced a
>> pretty huge log file 1.6 GB but still hasn't completed. I think I will
>> start the process of reformatting my drives and starting over.
>>
>> Thanks for your help anyway.
>>
>> Kind regards
>> Michael
>>
>> On 5 May 2018 at 01:43, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年05月05日 00:18, Michael Wade wrote:
>>>> Hi Qu,
>>>>
>>>> The tool is still running and the log file is now ~300mb. I guess it
>>>> shouldn't normally take this long.. Is there anything else worth
>>>> trying?
>>>
>>> I'm afraid not much.
>>>
>>> Although there is a possibility to modify btrfs-find-root to do much
>>> faster but limited search.
>>>
>>> But from the result, it looks like underlying device corruption, and not
>>> much we can do right now.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>> Kind regards
>>>> Michael
>>>>
>>>> On 2 May 2018 at 06:29, Michael Wade  wrote:
>>>>> Thanks Qu,
>>>>>
>>>>> I actually aborted the run with the old btrfs tools once I saw its
>>>>> output. The new btrfs tools is still running and has produced a log
>>>>> file of ~85mb filled with that content so far.
>>>>>
>>>>> Kind regards
>>>>> Michael
>>>>>
>>>>> On 2 May 2018 at 02:31, Qu Wenruo  wrote:
>>>>>>
>>>>>>
>>>>>> On 2018年05月01日 23:50, Michael Wade wrote:
>>>>>>> Hi Qu,
>>>>>>>
>>>>>>> Oh dear that is not good news!
>>>>>>>
>>>>>>> I have been running the find root command since yesterday but it only
>>>>>>> seems to be only be outputting the following message:
>>>>>>>
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>
>>>>>> It's mostly fine, as find-root will go through all tree blocks and try
>>>>>> to read them as tree blocks.
>>>>>> Although btrfs-find-root will suppress csum error output, but such basic
>>>>>> tree validation check is not suppressed, thus you get such message.
>>>>>>
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>>
>>>>>>> I tried with the latest btrfs tools compiled from source and the ones
>>>>>>> I have installed with the same result. Is there a CLI utility I could
>>>>>>> use to determine if the log cont

Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-06 Thread Qu Wenruo

On 2018/12/7 上午3:35, David Sterba wrote:
> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
>> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
>>> This patchset can be fetched from github:
>>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
>>>
>>> Which is based on v4.20-rc1.
>>
>> Thanks, I'll add it to for-next soon.
> 
> The branch was there for some time but not for at least a week (my
> mistake I did not notice in time). I've rebased it on top of recent
> misc-next, but without the delayed refs patchset from Josef.
> 
> At the moment I'm considering it for merge to 4.21, there's still some
> time to pull it out in case it shows up to be too problematic. I'm
> mostly worried about the unknown interactions with the enospc updates or

For that part, I don't think it would have some obvious problem for
enospc updates.

As the user-noticeable effect is the delay of reloc tree deletion.

Despite that, it's mostly transparent to extent allocation.

> generally because of lack of qgroup and reloc code reviews.

That's the biggest problem.

However most of the current qgroup + balance optimization is done inside
qgroup code (to skip certain qgroup record), if we're going to hit some
problem then this patchset would have the highest possibility to hit
problem.

Later patches will just keep tweaking qgroup to without affecting any
other parts mostly.

So I'm fine if you decide to pull it out for now.

Thanks,
Qu

> 
> I'm going to do some testing of the rebased branch before I add it to
> for-next. The branch is ext/qu/qgroup-delay-scan in my devel repos,
> plase check if everyghing is still ok there. Thanks.
> 

signature.asc
Description: OpenPGP digital signature

[PATCH 4/8] btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_data_ref()

2018-12-05 Thread Qu Wenruo

Just like btrfs_add_delayed_tree_ref(), use btrfs_ref to refactor
btrfs_add_delayed_data_ref().

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.c | 19 +--
 fs/btrfs/delayed-ref.h |  8 +++-
 fs/btrfs/extent-tree.c | 24 +++-
 3 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index c42b8ade7b07..09caf1e6fc22 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -800,21 +800,27 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
  * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
  */
 int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
-  struct btrfs_root *root,
-  u64 bytenr, u64 num_bytes,
-  u64 parent, u64 ref_root,
-  u64 owner, u64 offset, u64 reserved, int action,
-  int *old_ref_mod, int *new_ref_mod)
+  struct btrfs_ref *generic_ref,
+  u64 reserved, int *old_ref_mod,
+  int *new_ref_mod)
 {
struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_delayed_data_ref *ref;
struct btrfs_delayed_ref_head *head_ref;
struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_qgroup_extent_record *record = NULL;
+   int action = generic_ref->action;
int qrecord_inserted;
int ret;
+   u64 bytenr = generic_ref->bytenr;
+   u64 num_bytes = generic_ref->len;
+   u64 parent = generic_ref->parent;
+   u64 ref_root = generic_ref->data_ref.ref_root;
+   u64 owner = generic_ref->data_ref.ino;
+   u64 offset = generic_ref->data_ref.offset;
u8 ref_type;
 
+   ASSERT(generic_ref && generic_ref->type == BTRFS_REF_DATA && action);
ref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
if (!ref)
return -ENOMEM;
@@ -838,7 +844,8 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle 
*trans,
}
 
if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) &&
-   is_fstree(ref_root) && is_fstree(root->root_key.objectid)) {
+   is_fstree(ref_root) && is_fstree(generic_ref->real_root) &&
+   !generic_ref->skip_qgroup) {
record = kzalloc(sizeof(*record), GFP_NOFS);
if (!record) {
kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index dbe029c4e01b..a8fde33b43fd 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -337,11 +337,9 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
   struct btrfs_delayed_extent_op *extent_op,
   int *old_ref_mod, int *new_ref_mod);
 int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
-  struct btrfs_root *root,
-  u64 bytenr, u64 num_bytes,
-  u64 parent, u64 ref_root,
-  u64 owner, u64 offset, u64 reserved, int action,
-  int *old_ref_mod, int *new_ref_mod);
+  struct btrfs_ref *generic_ref,
+  u64 reserved, int *old_ref_mod,
+  int *new_ref_mod);
 int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info,
struct btrfs_trans_handle *trans,
u64 bytenr, u64 num_bytes,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ecfa0234863b..fa5dd3dfe2e7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2049,10 +2049,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle 
*trans,
ret = btrfs_add_delayed_tree_ref(trans, _ref,
NULL, _ref_mod, _ref_mod);
} else {
-   ret = btrfs_add_delayed_data_ref(trans, root, bytenr,
-num_bytes, parent,
-root_objectid, owner, offset,
-0, BTRFS_ADD_DELAYED_REF,
+   btrfs_init_data_ref(_ref, root_objectid, owner, offset);
+   ret = btrfs_add_delayed_data_ref(trans, _ref, 0,
 _ref_mod, _ref_mod);
}
 
@@ -7114,10 +7112,8 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
ret = btrfs_add_delayed_tree_ref(trans, _ref, NULL,
 _ref_mod, _ref_mod);
} else {
-   ret = btrfs_add_delayed_data_ref(

[PATCH 8/8] btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent()

2018-12-05 Thread Qu Wenruo

Similar to btrfs_inc_extent_ref(), just use btrfs_ref to replace the
long parameter list and the confusing @owner parameter.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  5 +---
 fs/btrfs/extent-tree.c | 53 ++
 fs/btrfs/file.c| 23 ++
 fs/btrfs/inode.c   | 13 +++
 fs/btrfs/relocation.c  | 26 +
 5 files changed, 62 insertions(+), 58 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index db3df5ce6087..9ed55a29993d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2671,10 +2671,7 @@ int btrfs_set_disk_extent_flags(struct 
btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info,
u64 bytenr, u64 num_bytes, u64 flags,
int level, int is_data);
-int btrfs_free_extent(struct btrfs_trans_handle *trans,
- struct btrfs_root *root,
- u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
- u64 owner, u64 offset, bool for_reloc);
+int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref);
 
 int btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info,
   u64 start, u64 len, int delalloc);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ff60091aef6b..8a6a73006dc4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3255,10 +3255,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
if (inc)
ret = btrfs_inc_extent_ref(trans, _ref);
else
-   ret = btrfs_free_extent(trans, root, bytenr,
-   num_bytes, parent, ref_root,
-   key.objectid, key.offset,
-   for_reloc);
+   ret = btrfs_free_extent(trans, _ref);
if (ret)
goto fail;
} else {
@@ -3272,9 +3269,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
if (inc)
ret = btrfs_inc_extent_ref(trans, _ref);
else
-   ret = btrfs_free_extent(trans, root, bytenr,
-   num_bytes, parent, ref_root,
-   level - 1, 0, for_reloc);
+   ret = btrfs_free_extent(trans, _ref);
if (ret)
goto fail;
}
@@ -7073,47 +7068,43 @@ void btrfs_free_tree_block(struct btrfs_trans_handle 
*trans,
 }
 
 /* Can return -ENOMEM */
-int btrfs_free_extent(struct btrfs_trans_handle *trans,
- struct btrfs_root *root,
- u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
- u64 owner, u64 offset, bool for_reloc)
+int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref)
 {
-   struct btrfs_fs_info *fs_info = root->fs_info;
-   struct btrfs_ref generic_ref = { 0 };
+   struct btrfs_fs_info *fs_info = trans->fs_info;
int old_ref_mod, new_ref_mod;
int ret;
 
if (btrfs_is_testing(fs_info))
return 0;
 
-   btrfs_init_generic_ref(_ref, BTRFS_DROP_DELAYED_REF, bytenr,
-  num_bytes, root->root_key.objectid, parent);
-   generic_ref.skip_qgroup = for_reloc;
/*
 * tree log blocks never actually go into the extent allocation
 * tree, just update pinning info and exit early.
 */
-   if (root_objectid == BTRFS_TREE_LOG_OBJECTID) {
-   WARN_ON(owner >= BTRFS_FIRST_FREE_OBJECTID);
+   if ((ref->type == BTRFS_REF_METADATA &&
+ref->tree_ref.root == BTRFS_TREE_LOG_OBJECTID) ||
+   (ref->type == BTRFS_REF_DATA &&
+ref->data_ref.ref_root == BTRFS_TREE_LOG_OBJECTID)) {
/* unlocks the pinned mutex */
-   btrfs_pin_extent(fs_info, bytenr, num_bytes, 1);
+   btrfs_pin_extent(fs_info, ref->bytenr, ref->len, 1);
old_ref_mod = new_ref_mod = 0;
ret = 0;
-   } else if (owner < BTRFS_FIRST_FREE_OBJECTID) {
-   btrfs_init_tree_ref(_ref, (int)owner, root_objectid);
-   ret = btrfs_add_delayed_tree_ref(trans, _ref, NULL,
+   } else if (ref->type == BTRFS_REF_METADATA) {
+   ret = btrfs_add_delayed_tree_ref(trans, ref, NULL,
 _ref_mod, _ref_mod);
} else {
-   btrfs_init_data_ref(_ref, root_objectid, owner, offset);
-   ret = btrfs_add_

[PATCH 2/8] btrfs: extent-tree: Open-code process_func in __btrfs_mod_ref

2018-12-05 Thread Qu Wenruo

The process_func is never a function hook used anywhere else.

Open code it to make later delayed ref refactor easier, so we can
refactor btrfs_inc_extent_ref() and btrfs_free_extent() in different
patches.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 33 ++---
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ea2c3d5220f0..ea68d288d761 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3220,10 +3220,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
int i;
int level;
int ret = 0;
-   int (*process_func)(struct btrfs_trans_handle *,
-   struct btrfs_root *,
-   u64, u64, u64, u64, u64, u64, bool);
-
 
if (btrfs_is_testing(fs_info))
return 0;
@@ -3235,11 +3231,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
if (!test_bit(BTRFS_ROOT_REF_COWS, >state) && level == 0)
return 0;
 
-   if (inc)
-   process_func = btrfs_inc_extent_ref;
-   else
-   process_func = btrfs_free_extent;
-
if (full_backref)
parent = buf->start;
else
@@ -3261,17 +3252,29 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
 
num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi);
key.offset -= btrfs_file_extent_offset(buf, fi);
-   ret = process_func(trans, root, bytenr, num_bytes,
-  parent, ref_root, key.objectid,
-  key.offset, for_reloc);
+   if (inc)
+   ret = btrfs_inc_extent_ref(trans, root, bytenr,
+   num_bytes, parent, ref_root,
+   key.objectid, key.offset,
+   for_reloc);
+   else
+   ret = btrfs_free_extent(trans, root, bytenr,
+   num_bytes, parent, ref_root,
+   key.objectid, key.offset,
+   for_reloc);
if (ret)
goto fail;
} else {
bytenr = btrfs_node_blockptr(buf, i);
num_bytes = fs_info->nodesize;
-   ret = process_func(trans, root, bytenr, num_bytes,
-  parent, ref_root, level - 1, 0,
-  for_reloc);
+   if (inc)
+   ret = btrfs_inc_extent_ref(trans, root, bytenr,
+   num_bytes, parent, ref_root,
+   level - 1, 0, for_reloc);
+   else
+   ret = btrfs_free_extent(trans, root, bytenr,
+   num_bytes, parent, ref_root,
+   level - 1, 0, for_reloc);
if (ret)
goto fail;
}
-- 
2.19.2

[PATCH 7/8] btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref()

2018-12-05 Thread Qu Wenruo

Now we don't need to play the dirty game of reusing @owner for tree block
level.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  6 ++---
 fs/btrfs/extent-tree.c | 58 ++
 fs/btrfs/file.c| 20 ++-
 fs/btrfs/inode.c   | 10 +---
 fs/btrfs/ioctl.c   | 17 -
 fs/btrfs/relocation.c  | 44 
 fs/btrfs/tree-log.c| 12 ++---
 7 files changed, 100 insertions(+), 67 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6f4b1e605736..db3df5ce6087 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -40,6 +40,7 @@ extern struct kmem_cache *btrfs_bit_radix_cachep;
 extern struct kmem_cache *btrfs_path_cachep;
 extern struct kmem_cache *btrfs_free_space_cachep;
 struct btrfs_ordered_sum;
+struct btrfs_ref;
 
 #define BTRFS_MAGIC 0x4D5F53665248425FULL /* ascii _BHRfS_M, no null */
 
@@ -2682,10 +2683,7 @@ int btrfs_free_and_pin_reserved_extent(struct 
btrfs_fs_info *fs_info,
 void btrfs_prepare_extent_commit(struct btrfs_fs_info *fs_info);
 int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans);
 int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
-struct btrfs_root *root,
-u64 bytenr, u64 num_bytes, u64 parent,
-u64 root_objectid, u64 owner, u64 offset,
-bool for_reloc);
+struct btrfs_ref *generic_ref);
 
 int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans);
 int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 70c05ca30d9a..ff60091aef6b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2026,36 +2026,28 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, 
u64 bytenr,
 
 /* Can return -ENOMEM */
 int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
-struct btrfs_root *root,
-u64 bytenr, u64 num_bytes, u64 parent,
-u64 root_objectid, u64 owner, u64 offset,
-bool for_reloc)
+struct btrfs_ref *generic_ref)
 {
-   struct btrfs_fs_info *fs_info = root->fs_info;
-   struct btrfs_ref generic_ref = { 0 };
+   struct btrfs_fs_info *fs_info = trans->fs_info;
int old_ref_mod, new_ref_mod;
int ret;
 
-   BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID &&
-  root_objectid == BTRFS_TREE_LOG_OBJECTID);
+   BUG_ON(generic_ref->type == BTRFS_REF_NOT_SET ||
+  !generic_ref->action);
+   BUG_ON(generic_ref->type == BTRFS_REF_METADATA &&
+  generic_ref->tree_ref.root == BTRFS_TREE_LOG_OBJECTID);
 
-   btrfs_init_generic_ref(_ref, BTRFS_ADD_DELAYED_REF, bytenr,
-  num_bytes, root->root_key.objectid, parent);
-   generic_ref.skip_qgroup = for_reloc;
-   if (owner < BTRFS_FIRST_FREE_OBJECTID) {
-   btrfs_init_tree_ref(_ref, (int)owner, root_objectid);
-   ret = btrfs_add_delayed_tree_ref(trans, _ref,
+   if (generic_ref->type == BTRFS_REF_METADATA)
+   ret = btrfs_add_delayed_tree_ref(trans, generic_ref,
NULL, _ref_mod, _ref_mod);
-   } else {
-   btrfs_init_data_ref(_ref, root_objectid, owner, offset);
-   ret = btrfs_add_delayed_data_ref(trans, _ref, 0,
+   else
+   ret = btrfs_add_delayed_data_ref(trans, generic_ref, 0,
 _ref_mod, _ref_mod);
-   }
 
-   btrfs_ref_tree_mod(fs_info, _ref);
+   btrfs_ref_tree_mod(fs_info, generic_ref);
 
if (ret == 0 && old_ref_mod < 0 && new_ref_mod >= 0)
-   add_pinned_bytes(fs_info, _ref);
+   add_pinned_bytes(fs_info, generic_ref);
 
return ret;
 }
@@ -3212,8 +3204,10 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
u32 nritems;
struct btrfs_key key;
struct btrfs_file_extent_item *fi;
+   struct btrfs_ref generic_ref = { 0 };
bool for_reloc = btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC);
int i;
+   int action;
int level;
int ret = 0;
 
@@ -3231,6 +3225,10 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
parent = buf->start;
else
parent = 0;
+   if (inc)
+   action = BTRFS_ADD_DELAYED_REF;
+   else
+   action = BTRFS_DROP_DELAYED_REF;
 
for (i = 0; i < nritems; i++) {
if (level == 0) {
@@ -3248,11 +3246,14 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
 
num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi);
key.offset -= btrfs_file_e

[PATCH 5/8] btrfs: ref-verify: Use btrfs_ref to refactor btrfs_ref_tree_mod()

2018-12-05 Thread Qu Wenruo

It's a perfect match for btrfs_ref_tree_mod() to use btrfs_ref, as
btrfs_ref describes a metadata/data reference update comprehensively.

Now we have one less function use confusing owner/level trick.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 27 +++--
 fs/btrfs/ref-verify.c  | 53 --
 fs/btrfs/ref-verify.h  | 10 
 3 files changed, 42 insertions(+), 48 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index fa5dd3dfe2e7..1d812bc2c7fc 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2038,9 +2038,6 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID &&
   root_objectid == BTRFS_TREE_LOG_OBJECTID);
 
-   btrfs_ref_tree_mod(root, bytenr, num_bytes, parent, root_objectid,
-  owner, offset, BTRFS_ADD_DELAYED_REF);
-
btrfs_init_generic_ref(_ref, BTRFS_ADD_DELAYED_REF, bytenr,
   num_bytes, root->root_key.objectid, parent);
generic_ref.skip_qgroup = for_reloc;
@@ -2054,6 +2051,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 _ref_mod, _ref_mod);
}
 
+   btrfs_ref_tree_mod(fs_info, _ref);
+
if (ret == 0 && old_ref_mod < 0 && new_ref_mod >= 0) {
bool metadata = owner < BTRFS_FIRST_FREE_OBJECTID;
 
@@ -7025,10 +7024,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle 
*trans,
if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
int old_ref_mod, new_ref_mod;
 
-   btrfs_ref_tree_mod(root, buf->start, buf->len, parent,
-  root->root_key.objectid,
-  btrfs_header_level(buf), 0,
-  BTRFS_DROP_DELAYED_REF);
+   btrfs_ref_tree_mod(fs_info, _ref);
ret = btrfs_add_delayed_tree_ref(trans, _ref, NULL,
 _ref_mod, _ref_mod);
BUG_ON(ret); /* -ENOMEM */
@@ -7089,11 +7085,6 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
if (btrfs_is_testing(fs_info))
return 0;
 
-   if (root_objectid != BTRFS_TREE_LOG_OBJECTID)
-   btrfs_ref_tree_mod(root, bytenr, num_bytes, parent,
-  root_objectid, owner, offset,
-  BTRFS_DROP_DELAYED_REF);
-
btrfs_init_generic_ref(_ref, BTRFS_DROP_DELAYED_REF, bytenr,
   num_bytes, root->root_key.objectid, parent);
generic_ref.skip_qgroup = for_reloc;
@@ -7117,6 +7108,9 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
 _ref_mod, _ref_mod);
}
 
+   if (root_objectid != BTRFS_TREE_LOG_OBJECTID)
+   btrfs_ref_tree_mod(fs_info, _ref);
+
if (ret == 0 && old_ref_mod >= 0 && new_ref_mod < 0) {
bool metadata = owner < BTRFS_FIRST_FREE_OBJECTID;
 
@@ -8083,14 +8077,11 @@ int btrfs_alloc_reserved_file_extent(struct 
btrfs_trans_handle *trans,
 
BUG_ON(root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID);
 
-   btrfs_ref_tree_mod(root, ins->objectid, ins->offset, 0,
-  root->root_key.objectid, owner, offset,
-  BTRFS_ADD_DELAYED_EXTENT);
-
btrfs_init_generic_ref(_ref, BTRFS_ADD_DELAYED_EXTENT,
   ins->objectid, ins->offset,
   root->root_key.objectid, 0);
btrfs_init_data_ref(_ref, root->root_key.objectid, owner, 
offset);
+   btrfs_ref_tree_mod(root->fs_info, _ref);
ret = btrfs_add_delayed_data_ref(trans, _ref,
 ram_bytes, NULL, NULL);
return ret;
@@ -8338,13 +8329,11 @@ struct extent_buffer *btrfs_alloc_tree_block(struct 
btrfs_trans_handle *trans,
extent_op->is_data = false;
extent_op->level = level;
 
-   btrfs_ref_tree_mod(root, ins.objectid, ins.offset, parent,
-  root_objectid, level, 0,
-  BTRFS_ADD_DELAYED_EXTENT);
btrfs_init_generic_ref(_ref, BTRFS_ADD_DELAYED_EXTENT,
   ins.objectid, ins.offset,
   root->root_key.objectid, parent);
btrfs_init_tree_ref(_ref, level, root_objectid);
+   btrfs_ref_tree_mod(fs_info, _ref);
ret = btrfs_add_delayed_tree_ref(trans, _ref,
 extent_op, NULL, NULL);
if (ret)
diff --git a/fs/btrfs/ref-verify.c b/fs/btrfs/ref-verify.c
index d69

[PATCH 3/8] btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_tree_ref()

2018-12-05 Thread Qu Wenruo

btrfs_add_delayed_tree_ref() has a longer and longer parameter list, and
some caller like btrfs_inc_extent_ref() are using @owner as level for
delayed tree ref.

Instead of making the parameter list longer and longer, use btrfs_ref to
refactor it, so each parameter assignment should be self-explaining
without dirty level/owner trick, and provides the basis for later refactor.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.c | 24 ++---
 fs/btrfs/delayed-ref.h |  4 +---
 fs/btrfs/extent-tree.c | 48 --
 3 files changed, 44 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 11dd46be4017..c42b8ade7b07 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -710,9 +710,7 @@ static void init_delayed_ref_common(struct btrfs_fs_info 
*fs_info,
  * transaction commits.
  */
 int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
-  u64 bytenr, u64 num_bytes, u64 parent,
-  u64 ref_root,  int level, bool for_reloc,
-  int action,
+  struct btrfs_ref *generic_ref,
   struct btrfs_delayed_extent_op *extent_op,
   int *old_ref_mod, int *new_ref_mod)
 {
@@ -722,10 +720,17 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_qgroup_extent_record *record = NULL;
int qrecord_inserted;
-   bool is_system = (ref_root == BTRFS_CHUNK_TREE_OBJECTID);
+   bool is_system = (generic_ref->real_root == BTRFS_CHUNK_TREE_OBJECTID);
+   int action = generic_ref->action;
+   int level = generic_ref->tree_ref.level;
int ret;
+   u64 bytenr = generic_ref->bytenr;
+   u64 num_bytes = generic_ref->len;
+   u64 parent = generic_ref->parent;
u8 ref_type;
 
+   ASSERT(generic_ref && generic_ref->type == BTRFS_REF_METADATA &&
+   generic_ref->action);
BUG_ON(extent_op && extent_op->is_data);
ref = kmem_cache_alloc(btrfs_delayed_tree_ref_cachep, GFP_NOFS);
if (!ref)
@@ -738,7 +743,9 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
}
 
if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) &&
-   is_fstree(ref_root) && !for_reloc) {
+   is_fstree(generic_ref->real_root) &&
+   is_fstree(generic_ref->tree_ref.root) &&
+   !generic_ref->skip_qgroup) {
record = kzalloc(sizeof(*record), GFP_NOFS);
if (!record) {
kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
@@ -753,13 +760,14 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
ref_type = BTRFS_TREE_BLOCK_REF_KEY;
 
init_delayed_ref_common(fs_info, >node, bytenr, num_bytes,
-   ref_root, action, ref_type);
-   ref->root = ref_root;
+   generic_ref->tree_ref.root, action, ref_type);
+   ref->root = generic_ref->tree_ref.root;
ref->parent = parent;
ref->level = level;
 
init_delayed_ref_head(head_ref, record, bytenr, num_bytes,
- ref_root, 0, action, false, is_system);
+ generic_ref->tree_ref.root, 0, action, false,
+ is_system);
head_ref->extent_op = extent_op;
 
delayed_refs = >transaction->delayed_refs;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index e36d6b05d85e..dbe029c4e01b 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -333,9 +333,7 @@ static inline void btrfs_put_delayed_ref_head(struct 
btrfs_delayed_ref_head *hea
 }
 
 int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
-  u64 bytenr, u64 num_bytes, u64 parent,
-  u64 ref_root, int level, bool for_reloc,
-  int action,
+  struct btrfs_ref *generic_ref,
   struct btrfs_delayed_extent_op *extent_op,
   int *old_ref_mod, int *new_ref_mod);
 int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ea68d288d761..ecfa0234863b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2031,6 +2031,7 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 bool for_reloc)
 {
struct btrfs_fs_info *fs_info = root->fs_info;
+   struct btrfs_ref generic_ref = { 0 };
int old_ref_mod, new_ref_mod;
int ret;
 
@@ -2040,13 +2041,13 @@ in

[PATCH 1/8] btrfs: delayed-ref: Introduce better documented delayed ref structures

2018-12-05 Thread Qu Wenruo

Current delayed ref interface has several problems:
- Longer and longer parameter lists
  bytenr
  num_bytes
  parent
  ref_root
  owner
  offset
  for_reloc << Only qgroup code cares.

- Different interpretation for the same parameter
  Above @owner for data ref is ino owning this extent,
  while for tree ref, it's level. They are even in different size range.
  For level we only need 0~8, while for ino it's
  BTRFS_FIRST_FREE_OBJECTID~BTRFS_LAST_FREE_OBJECTID.

  And @offset doesn't even makes sense for tree ref.

  Such parameter reuse may look clever as an hidden union, but it
  destroys code readability.

To solve both problems, we introduce a new structure, btrfs_ref to solve
them:

- Structure instead of long parameter list
  This makes later expansion easier, and better documented.

- Use btrfs_ref::type to distinguish data and tree ref

- Use proper union to store data/tree ref specific structures.

- Use separate functions to fill data/tree ref data, with a common generic
  function to fill common bytenr/num_bytes members.

All parameters will find its place in btrfs_ref, and an extra member,
real_root, inspired by ref-verify code, is newly introduced for later
qgroup code, to record which tree is triggered this extent modification.

This patch doesn't touch any code, but provides the basis for incoming
refactors.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.h | 109 +
 1 file changed, 109 insertions(+)

diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index d8fa12d3f2cc..e36d6b05d85e 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -176,6 +176,81 @@ struct btrfs_delayed_ref_root {
u64 qgroup_to_skip;
 };
 
+enum btrfs_ref_type {
+   BTRFS_REF_NOT_SET = 0,
+   BTRFS_REF_DATA,
+   BTRFS_REF_METADATA,
+   BTRFS_REF_LAST,
+};
+
+struct btrfs_data_ref {
+   /*
+* For EXTENT_DATA_REF
+*
+* @ref_root:   current owner of the extent.
+*  may differ from btrfs_ref::real_root.
+* @ino:inode number of the owner.
+* @offset: *CALCULATED* offset. Not EXTENT_DATA key offset.
+*
+*/
+   u64 ref_root;
+   u64 ino;
+   u64 offset;
+};
+
+struct btrfs_tree_ref {
+   /* Common for all sub types and skinny combination */
+   int level;
+
+   /*
+* For TREE_BLOCK_REF (skinny metadata, either inline or keyed)
+*
+* root here may differ from btrfs_ref::real_root.
+*/
+   u64 root;
+
+   /* For non-skinny metadata, no special member needed */
+};
+
+struct btrfs_ref {
+   enum btrfs_ref_type type;
+   int action;
+
+   /*
+* Use full backref(SHARED_BLOCK_REF or SHARED_DATA_REF) for this
+* extent and its children.
+* Set for reloc trees.
+*/
+   unsigned int use_fullback:1;
+
+   /*
+* Whether this extent should go through qgroup record.
+* Normally false, but for certain case like delayed subtree scan,
+* this can hugely reduce qgroup overhead.
+*/
+   unsigned int skip_qgroup:1;
+
+   /*
+* Who owns this reference modification, optional.
+*
+* One example:
+* When creating reloc tree for source fs, it will increase tree block
+* ref for children tree blocks.
+* In that case, btrfs_ref::real_root = reloc tree,
+* while btrfs_ref::tree_ref::root = fs tree.
+*/
+   u64 real_root;
+   u64 bytenr;
+   u64 len;
+
+   /* Common @parent for SHARED_DATA_REF/SHARED_BLOCK_REF */
+   u64 parent;
+   union {
+   struct btrfs_data_ref data_ref;
+   struct btrfs_tree_ref tree_ref;
+   };
+};
+
 extern struct kmem_cache *btrfs_delayed_ref_head_cachep;
 extern struct kmem_cache *btrfs_delayed_tree_ref_cachep;
 extern struct kmem_cache *btrfs_delayed_data_ref_cachep;
@@ -184,6 +259,40 @@ extern struct kmem_cache *btrfs_delayed_extent_op_cachep;
 int __init btrfs_delayed_ref_init(void);
 void __cold btrfs_delayed_ref_exit(void);
 
+static inline void btrfs_init_generic_ref(struct btrfs_ref *generic_ref,
+   int action, u64 bytenr, u64 len, u64 real_root,
+   u64 parent)
+{
+   generic_ref->action = action;
+   generic_ref->bytenr = bytenr;
+   generic_ref->len = len;
+   generic_ref->real_root = real_root;
+   generic_ref->parent = parent;
+}
+
+static inline void btrfs_init_tree_ref(struct btrfs_ref *generic_ref,
+   int level, u64 root)
+{
+   /* If @real_root not set, use @root as fallback */
+   if (!generic_ref->real_root)
+   generic_ref->real_root = root;
+   generic_ref->tree_ref.level = level;
+   generic_ref->tree_ref.root = root;
+   generic_ref->type = BTRFS_REF_METADATA;
+}
+
+static inline void

[PATCH 6/8] btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()

2018-12-05 Thread Qu Wenruo

Since add_pinned_bytes() only needs to know if the extent is metadata
and if it's a chunk tree extent, btrfs_ref is a perfect match for it, as
we don't need various owner/level trick to determine extent type.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 26 ++
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1d812bc2c7fc..70c05ca30d9a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -738,14 +738,15 @@ static struct btrfs_space_info *__find_space_info(struct 
btrfs_fs_info *info,
return NULL;
 }
 
-static void add_pinned_bytes(struct btrfs_fs_info *fs_info, s64 num_bytes,
-bool metadata, u64 root_objectid)
+static void add_pinned_bytes(struct btrfs_fs_info *fs_info,
+struct btrfs_ref *ref)
 {
struct btrfs_space_info *space_info;
+   s64 num_bytes = -ref->len;
u64 flags;
 
-   if (metadata) {
-   if (root_objectid == BTRFS_CHUNK_TREE_OBJECTID)
+   if (ref->type == BTRFS_REF_METADATA) {
+   if (ref->tree_ref.root == BTRFS_CHUNK_TREE_OBJECTID)
flags = BTRFS_BLOCK_GROUP_SYSTEM;
else
flags = BTRFS_BLOCK_GROUP_METADATA;
@@ -2053,11 +2054,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle 
*trans,
 
btrfs_ref_tree_mod(fs_info, _ref);
 
-   if (ret == 0 && old_ref_mod < 0 && new_ref_mod >= 0) {
-   bool metadata = owner < BTRFS_FIRST_FREE_OBJECTID;
-
-   add_pinned_bytes(fs_info, -num_bytes, metadata, root_objectid);
-   }
+   if (ret == 0 && old_ref_mod < 0 && new_ref_mod >= 0)
+   add_pinned_bytes(fs_info, _ref);
 
return ret;
 }
@@ -7059,8 +7057,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle 
*trans,
}
 out:
if (pin)
-   add_pinned_bytes(fs_info, buf->len, true,
-root->root_key.objectid);
+   add_pinned_bytes(fs_info, _ref);
 
if (last_ref) {
/*
@@ -7111,11 +7108,8 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
if (root_objectid != BTRFS_TREE_LOG_OBJECTID)
btrfs_ref_tree_mod(fs_info, _ref);
 
-   if (ret == 0 && old_ref_mod >= 0 && new_ref_mod < 0) {
-   bool metadata = owner < BTRFS_FIRST_FREE_OBJECTID;
-
-   add_pinned_bytes(fs_info, num_bytes, metadata, root_objectid);
-   }
+   if (ret == 0 && old_ref_mod >= 0 && new_ref_mod < 0)
+   add_pinned_bytes(fs_info, _ref);
 
return ret;
 }
-- 
2.19.2

[PATCH 0/8] btrfs: Refactor delayed ref parameter list

2018-12-05 Thread Qu Wenruo

Current delayed ref interface has several problems:
- Longer and longer parameter lists
  bytenr
  num_bytes
  parent
   So far so good
  ref_root
  owner
  offset
   I don't feel well now
  for_reloc
   This parameter only makes sense for qgroup code, but we need
   to pass the parameter a long way.

  This makes later expand on parameter list more and more tricky.

- Different interpretation for the same parameter
  Above @owner for data ref is ino who owns this extent,
  while for tree ref, it's level. They are even in different size range.

  For level we only need 0~8, while for ino it's
  BTRFS_FIRST_FREE_OBJECTID~BTRFS_LAST_FREE_OBJECTID, so it's still
  possible to distinguish them, but it's never a straight-forward thing
  to grasp.

  And @offset doesn't even makes sense for tree ref.

  Such parameter reuse may look clever as an hidden union, but it
  destroys code readability.

This patchset will change the way how we pass parameters for delayed
ref.
Instead of calling delayed ref interface like:
  ret = btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, parent,
 ref_root, owner, offset);
Or
  ret = btrfs_inc_extent_ref(trans, root, bytenr, nodesize, parent,
 level, ref_root, 0);

We now call like:
  btrfs_init_generic_ref(, bytenr, num_bytes,
 root->root_key.objectid, parent);
  btrfs_init_data_ref(, ref_root, owner, offset);
  ret = btrfs_inc_extent_ref(trans, );
Or
  btrfs_init_generic_ref(, bytenr, num_bytes,
 root->root_key.objectid, parent);
  btrfs_init_tree_ref(, level, ref_root);
  ret = btrfs_inc_extent_ref(trans, );

To determine if a ref is tree or data, instead of calling like:
  if (owner < BTRFS_FIRST_FREE_OBJECTID) {
  } else {
  }
We do it straight-forward:
  if (ref->type == BTRFS_REF_METADATA) {
  } else {
  }

And for newer and minor new members, we don't need to add a new
parameter to btrfs_add_delayed_tree|data_ref() or
btrfs_inc_extent_ref(), just assign them after generic/data/tree init:
  btrfs_init_generic_ref(, bytenr, num_bytes,
 root->root_key.objectid, parent);
  btrfs_init_data_ref(, ref_root, owner, offset);
  ref->skip_qgroup = true; /* @skip_qgroup is default to false, so new
  code doesn't need to care */
  ret = btrfs_inc_extent_ref(trans, );

This should improve the code readability and make later code easier to
write.


Qu Wenruo (8):
  btrfs: delayed-ref: Introduce better documented delayed ref structures
  btrfs: extent-tree: Open-code process_func in __btrfs_mod_ref
  btrfs: delayed-ref: Use btrfs_ref to refactor
btrfs_add_delayed_tree_ref()
  btrfs: delayed-ref: Use btrfs_ref to refactor
btrfs_add_delayed_data_ref()
  btrfs: ref-verify: Use btrfs_ref to refactor btrfs_ref_tree_mod()
  btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()
  btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref()
  btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent()

 fs/btrfs/ctree.h   |  11 +--
 fs/btrfs/delayed-ref.c |  43 ++---
 fs/btrfs/delayed-ref.h | 121 +++--
 fs/btrfs/extent-tree.c | 195 +++--
 fs/btrfs/file.c|  43 +
 fs/btrfs/inode.c   |  23 +++--
 fs/btrfs/ioctl.c   |  17 ++--
 fs/btrfs/ref-verify.c  |  53 ++-
 fs/btrfs/ref-verify.h  |  10 +--
 fs/btrfs/relocation.c  |  70 +--
 fs/btrfs/tree-log.c|  12 ++-
 11 files changed, 375 insertions(+), 223 deletions(-)

-- 
2.19.2

Re: [PATCH v2 07/13] btrfs-progs: Fix Wmaybe-uninitialized warning

2018-12-05 Thread Qu Wenruo



On 2018/12/5 下午9:40, David Sterba wrote:
> On Wed, Dec 05, 2018 at 02:40:12PM +0800, Qu Wenruo wrote:
>> GCC 8.2.1 will report the following warning with "make W=1":
>>
>>   ctree.c: In function 'btrfs_next_sibling_tree_block':
>>   ctree.c:2990:21: warning: 'slot' may be used uninitialized in this 
>> function [-Wmaybe-uninitialized]
>> path->slots[level] = slot;
>> ~~~^~
>>
>> The culprit is the following code:
>>
>>  int slot;   << Not initialized
>>  int level = path->lowest_level + 1;
>>  BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
>>  while(level < BTRFS_MAX_LEVEL) {
>>  slot = path->slots[level] + 1;
>>  ^^ but we initialize @slot here.
>>  ...
>>  }
>>  path->slots[level] = slot;
>>
>> It's possible that compiler doesn't get enough hint for BUG_ON() on
>> lowest_level + 1 >= BTRFS_MAX_LEVEL case.
>>
>> Fix it by using a do {} while() loop other than while() {} loop, to
>> ensure we will run the loop for at least once.
> 
> I was hoping that we can actually add the hint to BUG_ON so the code
> does not continue if the condition is true.
> 
I checked that method, but I'm not that confident about things like:

bugon_trace()
{
if (!val)
return;
__bugon_trace();
}

__attribute__ ((noreturn))
static inline void __bugon_trace();

This is as simple as just one extra function call, but the original
problem is just one more function call before hitting abort().

So I just give up screwing up things I'm not comfort enough to tweaking.

The current do {} while() loop is the most direct solution, if gcc one
day still gives such warning then I could say some harsh word then.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Qu Wenruo



On 2018/12/5 下午8:28, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs. I understand there have been
> previous attempts at it. However, I wanted to make sure copy-on-write
> (COW) works on dax as well.
> 
> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:
> 
> Questions:
> 1. I have been unable to do checksumming for DAX devices. While
> checksumming can be done for reads and writes, it is a problem when mmap
> is involved because btrfs kernel module does not get back control after
> an mmap() writes. Any ideas are appreciated, or we would have to set
> nodatasum when dax is enabled.

I'm not familar with DAX, so it's completely possible I'm talking like
an idiot.

If btrfs_page_mkwrite() can't provide enough control, then I have a
crazy idea.

Forcing page fault for every mmap() read/write (completely disable page
cache like DIO).
So that we could get some control since we're informed to read the page
and do some hacks there.

Thanks,
Qu
> 
> 2. Currently, a user can continue writing on "old" extents of an mmaped file
> after a snapshot has been created. How can we enforce writes to be directed
> to new extents after snapshots have been created? Do we keep a list of
> all mmap()s, and re-mmap them after a snapshot?
> 
> Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
> command line parameter.
> 
> 
> [PATCH 01/10] btrfs: create a mount option for dax
> [PATCH 02/10] btrfs: basic dax read
> [PATCH 03/10] btrfs: dax: read zeros from holes
> [PATCH 04/10] Rename __endio_write_update_ordered() to
> [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of
> [PATCH 06/10] btrfs: dax write support
> [PATCH 07/10] dax: export functions for use with btrfs
> [PATCH 08/10] btrfs: dax add read mmap path
> [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
> [PATCH 10/10] btrfs: dax mmap write
> 
>  fs/btrfs/Makefile   |1 
>  fs/btrfs/ctree.h|   17 ++
>  fs/btrfs/dax.c  |  303 
> ++--
>  fs/btrfs/file.c |   29 
>  fs/btrfs/inode.c|   54 +
>  fs/btrfs/ioctl.c|5 
>  fs/btrfs/super.c|   15 ++
>  fs/dax.c|   35 --
>  include/linux/dax.h |   16 ++
>  9 files changed, 430 insertions(+), 45 deletions(-)
> 
> 



signature.asc
Description: OpenPGP digital signature

[PATCH v2 11/13] btrfs-progs: Introduce rescue.h to resolve missing-prototypes for chunk and super rescue

2018-12-04 Thread Qu Wenruo

We don't have any header declaring btrfs_recover_chunk_tree() nor
btrfs_recover_superblocks(), thus W=1 gives missing-prototypes warning
on them.

Fix it by introducing a new header, rescue.h for these two functions, so
make W=1 could be much happier.

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 chunk-recover.c |  1 +
 cmds-rescue.c   |  4 +---
 rescue.h| 21 +
 super-recover.c |  1 +
 4 files changed, 24 insertions(+), 3 deletions(-)
 create mode 100644 rescue.h

diff --git a/chunk-recover.c b/chunk-recover.c
index 1d30db51d8ed..1e554b8e8750 100644
--- a/chunk-recover.c
+++ b/chunk-recover.c
@@ -40,6 +40,7 @@
 #include "utils.h"
 #include "btrfsck.h"
 #include "commands.h"
+#include "rescue.h"
 
 struct recover_control {
int verbose;
diff --git a/cmds-rescue.c b/cmds-rescue.c
index 2bc50c0841ed..36e9e1277e40 100644
--- a/cmds-rescue.c
+++ b/cmds-rescue.c
@@ -26,15 +26,13 @@
 #include "commands.h"
 #include "utils.h"
 #include "help.h"
+#include "rescue.h"
 
 static const char * const rescue_cmd_group_usage[] = {
"btrfs rescue  [options] ",
NULL
 };
 
-int btrfs_recover_chunk_tree(const char *path, int verbose, int yes);
-int btrfs_recover_superblocks(const char *path, int verbose, int yes);
-
 static const char * const cmd_rescue_chunk_recover_usage[] = {
"btrfs rescue chunk-recover [options] ",
"Recover the chunk tree by scanning the devices one by one.",
diff --git a/rescue.h b/rescue.h
new file mode 100644
index ..de486e2e2004
--- /dev/null
+++ b/rescue.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2018 SUSE.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+#ifndef __BTRFS_RESCUE_H__
+#define __BTRFS_RESCUE_H__
+
+int btrfs_recover_superblocks(const char *path, int verbose, int yes);
+int btrfs_recover_chunk_tree(const char *path, int verbose, int yes);
+
+#endif
diff --git a/super-recover.c b/super-recover.c
index 86b3df9867dc..a1af71786034 100644
--- a/super-recover.c
+++ b/super-recover.c
@@ -34,6 +34,7 @@
 #include "crc32c.h"
 #include "volumes.h"
 #include "commands.h"
+#include "rescue.h"
 
 struct btrfs_recover_superblock {
struct btrfs_fs_devices *fs_devices;
-- 
2.19.2

[PATCH v2 03/13] btrfs-progs: Makefile.extrawarn: Don't warn on sign compare

2018-12-04 Thread Qu Wenruo

Under most case, we are just using 'int' for 'unsigned int', and doesn't
care about the sign.

The Wsign-compare is causing tons of false alerts.
Suppressing it would make W=1 less noisy so we can focus on real
problem, while still allow it in W=3 build.

Signed-off-by: Qu Wenruo 
---
 Makefile.extrawarn | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Makefile.extrawarn b/Makefile.extrawarn
index 0c11f2450802..9b4cace01ce4 100644
--- a/Makefile.extrawarn
+++ b/Makefile.extrawarn
@@ -54,6 +54,7 @@ warning-1 += $(call cc-option, -Wmissing-include-dirs)
 warning-1 += $(call cc-option, -Wunused-but-set-variable)
 warning-1 += $(call cc-disable-warning, missing-field-initializers)
 warning-1 += $(call cc-disable-warning, format-truncation)
+warning-1 += $(call cc-disable-warning, sign-compare)
 
 warning-2 := -Waggregate-return
 warning-2 += -Wcast-align
@@ -74,6 +75,7 @@ warning-3 += -Wredundant-decls
 warning-3 += -Wswitch-default
 warning-3 += $(call cc-option, -Wpacked-bitfield-compat)
 warning-3 += $(call cc-option, -Wvla)
+warning-3 += $(call cc-option, -Wsign-compare)
 
 warning := $(warning-$(findstring 1, $(BUILD_ENABLE_EXTRA_GCC_CHECKS)))
 warning += $(warning-$(findstring 2, $(BUILD_ENABLE_EXTRA_GCC_CHECKS)))
-- 
2.19.2

[PATCH v2 09/13] btrfs-progs: Fix missing-prototypes warning caused by non-static functions

2018-12-04 Thread Qu Wenruo

Make the following functions static to avoid missing-prototypes warning:
 - btrfs.c::handle_special_globals()
 - check/mode-lowmem.c::repair_ternary_lowmem()
 - extent-tree.c::btrfs_search_overlap_extent()
 - free-space-tree.c::convert_free_space_to_bitmaps()
 - free-space-tree.c::convert_free_space_to_extents()
 - free-space-tree.c::__remove_from_free_space_tree()
 - free-space-tree.c::__add_to_free_space_tree()
 - free-space-tree.c::btrfs_create_tree()

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 btrfs.c |  2 +-
 check/mode-lowmem.c |  6 +++---
 extent-tree.c   |  2 +-
 free-space-tree.c   | 30 +++---
 4 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/btrfs.c b/btrfs.c
index 2d39f2ced3e8..78c468d2e050 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -210,7 +210,7 @@ static int handle_global_options(int argc, char **argv)
return shift;
 }
 
-void handle_special_globals(int shift, int argc, char **argv)
+static void handle_special_globals(int shift, int argc, char **argv)
 {
int has_help = 0;
int has_full = 0;
diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index 14bbc9ee6cb6..f56b5e8d45dc 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -953,9 +953,9 @@ out:
  * returns 0 means success.
  * returns not 0 means on error;
  */
-int repair_ternary_lowmem(struct btrfs_root *root, u64 dir_ino, u64 ino,
- u64 index, char *name, int name_len, u8 filetype,
- int err)
+static int repair_ternary_lowmem(struct btrfs_root *root, u64 dir_ino, u64 ino,
+u64 index, char *name, int name_len,
+u8 filetype, int err)
 {
struct btrfs_trans_handle *trans;
int stage = 0;
diff --git a/extent-tree.c b/extent-tree.c
index cd98633992ac..8c9cdeff3b02 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -3749,7 +3749,7 @@ static void __get_extent_size(struct btrfs_root *root, 
struct btrfs_path *path,
  * Return >0 for not found.
  * Return <0 for err
  */
-int btrfs_search_overlap_extent(struct btrfs_root *root,
+static int btrfs_search_overlap_extent(struct btrfs_root *root,
struct btrfs_path *path, u64 bytenr, u64 len)
 {
struct btrfs_key key;
diff --git a/free-space-tree.c b/free-space-tree.c
index 6641cdfa42ba..b3ffa90f704c 100644
--- a/free-space-tree.c
+++ b/free-space-tree.c
@@ -202,9 +202,9 @@ static void le_bitmap_set(unsigned long *map, unsigned int 
start, int len)
}
 }
 
-int convert_free_space_to_bitmaps(struct btrfs_trans_handle *trans,
- struct btrfs_block_group_cache *block_group,
- struct btrfs_path *path)
+static int convert_free_space_to_bitmaps(struct btrfs_trans_handle *trans,
+   struct btrfs_block_group_cache *block_group,
+   struct btrfs_path *path)
 {
struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_root *root = fs_info->free_space_root;
@@ -341,9 +341,9 @@ out:
return ret;
 }
 
-int convert_free_space_to_extents(struct btrfs_trans_handle *trans,
- struct btrfs_block_group_cache *block_group,
- struct btrfs_path *path)
+static int convert_free_space_to_extents(struct btrfs_trans_handle *trans,
+   struct btrfs_block_group_cache *block_group,
+   struct btrfs_path *path)
 {
struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_root *root = fs_info->free_space_root;
@@ -780,9 +780,9 @@ out:
return ret;
 }
 
-int __remove_from_free_space_tree(struct btrfs_trans_handle *trans,
- struct btrfs_block_group_cache *block_group,
- struct btrfs_path *path, u64 start, u64 size)
+static int __remove_from_free_space_tree(struct btrfs_trans_handle *trans,
+   struct btrfs_block_group_cache *block_group,
+   struct btrfs_path *path, u64 start, u64 size)
 {
struct btrfs_free_space_info *info;
u32 flags;
@@ -960,9 +960,9 @@ out:
return ret;
 }
 
-int __add_to_free_space_tree(struct btrfs_trans_handle *trans,
-struct btrfs_block_group_cache *block_group,
-struct btrfs_path *path, u64 start, u64 size)
+static int __add_to_free_space_tree(struct btrfs_trans_handle *trans,
+   struct btrfs_block_group_cache *block_group,
+   struct btrfs_path *path, u64 start, u64 size)
 {
struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_free_space_info *info;
@@ -1420,9 +1420,9 @@ out:
return ret;
 }
 
-struct btrfs_root *btrfs_create_tree(struct btrf

[PATCH v2 12/13] btrfs-progs: Add utils.h include to solve missing-prototypes warning

2018-12-04 Thread Qu Wenruo

Prototypes for arg_strtou64() and lookup_path_rootid() are included in
utils.c, resulting make W=1 warning for them.

Just include that header to make W=1 happier.

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 utils-lib.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/utils-lib.c b/utils-lib.c
index 044f93fc4446..5bb89f2f1a8d 100644
--- a/utils-lib.c
+++ b/utils-lib.c
@@ -1,4 +1,5 @@
 #include "kerncompat.h"
+#include "utils.h"
 #include 
 #include 
 #include 
-- 
2.19.2

[PATCH v2 04/13] btrfs-progs: Fix Wempty-body warning

2018-12-04 Thread Qu Wenruo

messages.h:49:24: warning: suggest braces around empty body in an 'if' 
statement [-Wempty-body]
PRINT_TRACE_ON_ERROR;\

Just extra braces would solve the problem.

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 messages.h | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/messages.h b/messages.h
index ec7d93381a36..16f650d19a4b 100644
--- a/messages.h
+++ b/messages.h
@@ -45,13 +45,16 @@
 
 #define error_on(cond, fmt, ...)   \
do {\
-   if ((cond)) \
+   if ((cond)) {   \
PRINT_TRACE_ON_ERROR;   \
-   if ((cond)) \
+   }   \
+   if ((cond)) {   \
PRINT_VERBOSE_ERROR;\
+   }   \
__btrfs_error_on((cond), (fmt), ##__VA_ARGS__); \
-   if ((cond)) \
+   if ((cond)) {   \
DO_ABORT_ON_ERROR;  \
+   }   \
} while (0)
 
 #define error_btrfs_util(err)  \
@@ -76,10 +79,12 @@
 
 #define warning_on(cond, fmt, ...) \
do {\
-   if ((cond)) \
+   if ((cond)) {   \
PRINT_TRACE_ON_ERROR;   \
-   if ((cond)) \
+   }   \
+   if ((cond)) {   \
PRINT_VERBOSE_ERROR;\
+   }   \
__btrfs_warning_on((cond), (fmt), ##__VA_ARGS__);   \
} while (0)
 
-- 
2.19.2

[PATCH v2 10/13] btrfs-progs: Move btrfs_check_nodesize to fsfeatures.c to fix missing-prototypes warning

2018-12-04 Thread Qu Wenruo

And fsfeatures.c is indeed a better location for that function.

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 fsfeatures.c | 23 +++
 utils.c  | 23 ---
 2 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/fsfeatures.c b/fsfeatures.c
index 7d85d60f1277..13ad030870cd 100644
--- a/fsfeatures.c
+++ b/fsfeatures.c
@@ -225,3 +225,26 @@ u32 get_running_kernel_version(void)
return version;
 }
 
+int btrfs_check_nodesize(u32 nodesize, u32 sectorsize, u64 features)
+{
+   if (nodesize < sectorsize) {
+   error("illegal nodesize %u (smaller than %u)",
+   nodesize, sectorsize);
+   return -1;
+   } else if (nodesize > BTRFS_MAX_METADATA_BLOCKSIZE) {
+   error("illegal nodesize %u (larger than %u)",
+   nodesize, BTRFS_MAX_METADATA_BLOCKSIZE);
+   return -1;
+   } else if (nodesize & (sectorsize - 1)) {
+   error("illegal nodesize %u (not aligned to %u)",
+   nodesize, sectorsize);
+   return -1;
+   } else if (features & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS &&
+  nodesize != sectorsize) {
+   error(
+   "illegal nodesize %u (not equal to %u for mixed block group)",
+   nodesize, sectorsize);
+   return -1;
+   }
+   return 0;
+}
diff --git a/utils.c b/utils.c
index b274f46fdd9d..a7e34b804551 100644
--- a/utils.c
+++ b/utils.c
@@ -2266,29 +2266,6 @@ int btrfs_tree_search2_ioctl_supported(int fd)
return ret;
 }
 
-int btrfs_check_nodesize(u32 nodesize, u32 sectorsize, u64 features)
-{
-   if (nodesize < sectorsize) {
-   error("illegal nodesize %u (smaller than %u)",
-   nodesize, sectorsize);
-   return -1;
-   } else if (nodesize > BTRFS_MAX_METADATA_BLOCKSIZE) {
-   error("illegal nodesize %u (larger than %u)",
-   nodesize, BTRFS_MAX_METADATA_BLOCKSIZE);
-   return -1;
-   } else if (nodesize & (sectorsize - 1)) {
-   error("illegal nodesize %u (not aligned to %u)",
-   nodesize, sectorsize);
-   return -1;
-   } else if (features & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS &&
-  nodesize != sectorsize) {
-   error("illegal nodesize %u (not equal to %u for mixed block 
group)",
-   nodesize, sectorsize);
-   return -1;
-   }
-   return 0;
-}
-
 /*
  * Copy a path argument from SRC to DEST and check the SRC length if it's at
  * most PATH_MAX and fits into DEST. DESTLEN is supposed to be exact size of
-- 
2.19.2

[PATCH v2 08/13] btrfs-progs: Fix Wtype-limits warning

2018-12-04 Thread Qu Wenruo

The only hit is the following code:

tlv_len = le16_to_cpu(tlv_hdr->tlv_len);

if (tlv_type == 0 || tlv_type > BTRFS_SEND_A_MAX
|| tlv_len > BTRFS_SEND_BUF_SIZE) {
error("invalid tlv in cmd tlv_type = %hu, tlv_len = 
%hu",
tlv_type, tlv_len);

@tlv_len is u16, while BTRFS_SEND_BUF_SIZE is 64K.
u16 MAX is 64K - 1, so the final check is always false.

Just remove it.

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 send-stream.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/send-stream.c b/send-stream.c
index 3b8e39c9486a..25461e92c37b 100644
--- a/send-stream.c
+++ b/send-stream.c
@@ -157,8 +157,7 @@ static int read_cmd(struct btrfs_send_stream *sctx)
tlv_type = le16_to_cpu(tlv_hdr->tlv_type);
tlv_len = le16_to_cpu(tlv_hdr->tlv_len);
 
-   if (tlv_type == 0 || tlv_type > BTRFS_SEND_A_MAX
-   || tlv_len > BTRFS_SEND_BUF_SIZE) {
+   if (tlv_type == 0 || tlv_type > BTRFS_SEND_A_MAX) {
error("invalid tlv in cmd tlv_type = %hu, tlv_len = 
%hu",
tlv_type, tlv_len);
ret = -EINVAL;
-- 
2.19.2

[PATCH v2 13/13] btrfs-progs: free-space-tree: Remove unsued function

2018-12-04 Thread Qu Wenruo

set_free_space_tree_thresholds() is never used, just remove it to solve
the missing-prototypes warning from make W=1.

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 free-space-tree.c | 29 -
 1 file changed, 29 deletions(-)

diff --git a/free-space-tree.c b/free-space-tree.c
index b3ffa90f704c..af141e6e611a 100644
--- a/free-space-tree.c
+++ b/free-space-tree.c
@@ -24,35 +24,6 @@
 #include "bitops.h"
 #include "internal.h"
 
-void set_free_space_tree_thresholds(struct btrfs_block_group_cache *cache,
-   u64 sectorsize)
-{
-   u32 bitmap_range;
-   size_t bitmap_size;
-   u64 num_bitmaps, total_bitmap_size;
-
-   /*
-* We convert to bitmaps when the disk space required for using extents
-* exceeds that required for using bitmaps.
-*/
-   bitmap_range = sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS;
-   num_bitmaps = div_u64(cache->key.offset + bitmap_range - 1,
- bitmap_range);
-   bitmap_size = sizeof(struct btrfs_item) + BTRFS_FREE_SPACE_BITMAP_SIZE;
-   total_bitmap_size = num_bitmaps * bitmap_size;
-   cache->bitmap_high_thresh = div_u64(total_bitmap_size,
-   sizeof(struct btrfs_item));
-
-   /*
-* We allow for a small buffer between the high threshold and low
-* threshold to avoid thrashing back and forth between the two formats.
-*/
-   if (cache->bitmap_high_thresh > 100)
-   cache->bitmap_low_thresh = cache->bitmap_high_thresh - 100;
-   else
-   cache->bitmap_low_thresh = 0;
-}
-
 static struct btrfs_free_space_info *
 search_free_space_info(struct btrfs_trans_handle *trans,
   struct btrfs_fs_info *fs_info,
-- 
2.19.2

[PATCH v2 07/13] btrfs-progs: Fix Wmaybe-uninitialized warning

2018-12-04 Thread Qu Wenruo

GCC 8.2.1 will report the following warning with "make W=1":

  ctree.c: In function 'btrfs_next_sibling_tree_block':
  ctree.c:2990:21: warning: 'slot' may be used uninitialized in this function 
[-Wmaybe-uninitialized]
path->slots[level] = slot;
~~~^~

The culprit is the following code:

int slot;   << Not initialized
int level = path->lowest_level + 1;
BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
while(level < BTRFS_MAX_LEVEL) {
slot = path->slots[level] + 1;
^^ but we initialize @slot here.
...
}
path->slots[level] = slot;

It's possible that compiler doesn't get enough hint for BUG_ON() on
lowest_level + 1 >= BTRFS_MAX_LEVEL case.

Fix it by using a do {} while() loop other than while() {} loop, to
ensure we will run the loop for at least once.

Signed-off-by: Qu Wenruo 
---
 ctree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/ctree.c b/ctree.c
index 46e2ccedc0bf..867e8b60b199 100644
--- a/ctree.c
+++ b/ctree.c
@@ -2966,7 +2966,7 @@ int btrfs_next_sibling_tree_block(struct btrfs_fs_info 
*fs_info,
struct extent_buffer *next = NULL;
 
BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
-   while(level < BTRFS_MAX_LEVEL) {
+   do {
if (!path->nodes[level])
return 1;
 
@@ -2986,7 +2986,7 @@ int btrfs_next_sibling_tree_block(struct btrfs_fs_info 
*fs_info,
if (!extent_buffer_uptodate(next))
return -EIO;
break;
-   }
+   } while (level < BTRFS_MAX_LEVEL);
path->slots[level] = slot;
while(1) {
level--;
-- 
2.19.2

[PATCH v2 05/13] btrfs-progs: Fix Wimplicit-fallthrough warning

2018-12-04 Thread Qu Wenruo

Although most fallthrough case is pretty obvious, we still need to teach
the dumb compiler that it's an explicit fallthrough.

Also reformat the code to use common indent.

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 utils.c | 30 ++
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/utils.c b/utils.c
index a310300829eb..b274f46fdd9d 100644
--- a/utils.c
+++ b/utils.c
@@ -1134,15 +1134,25 @@ int pretty_size_snprintf(u64 size, char *str, size_t 
str_size, unsigned unit_mod
num_divs = 0;
last_size = size;
switch (unit_mode & UNITS_MODE_MASK) {
-   case UNITS_TBYTES: base *= mult; num_divs++;
-   case UNITS_GBYTES: base *= mult; num_divs++;
-   case UNITS_MBYTES: base *= mult; num_divs++;
-   case UNITS_KBYTES: num_divs++;
-  break;
+   case UNITS_TBYTES:
+   base *= mult;
+   num_divs++;
+   __attribute__ ((fallthrough));
+   case UNITS_GBYTES:
+   base *= mult;
+   num_divs++;
+   __attribute__ ((fallthrough));
+   case UNITS_MBYTES:
+   base *= mult;
+   num_divs++;
+   __attribute__ ((fallthrough));
+   case UNITS_KBYTES:
+   num_divs++;
+   break;
case UNITS_BYTES:
-  base = 1;
-  num_divs = 0;
-  break;
+   base = 1;
+   num_divs = 0;
+   break;
default:
if (negative) {
s64 ssize = (s64)size;
@@ -1907,13 +1917,17 @@ int test_num_disk_vs_raid(u64 metadata_profile, u64 
data_profile,
default:
case 4:
allowed |= BTRFS_BLOCK_GROUP_RAID10;
+   __attribute__ ((fallthrough));
case 3:
allowed |= BTRFS_BLOCK_GROUP_RAID6;
+   __attribute__ ((fallthrough));
case 2:
allowed |= BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1 |
BTRFS_BLOCK_GROUP_RAID5;
+   __attribute__ ((fallthrough));
case 1:
allowed |= BTRFS_BLOCK_GROUP_DUP;
+   __attribute__ ((fallthrough));
}
 
if (dev_cnt > 1 && profile & BTRFS_BLOCK_GROUP_DUP) {
-- 
2.19.2

[PATCH v2 06/13] btrfs-progs: Fix Wsuggest-attribute=format warning

2018-12-04 Thread Qu Wenruo

Add __attribute__ ((format (printf, 4, 0))) to fix the vprintf calling
function.

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 string-table.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/string-table.c b/string-table.c
index 95833768960d..455285702d51 100644
--- a/string-table.c
+++ b/string-table.c
@@ -48,6 +48,7 @@ struct string_table *table_create(int columns, int rows)
  * '>' the text is right aligned. If fmt is equal to '=' the text will
  * be replaced by a '=' dimensioned on the basis of the column width
  */
+__attribute__ ((format (printf, 4, 0)))
 char *table_vprintf(struct string_table *tab, int column, int row,
  const char *fmt, va_list ap)
 {
-- 
2.19.2

[PATCH v2 00/13] btrfs-progs: Make W=1 great (no "again")

2018-12-04 Thread Qu Wenruo

This patchset can be fetched from github:
https://github.com/adam900710/btrfs-progs/tree/warning_fixes
Which is based on v4.19 tag.

This patchset will make "make W=1" reports no warning.

This patch will first introduce fix to Makefile.extrawarn to make
"cc-disable-warning" works, then disable sign-compare warning
completely, as we really don't want extra "unsigned" prefix to slow our
typing.

Then re-use (ok, in fact rework) Yanjun's patch to disable
formwat-truncation warning.

Finally, fix all the remaining warnings reported by make W=1.

Now, we make "make W=1" great (may 'again' or not, depending on the
distribution rolling speed).

changelog:
v1.1:
- Use cc-disable-warning instead of putting -Wno-something to improve
  compatibility.
- Better explaination on the BUG_ON() branch caused uninitialized
  variable.
- Also cleanup free-space-tree.c

v2:
- Add reviewed-by tags, except the 7th patch, as it goes a different way
  to fix in v2.
- Fix bad port of cc-disable-warning, using $CFLAGS instead of kernel
  flags.
- Make sure fixed warnings still show in W=3.
- Use do {} while() loop to replace a while() {} loop, so even compiler
  doesn't have enough hint for BUG_ON(), it won't report uninitialized
  variable warning.

Qu Wenruo (12):
  btrfs-progs: Makefile.extrawarn: Import cc-disable-warning
  btrfs-progs: Makefile.extrawarn: Don't warn on sign compare
  btrfs-progs: Fix Wempty-body warning
  btrfs-progs: Fix Wimplicit-fallthrough warning
  btrfs-progs: Fix Wsuggest-attribute=format warning
  btrfs-progs: Fix Wmaybe-uninitialized warning
  btrfs-progs: Fix Wtype-limits warning
  btrfs-progs: Fix missing-prototypes warning caused by non-static
functions
  btrfs-progs: Move btrfs_check_nodesize to fsfeatures.c to fix
missing-prototypes warning
  btrfs-progs: Introduce rescue.h to resolve missing-prototypes for
chunk and super rescue
  btrfs-progs: Add utils.h include to solve missing-prototypes warning
  btrfs-progs: free-space-tree: Remove unsued function

Su Yanjun (1):
  btrfs-progs: fix gcc8 default build warning caused by
'-Wformat-truncation'

 Makefile|  5 
 Makefile.extrawarn  | 10 
 btrfs.c |  2 +-
 check/mode-lowmem.c |  6 ++---
 chunk-recover.c |  1 +
 cmds-rescue.c   |  4 +--
 ctree.c |  4 +--
 extent-tree.c   |  2 +-
 free-space-tree.c   | 59 -
 fsfeatures.c| 23 ++
 messages.h  | 15 
 rescue.h| 21 
 send-stream.c   |  3 +--
 string-table.c  |  1 +
 super-recover.c |  1 +
 utils-lib.c |  1 +
 utils.c | 53 +---
 17 files changed, 119 insertions(+), 92 deletions(-)
 create mode 100644 rescue.h

-- 
2.19.2

[PATCH v2 01/13] btrfs-progs: Makefile.extrawarn: Import cc-disable-warning

2018-12-04 Thread Qu Wenruo

We imported cc-option but forgot to import cc-disable-warning.

Fixes: b556a992c3ad ("btrfs-progs: build: allow to build with various compiler 
warnings")
Signed-off-by: Qu Wenruo 
---
 Makefile.extrawarn | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/Makefile.extrawarn b/Makefile.extrawarn
index 1f4bda94a167..18a3a860053e 100644
--- a/Makefile.extrawarn
+++ b/Makefile.extrawarn
@@ -19,6 +19,12 @@ try-run = $(shell set -e;   \
  cc-option = $(call try-run,\
  $(CC) $(CFLAGS) $(1) -c -x c /dev/null -o "$$TMP",$(1),$(2))
 
+# cc-disable-warning
+# Usage: cflags-y += $(call cc-disable-warning,unused-but-set-variable)
+cc-disable-warning = $(call try-run,\
+   $(CC) -Werror $(CFLAGS) -W$(strip $(1)) -c -x c /dev/null -o 
"$$TMP",-Wno-$(strip $(1)))
+
+
 # From linux.git/scripts/Makefile.extrawarn
 # ==
 #
-- 
2.19.2

[PATCH v2 02/13] btrfs-progs: fix gcc8 default build warning caused by '-Wformat-truncation'

2018-12-04 Thread Qu Wenruo

From: Su Yanjun 

When using gcc8 + glibc 2.28.5 compiles utils.c, it complains as below:

  utils.c:852:45: warning: '%s' directive output may be truncated writing
  up to 4095 bytes into a region of size 4084 [-Wformat-truncation=]
 snprintf(path, sizeof(path), "/dev/mapper/%s", name);
 ^~   
  In file included from /usr/include/stdio.h:873,
   from utils.c:20:
  /usr/include/bits/stdio2.h:67:10: note: '__builtin___snprintf_chk'
  output between 13 and 4108 bytes into a destination of size 4096
 return __builtin___snprintf_chk (__s, __n, __USE_FORTIFY_LEVEL - 1,
^~~~
  __bos (__s), __fmt, __va_arg_pack ());
  ~

This isn't a type of warning we care about, particularly when calling
snprintf() we expect string to be truncated.

Using the GCC option -Wno-format-truncation to disable this for default
build and W=1 build, while still keep it for W=2/W=3 build.

Signed-off-by: Su Yanjun 
[Use cc-disable-warning to fix the not working CFLAGS setting in configure.ac]
[Keep the warning in W=2/W=3 build]
Signed-off-by: Qu Wenruo 
---
 Makefile   | 5 +
 Makefile.extrawarn | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/Makefile b/Makefile
index f4ab14ea74c8..a9e57fecb6e6 100644
--- a/Makefile
+++ b/Makefile
@@ -62,6 +62,10 @@ DEBUG_LDFLAGS :=
 ABSTOPDIR = $(shell pwd)
 TOPDIR := .
 
+# Disable certain GCC 8 + glibc 2.28 warning for snprintf()
+# where string truncation for snprintf() is expected.
+DISABLE_WARNING_FLAGS := $(call cc-disable-warning, format-truncation)
+
 # Common build flags
 CFLAGS = $(SUBST_CFLAGS) \
 $(CSTD) \
@@ -73,6 +77,7 @@ CFLAGS = $(SUBST_CFLAGS) \
 -I$(TOPDIR) \
 -I$(TOPDIR)/kernel-lib \
 -I$(TOPDIR)/libbtrfsutil \
+$(DISABLE_WARNING_FLAGS) \
 $(EXTRAWARN_CFLAGS) \
 $(DEBUG_CFLAGS_INTERNAL) \
 $(EXTRA_CFLAGS)
diff --git a/Makefile.extrawarn b/Makefile.extrawarn
index 18a3a860053e..0c11f2450802 100644
--- a/Makefile.extrawarn
+++ b/Makefile.extrawarn
@@ -53,6 +53,7 @@ warning-1 += -Wold-style-definition
 warning-1 += $(call cc-option, -Wmissing-include-dirs)
 warning-1 += $(call cc-option, -Wunused-but-set-variable)
 warning-1 += $(call cc-disable-warning, missing-field-initializers)
+warning-1 += $(call cc-disable-warning, format-truncation)
 
 warning-2 := -Waggregate-return
 warning-2 += -Wcast-align
@@ -61,6 +62,7 @@ warning-2 += -Wnested-externs
 warning-2 += -Wshadow
 warning-2 += $(call cc-option, -Wlogical-op)
 warning-2 += $(call cc-option, -Wmissing-field-initializers)
+warning-2 += $(call cc-option, -Wformat-truncation)
 
 warning-3 := -Wbad-function-cast
 warning-3 += -Wcast-qual
-- 
2.19.2

Re: [PATCH 7/9] btrfs-progs: Fix Wmaybe-uninitialized warning

2018-12-04 Thread Qu Wenruo



On 2018/12/4 下午8:17, David Sterba wrote:
> On Fri, Nov 16, 2018 at 03:54:24PM +0800, Qu Wenruo wrote:
>> The only location is the following code:
>>
>>  int level = path->lowest_level + 1;
>>  BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
>>  while(level < BTRFS_MAX_LEVEL) {
>>  slot = path->slots[level] + 1;
>>  ...
>>  }
>>  path->slots[level] = slot;
>>
>> Again, it's the stupid compiler needs some hint for the fact that
>> we will always enter the while loop for at least once, thus @slot should
>> always be initialized.
> 
> Harsh words for the compiler, and I say not deserved. The same code
> pasted to kernel a built with the same version does not report the
> warning, so it's apparently a missing annotation of BUG_ON in
> btrfs-progs that does not give the right hint.
> 
Well, in fact after the recent gcc8 updates (god knows how many versions
gcc8 get updated in Arch after the patchset), it doesn't report this
error anymore.

But your idea on the BUG_ON() lacking noreturn attribute makes sense.

I'll just add some hint for kerncompact.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature

Re: Ran into "invalid block group size" bug, unclear how to proceed.

2018-12-04 Thread Qu Wenruo



On 2018/12/5 上午6:33, Mike Javorski wrote:
> On Tue, Dec 4, 2018 at 2:18 AM Qu Wenruo  wrote:
>>
>>
>>
>> On 2018/12/4 上午11:32, Mike Javorski wrote:
>>> Need a bit of advice here ladies / gents. I am running into an issue
>>> which Qu Wenruo seems to have posted a patch for several weeks ago
>>> (see https://patchwork.kernel.org/patch/10694997/).
>>>
>>> Here is the relevant dmesg output which led me to Qu's patch.
>>> 
>>> [   10.032475] BTRFS critical (device sdb): corrupt leaf: root=2
>>> block=24655027060736 slot=20 bg_start=13188988928 bg_len=10804527104,
>>> invalid block group size, have 10804527104 expect (0, 10737418240]
>>> [   10.032493] BTRFS error (device sdb): failed to read block groups: -5
>>> [   10.053365] BTRFS error (device sdb): open_ctree failed
>>> 
>>
>> Exactly the same symptom.
>>
>>>
>>> This server has a 16 disk btrfs filesystem (RAID6) which I boot
>>> periodically to btrfs-send snapshots to. This machine is running
>>> ArchLinux and I had just updated  to their latest 4.19.4 kernel
>>> package (from 4.18.10 which was working fine). I've tried updating to
>>> the 4.19.6 kernel that is in testing, but that doesn't seem to resolve
>>> the issue. From what I can see on kernel.org, the patch above is not
>>> pushed to stable or to Linus' tree.
>>>
>>> At this point the question is what to do. Is my FS toast?
>>
>> If there is no other problem at all, your fs is just fine.
>> It's my original patch too sensitive (the excuse for not checking chunk
>> allocator carefully enough).
>>
>> But since you have the down time, it's never a bad idea to run a btrfs
>> check --readonly to see if your fs is really OK.
>>
> 
> After running for 4 hours...
> 
> UUID: 25b16375-b90b-408e-b592-fb07ed116d58
> [1/7] checking root items
> [2/7] checking extents
> [3/7] checking free space cache
> [4/7] checking fs roots
> [5/7] checking only csums items (without verifying data)
> [6/7] checking root refs
> [7/7] checking quota groups
> found 24939616169984 bytes used, no error found
> total csum bytes: 24321980768
> total tree bytes: 41129721856
> total fs tree bytes: 9854648320
> total extent tree bytes: 737804288
> btree space waste bytes: 7483785005
> file data blocks allocated: 212883520618496
>  referenced 212876546314240
> 
> So things appear good to go. I will keep an eye out for the patch to
> land before upgrading the kernel again.
> 
>>> Could I
>>> revert to the 4.18.10 kernel and boot safely?
>>
>> If your btrfs check --readonly doesn't report any problem, then you're
>> completely fine to do so.
>> Although I still recommend to go RAID10 other than RAID5/6.
> 
> I understand the risk, but don't have the funds to buy sufficient
> disks to operate in RAID10.

Then my advice would be, for any powerloss, please run a full-disk scrub
(and of course ensure there is not another powerloss during scrubbing).

I know this sounds silly and slow, but at least it should workaround the
write hole problem.

Thanks,
Qu

> The data is mostly large files and
> activity is predominantly reads, so risk is currently acceptable given
> the backup server. All super critical data is backed up to (very slow)
> cloud storage.
> 
>>
>> Thanks,
>> Qu
>>
>>> I don't know if the 4.19
>>> boot process may have flipped some bits which would make reverting
>>> problematic.
>>>
>>> Thanks much,
>>>
>>> - mike
>>>
>>



signature.asc
Description: OpenPGP digital signature

Re: [PATCH] btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable

2018-12-04 Thread Qu Wenruo



On 2018/12/4 下午9:52, David Sterba wrote:
> On Tue, Dec 04, 2018 at 06:15:13PM +0800, Qu Wenruo wrote:
>> Gentle ping.
>>
>> Please put this patch into current release as the new block group size
>> limit check introduced in v4.19 is causing at least 2 reports in mail list.
> 
> BTW, if there's an urgent fix or patch that should be considered for
> current devel cycle, please note that in the subject like
> 
>   [PATCH for 4.20-rc] btrfs: ...
> 
> to make it more visible.
> 

Great thanks for this hint!

Just forgot we have such tag.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature

Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Qu Wenruo




On 2018/12/4 下午9:07, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>> Hi all,
>>
>> Many months ago I promised to graph how long it took to mount a BTRFS 
>> filesystem as it grows.  I finally had (made) time for this, and the 
>> attached is the result of my testing.  The image is a fairly 
>> self-explanatory graph, and the raw data is also attached in 
>> comma-delimited format for the more curious.  The columns are: 
>> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
>>
>> Experimental setup:
>> - System:
>> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
>> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
>> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
>> - 3 unmount/mount cycles performed in between adding another 250GB of data
>> - 250GB of data added each time in the form of 25x10GB files in their 
>> own directory.  Files generated in parallel each epoch (25 at the same 
>> time, with a 1MB record size).
>> - 240 repetitions of this performed (to collect timings in increments of 
>> 250GB between a 0GB and 60TB filesystem)
>> - Normal "time" command used to measure time to mount.  "Real" time used 
>> of the timings reported from time.
>> - Mount:
>> /dev/md0 on /btrfs type btrfs 
>> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
>>
>> At 60TB, we take 30s to mount the filesystem, which is actually not as 
>> bad as I originally thought it would be (perhaps as a result of using 
>> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
>> to comment if folks more intimately familiar with BTRFS think this is 
>> due to the very large files I've used.  I can redo the test with much 
>> more realistic data if people have legitimate reason to think it will 
>> drastically change the result.
>>
>> With 14TB drives available today, it doesn't take more than a handful of 
>> drives to result in a filesystem that takes around a minute to mount. 
>> As a result of this, I suspect this will become an increasingly problem 
>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
>> not a contributor so I have no room to do so -- just shedding some light 
>> on a problem that may deserve attention as filesystem sizes continue to 
>> grow.
> 
> Would it be possible to provide perf traces of the longer-running mount
> time? Everyone seems to be fixated on reading block groups (which is
> likely to be the culprit) but before pointing finger I'd like concrete
> evidence pointed at the offender.

IIRC I submitted such analyse years ago.

Nowadays it may change due to chunk <-> bg <-> dev_extents cross checking.
So yes, it would be a good idea to show such percentage.

Thanks,
Qu

> 
>>
>> Best,
>>
>> ellis
>>

Re: [PATCH v1.1 9/9] btrfs-progs: Cleanup warning reported by -Wmissing-prototypes

2018-12-04 Thread Qu Wenruo



On 2018/12/4 下午8:22, David Sterba wrote:
> On Fri, Nov 16, 2018 at 04:04:51PM +0800, Qu Wenruo wrote:
>> The following missing prototypes will be fixed:
>> 1)  btrfs.c::handle_special_globals()
>> 2)  check/mode-lowmem.c::repair_ternary_lowmem()
>> 3)  extent-tree.c::btrfs_search_overlap_extent()
>> Above 3 can be fixed by making them static
>>
>> 4)  utils.c::btrfs_check_nodesize()
>> Fixed by moving it to fsfeatures.c
>>
>> 5)  chunk-recover.c::btrfs_recover_chunk_tree()
>> 6)  super-recover.c::btrfs_recover_superblocks()
>> Fixed by moving the declaration from cmds-rescue.c to rescue.h
>>
>> 7)  utils-lib.c::arg_strtou64()
>> 8)  utils-lib.c::lookup_path_rootid()
>> Fixed by include "utils.h"
>>
>> 9)  free-space-tree.c::set_free_space_tree_thresholds()
>> Fixed by deleting it, as there is no user.
>>
>> 10) free-space-tree.c::convert_free_space_to_bitmaps()
>> 11) free-space-tree.c::convert_free_space_to_extents()
>> 12) free-space-tree.c::__remove_from_free_space_tree()
>> 13) free-space-tree.c::__add_to_free_space_tree()
>> 14) free-space-tree.c::btrfs_create_tree()
>> Fixed by making them static.
> 
> Please split this to more patches grouped by the type of change.

No problem, just as the numbering is already grouped.

Thanks,
Qu
> 
>> --- /dev/null
>> +++ b/rescue.h
>> @@ -0,0 +1,14 @@
>> +/*
>> + * Copyright (C) 2018 SUSE.  All rights reserved.
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public
>> + * License v2 as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> + * General Public License for more details.
>> + */
> 
> Missing ifdef to prevent multiple inclusion
> 
>> +int btrfs_recover_superblocks(const char *path, int verbose, int yes);
>> +int btrfs_recover_chunk_tree(const char *path, int verbose, int yes);



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 2/9] btrfs-progs: fix gcc8 default build warning caused by '-Wformat-truncation'

2018-12-04 Thread Qu Wenruo



On 2018/12/4 下午7:10, David Sterba wrote:
> On Fri, Nov 16, 2018 at 03:54:19PM +0800, Qu Wenruo wrote:
>> From: Su Yanjun 
>>
>> When using gcc8 compiles utils.c, it complains as below:
>>
>> utils.c:852:45: warning: '%s' directive output may be truncated writing
>> up to 4095 bytes into a region of size 4084 [-Wformat-truncation=]
>>snprintf(path, sizeof(path), "/dev/mapper/%s", name);
>>  ^~   
>> In file included from /usr/include/stdio.h:873,
>>  from utils.c:20:
>> /usr/include/bits/stdio2.h:67:10: note: '__builtin___snprintf_chk'
>> output between 13 and 4108 bytes into a destination of size 4096
>>return __builtin___snprintf_chk (__s, __n, __USE_FORTIFY_LEVEL - 1,
>>   ^~~~
>> __bos (__s), __fmt, __va_arg_pack ());
>> ~
>>
>> This isn't a type of warning we care about, particularly when PATH_MAX
>> is much less than either.
>>
>> Using the GCC option -Wno-format-truncation to disable this for default
>> build.
>>
>> Signed-off-by: Su Yanjun 
>> [Use cc-disable-warning to fix the not working CFLAGS setting in 
>> configure.ac]
>> Signed-off-by: Qu Wenruo 
>> ---
>>  Makefile   | 1 +
>>  Makefile.extrawarn | 1 +
>>  2 files changed, 2 insertions(+)
>>
>> diff --git a/Makefile b/Makefile
>> index f4ab14ea74c8..252299f8869f 100644
>> --- a/Makefile
>> +++ b/Makefile
>> @@ -49,6 +49,7 @@ CSCOPE_CMD := cscope -u -b -c -q
>>  include Makefile.extrawarn
>>  
>>  EXTRA_CFLAGS :=
>> +EXTRA_CFLAGS += $(call cc-disable-warning, format-truncation)
> 
> Please don't touch EXTRA_CFLAGS, this is for users who want to override
> any defaults that are set by build. This should go to CFLAGS.
> 
>>  EXTRA_LDFLAGS :=
>>  
>>  DEBUG_CFLAGS_DEFAULT = -O0 -U_FORTIFY_SOURCE -ggdb3
>> diff --git a/Makefile.extrawarn b/Makefile.extrawarn
>> index 5849036fd166..bbb2d5173846 100644
>> --- a/Makefile.extrawarn
>> +++ b/Makefile.extrawarn
>> @@ -53,6 +53,7 @@ warning-1 += -Wold-style-definition
>>  warning-1 += $(call cc-option, -Wmissing-include-dirs)
>>  warning-1 += $(call cc-option, -Wunused-but-set-variable)
>>  warning-1 += $(call cc-disable-warning, missing-field-initializers)
>> +warning-1 += $(call cc-disable-warning, format-truncation)
> 
> It's ok to disable for W=1 but eg. the W=3 level could take all
> previously disabled warnings.

Any guide on the warning disabling?

IMHO the format-truncation used in any snprintf()-like functions are
really more or less expected, just like missing-field-initializers.

THanks,
Qu

> 
>>  
>>  warning-2 := -Waggregate-return
>>  warning-2 += -Wcast-align
>> -- 
>> 2.19.1



signature.asc
Description: OpenPGP digital signature

Re: [PATCH v2 1/5] btrfs-progs: image: Refactor fixup_devices() to fixup_chunks_and_devices()

2018-12-04 Thread Qu Wenruo



On 2018/12/4 下午6:20, David Sterba wrote:
> On Tue, Nov 27, 2018 at 04:38:24PM +0800, Qu Wenruo wrote:
>> +error:
>> +error(
>> +"failed to fix chunks and devices mapping, the fs may not be mountable: %s",
>> +strerror(-ret));
> 
> Recently the sterror(error code) have been converted to %m, so I changed
> this to
> 
>   errno = -ret
>   error("... %m");
> 

Thanks for mentioning this.

I'll use this format for later patches.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature

Re: [PATCH v2 1/5] btrfs-progs: image: Refactor fixup_devices() to fixup_chunks_and_devices()

2018-12-04 Thread Qu Wenruo




On 2018/12/4 下午6:18, David Sterba wrote:
> On Tue, Nov 27, 2018 at 04:50:57PM +0800, Qu Wenruo wrote:
>>>> -static int fixup_devices(struct btrfs_fs_info *fs_info,
>>>> -   struct mdrestore_struct *mdres, off_t dev_size)
>>>> +static int fixup_device_size(struct btrfs_trans_handle *trans,
>>>> +   struct mdrestore_struct *mdres,
>>>> +   off_t dev_size)
>>>>  {
>>>> -  struct btrfs_trans_handle *trans;
>>>> +  struct btrfs_fs_info *fs_info = trans->fs_info;
>>>>struct btrfs_dev_item *dev_item;
>>>>struct btrfs_path path;
>>>> -  struct extent_buffer *leaf;
>>>>struct btrfs_root *root = fs_info->chunk_root;
>>>>struct btrfs_key key;
>>>> +  struct extent_buffer *leaf;
>>>
>>> nit: Unnecessary change
>>
>> Doesn't it look better when all btrfs_ prefix get batched together? :)
> 
> Please don't do unrelated changes like that.

Understood, the github version has that @leaf reverted to the original
location.

Thanks,
Qu

Re: Ran into "invalid block group size" bug, unclear how to proceed.

2018-12-04 Thread Qu Wenruo



On 2018/12/4 上午11:32, Mike Javorski wrote:
> Need a bit of advice here ladies / gents. I am running into an issue
> which Qu Wenruo seems to have posted a patch for several weeks ago
> (see https://patchwork.kernel.org/patch/10694997/).
> 
> Here is the relevant dmesg output which led me to Qu's patch.
> 
> [   10.032475] BTRFS critical (device sdb): corrupt leaf: root=2
> block=24655027060736 slot=20 bg_start=13188988928 bg_len=10804527104,
> invalid block group size, have 10804527104 expect (0, 10737418240]
> [   10.032493] BTRFS error (device sdb): failed to read block groups: -5
> [   10.053365] BTRFS error (device sdb): open_ctree failed
> 

Exactly the same symptom.

> 
> This server has a 16 disk btrfs filesystem (RAID6) which I boot
> periodically to btrfs-send snapshots to. This machine is running
> ArchLinux and I had just updated  to their latest 4.19.4 kernel
> package (from 4.18.10 which was working fine). I've tried updating to
> the 4.19.6 kernel that is in testing, but that doesn't seem to resolve
> the issue. From what I can see on kernel.org, the patch above is not
> pushed to stable or to Linus' tree.
> 
> At this point the question is what to do. Is my FS toast?

If there is no other problem at all, your fs is just fine.
It's my original patch too sensitive (the excuse for not checking chunk
allocator carefully enough).

But since you have the down time, it's never a bad idea to run a btrfs
check --readonly to see if your fs is really OK.

> Could I
> revert to the 4.18.10 kernel and boot safely?

If your btrfs check --readonly doesn't report any problem, then you're
completely fine to do so.
Although I still recommend to go RAID10 other than RAID5/6.

Thanks,
Qu

> I don't know if the 4.19
> boot process may have flipped some bits which would make reverting
> problematic.
> 
> Thanks much,
> 
> - mike
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH] btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable

2018-12-04 Thread Qu Wenruo

Gentle ping.

Please put this patch into current release as the new block group size
limit check introduced in v4.19 is causing at least 2 reports in mail list.

Thanks,
Qu

On 2018/11/23 上午9:06, Qu Wenruo wrote:
> [BUG]
> A completely valid btrfs will refuse to mount, with error message like:
>   BTRFS critical (device sdb2): corrupt leaf: root=2 block=239681536 slot=172 
> \
> bg_start=12018974720 bg_len=10888413184, invalid block group size, \
> have 10888413184 expect (0, 10737418240]
> 
> Btrfs check returns no error, and all kernels used on this fs is later
> than 2011, which should all have the 10G size limit commit.
> 
> [CAUSE]
> For a 12 devices btrfs, we could allocate a chunk larger than 10G due to
> stripe stripe bump up.
> 
> __btrfs_alloc_chunk()
> |- max_stripe_size = 1G
> |- max_chunk_size = 10G
> |- data_stripe = 11
> |- if (1G * 11 > 10G) {
>stripe_size = 976128930;
>stripe_size = round_up(976128930, SZ_16M) = 989855744
> 
> However the final stripe_size (989855744) * 11 = 10888413184, which is
> still larger than 10G.
> 
> [FIX]
> For the comprehensive check, we need to do the full check at chunk
> read time, and rely on bg <-> chunk mapping to do the check.
> 
> We could just skip the length check for now.
> 
> Fixes: fce466eab7ac ("btrfs: tree-checker: Verify block_group_item")
> Cc: sta...@vger.kernel.org # v4.19+
> Reported-by: Wang Yugui 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/tree-checker.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index cab0b1f1f741..d8bd5340fbbc 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -389,13 +389,11 @@ static int check_block_group_item(struct btrfs_fs_info 
> *fs_info,
>  
>   /*
>* Here we don't really care about alignment since extent allocator can
> -  * handle it.  We care more about the size, as if one block group is
> -  * larger than maximum size, it's must be some obvious corruption.
> +  * handle it.  We care more about the size.
>*/
> - if (key->offset > BTRFS_MAX_DATA_CHUNK_SIZE || key->offset == 0) {
> + if (key->offset == 0) {
>   block_group_err(fs_info, leaf, slot,
> - "invalid block group size, have %llu expect (0, %llu]",
> - key->offset, BTRFS_MAX_DATA_CHUNK_SIZE);
> + "invalid block group size 0");
>   return -EUCLEAN;
>   }
>  
> 



signature.asc
Description: OpenPGP digital signature

Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Qu Wenruo



On 2018/12/4 上午2:20, Wilson, Ellis wrote:
> Hi all,
> 
> Many months ago I promised to graph how long it took to mount a BTRFS 
> filesystem as it grows.  I finally had (made) time for this, and the 
> attached is the result of my testing.  The image is a fairly 
> self-explanatory graph, and the raw data is also attached in 
> comma-delimited format for the more curious.  The columns are: 
> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
> 
> Experimental setup:
> - System:
> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
> - 3 unmount/mount cycles performed in between adding another 250GB of data
> - 250GB of data added each time in the form of 25x10GB files in their 
> own directory.  Files generated in parallel each epoch (25 at the same 
> time, with a 1MB record size).
> - 240 repetitions of this performed (to collect timings in increments of 
> 250GB between a 0GB and 60TB filesystem)
> - Normal "time" command used to measure time to mount.  "Real" time used 
> of the timings reported from time.
> - Mount:
> /dev/md0 on /btrfs type btrfs 
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
> 
> At 60TB, we take 30s to mount the filesystem, which is actually not as 
> bad as I originally thought it would be (perhaps as a result of using 
> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
> to comment if folks more intimately familiar with BTRFS think this is 
> due to the very large files I've used.  I can redo the test with much 
> more realistic data if people have legitimate reason to think it will 
> drastically change the result.
> 
> With 14TB drives available today, it doesn't take more than a handful of 
> drives to result in a filesystem that takes around a minute to mount. 
> As a result of this, I suspect this will become an increasingly problem 
> for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
> not a contributor so I have no room to do so -- just shedding some light 
> on a problem that may deserve attention as filesystem sizes continue to 
> grow.

This problem is somewhat known.

If you dig further, it's btrfs_read_block_groups() which will try to
read *ALL* block group items.
And to no one's surprise, when the fs goes larger, the more block group
items need to be read from disk.

We need to do some delay for such read to improve such case.

Thanks,
Qu

> 
> Best,
> 
> ellis
> 



signature.asc
Description: OpenPGP digital signature

Re: Filesystem Corruption

2018-12-03 Thread Qu Wenruo



On 2018/12/3 下午5:31, Stefan Malte Schumacher wrote:
> Hello,
> 
> I have noticed an unusual amount of crc-errors in downloaded rars,
> beginning about a week ago. But lets start with the preliminaries. I
> am using Debian Stretch.
> Kernel: Linux mars 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u4
> (2018-08-21) x86_64 GNU/Linux
> BTRFS-Tools btrfs-progs  4.7.3-1
> Smartctl shows no errors for any of the drives in the filesystem.
> 
> Btrfs /dev/stats shows zero errors, but dmesg gives me a lot of
> filesystem related error messages.
> 
> [5390748.884929] Buffer I/O error on dev dm-0, logical block
> 976701312, async page read
> This errors is shown a lot of time in the log.

No "btrfs:" prefix, looks more like an error message from block level,
no wonder btrfs shows no error at all.

What is the underlying device mapper?

And further more, is there any kernel message with "btrfs"
(case-insensitive) in it?

Thanks,
Qu
> 
> This seems to affect just newly written files. This is the output of
> btrfs scrub status:
> scrub status for 1609e4e1-4037-4d31-bf12-f84a691db5d8
> scrub started at Tue Nov 27 06:02:04 2018 and finished after 07:34:16
> total bytes scrubbed: 17.29TiB with 0 errors
> 
> What is the probable cause of these errors? How can I fix this?
> 
> Thanks in advance for your advice
> Stefan
> 



signature.asc
Description: OpenPGP digital signature

Re: Need help with potential ~45TB dataloss

2018-12-02 Thread Qu Wenruo

On 2018/12/3 上午4:30, Andrei Borzenkov wrote:
> 02.12.2018 23:14, Patrick Dijkgraaf пишет:
>> I have some additional info.
>>
>> I found the reason the FS got corrupted. It was a single failing drive,
>> which caused the entire cabinet (containing 7 drives) to reset. So the
>> FS suddenly lost 7 drives.
>>
> 
> This remains mystery for me. btrfs is marketed to be always consistent
> on disk - you either have previous full transaction or current full
> transaction. If current transaction was interrupted the promise is you
> are left with previous valid consistent transaction.
> 
> Obviously this is not what happens in practice. Which nullifies the main
> selling point of btrfs.
> 
> Unless this is expected behavior, it sounds like some barriers are
> missing and summary data is updated before (and without waiting for)
> subordinate data. And if it is expected behavior ...

There are one (unfortunately) known problem for RAID5/6 and one special
problem for RAID6.

The common problem is write hole.
For a RAID5 stripe like:
Disk 1  |Disk 2|   Disk 3
---
DATA1   |DATA2 |   PARITY

If we have written something into DATA1, but powerloss happened before
we update PARITY in disk 3.
In this case, we can't tolerant Disk 2 loss, since DATA1 doesn't match
PARAITY anymore.

Without the ability to know what exactly block we have written, for
write hole problem exists for any parity based solution, including BTRFS
RAID5/6.

From the guys in the mail list, other RAID5/6 implementations have their
own record of which block is updated on-disk, and for powerloss case
they will rebuild involved stripes.

Since btrfs doesn't has such ability, we need to scrub the whole fs to
regain the disk loss tolerance (and hope there will not be another power
loss during it)

The RAID6 special problem is the missing of rebuilt retry logic.
(Not any more after 4.16 kernel, but still missing btrfs-progs support)

For a RAID6 stripe like:
Disk 1 |Disk 2  | Disk 3 |Disk 4

DATA1  |DATA2   |   P|  Q

If data read from DATA1 failed, we have 3 ways to rebuild the data:
1) Using DATA2 and P (just as RAID5)
2) Using P and Q
3) Using DATA2 and Q

However until 4.16 we won't retry all possible ways to build it.
(Thanks Liu for solving this problem).

Thanks,
Qu

> 
>> I have removed the failed drive, so the RAID is now degraded. I hope
>> the data is still recoverable... ☹
>>
> 

signature.asc
Description: OpenPGP digital signature

[PATCH] btrfs-progs: fsck-tests: Move reloc tree images to 020-extent-ref-cases

2018-12-02 Thread Qu Wenruo

For reloc tree, despite of its short lifespan, it's still the backref,
where reloc tree root backref points back to itself, makes it special.

So it's more approriate to put them into 020-extent-ref-cases.

Signed-off-by: Qu Wenruo 
---
 tests/fsck-tests/015-tree-reloc-tree/test.sh|  16 
 tests/fsck-tests/020-extent-ref-cases/test.sh   |   5 +
 .../tree_reloc_for_data_reloc.img.xz| Bin
 .../tree_reloc_for_fs_tree.img.xz   | Bin
 4 files changed, 5 insertions(+), 16 deletions(-)
 delete mode 100755 tests/fsck-tests/015-tree-reloc-tree/test.sh
 rename tests/fsck-tests/{015-tree-reloc-tree => 
020-extent-ref-cases}/tree_reloc_for_data_reloc.img.xz (100%)
 rename tests/fsck-tests/{015-tree-reloc-tree => 
020-extent-ref-cases}/tree_reloc_for_fs_tree.img.xz (100%)

diff --git a/tests/fsck-tests/015-tree-reloc-tree/test.sh 
b/tests/fsck-tests/015-tree-reloc-tree/test.sh
deleted file mode 100755
index 5d9d5122fd06..
--- a/tests/fsck-tests/015-tree-reloc-tree/test.sh
+++ /dev/null
@@ -1,16 +0,0 @@
-#!/bin/bash
-# Make sure btrfs check won't report any false alerts for valid image with
-# reloc tree.
-#
-# Also due to the short life span of reloc tree, save the as dump example for
-# later usage.
-
-source "$TEST_TOP/common"
-
-check_prereq btrfs
-
-check_image() {
-   run_check "$TOP/btrfs" check "$1"
-}
-
-check_all_images
diff --git a/tests/fsck-tests/020-extent-ref-cases/test.sh 
b/tests/fsck-tests/020-extent-ref-cases/test.sh
index a1bf75b14486..2f5a05cca4d4 100755
--- a/tests/fsck-tests/020-extent-ref-cases/test.sh
+++ b/tests/fsck-tests/020-extent-ref-cases/test.sh
@@ -14,6 +14,11 @@
 #   Containing a block group and its first extent at
 #   the beginning of leaf.
 #   Which caused false alert for lowmem mode.
+#
+# Special cases with some rare backref type
+# * reloc tree
+#   For both fs tree and data reloc tree.
+#   Special for its backref pointing to itself and its short life span.
 
 source "$TEST_TOP/common"
 
diff --git 
a/tests/fsck-tests/015-tree-reloc-tree/tree_reloc_for_data_reloc.img.xz 
b/tests/fsck-tests/020-extent-ref-cases/tree_reloc_for_data_reloc.img.xz
similarity index 100%
rename from 
tests/fsck-tests/015-tree-reloc-tree/tree_reloc_for_data_reloc.img.xz
rename to tests/fsck-tests/020-extent-ref-cases/tree_reloc_for_data_reloc.img.xz
diff --git a/tests/fsck-tests/015-tree-reloc-tree/tree_reloc_for_fs_tree.img.xz 
b/tests/fsck-tests/020-extent-ref-cases/tree_reloc_for_fs_tree.img.xz
similarity index 100%
rename from tests/fsck-tests/015-tree-reloc-tree/tree_reloc_for_fs_tree.img.xz
rename to tests/fsck-tests/020-extent-ref-cases/tree_reloc_for_fs_tree.img.xz
-- 
2.19.2

Re: Need help with potential ~45TB dataloss

2018-12-02 Thread Qu Wenruo



On 2018/12/3 上午8:35, Qu Wenruo wrote:
> 
> 
> On 2018/12/2 下午5:03, Patrick Dijkgraaf wrote:
>> Hi Qu,
>>
>> Thanks for helping me!
>>
>> Please see the reponses in-line.
>> Any suggestions based on this?
>>
>> Thanks!
>>
>>
>> On Sat, 2018-12-01 at 07:57 +0800, Qu Wenruo wrote:
>>> On 2018/11/30 下午9:53, Patrick Dijkgraaf wrote:
>>>> Hi all,
>>>>
>>>> I have been a happy BTRFS user for quite some time. But now I'm
>>>> facing
>>>> a potential ~45TB dataloss... :-(
>>>> I hope someone can help!
>>>>
>>>> I have Server A and Server B. Both having a 20-devices BTRFS RAID6
>>>> filesystem.

I forgot one important thing here, specially for RAID6.

If one data device corrupted, RAID6 will normally try to rebuild using
RAID5 way, and if another one disk get corrupted, it may not recover
correctly.

Current way to recover is try *all* combination.

IIRC Liu Bo tried such patch but not merged.

This means current RAID6 can only handle two missing devices at its best
condition.
But for corruption, it can only be as good as RAID5.

Thanks,
Qu

> Because of known RAID5/6 risks, Server B was a backup
>>>> of
>>>> Server A.
>>>> After applying updates to server B and reboot, the FS would not
>>>> mount
>>>> anymore. Because it was "just" a backup. I decided to recreate the
>>>> FS
>>>> and perform a new backup. Later, I discovered that the FS was not
>>>> broken, but I faced this issue: 
>>>> https://patchwork.kernel.org/patch/10694997/
>>>>
>>>
>>> Sorry for the inconvenience.
>>>
>>> I didn't realize the max_chunk_size limit isn't reliable at that
>>> timing.
>>
>> No problem, I should not have jumped to the conclusion to recreate the
>> backup volume.
>>
>>>> Anyway, the FS was already recreated, so I needed to do a new
>>>> backup.
>>>> During the backup (using rsync -vah), Server A (the source)
>>>> encountered
>>>> an I/O error and my rsync failed. In an attempt to "quick fix" the
>>>> issue, I rebooted Server A after which the FS would not mount
>>>> anymore.
>>>
>>> Did you have any dmesg about that IO error?
>>
>> Yes there was. But I omitted capturing it... The system is now rebooted
>> and I can't retrieve it anymore. :-(
>>
>>> And how is the reboot scheduled? Forced power off or normal reboot
>>> command?
>>
>> The system was rebooted using a normal reboot command.
> 
> Then the problem is pretty serious.
> 
> Possibly already corrupted before.
> 
>>
>>>> I documented what I have tried, below. I have not yet tried
>>>> anything
>>>> except what is shown, because I am afraid of causing more harm to
>>>> the FS.
>>>
>>> Pretty clever, no btrfs check --repair is a pretty good move.
>>>
>>>> I hope somebody here can give me advice on how to (hopefully)
>>>> retrieve my data...
>>>>
>>>> Thanks in advance!
>>>>
>>>> ==
>>>>
>>>> [root@cornelis ~]# btrfs fi show
>>>> Label: 'cornelis-btrfs'  uuid: ac643516-670e-40f3-aa4c-f329fc3795fd
>>>>Total devices 1 FS bytes used 463.92GiB
>>>>devid1 size 800.00GiB used 493.02GiB path
>>>> /dev/mapper/cornelis-cornelis--btrfs
>>>>
>>>> Label: 'data'  uuid: 4c66fa8b-8fc6-4bba-9d83-02a2a1d69ad5
>>>>Total devices 20 FS bytes used 44.85TiB
>>>>devid1 size 3.64TiB used 3.64TiB path /dev/sdn2
>>>>devid2 size 3.64TiB used 3.64TiB path /dev/sdp2
>>>>devid3 size 3.64TiB used 3.64TiB path /dev/sdu2
>>>>devid4 size 3.64TiB used 3.64TiB path /dev/sdx2
>>>>devid5 size 3.64TiB used 3.64TiB path /dev/sdh2
>>>>devid6 size 3.64TiB used 3.64TiB path /dev/sdg2
>>>>devid7 size 3.64TiB used 3.64TiB path /dev/sdm2
>>>>devid8 size 3.64TiB used 3.64TiB path /dev/sdw2
>>>>devid9 size 3.64TiB used 3.64TiB path /dev/sdj2
>>>>devid   10 size 3.64TiB used 3.64TiB path /dev/sdt2
>>>>devid   11 size 3.64TiB used 3.64TiB path /dev/sdk2
>>>>devid   12 size 3.64TiB used 3.64TiB path /dev/sdq2
>>>>devid   13 size 3.64TiB used 3.64TiB path /dev/sds2
>>>>devid   14 size 3

Re: Need help with potential ~45TB dataloss

2018-12-02 Thread Qu Wenruo



On 2018/12/2 下午5:03, Patrick Dijkgraaf wrote:
> Hi Qu,
> 
> Thanks for helping me!
> 
> Please see the reponses in-line.
> Any suggestions based on this?
> 
> Thanks!
> 
> 
> On Sat, 2018-12-01 at 07:57 +0800, Qu Wenruo wrote:
>> On 2018/11/30 下午9:53, Patrick Dijkgraaf wrote:
>>> Hi all,
>>>
>>> I have been a happy BTRFS user for quite some time. But now I'm
>>> facing
>>> a potential ~45TB dataloss... :-(
>>> I hope someone can help!
>>>
>>> I have Server A and Server B. Both having a 20-devices BTRFS RAID6
>>> filesystem. Because of known RAID5/6 risks, Server B was a backup
>>> of
>>> Server A.
>>> After applying updates to server B and reboot, the FS would not
>>> mount
>>> anymore. Because it was "just" a backup. I decided to recreate the
>>> FS
>>> and perform a new backup. Later, I discovered that the FS was not
>>> broken, but I faced this issue: 
>>> https://patchwork.kernel.org/patch/10694997/
>>>
>>
>> Sorry for the inconvenience.
>>
>> I didn't realize the max_chunk_size limit isn't reliable at that
>> timing.
> 
> No problem, I should not have jumped to the conclusion to recreate the
> backup volume.
> 
>>> Anyway, the FS was already recreated, so I needed to do a new
>>> backup.
>>> During the backup (using rsync -vah), Server A (the source)
>>> encountered
>>> an I/O error and my rsync failed. In an attempt to "quick fix" the
>>> issue, I rebooted Server A after which the FS would not mount
>>> anymore.
>>
>> Did you have any dmesg about that IO error?
> 
> Yes there was. But I omitted capturing it... The system is now rebooted
> and I can't retrieve it anymore. :-(
> 
>> And how is the reboot scheduled? Forced power off or normal reboot
>> command?
> 
> The system was rebooted using a normal reboot command.

Then the problem is pretty serious.

Possibly already corrupted before.

> 
>>> I documented what I have tried, below. I have not yet tried
>>> anything
>>> except what is shown, because I am afraid of causing more harm to
>>> the FS.
>>
>> Pretty clever, no btrfs check --repair is a pretty good move.
>>
>>> I hope somebody here can give me advice on how to (hopefully)
>>> retrieve my data...
>>>
>>> Thanks in advance!
>>>
>>> ==
>>>
>>> [root@cornelis ~]# btrfs fi show
>>> Label: 'cornelis-btrfs'  uuid: ac643516-670e-40f3-aa4c-f329fc3795fd
>>> Total devices 1 FS bytes used 463.92GiB
>>> devid1 size 800.00GiB used 493.02GiB path
>>> /dev/mapper/cornelis-cornelis--btrfs
>>>
>>> Label: 'data'  uuid: 4c66fa8b-8fc6-4bba-9d83-02a2a1d69ad5
>>> Total devices 20 FS bytes used 44.85TiB
>>> devid1 size 3.64TiB used 3.64TiB path /dev/sdn2
>>> devid2 size 3.64TiB used 3.64TiB path /dev/sdp2
>>> devid3 size 3.64TiB used 3.64TiB path /dev/sdu2
>>> devid4 size 3.64TiB used 3.64TiB path /dev/sdx2
>>> devid5 size 3.64TiB used 3.64TiB path /dev/sdh2
>>> devid6 size 3.64TiB used 3.64TiB path /dev/sdg2
>>> devid7 size 3.64TiB used 3.64TiB path /dev/sdm2
>>> devid8 size 3.64TiB used 3.64TiB path /dev/sdw2
>>> devid9 size 3.64TiB used 3.64TiB path /dev/sdj2
>>> devid   10 size 3.64TiB used 3.64TiB path /dev/sdt2
>>> devid   11 size 3.64TiB used 3.64TiB path /dev/sdk2
>>> devid   12 size 3.64TiB used 3.64TiB path /dev/sdq2
>>> devid   13 size 3.64TiB used 3.64TiB path /dev/sds2
>>> devid   14 size 3.64TiB used 3.64TiB path /dev/sdf2
>>> devid   15 size 7.28TiB used 588.80GiB path /dev/sdr2
>>> devid   16 size 7.28TiB used 588.80GiB path /dev/sdo2
>>> devid   17 size 7.28TiB used 588.80GiB path /dev/sdv2
>>> devid   18 size 7.28TiB used 588.80GiB path /dev/sdi2
>>> devid   19 size 7.28TiB used 588.80GiB path /dev/sdl2
>>> devid   20 size 7.28TiB used 588.80GiB path /dev/sde2
>>>
>>> [root@cornelis ~]# mount /dev/sdn2 /mnt/data
>>> mount: /mnt/data: wrong fs type, bad option, bad superblock on
>>> /dev/sdn2, missing codepage or helper program, or other error.
>>
>> What is the dmesg of the mount failure?
> 
> [Sun Dec  2 09:41:08 2018] BTRFS info (device sdn2): disk space caching
> is enabled
> [Sun Dec  2 09:41:08 2018] BTRFS inf

Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots

2018-11-30 Thread Qu Wenruo



On 2018/12/1 上午12:52, Josef Bacik wrote:
> From: Josef Bacik 
> 
> When debugging some weird extent reference bug I suspected that we were
> changing a snapshot while we were deleting it, which could explain my
> bug.

May I ask under which case we're going to modify an unlinked snapshot?

Maybe metadata relocation?

Thanks,
Qu

> This was indeed what was happening, and this patch helped me
> verify my theory.  It is never correct to modify the snapshot once it's
> being deleted, so mark the root when we are deleting it and make sure we
> complain about it when it happens.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/ctree.c   | 3 +++
>  fs/btrfs/ctree.h   | 1 +
>  fs/btrfs/extent-tree.c | 9 +
>  3 files changed, 13 insertions(+)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 5912a97b07a6..5f82f86085e8 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle 
> *trans,
>   u64 search_start;
>   int ret;
>  
> + if (test_bit(BTRFS_ROOT_DELETING, >state))
> + WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being 
> dropped\n");
> +
>   if (trans->transaction != fs_info->running_transaction)
>   WARN(1, KERN_CRIT "trans %llu running %llu\n",
>  trans->transid,
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index facde70c15ed..5a3a94ccb65c 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1199,6 +1199,7 @@ enum {
>   BTRFS_ROOT_FORCE_COW,
>   BTRFS_ROOT_MULTI_LOG_TASKS,
>   BTRFS_ROOT_DIRTY,
> + BTRFS_ROOT_DELETING,
>  };
>  
>  /*
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 581c2a0b2945..dcb699dd57f3 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9333,6 +9333,15 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
>   if (block_rsv)
>   trans->block_rsv = block_rsv;
>  
> + /*
> +  * This will help us catch people modifying the fs tree while we're
> +  * dropping it.  It is unsafe to mess with the fs tree while it's being
> +  * dropped as we unlock the root node and parent nodes as we walk down
> +  * the tree, assuming nothing will change.  If something does change
> +  * then we'll have stale information and drop references to blocks we've
> +  * already dropped.
> +  */
> + set_bit(BTRFS_ROOT_DELETING, >state);
>   if (btrfs_disk_key_objectid(_item->drop_progress) == 0) {
>   level = btrfs_header_level(root->node);
>   path->nodes[level] = btrfs_lock_root_node(root);
> 



signature.asc
Description: OpenPGP digital signature

Re: Need help with potential ~45TB dataloss

2018-11-30 Thread Qu Wenruo



On 2018/11/30 下午9:53, Patrick Dijkgraaf wrote:
> Hi all,
> 
> I have been a happy BTRFS user for quite some time. But now I'm facing
> a potential ~45TB dataloss... :-(
> I hope someone can help!
> 
> I have Server A and Server B. Both having a 20-devices BTRFS RAID6
> filesystem. Because of known RAID5/6 risks, Server B was a backup of
> Server A.
> After applying updates to server B and reboot, the FS would not mount
> anymore. Because it was "just" a backup. I decided to recreate the FS
> and perform a new backup. Later, I discovered that the FS was not
> broken, but I faced this issue: 
> https://patchwork.kernel.org/patch/10694997/

Sorry for the inconvenience.

I didn't realize the max_chunk_size limit isn't reliable at that timing.

> 
> Anyway, the FS was already recreated, so I needed to do a new backup.
> During the backup (using rsync -vah), Server A (the source) encountered
> an I/O error and my rsync failed. In an attempt to "quick fix" the
> issue, I rebooted Server A after which the FS would not mount anymore.

Did you have any dmesg about that IO error?

And how is the reboot scheduled? Forced power off or normal reboot command?

> 
> I documented what I have tried, below. I have not yet tried anything
> except what is shown, because I am afraid of causing more harm to
> the FS.

Pretty clever, no btrfs check --repair is a pretty good move.

> I hope somebody here can give me advice on how to (hopefully)
> retrieve my data...
> 
> Thanks in advance!
> 
> ==
> 
> [root@cornelis ~]# btrfs fi show
> Label: 'cornelis-btrfs'  uuid: ac643516-670e-40f3-aa4c-f329fc3795fd
>   Total devices 1 FS bytes used 463.92GiB
>   devid1 size 800.00GiB used 493.02GiB path
> /dev/mapper/cornelis-cornelis--btrfs
> 
> Label: 'data'  uuid: 4c66fa8b-8fc6-4bba-9d83-02a2a1d69ad5
>   Total devices 20 FS bytes used 44.85TiB
>   devid1 size 3.64TiB used 3.64TiB path /dev/sdn2
>   devid2 size 3.64TiB used 3.64TiB path /dev/sdp2
>   devid3 size 3.64TiB used 3.64TiB path /dev/sdu2
>   devid4 size 3.64TiB used 3.64TiB path /dev/sdx2
>   devid5 size 3.64TiB used 3.64TiB path /dev/sdh2
>   devid6 size 3.64TiB used 3.64TiB path /dev/sdg2
>   devid7 size 3.64TiB used 3.64TiB path /dev/sdm2
>   devid8 size 3.64TiB used 3.64TiB path /dev/sdw2
>   devid9 size 3.64TiB used 3.64TiB path /dev/sdj2
>   devid   10 size 3.64TiB used 3.64TiB path /dev/sdt2
>   devid   11 size 3.64TiB used 3.64TiB path /dev/sdk2
>   devid   12 size 3.64TiB used 3.64TiB path /dev/sdq2
>   devid   13 size 3.64TiB used 3.64TiB path /dev/sds2
>   devid   14 size 3.64TiB used 3.64TiB path /dev/sdf2
>   devid   15 size 7.28TiB used 588.80GiB path /dev/sdr2
>   devid   16 size 7.28TiB used 588.80GiB path /dev/sdo2
>   devid   17 size 7.28TiB used 588.80GiB path /dev/sdv2
>   devid   18 size 7.28TiB used 588.80GiB path /dev/sdi2
>   devid   19 size 7.28TiB used 588.80GiB path /dev/sdl2
>   devid   20 size 7.28TiB used 588.80GiB path /dev/sde2
> 
> [root@cornelis ~]# mount /dev/sdn2 /mnt/data
> mount: /mnt/data: wrong fs type, bad option, bad superblock on
> /dev/sdn2, missing codepage or helper program, or other error.

What is the dmesg of the mount failure?

And have you tried -o ro,degraded ?

> 
> [root@cornelis ~]# btrfs check /dev/sdn2
> Opening filesystem to check...
> parent transid verify failed on 46451963543552 wanted 114401 found
> 114173
> parent transid verify failed on 46451963543552 wanted 114401 found
> 114173
> checksum verify failed on 46451963543552 found A8F2A769 wanted 4C111ADF
> checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4
> checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4
> bad tree block 46451963543552, bytenr mismatch, want=46451963543552,
> have=75208089814272
> Couldn't read tree root

Would you please also paste the output of "btrfs ins dump-super /dev/sdn2" ?

It looks like your tree root (or at least some tree root nodes/leaves
get corrupted)

> ERROR: cannot open file system

And since it's your tree root corrupted, you could also try
"btrfs-find-root " to try to get a good old copy of your tree root.

But I suspect the corruption happens before you noticed, thus the old
tree root may not help much.

Also, the output of "btrfs ins dump-tree -t root " will help.

Thanks,
Qu
> 
> [root@cornelis ~]# btrfs restore /dev/sdn2 /mnt/data/
> parent transid verify failed on 46451963543552 wanted 114401 found
> 114173
> parent transid verify failed on 46451963543552 wanted 114401 found
> 114173
> checksum verify failed on 46451963543552 found A8F2A769 wanted 4C111ADF
> checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4
> checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4
> bad tree block 46451963543552, bytenr mismatch, want=46451963543552,
> have=75208089814272
> Couldn't read

Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput

2018-11-28 Thread Qu Wenruo




On 2018/11/29 上午4:08, Filipe Manana wrote:
> On Wed, Nov 28, 2018 at 7:09 PM David Sterba  wrote:
>>
>> On Tue, Nov 27, 2018 at 03:08:08PM -0500, Josef Bacik wrote:
>>> On Tue, Nov 27, 2018 at 07:59:42PM +, Chris Mason wrote:
 On 27 Nov 2018, at 14:54, Josef Bacik wrote:

> On Tue, Nov 27, 2018 at 10:26:15AM +0200, Nikolay Borisov wrote:
>>
>>
>> On 21.11.18 г. 21:09 ч., Josef Bacik wrote:
>>> The cleaner thread usually takes care of delayed iputs, with the
>>> exception of the btrfs_end_transaction_throttle path.  The cleaner
>>> thread only gets woken up every 30 seconds, so instead wake it up to
>>> do
>>> it's work so that we can free up that space as quickly as possible.
>>
>> Have you done any measurements how this affects the overall system. I
>> suspect this introduces a lot of noise since now we are going to be
>> doing a thread wakeup on every iput, does this give a chance to have
>> nice, large batches of iputs that  the cost of wake up can be
>> amortized
>> across?
>
> I ran the whole patchset with our A/B testing stuff and the patchset
> was a 5%
> win overall, so I'm inclined to think it's fine.  Thanks,

 It's a good point though, a delayed wakeup may be less overhead.
>>>
>>> Sure, but how do we go about that without it sometimes messing up?  In 
>>> practice
>>> the only time we're doing this is at the end of finish_ordered_io, so 
>>> likely to
>>> not be a constant stream.  I suppose since we have places where we force it 
>>> to
>>> run that we don't really need this.  IDK, I'm fine with dropping it.  
>>> Thanks,
>>
>> The transaction thread wakes up cleaner periodically (commit interval,
>> 30s by default), so the time to process iputs is not unbounded.
>>
>> I have the same concerns as Nikolay, coupling the wakeup to all delayed
>> iputs could result in smaller batches. But some of the callers of
>> btrfs_add_delayed_iput might benefit from the extra wakeup, like
>> btrfs_remove_block_group, so I don't want to leave the idea yet.
> 
> I'm curious, why do you think it would benefit btrfs_remove_block_group()?

Just as Filipe said, I'm not sure why btrfs_remove_block_group() would
get some benefit from more frequent cleaner thread wake up.

For an empty block group to really be removed, it also needs to have 0
pinned bytenr, which is only possible after a transaction get committed.

Thanks,
Qu

Re: Balance: invalid convert data profile raid10

2018-11-28 Thread Qu Wenruo



On 2018/11/28 下午3:20, Mikko Merikivi wrote:
> Well, excuse me for thinking it wouldn't since in md-raid it worked.
> https://wiki.archlinux.org/index.php/RAID#RAID_level_comparison
> 
> Anyway, the error message is truly confusing for someone who doesn't
> know about btrfs's implementation. I suppose in md-raid the near
> layout is actually RAID 1 and far2 uses twice as much space.
> https://en.wikipedia.org/wiki/Non-standard_RAID_levels#LINUX-MD-RAID-10
> 
> Well, I guess I'll go with RAID 1, then. Thanks for clearing up the confusion.

You should check mkfs.btrfs(8).
It has a pretty good datasheet for all supported profiles under PROFILES
section, and it mentions min/max devices.

Thanks,
Qu


> ke 28. marrask. 2018 klo 3.14 Qu Wenruo (quwenruo.bt...@gmx.com) kirjoitti:
>>
>>
>>
>> On 2018/11/28 上午5:16, Mikko Merikivi wrote:
>>> I seem unable to convert an existing btrfs device array to RAID 10.
>>> Since it's pretty much RAID 0 and 1 combined, and 5 and 6 are
>>> unstable, it's what I would like to use.
>>>
>>> After I did tried this with 4.19.2-arch1-1-ARCH and btrfs-progs v4.19,
>>> I updated my system and tried btrfs balance again with this system
>>> information:
>>> [mikko@localhost lvdata]$ uname -a
>>> Linux localhost 4.19.4-arch1-1-ARCH #1 SMP PREEMPT Fri Nov 23 09:06:58
>>> UTC 2018 x86_64 GNU/Linux
>>> [mikko@localhost lvdata]$ btrfs --version
>>> btrfs-progs v4.19
>>> [mikko@localhost lvdata]$ sudo btrfs fi show
>>> Label: 'main1'  uuid: c7cbb9c3-8c55-45f1-b03c-48992efe2f11
>>> Total devices 1 FS bytes used 2.90TiB
>>> devid1 size 3.64TiB used 2.91TiB path /dev/mapper/main
>>>
>>> Label: 'red'  uuid: f3c781a8-0f3e-4019-acbf-0b783cf566d0
>>> Total devices 2 FS bytes used 640.00KiB
>>> devid1 size 931.51GiB used 2.03GiB path /dev/mapper/red1
>>> devid2 size 931.51GiB used 2.03GiB path /dev/mapper/red2
>>
>> RAID10 needs at least 4 devices.
>>
>> Thanks,
>> Qu
>>
>>> [mikko@localhost lvdata]$ btrfs fi df /mnt/red/
>>> Data, RAID1: total=1.00GiB, used=512.00KiB
>>> System, RAID1: total=32.00MiB, used=16.00KiB
>>> Metadata, RAID1: total=1.00GiB, used=112.00KiB
>>> GlobalReserve, single: total=16.00MiB, used=0.00B
>>>
>>> ---
>>>
>>> Here are the steps I originally used:
>>>
>>> [mikko@localhost lvdata]$ sudo cryptsetup luksFormat -s 512
>>> --use-random /dev/sdc
>>> [mikko@localhost lvdata]$ sudo cryptsetup luksFormat -s 512
>>> --use-random /dev/sdd
>>> [mikko@localhost lvdata]$ sudo cryptsetup luksOpen /dev/sdc red1
>>> [mikko@localhost lvdata]$ sudo cryptsetup luksOpen /dev/sdd red2
>>> [mikko@localhost lvdata]$ sudo mkfs.btrfs -L red /dev/mapper/red1
>>> btrfs-progs v4.19
>>> See http://btrfs.wiki.kernel.org for more information.
>>>
>>> Label:  red
>>> UUID:   f3c781a8-0f3e-4019-acbf-0b783cf566d0
>>> Node size:  16384
>>> Sector size:4096
>>> Filesystem size:931.51GiB
>>> Block group profiles:
>>>   Data: single8.00MiB
>>>   Metadata: DUP   1.00GiB
>>>   System:   DUP   8.00MiB
>>> SSD detected:   no
>>> Incompat features:  extref, skinny-metadata
>>> Number of devices:  1
>>> Devices:
>>>IDSIZE  PATH
>>> 1   931.51GiB  /dev/mapper/red1
>>>
>>> [mikko@localhost lvdata]$ sudo mount -t btrfs -o
>>> defaults,noatime,nodiratime,autodefrag,compress=lzo /dev/mapper/red1
>>> /mnt/red
>>> [mikko@localhost lvdata]$ sudo btrfs device add /dev/mapper/red2 /mnt/red
>>> [mikko@localhost lvdata]$ sudo btrfs balance start -dconvert=raid10
>>> -mconvert=raid10 /mnt/red
>>> ERROR: error during balancing '/mnt/red': Invalid argument
>>> There may be more info in syslog - try dmesg | tail
>>> code 1
>>>
>>> [mikko@localhost lvdata]$ dmesg | tail
>>> [12026.263243] BTRFS info (device dm-1): disk space caching is enabled
>>> [12026.263244] BTRFS info (device dm-1): has skinny extents
>>> [12026.263245] BTRFS info (device dm-1): flagging fs with big metadata 
>>> feature
>>> [12026.275153] BTRFS info (device dm-1): checking UUID tree
>>> [12195.431766] BTRFS info (device dm-1): enabling auto defrag
>>> [12195.431769] BTRFS info (device dm-1): use lzo compression, level 0
>>> [12195.431770] BTRFS info (device dm-1): disk space caching is enabled
>>> [12195.431771] BTRFS info (device dm-1): has skinny extents
>>> [12205.815941] BTRFS info (device dm-1): disk added /dev/mapper/red2
>>> [12744.788747] BTRFS error (device dm-1): balance: invalid convert
>>> data profile raid10
>>>
>>> Converting to RAID 1 did work but what can I do to make it RAID 10?
>>> With the up-to-date system it still says "invalid convert data profile
>>> raid10".
>>>
>>



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 0/9] Switch defines to enums

2018-11-28 Thread Qu Wenruo



On 2018/11/28 下午9:25, David Sterba wrote:
> On Wed, Nov 28, 2018 at 09:33:50AM +0800, Qu Wenruo wrote:
>> On 2018/11/28 上午3:53, David Sterba wrote:
>>> This is motivated by a merging mistake that happened a few releases ago.
>>> Two patches updated BTRFS_FS_* flags independently to the same value,
>>> git did not see any direct merge conflict. The two values got mixed at
>>> runtime and caused crash.
>>>
>>> Update all #define sequential values, the above merging problem would
>>> not happen as there would be a conflict and the enum value
>>> auto-increment would prevent duplicated values anyway.
>>
>> Just one small question for the bitmap usage.
>>
>> For enum we won't directly know the last number is, my concern is if
>> we're using u16 as bitmap and there is some enum over 15, will we get a
>> warning at compile time or some bug would just sneak into kernel?
> 
> Do you have a concrete example where this would happen? Most bits are
> used in 'long' that should be at least 32. I'm not aware of 16bit bits
> flags.
> 

OK, I'm too paranoid here.

The set_bit() definition needs unsigned long *, and passing a u16 * will
cause compile time warning, so it shouldn't be a problem.

And for enum over 63, reviewers will definitely complain anyway.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature

[PATCH] btrfs: qgroup: Introduce more accurate early reloc tree detection

2018-11-27 Thread Qu Wenruo

The biggest challenge for qgroup to skip reloc tree extents is to detect
correct owner of reloc tree blocks owner.

Unlike most data operations, the root of tree reloc tree can't be
easily detected.

For example, for relocation we call btrfs_copy_root to init reloc tree:

btrfs_copy_root(root=257, new_root_objectid=RELOC)
|- btrfs_inc_ref(trans, root=257, cow, 1)
   | << From this point, we won't know this eb will be used for reloc >>
   |- __btrfs_mod_ref(root=257)
  |- btrfS_inc_extent_ref()

In above case, at the timing of calling btrfs_inc_ref(), all later
function will not be aware of the fact that the extent buffer is used
for reloc tree.

This makes it extremely hard for qgroup code to detect tree block
allocated for reloc tree.

The good news is, at btrfs_copy_root() if we found @new_root_objectid ==
RELOC, we set BTRFS_HEADER_FLAG_RELOC for that extent buffer.

We could use that flag to detect reloc tree blocks, then we needs to
modify the following function for an extra parameter:
- btrfs_inc_extent_ref
- btrfs_free_extent
- add_delayed_tree_ref

This parameter change affects a lot of callers, but is needed for qgroup
to reduce balance overhead.

For benchmark, still the same memory backed VM, 4G subvolume 16
snaphots:
 | v4.20-rc1 + delayed* | w/ patch   | diff
---
relocated extents| 22703| 22610  | -0.0%
qgroup dirty extents | 74938| 69292  | -7.5%
time (real)  | 24.567s  | 23.546 | -4.1%

*: With delayed subtree scan and "btrfs: qgroup: Skip delayed data ref
for reloc trees"

Signed-off-by: Qu Wenruo 
---
For the delayed ref API paramaters mess, it need an interface refactor
to make things tidy.
In fact from current interface, we don't even have a method to know the
real owner of a delayed ref.
It will definitely cause problem for later qgroup + balance
optimization.
---
 fs/btrfs/ctree.h   |  5 +++--
 fs/btrfs/delayed-ref.c |  5 +++--
 fs/btrfs/delayed-ref.h |  3 ++-
 fs/btrfs/extent-tree.c | 24 +---
 fs/btrfs/file.c| 10 +-
 fs/btrfs/inode.c   |  4 ++--
 fs/btrfs/ioctl.c   |  3 ++-
 fs/btrfs/relocation.c  | 16 +---
 fs/btrfs/tree-log.c|  2 +-
 9 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e32fcf211c8a..6f4b1e605736 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2673,7 +2673,7 @@ int btrfs_set_disk_extent_flags(struct btrfs_trans_handle 
*trans,
 int btrfs_free_extent(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
  u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
- u64 owner, u64 offset);
+ u64 owner, u64 offset, bool for_reloc);
 
 int btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info,
   u64 start, u64 len, int delalloc);
@@ -2684,7 +2684,8 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle 
*trans);
 int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 struct btrfs_root *root,
 u64 bytenr, u64 num_bytes, u64 parent,
-u64 root_objectid, u64 owner, u64 offset);
+u64 root_objectid, u64 owner, u64 offset,
+bool for_reloc);
 
 int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans);
 int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 269bd6ecb8f3..685b21b2dc24 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -718,7 +718,8 @@ static void init_delayed_ref_common(struct btrfs_fs_info 
*fs_info,
  */
 int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
   u64 bytenr, u64 num_bytes, u64 parent,
-  u64 ref_root,  int level, int action,
+  u64 ref_root,  int level, bool for_reloc,
+  int action,
   struct btrfs_delayed_extent_op *extent_op,
   int *old_ref_mod, int *new_ref_mod)
 {
@@ -744,7 +745,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
}
 
if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) &&
-   is_fstree(ref_root)) {
+   is_fstree(ref_root) && !for_reloc) {
record = kmalloc(sizeof(*record), GFP_NOFS);
if (!record) {
kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 6c60737e55d6..35b38410e6bf 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -236,7 +236,8 @@ static inli

Re: [RFC PATCH 00/17] btrfs: implementation of priority aware allocator

2018-11-27 Thread Qu Wenruo




On 2018/11/28 上午11:11, Su Yue wrote:
> This patchset can be fetched from repo:
> https://github.com/Damenly/btrfs-devel/commits/priority_aware_allocator.
> Since patchset 'btrfs: Refactor find_free_extent()' does a nice work
> to simplify find_free_extent(). This patchset dependents on the refactor.
> The base is the commit in kdave/misc-next:
> 
> commit fcaaa1dfa81f2f87ad88cbe0ab86a07f9f76073c (kdave/misc-next)
> Author: Nikolay Borisov 
> Date:   Tue Nov 6 16:40:20 2018 +0200
> 
> btrfs: Always try all copies when reading extent buffers
> 
> 
> This patchset introduces a new mount option named 'priority_alloc=%s',
> %s is supported to be "usage" and "off" now. The mount option changes
> the way find_free_extent() how to search block groups.
> 
> Previously, block groups are stored in list of btrfs_space_info
> by start position. When call find_free_extent() if no hint,
> block_groups are searched one by one.
> 
> Design of priority aware allocator:
> Block group has its own priority. We split priorities to many levels,
> block groups are split to different trees according priorities.
> And those trees are sorted by their levels and stored in space_info.
> Once find_free_extent() is called, try to search block groups in higher
> priority level then lower level. Then a block group with higher
> priority is more likely to be used.
> 
> Pros:
> 1) Reduce the frequency of balance.
>The block group with a higher usage rate will be used preferentially
>for allocating extents. Free the empty block groups with pinned bytes
>as non-zero.[1]
> 
> 2) The priority of empty block group with pinned bytes as non-zero
>will be set as the lowest.
>
> 3) Support zoned block device.[2]
>For metadata allocation, the block group in conventional zones
>will be used as much as possible regardless of usage rate.
>Will do it in future.

Personally I'm a big fan of the priority aware extent allocator.

So nice job!

>
> Cons:
> 1) Expectable performance regression.
>The degree of the decline is temporarily unknown.
>The user can disable block group priority to get the full performance.
> 
> TESTS:
> 
> If use usage as priority(the only available option), empty block group
> is much harder to be reused.
> 
> About block group usage:
>   Disk: 4 x 1T HDD gathered in LVM.
> 
>   Run script to create files and delete files randomly in loop.
>   The num of files to create are double than to delete.
> 
>   Default mount option result:
>   https://i.loli.net/2018/11/28/5bfdfdf08c760.png
> 
>   Priority aware allocator(usage) result:
>   https://i.loli.net/2018/11/28/5bfdfdf0c1b11.png
> 
>   X coordinate means total disk usage, Y coordinate means avg block
>   group usage.
> 
>   Due to fragmentation of extents, the different are not obvious,
>   only about 1% improvement

I think you're using the wrong indicator to show the difference.

The real indicator should not be overall block group usage, but:
1) Number of block groups
2) Usage distribution of the block groups

If the number of block groups isn't much different, then we should go
check the distribution.
E.g. all bgs with 97% usage is not as good mostly 100% bgs and several
near 10% bgs.

And we should check the usage distribution between metadata and data bgs.
For data bg, we could hit some fragmentation problem, while for meta bgs
all extents are in the same size, thus may have a better performance for
metadata.

Thus we could do better for the test result.

>  
> Performance regression:
>I have ran sysbench on our machine with SSD in multi combinations,
>no obvious regression found.
>However in theory, the new allocator may cost more time in some
>cases.

Isn't that a good news? :)

> 
> [1] https://www.spinics.net/lists/linux-btrfs/msg79508.html
> [2] https://lkml.org/lkml/2018/8/16/174
> 
> ---
> Due to some reasons includes time and hardware, the use-case is not
> outstanding enough.

As discussed offline, another cause would be data extent fragmentations.
E.g we have a lot of small 4K holes but the request is a big 128M.
In that case btrfs_reserve_extent() could still trigger a new data chunk
other than return the 4K holes found.

Thanks,
Qu

> And some codes are dirty but I can't found another
> way. So I named it as RFC.
>  Any comments and suggestions are welcome.
>  
> Su Yue (17):
>   btrfs: priority alloc: prepare of priority aware allocator
>   btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE
>   btrfs: priority alloc: introduce compute_block_group_priority/usage
>   btrfs: priority alloc: add functions to create/remove priority trees
>   btrfs: priority alloc: introduce functions to add block group to
> priority tree
>   btrfs: priority alloc: introduce three macros to mark block group
> status
>   btrfs: priority alloc: add functions to remove block group from
> priority tree
>   btrfs: priority alloc: add

Re: [PATCH 9/9] btrfs: drop extra enum initialization where using defaults

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> The first auto-assigned value to enum is 0, we can use that and not
> initialize all members where the auto-increment does the same. This is
> used for values that are not part of on-disk format.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/btrfs_inode.h |  2 +-
>  fs/btrfs/ctree.h   | 28 ++--
>  fs/btrfs/disk-io.h | 10 +-
>  fs/btrfs/qgroup.h  |  2 +-
>  fs/btrfs/sysfs.h   |  2 +-
>  fs/btrfs/transaction.h | 14 +++---
>  6 files changed, 29 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 4de321aee7a5..fc25607304f2 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -20,7 +20,7 @@
>   * new data the application may have written before commit.
>   */
>  enum {
> - BTRFS_INODE_ORDERED_DATA_CLOSE = 0,
> + BTRFS_INODE_ORDERED_DATA_CLOSE,
>   BTRFS_INODE_DUMMY,
>   BTRFS_INODE_IN_DEFRAG,
>   BTRFS_INODE_HAS_ASYNC_EXTENT,
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 4bb0ac3050ff..f1d1c6ba3aa1 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -334,7 +334,7 @@ struct btrfs_node {
>   * The slots array records the index of the item or block pointer
>   * used while walking the tree.
>   */
> -enum { READA_NONE = 0, READA_BACK, READA_FORWARD };
> +enum { READA_NONE, READA_BACK, READA_FORWARD };
>  struct btrfs_path {
>   struct extent_buffer *nodes[BTRFS_MAX_LEVEL];
>   int slots[BTRFS_MAX_LEVEL];
> @@ -532,18 +532,18 @@ struct btrfs_free_cluster {
>  };
>  
>  enum btrfs_caching_type {
> - BTRFS_CACHE_NO  = 0,
> - BTRFS_CACHE_STARTED = 1,
> - BTRFS_CACHE_FAST= 2,
> - BTRFS_CACHE_FINISHED= 3,
> - BTRFS_CACHE_ERROR   = 4,
> + BTRFS_CACHE_NO,
> + BTRFS_CACHE_STARTED,
> + BTRFS_CACHE_FAST,
> + BTRFS_CACHE_FINISHED,
> + BTRFS_CACHE_ERROR,
>  };
>  
>  enum btrfs_disk_cache_state {
> - BTRFS_DC_WRITTEN= 0,
> - BTRFS_DC_ERROR  = 1,
> - BTRFS_DC_CLEAR  = 2,
> - BTRFS_DC_SETUP  = 3,
> + BTRFS_DC_WRITTEN,
> + BTRFS_DC_ERROR,
> + BTRFS_DC_CLEAR,
> + BTRFS_DC_SETUP,
>  };
>  
>  struct btrfs_caching_control {
> @@ -2621,10 +2621,10 @@ static inline gfp_t btrfs_alloc_write_mask(struct 
> address_space *mapping)
>  /* extent-tree.c */
>  
>  enum btrfs_inline_ref_type {
> - BTRFS_REF_TYPE_INVALID = 0,
> - BTRFS_REF_TYPE_BLOCK =   1,
> - BTRFS_REF_TYPE_DATA =2,
> - BTRFS_REF_TYPE_ANY = 3,
> + BTRFS_REF_TYPE_INVALID,
> + BTRFS_REF_TYPE_BLOCK,
> + BTRFS_REF_TYPE_DATA,
> + BTRFS_REF_TYPE_ANY,
>  };
>  
>  int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
> index 4cccba22640f..987a64bc0c66 100644
> --- a/fs/btrfs/disk-io.h
> +++ b/fs/btrfs/disk-io.h
> @@ -21,11 +21,11 @@
>  #define BTRFS_BDEV_BLOCKSIZE (4096)
>  
>  enum btrfs_wq_endio_type {
> - BTRFS_WQ_ENDIO_DATA = 0,
> - BTRFS_WQ_ENDIO_METADATA = 1,
> - BTRFS_WQ_ENDIO_FREE_SPACE = 2,
> - BTRFS_WQ_ENDIO_RAID56 = 3,
> - BTRFS_WQ_ENDIO_DIO_REPAIR = 4,
> + BTRFS_WQ_ENDIO_DATA,
> + BTRFS_WQ_ENDIO_METADATA,
> + BTRFS_WQ_ENDIO_FREE_SPACE,
> + BTRFS_WQ_ENDIO_RAID56,
> + BTRFS_WQ_ENDIO_DIO_REPAIR,
>  };
>  
>  static inline u64 btrfs_sb_offset(int mirror)
> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> index d8f78f5ab854..e4e6ee44073a 100644
> --- a/fs/btrfs/qgroup.h
> +++ b/fs/btrfs/qgroup.h
> @@ -70,7 +70,7 @@ struct btrfs_qgroup_extent_record {
>   *   be converted into META_PERTRANS.
>   */
>  enum btrfs_qgroup_rsv_type {
> - BTRFS_QGROUP_RSV_DATA = 0,
> + BTRFS_QGROUP_RSV_DATA,
>   BTRFS_QGROUP_RSV_META_PERTRANS,
>   BTRFS_QGROUP_RSV_META_PREALLOC,
>   BTRFS_QGROUP_RSV_LAST,
> diff --git a/fs/btrfs/sysfs.h b/fs/btrfs/sysfs.h
> index c6ee600aff89..40716b357c1d 100644
> --- a/fs/btrfs/sysfs.h
> +++ b/fs/btrfs/sysfs.h
> @@ -9,7 +9,7 @@
>  extern u64 btrfs_debugfs_test;
>  
>  enum btrfs_feature_set {
> - FEAT_COMPAT = 0,
> + FEAT_COMPAT,
>   FEAT_COMPAT_RO,
>   FEAT_INCOMPAT,
>   FEAT_MAX
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 703d5116a2fc..f1ba78949d1b 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -12,13 +12,13 @@
>  #include "ctree.h"
>  
>  enu

Re: [PATCH 0/9] Switch defines to enums

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> This is motivated by a merging mistake that happened a few releases ago.
> Two patches updated BTRFS_FS_* flags independently to the same value,
> git did not see any direct merge conflict. The two values got mixed at
> runtime and caused crash.
> 
> Update all #define sequential values, the above merging problem would
> not happen as there would be a conflict and the enum value
> auto-increment would prevent duplicated values anyway.

Just one small question for the bitmap usage.

For enum we won't directly know the last number is, my concern is if
we're using u16 as bitmap and there is some enum over 15, will we get a
warning at compile time or some bug would just sneak into kernel?

Thanks,
Qu

> 
> David Sterba (9):
>   btrfs: switch BTRFS_FS_STATE_* to enums
>   btrfs: switch BTRFS_BLOCK_RSV_* to enums
>   btrfs: switch BTRFS_FS_* to enums
>   btrfs: switch BTRFS_ROOT_* to enums
>   btrfs: swtich EXTENT_BUFFER_* to enums
>   btrfs: switch EXTENT_FLAG_* to enums
>   btrfs: switch BTRFS_*_LOCK to enums
>   btrfs: switch BTRFS_ORDERED_* to enums
>   btrfs: drop extra enum initialization where using defaults
> 
>  fs/btrfs/btrfs_inode.h  |   2 +-
>  fs/btrfs/ctree.h| 168 ++--
>  fs/btrfs/disk-io.h  |  10 +--
>  fs/btrfs/extent_io.h|  28 ---
>  fs/btrfs/extent_map.h   |  21 +++--
>  fs/btrfs/locking.h  |  10 ++-
>  fs/btrfs/ordered-data.h |  45 ++-
>  fs/btrfs/qgroup.h   |   2 +-
>  fs/btrfs/sysfs.h|   2 +-
>  fs/btrfs/transaction.h  |  14 ++--
>  10 files changed, 169 insertions(+), 133 deletions(-)
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 8/9] btrfs: switch BTRFS_ORDERED_* to enums

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> ordered extent flags.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/ordered-data.h | 45 +++--
>  1 file changed, 25 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index b10e6765d88f..fb9a161f0215 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -37,26 +37,31 @@ struct btrfs_ordered_sum {
>   * rbtree, just before waking any waiters.  It is used to indicate the
>   * IO is done and any metadata is inserted into the tree.
>   */
> -#define BTRFS_ORDERED_IO_DONE 0 /* set when all the pages are written */
> -
> -#define BTRFS_ORDERED_COMPLETE 1 /* set when removed from the tree */
> -
> -#define BTRFS_ORDERED_NOCOW 2 /* set when we want to write in place */
> -
> -#define BTRFS_ORDERED_COMPRESSED 3 /* writing a zlib compressed extent */
> -
> -#define BTRFS_ORDERED_PREALLOC 4 /* set when writing to preallocated extent 
> */
> -
> -#define BTRFS_ORDERED_DIRECT 5 /* set when we're doing DIO with this extent 
> */
> -
> -#define BTRFS_ORDERED_IOERR 6 /* We had an io error when writing this out */
> -
> -#define BTRFS_ORDERED_UPDATED_ISIZE 7 /* indicates whether this ordered 
> extent
> -* has done its due diligence in updating
> -* the isize. */
> -#define BTRFS_ORDERED_TRUNCATED 8 /* Set when we have to truncate an extent 
> */
> -
> -#define BTRFS_ORDERED_REGULAR 10 /* Regular IO for COW */
> +enum {
> + /* set when all the pages are written */
> + BTRFS_ORDERED_IO_DONE,
> + /* set when removed from the tree */
> + BTRFS_ORDERED_COMPLETE,
> + /* set when we want to write in place */
> + BTRFS_ORDERED_NOCOW,
> + /* writing a zlib compressed extent */
> + BTRFS_ORDERED_COMPRESSED,
> + /* set when writing to preallocated extent */
> + BTRFS_ORDERED_PREALLOC,
> + /* set when we're doing DIO with this extent */
> + BTRFS_ORDERED_DIRECT,
> + /* We had an io error when writing this out */
> + BTRFS_ORDERED_IOERR,
> + /*
> +  * indicates whether this ordered extent has done its due diligence in
> +  * updating the isize
> +  */
> + BTRFS_ORDERED_UPDATED_ISIZE,
> + /* Set when we have to truncate an extent */
> + BTRFS_ORDERED_TRUNCATED,
> + /* Regular IO for COW */
> + BTRFS_ORDERED_REGULAR,
> +};
>  
>  struct btrfs_ordered_extent {
>   /* logical offset in the file */
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 7/9] btrfs: switch BTRFS_*_LOCK to enums

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> tree lock types.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/locking.h | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
> index 29135def468e..684d0ef4faa4 100644
> --- a/fs/btrfs/locking.h
> +++ b/fs/btrfs/locking.h
> @@ -6,10 +6,12 @@
>  #ifndef BTRFS_LOCKING_H
>  #define BTRFS_LOCKING_H
>  
> -#define BTRFS_WRITE_LOCK 1
> -#define BTRFS_READ_LOCK 2
> -#define BTRFS_WRITE_LOCK_BLOCKING 3
> -#define BTRFS_READ_LOCK_BLOCKING 4
> +enum {
> + BTRFS_WRITE_LOCK,
> + BTRFS_READ_LOCK,
> + BTRFS_WRITE_LOCK_BLOCKING,
> + BTRFS_READ_LOCK_BLOCKING,
> +};
>  
>  void btrfs_tree_lock(struct extent_buffer *eb);
>  void btrfs_tree_unlock(struct extent_buffer *eb);
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 6/9] btrfs: switch EXTENT_FLAG_* to enums

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> extent map flags.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/extent_map.h | 21 ++---
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> index 31977ffd6190..ef05a0121652 100644
> --- a/fs/btrfs/extent_map.h
> +++ b/fs/btrfs/extent_map.h
> @@ -11,13 +11,20 @@
>  #define EXTENT_MAP_INLINE ((u64)-2)
>  #define EXTENT_MAP_DELALLOC ((u64)-1)
>  
> -/* bits for the flags field */
> -#define EXTENT_FLAG_PINNED 0 /* this entry not yet on disk, don't free it */
> -#define EXTENT_FLAG_COMPRESSED 1
> -#define EXTENT_FLAG_PREALLOC 3 /* pre-allocated extent */
> -#define EXTENT_FLAG_LOGGING 4 /* Logging this extent */
> -#define EXTENT_FLAG_FILLING 5 /* Filling in a preallocated extent */
> -#define EXTENT_FLAG_FS_MAPPING 6 /* filesystem extent mapping type */
> +/* bits for the extent_map::flags field */
> +enum {
> + /* this entry not yet on disk, don't free it */
> + EXTENT_FLAG_PINNED,
> + EXTENT_FLAG_COMPRESSED,
> + /* pre-allocated extent */
> + EXTENT_FLAG_PREALLOC,
> + /* Logging this extent */
> + EXTENT_FLAG_LOGGING,
> + /* Filling in a preallocated extent */
> + EXTENT_FLAG_FILLING,
> + /* filesystem extent mapping type */
> + EXTENT_FLAG_FS_MAPPING,
> +};
>  
>  struct extent_map {
>   struct rb_node rb_node;
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 5/9] btrfs: swtich EXTENT_BUFFER_* to enums

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> extent buffer flags;
> 
> Signed-off-by: David Sterba 

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/extent_io.h | 28 
>  1 file changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index a1d3ea5a0d32..fd42492e62e5 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -37,18 +37,22 @@
>  #define EXTENT_BIO_COMPRESSED 1
>  #define EXTENT_BIO_FLAG_SHIFT 16
>  
> -/* these are bit numbers for test/set bit */
> -#define EXTENT_BUFFER_UPTODATE 0
> -#define EXTENT_BUFFER_DIRTY 2
> -#define EXTENT_BUFFER_CORRUPT 3
> -#define EXTENT_BUFFER_READAHEAD 4/* this got triggered by readahead */
> -#define EXTENT_BUFFER_TREE_REF 5
> -#define EXTENT_BUFFER_STALE 6
> -#define EXTENT_BUFFER_WRITEBACK 7
> -#define EXTENT_BUFFER_READ_ERR 8/* read IO error */
> -#define EXTENT_BUFFER_UNMAPPED 9
> -#define EXTENT_BUFFER_IN_TREE 10
> -#define EXTENT_BUFFER_WRITE_ERR 11/* write IO error */
> +enum {
> + EXTENT_BUFFER_UPTODATE,
> + EXTENT_BUFFER_DIRTY,
> + EXTENT_BUFFER_CORRUPT,
> + /* this got triggered by readahead */
> + EXTENT_BUFFER_READAHEAD,
> + EXTENT_BUFFER_TREE_REF,
> + EXTENT_BUFFER_STALE,
> + EXTENT_BUFFER_WRITEBACK,
> + /* read IO error */
> + EXTENT_BUFFER_READ_ERR,
> + EXTENT_BUFFER_UNMAPPED,
> + EXTENT_BUFFER_IN_TREE,
> + /* write IO error */
> + EXTENT_BUFFER_WRITE_ERR,
> +};
>  
>  /* these are flags for __process_pages_contig */
>  #define PAGE_UNLOCK  (1 << 0)
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 4/9] btrfs: switch BTRFS_ROOT_* to enums

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> root tree flags.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/ctree.h | 33 +
>  1 file changed, 17 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 7176b95b40e7..4bb0ac3050ff 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1180,22 +1180,23 @@ struct btrfs_subvolume_writers {
>  /*
>   * The state of btrfs root
>   */
> -/*
> - * btrfs_record_root_in_trans is a multi-step process,
> - * and it can race with the balancing code.   But the
> - * race is very small, and only the first time the root
> - * is added to each transaction.  So IN_TRANS_SETUP
> - * is used to tell us when more checks are required
> - */
> -#define BTRFS_ROOT_IN_TRANS_SETUP0
> -#define BTRFS_ROOT_REF_COWS  1
> -#define BTRFS_ROOT_TRACK_DIRTY   2
> -#define BTRFS_ROOT_IN_RADIX  3
> -#define BTRFS_ROOT_ORPHAN_ITEM_INSERTED  4
> -#define BTRFS_ROOT_DEFRAG_RUNNING5
> -#define BTRFS_ROOT_FORCE_COW 6
> -#define BTRFS_ROOT_MULTI_LOG_TASKS   7
> -#define BTRFS_ROOT_DIRTY 8
> +enum {
> + /*
> +  * btrfs_record_root_in_trans is a multi-step process, and it can race
> +  * with the balancing code.   But the race is very small, and only the
> +  * first time the root is added to each transaction.  So IN_TRANS_SETUP
> +  * is used to tell us when more checks are required
> +  */
> + BTRFS_ROOT_IN_TRANS_SETUP,
> + BTRFS_ROOT_REF_COWS,
> + BTRFS_ROOT_TRACK_DIRTY,
> + BTRFS_ROOT_IN_RADIX,
> + BTRFS_ROOT_ORPHAN_ITEM_INSERTED,
> + BTRFS_ROOT_DEFRAG_RUNNING,
> + BTRFS_ROOT_FORCE_COW,
> + BTRFS_ROOT_MULTI_LOG_TASKS,
> + BTRFS_ROOT_DIRTY,
> +};
>  
>  /*
>   * in ram representation of the tree.  extent_root is used for all 
> allocations
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 3/9] btrfs: switch BTRFS_FS_* to enums

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> internal filesystem states.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/ctree.h | 63 
>  1 file changed, 31 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 40c405d74a01..7176b95b40e7 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -757,38 +757,37 @@ struct btrfs_swapfile_pin {
>  
>  bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
>  
> -#define BTRFS_FS_BARRIER 1
> -#define BTRFS_FS_CLOSING_START   2
> -#define BTRFS_FS_CLOSING_DONE3
> -#define BTRFS_FS_LOG_RECOVERING  4
> -#define BTRFS_FS_OPEN5
> -#define BTRFS_FS_QUOTA_ENABLED   6
> -#define BTRFS_FS_UPDATE_UUID_TREE_GEN9
> -#define BTRFS_FS_CREATING_FREE_SPACE_TREE10
> -#define BTRFS_FS_BTREE_ERR   11
> -#define BTRFS_FS_LOG1_ERR12
> -#define BTRFS_FS_LOG2_ERR13
> -#define BTRFS_FS_QUOTA_OVERRIDE  14
> -/* Used to record internally whether fs has been frozen */
> -#define BTRFS_FS_FROZEN  15
> -
> -/*
> - * Indicate that a whole-filesystem exclusive operation is running
> - * (device replace, resize, device add/delete, balance)
> - */
> -#define BTRFS_FS_EXCL_OP 16
> -
> -/*
> - * To info transaction_kthread we need an immediate commit so it doesn't
> - * need to wait for commit_interval
> - */
> -#define BTRFS_FS_NEED_ASYNC_COMMIT   17
> -
> -/*
> - * Indicate that balance has been set up from the ioctl and is in the main
> - * phase. The fs_info::balance_ctl is initialized.
> - */
> -#define BTRFS_FS_BALANCE_RUNNING 18
> +enum {
> + BTRFS_FS_BARRIER,
> + BTRFS_FS_CLOSING_START,
> + BTRFS_FS_CLOSING_DONE,
> + BTRFS_FS_LOG_RECOVERING,
> + BTRFS_FS_OPEN,
> + BTRFS_FS_QUOTA_ENABLED,
> + BTRFS_FS_UPDATE_UUID_TREE_GEN,
> + BTRFS_FS_CREATING_FREE_SPACE_TREE,
> + BTRFS_FS_BTREE_ERR,
> + BTRFS_FS_LOG1_ERR,
> + BTRFS_FS_LOG2_ERR,
> + BTRFS_FS_QUOTA_OVERRIDE,
> + /* Used to record internally whether fs has been frozen */
> + BTRFS_FS_FROZEN,
> + /*
> +  * Indicate that a whole-filesystem exclusive operation is running
> +  * (device replace, resize, device add/delete, balance)
> +  */
> + BTRFS_FS_EXCL_OP,
> + /*
> +  * To info transaction_kthread we need an immediate commit so it
> +  * doesn't need to wait for commit_interval
> +  */
> + BTRFS_FS_NEED_ASYNC_COMMIT,
> + /*
> +  * Indicate that balance has been set up from the ioctl and is in the
> +  * main phase. The fs_info::balance_ctl is initialized.
> +  */
> + BTRFS_FS_BALANCE_RUNNING,
> +};
>  
>  struct btrfs_fs_info {
>   u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 2/9] btrfs: switch BTRFS_BLOCK_RSV_* to enums

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> block reserve types.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Qu Wenruo 

However more comment will always be a good thing.

Thanks,
Qu

> ---
>  fs/btrfs/ctree.h | 19 ---
>  1 file changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index f82ec5e41b0c..40c405d74a01 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -461,13 +461,18 @@ struct btrfs_space_info {
>   struct kobject *block_group_kobjs[BTRFS_NR_RAID_TYPES];
>  };
>  
> -#define  BTRFS_BLOCK_RSV_GLOBAL  1
> -#define  BTRFS_BLOCK_RSV_DELALLOC2
> -#define  BTRFS_BLOCK_RSV_TRANS   3
> -#define  BTRFS_BLOCK_RSV_CHUNK   4
> -#define  BTRFS_BLOCK_RSV_DELOPS  5
> -#define  BTRFS_BLOCK_RSV_EMPTY   6
> -#define  BTRFS_BLOCK_RSV_TEMP7
> +/*
> + * Types of block reserves
> + */
> +enum {
> + BTRFS_BLOCK_RSV_GLOBAL,
> + BTRFS_BLOCK_RSV_DELALLOC,
> + BTRFS_BLOCK_RSV_TRANS,
> + BTRFS_BLOCK_RSV_CHUNK,
> + BTRFS_BLOCK_RSV_DELOPS,
> + BTRFS_BLOCK_RSV_EMPTY,
> + BTRFS_BLOCK_RSV_TEMP,
> +};
>  
>  struct btrfs_block_rsv {
>   u64 size;
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 1/9] btrfs: switch BTRFS_FS_STATE_* to enums

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午3:53, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> global filesystem states.
> 
> Signed-off-by: David Sterba 

Good comment.

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/ctree.h | 25 +++--
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index a98507fa9192..f82ec5e41b0c 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -109,13 +109,26 @@ static inline unsigned long btrfs_chunk_item_size(int 
> num_stripes)
>  }
>  
>  /*
> - * File system states
> + * Runtime (in-memory) states of filesystem
>   */
> -#define BTRFS_FS_STATE_ERROR 0
> -#define BTRFS_FS_STATE_REMOUNTING1
> -#define BTRFS_FS_STATE_TRANS_ABORTED 2
> -#define BTRFS_FS_STATE_DEV_REPLACING 3
> -#define BTRFS_FS_STATE_DUMMY_FS_INFO 4
> +enum {
> + /* Global indicator of serious filesysystem errors */
> + BTRFS_FS_STATE_ERROR,
> + /*
> +  * Filesystem is being remounted, allow to skip some operations, like
> +  * defrag
> +  */
> + BTRFS_FS_STATE_REMOUNTING,
> + /* Track if the transaction abort has been reported */
> + BTRFS_FS_STATE_TRANS_ABORTED,
> + /*
> +  * Indicate that replace source or target device state is changed and
> +  * allow to block bio operations
> +  */
> + BTRFS_FS_STATE_DEV_REPLACING,
> + /* The btrfs_fs_info created for self-tests */
> + BTRFS_FS_STATE_DUMMY_FS_INFO,
> +};
>  
>  #define BTRFS_BACKREF_REV_MAX256
>  #define BTRFS_BACKREF_REV_SHIFT  56
> 



signature.asc
Description: OpenPGP digital signature

Re: Balance: invalid convert data profile raid10

2018-11-27 Thread Qu Wenruo



On 2018/11/28 上午5:16, Mikko Merikivi wrote:
> I seem unable to convert an existing btrfs device array to RAID 10.
> Since it's pretty much RAID 0 and 1 combined, and 5 and 6 are
> unstable, it's what I would like to use.
> 
> After I did tried this with 4.19.2-arch1-1-ARCH and btrfs-progs v4.19,
> I updated my system and tried btrfs balance again with this system
> information:
> [mikko@localhost lvdata]$ uname -a
> Linux localhost 4.19.4-arch1-1-ARCH #1 SMP PREEMPT Fri Nov 23 09:06:58
> UTC 2018 x86_64 GNU/Linux
> [mikko@localhost lvdata]$ btrfs --version
> btrfs-progs v4.19
> [mikko@localhost lvdata]$ sudo btrfs fi show
> Label: 'main1'  uuid: c7cbb9c3-8c55-45f1-b03c-48992efe2f11
> Total devices 1 FS bytes used 2.90TiB
> devid1 size 3.64TiB used 2.91TiB path /dev/mapper/main
> 
> Label: 'red'  uuid: f3c781a8-0f3e-4019-acbf-0b783cf566d0
> Total devices 2 FS bytes used 640.00KiB
> devid1 size 931.51GiB used 2.03GiB path /dev/mapper/red1
> devid2 size 931.51GiB used 2.03GiB path /dev/mapper/red2

RAID10 needs at least 4 devices.

Thanks,
Qu

> [mikko@localhost lvdata]$ btrfs fi df /mnt/red/
> Data, RAID1: total=1.00GiB, used=512.00KiB
> System, RAID1: total=32.00MiB, used=16.00KiB
> Metadata, RAID1: total=1.00GiB, used=112.00KiB
> GlobalReserve, single: total=16.00MiB, used=0.00B
> 
> ---
> 
> Here are the steps I originally used:
> 
> [mikko@localhost lvdata]$ sudo cryptsetup luksFormat -s 512
> --use-random /dev/sdc
> [mikko@localhost lvdata]$ sudo cryptsetup luksFormat -s 512
> --use-random /dev/sdd
> [mikko@localhost lvdata]$ sudo cryptsetup luksOpen /dev/sdc red1
> [mikko@localhost lvdata]$ sudo cryptsetup luksOpen /dev/sdd red2
> [mikko@localhost lvdata]$ sudo mkfs.btrfs -L red /dev/mapper/red1
> btrfs-progs v4.19
> See http://btrfs.wiki.kernel.org for more information.
> 
> Label:  red
> UUID:   f3c781a8-0f3e-4019-acbf-0b783cf566d0
> Node size:  16384
> Sector size:4096
> Filesystem size:931.51GiB
> Block group profiles:
>   Data: single8.00MiB
>   Metadata: DUP   1.00GiB
>   System:   DUP   8.00MiB
> SSD detected:   no
> Incompat features:  extref, skinny-metadata
> Number of devices:  1
> Devices:
>IDSIZE  PATH
> 1   931.51GiB  /dev/mapper/red1
> 
> [mikko@localhost lvdata]$ sudo mount -t btrfs -o
> defaults,noatime,nodiratime,autodefrag,compress=lzo /dev/mapper/red1
> /mnt/red
> [mikko@localhost lvdata]$ sudo btrfs device add /dev/mapper/red2 /mnt/red
> [mikko@localhost lvdata]$ sudo btrfs balance start -dconvert=raid10
> -mconvert=raid10 /mnt/red
> ERROR: error during balancing '/mnt/red': Invalid argument
> There may be more info in syslog - try dmesg | tail
> code 1
> 
> [mikko@localhost lvdata]$ dmesg | tail
> [12026.263243] BTRFS info (device dm-1): disk space caching is enabled
> [12026.263244] BTRFS info (device dm-1): has skinny extents
> [12026.263245] BTRFS info (device dm-1): flagging fs with big metadata feature
> [12026.275153] BTRFS info (device dm-1): checking UUID tree
> [12195.431766] BTRFS info (device dm-1): enabling auto defrag
> [12195.431769] BTRFS info (device dm-1): use lzo compression, level 0
> [12195.431770] BTRFS info (device dm-1): disk space caching is enabled
> [12195.431771] BTRFS info (device dm-1): has skinny extents
> [12205.815941] BTRFS info (device dm-1): disk added /dev/mapper/red2
> [12744.788747] BTRFS error (device dm-1): balance: invalid convert
> data profile raid10
> 
> Converting to RAID 1 did work but what can I do to make it RAID 10?
> With the up-to-date system it still says "invalid convert data profile
> raid10".
> 



signature.asc
Description: OpenPGP digital signature

Re: Linux-next regression?

2018-11-27 Thread Qu Wenruo



On 2018/11/27 下午10:11, Andrea Gelmini wrote:
> On Tue, Nov 27, 2018 at 09:13:02AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2018/11/26 下午11:01, Andrea Gelmini wrote:
>>>   One question: I can completely trust the ok return status of scrub? I 
>>> know is made for this, but shit happens...
>>
>> No, scrub only checks csum of data and tree blocks, it doesn't ensure
>> the content of tree blocks are OK.
> 
> Hi Qu,
>   and thanks a lot, really. Your answers are always the best: short,
>   detailed and very kind. You rock.
> 
>   I'm going to send a patch to propose to add your explanation above
>   on the relative man page, if you agree.
> 
>> For comprehensive check, go "btrfs check --readonly".
> 
>   I'll do it.
> 
>   At the moment I just compared the file existance between my laptop and
>   latest backup. Everything is fine.
> 
>>
>> However I don't think it's something "btrfs check --readonly" would
>> report, but some strange behavior, maybe from LVM or cryptsetup.
> 
>   Well, I'm using this setup with ext4 and xfs, on same machine, without
>   troubles.

Then it indeed looks like something goes wrong in linux-next.

I would recommend to do a bisect if possible.

As you compared all your data with laptop, it ensures your csum/file
trees are OK, thus no corruption in that trees.
But still something doesn't look right for extent tree only.

But it's less a concerning problem since it doesn't reach latest RC, so
if you could reproduce it stably, I'd recommend to do a bisect.

Thanks,
Qu

>   I've got files checksummed on the backup machine, so I can be sure about
>   comparing integrity.
> 
> Anyway, thanks a lot again,
> Andrea
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH v2 1/5] btrfs-progs: image: Refactor fixup_devices() to fixup_chunks_and_devices()

2018-11-27 Thread Qu Wenruo




On 2018/11/27 下午4:46, Nikolay Borisov wrote:
> 
> 
> On 27.11.18 г. 10:38 ч., Qu Wenruo wrote:
>> Current fixup_devices() will only remove DEV_ITEMs and reset DEV_ITEM
>> size.
>> Later we need to do more fixup works, so change the name to
>> fixup_chunks_and_devices() and refactor the original device size fixup
>> to fixup_device_size().
>>
>> Signed-off-by: Qu Wenruo 
> 
> Reviewed-by: Nikolay Borisov 
> 
> However, one minor nit below.
> 
>> ---
>>  image/main.c | 52 
>>  1 file changed, 36 insertions(+), 16 deletions(-)
>>
>> diff --git a/image/main.c b/image/main.c
>> index c680ab19de6c..bbfcf8f19298 100644
>> --- a/image/main.c
>> +++ b/image/main.c
>> @@ -2084,28 +2084,19 @@ static void remap_overlapping_chunks(struct 
>> mdrestore_struct *mdres)
>>  }
>>  }
>>  
>> -static int fixup_devices(struct btrfs_fs_info *fs_info,
>> - struct mdrestore_struct *mdres, off_t dev_size)
>> +static int fixup_device_size(struct btrfs_trans_handle *trans,
>> + struct mdrestore_struct *mdres,
>> + off_t dev_size)
>>  {
>> -struct btrfs_trans_handle *trans;
>> +struct btrfs_fs_info *fs_info = trans->fs_info;
>>  struct btrfs_dev_item *dev_item;
>>  struct btrfs_path path;
>> -struct extent_buffer *leaf;
>>  struct btrfs_root *root = fs_info->chunk_root;
>>  struct btrfs_key key;
>> +struct extent_buffer *leaf;
> 
> nit: Unnecessary change

Doesn't it look better when all btrfs_ prefix get batched together? :)

Thanks,
Qu

> 
>>  u64 devid, cur_devid;
>>  int ret;
>>  
>> -if (btrfs_super_log_root(fs_info->super_copy)) {
>> -warning(
>> -"log tree detected, its generation will not match superblock");
>> -}
>> -trans = btrfs_start_transaction(fs_info->tree_root, 1);
>> -if (IS_ERR(trans)) {
>> -error("cannot starting transaction %ld", PTR_ERR(trans));
>> -return PTR_ERR(trans);
>> -}
>> -
>>  dev_item = _info->super_copy->dev_item;
>>  
>>  devid = btrfs_stack_device_id(dev_item);
>> @@ -2123,7 +2114,7 @@ again:
>>  ret = btrfs_search_slot(trans, root, , , -1, 1);
>>  if (ret < 0) {
>>  error("search failed: %d", ret);
>> -exit(1);
>> +return ret;
>>  }
>>  
>>  while (1) {
>> @@ -2170,12 +2161,41 @@ again:
>>  }
>>  
>>  btrfs_release_path();
>> +return 0;
>> +}
>> +
>> +static int fixup_chunks_and_devices(struct btrfs_fs_info *fs_info,
>> + struct mdrestore_struct *mdres, off_t dev_size)
>> +{
>> +struct btrfs_trans_handle *trans;
>> +int ret;
>> +
>> +if (btrfs_super_log_root(fs_info->super_copy)) {
>> +warning(
>> +"log tree detected, its generation will not match superblock");
>> +}
>> +trans = btrfs_start_transaction(fs_info->tree_root, 1);
>> +if (IS_ERR(trans)) {
>> +error("cannot starting transaction %ld", PTR_ERR(trans));
>> +return PTR_ERR(trans);
>> +}
>> +
>> +ret = fixup_device_size(trans, mdres, dev_size);
>> +if (ret < 0)
>> +goto error;
>> +
>>  ret = btrfs_commit_transaction(trans, fs_info->tree_root);
>>  if (ret) {
>>  error("unable to commit transaction: %d", ret);
>>  return ret;
>>  }
>>  return 0;
>> +error:
>> +error(
>> +"failed to fix chunks and devices mapping, the fs may not be mountable: %s",
>> +strerror(-ret));
>> +btrfs_abort_transaction(trans, ret);
>> +return ret;
>>  }
>>  
>>  static int restore_metadump(const char *input, FILE *out, int old_restore,
>> @@ -2282,7 +2302,7 @@ static int restore_metadump(const char *input, FILE 
>> *out, int old_restore,
>>  return 1;
>>  }
>>  
>> -ret = fixup_devices(info, , st.st_size);
>> +ret = fixup_chunks_and_devices(info, , st.st_size);
>>  close_ctree(info->chunk_root);
>>  if (ret)
>>  goto out;
>>

[PATCH v2 5/5] btrfs-progs: misc-tests/021: Do extra btrfs check before mounting

2018-11-27 Thread Qu Wenruo

Test case misc/021 is testing if we could mount a single disk btrfs
image recovered from multi disk fs.

The problem is, current kernel has extra check for block group, chunk
and dev extent.
This means any image can't pass btrfs check for chunk tree will not
mount.

So do extra btrfs check before mount, this will also help us to locate
the problem in btrfs-image easier.

Signed-off-by: Qu Wenruo 
---
 tests/misc-tests/021-image-multi-devices/test.sh | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tests/misc-tests/021-image-multi-devices/test.sh 
b/tests/misc-tests/021-image-multi-devices/test.sh
index 5430847f4e2f..26beae6e4b85 100755
--- a/tests/misc-tests/021-image-multi-devices/test.sh
+++ b/tests/misc-tests/021-image-multi-devices/test.sh
@@ -37,6 +37,9 @@ run_check $SUDO_HELPER wipefs -a "$loop2"
 
 run_check $SUDO_HELPER "$TOP/btrfs-image" -r "$IMAGE" "$loop1"
 
+# Run check to make sure there is nothing wrong for the recovered image
+run_check "$TOP/btrfs" check "$loop1"
+
 run_check $SUDO_HELPER mount "$loop1" "$TEST_MNT"
 new_md5=$(run_check_stdout md5sum "$TEST_MNT/foobar" | cut -d ' ' -f 1)
 run_check $SUDO_HELPER umount "$TEST_MNT"
-- 
2.19.2

[PATCH v2 4/5] btrfs-progs: image: Remove all existing dev extents for later rebuild

2018-11-27 Thread Qu Wenruo

For multi-disk btrfs image recovered to single disk, the dev tree would
look like:
item 2 key (1 DEV_EXTENT 22020096)
dev extent chunk_tree 3
chunk_objectid 256 chunk_offset 22020096 length 8388608
item 3 key (1 DEV_EXTENT 30408704)
dev extent chunk_tree 3
chunk_objectid 256 chunk_offset 30408704 length 1073741824
item 4 key (1 DEV_EXTENT 1104150528)
dev extent chunk_tree 3
chunk_objectid 256 chunk_offset 1104150528 length 536870912
item 5 key (2 DEV_EXTENT 1048576)
dev extent chunk_tree 3
chunk_objectid 256 chunk_offset 22020096 length 8388608
item 6 key (2 DEV_EXTENT 9437184)
dev extent chunk_tree 3
chunk_objectid 256 chunk_offset 30408704 length 1073741824
item 7 key (2 DEV_EXTENT 1083179008)
dev extent chunk_tree 3
chunk_objectid 256 chunk_offset 1104150528 length 536870912

However in chunk tree, we only use devid 2, thus devid 1 is completely
garbage:
item 0 key (DEV_ITEMS DEV_ITEM 2)
item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 16105 
itemsize 80
length 8388608 owner 2 stripe_len 65536 type SYSTEM
num_stripes 1 sub_stripes 0
stripe 0 devid 2 offset 1048576
item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 16025 
itemsize 80
length 1073741824 owner 2 stripe_len 65536 type METADATA
num_stripes 1 sub_stripes 0
stripe 0 devid 2 offset 9437184
item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 1104150528) itemoff 15945 
itemsize 80
length 1073741824 owner 2 stripe_len 65536 type DATA
num_stripes 1 sub_stripes 0
stripe 0 devid 2 offset 1083179008

To fix the problem, the most straight-forward way is to remove all
existing dev extents, and then re-fill correct dev extents from chunk.

So this patch just follow the straight-forward way to fix it, causing
the final dev extents layout to match with chunk tree, and make btrfs
check happy.

Signed-off-by: Qu Wenruo 
---
 image/main.c | 102 +++
 1 file changed, 102 insertions(+)

diff --git a/image/main.c b/image/main.c
index 9187de34f34a..8f756689a1fa 100644
--- a/image/main.c
+++ b/image/main.c
@@ -2209,6 +2209,104 @@ static void fixup_block_groups(struct btrfs_fs_info 
*fs_info)
}
 }
 
+static int remove_all_dev_extents(struct btrfs_trans_handle *trans)
+{
+   struct btrfs_fs_info *fs_info = trans->fs_info;
+   struct btrfs_root *root = fs_info->dev_root;
+   struct btrfs_path path;
+   struct btrfs_key key;
+   struct extent_buffer *leaf;
+   int slot;
+   int ret;
+
+   key.objectid = 1;
+   key.type = BTRFS_DEV_EXTENT_KEY;
+   key.offset = 0;
+   btrfs_init_path();
+
+   ret = btrfs_search_slot(trans, root, , , -1, 1);
+   if (ret < 0) {
+   error("failed to search dev tree: %s", strerror(-ret));
+   return ret;
+   }
+
+   while (1) {
+   slot = path.slots[0];
+   leaf = path.nodes[0];
+   if (slot >= btrfs_header_nritems(leaf)) {
+   ret = btrfs_next_leaf(root, );
+   if (ret < 0) {
+   error("failed to search dev tree: %s",
+   strerror(-ret));
+   goto out;
+   }
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+   }
+
+   btrfs_item_key_to_cpu(leaf, , slot);
+   if (key.type != BTRFS_DEV_EXTENT_KEY)
+   break;
+   ret = btrfs_del_item(trans, root, );
+   if (ret < 0) {
+   error("failed to delete dev extent %llu, %llu: %s",
+   key.objectid, key.offset, strerror(-ret));
+   goto out;
+   }
+   }
+out:
+   btrfs_release_path();
+   return ret;
+}
+
+static int fixup_dev_extents(struct btrfs_trans_handle *trans)
+{
+   struct btrfs_fs_info *fs_info = trans->fs_info;
+   struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
+   struct btrfs_device *dev;
+   struct cache_extent *ce;
+   struct map_lookup *map;
+   u64 devid = btrfs_stack_device_id(_info->super_copy->dev_item);
+   int i;
+   int ret;
+
+   ret = remove_all_dev_extents(trans);
+   if (ret < 0)
+   error("failed to remove all existing dev extents: %s",
+   strerror(-ret));
+
+   dev = btrfs_find_device(f

[PATCH v2 0/5] btrfs-progs: image: Fix error when restoring multi-disk image to single disk

2018-11-27 Thread Qu Wenruo

This patchset can be fetched from github:
https://github.com/adam900710/btrfs-progs/tree/image_recover

The base commit is as usual, the latest stable tag, v4.19.


Test case misc/021 will fail if using latest upstream kernel.

This is due to the enhanced code in kernel to check block group <->
chunk <-> dev extent mapping.

This means from the very beginning, btrfs-image can't really restore a
multi-disk image to single-disk one correctly.

The problem is, although we modified chunk item, we didn't modify block
group item's flags or dev extents.

This patch will reset block group flags, then rebuild the whole
dev extents by removing existing ones first, then re-add the correct
dev extents calculated from chunk map.

Now it could pass all misc tests and fsck tests.

Changelog:
v2:
- Parameter list cleanup
  * Use trans->fs_info to remove fs_info parameter
  * Remove trans parameter for function who doesn't need
- Merge dev extents removal code with rebuild code
- Refactor btrfs_alloc_dev_extent() into 2 functions
  * btrfs_insert_dev_extent() for convert and dev extent rebuild
  * btrfs_alloc_dev_extent() for old use case
  
Qu Wenruo (5):
  btrfs-progs: image: Refactor fixup_devices() to
fixup_chunks_and_devices()
  btrfs-progs: image: Fix block group item flags when restoring
multi-device image to single device
  btrfs-progs: volumes: Refactor btrfs_alloc_dev_extent() into two
functions
  btrfs-progs: image: Remove all existing dev extents for later rebuild
  btrfs-progs: misc-tests/021: Do extra btrfs check before mounting

 image/main.c  | 200 --
 .../021-image-multi-devices/test.sh   |   3 +
 volumes.c |  48 +++--
 volumes.h |   3 +
 4 files changed, 220 insertions(+), 34 deletions(-)

-- 
2.19.2

[PATCH v2 3/5] btrfs-progs: volumes: Refactor btrfs_alloc_dev_extent() into two functions

2018-11-27 Thread Qu Wenruo

We have btrfs_alloc_dev_extent() accepting @convert flag to toggle
special handling for convert.

However that @convert flag only determine whether we call
find_free_dev_extent(), and we may later need to insert dev extents
without searching dev tree.

So refactor btrfs_alloc_dev_extent() into 2 functions,
btrfs_alloc_dev_extent(), which will try to find free dev extent, and
btrfs_insert_dev_extent(), which will just insert a dev extent.

For implementation, btrfs_alloc_dev_extent() will call
btrfs_insert_dev_extent() anyway, so no duplicated code.

This removes the need of @convert parameter, and make
btrfs_insert_dev_extent() public for later usage.

Signed-off-by: Qu Wenruo 
---
 volumes.c | 48 ++--
 volumes.h |  3 +++
 2 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/volumes.c b/volumes.c
index 30090ce5f8e8..0dd082cd1718 100644
--- a/volumes.c
+++ b/volumes.c
@@ -530,10 +530,12 @@ static int find_free_dev_extent(struct btrfs_device 
*device, u64 num_bytes,
return find_free_dev_extent_start(device, num_bytes, 0, start, len);
 }
 
-static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
- struct btrfs_device *device,
- u64 chunk_offset, u64 num_bytes, u64 *start,
- int convert)
+/*
+ * Insert one device extent into the fs.
+ */
+int btrfs_insert_dev_extent(struct btrfs_trans_handle *trans,
+   struct btrfs_device *device,
+   u64 chunk_offset, u64 num_bytes, u64 start)
 {
int ret;
struct btrfs_path *path;
@@ -546,18 +548,8 @@ static int btrfs_alloc_dev_extent(struct 
btrfs_trans_handle *trans,
if (!path)
return -ENOMEM;
 
-   /*
-* For convert case, just skip search free dev_extent, as caller
-* is responsible to make sure it's free.
-*/
-   if (!convert) {
-   ret = find_free_dev_extent(device, num_bytes, start, NULL);
-   if (ret)
-   goto err;
-   }
-
key.objectid = device->devid;
-   key.offset = *start;
+   key.offset = start;
key.type = BTRFS_DEV_EXTENT_KEY;
ret = btrfs_insert_empty_item(trans, root, path, ,
  sizeof(*extent));
@@ -583,6 +575,22 @@ err:
return ret;
 }
 
+/*
+ * Allocate one free dev extent and insert it into the fs.
+ */
+static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
+ struct btrfs_device *device,
+ u64 chunk_offset, u64 num_bytes, u64 *start)
+{
+   int ret;
+
+   ret = find_free_dev_extent(device, num_bytes, start, NULL);
+   if (ret)
+   return ret;
+   return btrfs_insert_dev_extent(trans, device, chunk_offset, num_bytes,
+   *start);
+}
+
 static int find_next_chunk(struct btrfs_fs_info *fs_info, u64 *offset)
 {
struct btrfs_root *root = fs_info->chunk_root;
@@ -1107,7 +1115,7 @@ again:
list_move_tail(>dev_list, dev_list);
 
ret = btrfs_alloc_dev_extent(trans, device, key.offset,
-calc_size, _offset, 0);
+calc_size, _offset);
if (ret < 0)
goto out_chunk_map;
 
@@ -1241,8 +1249,12 @@ int btrfs_alloc_data_chunk(struct btrfs_trans_handle 
*trans,
while (index < num_stripes) {
struct btrfs_stripe *stripe;
 
-   ret = btrfs_alloc_dev_extent(trans, device, key.offset,
-calc_size, _offset, convert);
+   if (convert)
+   ret = btrfs_insert_dev_extent(trans, device, key.offset,
+   calc_size, dev_offset);
+   else
+   ret = btrfs_alloc_dev_extent(trans, device, key.offset,
+   calc_size, _offset);
BUG_ON(ret);
 
device->bytes_used += calc_size;
diff --git a/volumes.h b/volumes.h
index b4ea93f0bec3..44284ee75adb 100644
--- a/volumes.h
+++ b/volumes.h
@@ -268,6 +268,9 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
   int flags);
 int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
 void btrfs_close_all_devices(void);
+int btrfs_insert_dev_extent(struct btrfs_trans_handle *trans,
+   struct btrfs_device *device,
+   u64 chunk_offset, u64 num_bytes, u64 start);
 int btrfs_add_device(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info,
 struct btrfs_device *device);
-- 
2.19.2

[PATCH v2 2/5] btrfs-progs: image: Fix block group item flags when restoring multi-device image to single device

2018-11-27 Thread Qu Wenruo

Since btrfs-image is just restoring tree blocks without really check if
that tree block contents makes sense, for multi-device image, block
group items will keep that incorrect block group flags.

For example, for a metadata RAID1 data RAID0 btrfs recovered to a single
disk, its chunk tree will look like:

item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096)
length 8388608 owner 2 stripe_len 65536 type SYSTEM
item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704)
length 1073741824 owner 2 stripe_len 65536 type METADATA
item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 1104150528)
length 1073741824 owner 2 stripe_len 65536 type DATA

All chunks have correct type (SINGLE).

While its block group items will look like:

item 1 key (22020096 BLOCK_GROUP_ITEM 8388608)
block group used 16384 chunk_objectid 256 flags SYSTEM|RAID1
item 3 key (30408704 BLOCK_GROUP_ITEM 1073741824)
block group used 114688 chunk_objectid 256 flags METADATA|RAID1
item 11 key (1104150528 BLOCK_GROUP_ITEM 1073741824)
block group used 1572864 chunk_objectid 256 flags DATA|RAID0

All block group items still have the wrong profiles.

And btrfs check (lowmem mode for better output) will report error for such 
image:

  ERROR: chunk[22020096 30408704) related block group item flags mismatch, 
wanted: 2, have: 18
  ERROR: chunk[30408704 1104150528) related block group item flags mismatch, 
wanted: 4, have: 20
  ERROR: chunk[1104150528 2177892352) related block group item flags mismatch, 
wanted: 1, have: 9

This patch will do an extra repair for block group items to fix the
profile of them.

Signed-off-by: Qu Wenruo 
---
 image/main.c | 46 ++
 1 file changed, 46 insertions(+)

diff --git a/image/main.c b/image/main.c
index bbfcf8f19298..9187de34f34a 100644
--- a/image/main.c
+++ b/image/main.c
@@ -2164,6 +2164,51 @@ again:
return 0;
 }
 
+static void fixup_block_groups(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_block_group_cache *bg;
+   struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
+   struct cache_extent *ce;
+   struct map_lookup *map;
+   u64 extra_flags;
+
+   for (ce = search_cache_extent(_tree->cache_tree, 0); ce;
+ce = next_cache_extent(ce)) {
+   map = container_of(ce, struct map_lookup, ce);
+
+   bg = btrfs_lookup_block_group(fs_info, ce->start);
+   if (!bg) {
+   warning(
+   "can't find block group %llu, result fs may not be mountable",
+   ce->start);
+   continue;
+   }
+   extra_flags = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
+
+   if (bg->flags == map->type)
+   continue;
+
+   /* Update the block group item and mark the bg dirty */
+   bg->flags = map->type;
+   btrfs_set_block_group_flags(>item, bg->flags);
+   set_extent_bits(_info->block_group_cache, ce->start,
+   ce->start + ce->size - 1, BLOCK_GROUP_DIRTY);
+
+   /*
+* Chunk and bg flags can be different, changing bg flags
+* without update avail_data/meta_alloc_bits will lead to
+* ENOSPC.
+* So here we set avail_*_alloc_bits to match chunk types.
+*/
+   if (map->type & BTRFS_BLOCK_GROUP_DATA)
+   fs_info->avail_data_alloc_bits = extra_flags;
+   if (map->type & BTRFS_BLOCK_GROUP_METADATA)
+   fs_info->avail_metadata_alloc_bits = extra_flags;
+   if (map->type & BTRFS_BLOCK_GROUP_SYSTEM)
+   fs_info->avail_system_alloc_bits = extra_flags;
+   }
+}
+
 static int fixup_chunks_and_devices(struct btrfs_fs_info *fs_info,
 struct mdrestore_struct *mdres, off_t dev_size)
 {
@@ -2180,6 +2225,7 @@ static int fixup_chunks_and_devices(struct btrfs_fs_info 
*fs_info,
return PTR_ERR(trans);
}
 
+   fixup_block_groups(fs_info);
ret = fixup_device_size(trans, mdres, dev_size);
if (ret < 0)
goto error;
-- 
2.19.2

[PATCH v2 1/5] btrfs-progs: image: Refactor fixup_devices() to fixup_chunks_and_devices()

2018-11-27 Thread Qu Wenruo

Current fixup_devices() will only remove DEV_ITEMs and reset DEV_ITEM
size.
Later we need to do more fixup works, so change the name to
fixup_chunks_and_devices() and refactor the original device size fixup
to fixup_device_size().

Signed-off-by: Qu Wenruo 
---
 image/main.c | 52 
 1 file changed, 36 insertions(+), 16 deletions(-)

diff --git a/image/main.c b/image/main.c
index c680ab19de6c..bbfcf8f19298 100644
--- a/image/main.c
+++ b/image/main.c
@@ -2084,28 +2084,19 @@ static void remap_overlapping_chunks(struct 
mdrestore_struct *mdres)
}
 }
 
-static int fixup_devices(struct btrfs_fs_info *fs_info,
-struct mdrestore_struct *mdres, off_t dev_size)
+static int fixup_device_size(struct btrfs_trans_handle *trans,
+struct mdrestore_struct *mdres,
+off_t dev_size)
 {
-   struct btrfs_trans_handle *trans;
+   struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_dev_item *dev_item;
struct btrfs_path path;
-   struct extent_buffer *leaf;
struct btrfs_root *root = fs_info->chunk_root;
struct btrfs_key key;
+   struct extent_buffer *leaf;
u64 devid, cur_devid;
int ret;
 
-   if (btrfs_super_log_root(fs_info->super_copy)) {
-   warning(
-   "log tree detected, its generation will not match superblock");
-   }
-   trans = btrfs_start_transaction(fs_info->tree_root, 1);
-   if (IS_ERR(trans)) {
-   error("cannot starting transaction %ld", PTR_ERR(trans));
-   return PTR_ERR(trans);
-   }
-
dev_item = _info->super_copy->dev_item;
 
devid = btrfs_stack_device_id(dev_item);
@@ -2123,7 +2114,7 @@ again:
ret = btrfs_search_slot(trans, root, , , -1, 1);
if (ret < 0) {
error("search failed: %d", ret);
-   exit(1);
+   return ret;
}
 
while (1) {
@@ -2170,12 +2161,41 @@ again:
}
 
btrfs_release_path();
+   return 0;
+}
+
+static int fixup_chunks_and_devices(struct btrfs_fs_info *fs_info,
+struct mdrestore_struct *mdres, off_t dev_size)
+{
+   struct btrfs_trans_handle *trans;
+   int ret;
+
+   if (btrfs_super_log_root(fs_info->super_copy)) {
+   warning(
+   "log tree detected, its generation will not match superblock");
+   }
+   trans = btrfs_start_transaction(fs_info->tree_root, 1);
+   if (IS_ERR(trans)) {
+   error("cannot starting transaction %ld", PTR_ERR(trans));
+   return PTR_ERR(trans);
+   }
+
+   ret = fixup_device_size(trans, mdres, dev_size);
+   if (ret < 0)
+   goto error;
+
ret = btrfs_commit_transaction(trans, fs_info->tree_root);
if (ret) {
error("unable to commit transaction: %d", ret);
return ret;
}
return 0;
+error:
+   error(
+"failed to fix chunks and devices mapping, the fs may not be mountable: %s",
+   strerror(-ret));
+   btrfs_abort_transaction(trans, ret);
+   return ret;
 }
 
 static int restore_metadump(const char *input, FILE *out, int old_restore,
@@ -2282,7 +2302,7 @@ static int restore_metadump(const char *input, FILE *out, 
int old_restore,
return 1;
}
 
-   ret = fixup_devices(info, , st.st_size);
+   ret = fixup_chunks_and_devices(info, , st.st_size);
close_ctree(info->chunk_root);
if (ret)
goto out;
-- 
2.19.2

Re: [PATCH 5/5] btrfs-progs: misc-tests/021: Do extra btrfs check before mounting

2018-11-26 Thread Qu Wenruo




On 2018/11/27 下午3:29, Nikolay Borisov wrote:
> 
> 
> On 27.11.18 г. 4:33 ч., Qu Wenruo wrote:
>> Test case misc/021 is testing if we could mount a single disk btrfs
>> image recovered from multi disk fs.
>>
>> The problem is, current kernel has extra check for block group, chunk
>> and dev extent.
>> This means any image can't pass btrfs check for chunk tree will not
>> mount.
>>
>> So do extra btrfs check before mount, this will also help us to locate
>> the problem in btrfs-image easier.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>  tests/misc-tests/021-image-multi-devices/test.sh | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/tests/misc-tests/021-image-multi-devices/test.sh 
>> b/tests/misc-tests/021-image-multi-devices/test.sh
>> index 5430847f4e2f..26beae6e4b85 100755
>> --- a/tests/misc-tests/021-image-multi-devices/test.sh
>> +++ b/tests/misc-tests/021-image-multi-devices/test.sh
>> @@ -37,6 +37,9 @@ run_check $SUDO_HELPER wipefs -a "$loop2"
>>  
>>  run_check $SUDO_HELPER "$TOP/btrfs-image" -r "$IMAGE" "$loop1"
>>  
>> +# Run check to make sure there is nothing wrong for the recovered image
>> +run_check "$TOP/btrfs" check "$loop1"
> 
> I think this needs to be run_check $SUDO_HELPER "$TOP/btrfs" check "$loop1"

For read-only check, it's OK and I always run the tests using normal
user, no privilege problem.

Thanks,
Qu

>> +
>>  run_check $SUDO_HELPER mount "$loop1" "$TEST_MNT"
>>  new_md5=$(run_check_stdout md5sum "$TEST_MNT/foobar" | cut -d ' ' -f 1)
>>  run_check $SUDO_HELPER umount "$TEST_MNT"
>>

Re: [PATCH 4/5] btrfs-progs: image: Rebuild dev extents using chunk tree

2018-11-26 Thread Qu Wenruo




On 2018/11/27 下午3:28, Nikolay Borisov wrote:
> 
> 
> On 27.11.18 г. 4:33 ч., Qu Wenruo wrote:
>> With existing dev extents cleaned up, now we can rebuild dev extents
>> using the correct chunk tree.
>>
>> Since new dev extents are all rebuild from scratch, even we're restoring
>> image from multi-device fs to single disk, we won't have any problem
>> reported by btrfs check.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>  image/main.c | 34 ++
>>  volumes.c| 10 +-
>>  volumes.h|  4 
>>  3 files changed, 43 insertions(+), 5 deletions(-)
>>
>> diff --git a/image/main.c b/image/main.c
>> index 707568f22e01..626eb933d5cc 100644
>> --- a/image/main.c
>> +++ b/image/main.c
>> @@ -2265,12 +2265,46 @@ out:
>>  static int fixup_dev_extents(struct btrfs_trans_handle *trans,
>>   struct btrfs_fs_info *fs_info)
>>  {
>> +struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
>> +struct btrfs_device *dev;
>> +struct cache_extent *ce;
>> +struct map_lookup *map;
>> +u64 devid = btrfs_stack_device_id(_info->super_copy->dev_item);
>> +int i;
>>  int ret;
>>  
>>  ret = remove_all_dev_extents(trans, fs_info);
>>  if (ret < 0)
>>  error("failed to remove all existing dev extents: %s",
>>  strerror(-ret));
>> +
>> +dev = btrfs_find_device(fs_info, devid, NULL, NULL);
>> +if (!dev) {
>> +error("faild to find devid %llu", devid);
>> +return -ENODEV;
>> +}
>> +
>> +/* Rebuild all dev extents using chunk maps */
>> +for (ce = search_cache_extent(_tree->cache_tree, 0); ce;
>> + ce = next_cache_extent(ce)) {
>> +u64 stripe_len;
>> +
>> +map = container_of(ce, struct map_lookup, ce);
>> +stripe_len = calc_stripe_length(map->type, ce->size,
>> +map->num_stripes);
>> +for (i = 0; i < map->num_stripes; i++) {
>> +ret = btrfs_alloc_dev_extent(trans, dev, ce->start,
>> +stripe_len, >stripes[i].physical, 1);
>> +if (ret < 0) {
>> +error(
>> +"failed to insert dev extent %llu %llu: %s",
>> +devid, map->stripes[i].physical,
>> +strerror(-ret));
>> +goto out;
>> +}
>> +}
>> +}
>> +out:
>>  return ret;
>>  }
>>  
>> diff --git a/volumes.c b/volumes.c
>> index 30090ce5f8e8..73c9204fa7d1 100644
>> --- a/volumes.c
>> +++ b/volumes.c
>> @@ -530,10 +530,10 @@ static int find_free_dev_extent(struct btrfs_device 
>> *device, u64 num_bytes,
>>  return find_free_dev_extent_start(device, num_bytes, 0, start, len);
>>  }
>>  
>> -static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
>> -  struct btrfs_device *device,
>> -  u64 chunk_offset, u64 num_bytes, u64 *start,
>> -  int convert)
>> +int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
>> +   struct btrfs_device *device,
>> +   u64 chunk_offset, u64 num_bytes, u64 *start,
>> +   int insert_asis)
> 
> Make that parameter a bool. Also why do you rename it ?

Since it's no longer only used by convert.

The best naming may be two function, one called
btrfs_insert_device_extent(), and then btrfs_alloc_device_extent().

As for convert and this use case, we are not allocating, but just
inserting one.

What about above naming change?

Thanks,
Qu

> 
>>  {
>>  int ret;
>>  struct btrfs_path *path;
>> @@ -550,7 +550,7 @@ static int btrfs_alloc_dev_extent(struct 
>> btrfs_trans_handle *trans,
>>   * For convert case, just skip search free dev_extent, as caller
>>   * is responsible to make sure it's free.
>>   */
>> -if (!convert) {
>> +if (!insert_asis) {
>>  ret = find_free_dev_extent(device, num_bytes, start, NULL);
>>  if (ret)
>>  goto err;
>> diff --git a/volumes.h b/volumes.h
>> index b4ea93f0bec3..5ca2779ebd45 100644
>> --- a/volumes.h
>> +++ b/volumes.h
>> @@ -271,6 +271,10 @@ void btrfs_close_all_devices(void);
>>  int btrfs_add_device(struct btrfs_trans_handle *trans,
>>   struct btrfs_fs_info *fs_info,
>>   struct btrfs_device *device);
>> +int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
>> +   struct btrfs_device *device,
>> +   u64 chunk_offset, u64 num_bytes, u64 *start,
>> +   int insert_asis);
>>  int btrfs_update_device(struct btrfs_trans_handle *trans,
>>  struct btrfs_device *device);
>>  int btrfs_scan_one_device(int fd, const char *path,
>>

Re: [PATCH 2/5] btrfs-progs: image: Fix block group item flags when restoring multi-device image to single device

2018-11-26 Thread Qu Wenruo




On 2018/11/27 下午3:15, Nikolay Borisov wrote:
> 
> 
> On 27.11.18 г. 4:33 ч., Qu Wenruo wrote:
>> Since btrfs-image is just restoring tree blocks without really check if
>> that tree block contents makes sense, for multi-device image, block
>> group items will keep that incorrect block group flags.
>>
>> For example, for a metadata RAID1 data RAID0 btrfs recovered to a single
>> disk, its chunk tree will look like:
>>
>>  item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096)
>>  length 8388608 owner 2 stripe_len 65536 type SYSTEM
>>  item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704)
>>  length 1073741824 owner 2 stripe_len 65536 type METADATA
>>  item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 1104150528)
>>  length 1073741824 owner 2 stripe_len 65536 type DATA
>>
>> All chunks have correct type (SINGLE).
>>
>> While its block group items will look like:
>>
>>  item 1 key (22020096 BLOCK_GROUP_ITEM 8388608)
>>  block group used 16384 chunk_objectid 256 flags SYSTEM|RAID1
>>  item 3 key (30408704 BLOCK_GROUP_ITEM 1073741824)
>>  block group used 114688 chunk_objectid 256 flags METADATA|RAID1
>>  item 11 key (1104150528 BLOCK_GROUP_ITEM 1073741824)
>>  block group used 1572864 chunk_objectid 256 flags DATA|RAID0
>>
>> All block group items still have the wrong profiles.
>>
>> And btrfs check (lowmem mode for better output) will report error for such 
>> image:
>>
>>   ERROR: chunk[22020096 30408704) related block group item flags mismatch, 
>> wanted: 2, have: 18
>>   ERROR: chunk[30408704 1104150528) related block group item flags mismatch, 
>> wanted: 4, have: 20
>>   ERROR: chunk[1104150528 2177892352) related block group item flags 
>> mismatch, wanted: 1, have: 9
>>
>> This patch will do an extra repair for block group items to fix the
>> profile of them.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>  image/main.c | 47 +++
>>  1 file changed, 47 insertions(+)
>>
>> diff --git a/image/main.c b/image/main.c
>> index 36b5c95ea146..9060f6b1b665 100644
>> --- a/image/main.c
>> +++ b/image/main.c
>> @@ -2164,6 +2164,52 @@ again:
>>  return 0;
>>  }
>>  
>> +static void fixup_block_groups(struct btrfs_trans_handle *trans,
>> +  struct btrfs_fs_info *fs_info)
> 
> You are not even using the trans handle in this function, why pass it?

Bad habit again.

Will definitely do something with that.

Thanks,
Qu

> 
>> +{
>> +struct btrfs_block_group_cache *bg;
>> +struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
>> +struct cache_extent *ce;
>> +struct map_lookup *map;
>> +u64 extra_flags;
>> +
>> +for (ce = search_cache_extent(_tree->cache_tree, 0); ce;
>> + ce = next_cache_extent(ce)) {
>> +map = container_of(ce, struct map_lookup, ce);
>> +
>> +bg = btrfs_lookup_block_group(fs_info, ce->start);
>> +if (!bg) {
>> +warning(
>> +"can't find block group %llu, result fs may not be mountable",
>> +ce->start);
>> +continue;
>> +}
>> +extra_flags = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
>> +
>> +if (bg->flags == map->type)
>> +continue;
>> +
>> +/* Update the block group item and mark the bg dirty */
>> +bg->flags = map->type;
>> +btrfs_set_block_group_flags(>item, bg->flags);
>> +set_extent_bits(_info->block_group_cache, ce->start,
>> +ce->start + ce->size - 1, BLOCK_GROUP_DIRTY);
>> +
>> +/*
>> + * Chunk and bg flags can be different, changing bg flags
>> + * without update avail_data/meta_alloc_bits will lead to
>> + * ENOSPC.
>> + * So here we set avail_*_alloc_bits to match chunk types.
>> + */
>> +if (map->type & BTRFS_BLOCK_GROUP_DATA)
>> +fs_info->avail_data_alloc_bits = extra_flags;
>> +if (map->type & BTRFS_BLOCK_GROUP_METADATA)
>> +fs_info->avail_metadata_alloc_bits = extra_flags;
>> +if (map->type & BTRFS_BLOCK_GROUP_SYSTEM)
>> +fs_info->avail_system_alloc_bits = extra_flags;
>> +}
>> +}
>> +
>>  static int fixup_chunks_and_devices(struct btrfs_fs_info *fs_info,
>>   struct mdrestore_struct *mdres, off_t dev_size)
>>  {
>> @@ -2180,6 +2226,7 @@ static int fixup_chunks_and_devices(struct 
>> btrfs_fs_info *fs_info,
>>  return PTR_ERR(trans);
>>  }
>>  
>> +fixup_block_groups(trans, fs_info);
>>  ret = fixup_device_size(trans, fs_info, mdres, dev_size);
>>  if (ret < 0)
>>  goto error;
>>

Re: [PATCH 1/5] btrfs-progs: image: Refactor fixup_devices() to fixup_chunks_and_devices()

2018-11-26 Thread Qu Wenruo




On 2018/11/27 下午3:13, Nikolay Borisov wrote:
> 
> 
> On 27.11.18 г. 4:33 ч., Qu Wenruo wrote:
>> Current fixup_devices() will only remove DEV_ITEMs and reset DEV_ITEM
>> size.
>> Later we need to do more fixup works, so change the name to
>> fixup_chunks_and_devices() and refactor the original device size fixup
>> to fixup_device_size().
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>  image/main.c | 52 
>>  1 file changed, 36 insertions(+), 16 deletions(-)
>>
>> diff --git a/image/main.c b/image/main.c
>> index c680ab19de6c..36b5c95ea146 100644
>> --- a/image/main.c
>> +++ b/image/main.c
>> @@ -2084,28 +2084,19 @@ static void remap_overlapping_chunks(struct 
>> mdrestore_struct *mdres)
>>  }
>>  }
>>  
>> -static int fixup_devices(struct btrfs_fs_info *fs_info,
>> - struct mdrestore_struct *mdres, off_t dev_size)
>> +static int fixup_device_size(struct btrfs_trans_handle *trans,
>> + struct btrfs_fs_info *fs_info,
> 
> trans already has a handle to the fs_info so you can drop it from the
> param list.

Indeed! My bad habbit of trans then fs_info definitely needs to be
corrected.

Thanks,
Qu

> 
>> + struct mdrestore_struct *mdres,
>> + off_t dev_size)
>>  {
>> -struct btrfs_trans_handle *trans;
>>  struct btrfs_dev_item *dev_item;
>>  struct btrfs_path path;
>> -struct extent_buffer *leaf;
>>  struct btrfs_root *root = fs_info->chunk_root;
>>  struct btrfs_key key;
>> +struct extent_buffer *leaf;
>>  u64 devid, cur_devid;
>>  int ret;
>>  
>> -if (btrfs_super_log_root(fs_info->super_copy)) {
>> -warning(
>> -"log tree detected, its generation will not match superblock");
>> -}
>> -trans = btrfs_start_transaction(fs_info->tree_root, 1);
>> -if (IS_ERR(trans)) {
>> -error("cannot starting transaction %ld", PTR_ERR(trans));
>> -return PTR_ERR(trans);
>> -}
>> -
>>  dev_item = _info->super_copy->dev_item;
>>  
>>  devid = btrfs_stack_device_id(dev_item);
>> @@ -2123,7 +2114,7 @@ again:
>>  ret = btrfs_search_slot(trans, root, , , -1, 1);
>>  if (ret < 0) {
>>  error("search failed: %d", ret);
>> -exit(1);
>> +return ret;
>>  }
>>  
>>  while (1) {
>> @@ -2170,12 +2161,41 @@ again:
>>  }
>>  
>>  btrfs_release_path();
>> +return 0;
>> +}
>> +
>> +static int fixup_chunks_and_devices(struct btrfs_fs_info *fs_info,
>> + struct mdrestore_struct *mdres, off_t dev_size)
>> +{
>> +struct btrfs_trans_handle *trans;
>> +int ret;
>> +
>> +if (btrfs_super_log_root(fs_info->super_copy)) {
>> +warning(
>> +"log tree detected, its generation will not match superblock");
>> +}
>> +trans = btrfs_start_transaction(fs_info->tree_root, 1);
>> +if (IS_ERR(trans)) {
>> +error("cannot starting transaction %ld", PTR_ERR(trans));
>> +return PTR_ERR(trans);
>> +}
>> +
>> +ret = fixup_device_size(trans, fs_info, mdres, dev_size);
>> +if (ret < 0)
>> +goto error;
>> +
>>  ret = btrfs_commit_transaction(trans, fs_info->tree_root);
>>  if (ret) {
>>  error("unable to commit transaction: %d", ret);
>>  return ret;
>>  }
>>  return 0;
>> +error:
>> +error(
>> +"failed to fix chunks and devices mapping, the fs may not be mountable: %s",
>> +strerror(-ret));
>> +btrfs_abort_transaction(trans, ret);
>> +return ret;
>>  }
>>  
>>  static int restore_metadump(const char *input, FILE *out, int old_restore,
>> @@ -2282,7 +2302,7 @@ static int restore_metadump(const char *input, FILE 
>> *out, int old_restore,
>>  return 1;
>>  }
>>  
>> -ret = fixup_devices(info, , st.st_size);
>> +ret = fixup_chunks_and_devices(info, , st.st_size);
>>  close_ctree(info->chunk_root);
>>  if (ret)
>>  goto out;
>>

[PATCH 5/5] btrfs-progs: misc-tests/021: Do extra btrfs check before mounting

2018-11-26 Thread Qu Wenruo

Test case misc/021 is testing if we could mount a single disk btrfs
image recovered from multi disk fs.

The problem is, current kernel has extra check for block group, chunk
and dev extent.
This means any image can't pass btrfs check for chunk tree will not
mount.

So do extra btrfs check before mount, this will also help us to locate
the problem in btrfs-image easier.

Signed-off-by: Qu Wenruo 
---
 tests/misc-tests/021-image-multi-devices/test.sh | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tests/misc-tests/021-image-multi-devices/test.sh 
b/tests/misc-tests/021-image-multi-devices/test.sh
index 5430847f4e2f..26beae6e4b85 100755
--- a/tests/misc-tests/021-image-multi-devices/test.sh
+++ b/tests/misc-tests/021-image-multi-devices/test.sh
@@ -37,6 +37,9 @@ run_check $SUDO_HELPER wipefs -a "$loop2"
 
 run_check $SUDO_HELPER "$TOP/btrfs-image" -r "$IMAGE" "$loop1"
 
+# Run check to make sure there is nothing wrong for the recovered image
+run_check "$TOP/btrfs" check "$loop1"
+
 run_check $SUDO_HELPER mount "$loop1" "$TEST_MNT"
 new_md5=$(run_check_stdout md5sum "$TEST_MNT/foobar" | cut -d ' ' -f 1)
 run_check $SUDO_HELPER umount "$TEST_MNT"
-- 
2.19.2

[PATCH 3/5] btrfs-progs: image: Remove all existing dev extents for later rebuild

2018-11-26 Thread Qu Wenruo

This patch will remove all existing dev extents for later rebuild.

Signed-off-by: Qu Wenruo 
---
 image/main.c | 68 
 1 file changed, 68 insertions(+)

diff --git a/image/main.c b/image/main.c
index 9060f6b1b665..707568f22e01 100644
--- a/image/main.c
+++ b/image/main.c
@@ -2210,6 +2210,70 @@ static void fixup_block_groups(struct btrfs_trans_handle 
*trans,
}
 }
 
+static int remove_all_dev_extents(struct btrfs_trans_handle *trans,
+ struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_root *root = fs_info->dev_root;
+   struct btrfs_path path;
+   struct btrfs_key key;
+   struct extent_buffer *leaf;
+   int slot;
+   int ret;
+
+   key.objectid = 1;
+   key.type = BTRFS_DEV_EXTENT_KEY;
+   key.offset = 0;
+   btrfs_init_path();
+
+   ret = btrfs_search_slot(trans, root, , , -1, 1);
+   if (ret < 0) {
+   error("failed to search dev tree: %s", strerror(-ret));
+   return ret;
+   }
+
+   while (1) {
+   slot = path.slots[0];
+   leaf = path.nodes[0];
+   if (slot >= btrfs_header_nritems(leaf)) {
+   ret = btrfs_next_leaf(root, );
+   if (ret < 0) {
+   error("failed to search dev tree: %s",
+   strerror(-ret));
+   goto out;
+   }
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+   }
+
+   btrfs_item_key_to_cpu(leaf, , slot);
+   if (key.type != BTRFS_DEV_EXTENT_KEY)
+   break;
+   ret = btrfs_del_item(trans, root, );
+   if (ret < 0) {
+   error("failed to delete dev extent %llu, %llu: %s",
+   key.objectid, key.offset, strerror(-ret));
+   goto out;
+   }
+   }
+out:
+   btrfs_release_path();
+   return ret;
+}
+
+static int fixup_dev_extents(struct btrfs_trans_handle *trans,
+struct btrfs_fs_info *fs_info)
+{
+   int ret;
+
+   ret = remove_all_dev_extents(trans, fs_info);
+   if (ret < 0)
+   error("failed to remove all existing dev extents: %s",
+   strerror(-ret));
+   return ret;
+}
+
 static int fixup_chunks_and_devices(struct btrfs_fs_info *fs_info,
 struct mdrestore_struct *mdres, off_t dev_size)
 {
@@ -2227,6 +2291,10 @@ static int fixup_chunks_and_devices(struct btrfs_fs_info 
*fs_info,
}
 
fixup_block_groups(trans, fs_info);
+   ret = fixup_dev_extents(trans, fs_info);
+   if (ret < 0)
+   goto error;
+
ret = fixup_device_size(trans, fs_info, mdres, dev_size);
if (ret < 0)
goto error;
-- 
2.19.2

[PATCH 2/5] btrfs-progs: image: Fix block group item flags when restoring multi-device image to single device

2018-11-26 Thread Qu Wenruo

Since btrfs-image is just restoring tree blocks without really check if
that tree block contents makes sense, for multi-device image, block
group items will keep that incorrect block group flags.

For example, for a metadata RAID1 data RAID0 btrfs recovered to a single
disk, its chunk tree will look like:

item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096)
length 8388608 owner 2 stripe_len 65536 type SYSTEM
item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704)
length 1073741824 owner 2 stripe_len 65536 type METADATA
item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 1104150528)
length 1073741824 owner 2 stripe_len 65536 type DATA

All chunks have correct type (SINGLE).

While its block group items will look like:

item 1 key (22020096 BLOCK_GROUP_ITEM 8388608)
block group used 16384 chunk_objectid 256 flags SYSTEM|RAID1
item 3 key (30408704 BLOCK_GROUP_ITEM 1073741824)
block group used 114688 chunk_objectid 256 flags METADATA|RAID1
item 11 key (1104150528 BLOCK_GROUP_ITEM 1073741824)
block group used 1572864 chunk_objectid 256 flags DATA|RAID0

All block group items still have the wrong profiles.

And btrfs check (lowmem mode for better output) will report error for such 
image:

  ERROR: chunk[22020096 30408704) related block group item flags mismatch, 
wanted: 2, have: 18
  ERROR: chunk[30408704 1104150528) related block group item flags mismatch, 
wanted: 4, have: 20
  ERROR: chunk[1104150528 2177892352) related block group item flags mismatch, 
wanted: 1, have: 9

This patch will do an extra repair for block group items to fix the
profile of them.

Signed-off-by: Qu Wenruo 
---
 image/main.c | 47 +++
 1 file changed, 47 insertions(+)

diff --git a/image/main.c b/image/main.c
index 36b5c95ea146..9060f6b1b665 100644
--- a/image/main.c
+++ b/image/main.c
@@ -2164,6 +2164,52 @@ again:
return 0;
 }
 
+static void fixup_block_groups(struct btrfs_trans_handle *trans,
+ struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_block_group_cache *bg;
+   struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
+   struct cache_extent *ce;
+   struct map_lookup *map;
+   u64 extra_flags;
+
+   for (ce = search_cache_extent(_tree->cache_tree, 0); ce;
+ce = next_cache_extent(ce)) {
+   map = container_of(ce, struct map_lookup, ce);
+
+   bg = btrfs_lookup_block_group(fs_info, ce->start);
+   if (!bg) {
+   warning(
+   "can't find block group %llu, result fs may not be mountable",
+   ce->start);
+   continue;
+   }
+   extra_flags = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
+
+   if (bg->flags == map->type)
+   continue;
+
+   /* Update the block group item and mark the bg dirty */
+   bg->flags = map->type;
+   btrfs_set_block_group_flags(>item, bg->flags);
+   set_extent_bits(_info->block_group_cache, ce->start,
+   ce->start + ce->size - 1, BLOCK_GROUP_DIRTY);
+
+   /*
+* Chunk and bg flags can be different, changing bg flags
+* without update avail_data/meta_alloc_bits will lead to
+* ENOSPC.
+* So here we set avail_*_alloc_bits to match chunk types.
+*/
+   if (map->type & BTRFS_BLOCK_GROUP_DATA)
+   fs_info->avail_data_alloc_bits = extra_flags;
+   if (map->type & BTRFS_BLOCK_GROUP_METADATA)
+   fs_info->avail_metadata_alloc_bits = extra_flags;
+   if (map->type & BTRFS_BLOCK_GROUP_SYSTEM)
+   fs_info->avail_system_alloc_bits = extra_flags;
+   }
+}
+
 static int fixup_chunks_and_devices(struct btrfs_fs_info *fs_info,
 struct mdrestore_struct *mdres, off_t dev_size)
 {
@@ -2180,6 +2226,7 @@ static int fixup_chunks_and_devices(struct btrfs_fs_info 
*fs_info,
return PTR_ERR(trans);
}
 
+   fixup_block_groups(trans, fs_info);
ret = fixup_device_size(trans, fs_info, mdres, dev_size);
if (ret < 0)
goto error;
-- 
2.19.2

[PATCH 4/5] btrfs-progs: image: Rebuild dev extents using chunk tree

2018-11-26 Thread Qu Wenruo

With existing dev extents cleaned up, now we can rebuild dev extents
using the correct chunk tree.

Since new dev extents are all rebuild from scratch, even we're restoring
image from multi-device fs to single disk, we won't have any problem
reported by btrfs check.

Signed-off-by: Qu Wenruo 
---
 image/main.c | 34 ++
 volumes.c| 10 +-
 volumes.h|  4 
 3 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/image/main.c b/image/main.c
index 707568f22e01..626eb933d5cc 100644
--- a/image/main.c
+++ b/image/main.c
@@ -2265,12 +2265,46 @@ out:
 static int fixup_dev_extents(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info)
 {
+   struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
+   struct btrfs_device *dev;
+   struct cache_extent *ce;
+   struct map_lookup *map;
+   u64 devid = btrfs_stack_device_id(_info->super_copy->dev_item);
+   int i;
int ret;
 
ret = remove_all_dev_extents(trans, fs_info);
if (ret < 0)
error("failed to remove all existing dev extents: %s",
strerror(-ret));
+
+   dev = btrfs_find_device(fs_info, devid, NULL, NULL);
+   if (!dev) {
+   error("faild to find devid %llu", devid);
+   return -ENODEV;
+   }
+
+   /* Rebuild all dev extents using chunk maps */
+   for (ce = search_cache_extent(_tree->cache_tree, 0); ce;
+ce = next_cache_extent(ce)) {
+   u64 stripe_len;
+
+   map = container_of(ce, struct map_lookup, ce);
+   stripe_len = calc_stripe_length(map->type, ce->size,
+   map->num_stripes);
+   for (i = 0; i < map->num_stripes; i++) {
+   ret = btrfs_alloc_dev_extent(trans, dev, ce->start,
+   stripe_len, >stripes[i].physical, 1);
+   if (ret < 0) {
+   error(
+   "failed to insert dev extent %llu %llu: %s",
+   devid, map->stripes[i].physical,
+   strerror(-ret));
+   goto out;
+   }
+   }
+   }
+out:
return ret;
 }
 
diff --git a/volumes.c b/volumes.c
index 30090ce5f8e8..73c9204fa7d1 100644
--- a/volumes.c
+++ b/volumes.c
@@ -530,10 +530,10 @@ static int find_free_dev_extent(struct btrfs_device 
*device, u64 num_bytes,
return find_free_dev_extent_start(device, num_bytes, 0, start, len);
 }
 
-static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
- struct btrfs_device *device,
- u64 chunk_offset, u64 num_bytes, u64 *start,
- int convert)
+int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
+  struct btrfs_device *device,
+  u64 chunk_offset, u64 num_bytes, u64 *start,
+  int insert_asis)
 {
int ret;
struct btrfs_path *path;
@@ -550,7 +550,7 @@ static int btrfs_alloc_dev_extent(struct btrfs_trans_handle 
*trans,
 * For convert case, just skip search free dev_extent, as caller
 * is responsible to make sure it's free.
 */
-   if (!convert) {
+   if (!insert_asis) {
ret = find_free_dev_extent(device, num_bytes, start, NULL);
if (ret)
goto err;
diff --git a/volumes.h b/volumes.h
index b4ea93f0bec3..5ca2779ebd45 100644
--- a/volumes.h
+++ b/volumes.h
@@ -271,6 +271,10 @@ void btrfs_close_all_devices(void);
 int btrfs_add_device(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info,
 struct btrfs_device *device);
+int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
+  struct btrfs_device *device,
+  u64 chunk_offset, u64 num_bytes, u64 *start,
+  int insert_asis);
 int btrfs_update_device(struct btrfs_trans_handle *trans,
struct btrfs_device *device);
 int btrfs_scan_one_device(int fd, const char *path,
-- 
2.19.2

[PATCH 1/5] btrfs-progs: image: Refactor fixup_devices() to fixup_chunks_and_devices()

2018-11-26 Thread Qu Wenruo

Current fixup_devices() will only remove DEV_ITEMs and reset DEV_ITEM
size.
Later we need to do more fixup works, so change the name to
fixup_chunks_and_devices() and refactor the original device size fixup
to fixup_device_size().

Signed-off-by: Qu Wenruo 
---
 image/main.c | 52 
 1 file changed, 36 insertions(+), 16 deletions(-)

diff --git a/image/main.c b/image/main.c
index c680ab19de6c..36b5c95ea146 100644
--- a/image/main.c
+++ b/image/main.c
@@ -2084,28 +2084,19 @@ static void remap_overlapping_chunks(struct 
mdrestore_struct *mdres)
}
 }
 
-static int fixup_devices(struct btrfs_fs_info *fs_info,
-struct mdrestore_struct *mdres, off_t dev_size)
+static int fixup_device_size(struct btrfs_trans_handle *trans,
+struct btrfs_fs_info *fs_info,
+struct mdrestore_struct *mdres,
+off_t dev_size)
 {
-   struct btrfs_trans_handle *trans;
struct btrfs_dev_item *dev_item;
struct btrfs_path path;
-   struct extent_buffer *leaf;
struct btrfs_root *root = fs_info->chunk_root;
struct btrfs_key key;
+   struct extent_buffer *leaf;
u64 devid, cur_devid;
int ret;
 
-   if (btrfs_super_log_root(fs_info->super_copy)) {
-   warning(
-   "log tree detected, its generation will not match superblock");
-   }
-   trans = btrfs_start_transaction(fs_info->tree_root, 1);
-   if (IS_ERR(trans)) {
-   error("cannot starting transaction %ld", PTR_ERR(trans));
-   return PTR_ERR(trans);
-   }
-
dev_item = _info->super_copy->dev_item;
 
devid = btrfs_stack_device_id(dev_item);
@@ -2123,7 +2114,7 @@ again:
ret = btrfs_search_slot(trans, root, , , -1, 1);
if (ret < 0) {
error("search failed: %d", ret);
-   exit(1);
+   return ret;
}
 
while (1) {
@@ -2170,12 +2161,41 @@ again:
}
 
btrfs_release_path();
+   return 0;
+}
+
+static int fixup_chunks_and_devices(struct btrfs_fs_info *fs_info,
+struct mdrestore_struct *mdres, off_t dev_size)
+{
+   struct btrfs_trans_handle *trans;
+   int ret;
+
+   if (btrfs_super_log_root(fs_info->super_copy)) {
+   warning(
+   "log tree detected, its generation will not match superblock");
+   }
+   trans = btrfs_start_transaction(fs_info->tree_root, 1);
+   if (IS_ERR(trans)) {
+   error("cannot starting transaction %ld", PTR_ERR(trans));
+   return PTR_ERR(trans);
+   }
+
+   ret = fixup_device_size(trans, fs_info, mdres, dev_size);
+   if (ret < 0)
+   goto error;
+
ret = btrfs_commit_transaction(trans, fs_info->tree_root);
if (ret) {
error("unable to commit transaction: %d", ret);
return ret;
}
return 0;
+error:
+   error(
+"failed to fix chunks and devices mapping, the fs may not be mountable: %s",
+   strerror(-ret));
+   btrfs_abort_transaction(trans, ret);
+   return ret;
 }
 
 static int restore_metadump(const char *input, FILE *out, int old_restore,
@@ -2282,7 +2302,7 @@ static int restore_metadump(const char *input, FILE *out, 
int old_restore,
return 1;
}
 
-   ret = fixup_devices(info, , st.st_size);
+   ret = fixup_chunks_and_devices(info, , st.st_size);
close_ctree(info->chunk_root);
if (ret)
goto out;
-- 
2.19.2

[PATCH 0/5] btrfs-progs: image: Fix error when restoring multi-disk image to single disk

2018-11-26 Thread Qu Wenruo

This patchset can be fetched from github:
https://github.com/adam900710/btrfs-progs/tree/image_recover

The base commit is as usual, the latest stable tag, v4.19.


Test case misc/021 will fail if using latest upstream kernel.

This is due to the enhanced code in kernel to check block group <->
chunk <-> dev extent mapping.

This means from the very beginning, btrfs-image can't really restore a
multi-disk image to single-disk one correctly.

The problem is, although we modified chunk item, we didn't modify block
group item's flags or dev extents.

This patch will reset block group flags, then rebuild the whole
dev extents by removing existing ones first, then re-add the correct
dev extents calculated from chunk map.

Now it could pass all misc tests and fsck tests.

Qu Wenruo (5):
  btrfs-progs: image: Refactor fixup_devices() to
fixup_chunks_and_devices()
  btrfs-progs: image: Fix block group item flags when restoring
multi-device image to single device
  btrfs-progs: image: Remove all existing dev extents for later rebuild
  btrfs-progs: image: Rebuild dev extents using chunk tree
  btrfs-progs: misc-tests/021: Do extra btrfs check before mounting

 image/main.c  | 201 --
 .../021-image-multi-devices/test.sh   |   3 +
 volumes.c |  10 +-
 volumes.h |   4 +
 4 files changed, 197 insertions(+), 21 deletions(-)

-- 
2.19.2

Re: Linux-next regression?

2018-11-26 Thread Qu Wenruo



On 2018/11/26 下午11:01, Andrea Gelmini wrote:
> Hi everybody,
>and thanks a lot for your work.
> 
>I'm using BTRFS over LVM over cryptsetup, over Samsung SSD 860 EVO (latest 
> git of btrfs-progs).
>Usually I run kernel in development, because I know BTRFS is young and 
> there are still lots of bugs and corner case to fix.
> 
>Anyway, I just want to submit to you a - maybe - useful info.
> 
>Yesterday I compiled and booted latest linux-next,¹ and I've got this:
> 
> ---
> nov 26 01:18:22 glet kernel: Btrfs loaded, crc32c=crc32c-intel
> nov 26 01:18:22 glet kernel: BTRFS: device label home devid 1 transid 32759 
> /dev/mapper/cry-home
> nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): force lzo compression, 
> level 0
> nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): disk space caching is 
> enabled
> nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): has skinny extents
> nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): bad tree block start, 
> want 2152002191360 have 8829432654847901262

This means we failed to read one extent tree block and caused the problem.

And if you're using default mkfs profile it should try again to use the
extra copy, but it doesn't look like to be the case.

BTW, does it always happen like this? Or is there any possibility involved?

> nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): failed to read block 
> groups: -5
> nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): open_ctree failed
> ---
> 
>Now, rebooting with 4.19.0-041900 (downloaded from here)², or 4.20-rc4 
> (compiled on this machine), the problem disappears.
> 
>Now, running scrub a few times, and copying data (all files of the logical 
> volume) to external device, gives no complain
Would you please also try "btrfs check --readonly"?

> 
>Here I stop. This is my primary dev laptop, and at the moment I can't 
> spend time switching/rebooting/testing. I'm comparing the data with last 
> backup (I rsync each hour), but it takes time (it's more then 3TB).
> 
>So, that was about to let you know. Well, it's Ubuntu 18.10, and between 
> reboots no dist-upgrade or changes in booting related packages or systemd.
> 
>   One question: I can completely trust the ok return status of scrub? I know 
> is made for this, but shit happens...

No, scrub only checks csum of data and tree blocks, it doesn't ensure
the content of tree blocks are OK.

For comprehensive check, go "btrfs check --readonly".

However I don't think it's something "btrfs check --readonly" would
report, but some strange behavior, maybe from LVM or cryptsetup.

Thanks,
Qu

> 
> Kisses,
> Gelma   
> 
> -
> ¹ commit:  8c9733fd9806c71e7f2313a280f98cb3051f93df
>   "Add linux-next specific files for 20181123"
> ² http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19/
> 



signature.asc
Description: OpenPGP digital signature

Re: unable to fixup (regular) error

2018-11-26 Thread Qu Wenruo



On 2018/11/26 下午3:19, Alexander Fieroch wrote:
> Hi,
> 
> My data partition with btrfs RAID 0 (/dev/sdc0 and /dev/sdd0) shows
> errors in syslog:
> 
> BTRFS error (device sdc): cleaner transaction attach returned -30
> BTRFS info (device sdc): disk space caching is enabled
> BTRFS info (device sdc): has skinny extents
> BTRFS info (device sdc): bdev /dev/sdc errs: wr 0, rd 0, flush 0,
> corrupt 3, gen 1
> BTRFS info (device sdc): bdev /dev/sdd errs: wr 0, rd 0, flush 0,
> corrupt 6, gen 2

Generation mismatch means something more serious.

> 
> 
> BTRFS error (device sdc): scrub: tree block 858803990528 spanning
> stripes, ignored. logical=3D858803929088

While the spanning stripes only means scrub code can't really check it
since it crosses stripe boundary.

It's normally nothing to worry, and it's normally caused by old kernel.
Newer kernel will avoid such problem from happening, but for existing
one, it will just skip it.

> BTRFS error (device sdc): scrub: tree block 858803990528 spanning
> stripes, ignored. logical=3D858803994624
> BTRFS warning (device sdc): checksum error at logical 858803961856 on
> dev /dev/sdd, physical 385263894528: metadata leaf (level 0) in tree 7
> BTRFS warning (device sdc): checksum error at logical 858803961856 on
> dev /dev/sdd, physical 385263894528: metadata leaf (level 0) in tree 7

This means some csum tree blocks get corrupted.

> BTRFS error (device sdc): bdev /dev/sdd errs: wr 0, rd 0, flush 0,
> corrupt 4, gen 1
> BTRFS error (device sdc): scrub: tree block 858820505600 spanning
> stripes, ignored. logical=3D858820444160
> BTRFS error (device sdc): scrub: tree block 858820505600 spanning
> stripes, ignored. logical=3D858820509696
> BTRFS error (device sdc): unable to fixup (regular) error at logical
> 858803961856 on dev /dev/sdd
> BTRFS error (device sdc): scrub: tree block 858821292032 spanning
> stripes, ignored. logical=3D858821230592
> BTRFS error (device sdc): scrub: tree block 858821292032 spanning
> stripes, ignored. logical=3D858821296128
> BTRFS warning (device sdc): checksum error at logical 858821263360 on
> dev /dev/sdd, physical 385281196032: metadata leaf (level 0) in tree 7
> BTRFS warning (device sdc): checksum error at logical 858821263360 on
> dev /dev/sdd, physical 385281196032: metadata leaf (level 0) in tree 7
> BTRFS error (device sdc): bdev /dev/sdd errs: wr 0, rd 0, flush 0,
> corrupt 5, gen 1
> BTRFS error (device sdc): unable to fixup (regular) error at logical
> 858821263360 on dev /dev/sdd
> BTRFS warning (device sdc): checksum/header error at logical
> 858820476928 on dev /dev/sdd, physical 385280409600: metadata leaf
> (level 0) in tree 7
> BTRFS warning (device sdc): checksum/header error at logical
> 858820476928 on dev /dev/sdd, physical 385280409600: metadata leaf
> (level 0) in tree 7
> BTRFS error (device sdc): bdev /dev/sdd errs: wr 0, rd 0, flush 0,
> corrupt 5, gen 2
> BTRFS warning (device sdc): checksum error at logical 858820489216 on
> dev /dev/sdd, physical 385280421888: metadata leaf (level 0) in tree 2
> BTRFS warning (device sdc): checksum error at logical 858820489216 on
> dev /dev/sdd, physical 385280421888: metadata leaf (level 0) in tree 2

This is some error in extent tree, and I'd say it's a serious problem
which may affect later write operation.

> BTRFS error (device sdc): bdev /dev/sdd errs: wr 0, rd 0, flush 0,
> corrupt 6, gen 2
> BTRFS error (device sdc): unable to fixup (regular) error at logical
> 858820476928 on dev /dev/sdd
> BTRFS error (device sdc): unable to fixup (regular) error at logical
> 858820489216 on dev /dev/sdd0
> 
> 
> $ btrfs filesystem show /mnt/data/
> Label: none  uuid: 5e6506b0-bf15-4b2e-b5f4-322c44b89db6
>   Total devices 2 FS bytes used 10.17TiB
>   devid    1 size 5.46TiB used 5.43TiB path /dev/sdc
>   devid    2 size 5.46TiB used 5.43TiB path /dev/sdd
> 
> $ btrfs --version
> btrfs-progs v4.15.1
> 
> $ uname -a
> Linux gpur1 4.15.0-39-generic #42-Ubuntu SMP Tue Oct 23 15:48:01 UTC
> 2018 x86_64 x86_64 x86_64 GNU/Linux
> 
> 
> $ btrfs dev stats /dev/sdc
> [/dev/sdc].write_io_errs    0
> [/dev/sdc].read_io_errs 0
> [/dev/sdc].flush_io_errs    0
> [/dev/sdc].corruption_errs  3
> [/dev/sdc].generation_errs  1
> 
> $ btrfs dev stats /dev/sdd
> [/dev/sdd].write_io_errs    0
> [/dev/sdd].read_io_errs 0
> [/dev/sdd].flush_io_errs    0
> [/dev/sdd].corruption_errs  3
> [/dev/sdd].generation_errs  1
> 
> $ btrfs fi show
> Label: 'system'  uuid: ae121e8e-d483-45f4-8568-2817f5c5d497
>     Total devices 1 FS bytes used 194.05GiB
>     devid    1 size 228.66GiB used 199.03GiB path /dev/sda3
> Label: none  uuid: 5e6506b0-bf15-4b2e-b5f4-322c44b89db6
>     Total devices 2 FS bytes used 10.17TiB
>     devid    1 size 5.46TiB used 5.43TiB path /dev/sdc
>     devid    2 size 5.46TiB used 5.43TiB path /dev/sdd
> 
> $ btrfs fi df /mnt/data/
> Data, RAID0: total=10.84TiB, used=10.15TiB
> System, RAID1: total=8.00MiB, used=896.00KiB
> Metadata,

[PATCH 0/5] btrfs-progs: check update

2018-11-25 Thread Qu Wenruo

Hi David,

Please merge this pull request:
https://github.com/kdave/btrfs-progs/pull/155


This patch can be fetch from the following branch:
https://github.com/adam900710/btrfs-progs/tree/check-next

The base commit is:
commit 5d64c40240135cc22f4ba2b902bfe20418a599ea (david/devel)
Author: David Sterba 
Date:   Tue Nov 20 11:13:08 2018 +0100

btrfs-progs: docs: fix rendering of exponents in manual pages

Reported on IRC that the inode number limit appears to be 264, while the
actual value is 2^64. Fix that for the manual page backend by redefining
the format.

Signed-off-by: David Sterba 


This small patchset contains 2 check related functionality:
1) Ability to repair dir item with mismatch hash
   Both lowmem and original check, along with test case update.

2) Check qgroup limit exceed
   The patch to deprecate BTRFS_QGROUP_LIMIT_RSV_RFER|EXCL is dropped
   for this pull request, as it's still uncertain if it's OK or not, and
   it's very easy to rebase that patch.

The patchset passes all selftest with one exception:
misc-tests/021-image-multi-devices

That test case failure is caused by poorly recovered multi-device
btrfs-image.
The problem exists from the very beginning, just recent enhanced kernel
code will refuse to mount and I'll address it later in another patchset.

There is some other check related patches, naming "btrfs-progs: fixes of
file extent in original and lowmem check" from Fujitsu, which is still
being updated. I'll push them in next update.

Thanks,
Qu

Qu Wenruo (5):
  btrfs-progs: lowmem check: Add ability to repair dir item with
mismatch hash
  btrfs-progs: original check: Use mismatch_dir_hash_record to record
bad dir items
  btrfs-progs: original check: Add ability to repair dir item with
invalid hash
  btrfs-progs: fsck-tests: Make 026-bad-dir-item-name test case to
verify if btrfs-check can also repair it
  btrfs-progs: qgroup-verify: Check if qgroup limit is exceeded

 check/main.c  | 121 +-
 check/mode-common.c   |  51 
 check/mode-common.h   |   5 +-
 check/mode-lowmem.c   |  46 ++-
 check/mode-lowmem.h   |   1 +
 check/mode-original.h |  14 ++
 ctree.h   |   3 +
 dir-item.c|   6 +-
 qgroup-verify.c   |  82 
 .../026-bad-dir-item-name/description.txt |  41 ++
 .../fsck-tests/026-bad-dir-item-name/test.sh  |  13 --
 11 files changed, 354 insertions(+), 29 deletions(-)
 create mode 100644 tests/fsck-tests/026-bad-dir-item-name/description.txt
 delete mode 100755 tests/fsck-tests/026-bad-dir-item-name/test.sh

-- 
2.19.2

Re: [[Missing subject]]

2018-11-23 Thread Qu Wenruo



On 2018/11/23 下午2:41, Andy Leadbetter wrote:
> I have a failing 2TB disk that is part of a 4 disk RAID 6 system.  I
> have added a new 2TB disk to the computer, and started a BTRFS replace
> for the old and new disk.  The process starts correctly however some
> hours into the job, there is an error and kernel oops. relevant log
> below.
> 
> The disks are configured on top of bcache, in 5 arrays with a small
> 128GB SSD cache shared.  The system in this configuration has worked
> perfectly for 3 years, until 2 weeks ago csum errors started
> appearing.  I have a crashplan backup of all files on the disk, so I
> am not concerned about data loss, but I would like to avoid rebuild
> the system.
> 
> btrfs dev stats shows
> [/dev/bcache0].write_io_errs0
> [/dev/bcache0].read_io_errs 0
> [/dev/bcache0].flush_io_errs0
> [/dev/bcache0].corruption_errs  0
> [/dev/bcache0].generation_errs  0
> [/dev/bcache1].write_io_errs0
> [/dev/bcache1].read_io_errs 20
> [/dev/bcache1].flush_io_errs0
> [/dev/bcache1].corruption_errs  0
> [/dev/bcache1].generation_errs  14

Unfortunately, this is not a sign of degrading disk, but something
really went wrong, screwing up some metadata.

For such case, it's recommended to do a "btrfs check --readonly", to
show how serious the problem is.

It could be some subvolume corruption, or some non-essential tree, but
anyway the generation mismatch is a problem that neither kernel or
btrfs-progs has a real good solution.

So at least please consider rebuild the fs.

Despite that, it's recommended to provide the versions of all the
kernels run on the fs, along with the mount option used.

We had some similar reports on such generation mismatch, but still we
don't have a convincing cause for it.
From old kernel to space cache corruption to powerloss + space cache
corruption.

> [/dev/bcache3].write_io_errs0
> [/dev/bcache3].read_io_errs 0
> [/dev/bcache3].flush_io_errs0
> [/dev/bcache3].corruption_errs  0
> [/dev/bcache3].generation_errs  19
> [/dev/bcache2].write_io_errs0
> [/dev/bcache2].read_io_errs 0
> [/dev/bcache2].flush_io_errs0
> [/dev/bcache2].corruption_errs  0
> [/dev/bcache2].generation_errs  2
> 
> and a smart test of the backing disk /dev/bcache1 shows a high read
> error rate, and lot of reallocated sectors.  The disk is 10 years old,
> and has clearly started to fail.
> 
> I've tried the latest kernel, and the latest tools, but nothing will
> allow me to replace, or delete the failed disk.
> 
>   884.171025] BTRFS info (device bcache0): dev_replace from
> /dev/bcache1 (devid 2) to /dev/bcache4 started
> [ 3301.101958] BTRFS error (device bcache0): parent transid verify
> failed on 8251260944384 wanted 640926 found 640907
> [ 3301.241214] BTRFS error (device bcache0): parent transid verify
> failed on 8251260944384 wanted 640926 found 640907
> [ 3301.241398] BTRFS error (device bcache0): parent transid verify
> failed on 8251260944384 wanted 640926 found 640907
> [ 3301.241513] BTRFS error (device bcache0): parent transid verify
> failed on 8251260944384 wanted 640926 found 640907

If btrfs check --readonly only reports this problem, it may be possible
for us to fix it.

Please also do a tree block dump on this block by:
# btrfs ins dump-tree -b 8251260944384 /dev/bcache0

If btrfs check --readonly reports a lot of problems, then it's strongly
recommended to rebuild the filesystem.

Thanks,
Qu

> [ 3302.381094] BTRFS error (device bcache0):
> btrfs_scrub_dev(/dev/bcache1, 2, /dev/bcache4) failed -5
> [ 3302.394612] WARNING: CPU: 0 PID: 5936 at
> /build/linux-5s7Xkn/linux-4.15.0/fs/btrfs/dev-replace.c:413
> btrfs_dev_replace_start+0x281/0x320 [btrfs]
> [ 3302.394613] Modules linked in: btrfs zstd_compress xor raid6_pq
> bcache intel_rapl x86_pkg_temp_thermal intel_powerclamp
> snd_hda_codec_hdmi coretemp kvm_intel snd_hda_codec_realtek kvm
> snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core
> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hwdep
> snd_pcm pcbc snd_seq_midi aesni_intel snd_seq_midi_event joydev
> input_leds aes_x86_64 snd_rawmidi crypto_simd glue_helper snd_seq
> eeepc_wmi cryptd asus_wmi snd_seq_device snd_timer wmi_bmof
> sparse_keymap snd intel_cstate intel_rapl_perf soundcore mei_me mei
> shpchp mac_hid sch_fq_codel acpi_pad parport_pc ppdev lp parport
> ip_tables x_tables autofs4 overlay nls_iso8859_1 dm_mirror
> dm_region_hash dm_log hid_generic usbhid hid uas usb_storage i915
> i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
> [ 3302.394640]  sysimgblt fb_sys_fops r8169 mxm_wmi mii drm ahci
> libahci wmi video
> [ 3302.394646] CPU: 0 PID: 5936 Comm: btrfs Not tainted
> 4.15.0-20-generic #21-Ubuntu
> [ 3302.394646] Hardware name: System manufacturer System Product
> Name/H110M-R, BIOS 3404 10/10/2017
> [ 3302.394658] RIP: 0010:btrfs_dev_replace_start+0x281/0x320 [btrfs]
> [ 3302.394659] RSP: 0018:a8b582b5fd18 EFLAGS: 00010282
> [ 3302.394660] RAX: fffb RBX:

[PATCH] btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable

2018-11-22 Thread Qu Wenruo

[BUG]
A completely valid btrfs will refuse to mount, with error message like:
  BTRFS critical (device sdb2): corrupt leaf: root=2 block=239681536 slot=172 \
bg_start=12018974720 bg_len=10888413184, invalid block group size, \
have 10888413184 expect (0, 10737418240]

Btrfs check returns no error, and all kernels used on this fs is later
than 2011, which should all have the 10G size limit commit.

[CAUSE]
For a 12 devices btrfs, we could allocate a chunk larger than 10G due to
stripe stripe bump up.

__btrfs_alloc_chunk()
|- max_stripe_size = 1G
|- max_chunk_size = 10G
|- data_stripe = 11
|- if (1G * 11 > 10G) {
   stripe_size = 976128930;
   stripe_size = round_up(976128930, SZ_16M) = 989855744

However the final stripe_size (989855744) * 11 = 10888413184, which is
still larger than 10G.

[FIX]
For the comprehensive check, we need to do the full check at chunk
read time, and rely on bg <-> chunk mapping to do the check.

We could just skip the length check for now.

Fixes: fce466eab7ac ("btrfs: tree-checker: Verify block_group_item")
Cc: sta...@vger.kernel.org # v4.19+
Reported-by: Wang Yugui 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/tree-checker.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index cab0b1f1f741..d8bd5340fbbc 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -389,13 +389,11 @@ static int check_block_group_item(struct btrfs_fs_info 
*fs_info,
 
/*
 * Here we don't really care about alignment since extent allocator can
-* handle it.  We care more about the size, as if one block group is
-* larger than maximum size, it's must be some obvious corruption.
+* handle it.  We care more about the size.
 */
-   if (key->offset > BTRFS_MAX_DATA_CHUNK_SIZE || key->offset == 0) {
+   if (key->offset == 0) {
block_group_err(fs_info, leaf, slot,
-   "invalid block group size, have %llu expect (0, %llu]",
-   key->offset, BTRFS_MAX_DATA_CHUNK_SIZE);
+   "invalid block group size 0");
return -EUCLEAN;
}
 
-- 
2.19.1

Re: btrfs-cleaner 100% busy on an idle filesystem with 4.19.3

2018-11-22 Thread Qu Wenruo




On 2018/11/22 下午10:03, Roman Mamedov wrote:
> On Thu, 22 Nov 2018 22:07:25 +0900
> Tomasz Chmielewski  wrote:
> 
>> Spot on!
>>
>> Removed "discard" from fstab and added "ssd", rebooted - no more 
>> btrfs-cleaner running.
> 
> Recently there has been a bugfix for TRIM in Btrfs:
>   
>   btrfs: Ensure btrfs_trim_fs can trim the whole fs
>   https://patchwork.kernel.org/patch/10579539/
> 
> Perhaps your upgraded kernel is the first one to contain it, and for the first
> time you're seeing TRIM to actually *work*, with the actual performance impact
> of it on a large fragmented FS, instead of a few contiguous unallocated areas.
> 
That only affects btrfs_trim_fs(), and you can see it's only called from
ioctl interface, so it's definitely not the case.

Thanks,
Qu

Re: [PATCH v2] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation

2018-11-22 Thread Qu Wenruo



On 2018/11/22 下午9:12, David Sterba wrote:
> On Tue, Nov 20, 2018 at 08:32:42AM +0800, Qu Wenruo wrote:
>>>>> @@ -1013,16 +1013,22 @@ int btrfs_quota_enable(struct btrfs_fs_info 
>>>>> *fs_info)
>>>>>   btrfs_abort_transaction(trans, ret);
>>>>>   goto out_free_path;
>>>>>   }
>>>>> - spin_lock(_info->qgroup_lock);
>>>>> - fs_info->quota_root = quota_root;
>>>>> - set_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags);
>>>>> - spin_unlock(_info->qgroup_lock);
>>>>>
>>>>>   ret = btrfs_commit_transaction(trans);
>>>>>   trans = NULL;
>>>>>   if (ret)
>>>>>   goto out_free_path;
>>>>
>>>> The main concern here is, if we don't set qgroup enabled bit before we
>>>> commit transaction, there will be a window where new tree modification
>>>> could happen before QGROUP_ENABLED bit set.
>>>
>>> That doesn't seem to make much sense to me, if I understood correctly.
>>> Essentially you're saying stuff done to any tree in the the
>>> transaction we use to
>>> enable quotas must be accounted for. In that case the quota enabled bit 
>>> should
>>> be done as soon as the transaction is started, because before we set
>>> it and after
>>> we started (or joined) a transaction, a lot could of modifications may
>>> have happened.
>>> Nevertheless I don't think it's unexpected for anyone to have the
>>> accounting happening
>>> only after the quota enable ioctl returns success.
>>
>> The problem is not accounting, the qgroup number won't cause problem.
>>
>> It's the reserved space. Like some dirty pages are dirtied before quota
>> enabled, but at run_dealloc() time quota is enabled.
>>
>> For such case we have io_tree based method to avoid underflow so it
>> should be OK.
>>
>> So v2 patch looks OK.
> 
> Does that mean reviewed-by? In case there's a evolved discussion under a
> patch, a clear yes/no is appreciated and an explicit Reviewed-by even
> more. I'm about to add this patch to rc4 pull, thre's still some time to
> add the tag. Thanks.
> 

I'd like to add reviewed-by tab, but I'm still not 100% if this will
cause extra qgroup reserved space related problem.

At least from my side, I can't directly see a case where it will cause
problem.

Does such case mean a reviewed-by tag? Or something LGTM-but-uncertain?

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 2/6] btrfs: add cleanup_ref_head_accounting helper

2018-11-21 Thread Qu Wenruo



On 2018/11/22 上午2:59, Josef Bacik wrote:
> From: Josef Bacik 
> 
> We were missing some quota cleanups in check_ref_cleanup, so break the
> ref head accounting cleanup into a helper and call that from both
> check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
> we don't screw up accounting in the future for other things that we add.
> 
> Reviewed-by: Omar Sandoval 
> Reviewed-by: Liu Bo 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 67 
> +-
>  1 file changed, 39 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index c36b3a42f2bb..e3ed3507018d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2443,6 +2443,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle 
> *trans,
>   return ret ? ret : 1;
>  }
>  
> +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> + struct btrfs_delayed_ref_head *head)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_delayed_ref_root *delayed_refs =
> + >transaction->delayed_refs;
> +
> + if (head->total_ref_mod < 0) {
> + struct btrfs_space_info *space_info;
> + u64 flags;
> +
> + if (head->is_data)
> + flags = BTRFS_BLOCK_GROUP_DATA;
> + else if (head->is_system)
> + flags = BTRFS_BLOCK_GROUP_SYSTEM;
> + else
> + flags = BTRFS_BLOCK_GROUP_METADATA;
> + space_info = __find_space_info(fs_info, flags);
> + ASSERT(space_info);
> + percpu_counter_add_batch(_info->total_bytes_pinned,
> +-head->num_bytes,
> +BTRFS_TOTAL_BYTES_PINNED_BATCH);
> +
> + if (head->is_data) {
> + spin_lock(_refs->lock);
> + delayed_refs->pending_csums -= head->num_bytes;
> + spin_unlock(_refs->lock);
> + }
> + }
> +
> + /* Also free its reserved qgroup space */
> + btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> +   head->qgroup_reserved);

This part will be removed in patch "[PATCH] btrfs: qgroup: Move reserved
data account from btrfs_delayed_ref_head to btrfs_qgroup_extent_record".

So there is one less thing to worry about in delayed ref head.

Thanks,
Qu

> +}
> +
>  static int cleanup_ref_head(struct btrfs_trans_handle *trans,
>   struct btrfs_delayed_ref_head *head)
>  {
> @@ -2478,31 +2513,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
>  
> - trace_run_delayed_ref_head(fs_info, head, 0);
> -
> - if (head->total_ref_mod < 0) {
> - struct btrfs_space_info *space_info;
> - u64 flags;
> -
> - if (head->is_data)
> - flags = BTRFS_BLOCK_GROUP_DATA;
> - else if (head->is_system)
> - flags = BTRFS_BLOCK_GROUP_SYSTEM;
> - else
> - flags = BTRFS_BLOCK_GROUP_METADATA;
> - space_info = __find_space_info(fs_info, flags);
> - ASSERT(space_info);
> - percpu_counter_add_batch(_info->total_bytes_pinned,
> --head->num_bytes,
> -BTRFS_TOTAL_BYTES_PINNED_BATCH);
> -
> - if (head->is_data) {
> - spin_lock(_refs->lock);
> - delayed_refs->pending_csums -= head->num_bytes;
> - spin_unlock(_refs->lock);
> - }
> - }
> -
>   if (head->must_insert_reserved) {
>   btrfs_pin_extent(fs_info, head->bytenr,
>head->num_bytes, 1);
> @@ -2512,9 +2522,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   }
>   }
>  
> - /* Also free its reserved qgroup space */
> - btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> -   head->qgroup_reserved);
> + cleanup_ref_head_accounting(trans, head);
> +
> + trace_run_delayed_ref_head(fs_info, head, 0);
>   btrfs_delayed_ref_unlock(head);
>   btrfs_put_delayed_ref_head(head);
>   return 0;
> @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (head->must_insert_reserved)
>   ret = 1;
>  
> + cleanup_ref_head_accounting(trans, head);
>   mutex_unlock(>mutex);
>   btrfs_put_delayed_ref_head(head);
>   return ret;
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH] btrfs: qgroup: Skip delayed data ref for reloc trees

2018-11-20 Thread Qu Wenruo

On 2018/11/20 下午5:11, Nikolay Borisov wrote:
> 
> 
> On 20.11.18 г. 11:07 ч., Qu Wenruo wrote:
>>
>>
>> On 2018/11/20 下午4:51, Nikolay Borisov wrote:
> 
>>> I'm beginning to wonder, should we document
>>> btrfs_add_delayed_data_ref/btrfs_add_tree_ref arguments separate for
>>> each function, or should only the differences be documented - in this
>>> case the newly added root parameter. The rest of the arguments are being
>>> documented at init_delayed_ref_common.
>>
>> You won't be happy with my later plan, it will add new parameter for
>> btrfs_add_delayed_tree_ref(), and it may not be @root, but some bool.
> 
> You are right, but I'm starting to think that the interface of adding
> those references is wrong because we shouldn't really need adding more
> and more arguments. All of this feels like piling hack on top of hack
> for some legacy infrastructure which no one bothers fixing from a
> high-level.

Can't agree any more!

But the problem is, such big change on the low level delayed-ref
infrastructure could easily cause new bugs, thus there isn't much
driving force to change it.

I'm considering to change the longer and longer parameter list into a
structure as the first step to do cleanup.
(By the nature of structure and union, some parameters can easily be
merged into an union, makes the parameter structure easier to read)

Feel free if you have any better suggestion.
(I also hate the current btrfs_inc_extent_ref() and btrfs_inc_ref()
interfaces, but I don't have enough confidence to change them right now)

Thanks,
Qu

> 
> 
>>
>> So I think we may need to document at least the difference.
>>
>> Thanks,
>> Qu
>>
>>>
> 
>

Re: [PATCH] btrfs: qgroup: Skip delayed data ref for reloc trees

2018-11-20 Thread Qu Wenruo




On 2018/11/20 下午4:51, Nikolay Borisov wrote:
> 
> 
> On 20.11.18 г. 10:46 ч., Qu Wenruo wrote:
>> Currently we only check @ref_root in btrfs_add_delayed_data_ref() to
>> determine whether a data delayed ref is for reloc tree.
>>
>> Such check is insufficient as for relocation we could pass @ref_root
>> as the source file tree, causing qgroup to trace unchanged data extents
>> even we're only relocating metadata chunks.
>>
>> We could insert qgroup extent record for the following call trace even
>> we're only relocating metadata block group:
>>
>> do_relocation()
>> |- btrfs_cow_block(root=reloc_root)
>>|- update_ref_for_cow(root=reloc_root)
>>   |- __btrfs_mod_ref(root=reloc_root)
>>  |- ref_root = btrfs_header_owner()
>>  |- btrfs_add_delayed_data_ref(ref_root=source_root)
>>
>> And another case when dropping reloc tree:
>>
>> clean_dirty_root()
>> |- btrfs_drop_snapshot(root=relocc_root)
>>|- walk_up_tree(root=reloc_root)
>>   |- walk_up_proc(root=reloc_root)
>>  |- btrfs_dec_ref(root=reloc_root)
>> |- __btrfs_mod_ref(root=reloc_root)
>>|- ref_root = btrfs_header_owner()
>>|- btrfs_add_delayed_data_ref(ref_root=source_root)
>>
>> This patch will introduce @root parameter for
>> btrfs_add_delayed_data_ref(), so that we could determine if this data
>> extent belongs to reloc tree or not.
>>
>> This could skip data extents which aren't really modified during
>> relocation.
>>
>> For the same real world 4G data 16 snapshots 4K nodesize metadata
>> balance test:
>>  | v4.20-rc1 + delaye*  | w/ patch   | diff
>> ---
>> relocated extents| 22773        | 22656  | -0.1%
>> qgroup dirty extents | 122879   | 74316  | -39.5%
>> time (real)  | 23.073s  | 14.971 | -35.1%
>>
>> *: v4.20-rc1 + delayed subtree scan patchset
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>  fs/btrfs/delayed-ref.c | 3 ++-
>>  fs/btrfs/delayed-ref.h | 1 +
>>  fs/btrfs/extent-tree.c | 6 +++---
>>  3 files changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>> index 9301b3ad9217..269bd6ecb8f3 100644
>> --- a/fs/btrfs/delayed-ref.c
>> +++ b/fs/btrfs/delayed-ref.c
>> @@ -798,6 +798,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
>> *trans,
>>   * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
>>   */
>>  int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
>> +   struct btrfs_root *root,
> 
> I'm beginning to wonder, should we document
> btrfs_add_delayed_data_ref/btrfs_add_tree_ref arguments separate for
> each function, or should only the differences be documented - in this
> case the newly added root parameter. The rest of the arguments are being
> documented at init_delayed_ref_common.

You won't be happy with my later plan, it will add new parameter for
btrfs_add_delayed_tree_ref(), and it may not be @root, but some bool.

So I think we may need to document at least the difference.

Thanks,
Qu

> 
>> u64 bytenr, u64 num_bytes,
>> u64 parent, u64 ref_root,
>> u64 owner, u64 offset, u64 reserved, int action,
>> @@ -835,7 +836,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle 
>> *trans,
>>  }
>>  
>>  if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) &&
>> -is_fstree(ref_root)) {
>> +is_fstree(ref_root) && is_fstree(root->root_key.objectid)) {
>>  record = kmalloc(sizeof(*record), GFP_NOFS);
>>  if (!record) {
>>  kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
>> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
>> index 8e20c5cb5404..6c60737e55d6 100644
>> --- a/fs/btrfs/delayed-ref.h
>> +++ b/fs/btrfs/delayed-ref.h
>> @@ -240,6 +240,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
>> *trans,
>> struct btrfs_delayed_extent_op *extent_op,
>> int *old_ref_mod, int *new_ref_mod);
>>  int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
>> +   struct btrfs_root *root,
>> u64 bytenr, u64 num_bytes,
>>

[PATCH] btrfs: qgroup: Skip delayed data ref for reloc trees

2018-11-20 Thread Qu Wenruo

Currently we only check @ref_root in btrfs_add_delayed_data_ref() to
determine whether a data delayed ref is for reloc tree.

Such check is insufficient as for relocation we could pass @ref_root
as the source file tree, causing qgroup to trace unchanged data extents
even we're only relocating metadata chunks.

We could insert qgroup extent record for the following call trace even
we're only relocating metadata block group:

do_relocation()
|- btrfs_cow_block(root=reloc_root)
   |- update_ref_for_cow(root=reloc_root)
  |- __btrfs_mod_ref(root=reloc_root)
 |- ref_root = btrfs_header_owner()
 |- btrfs_add_delayed_data_ref(ref_root=source_root)

And another case when dropping reloc tree:

clean_dirty_root()
|- btrfs_drop_snapshot(root=relocc_root)
   |- walk_up_tree(root=reloc_root)
  |- walk_up_proc(root=reloc_root)
 |- btrfs_dec_ref(root=reloc_root)
|- __btrfs_mod_ref(root=reloc_root)
   |- ref_root = btrfs_header_owner()
   |- btrfs_add_delayed_data_ref(ref_root=source_root)

This patch will introduce @root parameter for
btrfs_add_delayed_data_ref(), so that we could determine if this data
extent belongs to reloc tree or not.

This could skip data extents which aren't really modified during
relocation.

For the same real world 4G data 16 snapshots 4K nodesize metadata
balance test:
 | v4.20-rc1 + delaye*  | w/ patch   | diff
---
relocated extents| 22773| 22656  | -0.1%
qgroup dirty extents | 122879   | 74316  | -39.5%
time (real)  | 23.073s  | 14.971 | -35.1%

*: v4.20-rc1 + delayed subtree scan patchset

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.c | 3 ++-
 fs/btrfs/delayed-ref.h | 1 +
 fs/btrfs/extent-tree.c | 6 +++---
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 9301b3ad9217..269bd6ecb8f3 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -798,6 +798,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
  * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
  */
 int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
   u64 bytenr, u64 num_bytes,
   u64 parent, u64 ref_root,
   u64 owner, u64 offset, u64 reserved, int action,
@@ -835,7 +836,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle 
*trans,
}
 
if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) &&
-   is_fstree(ref_root)) {
+   is_fstree(ref_root) && is_fstree(root->root_key.objectid)) {
record = kmalloc(sizeof(*record), GFP_NOFS);
if (!record) {
kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 8e20c5cb5404..6c60737e55d6 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -240,6 +240,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
   struct btrfs_delayed_extent_op *extent_op,
   int *old_ref_mod, int *new_ref_mod);
 int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
   u64 bytenr, u64 num_bytes,
   u64 parent, u64 ref_root,
   u64 owner, u64 offset, u64 reserved, int action,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a1febf155747..0554d2cc2ea1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2046,7 +2046,7 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 BTRFS_ADD_DELAYED_REF, NULL,
 _ref_mod, _ref_mod);
} else {
-   ret = btrfs_add_delayed_data_ref(trans, bytenr,
+   ret = btrfs_add_delayed_data_ref(trans, root, bytenr,
 num_bytes, parent,
 root_objectid, owner, offset,
 0, BTRFS_ADD_DELAYED_REF,
@@ -7104,7 +7104,7 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
 BTRFS_DROP_DELAYED_REF, NULL,
 _ref_mod, _ref_mod);
} else {
-   ret = btrfs_add_delayed_data_ref(trans, bytenr,
+   ret = btrfs_add_delayed_data_ref(trans, root, bytenr,
 num_bytes, parent,

[PATCH] btrfs: qgroup: Skip delayed data ref for reloc trees

2018-11-20 Thread Qu Wenruo

Currently we only check @ref_root in btrfs_add_delayed_data_ref() to
determine whether a data delayed ref is for reloc tree.

Such check is insufficient as for relocation we could pass @ref_root
as the source file tree, causing qgroup to trace unchanged data extents
even we're only relocating metadata chunks.

We could insert qgroup extent record for the following call trace even
we're only relocating metadata block group:

do_relocation()
|- btrfs_cow_block(root=reloc_root)
   |- update_ref_for_cow(root=reloc_root)
  |- __btrfs_mod_ref(root=reloc_root)
 |- ref_root = btrfs_header_owner()
 |- btrfs_add_delayed_data_ref(ref_root=source_root)

And another case when dropping reloc tree:

clean_dirty_root()
|- btrfs_drop_snapshot(root=relocc_root)
   |- walk_up_tree(root=reloc_root)
  |- walk_up_proc(root=reloc_root)
 |- btrfs_dec_ref(root=reloc_root)
|- __btrfs_mod_ref(root=reloc_root)
   |- ref_root = btrfs_header_owner()
   |- btrfs_add_delayed_data_ref(ref_root=source_root)

This patch will introduce @root parameter for
btrfs_add_delayed_data_ref(), so that we could determine if this data
extent belongs to reloc tree or not.

This could skip data extents which aren't really modified during
relocation.

For the same real world 4G data 16 snapshots 4K nodesize metadata
balance test:
 | v4.20-rc1 + delaye*  | w/ patch   | diff
---
relocated extents| 22773| 22656  | -0.1%
qgroup dirty extents | 122879   | 74316  | -39.5%
time (real)  | 23.073s  | 14.971 | -35.1%

*: v4.20-rc1 + delayed subtree scan patchset

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.c | 3 ++-
 fs/btrfs/delayed-ref.h | 1 +
 fs/btrfs/extent-tree.c | 6 +++---
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 9301b3ad9217..269bd6ecb8f3 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -798,6 +798,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
  * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
  */
 int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
   u64 bytenr, u64 num_bytes,
   u64 parent, u64 ref_root,
   u64 owner, u64 offset, u64 reserved, int action,
@@ -835,7 +836,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle 
*trans,
}
 
if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) &&
-   is_fstree(ref_root)) {
+   is_fstree(ref_root) && is_fstree(root->root_key.objectid)) {
record = kmalloc(sizeof(*record), GFP_NOFS);
if (!record) {
kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 8e20c5cb5404..6c60737e55d6 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -240,6 +240,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
   struct btrfs_delayed_extent_op *extent_op,
   int *old_ref_mod, int *new_ref_mod);
 int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
   u64 bytenr, u64 num_bytes,
   u64 parent, u64 ref_root,
   u64 owner, u64 offset, u64 reserved, int action,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a1febf155747..0554d2cc2ea1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2046,7 +2046,7 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 BTRFS_ADD_DELAYED_REF, NULL,
 _ref_mod, _ref_mod);
} else {
-   ret = btrfs_add_delayed_data_ref(trans, bytenr,
+   ret = btrfs_add_delayed_data_ref(trans, root, bytenr,
 num_bytes, parent,
 root_objectid, owner, offset,
 0, BTRFS_ADD_DELAYED_REF,
@@ -7104,7 +7104,7 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
 BTRFS_DROP_DELAYED_REF, NULL,
 _ref_mod, _ref_mod);
} else {
-   ret = btrfs_add_delayed_data_ref(trans, bytenr,
+   ret = btrfs_add_delayed_data_ref(trans, root, bytenr,
 num_bytes, parent,

Re: [PATCH] Btrfs: fix race between enabling quotas and subvolume creation

2018-11-19 Thread Qu Wenruo



On 2018/11/20 上午12:20, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> We have a race between enabling quotas end subvolume creation that cause
> subvolume creation to fail with -EINVAL, and the following diagram shows
> how it happens:
> 
>   CPU 0  CPU 1
> 
>  btrfs_ioctl()
>   btrfs_ioctl_quota_ctl()
>btrfs_quota_enable()
> mutex_lock(fs_info->qgroup_ioctl_lock)
> 
>   btrfs_ioctl()
>create_subvol()
> btrfs_qgroup_inherit()
>  -> save 
> fs_info->quota_root
> into quota_root
>  -> stores a NULL value
>  -> tries to lock the 
> mutex
> qgroup_ioctl_lock
> -> blocks waiting for
>the task at CPU0
> 
>-> sets BTRFS_FS_QUOTA_ENABLED in fs_info
>-> sets quota_root in fs_info->quota_root
>   (non-NULL value)
> 
>mutex_unlock(fs_info->qgroup_ioctl_lock)
> 
>  -> checks quota enabled
> flag is set
>  -> returns -EINVAL 
> because
> fs_info->quota_root 
> was
> NULL before it 
> acquired
> the mutex
> qgroup_ioctl_lock
>-> ioctl returns -EINVAL
> 
> Returning -EINVAL to user space will be confusing if all the arguments
> passed to the subvolume creation ioctl were valid.
> 
> Fix it by grabbing the value from fs_info->quota_root after acquiring
> the mutex.
> 
> Signed-off-by: Filipe Manana 

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/qgroup.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index ae1358253b7b..0bdf28499790 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -2250,7 +2250,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
> *trans, u64 srcid,
>   int i;
>   u64 *i_qgroups;
>   struct btrfs_fs_info *fs_info = trans->fs_info;
> - struct btrfs_root *quota_root = fs_info->quota_root;
> + struct btrfs_root *quota_root;
>   struct btrfs_qgroup *srcgroup;
>   struct btrfs_qgroup *dstgroup;
>   u32 level_size = 0;
> @@ -2260,6 +2260,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
> *trans, u64 srcid,
>   if (!test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags))
>   goto out;
>  
> + quota_root = fs_info->quota_root;
>   if (!quota_root) {
>   ret = -EINVAL;
>   goto out;
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH v2] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation

2018-11-19 Thread Qu Wenruo



On 2018/11/19 下午11:24, Filipe Manana wrote:
> On Mon, Nov 19, 2018 at 2:48 PM Qu Wenruo  wrote:
>>
>>
>>
>> On 2018/11/19 下午10:15, fdman...@kernel.org wrote:
>>> From: Filipe Manana 
>>>
>>> If the quota enable and snapshot creation ioctls are called concurrently
>>> we can get into a deadlock where the task enabling quotas will deadlock
>>> on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it
>>> twice, or the task creating a snapshot tries to commit the transaction
>>> while the task enabling quota waits for the former task to commit the
>>> transaction while holding the mutex. The following time diagrams show how
>>> both cases happen.
>>>
>>> First scenario:
>>>
>>>CPU 0CPU 1
>>>
>>>  btrfs_ioctl()
>>>   btrfs_ioctl_quota_ctl()
>>>btrfs_quota_enable()
>>> mutex_lock(fs_info->qgroup_ioctl_lock)
>>> btrfs_start_transaction()
>>>
>>>  btrfs_ioctl()
>>>   btrfs_ioctl_snap_create_v2
>>>create_snapshot()
>>> --> adds snapshot to the
>>> list pending_snapshots
>>> of the current
>>> transaction
>>>
>>> btrfs_commit_transaction()
>>>  create_pending_snapshots()
>>>create_pending_snapshot()
>>> qgroup_account_snapshot()
>>>  btrfs_qgroup_inherit()
>>>  mutex_lock(fs_info->qgroup_ioctl_lock)
>>>   --> deadlock, mutex already locked
>>>   by this task at
>>>   btrfs_quota_enable()
>>>
>>> Second scenario:
>>>
>>>CPU 0CPU 1
>>>
>>>  btrfs_ioctl()
>>>   btrfs_ioctl_quota_ctl()
>>>btrfs_quota_enable()
>>> mutex_lock(fs_info->qgroup_ioctl_lock)
>>> btrfs_start_transaction()
>>>
>>>  btrfs_ioctl()
>>>   btrfs_ioctl_snap_create_v2
>>>create_snapshot()
>>> --> adds snapshot to the
>>> list pending_snapshots
>>> of the current
>>> transaction
>>>
>>> btrfs_commit_transaction()
>>>  --> waits for task at
>>>  CPU 0 to release
>>>  its transaction
>>>  handle
>>>
>>> btrfs_commit_transaction()
>>>  --> sees another task started
>>>  the transaction commit first
>>>  --> releases its transaction
>>>  handle
>>>  --> waits for the transaction
>>>  commit to be completed by
>>>  the task at CPU 1
>>>
>>>  create_pending_snapshot()
>>>   qgroup_account_snapshot()
>>>btrfs_qgroup_inherit()
>>> 
>>> mutex_lock(fs_info->qgroup_ioctl_lock)
>>>  --> deadlock, task at 
>>> CPU 0
>>>  has the mutex 
>>> locked but
>>>  it is waiting for 
>>> us to
>>>  finish the 
>>> transaction
>>>  commit
>>>
>>> So fix this by setting the quota enabled flag in fs_info after committing
>>> the transaction at btrfs_quota_enable(). Thi

Re: [PATCH v2] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation

2018-11-19 Thread Qu Wenruo




On 2018/11/19 下午11:36, Nikolay Borisov wrote:
> 
> 
> On 19.11.18 г. 16:48 ч., Qu Wenruo wrote:
>> There may be some qgroup reserved space related problem in such case,
>> but I'm not 100% sure to foresee such problem.
> 
> But why is this a problem - we always queue quota rescan following QUOTA
> enable, that should take care of proper accounting, no?

For reserved data/metadata space, not qgroup numbers.

But it turns out we have already handle such case for data.

So I'm overreacting to this problem.

> 
>>
>>
>> The best way to do this is, commit trans first, and before any other one
>> trying to start transaction, we set the bit.
>> However I can't find such infrastructure now (I still remember we used
>> to have a pending bit to change quota enabled flag, but removed later)

Re: [PATCH v2] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation

2018-11-19 Thread Qu Wenruo



On 2018/11/19 下午10:15, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> If the quota enable and snapshot creation ioctls are called concurrently
> we can get into a deadlock where the task enabling quotas will deadlock
> on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it
> twice, or the task creating a snapshot tries to commit the transaction
> while the task enabling quota waits for the former task to commit the
> transaction while holding the mutex. The following time diagrams show how
> both cases happen.
> 
> First scenario:
> 
>CPU 0CPU 1
> 
>  btrfs_ioctl()
>   btrfs_ioctl_quota_ctl()
>btrfs_quota_enable()
> mutex_lock(fs_info->qgroup_ioctl_lock)
> btrfs_start_transaction()
> 
>  btrfs_ioctl()
>   btrfs_ioctl_snap_create_v2
>create_snapshot()
> --> adds snapshot to the
> list pending_snapshots
> of the current
> transaction
> 
> btrfs_commit_transaction()
>  create_pending_snapshots()
>create_pending_snapshot()
> qgroup_account_snapshot()
>  btrfs_qgroup_inherit()
>  mutex_lock(fs_info->qgroup_ioctl_lock)
>   --> deadlock, mutex already locked
>   by this task at
>   btrfs_quota_enable()
> 
> Second scenario:
> 
>CPU 0CPU 1
> 
>  btrfs_ioctl()
>   btrfs_ioctl_quota_ctl()
>btrfs_quota_enable()
> mutex_lock(fs_info->qgroup_ioctl_lock)
> btrfs_start_transaction()
> 
>  btrfs_ioctl()
>   btrfs_ioctl_snap_create_v2
>create_snapshot()
> --> adds snapshot to the
> list pending_snapshots
> of the current
> transaction
> 
> btrfs_commit_transaction()
>  --> waits for task at
>  CPU 0 to release
>  its transaction
>  handle
> 
> btrfs_commit_transaction()
>  --> sees another task started
>  the transaction commit first
>  --> releases its transaction
>  handle
>  --> waits for the transaction
>  commit to be completed by
>  the task at CPU 1
> 
>  create_pending_snapshot()
>   qgroup_account_snapshot()
>btrfs_qgroup_inherit()
> 
> mutex_lock(fs_info->qgroup_ioctl_lock)
>  --> deadlock, task at 
> CPU 0
>  has the mutex locked 
> but
>  it is waiting for us 
> to
>  finish the 
> transaction
>  commit
> 
> So fix this by setting the quota enabled flag in fs_info after committing
> the transaction at btrfs_quota_enable(). This ends up serializing quota
> enable and snapshot creation as if the snapshot creation happened just
> before the quota enable request. The quota rescan task, scheduled after
> committing the transaction in btrfs_quote_enable(), will do the accounting.
> 
> Fixes: 6426c7ad697d ("btrfs: qgroup: Fix qgroup accounting when creating 
> snapshot")
> Signed-off-by: Filipe Manana 
> ---
> 
> V2: Added second deadlock example to changelog and changed the fix to deal
> with that case as well.
> 
>  fs/btrfs/qgroup.c | 14 ++
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index d4917c0cddf5..ae1358253b7b 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1013,16 +1013,22 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
>   btrfs_abort_transaction(trans, ret);
>   goto out_free_path;
>   }
> - spin_lock(_info->qgroup_lock);
> - fs_info->quota_root = quota_root;
> - set_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags);
> - spin_unlock(_info->qgroup_lock);
>  
>   ret = btrfs_commit_transaction(trans);
>   trans =

Re: [PATCH] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation

2018-11-19 Thread Qu Wenruo



On 2018/11/19 下午7:52, Filipe Manana wrote:
> On Mon, Nov 19, 2018 at 11:35 AM Qu Wenruo  wrote:
>>
>>
>>
>> On 2018/11/19 下午7:13, Filipe Manana wrote:
>>> On Mon, Nov 19, 2018 at 11:09 AM Qu Wenruo  wrote:
>>>>
>>>>
>>>>
>>>> On 2018/11/19 下午5:48, fdman...@kernel.org wrote:
>>>>> From: Filipe Manana 
>>>>>
>>>>> If the quota enable and snapshot creation ioctls are called concurrently
>>>>> we can get into a deadlock where the task enabling quotas will deadlock
>>>>> on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it
>>>>> twice. The following time diagram shows how this happens.
>>>>>
>>>>>CPU 0CPU 1
>>>>>
>>>>>  btrfs_ioctl()
>>>>>   btrfs_ioctl_quota_ctl()
>>>>>btrfs_quota_enable()
>>>>> mutex_lock(fs_info->qgroup_ioctl_lock)
>>>>> btrfs_start_transaction()
>>>>>
>>>>>  btrfs_ioctl()
>>>>>   btrfs_ioctl_snap_create_v2
>>>>>create_snapshot()
>>>>> --> adds snapshot to the
>>>>> list pending_snapshots
>>>>> of the current
>>>>> transaction
>>>>>
>>>>> btrfs_commit_transaction()
>>>>>  create_pending_snapshots()
>>>>>create_pending_snapshot()
>>>>> qgroup_account_snapshot()
>>>>>  btrfs_qgroup_inherit()
>>>>>  mutex_lock(fs_info->qgroup_ioctl_lock)
>>>>>   --> deadlock, mutex already locked
>>>>>   by this task at
>>>>>   btrfs_quota_enable()
>>>>
>>>> The backtrace looks valid.
>>>>
>>>>>
>>>>> So fix this by adding a flag to the transaction handle that signals if the
>>>>> transaction is being used for enabling quotas (only seen by the task doing
>>>>> it) and do not lock the mutex qgroup_ioctl_lock at btrfs_qgroup_inherit()
>>>>> if the transaction handle corresponds to the one being used to enable the
>>>>> quotas.
>>>>>
>>>>> Fixes: 6426c7ad697d ("btrfs: qgroup: Fix qgroup accounting when creating 
>>>>> snapshot")
>>>>> Signed-off-by: Filipe Manana 
>>>>> ---
>>>>>  fs/btrfs/qgroup.c  | 10 --
>>>>>  fs/btrfs/transaction.h |  1 +
>>>>>  2 files changed, 9 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>>>>> index d4917c0cddf5..3aec3bfa3d70 100644
>>>>> --- a/fs/btrfs/qgroup.c
>>>>> +++ b/fs/btrfs/qgroup.c
>>>>> @@ -908,6 +908,7 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
>>>>>   trans = NULL;
>>>>>   goto out;
>>>>>   }
>>>>> + trans->enabling_quotas = true;
>>>>
>>>> Should we put enabling_quotas bit into btrfs_transaction instead of
>>>> btrfs_trans_handle?
>>>
>>> Why?
>>> Only the task which is enabling quotas needs to know about it.
>>
>> But it's the btrfs_qgroup_inherit() using the trans handler to avoid
>> dead lock.
>>
>> What makes sure btrfs_qgroup_inherit() get the exactly same trans
>> handler allocated here?
> 
> If it's the other task (the one creating a snapshot) that starts the
> transaction commit,
> it will have to wait for the task enabling quotas to release the
> transaction - once that task
> also calls commit_transaction(), it will skip doing the commit itself
> and wait for the snapshot
> one to finish the commit, while holding the qgroup mutex (this part I
> missed before).
> So yes we'll have to use a bit in the transaction itself instead.
> 
>>
>>>
>>>>
>>>> Isn't it possible to have different trans handle pointed to the same
>>>> transaction?
>>>
>>> Yes.
>>>
>>>>
>>>> And I'm not rea

Re: [PATCH] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation

2018-11-19 Thread Qu Wenruo



On 2018/11/19 下午7:13, Filipe Manana wrote:
> On Mon, Nov 19, 2018 at 11:09 AM Qu Wenruo  wrote:
>>
>>
>>
>> On 2018/11/19 下午5:48, fdman...@kernel.org wrote:
>>> From: Filipe Manana 
>>>
>>> If the quota enable and snapshot creation ioctls are called concurrently
>>> we can get into a deadlock where the task enabling quotas will deadlock
>>> on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it
>>> twice. The following time diagram shows how this happens.
>>>
>>>CPU 0CPU 1
>>>
>>>  btrfs_ioctl()
>>>   btrfs_ioctl_quota_ctl()
>>>btrfs_quota_enable()
>>> mutex_lock(fs_info->qgroup_ioctl_lock)
>>> btrfs_start_transaction()
>>>
>>>  btrfs_ioctl()
>>>   btrfs_ioctl_snap_create_v2
>>>create_snapshot()
>>> --> adds snapshot to the
>>> list pending_snapshots
>>> of the current
>>> transaction
>>>
>>> btrfs_commit_transaction()
>>>  create_pending_snapshots()
>>>create_pending_snapshot()
>>> qgroup_account_snapshot()
>>>  btrfs_qgroup_inherit()
>>>  mutex_lock(fs_info->qgroup_ioctl_lock)
>>>   --> deadlock, mutex already locked
>>>   by this task at
>>>   btrfs_quota_enable()
>>
>> The backtrace looks valid.
>>
>>>
>>> So fix this by adding a flag to the transaction handle that signals if the
>>> transaction is being used for enabling quotas (only seen by the task doing
>>> it) and do not lock the mutex qgroup_ioctl_lock at btrfs_qgroup_inherit()
>>> if the transaction handle corresponds to the one being used to enable the
>>> quotas.
>>>
>>> Fixes: 6426c7ad697d ("btrfs: qgroup: Fix qgroup accounting when creating 
>>> snapshot")
>>> Signed-off-by: Filipe Manana 
>>> ---
>>>  fs/btrfs/qgroup.c  | 10 --
>>>  fs/btrfs/transaction.h |  1 +
>>>  2 files changed, 9 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>>> index d4917c0cddf5..3aec3bfa3d70 100644
>>> --- a/fs/btrfs/qgroup.c
>>> +++ b/fs/btrfs/qgroup.c
>>> @@ -908,6 +908,7 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
>>>   trans = NULL;
>>>   goto out;
>>>   }
>>> + trans->enabling_quotas = true;
>>
>> Should we put enabling_quotas bit into btrfs_transaction instead of
>> btrfs_trans_handle?
> 
> Why?
> Only the task which is enabling quotas needs to know about it.

But it's the btrfs_qgroup_inherit() using the trans handler to avoid
dead lock.

What makes sure btrfs_qgroup_inherit() get the exactly same trans
handler allocated here?

> 
>>
>> Isn't it possible to have different trans handle pointed to the same
>> transaction?
> 
> Yes.
> 
>>
>> And I'm not really sure about the naming "enabling_quotas".
>> What about "quota_ioctl_mutex_hold"? (Well, this also sounds awful)
> 
> Too long.

Anyway, current naming doesn't really show why we could skip mutex
locking. Just hope to get some name better.

Thanks,
Qu

> 
> 
>>
>> Thanks,
>> Qu
>>
>>>
>>>   fs_info->qgroup_ulist = ulist_alloc(GFP_KERNEL);
>>>   if (!fs_info->qgroup_ulist) {
>>> @@ -2250,7 +2251,11 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
>>> *trans, u64 srcid,
>>>   u32 level_size = 0;
>>>   u64 nums;
>>>
>>> - mutex_lock(_info->qgroup_ioctl_lock);
>>> + if (trans->enabling_quotas)
>>> + lockdep_assert_held(_info->qgroup_ioctl_lock);
>>> + else
>>> + mutex_lock(_info->qgroup_ioctl_lock);
>>> +
>>>   if (!test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags))
>>>   goto out;
>>>
>>> @@ -2413,7 +2418,8 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
>>> *trans, u64 srcid,
>>>  unlock:
>>>   spin_unlock(_info->qgroup_lock);
>>>  out:
>>> - mutex_unlock(_info->qgroup_ioctl_lock);
>>> + if (!trans->enabling_quotas)
>>> + mutex_unlock(_info->qgroup_ioctl_lock);
>>>   return ret;
>>>  }
>>>
>>> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
>>> index 703d5116a2fc..a5553a1dee30 100644
>>> --- a/fs/btrfs/transaction.h
>>> +++ b/fs/btrfs/transaction.h
>>> @@ -122,6 +122,7 @@ struct btrfs_trans_handle {
>>>   bool reloc_reserved;
>>>   bool sync;
>>>   bool dirty;
>>> + bool enabling_quotas;
>>>   struct btrfs_root *root;
>>>   struct btrfs_fs_info *fs_info;
>>>   struct list_head new_bgs;
>>>
>>



signature.asc
Description: OpenPGP digital signature

Re: [PATCH] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation

2018-11-19 Thread Qu Wenruo



On 2018/11/19 下午5:48, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> If the quota enable and snapshot creation ioctls are called concurrently
> we can get into a deadlock where the task enabling quotas will deadlock
> on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it
> twice. The following time diagram shows how this happens.
> 
>CPU 0CPU 1
> 
>  btrfs_ioctl()
>   btrfs_ioctl_quota_ctl()
>btrfs_quota_enable()
> mutex_lock(fs_info->qgroup_ioctl_lock)
> btrfs_start_transaction()
> 
>  btrfs_ioctl()
>   btrfs_ioctl_snap_create_v2
>create_snapshot()
> --> adds snapshot to the
> list pending_snapshots
> of the current
> transaction
> 
> btrfs_commit_transaction()
>  create_pending_snapshots()
>create_pending_snapshot()
> qgroup_account_snapshot()
>  btrfs_qgroup_inherit()
>  mutex_lock(fs_info->qgroup_ioctl_lock)
>   --> deadlock, mutex already locked
>   by this task at
>   btrfs_quota_enable()

The backtrace looks valid.

> 
> So fix this by adding a flag to the transaction handle that signals if the
> transaction is being used for enabling quotas (only seen by the task doing
> it) and do not lock the mutex qgroup_ioctl_lock at btrfs_qgroup_inherit()
> if the transaction handle corresponds to the one being used to enable the
> quotas.
> 
> Fixes: 6426c7ad697d ("btrfs: qgroup: Fix qgroup accounting when creating 
> snapshot")
> Signed-off-by: Filipe Manana 
> ---
>  fs/btrfs/qgroup.c  | 10 --
>  fs/btrfs/transaction.h |  1 +
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index d4917c0cddf5..3aec3bfa3d70 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -908,6 +908,7 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
>   trans = NULL;
>   goto out;
>   }
> + trans->enabling_quotas = true;

Should we put enabling_quotas bit into btrfs_transaction instead of
btrfs_trans_handle?

Isn't it possible to have different trans handle pointed to the same
transaction?

And I'm not really sure about the naming "enabling_quotas".
What about "quota_ioctl_mutex_hold"? (Well, this also sounds awful)

Thanks,
Qu

>  
>   fs_info->qgroup_ulist = ulist_alloc(GFP_KERNEL);
>   if (!fs_info->qgroup_ulist) {
> @@ -2250,7 +2251,11 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
> *trans, u64 srcid,
>   u32 level_size = 0;
>   u64 nums;
>  
> - mutex_lock(_info->qgroup_ioctl_lock);
> + if (trans->enabling_quotas)
> + lockdep_assert_held(_info->qgroup_ioctl_lock);
> + else
> + mutex_lock(_info->qgroup_ioctl_lock);
> +
>   if (!test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags))
>   goto out;
>  
> @@ -2413,7 +2418,8 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
> *trans, u64 srcid,
>  unlock:
>   spin_unlock(_info->qgroup_lock);
>  out:
> - mutex_unlock(_info->qgroup_ioctl_lock);
> + if (!trans->enabling_quotas)
> + mutex_unlock(_info->qgroup_ioctl_lock);
>   return ret;
>  }
>  
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 703d5116a2fc..a5553a1dee30 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -122,6 +122,7 @@ struct btrfs_trans_handle {
>   bool reloc_reserved;
>   bool sync;
>   bool dirty;
> + bool enabling_quotas;
>   struct btrfs_root *root;
>   struct btrfs_fs_info *fs_info;
>   struct list_head new_bgs;
> 



signature.asc
Description: OpenPGP digital signature

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 5771 matches

Mail list logo