Re: [PATCH] Btrfs: remove redundant btrfs_trans_release_metadata"
On 5.09.2018 04:14, Liu Bo wrote: > __btrfs_end_transaction() has done the metadata release twice, > probably because it used to process delayed refs in between, but now > that we don't process delayed refs any more, the 2nd release is always > a noop. > > Signed-off-by: Liu Bo Reviewed-by: Nikolay Borisov > --- > fs/btrfs/transaction.c | 6 -- > 1 file changed, 6 deletions(-) > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index bb1b9f526e98..94b036a74d11 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -826,12 +826,6 @@ static int __btrfs_end_transaction(struct > btrfs_trans_handle *trans, > return 0; > } > > - btrfs_trans_release_metadata(trans); > - trans->block_rsv = NULL; > - > - if (!list_empty(>new_bgs)) > - btrfs_create_pending_block_groups(trans); > - The only code which can have any implications to the transaction reserve is the btrfs_Create_pending_block_groups since it does insert items. But at this point trans->block_rsv is already null and additionally even if more reservations are made for this transaction further down either btrfs_commit_transaction is called or the transaction kthread is called which is going to commit it. So this change really seems inconsequential. > trans->delayed_ref_updates = 0; > if (!trans->sync) { > must_run_delayed_refs = >
Re: [PATCH 5/8] btrfs-progs: Wire up delayed refs
On 2018/9/5 下午1:42, Nikolay Borisov wrote: > > > On 5.09.2018 05:10, Qu Wenruo wrote: >> >> >> On 2018/8/16 下午9:10, Nikolay Borisov wrote: >>> This commit enables the delayed refs infrastructures. This entails doing >>> the following: >>> >>> 1. Replacing existing calls of btrfs_extent_post_op (which is the >>> equivalent of delayed refs) with the proper btrfs_run_delayed_refs. >>> As well as eliminating open-coded calls to finish_current_insert and >>> del_pending_extents which execute the delayed ops. >>> >>> 2. Wiring up the addition of delayed refs when freeing extents >>> (btrfs_free_extent) and when adding new extents (alloc_tree_block). >>> >>> 3. Adding calls to btrfs_run_delayed refs in the transaction commit >>> path alongside comments why every call is needed, since it's not always >>> obvious (those call sites were derived empirically by running and >>> debugging existing tests) >>> >>> 4. Correctly flagging the transaction in which we are reinitialising >>> the extent tree. >>> >>> 5 Moving btrfs_write_dirty_block_groups to btrfs_write_dirty_block_groups >>> since blockgroups should be written to disk after the last delayed refs >>> have been run. >>> >>> Signed-off-by: Nikolay Borisov >>> Signed-off-by: David Sterba >> >> Is there something (maybe btrfs_run_delayed_refs()?) missing in btrfs-image? >> >> btrfs-image from devel branch can't restore image correctly, the block >> group used bytes is not correct, thus it can't pass misc nor fsck tests. > > This is really strange, all fsck/misc tests passed with those patches. > Can you be more specific which tests exactly you mean ? One case is fsck/020 with lowmem mode. (Original mode lacks block group->used check). More specifically, fsck/020/keyed_data_ref_with_shared_leaf.img Using btrfs-image from my distribution (v4.17.1) and devel branch btrfs check: (cwd is btrfs-progs, devel branch) $ btrfs-image -r tests/fsck-tests/020-extent-ref-cases/keyed_data_ref_with_shared_leaf.img ~/test.img $ btrfs check --mode=lowmem ~/test.img Opening filesystem to check... Checking filesystem on /home/adam/test.img UUID: 12dabcf2-d4da-4a70-9701-9f3d48074e73 [1/7] checking root items [2/7] checking extents [3/7] checking free space cache [4/7] checking fs roots [5/7] checking only csums items (without verifying data) [6/7] checking root refs done with fs roots in lowmem mode, skipping [7/7] checking quota groups skipped (not enabled on this FS) found 1208320 bytes used, no error found total csum bytes: 512 total tree bytes: 684032 total fs tree bytes: 638976 total extent tree bytes: 16384 btree space waste bytes: 305606 file data blocks allocated: 93847552 referenced 1773568 But if using btrfs-image with your delayed ref patch: $ ./btrfs-image -r tests/fsck-tests/020-extent-ref-cases/keyed_data_ref_with_shared_leaf.img ~/test.img # No matter if I'm using btrfs-check from devel or 4.17.1 $ btrfs check --mode=lowmem ~/test.img Opening filesystem to check... Checking filesystem on /home/adam/test.img UUID: 12dabcf2-d4da-4a70-9701-9f3d48074e73 [1/7] checking root items [2/7] checking extents ERROR: block group[4194304 8388608] used 20480 but extent items used 24576 ERROR: block group[20971520 16777216] used 659456 but extent items used 655360 ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space cache [4/7] checking fs roots [5/7] checking only csums items (without verifying data) [6/7] checking root refs done with fs roots in lowmem mode, skipping [7/7] checking quota groups skipped (not enabled on this FS) found 1208320 bytes used, error(s) found total csum bytes: 512 total tree bytes: 684032 total fs tree bytes: 638976 total extent tree bytes: 16384 btree space waste bytes: 305606 file data blocks allocated: 93847552 referenced 1773568 I'd say, although lowmem check is still far from perfect, it indeed has extra checks original mode lacks, and in this case it indeed exposes problem. Thanks, Qu > >> >> Thanks, >> Qu >> >>> --- >>> check/main.c | 3 +- >>> extent-tree.c | 166 >>> ++ >>> transaction.c | 27 +- >>> 3 files changed, 112 insertions(+), 84 deletions(-) >>> >>> diff --git a/check/main.c b/check/main.c >>> index bc2ee22f7943..b361cd7e26a0 100644 >>> --- a/check/main.c >>> +++ b/check/main.c >>> @@ -8710,7 +8710,7 @@ static int reinit_extent_tree(struct >>> btrfs_trans_handle *trans, >>> fprintf(stderr, "Error adding block group\n"); >>> return ret; >>> } >>> - btrfs_extent_post_op(trans); >>> + btrfs_run_delayed_refs(trans, -1); >>> } >>> >>> ret = reset_balance(trans, fs_info); >>> @@ -9767,6 +9767,7 @@ int cmd_check(int argc, char **argv) >>> goto close_out; >>> } >>> >>> + trans->reinit_extent_tree = true; >>> if (init_extent_tree) { >>> printf("Creating a new
Re: [PATCH 5/8] btrfs-progs: Wire up delayed refs
On 5.09.2018 05:10, Qu Wenruo wrote: > > > On 2018/8/16 下午9:10, Nikolay Borisov wrote: >> This commit enables the delayed refs infrastructures. This entails doing >> the following: >> >> 1. Replacing existing calls of btrfs_extent_post_op (which is the >> equivalent of delayed refs) with the proper btrfs_run_delayed_refs. >> As well as eliminating open-coded calls to finish_current_insert and >> del_pending_extents which execute the delayed ops. >> >> 2. Wiring up the addition of delayed refs when freeing extents >> (btrfs_free_extent) and when adding new extents (alloc_tree_block). >> >> 3. Adding calls to btrfs_run_delayed refs in the transaction commit >> path alongside comments why every call is needed, since it's not always >> obvious (those call sites were derived empirically by running and >> debugging existing tests) >> >> 4. Correctly flagging the transaction in which we are reinitialising >> the extent tree. >> >> 5 Moving btrfs_write_dirty_block_groups to btrfs_write_dirty_block_groups >> since blockgroups should be written to disk after the last delayed refs >> have been run. >> >> Signed-off-by: Nikolay Borisov >> Signed-off-by: David Sterba > > Is there something (maybe btrfs_run_delayed_refs()?) missing in btrfs-image? > > btrfs-image from devel branch can't restore image correctly, the block > group used bytes is not correct, thus it can't pass misc nor fsck tests. This is really strange, all fsck/misc tests passed with those patches. Can you be more specific which tests exactly you mean ? > > Thanks, > Qu > >> --- >> check/main.c | 3 +- >> extent-tree.c | 166 >> ++ >> transaction.c | 27 +- >> 3 files changed, 112 insertions(+), 84 deletions(-) >> >> diff --git a/check/main.c b/check/main.c >> index bc2ee22f7943..b361cd7e26a0 100644 >> --- a/check/main.c >> +++ b/check/main.c >> @@ -8710,7 +8710,7 @@ static int reinit_extent_tree(struct >> btrfs_trans_handle *trans, >> fprintf(stderr, "Error adding block group\n"); >> return ret; >> } >> -btrfs_extent_post_op(trans); >> +btrfs_run_delayed_refs(trans, -1); >> } >> >> ret = reset_balance(trans, fs_info); >> @@ -9767,6 +9767,7 @@ int cmd_check(int argc, char **argv) >> goto close_out; >> } >> >> +trans->reinit_extent_tree = true; >> if (init_extent_tree) { >> printf("Creating a new extent tree\n"); >> ret = reinit_extent_tree(trans, info, >> diff --git a/extent-tree.c b/extent-tree.c >> index 7d6c37c6b371..2fa51bbc0359 100644 >> --- a/extent-tree.c >> +++ b/extent-tree.c >> @@ -1418,8 +1418,6 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle >> *trans, >> err = ret; >> out: >> btrfs_free_path(path); >> -finish_current_insert(trans); >> -del_pending_extents(trans); >> BUG_ON(err); >> return err; >> } >> @@ -1602,8 +1600,6 @@ int btrfs_set_block_flags(struct btrfs_trans_handle >> *trans, u64 bytenr, >> btrfs_set_extent_flags(l, item, flags); >> out: >> btrfs_free_path(path); >> -finish_current_insert(trans); >> -del_pending_extents(trans); >> return ret; >> } >> >> @@ -1701,7 +1697,6 @@ static int write_one_cache_group(struct >> btrfs_trans_handle *trans, >> struct btrfs_block_group_cache *cache) >> { >> int ret; >> -int pending_ret; >> struct btrfs_root *extent_root = trans->fs_info->extent_root; >> unsigned long bi; >> struct extent_buffer *leaf; >> @@ -1717,12 +1712,8 @@ static int write_one_cache_group(struct >> btrfs_trans_handle *trans, >> btrfs_mark_buffer_dirty(leaf); >> btrfs_release_path(path); >> fail: >> -finish_current_insert(trans); >> -pending_ret = del_pending_extents(trans); >> if (ret) >> return ret; >> -if (pending_ret) >> -return pending_ret; >> return 0; >> >> } >> @@ -2049,6 +2040,7 @@ static int finish_current_insert(struct >> btrfs_trans_handle *trans) >> int skinny_metadata = >> btrfs_fs_incompat(extent_root->fs_info, SKINNY_METADATA); >> >> + >> while(1) { >> ret = find_first_extent_bit(>extent_ins, 0, , >> , EXTENT_LOCKED); >> @@ -2080,6 +2072,8 @@ static int finish_current_insert(struct >> btrfs_trans_handle *trans) >> BUG_ON(1); >> } >> >> + >> +printf("shouldn't be executed\n"); >> clear_extent_bits(>extent_ins, start, end, EXTENT_LOCKED); >> kfree(extent_op); >> } >> @@ -2379,7 +2373,6 @@ static int __free_extent(struct btrfs_trans_handle >> *trans, >> } >> fail: >> btrfs_free_path(path); >> -finish_current_insert(trans); >> return ret; >> } >> >> @@ -2462,33 +2455,30 @@
[PATCH] btrfs: qgroup: Don't trace subtree if we're dropping tree reloc tree
Tree reloc tree doesn't contribute to qgroup numbers, as we have accounted them at balance time (check replace_path()). Skip such unneeded subtree trace should reduce some performance overhead. Signed-off-by: Qu Wenruo --- fs/btrfs/extent-tree.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index de6f75f5547b..4588153f414c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -8643,7 +8643,13 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans, parent = 0; } - if (need_account) { + /* +* Tree reloc tree doesn't contribute to qgroup numbers, and +* we have already accounted them at merge time (replace_path), +* thus we could skip expensive subtree trace here. +*/ + if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID && + need_account) { ret = btrfs_qgroup_trace_subtree(trans, next, generation, level - 1); if (ret) { -- 2.18.0
[PATCH] btrfs: defrag: use btrfs_mod_outstanding_extents in cluster_pages_for_defrag
Since commit 8b62f87bad9c ("Btrfs: rework outstanding_extents"), manual operations of outstanding_extent in btrfs_inode are replaced by btrfs_mod_outstanding_extents(). The one in cluster_pages_for_defrag seems to be lost, so replace it of btrfs_mod_outstanding_extents(). Fixes: 8b62f87bad9c ("Btrfs: rework outstanding_extents") Signed-off-by: Su Yue --- fs/btrfs/ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 63600dc2ac4c..c180ded27092 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1308,7 +1308,7 @@ static int cluster_pages_for_defrag(struct inode *inode, if (i_done != page_cnt) { spin_lock(_I(inode)->lock); - BTRFS_I(inode)->outstanding_extents++; + btrfs_mod_outstanding_extents(BTRFS_I(inode), 1); spin_unlock(_I(inode)->lock); btrfs_delalloc_release_space(inode, data_reserved, start_index << PAGE_SHIFT, -- 2.18.0
Re: [PATCH 5/8] btrfs-progs: Wire up delayed refs
On 2018/8/16 下午9:10, Nikolay Borisov wrote: > This commit enables the delayed refs infrastructures. This entails doing > the following: > > 1. Replacing existing calls of btrfs_extent_post_op (which is the > equivalent of delayed refs) with the proper btrfs_run_delayed_refs. > As well as eliminating open-coded calls to finish_current_insert and > del_pending_extents which execute the delayed ops. > > 2. Wiring up the addition of delayed refs when freeing extents > (btrfs_free_extent) and when adding new extents (alloc_tree_block). > > 3. Adding calls to btrfs_run_delayed refs in the transaction commit > path alongside comments why every call is needed, since it's not always > obvious (those call sites were derived empirically by running and > debugging existing tests) > > 4. Correctly flagging the transaction in which we are reinitialising > the extent tree. > > 5 Moving btrfs_write_dirty_block_groups to btrfs_write_dirty_block_groups > since blockgroups should be written to disk after the last delayed refs > have been run. > > Signed-off-by: Nikolay Borisov > Signed-off-by: David Sterba Is there something (maybe btrfs_run_delayed_refs()?) missing in btrfs-image? btrfs-image from devel branch can't restore image correctly, the block group used bytes is not correct, thus it can't pass misc nor fsck tests. Thanks, Qu > --- > check/main.c | 3 +- > extent-tree.c | 166 > ++ > transaction.c | 27 +- > 3 files changed, 112 insertions(+), 84 deletions(-) > > diff --git a/check/main.c b/check/main.c > index bc2ee22f7943..b361cd7e26a0 100644 > --- a/check/main.c > +++ b/check/main.c > @@ -8710,7 +8710,7 @@ static int reinit_extent_tree(struct btrfs_trans_handle > *trans, > fprintf(stderr, "Error adding block group\n"); > return ret; > } > - btrfs_extent_post_op(trans); > + btrfs_run_delayed_refs(trans, -1); > } > > ret = reset_balance(trans, fs_info); > @@ -9767,6 +9767,7 @@ int cmd_check(int argc, char **argv) > goto close_out; > } > > + trans->reinit_extent_tree = true; > if (init_extent_tree) { > printf("Creating a new extent tree\n"); > ret = reinit_extent_tree(trans, info, > diff --git a/extent-tree.c b/extent-tree.c > index 7d6c37c6b371..2fa51bbc0359 100644 > --- a/extent-tree.c > +++ b/extent-tree.c > @@ -1418,8 +1418,6 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle > *trans, > err = ret; > out: > btrfs_free_path(path); > - finish_current_insert(trans); > - del_pending_extents(trans); > BUG_ON(err); > return err; > } > @@ -1602,8 +1600,6 @@ int btrfs_set_block_flags(struct btrfs_trans_handle > *trans, u64 bytenr, > btrfs_set_extent_flags(l, item, flags); > out: > btrfs_free_path(path); > - finish_current_insert(trans); > - del_pending_extents(trans); > return ret; > } > > @@ -1701,7 +1697,6 @@ static int write_one_cache_group(struct > btrfs_trans_handle *trans, >struct btrfs_block_group_cache *cache) > { > int ret; > - int pending_ret; > struct btrfs_root *extent_root = trans->fs_info->extent_root; > unsigned long bi; > struct extent_buffer *leaf; > @@ -1717,12 +1712,8 @@ static int write_one_cache_group(struct > btrfs_trans_handle *trans, > btrfs_mark_buffer_dirty(leaf); > btrfs_release_path(path); > fail: > - finish_current_insert(trans); > - pending_ret = del_pending_extents(trans); > if (ret) > return ret; > - if (pending_ret) > - return pending_ret; > return 0; > > } > @@ -2049,6 +2040,7 @@ static int finish_current_insert(struct > btrfs_trans_handle *trans) > int skinny_metadata = > btrfs_fs_incompat(extent_root->fs_info, SKINNY_METADATA); > > + > while(1) { > ret = find_first_extent_bit(>extent_ins, 0, , > , EXTENT_LOCKED); > @@ -2080,6 +2072,8 @@ static int finish_current_insert(struct > btrfs_trans_handle *trans) > BUG_ON(1); > } > > + > + printf("shouldn't be executed\n"); > clear_extent_bits(>extent_ins, start, end, EXTENT_LOCKED); > kfree(extent_op); > } > @@ -2379,7 +2373,6 @@ static int __free_extent(struct btrfs_trans_handle > *trans, > } > fail: > btrfs_free_path(path); > - finish_current_insert(trans); > return ret; > } > > @@ -2462,33 +2455,30 @@ int btrfs_free_extent(struct btrfs_trans_handle > *trans, > u64 bytenr, u64 num_bytes, u64 parent, > u64 root_objectid, u64 owner, u64 offset) > { > - struct btrfs_root *extent_root = root->fs_info->extent_root; > -
[PATCH v2] Btrfs: remove confusing tracepoint in btrfs_add_reserved_bytes
Here we're not releasing any space, but transferring bytes from ->bytes_may_use to ->bytes_reserved. Signed-off-by: Liu Bo --- v2: Add missing commit log. fs/btrfs/extent-tree.c | 4 1 file changed, 4 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 41a02cbb5a4a..76ee5ebef2b9 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -6401,10 +6401,6 @@ static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache, } else { cache->reserved += num_bytes; space_info->bytes_reserved += num_bytes; - - trace_btrfs_space_reservation(cache->fs_info, - "space_info", space_info->flags, - ram_bytes, 0); space_info->bytes_may_use -= ram_bytes; if (delalloc) cache->delalloc_bytes += num_bytes; -- 1.8.3.1
nbdkit as a flexible alternative to loopback mounts
https://rwmj.wordpress.com/2018/09/04/nbdkit-as-a-flexible-alternative-to-loopback-mounts/ This is a pretty cool writeup. I can vouch Btrfs will format mount, write to, scrub, and btrfs check works on an 8EiB (virtual) disk. The one thing I thought might cause a problem is the ndb device has a 1KiB sector size, but Btrfs (on x86_64) still uses 4096 byte "sector" and it all seems to work fine despite that. Anyway, maybe it's useful for some fstests instead of file backed losetup devices? -- Chris Murphy
Re: [PATCH 20/35] btrfs: reset max_extent_size on clear in a bitmap
On Thu, Aug 30, 2018 at 10:42 AM, Josef Bacik wrote: > From: Josef Bacik > > We need to clear the max_extent_size when we clear bits from a bitmap > since it could have been from the range that contains the > max_extent_size. > Looks OK. Reviewed-by: Liu Bo thanks, liubo > Signed-off-by: Josef Bacik > --- > fs/btrfs/free-space-cache.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c > index 53521027dd78..7faca05e61ea 100644 > --- a/fs/btrfs/free-space-cache.c > +++ b/fs/btrfs/free-space-cache.c > @@ -1683,6 +1683,8 @@ static inline void __bitmap_clear_bits(struct > btrfs_free_space_ctl *ctl, > bitmap_clear(info->bitmap, start, count); > > info->bytes -= bytes; > + if (info->max_extent_size > ctl->unit) > + info->max_extent_size = 0; > } > > static void bitmap_clear_bits(struct btrfs_free_space_ctl *ctl, > -- > 2.14.3 >
Re: [PATCH 08/35] btrfs: release metadata before running delayed refs
On Thu, Aug 30, 2018 at 10:41 AM, Josef Bacik wrote: > We want to release the unused reservation we have since it refills the > delayed refs reserve, which will make everything go smoother when > running the delayed refs if we're short on our reservation. > Looks good. Reviewed-by: Liu Bo thanks, liubo > Signed-off-by: Josef Bacik > --- > fs/btrfs/transaction.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index 99741254e27e..ebb0c0405598 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -1915,6 +1915,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle > *trans) > return ret; > } > > + btrfs_trans_release_metadata(trans); > + trans->block_rsv = NULL; > + > /* make a pass through all the delayed refs we have so far > * any runnings procs may add more while we are here > */ > @@ -1924,9 +1927,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle > *trans) > return ret; > } > > - btrfs_trans_release_metadata(trans); > - trans->block_rsv = NULL; > - > cur_trans = trans->transaction; > > /* > -- > 2.14.3 >
[PATCH] Btrfs: remove confusing tracepoint in btrfs_add_reserved_bytes
Signed-off-by: Liu Bo --- fs/btrfs/extent-tree.c | 4 1 file changed, 4 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 41a02cbb5a4a..76ee5ebef2b9 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -6401,10 +6401,6 @@ static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache, } else { cache->reserved += num_bytes; space_info->bytes_reserved += num_bytes; - - trace_btrfs_space_reservation(cache->fs_info, - "space_info", space_info->flags, - ram_bytes, 0); space_info->bytes_may_use -= ram_bytes; if (delalloc) cache->delalloc_bytes += num_bytes; -- 1.8.3.1
Re: RAID1 & BTRFS critical (device sda2): corrupt leaf, bad key order
On 2018/9/5 上午4:37, Chris Murphy wrote: > On Tue, Sep 4, 2018 at 10:22 AM, Etienne Champetier > wrote: > >> Do you have a procedure to copy all subvolumes & skip error ? (I have >> ~200 snapshots) > > If they're already read-only snapshots, then script an iteration of > btrfs send receive to a new volume. Doesn't simple "cp -r" work here? (If the important thing is data, not the subvolume layout). Thanks, Qu > > Btrfs seed-sprout would be ideal, however in this case I don't think > can help because a.) it's temporarily one file system, which could > mean the corruption is inherited; and b.) I'm not sure it's multiple > device aware, so either the btrfs-tune -S1 might fail on 2+ device > Btrfs volumes, or possibly it insists on a two device sprout in order > to replicate a two device seed. > > If they're not already read-only, it's tricky because it sounds like > mounting rw is possibly risky, and taking read only snapshots might > fail anyway. There is no way to make read only snapshots unless the > volume can be written to; and no way to force a rw subvolume to be > treated as if it were read only even if the volume is mounted read > only. And it takes a read only subvolume for send to work. > > signature.asc Description: OpenPGP digital signature
[PATCH] Btrfs: remove redundant btrfs_trans_release_metadata"
__btrfs_end_transaction() has done the metadata release twice, probably because it used to process delayed refs in between, but now that we don't process delayed refs any more, the 2nd release is always a noop. Signed-off-by: Liu Bo --- fs/btrfs/transaction.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index bb1b9f526e98..94b036a74d11 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -826,12 +826,6 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, return 0; } - btrfs_trans_release_metadata(trans); - trans->block_rsv = NULL; - - if (!list_empty(>new_bgs)) - btrfs_create_pending_block_groups(trans); - trans->delayed_ref_updates = 0; if (!trans->sync) { must_run_delayed_refs = -- 1.8.3.1
Re: [PATCH 02/35] btrfs: add cleanup_ref_head_accounting helper
On Thu, Aug 30, 2018 at 10:41 AM, Josef Bacik wrote: > From: Josef Bacik > > We were missing some quota cleanups in check_ref_cleanup, so break the > ref head accounting cleanup into a helper and call that from both > check_ref_cleanup and cleanup_ref_head. This will hopefully ensure that > we don't screw up accounting in the future for other things that we add. > Reviewed-by: Liu Bo thanks, liubo > Signed-off-by: Josef Bacik > --- > fs/btrfs/extent-tree.c | 67 > +- > 1 file changed, 39 insertions(+), 28 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 6799950fa057..4c9fd35bca07 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2461,6 +2461,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle > *trans, > return ret ? ret : 1; > } > > +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans, > + struct btrfs_delayed_ref_head *head) > +{ > + struct btrfs_fs_info *fs_info = trans->fs_info; > + struct btrfs_delayed_ref_root *delayed_refs = > + >transaction->delayed_refs; > + > + if (head->total_ref_mod < 0) { > + struct btrfs_space_info *space_info; > + u64 flags; > + > + if (head->is_data) > + flags = BTRFS_BLOCK_GROUP_DATA; > + else if (head->is_system) > + flags = BTRFS_BLOCK_GROUP_SYSTEM; > + else > + flags = BTRFS_BLOCK_GROUP_METADATA; > + space_info = __find_space_info(fs_info, flags); > + ASSERT(space_info); > + percpu_counter_add_batch(_info->total_bytes_pinned, > + -head->num_bytes, > + BTRFS_TOTAL_BYTES_PINNED_BATCH); > + > + if (head->is_data) { > + spin_lock(_refs->lock); > + delayed_refs->pending_csums -= head->num_bytes; > + spin_unlock(_refs->lock); > + } > + } > + > + /* Also free its reserved qgroup space */ > + btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, > + head->qgroup_reserved); > +} > + > static int cleanup_ref_head(struct btrfs_trans_handle *trans, > struct btrfs_delayed_ref_head *head) > { > @@ -2496,31 +2531,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle > *trans, > spin_unlock(_refs->lock); > spin_unlock(>lock); > > - trace_run_delayed_ref_head(fs_info, head, 0); > - > - if (head->total_ref_mod < 0) { > - struct btrfs_space_info *space_info; > - u64 flags; > - > - if (head->is_data) > - flags = BTRFS_BLOCK_GROUP_DATA; > - else if (head->is_system) > - flags = BTRFS_BLOCK_GROUP_SYSTEM; > - else > - flags = BTRFS_BLOCK_GROUP_METADATA; > - space_info = __find_space_info(fs_info, flags); > - ASSERT(space_info); > - percpu_counter_add_batch(_info->total_bytes_pinned, > - -head->num_bytes, > - BTRFS_TOTAL_BYTES_PINNED_BATCH); > - > - if (head->is_data) { > - spin_lock(_refs->lock); > - delayed_refs->pending_csums -= head->num_bytes; > - spin_unlock(_refs->lock); > - } > - } > - > if (head->must_insert_reserved) { > btrfs_pin_extent(fs_info, head->bytenr, > head->num_bytes, 1); > @@ -2530,9 +2540,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle > *trans, > } > } > > - /* Also free its reserved qgroup space */ > - btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, > - head->qgroup_reserved); > + cleanup_ref_head_accounting(trans, head); > + > + trace_run_delayed_ref_head(fs_info, head, 0); > btrfs_delayed_ref_unlock(head); > btrfs_put_delayed_ref_head(head); > return 0; > @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct > btrfs_trans_handle *trans, > if (head->must_insert_reserved) > ret = 1; > > + cleanup_ref_head_accounting(trans, head); > mutex_unlock(>mutex); > btrfs_put_delayed_ref_head(head); > return ret; > -- > 2.14.3 >
[PATCH v5 2/2] vfs: dedupe should return EPERM if permission is not granted
Right now we return EINVAL if a process does not have permission to dedupe a file. This was an oversight on my part. EPERM gives a true description of the nature of our error, and EINVAL is already used for the case that the filesystem does not support dedupe. Signed-off-by: Mark Fasheh Reviewed-by: Darrick J. Wong Acked-by: David Sterba --- fs/read_write.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/read_write.c b/fs/read_write.c index 71e9077f8bc1..7188982e2733 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -2050,7 +2050,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) if (info->reserved) { info->status = -EINVAL; } else if (!allow_file_dedupe(dst_file)) { - info->status = -EINVAL; + info->status = -EPERM; } else if (file->f_path.mnt != dst_file->f_path.mnt) { info->status = -EXDEV; } else if (S_ISDIR(dst->i_mode)) { -- 2.15.1
[PATCH v5 1/2] vfs: allow dedupe of user owned read-only files
The permission check in vfs_dedupe_file_range() is too coarse - We only allow dedupe of the destination file if the user is root, or they have the file open for write. This effectively limits a non-root user from deduping their own read-only files. In addition, the write file descriptor that the user is forced to hold open can prevent execution of files. As file data during a dedupe does not change, the behavior is unexpected and this has caused a number of issue reports. For an example, see: https://github.com/markfasheh/duperemove/issues/129 So change the check so we allow dedupe on the target if: - the root or admin is asking for it - the process has write access - the owner of the file is asking for the dedupe - the process could get write access That way users can open read-only and still get dedupe. Signed-off-by: Mark Fasheh Acked-by: Darrick J. Wong --- fs/read_write.c | 17 +++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index e83bd9744b5d..71e9077f8bc1 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1964,6 +1964,20 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, } EXPORT_SYMBOL(vfs_dedupe_file_range_compare); +/* Check whether we are allowed to dedupe the destination file */ +static bool allow_file_dedupe(struct file *file) +{ + if (capable(CAP_SYS_ADMIN)) + return true; + if (file->f_mode & FMODE_WRITE) + return true; + if (uid_eq(current_fsuid(), file_inode(file)->i_uid)) + return true; + if (!inode_permission(file_inode(file), MAY_WRITE)) + return true; + return false; +} + int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) { struct file_dedupe_range_info *info; @@ -1972,7 +1986,6 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) u64 len; int i; int ret; - bool is_admin = capable(CAP_SYS_ADMIN); u16 count = same->dest_count; struct file *dst_file; loff_t dst_off; @@ -2036,7 +2049,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) if (info->reserved) { info->status = -EINVAL; - } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) { + } else if (!allow_file_dedupe(dst_file)) { info->status = -EINVAL; } else if (file->f_path.mnt != dst_file->f_path.mnt) { info->status = -EXDEV; -- 2.15.1
[RESEND][PATCH v5 0/2] vfs: fix dedupe permission check
Hi Andrew/Al, Could I please have these patches put in a tree for more public testing? They've hit fsdevel a few times now, I have links to the discussions in the change log below. The following patches fix a couple of issues with the permission check we do in vfs_dedupe_file_range(). The first patch expands our check to allow dedupe of a file if the user owns it or otherwise would be allowed to write to it. Current behavior is that we'll allow dedupe only if: - the user is an admin (root) - the user has the file open for write This makes it impossible for a user to dedupe their own file set unless they do it as root, or ensure that all files have write permission. There's a couple of duperemove bugs open for this: https://github.com/markfasheh/duperemove/issues/129 https://github.com/markfasheh/duperemove/issues/86 The other problem we have is also related to forcing the user to open target files for write - A process trying to exec a file currently being deduped gets ETXTBUSY. The answer (as above) is to allow them to open the targets ro - root can already do this. There was a patch from Adam Borowski to fix this back in 2016: https://lkml.org/lkml/2016/7/17/130 which I have incorporated into my changes. The 2nd patch fixes our return code for permission denied to be EPERM. For some reason we're returning EINVAL - I think that's probably my fault. At any rate, we need to be returning something descriptive of the actual problem, otherwise callers see EINVAL and can't really make a valid determination of what's gone wrong. This has also popped up in duperemove, mostly in the form of cryptic error messages. Because this is a code returned to userspace, I did check the other users of extent-same that I could find. Both 'bees' and 'rust-btrfs' do the same as duperemove and simply report the error (as they should). Lastly, I have an update to the fi_deduperange manpage to reflect these changes. That patch is attached below. Please apply. git pull https://github.com/markfasheh/linux dedupe-perms Thanks, --Mark Changes from V4 to V5: - Rebase and retest on 4.18-rc8 - Place updated manpage patch below, CC linux-api - V4 discussion: https://patchwork.kernel.org/patch/10530365/ Changes from V3 to V4: - Add a patch (below) to ioctl_fideduperange.2 explaining our changes. I will send this patch once the kernel update is accepted. Thanks to Darrick Wong for this suggestion. - V3 discussion: https://www.spinics.net/lists/linux-btrfs/msg79135.html Changes from V2 to V3: - Return bool from allow_file_dedupe - V2 discussion: https://www.spinics.net/lists/linux-btrfs/msg78421.html Changes from V1 to V2: - Add inode_permission check as suggested by Adam Borowski - V1 discussion: https://marc.info/?l=linux-xfs=152606684017965=2 From: Mark Fasheh [PATCH] ioctl_fideduperange.2: clarify permission requirements dedupe permission checks were recently relaxed - update our man page to reflect those changes. Signed-off-by: Mark Fasheh --- man2/ioctl_fideduperange.2 | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/man2/ioctl_fideduperange.2 b/man2/ioctl_fideduperange.2 index 84d20a276..4040ee064 100644 --- a/man2/ioctl_fideduperange.2 +++ b/man2/ioctl_fideduperange.2 @@ -105,9 +105,12 @@ The field must be zero. During the call, .IR src_fd -must be open for reading and +must be open for reading. .IR dest_fd -must be open for writing. +can be open for writing, or reading. +If +.IR dest_fd +is open for reading, the user must have write access to the file. The combined size of the struct .IR file_dedupe_range and the struct @@ -185,8 +188,8 @@ This can appear if the filesystem does not support deduplicating either file descriptor, or if either file descriptor refers to special inodes. .TP .B EPERM -.IR dest_fd -is immutable. +This will be returned if the user lacks permission to dedupe the file referenced by +.IR dest_fd . .TP .B ETXTBSY One of the files is a swap file. -- 2.15.1
Re: RAID1 & BTRFS critical (device sda2): corrupt leaf, bad key order
On Tue, Sep 4, 2018 at 10:22 AM, Etienne Champetier wrote: > Do you have a procedure to copy all subvolumes & skip error ? (I have > ~200 snapshots) If they're already read-only snapshots, then script an iteration of btrfs send receive to a new volume. Btrfs seed-sprout would be ideal, however in this case I don't think can help because a.) it's temporarily one file system, which could mean the corruption is inherited; and b.) I'm not sure it's multiple device aware, so either the btrfs-tune -S1 might fail on 2+ device Btrfs volumes, or possibly it insists on a two device sprout in order to replicate a two device seed. If they're not already read-only, it's tricky because it sounds like mounting rw is possibly risky, and taking read only snapshots might fail anyway. There is no way to make read only snapshots unless the volume can be written to; and no way to force a rw subvolume to be treated as if it were read only even if the volume is mounted read only. And it takes a read only subvolume for send to work. -- Chris Murphy
WTS: Panasonic Toughbook
Hello, We have Grade A Panasonic Toughbook, tested working in good condition and looks like new. Grade A cf-c2ccezxcm -Panasonic Toughbook CF-C2 Convertible Laptop Intel Core i5-4300U 1.90GHz CPU 8GB RAM 128GB HDD Win 8.1 Pro. QTY 52 @ $110 each. Contact us if you are interested, pictures are available. Best regards, Steven Anderson Sales Executive Email:sa...@refurbishedcomputersdepot.com Phone: (650) 618-9852 Fax: (650) 644-3252 www.refurbishedcomputersdepot.com
Re: [PATCH 12/35] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
On 4.09.2018 20:57, Josef Bacik wrote: > On Mon, Sep 03, 2018 at 05:19:19PM +0300, Nikolay Borisov wrote: >> >> >> On 30.08.2018 20:42, Josef Bacik wrote: >>> + if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles) >>> + flush_state++; >> >> This is a bit obscure. So if we allocated a chunk and !commit_cycles >> just break from the loop? What's the reasoning behind this ? > > I'll add a comment, but it doesn't break the loop, it just goes to > COMMIT_TRANS. > The idea is we don't want to force a chunk allocation if we're experiencing a > little bit of pressure, because we could end up with a drive full of empty > metadata chunks. We want to try committing the transaction first, and then if > we still have issues we can force a chunk allocation. Thanks, I think it will be better if this check is moved up somewhere before the the if (flush_state > commit trans). > > Josef >
Re: [PATCH 05/35] btrfs: introduce delayed_refs_rsv
On Tue, Sep 04, 2018 at 06:21:23PM +0300, Nikolay Borisov wrote: > > > On 30.08.2018 20:41, Josef Bacik wrote: > > From: Josef Bacik > > > > Traditionally we've had voodoo in btrfs to account for the space that > > delayed refs may take up by having a global_block_rsv. This works most > > of the time, except when it doesn't. We've had issues reported and seen > > in production where sometimes the global reserve is exhausted during > > transaction commit before we can run all of our delayed refs, resulting > > in an aborted transaction. Because of this voodoo we have equally > > dubious flushing semantics around throttling delayed refs which we often > > get wrong. > > > > So instead give them their own block_rsv. This way we can always know > > exactly how much outstanding space we need for delayed refs. This > > allows us to make sure we are constantly filling that reservation up > > with space, and allows us to put more precise pressure on the enospc > > system. Instead of doing math to see if its a good time to throttle, > > the normal enospc code will be invoked if we have a lot of delayed refs > > pending, and they will be run via the normal flushing mechanism. > > > > For now the delayed_refs_rsv will hold the reservations for the delayed > > refs, the block group updates, and deleting csums. We could have a > > separate rsv for the block group updates, but the csum deletion stuff is > > still handled via the delayed_refs so that will stay there. > > > > Signed-off-by: Josef Bacik > > --- > > fs/btrfs/ctree.h | 24 +++- > > fs/btrfs/delayed-ref.c | 28 - > > fs/btrfs/disk-io.c | 3 + > > fs/btrfs/extent-tree.c | 268 > > +++ > > fs/btrfs/transaction.c | 68 +-- > > include/trace/events/btrfs.h | 2 + > > 6 files changed, 294 insertions(+), 99 deletions(-) > > > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > > index 66f1d3895bca..0a4e55703d48 100644 > > --- a/fs/btrfs/ctree.h > > +++ b/fs/btrfs/ctree.h > > @@ -452,8 +452,9 @@ struct btrfs_space_info { > > #defineBTRFS_BLOCK_RSV_TRANS 3 > > #defineBTRFS_BLOCK_RSV_CHUNK 4 > > #defineBTRFS_BLOCK_RSV_DELOPS 5 > > -#defineBTRFS_BLOCK_RSV_EMPTY 6 > > -#defineBTRFS_BLOCK_RSV_TEMP7 > > +#define BTRFS_BLOCK_RSV_DELREFS6 > > +#defineBTRFS_BLOCK_RSV_EMPTY 7 > > +#defineBTRFS_BLOCK_RSV_TEMP8 > > > > struct btrfs_block_rsv { > > u64 size; > > @@ -794,6 +795,8 @@ struct btrfs_fs_info { > > struct btrfs_block_rsv chunk_block_rsv; > > /* block reservation for delayed operations */ > > struct btrfs_block_rsv delayed_block_rsv; > > + /* block reservation for delayed refs */ > > + struct btrfs_block_rsv delayed_refs_rsv; > > > > struct btrfs_block_rsv empty_block_rsv; > > > > @@ -2723,10 +2726,12 @@ enum btrfs_reserve_flush_enum { > > enum btrfs_flush_state { > > FLUSH_DELAYED_ITEMS_NR = 1, > > FLUSH_DELAYED_ITEMS = 2, > > - FLUSH_DELALLOC = 3, > > - FLUSH_DELALLOC_WAIT = 4, > > - ALLOC_CHUNK = 5, > > - COMMIT_TRANS= 6, > > + FLUSH_DELAYED_REFS_NR = 3, > > + FLUSH_DELAYED_REFS = 4, > > + FLUSH_DELALLOC = 5, > > + FLUSH_DELALLOC_WAIT = 6, > > + ALLOC_CHUNK = 7, > > + COMMIT_TRANS= 8, > > }; > > > > int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); > > @@ -2777,6 +2782,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info > > *fs_info, > > void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info, > > struct btrfs_block_rsv *block_rsv, > > u64 num_bytes); > > +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr); > > +void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans); > > +int btrfs_refill_delayed_refs_rsv(struct btrfs_fs_info *fs_info, > > + enum btrfs_reserve_flush_enum flush); > > +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info, > > + struct btrfs_block_rsv *src, > > + u64 num_bytes); > > int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache); > > void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache); > > void btrfs_put_block_group_cache(struct btrfs_fs_info *info); > > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c > > index 27f7dd4e3d52..96ce087747b2 100644 > > --- a/fs/btrfs/delayed-ref.c > > +++ b/fs/btrfs/delayed-ref.c > > @@ -467,11 +467,14 @@ static int insert_delayed_ref(struct > > btrfs_trans_handle *trans, > > * existing and update must have the same bytenr > > */ > > static noinline void > >
Re: [PATCH 21/35] btrfs: only run delayed refs if we're committing
On Tue, Sep 04, 2018 at 01:54:13PM -0400, Josef Bacik wrote: > On Fri, Aug 31, 2018 at 05:28:09PM -0700, Omar Sandoval wrote: > > On Thu, Aug 30, 2018 at 01:42:11PM -0400, Josef Bacik wrote: > > > I noticed in a giant dbench run that we spent a lot of time on lock > > > contention while running transaction commit. This is because dbench > > > results in a lot of fsync()'s that do a btrfs_transaction_commit(), and > > > they all run the delayed refs first thing, so they all contend with > > > each other. This leads to seconds of 0 throughput. Change this to only > > > run the delayed refs if we're the ones committing the transaction. This > > > makes the latency go away and we get no more lock contention. > > > > This means that we're going to spend more time running delayed refs > > while in TRANS_STATE_COMMIT_START, so couldn't we end up blocking new > > transactions more than before? > > > > You'd think that, but the lock contention is enough that it makes it > unfuckingpossible for anything to run for several seconds while everybody > competes for either the delayed refs lock or the extent root lock. > > With the delayed refs rsv we actually end up running the delayed refs often > enough because of the extra ENOSPC pressure that we don't really end up with > long chunks of time running delayed refs while blocking out START > transactions. > > If at some point down the line this turns out to be an actual issue we can > revisit the best way to do this. Off the top of my head we do something like > wrap it in a "run all the delayed refs" mutex so that all the committers just > wait on whoever wins, and we move it back outside of the start logic in order > to > make it better all the way around. But I don't think that's something we need > to do at this point. Thanks, Ok, that's good enough for me. Reviewed-by: Omar Sandoval
Re: [PATCH 12/35] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
On Mon, Sep 03, 2018 at 05:19:19PM +0300, Nikolay Borisov wrote: > > > On 30.08.2018 20:42, Josef Bacik wrote: > > + if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles) > > + flush_state++; > > This is a bit obscure. So if we allocated a chunk and !commit_cycles > just break from the loop? What's the reasoning behind this ? I'll add a comment, but it doesn't break the loop, it just goes to COMMIT_TRANS. The idea is we don't want to force a chunk allocation if we're experiencing a little bit of pressure, because we could end up with a drive full of empty metadata chunks. We want to try committing the transaction first, and then if we still have issues we can force a chunk allocation. Thanks, Josef
Re: [PATCH 21/35] btrfs: only run delayed refs if we're committing
On Fri, Aug 31, 2018 at 05:28:09PM -0700, Omar Sandoval wrote: > On Thu, Aug 30, 2018 at 01:42:11PM -0400, Josef Bacik wrote: > > I noticed in a giant dbench run that we spent a lot of time on lock > > contention while running transaction commit. This is because dbench > > results in a lot of fsync()'s that do a btrfs_transaction_commit(), and > > they all run the delayed refs first thing, so they all contend with > > each other. This leads to seconds of 0 throughput. Change this to only > > run the delayed refs if we're the ones committing the transaction. This > > makes the latency go away and we get no more lock contention. > > This means that we're going to spend more time running delayed refs > while in TRANS_STATE_COMMIT_START, so couldn't we end up blocking new > transactions more than before? > You'd think that, but the lock contention is enough that it makes it unfuckingpossible for anything to run for several seconds while everybody competes for either the delayed refs lock or the extent root lock. With the delayed refs rsv we actually end up running the delayed refs often enough because of the extra ENOSPC pressure that we don't really end up with long chunks of time running delayed refs while blocking out START transactions. If at some point down the line this turns out to be an actual issue we can revisit the best way to do this. Off the top of my head we do something like wrap it in a "run all the delayed refs" mutex so that all the committers just wait on whoever wins, and we move it back outside of the start logic in order to make it better all the way around. But I don't think that's something we need to do at this point. Thanks, Josef
Re: RAID1 & BTRFS critical (device sda2): corrupt leaf, bad key order
Thanks Qu, one last question I think Le mar. 4 sept. 2018 à 08:33, Qu Wenruo a écrit : > > On 2018/9/4 下午7:53, Etienne Champetier wrote: > > Hi Qu, > > > > Le lun. 3 sept. 2018 à 20:27, Qu Wenruo a écrit : > >> > >> On 2018/9/3 下午10:18, Etienne Champetier wrote: > >>> Hello btfrs hackers, > >>> > >>> I have a computer acting as backup server with BTRFS RAID1, and I > >>> would like to know the different options to rebuild this RAID > >>> (I saw this thread > >>> https://www.spinics.net/lists/linux-btrfs/msg68679.html but there was > >>> no raid 1) > >>> > >>> # uname -a > >>> Linux servmaison 4.4.0-134-generic #160-Ubuntu SMP Wed Aug 15 14:58:00 > >>> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > >>> > >>> # btrfs --version > >>> btrfs-progs v4.4 > >>> > >>> # dmesg > >>> ... > >>> [ 1955.581972] BTRFS critical (device sda2): corrupt leaf, bad key > >>> order: block=6020235362304,root=1, slot=63 > >>> [ 1955.582299] BTRFS critical (device sda2): corrupt leaf, bad key > >>> order: block=6020235362304,root=1, slot=63 > > > > Now running a Fedora 28 install kernel > > > > # uname -a > > Linux servmaison 4.16.3-301.fc28.x86_64 #1 SMP Mon Apr 23 21:59:58 UTC > > 2018 x86_64 x86_64 x86_64 GNU/Linux > > # btrfs --version > > btrfs-progs v4.15.1 > > Unfortunately, even for latest btrfs-progs release (v4.17.1, and even > devel branch), btrfs check will abort checking if free space cache is > corrupted. > > So we didn't get any useful info from btrfs check. > > Such diff would help you continue checking (if you really want, other > than starting salvaging your data) > -- > diff --git a/check/main.c b/check/main.c > index b361cd7e26a0..4f720163221e 100644 > --- a/check/main.c > +++ b/check/main.c > @@ -9885,7 +9885,6 @@ int cmd_check(int argc, char **argv) > error("errors found in free space tree"); > else > error("errors found in free space cache"); > - goto out; > } > > /* > -- > > > For dump tree block, the corrupted tree block belongs to extent tree. > Which could be a good news (depends on how you define GOOD news). > > The corruption is not an easy fix, it's not just a swapped slot. > The corrupted slot (item 64, whole key objectid is 5946810351616) is way > beyond the extent data range, thus btrfs-progs can't fix it easily. > > Considering how much bytenr difference there is and the generation gap > (53167 vs current generation 1555950), the bug happens a long long time > ago (days or weeks before 2016-06-04). So it's a little too late to be > fixed (unless someone could send me a time machine). > > On the other hand, this means any WRITE would easily fail due to > corrupted extent tree, but your fs should be OK if mounted RO, thus you > could copy your data out. > Do you have a procedure to copy all subvolumes & skip error ? (I have ~200 snapshots) > > > >> > >> Please provide the following dump: > >> > >> # btrfs inspect dump-tree -t root /dev/sda2 > >> # btrfs inspect dump-tree -b 6020235362304 /dev/sda2 > > > > All requested dump are in this repo: > > https://github.com/champtar/debugraidbtrfs > > > [snip] > >> > >> If it's the only problem, "btrfs check --repair" indeed could fix it. > > > > Also available in https://github.com/champtar/debugraidbtrfs, here > > "btrfs check --readonly /dev/sda2" output > > > > checking extents > > bad key ordering 63 64 > > bad key ordering 63 64 > > bad key ordering 63 64 > > bad key ordering 63 64 > > bad key ordering 63 64 > > bad key ordering 63 64 > > bad key ordering 63 64 > > bad key ordering 63 64 > > bad key ordering 63 64 > > bad key ordering 63 64 > > bad key ordering 63 64 > > bad key ordering 63 64 > > bad key ordering 63 64 > > bad key ordering 63 64 > > bad key ordering 63 64 > > bad key ordering 63 64 > > bad key ordering 63 64 > > bad key ordering 63 64 > > bad block 6020235362304 > > ERROR: errors found in extent allocation tree or chunk allocation > > checking free space cache > > there is no free space entry for 6011561750528-5942842273792 > > there is no free space entry for 6011561750528-6012044050432 > > cache appears valid but isn't 6010970308608 > > there is no free space entry for 6015529828352-5946810351616 > > there is no free space entry for 6015529828352-6016339017728 > > cache appears valid but isn't 6015265275904 > > there is no free space entry for 6139476623360-6070757146624 > > there is no free space entry for 6139476623360-6139852881920 > > cache appears valid but isn't 6138779140096 > > ERROR: errors found in free space cache > > Checking filesystem on /dev/sda2 > > UUID: 4917db5e-fc20-4369-9556-83082a32d4cd > > found 1321120776195 bytes used, error(s) found > > total csum bytes: 0 > > total tree bytes: 1163182080 > > total fs tree bytes: 0 > > total extent tree bytes: 1161740288 > > btree space waste bytes: 290512355 > > file data blocks allocated: 618135552 > > referenced 618135552 > > > >
Re: [PATCH 05/35] btrfs: introduce delayed_refs_rsv
On 30.08.2018 20:41, Josef Bacik wrote: > From: Josef Bacik > > Traditionally we've had voodoo in btrfs to account for the space that > delayed refs may take up by having a global_block_rsv. This works most > of the time, except when it doesn't. We've had issues reported and seen > in production where sometimes the global reserve is exhausted during > transaction commit before we can run all of our delayed refs, resulting > in an aborted transaction. Because of this voodoo we have equally > dubious flushing semantics around throttling delayed refs which we often > get wrong. > > So instead give them their own block_rsv. This way we can always know > exactly how much outstanding space we need for delayed refs. This > allows us to make sure we are constantly filling that reservation up > with space, and allows us to put more precise pressure on the enospc > system. Instead of doing math to see if its a good time to throttle, > the normal enospc code will be invoked if we have a lot of delayed refs > pending, and they will be run via the normal flushing mechanism. > > For now the delayed_refs_rsv will hold the reservations for the delayed > refs, the block group updates, and deleting csums. We could have a > separate rsv for the block group updates, but the csum deletion stuff is > still handled via the delayed_refs so that will stay there. > > Signed-off-by: Josef Bacik > --- > fs/btrfs/ctree.h | 24 +++- > fs/btrfs/delayed-ref.c | 28 - > fs/btrfs/disk-io.c | 3 + > fs/btrfs/extent-tree.c | 268 > +++ > fs/btrfs/transaction.c | 68 +-- > include/trace/events/btrfs.h | 2 + > 6 files changed, 294 insertions(+), 99 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 66f1d3895bca..0a4e55703d48 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -452,8 +452,9 @@ struct btrfs_space_info { > #define BTRFS_BLOCK_RSV_TRANS 3 > #define BTRFS_BLOCK_RSV_CHUNK 4 > #define BTRFS_BLOCK_RSV_DELOPS 5 > -#define BTRFS_BLOCK_RSV_EMPTY 6 > -#define BTRFS_BLOCK_RSV_TEMP7 > +#define BTRFS_BLOCK_RSV_DELREFS 6 > +#define BTRFS_BLOCK_RSV_EMPTY 7 > +#define BTRFS_BLOCK_RSV_TEMP8 > > struct btrfs_block_rsv { > u64 size; > @@ -794,6 +795,8 @@ struct btrfs_fs_info { > struct btrfs_block_rsv chunk_block_rsv; > /* block reservation for delayed operations */ > struct btrfs_block_rsv delayed_block_rsv; > + /* block reservation for delayed refs */ > + struct btrfs_block_rsv delayed_refs_rsv; > > struct btrfs_block_rsv empty_block_rsv; > > @@ -2723,10 +2726,12 @@ enum btrfs_reserve_flush_enum { > enum btrfs_flush_state { > FLUSH_DELAYED_ITEMS_NR = 1, > FLUSH_DELAYED_ITEMS = 2, > - FLUSH_DELALLOC = 3, > - FLUSH_DELALLOC_WAIT = 4, > - ALLOC_CHUNK = 5, > - COMMIT_TRANS= 6, > + FLUSH_DELAYED_REFS_NR = 3, > + FLUSH_DELAYED_REFS = 4, > + FLUSH_DELALLOC = 5, > + FLUSH_DELALLOC_WAIT = 6, > + ALLOC_CHUNK = 7, > + COMMIT_TRANS= 8, > }; > > int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); > @@ -2777,6 +2782,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info > *fs_info, > void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info, >struct btrfs_block_rsv *block_rsv, >u64 num_bytes); > +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr); > +void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans); > +int btrfs_refill_delayed_refs_rsv(struct btrfs_fs_info *fs_info, > + enum btrfs_reserve_flush_enum flush); > +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info, > +struct btrfs_block_rsv *src, > +u64 num_bytes); > int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache); > void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache); > void btrfs_put_block_group_cache(struct btrfs_fs_info *info); > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c > index 27f7dd4e3d52..96ce087747b2 100644 > --- a/fs/btrfs/delayed-ref.c > +++ b/fs/btrfs/delayed-ref.c > @@ -467,11 +467,14 @@ static int insert_delayed_ref(struct btrfs_trans_handle > *trans, > * existing and update must have the same bytenr > */ > static noinline void > -update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs, > +update_existing_head_ref(struct btrfs_trans_handle *trans, >struct btrfs_delayed_ref_head *existing, >struct
[PATCH] btrfs-progs: calibrate extent_end when found a gap
The extent_end will be used to check whether there is gap between this extent and next extent. If it is not calibrated, check_file_extent will mistake that there are gaps between the remaining extents. Signed-off-by: Lu Fengqi --- check/mode-lowmem.c | 1 + 1 file changed, 1 insertion(+) diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c index 1bce44f5658a..0f14a4968e84 100644 --- a/check/mode-lowmem.c +++ b/check/mode-lowmem.c @@ -1972,6 +1972,7 @@ static int check_file_extent(struct btrfs_root *root, struct btrfs_path *path, root->objectid, fkey.objectid, fkey.offset, fkey.objectid, *end); } + *end = fkey.offset; } *end += extent_num_bytes; -- 2.18.0
[PATCH] btrfs-progs: Continue checking even we found something wrong in free space cache
No need to abort checking, especially for RO check free space cache is meaningless, the errors in fs/extent tree is more interesting for developers. So continue checking even something in free space cache is wrong. Reported-by: Etienne Champetier Signed-off-by: Qu Wenruo --- check/main.c | 1 - 1 file changed, 1 deletion(-) diff --git a/check/main.c b/check/main.c index b361cd7e26a0..4f720163221e 100644 --- a/check/main.c +++ b/check/main.c @@ -9885,7 +9885,6 @@ int cmd_check(int argc, char **argv) error("errors found in free space tree"); else error("errors found in free space cache"); - goto out; } /* -- 2.18.0
Re: [PATCH] btrfs-progs: dump-tree: Introduce --breadth-first option
On 2018/8/23 下午3:45, Qu Wenruo wrote: > > > On 2018/8/23 下午3:36, Nikolay Borisov wrote: >> >> >> On 23.08.2018 10:31, Qu Wenruo wrote: >>> Introduce --breadth-first option to do breadth-first tree dump. >>> This is especially handy to inspect high level trees, e.g. comparing >>> tree reloc tree with its source tree. >> >> Will it make sense instead of exposing another option to just have a >> heuristics check that will switch to the BFS if the tree is higher than, >> say, 2 levels? > > BFS has one obvious disadvantage here, so it may not be a good idea to > use it for default. Well, this is only true for my implementation. But there are other solutions to do BFS without that heavy memory usage. > >>> More memory usage << >It needs to alloc heap memory, and this can be pretty large for >leaves. >At level 1, it will need to alloc nr_leaves * sizeof(bfs_entry) >memory at least. >Compared to DFS, it only needs to iterate at most 8 times, and all of >its memory usage is function call stack memory. > > It only makes sense for my niche use case (compare tree reloc tree with > its source). > For real world use case the default DFS should works fine without all > the memory allocation burden. Since we have btrfs_path to show where our parents are, it's possible to use btrfs_path and avoid current memory burden. And in that case, your idea of using BFS default for tree higher than 2 levels completely make sense. I'll update the patchset to use a new (while a little more complex) to implement BFS with less memory usage. Thank your again for the idea, Qu > > So I still prefer to keep DFS as default and only provides BFS as a > niche option for weird guys like me. > > Thanks, > Qu >
Re: RAID1 & BTRFS critical (device sda2): corrupt leaf, bad key order
On 2018/9/4 下午7:53, Etienne Champetier wrote: > Hi Qu, > > Le lun. 3 sept. 2018 à 20:27, Qu Wenruo a écrit : >> >> On 2018/9/3 下午10:18, Etienne Champetier wrote: >>> Hello btfrs hackers, >>> >>> I have a computer acting as backup server with BTRFS RAID1, and I >>> would like to know the different options to rebuild this RAID >>> (I saw this thread >>> https://www.spinics.net/lists/linux-btrfs/msg68679.html but there was >>> no raid 1) >>> >>> # uname -a >>> Linux servmaison 4.4.0-134-generic #160-Ubuntu SMP Wed Aug 15 14:58:00 >>> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux >>> >>> # btrfs --version >>> btrfs-progs v4.4 >>> >>> # dmesg >>> ... >>> [ 1955.581972] BTRFS critical (device sda2): corrupt leaf, bad key >>> order: block=6020235362304,root=1, slot=63 >>> [ 1955.582299] BTRFS critical (device sda2): corrupt leaf, bad key >>> order: block=6020235362304,root=1, slot=63 > > Now running a Fedora 28 install kernel > > # uname -a > Linux servmaison 4.16.3-301.fc28.x86_64 #1 SMP Mon Apr 23 21:59:58 UTC > 2018 x86_64 x86_64 x86_64 GNU/Linux > # btrfs --version > btrfs-progs v4.15.1 Unfortunately, even for latest btrfs-progs release (v4.17.1, and even devel branch), btrfs check will abort checking if free space cache is corrupted. So we didn't get any useful info from btrfs check. Such diff would help you continue checking (if you really want, other than starting salvaging your data) -- diff --git a/check/main.c b/check/main.c index b361cd7e26a0..4f720163221e 100644 --- a/check/main.c +++ b/check/main.c @@ -9885,7 +9885,6 @@ int cmd_check(int argc, char **argv) error("errors found in free space tree"); else error("errors found in free space cache"); - goto out; } /* -- For dump tree block, the corrupted tree block belongs to extent tree. Which could be a good news (depends on how you define GOOD news). The corruption is not an easy fix, it's not just a swapped slot. The corrupted slot (item 64, whole key objectid is 5946810351616) is way beyond the extent data range, thus btrfs-progs can't fix it easily. Considering how much bytenr difference there is and the generation gap (53167 vs current generation 1555950), the bug happens a long long time ago (days or weeks before 2016-06-04). So it's a little too late to be fixed (unless someone could send me a time machine). On the other hand, this means any WRITE would easily fail due to corrupted extent tree, but your fs should be OK if mounted RO, thus you could copy your data out. > >> >> Please provide the following dump: >> >> # btrfs inspect dump-tree -t root /dev/sda2 >> # btrfs inspect dump-tree -b 6020235362304 /dev/sda2 > > All requested dump are in this repo: > https://github.com/champtar/debugraidbtrfs > [snip] >> >> If it's the only problem, "btrfs check --repair" indeed could fix it. > > Also available in https://github.com/champtar/debugraidbtrfs, here > "btrfs check --readonly /dev/sda2" output > > checking extents > bad key ordering 63 64 > bad key ordering 63 64 > bad key ordering 63 64 > bad key ordering 63 64 > bad key ordering 63 64 > bad key ordering 63 64 > bad key ordering 63 64 > bad key ordering 63 64 > bad key ordering 63 64 > bad key ordering 63 64 > bad key ordering 63 64 > bad key ordering 63 64 > bad key ordering 63 64 > bad key ordering 63 64 > bad key ordering 63 64 > bad key ordering 63 64 > bad key ordering 63 64 > bad key ordering 63 64 > bad block 6020235362304 > ERROR: errors found in extent allocation tree or chunk allocation > checking free space cache > there is no free space entry for 6011561750528-5942842273792 > there is no free space entry for 6011561750528-6012044050432 > cache appears valid but isn't 6010970308608 > there is no free space entry for 6015529828352-5946810351616 > there is no free space entry for 6015529828352-6016339017728 > cache appears valid but isn't 6015265275904 > there is no free space entry for 6139476623360-6070757146624 > there is no free space entry for 6139476623360-6139852881920 > cache appears valid but isn't 6138779140096 > ERROR: errors found in free space cache > Checking filesystem on /dev/sda2 > UUID: 4917db5e-fc20-4369-9556-83082a32d4cd > found 1321120776195 bytes used, error(s) found > total csum bytes: 0 > total tree bytes: 1163182080 > total fs tree bytes: 0 > total extent tree bytes: 1161740288 > btree space waste bytes: 290512355 > file data blocks allocated: 618135552 > referenced 618135552 > As expected, btrfs-progs is unable to fix it. > > Thanks > Etienne > > P.S: sorry for the initial duplicate email, it took a very long time > to show up in https://www.spinics.net/lists/linux-btrfs/maillist.html, > thought it was discarded as I was not subscribed to the list It's pretty common, I even sometimes sent patches twice for the same reason. And just another kindly note, for "btrfs check" or "btrfs inspect
Re: fsck lowmem mode only: ERROR: errors found in fs roots
On Tue, 2018-09-04 at 17:14 +0800, Qu Wenruo wrote: > However the backtrace can't tell which process caused such fsync > call. > (Maybe LVM user space code?) Well it was just literally before btrfs-check exited... so I blindly guesses... but arguably it could be just some coincidence. LVM tools are installed, but since I no longer use and PVs/LVs/etc. ... I'd doubt they'd do anything here. Cheers, Chris.
Re: fsck lowmem mode only: ERROR: errors found in fs roots
On 2018/9/4 上午4:24, Christoph Anton Mitterer wrote: > Hey. > > > On Fri, 2018-08-31 at 10:33 +0800, Su Yue wrote: >> Can you please fetch btrfs-progs from my repo and run lowmem check >> in readonly? >> Repo: https://github.com/Damenly/btrfs-progs/tree/lowmem_debug >> It's based on v4.17.1 plus additonal output for debug only. > > I've adapted your patch to 4.17 from Debian (i.e. not the 4.17.1). > > > First I ran it again with the pristine 4.17 from Debian: > # btrfs check --mode=lowmem /dev/mapper/system ; echo $? > Checking filesystem on /dev/mapper/system > UUID: 6050ca10-e778-4d08-80e7-6d27b9c89b3c > checking extents > checking free space cache > checking fs roots > ERROR: errors found in fs roots > found 435924422656 bytes used, error(s) found > total csum bytes: 423418948 > total tree bytes: 2218328064 > total fs tree bytes: 1557168128 > total extent tree bytes: 125894656 > btree space waste bytes: 429599230 > file data blocks allocated: 5193373646848 > referenced 555255164928 > [ 1248.687628] [ cut here ] > [ 1248.688352] generic_make_request: Trying to write to read-only > block-device dm-0 (partno 0) > [ 1248.689127] WARNING: CPU: 3 PID: 933 at > /build/linux-LgHyGB/linux-4.17.17/block/blk-core.c:2180 > generic_make_request_checks+0x43d/0x610 > [ 1248.689909] Modules linked in: dm_crypt algif_skcipher af_alg dm_mod > snd_hda_codec_hdmi snd_hda_codec_realtek intel_rapl snd_hda_codec_generic > x86_pkg_temp_thermal intel_powerclamp i915 iwlwifi btusb coretemp btrtl btbcm > uvcvideo kvm_intel snd_hda_intel btintel videobuf2_vmalloc bluetooth > snd_hda_codec kvm videobuf2_memops videobuf2_v4l2 videobuf2_common cfg80211 > snd_hda_core irqbypass videodev jitterentropy_rng drm_kms_helper > crct10dif_pclmul snd_hwdep crc32_pclmul drbg ghash_clmulni_intel intel_cstate > snd_pcm ansi_cprng ppdev intel_uncore drm media ecdh_generic iTCO_wdt > snd_timer iTCO_vendor_support rtsx_pci_ms crc16 snd intel_rapl_perf memstick > joydev mei_me rfkill evdev soundcore sg parport_pc pcspkr serio_raw > fujitsu_laptop mei i2c_algo_bit parport shpchp sparse_keymap pcc_cpufreq > lpc_ich button > [ 1248.693639] video battery ac ip_tables x_tables autofs4 btrfs > zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov > async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c > crc32c_generic raid1 raid0 multipath linear md_mod sd_mod uas usb_storage > crc32c_intel rtsx_pci_sdmmc mmc_core ahci xhci_pci libahci aesni_intel > ehci_pci aes_x86_64 libata crypto_simd xhci_hcd ehci_hcd cryptd glue_helper > psmouse i2c_i801 scsi_mod rtsx_pci e1000e usbcore usb_common > [ 1248.696956] CPU: 3 PID: 933 Comm: btrfs Not tainted 4.17.0-3-amd64 #1 > Debian 4.17.17-1 > [ 1248.698118] Hardware name: FUJITSU LIFEBOOK E782/FJNB253, BIOS Version > 2.11 07/15/2014 > [ 1248.699299] RIP: 0010:generic_make_request_checks+0x43d/0x610 > [ 1248.700495] RSP: 0018:ac89827c7d88 EFLAGS: 00010286 > [ 1248.701702] RAX: RBX: 98f4848a9200 RCX: > 0006 > [ 1248.702930] RDX: 0007 RSI: 0082 RDI: > 98f49e2d6730 > [ 1248.704170] RBP: 98f484f6d460 R08: 033e R09: > 00aa > [ 1248.705422] R10: ac89827c7e60 R11: R12: > > [ 1248.706675] R13: 0001 R14: R15: > > [ 1248.707928] FS: 7f92842018c0() GS:98f49e2c() > knlGS: > [ 1248.709190] CS: 0010 DS: ES: CR0: 80050033 > [ 1248.710448] CR2: 55fc6fe1a5b0 CR3: 000407f62001 CR4: > 001606e0 > [ 1248.711707] Call Trace: > [ 1248.712960] ? do_writepages+0x4b/0xe0 > [ 1248.714201] ? blkdev_readpages+0x20/0x20 > [ 1248.715441] ? do_writepages+0x4b/0xe0 > [ 1248.716684] generic_make_request+0x64/0x400 > [ 1248.717935] ? finish_wait+0x80/0x80 > [ 1248.719181] ? mempool_alloc+0x67/0x1a0 > [ 1248.720425] ? submit_bio+0x6c/0x140 > [ 1248.721663] submit_bio+0x6c/0x140 > [ 1248.722902] submit_bio_wait+0x53/0x80 > [ 1248.724139] blkdev_issue_flush+0x7c/0xb0 > [ 1248.725377] blkdev_fsync+0x2f/0x40 > [ 1248.726612] do_fsync+0x38/0x60 > [ 1248.727849] __x64_sys_fsync+0x10/0x20 > [ 1248.729086] do_syscall_64+0x55/0x110 > [ 1248.730323] entry_SYSCALL_64_after_hwframe+0x44/0xa9 I don't really think it's "btrfs check" causing the problem. Btrfs check, just like all offline tools, uses open_ctree_flags to determine if it's allowed to do write. Without OPEN_CTREE_FLAGS_WRITE, all devices are opened RO, thus any write will just return error without reaching disk. Not to mention such fsync syscall. However the backtrace can't tell which process caused such fsync call. (Maybe LVM user space code?) Thanks, Qu > [ 1248.731565] RIP: 0033:0x7f928354d161 > [ 1248.732805] RSP: 002b:7ffd35e3f5d8 EFLAGS: 0246 ORIG_RAX: > 004a > [ 1248.734067] RAX: ffda RBX: 55fc09c0c260 RCX: >
[PATCH v10.5 2/5] btrfs-progs: dedupe: Add enable command for dedupe command group
From: Qu Wenruo Add enable subcommand for dedupe commmand group. Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- Documentation/btrfs-dedupe-inband.asciidoc | 114 +- btrfs-completion | 6 +- cmds-dedupe-ib.c | 238 + ioctl.h| 2 + 4 files changed, 358 insertions(+), 2 deletions(-) diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc index 83113f5487e2..d895aafbcf45 100644 --- a/Documentation/btrfs-dedupe-inband.asciidoc +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -22,7 +22,119 @@ use with caution. SUBCOMMAND -- -Nothing yet +*enable* [options] :: +Enable in-band de-duplication for a filesystem. ++ +`Options` ++ +-f|--force +Force 'enable' command to be exected. +Will skip memory limit check and allow 'enable' to be executed even in-band +de-duplication is already enabled. ++ +NOTE: If re-enable dedupe with '-f' option, any unspecified parameter will be +reset to its default value. + +-s|--storage-backend +Specify de-duplication hash storage backend. +Only 'inmemory' backend is supported yet. +If not specified, default value is 'inmemory'. ++ +Refer to *BACKENDS* sector for more information. + +-b|--blocksize +Specify dedupe block size. +Supported values are power of 2 from '16K' to '8M'. +Default value is '128K'. ++ +Refer to *BLOCKSIZE* sector for more information. + +-a|--hash-algorithm +Specify hash algorithm. +Only 'sha256' is supported yet. + +-l|--limit-hash +Specify maximum number of hashes stored in memory. +Only works for 'inmemory' backend. +Conflicts with '-m' option. ++ +Only positive values are valid. +Default value is '32K'. + +-m|--limit-memory +Specify maximum memory used for hashes. +Only works for 'inmemory' backend. +Conflicts with '-l' option. ++ +Only value larger than or equal to '1024' is valid. +No default value. ++ +NOTE: Memory limit will be rounded down to kernel internal hash size, +so the memory limit shown in 'btrfs dedupe-inband status' may be different +from the . + +WARNING: Too large value for '-l' or '-m' will easily trigger OOM. +Please use with caution according to system memory. + +NOTE: In-band de-duplication is not compactible with compression yet. +And compression has higher priority than in-band de-duplication, means if +compression and de-duplication is enabled at the same time, only compression +will work. + +BACKENDS + +Btrfs in-band de-duplication will support different storage backends, with +different use case and features. + +In-memory backend:: +This backend provides backward-compatibility, and more fine-tuning options. +But hash pool is non-persistent and may exhaust kernel memory if not setup +properly. ++ +This backend can be used on old btrfs(without '-O dedupe' mkfs option). +When used on old btrfs, this backend needs to be enabled manually after mount. ++ +Designed for fast hash search speed, in-memory backend will keep all dedupe +hashes in memory. (Although overall performance is still much the same with +'ondisk' backend if all 'ondisk' hash can be cached in memory) ++ +And only keeps limited number of hash in memory to avoid exhausting memory. +Hashes over the limit will be dropped following Last-Recent-Use behavior. +So this backend has a consistent overhead for given limit but can\'t ensure +all duplicated blocks will be de-duplicated. ++ +After umount and mount, in-memory backend need to refill its hash pool. + +On-disk backend:: +This backend provides persistent hash pool, with more smart memory management +for hash pool. +But it\'s not backward-compatible, meaning it must be used with '-O dedupe' mkfs +option and older kernel can\'t mount it read-write. ++ +Designed for de-duplication rate, hash pool is stored as btrfs B+ tree on disk. +This behavior may cause extra disk IO for hash search under high memory +pressure. ++ +After umount and mount, on-disk backend still has its hash on disk, no need to +refill its dedupe hash pool. + +Currently, only 'inmemory' backend is supported in btrfs-progs. + +DEDUPE BLOCK SIZE + +In-band de-duplication is done at dedupe block size. +Any data smaller than dedupe block size won\'t go through in-band +de-duplication. + +And dedupe block size affects dedupe rate and fragmentation heavily. + +Smaller block size will cause more fragments, but higher dedupe rate. + +Larger block size will cause less fragments, but lower dedupe rate. + +In-band de-duplication rate is highly related to the workload pattern. +So it\'s highly recommended to align dedupe block size to the workload +block size to make full use of de-duplication. EXIT STATUS --- diff --git a/btrfs-completion b/btrfs-completion index ae683f4ecf61..cfdf70966e47 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -29,7 +29,7 @@ _btrfs() local cmd=${words[1]} -
[PATCH v10.5 5/5] btrfs-progs: dedupe: introduce reconfigure subcommand
From: Qu Wenruo Introduce reconfigure subcommand to co-operate with new kernel ioctl modification. Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- Documentation/btrfs-dedupe-inband.asciidoc | 7 +++ btrfs-completion | 2 +- cmds-dedupe-ib.c | 73 +- 3 files changed, 66 insertions(+), 16 deletions(-) diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc index 6096389cb0b4..78c806f772d6 100644 --- a/Documentation/btrfs-dedupe-inband.asciidoc +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -86,6 +86,13 @@ And compression has higher priority than in-band de-duplication, means if compression and de-duplication is enabled at the same time, only compression will work. +*reconfigure* [options] :: +Re-configure in-band de-duplication parameters of a filesystem. ++ +In-band de-duplication must be enbaled first before re-configuration. ++ +[Options] are the same with 'btrfs dedupe-inband enable'. + *status* :: Show current in-band de-duplication status of a filesystem. diff --git a/btrfs-completion b/btrfs-completion index 62a7bdd4d0d5..6ff48e4c2f6a 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -41,7 +41,7 @@ _btrfs() commands_quota='enable disable rescan' commands_qgroup='assign remove create destroy show limit' commands_replace='start status cancel' - commands_dedupe_inband='enable disable status' + commands_dedupe_inband='enable disable status reconfigure' if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then COMPREPLY=( $( compgen -W '--help' -- "$cur" ) ) diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c index e778457e25a8..e52f939c9ced 100644 --- a/cmds-dedupe-ib.c +++ b/cmds-dedupe-ib.c @@ -56,7 +56,6 @@ static const char * const cmd_dedupe_ib_enable_usage[] = { NULL }; - #define report_fatal_parameter(dargs, old, member, type, err_val, fmt) \ ({ \ if (dargs->member != old->member && \ @@ -88,6 +87,12 @@ static void report_parameter_error(struct btrfs_ioctl_dedupe_args *dargs, } report_option_parameter(dargs, old, flags, u8, -1, x); } + + if (dargs->status == 0 && old->cmd == BTRFS_DEDUPE_CTL_RECONF) { + error("must enable dedupe before reconfiguration"); + return; + } + if (report_fatal_parameter(dargs, old, cmd, u16, -1, u) || report_fatal_parameter(dargs, old, blocksize, u64, -1, llu) || report_fatal_parameter(dargs, old, backend, u16, -1, u) || @@ -100,14 +105,17 @@ static void report_parameter_error(struct btrfs_ioctl_dedupe_args *dargs, old->limit_nr, old->limit_mem); } -static int cmd_dedupe_ib_enable(int argc, char **argv) +static int enable_reconfig_dedupe(int argc, char **argv, int reconf) { int ret; int fd = -1; char *path; u64 blocksize = BTRFS_DEDUPE_BLOCKSIZE_DEFAULT; + int blocksize_set = 0; u16 hash_algo = BTRFS_DEDUPE_HASH_SHA256; + int hash_algo_set = 0; u16 backend = BTRFS_DEDUPE_BACKEND_INMEMORY; + int backend_set = 0; u64 limit_nr = 0; u64 limit_mem = 0; u64 sys_mem = 0; @@ -134,15 +142,17 @@ static int cmd_dedupe_ib_enable(int argc, char **argv) break; switch (c) { case 's': - if (!strcasecmp("inmemory", optarg)) + if (!strcasecmp("inmemory", optarg)) { backend = BTRFS_DEDUPE_BACKEND_INMEMORY; - else { + backend_set = 1; + } else { error("unsupported dedupe backend: %s", optarg); exit(1); } break; case 'b': blocksize = parse_size(optarg); + blocksize_set = 1; break; case 'a': if (strcmp("sha256", optarg)) { @@ -224,26 +234,40 @@ static int cmd_dedupe_ib_enable(int argc, char **argv) return 1; } memset(, -1, sizeof(dargs)); - dargs.cmd = BTRFS_DEDUPE_CTL_ENABLE; - dargs.blocksize = blocksize; - dargs.hash_algo = hash_algo; - dargs.limit_nr = limit_nr; - dargs.limit_mem = limit_mem; - dargs.backend = backend; - if (force) - dargs.flags |= BTRFS_DEDUPE_FLAG_FORCE; - else - dargs.flags = 0; + if (reconf) { + dargs.cmd = BTRFS_DEDUPE_CTL_RECONF; + if (blocksize_set) + dargs.blocksize = blocksize; + if (hash_algo_set) +
[PATCH v10.5 3/5] btrfs-progs: dedupe: Add disable support for inband dedupelication
From: Qu Wenruo Add disable subcommand for dedupe command group. Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- Documentation/btrfs-dedupe-inband.asciidoc | 5 +++ btrfs-completion | 2 +- cmds-dedupe-ib.c | 41 ++ 3 files changed, 47 insertions(+), 1 deletion(-) diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc index d895aafbcf45..3452f690e3e5 100644 --- a/Documentation/btrfs-dedupe-inband.asciidoc +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -22,6 +22,11 @@ use with caution. SUBCOMMAND -- +*disable* :: +Disable in-band de-duplication for a filesystem. ++ +This will trash all stored dedupe hash. ++ *enable* [options] :: Enable in-band de-duplication for a filesystem. + diff --git a/btrfs-completion b/btrfs-completion index cfdf70966e47..a74a23f42022 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -41,7 +41,7 @@ _btrfs() commands_quota='enable disable rescan' commands_qgroup='assign remove create destroy show limit' commands_replace='start status cancel' - commands_dedupe_inband='enable' + commands_dedupe_inband='enable disable' if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then COMPREPLY=( $( compgen -W '--help' -- "$cur" ) ) diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c index 4d499677d9ae..91b6fe234043 100644 --- a/cmds-dedupe-ib.c +++ b/cmds-dedupe-ib.c @@ -259,10 +259,51 @@ out: return ret; } +static const char * const cmd_dedupe_ib_disable_usage[] = { + "btrfs dedupe-inband disable ", + "Disable in-band(write time) de-duplication of a btrfs.", + NULL +}; + +static int cmd_dedupe_ib_disable(int argc, char **argv) +{ + struct btrfs_ioctl_dedupe_args dargs; + DIR *dirstream; + char *path; + int fd; + int ret; + + if (check_argc_exact(argc, 2)) + usage(cmd_dedupe_ib_disable_usage); + + path = argv[1]; + fd = open_file_or_dir(path, ); + if (fd < 0) { + error("failed to open file or directory: %s", path); + return 1; + } + memset(, 0, sizeof(dargs)); + dargs.cmd = BTRFS_DEDUPE_CTL_DISABLE; + + ret = ioctl(fd, BTRFS_IOC_DEDUPE_CTL, ); + if (ret < 0) { + error("failed to disable inband deduplication: %m"); + ret = 1; + goto out; + } + ret = 0; + +out: + close_file_or_dir(fd, dirstream); + return 0; +} + const struct cmd_group dedupe_ib_cmd_group = { dedupe_ib_cmd_group_usage, dedupe_ib_cmd_group_info, { { "enable", cmd_dedupe_ib_enable, cmd_dedupe_ib_enable_usage, NULL, 0}, + { "disable", cmd_dedupe_ib_disable, cmd_dedupe_ib_disable_usage, + NULL, 0}, NULL_CMD_STRUCT } }; -- 2.18.0
[PATCH v10.5 4/5] btrfs-progs: dedupe: Add status subcommand
From: Qu Wenruo Add status subcommand for dedupe command group. Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- Documentation/btrfs-dedupe-inband.asciidoc | 3 + btrfs-completion | 2 +- cmds-dedupe-ib.c | 80 ++ 3 files changed, 84 insertions(+), 1 deletion(-) diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc index 3452f690e3e5..6096389cb0b4 100644 --- a/Documentation/btrfs-dedupe-inband.asciidoc +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -86,6 +86,9 @@ And compression has higher priority than in-band de-duplication, means if compression and de-duplication is enabled at the same time, only compression will work. +*status* :: +Show current in-band de-duplication status of a filesystem. + BACKENDS Btrfs in-band de-duplication will support different storage backends, with diff --git a/btrfs-completion b/btrfs-completion index a74a23f42022..62a7bdd4d0d5 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -41,7 +41,7 @@ _btrfs() commands_quota='enable disable rescan' commands_qgroup='assign remove create destroy show limit' commands_replace='start status cancel' - commands_dedupe_inband='enable disable' + commands_dedupe_inband='enable disable status' if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then COMPREPLY=( $( compgen -W '--help' -- "$cur" ) ) diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c index 91b6fe234043..e778457e25a8 100644 --- a/cmds-dedupe-ib.c +++ b/cmds-dedupe-ib.c @@ -298,12 +298,92 @@ out: return 0; } +static const char * const cmd_dedupe_ib_status_usage[] = { + "btrfs dedupe-inband status ", + "Show current in-band(write time) de-duplication status of a btrfs.", + NULL +}; + +static int cmd_dedupe_ib_status(int argc, char **argv) +{ + struct btrfs_ioctl_dedupe_args dargs; + DIR *dirstream; + char *path; + int fd; + int ret; + int print_limit = 1; + + if (check_argc_exact(argc, 2)) + usage(cmd_dedupe_ib_status_usage); + + path = argv[1]; + fd = open_file_or_dir(path, ); + if (fd < 0) { + error("failed to open file or directory: %s", path); + ret = 1; + goto out; + } + memset(, 0, sizeof(dargs)); + dargs.cmd = BTRFS_DEDUPE_CTL_STATUS; + + ret = ioctl(fd, BTRFS_IOC_DEDUPE_CTL, ); + if (ret < 0) { + error("failed to get inband deduplication status: %m"); + ret = 1; + goto out; + } + ret = 0; + if (dargs.status == 0) { + printf("Status: \t\t\tDisabled\n"); + goto out; + } + printf("Status:\t\t\tEnabled\n"); + + if (dargs.hash_algo == BTRFS_DEDUPE_HASH_SHA256) + printf("Hash algorithm:\t\tSHA-256\n"); + else + printf("Hash algorithm:\t\tUnrecognized(%x)\n", + dargs.hash_algo); + + if (dargs.backend == BTRFS_DEDUPE_BACKEND_INMEMORY) { + printf("Backend:\t\tIn-memory\n"); + print_limit = 1; + } else { + printf("Backend:\t\tUnrecognized(%x)\n", + dargs.backend); + } + + printf("Dedup Blocksize:\t%llu\n", dargs.blocksize); + + if (print_limit) { + u64 cur_mem; + + /* Limit nr may be 0 */ + if (dargs.limit_nr) + cur_mem = dargs.current_nr * (dargs.limit_mem / + dargs.limit_nr); + else + cur_mem = 0; + + printf("Number of hash: \t[%llu/%llu]\n", dargs.current_nr, + dargs.limit_nr); + printf("Memory usage: \t\t[%s/%s]\n", + pretty_size(cur_mem), + pretty_size(dargs.limit_mem)); + } +out: + close_file_or_dir(fd, dirstream); + return ret; +} + const struct cmd_group dedupe_ib_cmd_group = { dedupe_ib_cmd_group_usage, dedupe_ib_cmd_group_info, { { "enable", cmd_dedupe_ib_enable, cmd_dedupe_ib_enable_usage, NULL, 0}, { "disable", cmd_dedupe_ib_disable, cmd_dedupe_ib_disable_usage, NULL, 0}, + { "status", cmd_dedupe_ib_status, cmd_dedupe_ib_status_usage, + NULL, 0}, NULL_CMD_STRUCT } }; -- 2.18.0
[PATCH v10.5 0/5] In-band de-duplication for btrfs-progs
Patchset can be fetched from github: https://github.com/littleroad/btrfs-progs.git dedupe_latest Inband dedupe(in-memory backend only) ioctl support for btrfs-progs. v7 changes: Update ctree.h to follow kernel structure change Update print-tree to follow kernel structure change V8 changes: Move dedup props and on-disk backend support out of the patchset Change command group name to "dedupe-inband", to avoid confusion with possible out-of-band dedupe. Suggested by Mark. Rebase to latest devel branch. V9 changes: Follow kernels ioctl change to support FORCE flag, new reconf ioctl, and more precious error reporting. v10 changes: Rebase to v4.10. Add BUILD_ASSERT for btrfs_ioctl_dedupe_args v10.1 changes: Rebase to v4.14. v10.2 changes: Rebase to v4.16.1. v10.3 changes: Rebase to v4.17. v10.4 changes: Deal with offline reviews from Misono Tomohiro. 1. s/btrfs-dedupe/btrfs-dedupe-inband 2. Replace strerror(errno) with %m 3. Use SZ_* instead of intermedia number 4. update btrfs-completion for reconfigure subcommand v10.5 changes: Rebase to v4.17.1. Qu Wenruo (5): btrfs-progs: Basic framework for dedupe-inband command group btrfs-progs: dedupe: Add enable command for dedupe command group btrfs-progs: dedupe: Add disable support for inband dedupelication btrfs-progs: dedupe: Add status subcommand btrfs-progs: dedupe: introduce reconfigure subcommand Documentation/Makefile.in | 1 + Documentation/btrfs-dedupe-inband.asciidoc | 167 Documentation/btrfs.asciidoc | 4 + Makefile | 3 +- btrfs-completion | 6 +- btrfs.c| 2 + cmds-dedupe-ib.c | 437 + commands.h | 2 + dedupe-ib.h| 28 ++ ioctl.h| 38 ++ 10 files changed, 686 insertions(+), 2 deletions(-) create mode 100644 Documentation/btrfs-dedupe-inband.asciidoc create mode 100644 cmds-dedupe-ib.c create mode 100644 dedupe-ib.h -- 2.18.0
[PATCH v10.5 1/5] btrfs-progs: Basic framework for dedupe-inband command group
From: Qu Wenruo Add basic ioctl header and command group framework for later use. Alone with basic man page doc. Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- Documentation/Makefile.in | 1 + Documentation/btrfs-dedupe-inband.asciidoc | 40 ++ Documentation/btrfs.asciidoc | 4 +++ Makefile | 3 +- btrfs.c| 2 ++ cmds-dedupe-ib.c | 35 +++ commands.h | 2 ++ dedupe-ib.h| 28 +++ ioctl.h| 36 +++ 9 files changed, 150 insertions(+), 1 deletion(-) create mode 100644 Documentation/btrfs-dedupe-inband.asciidoc create mode 100644 cmds-dedupe-ib.c create mode 100644 dedupe-ib.h diff --git a/Documentation/Makefile.in b/Documentation/Makefile.in index 184647c41940..402155fae001 100644 --- a/Documentation/Makefile.in +++ b/Documentation/Makefile.in @@ -28,6 +28,7 @@ MAN8_TXT += btrfs-qgroup.asciidoc MAN8_TXT += btrfs-replace.asciidoc MAN8_TXT += btrfs-restore.asciidoc MAN8_TXT += btrfs-property.asciidoc +MAN8_TXT += btrfs-dedupe-inband.asciidoc # Category 5 manual page MAN5_TXT += btrfs-man5.asciidoc diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc new file mode 100644 index ..83113f5487e2 --- /dev/null +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -0,0 +1,40 @@ +btrfs-dedupe-inband(8) +== + +NAME + +btrfs-dedupe-inband - manage in-band (write time) de-duplication of a btrfs +filesystem + +SYNOPSIS + +*btrfs dedupe-inband* + +DESCRIPTION +--- +*btrfs dedupe-inband* is used to enable/disable or show current in-band de-duplication +status of a btrfs filesystem. + +Kernel support for in-band de-duplication starts from 4.19. + +WARNING: In-band de-duplication is still an experimental feautre of btrfs, +use with caution. + +SUBCOMMAND +-- +Nothing yet + +EXIT STATUS +--- +*btrfs dedupe-inband* returns a zero exit status if it succeeds. Non zero is +returned in case of failure. + +AVAILABILITY + +*btrfs* is part of btrfs-progs. +Please refer to the btrfs wiki http://btrfs.wiki.kernel.org for +further details. + +SEE ALSO + +`mkfs.btrfs`(8), diff --git a/Documentation/btrfs.asciidoc b/Documentation/btrfs.asciidoc index 7316ac094413..1cf5bddec335 100644 --- a/Documentation/btrfs.asciidoc +++ b/Documentation/btrfs.asciidoc @@ -50,6 +50,10 @@ COMMANDS Do off-line check on a btrfs filesystem. + See `btrfs-check`(8) for details. +*dedupe-inband*:: + Control btrfs in-band(write time) de-duplication. + + See `btrfs-dedupe-inband`(8) for details. + *device*:: Manage devices managed by btrfs, including add/delete/scan and so on. + diff --git a/Makefile b/Makefile index fcfc815a2a5b..4052cecfae4d 100644 --- a/Makefile +++ b/Makefile @@ -123,7 +123,8 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \ cmds-property.o cmds-fi-usage.o cmds-inspect-dump-tree.o \ cmds-inspect-dump-super.o cmds-inspect-tree-stats.o cmds-fi-du.o \ - mkfs/common.o check/mode-common.o check/mode-lowmem.o + mkfs/common.o check/mode-common.o check/mode-lowmem.o \ + cmds-dedupe-ib.o libbtrfs_objects = send-stream.o send-utils.o kernel-lib/rbtree.o btrfs-list.o \ kernel-lib/crc32c.o messages.o \ uuid-tree.o utils-lib.o rbtree-utils.o diff --git a/btrfs.c b/btrfs.c index 2d39f2ced3e8..2168f5a8bc7f 100644 --- a/btrfs.c +++ b/btrfs.c @@ -255,6 +255,8 @@ static const struct cmd_group btrfs_cmd_group = { { "quota", cmd_quota, NULL, _cmd_group, 0 }, { "qgroup", cmd_qgroup, NULL, _cmd_group, 0 }, { "replace", cmd_replace, NULL, _cmd_group, 0 }, + { "dedupe-inband", cmd_dedupe_ib, NULL, _ib_cmd_group, + 0 }, { "help", cmd_help, cmd_help_usage, NULL, 0 }, { "version", cmd_version, cmd_version_usage, NULL, 0 }, NULL_CMD_STRUCT diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c new file mode 100644 index ..73c923a797da --- /dev/null +++ b/cmds-dedupe-ib.c @@ -0,0 +1,35 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2017 Fujitsu. All rights reserved. + */ + +#include +#include +#include + +#include "ctree.h" +#include "ioctl.h" + +#include "commands.h" +#include "utils.h" +#include "kerncompat.h" +#include "dedupe-ib.h" + +static const char * const dedupe_ib_cmd_group_usage[] = { + "btrfs dedupe-inband [options] ", + NULL +}; + +static const char dedupe_ib_cmd_group_info[] =
[PATCH v15 09/13] btrfs: introduce type based delalloc metadata reserve
From: Wang Xiaoguang Introduce type based metadata reserve parameter for delalloc space reservation/freeing function. The problem we are going to solve is, btrfs use different max extent size for different mount options. For de-duplication, the max extent size can be set by the dedupe ioctl, while for normal write it's 128M. And furthermore, split/merge extent hook highly depends that max extent size. Such situation contributes to quite a lot of false ENOSPC. So this patch introduces the facility to help solve these false ENOSPC related to different max extent size. Currently, only normal 128M extent size is supported. More types will follow soon. Signed-off-by: Wang Xiaoguang Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- fs/btrfs/ctree.h | 43 ++--- fs/btrfs/extent-tree.c | 48 --- fs/btrfs/file.c | 30 + fs/btrfs/free-space-cache.c | 6 +- fs/btrfs/inode-map.c | 9 ++- fs/btrfs/inode.c | 115 +-- fs/btrfs/ioctl.c | 23 +++ fs/btrfs/ordered-data.c | 6 +- fs/btrfs/ordered-data.h | 3 +- fs/btrfs/relocation.c| 22 --- fs/btrfs/tests/inode-tests.c | 15 +++-- 11 files changed, 223 insertions(+), 97 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 741ef21a6185..4f0b6a12ecb1 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -98,11 +98,24 @@ static const int btrfs_csum_sizes[] = { 4 }; /* * Count how many BTRFS_MAX_EXTENT_SIZE cover the @size */ -static inline u32 count_max_extents(u64 size) +static inline u32 count_max_extents(u64 size, u64 max_extent_size) { - return div_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, BTRFS_MAX_EXTENT_SIZE); + return div_u64(size + max_extent_size - 1, max_extent_size); } +/* + * Type based metadata reserve type + * This affects how btrfs reserve metadata space for buffered write. + * + * This is caused by the different max extent size for normal COW + * and further in-band dedupe + */ +enum btrfs_metadata_reserve_type { + BTRFS_RESERVE_NORMAL, +}; + +u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type); + struct btrfs_mapping_tree { struct extent_map_tree map_tree; }; @@ -2742,8 +2755,9 @@ int btrfs_check_data_free_space(struct inode *inode, void btrfs_free_reserved_data_space(struct inode *inode, struct extent_changeset *reserved, u64 start, u64 len); void btrfs_delalloc_release_space(struct inode *inode, - struct extent_changeset *reserved, - u64 start, u64 len, bool qgroup_free); + struct extent_changeset *reserved, + u64 start, u64 len, bool qgroup_free, + enum btrfs_metadata_reserve_type reserve_type); void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start, u64 len); void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans); @@ -2753,13 +2767,17 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root *root, void btrfs_subvolume_release_metadata(struct btrfs_fs_info *fs_info, struct btrfs_block_rsv *rsv); void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes, - bool qgroup_free); + bool qgroup_free, + enum btrfs_metadata_reserve_type reserve_type); -int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes); +int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes, + enum btrfs_metadata_reserve_type reserve_type); void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes, -bool qgroup_free); + bool qgroup_free, + enum btrfs_metadata_reserve_type reserve_type); int btrfs_delalloc_reserve_space(struct inode *inode, - struct extent_changeset **reserved, u64 start, u64 len); + struct extent_changeset **reserved, u64 start, u64 len, + enum btrfs_metadata_reserve_type reserve_type); void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type); struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info, unsigned short type); @@ -3165,7 +3183,11 @@ int btrfs_start_delalloc_inodes(struct btrfs_root *root); int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int nr); int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end, unsigned int extra_bits, - struct extent_state **cached_state, int dedupe); +
[PATCH v15 03/13] btrfs: dedupe: Introduce function to add hash into in-memory tree
From: Wang Xiaoguang Introduce static function inmem_add() to add hash into in-memory tree. And now we can implement the btrfs_dedupe_add() interface. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang Reviewed-by: Josef Bacik Signed-off-by: Lu Fengqi --- fs/btrfs/dedupe.c | 150 ++ 1 file changed, 150 insertions(+) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index 06523162753d..784bb3a8a5ab 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -19,6 +19,14 @@ struct inmem_hash { u8 hash[]; }; +static inline struct inmem_hash *inmem_alloc_hash(u16 algo) +{ + if (WARN_ON(algo >= ARRAY_SIZE(btrfs_hash_sizes))) + return NULL; + return kzalloc(sizeof(struct inmem_hash) + btrfs_hash_sizes[algo], + GFP_NOFS); +} + static struct btrfs_dedupe_info * init_dedupe_info(struct btrfs_ioctl_dedupe_args *dargs) { @@ -167,3 +175,145 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info) /* Place holder for bisect, will be implemented in later patches */ return 0; } + +static int inmem_insert_hash(struct rb_root *root, +struct inmem_hash *hash, int hash_len) +{ + struct rb_node **p = >rb_node; + struct rb_node *parent = NULL; + struct inmem_hash *entry = NULL; + + while (*p) { + parent = *p; + entry = rb_entry(parent, struct inmem_hash, hash_node); + if (memcmp(hash->hash, entry->hash, hash_len) < 0) + p = &(*p)->rb_left; + else if (memcmp(hash->hash, entry->hash, hash_len) > 0) + p = &(*p)->rb_right; + else + return 1; + } + rb_link_node(>hash_node, parent, p); + rb_insert_color(>hash_node, root); + return 0; +} + +static int inmem_insert_bytenr(struct rb_root *root, + struct inmem_hash *hash) +{ + struct rb_node **p = >rb_node; + struct rb_node *parent = NULL; + struct inmem_hash *entry = NULL; + + while (*p) { + parent = *p; + entry = rb_entry(parent, struct inmem_hash, bytenr_node); + if (hash->bytenr < entry->bytenr) + p = &(*p)->rb_left; + else if (hash->bytenr > entry->bytenr) + p = &(*p)->rb_right; + else + return 1; + } + rb_link_node(>bytenr_node, parent, p); + rb_insert_color(>bytenr_node, root); + return 0; +} + +static void __inmem_del(struct btrfs_dedupe_info *dedupe_info, + struct inmem_hash *hash) +{ + list_del(>lru_list); + rb_erase(>hash_node, _info->hash_root); + rb_erase(>bytenr_node, _info->bytenr_root); + + if (!WARN_ON(dedupe_info->current_nr == 0)) + dedupe_info->current_nr--; + + kfree(hash); +} + +/* + * Insert a hash into in-memory dedupe tree + * Will remove exceeding last recent use hash. + * + * If the hash mathced with existing one, we won't insert it, to + * save memory + */ +static int inmem_add(struct btrfs_dedupe_info *dedupe_info, +struct btrfs_dedupe_hash *hash) +{ + int ret = 0; + u16 algo = dedupe_info->hash_algo; + struct inmem_hash *ihash; + + ihash = inmem_alloc_hash(algo); + + if (!ihash) + return -ENOMEM; + + /* Copy the data out */ + ihash->bytenr = hash->bytenr; + ihash->num_bytes = hash->num_bytes; + memcpy(ihash->hash, hash->hash, btrfs_hash_sizes[algo]); + + mutex_lock(_info->lock); + + ret = inmem_insert_bytenr(_info->bytenr_root, ihash); + if (ret > 0) { + kfree(ihash); + ret = 0; + goto out; + } + + ret = inmem_insert_hash(_info->hash_root, ihash, + btrfs_hash_sizes[algo]); + if (ret > 0) { + /* +* We only keep one hash in tree to save memory, so if +* hash conflicts, free the one to insert. +*/ + rb_erase(>bytenr_node, _info->bytenr_root); + kfree(ihash); + ret = 0; + goto out; + } + + list_add(>lru_list, _info->lru_list); + dedupe_info->current_nr++; + + /* Remove the last dedupe hash if we exceed limit */ + while (dedupe_info->current_nr > dedupe_info->limit_nr) { + struct inmem_hash *last; + + last = list_entry(dedupe_info->lru_list.prev, + struct inmem_hash, lru_list); + __inmem_del(dedupe_info, last); + } +out: + mutex_unlock(_info->lock); + return 0; +} + +int btrfs_dedupe_add(struct btrfs_fs_info *fs_info, +struct btrfs_dedupe_hash *hash) +{ + struct btrfs_dedupe_info *dedupe_info
[PATCH v15 00/13] Btrfs In-band De-duplication
This patchset can be fetched from github: https://github.com/littleroad/linux.git dedupe_latest Now the new base is v4.19-rc2, and drop the patch about compression which conflict with compression heuristic. Normal test cases from auto group exposes no regression, and ib-dedupe group can pass without problem. xfstests ib-dedupe group can be fetched from github: https://github.com/littleroad/xfstests-dev.git btrfs_dedupe_latest Changelog: v2: Totally reworked to handle multiple backends v3: Fix a stupid but deadly on-disk backend bug Add handle for multiple hash on same bytenr corner case to fix abort trans error Increase dedup rate by enhancing delayed ref handler for both backend. Move dedup_add() to run_delayed_ref() time, to fix abort trans error. Increase dedup block size up limit to 8M. v4: Add dedup prop for disabling dedup for given files/dirs. Merge inmem_search() and ondisk_search() into generic_search() to save some code Fix another delayed_ref related bug. Use the same mutex for both inmem and ondisk backend. Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup rate. v5: Reuse compress routine for much simpler dedup function. Slightly improved performance due to above modification. Fix race between dedup enable/disable Fix for false ENOSPC report v6: Further enable/disable race window fix. Minor format change according to checkpatch. v7: Fix one concurrency bug with balance. Slightly modify return value from -EINVAL to -EOPNOTSUPP for btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands and wrong parameter. Rebased to integration-4.6. v8: Rename 'dedup' to 'dedupe'. Add support to allow dedupe and compression work at the same time. Fix several balance related bugs. Special thanks to Satoru Takeuchi, who exposed most of them. Small dedupe hit case performance improvement. v9: Re-order the patchset to completely separate pure in-memory and any on-disk format change. Fold bug fixes into its original patch. v10: Adding back missing bug fix patch. Reduce on-disk item size. Hide dedupe ioctl under CONFIG_BTRFS_DEBUG. v11: Remove other backend and props support to focus on the framework and in-memory backend. Suggested by David. Better disable and buffered write race protection. Comprehensive fix to dedupe metadata ENOSPC problem. v12: Stateful 'enable' ioctl and new 'reconf' ioctl New FORCE flag for enable ioctl to allow stateless ioctl Precise error report and extendable ioctl structure. v12.1 Rebase to David's for-next-20160704 branch Add co-ordinate patch for subpage and dedupe patchset. v12.2 Rebase to David's for-next-20160715 branch Add co-ordinate patch for other patchset. v13 Rebase to David's for-next-20160906 branch Fix a reserved space leak bug, which only frees quota reserved space but not space_info->byte_may_use. v13.1 Rebase to Chris' for-linux-4.9 branch v14 Use generic ENOSPC fix for both compression and dedupe. v14.1 Further split ENOSPC fix. v14.2 Rebase to v4.11-rc2. Co-operate with count_max_extent() to calculate num_extents. No longer rely on qgroup fixes. v14.3 Rebase to v4.12-rc1. v14.4 Rebase to kdave/for-4.13-part1. v14.5 Rebase to v4.15-rc3. v14.6 Rebase to v4.17-rc5. v14.7 Replace SHASH_DESC_ON_STACK with kmalloc to remove VLA. Fixed the following errors by switching to div_u64. ├── arm-allmodconfig │ └── ERROR:__aeabi_uldivmod-fs-btrfs-btrfs.ko-undefined └── i386-allmodconfig └── ERROR:__udivdi3-fs-btrfs-btrfs.ko-undefined v14.8 Rebase to v4.18-rc4. v15 Rebase to v4.19-rc2. Drop "btrfs: Introduce COMPRESS reserve type to fix false enospc for compression". Remove the ifdef around btrfs inband dedupe ioctl. Qu Wenruo (4): btrfs: delayed-ref: Add support for increasing data ref under spinlock btrfs: dedupe: Inband in-memory only de-duplication implement btrfs: relocation: Enhance error handling to avoid BUG_ON btrfs: dedupe: Introduce new reconfigure ioctl Wang Xiaoguang (9): btrfs: dedupe: Introduce dedupe framework and its header btrfs: dedupe: Introduce function to initialize dedupe info btrfs: dedupe: Introduce function to add hash into in-memory tree btrfs: dedupe: Introduce function to remove hash from in-memory tree btrfs: dedupe: Introduce function to search for an existing hash btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface btrfs: ordered-extent: Add support for dedupe btrfs: introduce type based delalloc metadata reserve btrfs: dedupe: Add ioctl for inband deduplication fs/btrfs/Makefile| 2 +- fs/btrfs/ctree.h | 52 ++- fs/btrfs/dedupe.c| 828 +++ fs/btrfs/dedupe.h| 175 +++- fs/btrfs/delayed-ref.c | 53 ++- fs/btrfs/delayed-ref.h | 15 + fs/btrfs/disk-io.c | 4 + fs/btrfs/extent-tree.c | 67 ++- fs/btrfs/extent_io.c
[PATCH v15 13/13] btrfs: dedupe: Introduce new reconfigure ioctl
From: Qu Wenruo Introduce new reconfigure ioctl and new FORCE flag for in-band dedupe ioctls. Now dedupe enable and reconfigure ioctl are stateful. | Current state | Ioctl| Next state | | Disabled | enable| Enabled | | Enabled | enable| Not allowed | | Enabled | reconf| Enabled | | Enabled | disable | Disabled| | Disabled | dsiable | Disabled| | Disabled | reconf| Not allowed | (While disable is always stateless) While for guys prefer stateless ioctl (myself for example), new FORCE flag is introduced. In FORCE mode, enable/disable is completely stateless. | Current state | Ioctl| Next state | | Disabled | enable| Enabled | | Enabled | enable| Enabled | | Enabled | disable | Disabled| | Disabled | disable | Disabled| Also, re-configure ioctl will only modify specified fields. Unlike enable, un-specified fields will be filled with default value. For example: # btrfs dedupe enable --block-size 64k /mnt # btrfs dedupe reconfigure --limit-hash 1m /mnt Will leads to: dedupe blocksize: 64K dedupe hash limit nr: 1m While for enable: # btrfs dedupe enable --force --block-size 64k /mnt # btrfs dedupe enable --force --limit-hash 1m /mnt Will reset blocksize to default value: dedupe blocksize: 128K << reset dedupe hash limit nr: 1m Suggested-by: David Sterba Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- fs/btrfs/dedupe.c | 132 ++--- fs/btrfs/dedupe.h | 13 fs/btrfs/ioctl.c | 13 include/uapi/linux/btrfs.h | 11 +++- 4 files changed, 143 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index a147e148bbb8..2be3e53acc6a 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -29,6 +29,40 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 algo) GFP_NOFS); } +/* + * Copy from current dedupe info to fill dargs. + * For reconf case, only fill members which is uninitialized. + */ +static void get_dedupe_status(struct btrfs_dedupe_info *dedupe_info, + struct btrfs_ioctl_dedupe_args *dargs) +{ + int reconf = (dargs->cmd == BTRFS_DEDUPE_CTL_RECONF); + + dargs->status = 1; + + if (!reconf || (reconf && dargs->blocksize == (u64)-1)) + dargs->blocksize = dedupe_info->blocksize; + if (!reconf || (reconf && dargs->backend == (u16)-1)) + dargs->backend = dedupe_info->backend; + if (!reconf || (reconf && dargs->hash_algo == (u16)-1)) + dargs->hash_algo = dedupe_info->hash_algo; + + /* +* For re-configure case, if not modifying limit, +* therir limit will be set to 0, unlike other fields +*/ + if (!reconf || !(dargs->limit_nr || dargs->limit_mem)) { + dargs->limit_nr = dedupe_info->limit_nr; + dargs->limit_mem = dedupe_info->limit_nr * + (sizeof(struct inmem_hash) + +btrfs_hash_sizes[dedupe_info->hash_algo]); + } + + /* current_nr doesn't makes sense for reconfig case */ + if (!reconf) + dargs->current_nr = dedupe_info->current_nr; +} + void btrfs_dedupe_status(struct btrfs_fs_info *fs_info, struct btrfs_ioctl_dedupe_args *dargs) { @@ -45,15 +79,7 @@ void btrfs_dedupe_status(struct btrfs_fs_info *fs_info, return; } mutex_lock(_info->lock); - dargs->status = 1; - dargs->blocksize = dedupe_info->blocksize; - dargs->backend = dedupe_info->backend; - dargs->hash_algo = dedupe_info->hash_algo; - dargs->limit_nr = dedupe_info->limit_nr; - dargs->limit_mem = dedupe_info->limit_nr * - (sizeof(struct inmem_hash) + -btrfs_hash_sizes[dedupe_info->hash_algo]); - dargs->current_nr = dedupe_info->current_nr; + get_dedupe_status(dedupe_info, dargs); mutex_unlock(_info->lock); memset(dargs->__unused, -1, sizeof(dargs->__unused)); } @@ -98,17 +124,50 @@ init_dedupe_info(struct btrfs_ioctl_dedupe_args *dargs) static int check_dedupe_parameter(struct btrfs_fs_info *fs_info, struct btrfs_ioctl_dedupe_args *dargs) { - u64 blocksize = dargs->blocksize; - u64 limit_nr = dargs->limit_nr; - u64 limit_mem = dargs->limit_mem; - u16 hash_algo = dargs->hash_algo; - u8 backend = dargs->backend; + struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info; + + u64 blocksize; + u64 limit_nr; + u64 limit_mem; +
[PATCH v15 12/13] btrfs: relocation: Enhance error handling to avoid BUG_ON
From: Qu Wenruo Since the introduction of btrfs dedupe tree, it's possible that balance can race with dedupe disabling. When this happens, dedupe_enabled will make btrfs_get_fs_root() return PTR_ERR(-ENOENT). But due to a bug in error handling branch, when this happens backref_cache->nr_nodes is increased but the node is neither added to backref_cache or nr_nodes decreased. Causing BUG_ON() in backref_cache_cleanup() [ 2611.668810] [ cut here ] [ 2611.669946] kernel BUG at /home/sat/ktest/linux/fs/btrfs/relocation.c:243! [ 2611.670572] invalid opcode: [#1] SMP [ 2611.686797] Call Trace: [ 2611.687034] [] btrfs_relocate_block_group+0x1b3/0x290 [btrfs] [ 2611.687706] [] btrfs_relocate_chunk.isra.40+0x47/0xd0 [btrfs] [ 2611.688385] [] btrfs_balance+0xb22/0x11e0 [btrfs] [ 2611.688966] [] btrfs_ioctl_balance+0x391/0x3a0 [btrfs] [ 2611.689587] [] btrfs_ioctl+0x1650/0x2290 [btrfs] [ 2611.690145] [] ? lru_cache_add+0x3a/0x80 [ 2611.690647] [] ? lru_cache_add_active_or_unevictable+0x4c/0xc0 [ 2611.691310] [] ? handle_mm_fault+0xcd4/0x17f0 [ 2611.691842] [] ? cp_new_stat+0x153/0x180 [ 2611.692342] [] ? __vma_link_rb+0xfd/0x110 [ 2611.692842] [] ? vma_link+0xb9/0xc0 [ 2611.693303] [] do_vfs_ioctl+0xa1/0x5a0 [ 2611.693781] [] ? __do_page_fault+0x1b4/0x400 [ 2611.694310] [] SyS_ioctl+0x41/0x70 [ 2611.694758] [] entry_SYSCALL_64_fastpath+0x12/0x71 [ 2611.695331] Code: ff 48 8b 45 bf 49 83 af a8 05 00 00 01 49 89 87 a0 05 00 00 e9 2e fd ff ff b8 f4 ff ff ff e9 e4 fb ff ff 0f 0b 0f 0b 0f 0b 0f 0b <0f> 0b 0f 0b 41 89 c6 e9 b8 fb ff ff e8 9e a6 e8 e0 4c 89 e7 44 [ 2611.697870] RIP [] relocate_block_group+0x741/0x7a0 [btrfs] [ 2611.698818] RSP This patch will call remove_backref_node() in error handling branch, and cache the returned -ENOENT in relocate_tree_block() and continue balancing. Reported-by: Satoru Takeuchi Signed-off-by: Qu Wenruo --- fs/btrfs/relocation.c | 22 +- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 59a9c22ebf51..5f4b138fcb35 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -845,6 +845,13 @@ struct backref_node *build_backref_tree(struct reloc_control *rc, root = read_fs_root(rc->extent_root->fs_info, key.offset); if (IS_ERR(root)) { err = PTR_ERR(root); + /* +* Don't forget to cleanup current node. +* As it may not be added to backref_cache but nr_node +* increased. +* This will cause BUG_ON() in backref_cache_cleanup(). +*/ + remove_backref_node(>backref_cache, cur); goto out; } @@ -3018,14 +3025,21 @@ int relocate_tree_blocks(struct btrfs_trans_handle *trans, } rb_node = rb_first(blocks); - while (rb_node) { + for (rb_node = rb_first(blocks); rb_node; rb_node = rb_next(rb_node)) { block = rb_entry(rb_node, struct tree_block, rb_node); node = build_backref_tree(rc, >key, block->level, block->bytenr); if (IS_ERR(node)) { + /* +* The root(dedupe tree yet) of the tree block is +* going to be freed and can't be reached. +* Just skip it and continue balancing. +*/ + if (PTR_ERR(node) == -ENOENT) + continue; err = PTR_ERR(node); - goto out; + break; } ret = relocate_tree_block(trans, rc, node, >key, @@ -3033,11 +3047,9 @@ int relocate_tree_blocks(struct btrfs_trans_handle *trans, if (ret < 0) { if (ret != -EAGAIN || rb_node == rb_first(blocks)) err = ret; - goto out; + break; } - rb_node = rb_next(rb_node); } -out: err = finish_pending_nodes(trans, rc, path, err); out_free_path: -- 2.18.0
[PATCH v15 07/13] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
From: Wang Xiaoguang Unlike in-memory or on-disk dedupe method, only SHA256 hash method is supported yet, so implement btrfs_dedupe_calc_hash() interface using SHA256. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang Reviewed-by: Josef Bacik Signed-off-by: Lu Fengqi --- fs/btrfs/dedupe.c | 50 +++ 1 file changed, 50 insertions(+) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index 9c6152b7f0eb..9b0a90dd8e42 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -644,3 +644,53 @@ int btrfs_dedupe_search(struct btrfs_fs_info *fs_info, } return ret; } + +int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info, + struct inode *inode, u64 start, + struct btrfs_dedupe_hash *hash) +{ + int i; + int ret; + struct page *p; + struct shash_desc *shash; + struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info; + struct crypto_shash *tfm = dedupe_info->dedupe_driver; + u64 dedupe_bs; + u64 sectorsize = fs_info->sectorsize; + + shash = kmalloc(sizeof(*shash) + crypto_shash_descsize(tfm), GFP_NOFS); + if (!shash) + return -ENOMEM; + + if (!fs_info->dedupe_enabled || !hash) + return 0; + + if (WARN_ON(dedupe_info == NULL)) + return -EINVAL; + + WARN_ON(!IS_ALIGNED(start, sectorsize)); + + dedupe_bs = dedupe_info->blocksize; + + shash->tfm = tfm; + shash->flags = 0; + ret = crypto_shash_init(shash); + if (ret) + return ret; + for (i = 0; sectorsize * i < dedupe_bs; i++) { + char *d; + + p = find_get_page(inode->i_mapping, + (start >> PAGE_SHIFT) + i); + if (WARN_ON(!p)) + return -ENOENT; + d = kmap(p); + ret = crypto_shash_update(shash, d, sectorsize); + kunmap(p); + put_page(p); + if (ret) + return ret; + } + ret = crypto_shash_final(shash, hash->hash); + return ret; +} -- 2.18.0
[PATCH v15 11/13] btrfs: dedupe: Add ioctl for inband deduplication
From: Wang Xiaoguang Add ioctl interface for inband deduplication, which includes: 1) enable 2) disable 3) status And a pseudo RO compat flag, to imply that btrfs now supports inband dedup. However we don't add any ondisk format change, it's just a pseudo RO compat flag. All these ioctl interfaces are state-less, which means caller don't need to bother previous dedupe state before calling them, and only need to care the final desired state. For example, if user want to enable dedupe with specified block size and limit, just fill the ioctl structure and call enable ioctl. No need to check if dedupe is already running. These ioctls will handle things like re-configure or disable quite well. Also, for invalid parameters, enable ioctl interface will set the field of the first encountered invalid parameter to (-1) to inform caller. While for limit_nr/limit_mem, the value will be (0). Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang Signed-off-by: Lu Fengqi --- fs/btrfs/dedupe.c | 50 + fs/btrfs/dedupe.h | 17 +++--- fs/btrfs/disk-io.c | 3 ++ fs/btrfs/ioctl.c | 65 ++ fs/btrfs/sysfs.c | 2 ++ include/uapi/linux/btrfs.h | 12 ++- 6 files changed, 143 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index 9b0a90dd8e42..a147e148bbb8 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -29,6 +29,35 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 algo) GFP_NOFS); } +void btrfs_dedupe_status(struct btrfs_fs_info *fs_info, +struct btrfs_ioctl_dedupe_args *dargs) +{ + struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info; + + if (!fs_info->dedupe_enabled || !dedupe_info) { + dargs->status = 0; + dargs->blocksize = 0; + dargs->backend = 0; + dargs->hash_algo = 0; + dargs->limit_nr = 0; + dargs->current_nr = 0; + memset(dargs->__unused, -1, sizeof(dargs->__unused)); + return; + } + mutex_lock(_info->lock); + dargs->status = 1; + dargs->blocksize = dedupe_info->blocksize; + dargs->backend = dedupe_info->backend; + dargs->hash_algo = dedupe_info->hash_algo; + dargs->limit_nr = dedupe_info->limit_nr; + dargs->limit_mem = dedupe_info->limit_nr * + (sizeof(struct inmem_hash) + +btrfs_hash_sizes[dedupe_info->hash_algo]); + dargs->current_nr = dedupe_info->current_nr; + mutex_unlock(_info->lock); + memset(dargs->__unused, -1, sizeof(dargs->__unused)); +} + static struct btrfs_dedupe_info * init_dedupe_info(struct btrfs_ioctl_dedupe_args *dargs) { @@ -402,6 +431,27 @@ static void unblock_all_writers(struct btrfs_fs_info *fs_info) percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1); } +int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info) +{ + struct btrfs_dedupe_info *dedupe_info; + + fs_info->dedupe_enabled = 0; + /* same as disable */ + smp_wmb(); + dedupe_info = fs_info->dedupe_info; + fs_info->dedupe_info = NULL; + + if (!dedupe_info) + return 0; + + if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) + inmem_destroy(dedupe_info); + + crypto_free_shash(dedupe_info->dedupe_driver); + kfree(dedupe_info); + return 0; +} + int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info) { struct btrfs_dedupe_info *dedupe_info; diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h index 8157b17c4d11..fdd00355d6b5 100644 --- a/fs/btrfs/dedupe.h +++ b/fs/btrfs/dedupe.h @@ -90,6 +90,15 @@ static inline struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 algo) int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, struct btrfs_ioctl_dedupe_args *dargs); + +/* + * Get inband dedupe info + * Since it needs to access different backends' hash size, which + * is not exported, we need such simple function. + */ +void btrfs_dedupe_status(struct btrfs_fs_info *fs_info, +struct btrfs_ioctl_dedupe_args *dargs); + /* * Disable dedupe and invalidate all its dedupe data. * Called at dedupe disable time. @@ -101,12 +110,10 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info); /* - * Get current dedupe status. - * Return 0 for success - * No possible error yet + * Cleanup current btrfs_dedupe_info + * Called in umount time */ -void btrfs_dedupe_status(struct btrfs_fs_info *fs_info, -struct btrfs_ioctl_dedupe_args *dargs); +int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info); /* * Calculate hash for dedupe. diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 06683b3f2f0b..43a0324c749c 100644 ---
[PATCH v15 04/13] btrfs: dedupe: Introduce function to remove hash from in-memory tree
From: Wang Xiaoguang Introduce static function inmem_del() to remove hash from in-memory dedupe tree. And implement btrfs_dedupe_del() and btrfs_dedup_disable() interfaces. Also for btrfs_dedupe_disable(), add new functions to wait existing writer and block incoming writers to eliminate all possible race. Cc: Mark Fasheh Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang Signed-off-by: Lu Fengqi --- fs/btrfs/dedupe.c | 131 +++--- 1 file changed, 125 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index 784bb3a8a5ab..951fefd19fde 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -170,12 +170,6 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, return ret; } -int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info) -{ - /* Place holder for bisect, will be implemented in later patches */ - return 0; -} - static int inmem_insert_hash(struct rb_root *root, struct inmem_hash *hash, int hash_len) { @@ -317,3 +311,128 @@ int btrfs_dedupe_add(struct btrfs_fs_info *fs_info, return inmem_add(dedupe_info, hash); return -EINVAL; } + +static struct inmem_hash * +inmem_search_bytenr(struct btrfs_dedupe_info *dedupe_info, u64 bytenr) +{ + struct rb_node **p = _info->bytenr_root.rb_node; + struct rb_node *parent = NULL; + struct inmem_hash *entry = NULL; + + while (*p) { + parent = *p; + entry = rb_entry(parent, struct inmem_hash, bytenr_node); + + if (bytenr < entry->bytenr) + p = &(*p)->rb_left; + else if (bytenr > entry->bytenr) + p = &(*p)->rb_right; + else + return entry; + } + + return NULL; +} + +/* Delete a hash from in-memory dedupe tree */ +static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr) +{ + struct inmem_hash *hash; + + mutex_lock(_info->lock); + hash = inmem_search_bytenr(dedupe_info, bytenr); + if (!hash) { + mutex_unlock(_info->lock); + return 0; + } + + __inmem_del(dedupe_info, hash); + mutex_unlock(_info->lock); + return 0; +} + +/* Remove a dedupe hash from dedupe tree */ +int btrfs_dedupe_del(struct btrfs_fs_info *fs_info, u64 bytenr) +{ + struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info; + + if (!fs_info->dedupe_enabled) + return 0; + + if (WARN_ON(dedupe_info == NULL)) + return -EINVAL; + + if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) + return inmem_del(dedupe_info, bytenr); + return -EINVAL; +} + +static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info) +{ + struct inmem_hash *entry, *tmp; + + mutex_lock(_info->lock); + list_for_each_entry_safe(entry, tmp, _info->lru_list, lru_list) + __inmem_del(dedupe_info, entry); + mutex_unlock(_info->lock); +} + +/* + * Helper function to wait and block all incoming writers + * + * Use rw_sem introduced for freeze to wait/block writers. + * So during the block time, no new write will happen, so we can + * do something quite safe, espcially helpful for dedupe disable, + * as it affect buffered write. + */ +static void block_all_writers(struct btrfs_fs_info *fs_info) +{ + struct super_block *sb = fs_info->sb; + + percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1); + down_write(>s_umount); +} + +static void unblock_all_writers(struct btrfs_fs_info *fs_info) +{ + struct super_block *sb = fs_info->sb; + + up_write(>s_umount); + percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1); +} + +int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info) +{ + struct btrfs_dedupe_info *dedupe_info; + int ret; + + dedupe_info = fs_info->dedupe_info; + + if (!dedupe_info) + return 0; + + /* Don't allow disable status change in RO mount */ + if (fs_info->sb->s_flags & MS_RDONLY) + return -EROFS; + + /* +* Wait for all unfinished writers and block further writers. +* Then sync the whole fs so all current write will go through +* dedupe, and all later write won't go through dedupe. +*/ + block_all_writers(fs_info); + ret = sync_filesystem(fs_info->sb); + fs_info->dedupe_enabled = 0; + fs_info->dedupe_info = NULL; + unblock_all_writers(fs_info); + if (ret < 0) + return ret; + + /* now we are OK to clean up everything */ + if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) + inmem_destroy(dedupe_info); + + crypto_free_shash(dedupe_info->dedupe_driver); + kfree(dedupe_info); + return 0; +} -- 2.18.0
[PATCH v15 02/13] btrfs: dedupe: Introduce function to initialize dedupe info
From: Wang Xiaoguang Add generic function to initialize dedupe info. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang Reviewed-by: Josef Bacik Signed-off-by: Lu Fengqi --- fs/btrfs/Makefile | 2 +- fs/btrfs/dedupe.c | 169 + fs/btrfs/dedupe.h | 12 +++ include/uapi/linux/btrfs.h | 3 + 4 files changed, 185 insertions(+), 1 deletion(-) create mode 100644 fs/btrfs/dedupe.c diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index ca693dd554e9..78fdc87dba39 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -10,7 +10,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \ compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ - uuid-tree.o props.o free-space-tree.o tree-checker.o + uuid-tree.o props.o free-space-tree.o tree-checker.o dedupe.o btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c new file mode 100644 index ..06523162753d --- /dev/null +++ b/fs/btrfs/dedupe.c @@ -0,0 +1,169 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2016 Fujitsu. All rights reserved. + */ + +#include "ctree.h" +#include "dedupe.h" +#include "btrfs_inode.h" +#include "delayed-ref.h" + +struct inmem_hash { + struct rb_node hash_node; + struct rb_node bytenr_node; + struct list_head lru_list; + + u64 bytenr; + u32 num_bytes; + + u8 hash[]; +}; + +static struct btrfs_dedupe_info * +init_dedupe_info(struct btrfs_ioctl_dedupe_args *dargs) +{ + struct btrfs_dedupe_info *dedupe_info; + + dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS); + if (!dedupe_info) + return ERR_PTR(-ENOMEM); + + dedupe_info->hash_algo = dargs->hash_algo; + dedupe_info->backend = dargs->backend; + dedupe_info->blocksize = dargs->blocksize; + dedupe_info->limit_nr = dargs->limit_nr; + + /* only support SHA256 yet */ + dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0); + if (IS_ERR(dedupe_info->dedupe_driver)) { + kfree(dedupe_info); + return ERR_CAST(dedupe_info->dedupe_driver); + } + + dedupe_info->hash_root = RB_ROOT; + dedupe_info->bytenr_root = RB_ROOT; + dedupe_info->current_nr = 0; + INIT_LIST_HEAD(_info->lru_list); + mutex_init(_info->lock); + + return dedupe_info; +} + +/* + * Helper to check if parameters are valid. + * The first invalid field will be set to (-1), to info user which parameter + * is invalid. + * Except dargs->limit_nr or dargs->limit_mem, in that case, 0 will returned + * to info user, since user can specify any value to limit, except 0. + */ +static int check_dedupe_parameter(struct btrfs_fs_info *fs_info, + struct btrfs_ioctl_dedupe_args *dargs) +{ + u64 blocksize = dargs->blocksize; + u64 limit_nr = dargs->limit_nr; + u64 limit_mem = dargs->limit_mem; + u16 hash_algo = dargs->hash_algo; + u8 backend = dargs->backend; + + /* +* Set all reserved fields to -1, allow user to detect +* unsupported optional parameters. +*/ + memset(dargs->__unused, -1, sizeof(dargs->__unused)); + if (blocksize > BTRFS_DEDUPE_BLOCKSIZE_MAX || + blocksize < BTRFS_DEDUPE_BLOCKSIZE_MIN || + blocksize < fs_info->sectorsize || + !is_power_of_2(blocksize) || + blocksize < PAGE_SIZE) { + dargs->blocksize = (u64)-1; + return -EINVAL; + } + if (hash_algo >= ARRAY_SIZE(btrfs_hash_sizes)) { + dargs->hash_algo = (u16)-1; + return -EINVAL; + } + if (backend >= BTRFS_DEDUPE_BACKEND_COUNT) { + dargs->backend = (u8)-1; + return -EINVAL; + } + + /* Backend specific check */ + if (backend == BTRFS_DEDUPE_BACKEND_INMEMORY) { + /* only one limit is accepted for enable*/ + if (dargs->limit_nr && dargs->limit_mem) { + dargs->limit_nr = 0; + dargs->limit_mem = 0; + return -EINVAL; + } + + if (!limit_nr && !limit_mem) + dargs->limit_nr = BTRFS_DEDUPE_LIMIT_NR_DEFAULT; + else { + u64 tmp = (u64)-1; + + if (limit_mem) { + tmp = div_u64(limit_mem, + (sizeof(struct inmem_hash)) + + btrfs_hash_sizes[hash_algo]); + /* Too small limit_mem to fill a hash
[PATCH v15 10/13] btrfs: dedupe: Inband in-memory only de-duplication implement
From: Qu Wenruo Core implement for inband de-duplication. It reuses the async_cow_start() facility to do the calculate dedupe hash. And use dedupe hash to do inband de-duplication at extent level. The workflow is as below: 1) Run delalloc range for an inode 2) Calculate hash for the delalloc range at the unit of dedupe_bs 3) For hash match(duplicated) case, just increase source extent ref and insert file extent. For hash mismatch case, go through the normal cow_file_range() fallback, and add hash into dedupe_tree. Compress for hash miss case is not supported yet. Current implement restore all dedupe hash in memory rb-tree, with LRU behavior to control the limit. Signed-off-by: Wang Xiaoguang Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- fs/btrfs/ctree.h | 4 +- fs/btrfs/dedupe.h | 15 ++ fs/btrfs/extent-tree.c | 31 +++- fs/btrfs/extent_io.c | 7 +- fs/btrfs/extent_io.h | 1 + fs/btrfs/file.c| 4 + fs/btrfs/inode.c | 316 +++-- fs/btrfs/ioctl.c | 1 + fs/btrfs/relocation.c | 18 +++ 9 files changed, 341 insertions(+), 56 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 4f0b6a12ecb1..627d617e3265 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -112,9 +112,11 @@ static inline u32 count_max_extents(u64 size, u64 max_extent_size) */ enum btrfs_metadata_reserve_type { BTRFS_RESERVE_NORMAL, + BTRFS_RESERVE_DEDUPE, }; -u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type); +u64 btrfs_max_extent_size(struct btrfs_inode *inode, + enum btrfs_metadata_reserve_type reserve_type); struct btrfs_mapping_tree { struct extent_map_tree map_tree; diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h index 87f5b7ce7766..8157b17c4d11 100644 --- a/fs/btrfs/dedupe.h +++ b/fs/btrfs/dedupe.h @@ -7,6 +7,7 @@ #define BTRFS_DEDUPE_H #include +#include "btrfs_inode.h" /* 32 bytes for SHA256 */ static const int btrfs_hash_sizes[] = { 32 }; @@ -47,6 +48,20 @@ struct btrfs_dedupe_info { u64 current_nr; }; +static inline u64 btrfs_dedupe_blocksize(struct btrfs_inode *inode) +{ + struct btrfs_fs_info *fs_info = inode->root->fs_info; + + return fs_info->dedupe_info->blocksize; +} + +static inline int inode_need_dedupe(struct inode *inode) +{ + struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info; + + return fs_info->dedupe_enabled; +} + static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash) { return (hash && hash->bytenr); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f90233ffcb27..131d48487c84 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -28,6 +28,7 @@ #include "sysfs.h" #include "qgroup.h" #include "ref-verify.h" +#include "dedupe.h" #undef SCRAMBLE_DELAYED_REFS @@ -2489,6 +2490,17 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, btrfs_pin_extent(fs_info, head->bytenr, head->num_bytes, 1); if (head->is_data) { + /* +* If insert_reserved is given, it means +* a new extent is revered, then deleted +* in one tran, and inc/dec get merged to 0. +* +* In this case, we need to remove its dedupe +* hash. +*/ + ret = btrfs_dedupe_del(fs_info, head->bytenr); + if (ret < 0) + return ret; ret = btrfs_del_csums(trans, fs_info, head->bytenr, head->num_bytes); } @@ -5882,13 +5894,15 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info, spin_unlock(_rsv->lock); } -u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type) +u64 btrfs_max_extent_size(struct btrfs_inode *inode, + enum btrfs_metadata_reserve_type reserve_type) { if (reserve_type == BTRFS_RESERVE_NORMAL) return BTRFS_MAX_EXTENT_SIZE; - - ASSERT(0); - return BTRFS_MAX_EXTENT_SIZE; + else if (reserve_type == BTRFS_RESERVE_DEDUPE) + return btrfs_dedupe_blocksize(inode); + else + return BTRFS_MAX_EXTENT_SIZE; } int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes, @@ -5899,7 +5913,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes, enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL; int ret = 0; bool delalloc_lock = true; - u64 max_extent_size = btrfs_max_extent_size(reserve_type); + u64 max_extent_size = btrfs_max_extent_size(inode, reserve_type); /* If we are a
[PATCH v15 05/13] btrfs: delayed-ref: Add support for increasing data ref under spinlock
From: Qu Wenruo For in-band dedupe, btrfs needs to increase data ref with delayed_ref locked, so add a new function btrfs_add_delayed_data_ref_lock() to increase extent ref with delayed_refs already locked. Export init_delayed_ref_head and init_delayed_ref_common for inband dedupe. Signed-off-by: Qu Wenruo Reviewed-by: Josef Bacik Signed-off-by: Lu Fengqi --- fs/btrfs/delayed-ref.c | 53 +- fs/btrfs/delayed-ref.h | 15 2 files changed, 52 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 62ff545ba1f7..faca30b334ee 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -526,7 +526,7 @@ update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs, spin_unlock(>lock); } -static void init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref, +void btrfs_init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref, struct btrfs_qgroup_extent_record *qrecord, u64 bytenr, u64 num_bytes, u64 ref_root, u64 reserved, int action, bool is_data, @@ -654,7 +654,7 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans, } /* - * init_delayed_ref_common - Initialize the structure which represents a + * btrfs_init_delayed_ref_common - Initialize the structure which represents a * modification to a an extent. * * @fs_info:Internal to the mounted filesystem mount structure. @@ -678,7 +678,7 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans, * when recording a metadata extent or BTRFS_SHARED_DATA_REF_KEY/ * BTRFS_EXTENT_DATA_REF_KEY when recording data extent */ -static void init_delayed_ref_common(struct btrfs_fs_info *fs_info, +void btrfs_init_delayed_ref_common(struct btrfs_fs_info *fs_info, struct btrfs_delayed_ref_node *ref, u64 bytenr, u64 num_bytes, u64 ref_root, int action, u8 ref_type) @@ -751,14 +751,14 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans, else ref_type = BTRFS_TREE_BLOCK_REF_KEY; - init_delayed_ref_common(fs_info, >node, bytenr, num_bytes, - ref_root, action, ref_type); + btrfs_init_delayed_ref_common(fs_info, >node, bytenr, num_bytes, + ref_root, action, ref_type); ref->root = ref_root; ref->parent = parent; ref->level = level; - init_delayed_ref_head(head_ref, record, bytenr, num_bytes, - ref_root, 0, action, false, is_system); + btrfs_init_delayed_ref_head(head_ref, record, bytenr, num_bytes, + ref_root, 0, action, false, is_system); head_ref->extent_op = extent_op; delayed_refs = >transaction->delayed_refs; @@ -787,6 +787,29 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans, return 0; } +/* + * Do real delayed data ref insert. + * Caller must hold delayed_refs->lock and allocation memory + * for dref,head_ref and record. + */ +int btrfs_add_delayed_data_ref_locked(struct btrfs_trans_handle *trans, + struct btrfs_delayed_ref_head *head_ref, + struct btrfs_qgroup_extent_record *qrecord, + struct btrfs_delayed_data_ref *ref, int action, + int *qrecord_inserted_ret, int *old_ref_mod, + int *new_ref_mod) +{ + struct btrfs_delayed_ref_root *delayed_refs; + + head_ref = add_delayed_ref_head(trans, head_ref, qrecord, + action, qrecord_inserted_ret, + old_ref_mod, new_ref_mod); + + delayed_refs = >transaction->delayed_refs; + + return insert_delayed_ref(trans, delayed_refs, head_ref, >node); +} + /* * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref. */ @@ -813,7 +836,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans, ref_type = BTRFS_SHARED_DATA_REF_KEY; else ref_type = BTRFS_EXTENT_DATA_REF_KEY; - init_delayed_ref_common(fs_info, >node, bytenr, num_bytes, + btrfs_init_delayed_ref_common(fs_info, >node, bytenr, num_bytes, ref_root, action, ref_type); ref->root = ref_root; ref->parent = parent; @@ -838,8 +861,8 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans, } } - init_delayed_ref_head(head_ref, record, bytenr, num_bytes, ref_root, - reserved, action, true, false); + btrfs_init_delayed_ref_head(head_ref, record, bytenr, num_bytes, + ref_root, reserved,
[PATCH v15 08/13] btrfs: ordered-extent: Add support for dedupe
From: Wang Xiaoguang Add ordered-extent support for dedupe. Note, current ordered-extent support only supports non-compressed source extent. Support for compressed source extent will be added later. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang Reviewed-by: Josef Bacik --- fs/btrfs/ordered-data.c | 46 + fs/btrfs/ordered-data.h | 13 2 files changed, 55 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 0c4ef208b8b9..4b112258a79b 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -12,6 +12,7 @@ #include "extent_io.h" #include "disk-io.h" #include "compression.h" +#include "dedupe.h" static struct kmem_cache *btrfs_ordered_extent_cache; @@ -170,7 +171,8 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree, */ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, u64 start, u64 len, u64 disk_len, - int type, int dio, int compress_type) + int type, int dio, int compress_type, + struct btrfs_dedupe_hash *hash) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct btrfs_root *root = BTRFS_I(inode)->root; @@ -191,6 +193,33 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, entry->inode = igrab(inode); entry->compress_type = compress_type; entry->truncated_len = (u64)-1; + entry->hash = NULL; + /* +* A hash hit means we have already incremented the extents delayed +* ref. +* We must handle this even if another process is trying to +* turn off dedupe, otherwise we will leak a reference. +*/ + if (hash && (hash->bytenr || root->fs_info->dedupe_enabled)) { + struct btrfs_dedupe_info *dedupe_info; + + dedupe_info = root->fs_info->dedupe_info; + if (WARN_ON(dedupe_info == NULL)) { + kmem_cache_free(btrfs_ordered_extent_cache, + entry); + return -EINVAL; + } + entry->hash = btrfs_dedupe_alloc_hash(dedupe_info->hash_algo); + if (!entry->hash) { + kmem_cache_free(btrfs_ordered_extent_cache, entry); + return -ENOMEM; + } + entry->hash->bytenr = hash->bytenr; + entry->hash->num_bytes = hash->num_bytes; + memcpy(entry->hash->hash, hash->hash, + btrfs_hash_sizes[dedupe_info->hash_algo]); + } + if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE) set_bit(type, >flags); @@ -245,15 +274,23 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, { return __btrfs_add_ordered_extent(inode, file_offset, start, len, disk_len, type, 0, - BTRFS_COMPRESS_NONE); + BTRFS_COMPRESS_NONE, NULL); } +int btrfs_add_ordered_extent_dedupe(struct inode *inode, u64 file_offset, + u64 start, u64 len, u64 disk_len, int type, + struct btrfs_dedupe_hash *hash) +{ + return __btrfs_add_ordered_extent(inode, file_offset, start, len, + disk_len, type, 0, + BTRFS_COMPRESS_NONE, hash); +} int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset, u64 start, u64 len, u64 disk_len, int type) { return __btrfs_add_ordered_extent(inode, file_offset, start, len, disk_len, type, 1, - BTRFS_COMPRESS_NONE); + BTRFS_COMPRESS_NONE, NULL); } int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset, @@ -262,7 +299,7 @@ int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset, { return __btrfs_add_ordered_extent(inode, file_offset, start, len, disk_len, type, 0, - compress_type); + compress_type, NULL); } /* @@ -444,6 +481,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry) list_del(>list); kfree(sum); } + kfree(entry->hash); kmem_cache_free(btrfs_ordered_extent_cache, entry); } } diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 02d813aaa261..08c7ee986bb9 100644 ---
[PATCH v15 01/13] btrfs: dedupe: Introduce dedupe framework and its header
From: Wang Xiaoguang Introduce the header for btrfs in-band(write time) de-duplication framework and needed header. The new de-duplication framework is going to support 2 different dedupe methods and 1 dedupe hash. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang Signed-off-by: Lu Fengqi --- fs/btrfs/ctree.h | 7 ++ fs/btrfs/dedupe.h | 128 - fs/btrfs/disk-io.c | 1 + include/uapi/linux/btrfs.h | 34 ++ 4 files changed, 168 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 53af9f5253f4..741ef21a6185 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1125,6 +1125,13 @@ struct btrfs_fs_info { spinlock_t ref_verify_lock; struct rb_root block_tree; #endif + + /* +* Inband de-duplication related structures +*/ + unsigned long dedupe_enabled:1; + struct btrfs_dedupe_info *dedupe_info; + struct mutex dedupe_ioctl_lock; }; static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb) diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h index 90281a7a35a8..222ce7b4d827 100644 --- a/fs/btrfs/dedupe.h +++ b/fs/btrfs/dedupe.h @@ -6,7 +6,131 @@ #ifndef BTRFS_DEDUPE_H #define BTRFS_DEDUPE_H -/* later in-band dedupe will expand this struct */ -struct btrfs_dedupe_hash; +#include +/* 32 bytes for SHA256 */ +static const int btrfs_hash_sizes[] = { 32 }; + +/* + * For caller outside of dedupe.c + * + * Different dedupe backends should have their own hash structure + */ +struct btrfs_dedupe_hash { + u64 bytenr; + u32 num_bytes; + + /* last field is a variable length array of dedupe hash */ + u8 hash[]; +}; + +struct btrfs_dedupe_info { + /* dedupe blocksize */ + u64 blocksize; + u16 backend; + u16 hash_algo; + + struct crypto_shash *dedupe_driver; + + /* +* Use mutex to portect both backends +* Even for in-memory backends, the rb-tree can be quite large, +* so mutex is better for such use case. +*/ + struct mutex lock; + + /* following members are only used in in-memory backend */ + struct rb_root hash_root; + struct rb_root bytenr_root; + struct list_head lru_list; + u64 limit_nr; + u64 current_nr; +}; + +static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash) +{ + return (hash && hash->bytenr); +} + +/* + * Initial inband dedupe info + * Called at dedupe enable time. + * + * Return 0 for success + * Return <0 for any error + * (from unsupported param to tree creation error for some backends) + */ +int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, + struct btrfs_ioctl_dedupe_args *dargs); + +/* + * Disable dedupe and invalidate all its dedupe data. + * Called at dedupe disable time. + * + * Return 0 for success + * Return <0 for any error + * (tree operation error for some backends) + */ +int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info); + +/* + * Get current dedupe status. + * Return 0 for success + * No possible error yet + */ +void btrfs_dedupe_status(struct btrfs_fs_info *fs_info, +struct btrfs_ioctl_dedupe_args *dargs); + +/* + * Calculate hash for dedupe. + * Caller must ensure [start, start + dedupe_bs) has valid data. + * + * Return 0 for success + * Return <0 for any error + * (error from hash codes) + */ +int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info, + struct inode *inode, u64 start, + struct btrfs_dedupe_hash *hash); + +/* + * Search for duplicated extents by calculated hash + * Caller must call btrfs_dedupe_calc_hash() first to get the hash. + * + * @inode: the inode for we are writing + * @file_pos: offset inside the inode + * As we will increase extent ref immediately after a hash match, + * we need @file_pos and @inode in this case. + * + * Return > 0 for a hash match, and the extent ref will be + * *INCREASED*, and hash->bytenr/num_bytes will record the existing + * extent data. + * Return 0 for a hash miss. Nothing is done + * Return <0 for any error + * (tree operation error for some backends) + */ +int btrfs_dedupe_search(struct btrfs_fs_info *fs_info, + struct inode *inode, u64 file_pos, + struct btrfs_dedupe_hash *hash); + +/* + * Add a dedupe hash into dedupe info + * Return 0 for success + * Return <0 for any error + * (tree operation error for some backends) + */ +int btrfs_dedupe_add(struct btrfs_fs_info *fs_info, +struct btrfs_dedupe_hash *hash); + +/* + * Remove a dedupe hash from dedupe info + * Return 0 for success + * Return <0 for any error + * (tree operation error for some backends) + * + * NOTE: if hash deletion error is not handled well, it will lead + * to corrupted fs, as later dedupe write can points to non-exist or even + * wrong extent. + */ +int
[PATCH v15 06/13] btrfs: dedupe: Introduce function to search for an existing hash
From: Wang Xiaoguang Introduce static function inmem_search() to handle the job for in-memory hash tree. The trick is, we must ensure the delayed ref head is not being run at the time we search the for the hash. With inmem_search(), we can implement the btrfs_dedupe_search() interface. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang Reviewed-by: Josef Bacik Signed-off-by: Lu Fengqi --- fs/btrfs/dedupe.c | 210 +- 1 file changed, 209 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index 951fefd19fde..9c6152b7f0eb 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -7,6 +7,8 @@ #include "dedupe.h" #include "btrfs_inode.h" #include "delayed-ref.h" +#include "qgroup.h" +#include "transaction.h" struct inmem_hash { struct rb_node hash_node; @@ -242,7 +244,6 @@ static int inmem_add(struct btrfs_dedupe_info *dedupe_info, struct inmem_hash *ihash; ihash = inmem_alloc_hash(algo); - if (!ihash) return -ENOMEM; @@ -436,3 +437,210 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info) kfree(dedupe_info); return 0; } + +/* + * Caller must ensure the corresponding ref head is not being run. + */ +static struct inmem_hash * +inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash) +{ + struct rb_node **p = _info->hash_root.rb_node; + struct rb_node *parent = NULL; + struct inmem_hash *entry = NULL; + u16 hash_algo = dedupe_info->hash_algo; + int hash_len = btrfs_hash_sizes[hash_algo]; + + while (*p) { + parent = *p; + entry = rb_entry(parent, struct inmem_hash, hash_node); + + if (memcmp(hash, entry->hash, hash_len) < 0) { + p = &(*p)->rb_left; + } else if (memcmp(hash, entry->hash, hash_len) > 0) { + p = &(*p)->rb_right; + } else { + /* Found, need to re-add it to LRU list head */ + list_del(>lru_list); + list_add(>lru_list, _info->lru_list); + return entry; + } + } + return NULL; +} + +static int inmem_search(struct btrfs_dedupe_info *dedupe_info, + struct inode *inode, u64 file_pos, + struct btrfs_dedupe_hash *hash) +{ + int ret; + struct btrfs_root *root = BTRFS_I(inode)->root; + struct btrfs_trans_handle *trans; + struct btrfs_delayed_ref_root *delayed_refs; + struct btrfs_delayed_ref_head *head; + struct btrfs_delayed_ref_head *insert_head; + struct btrfs_delayed_data_ref *insert_dref; + struct btrfs_qgroup_extent_record *insert_qrecord = NULL; + struct inmem_hash *found_hash; + int free_insert = 1; + int qrecord_inserted = 0; + u64 ref_root = root->root_key.objectid; + u64 bytenr; + u32 num_bytes; + + insert_head = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS); + if (!insert_head) + return -ENOMEM; + insert_head->extent_op = NULL; + + insert_dref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS); + if (!insert_dref) { + kmem_cache_free(btrfs_delayed_ref_head_cachep, insert_head); + return -ENOMEM; + } + if (test_bit(BTRFS_FS_QUOTA_ENABLED, >fs_info->flags) && + is_fstree(ref_root)) { + insert_qrecord = kmalloc(sizeof(*insert_qrecord), GFP_NOFS); + if (!insert_qrecord) { + kmem_cache_free(btrfs_delayed_ref_head_cachep, + insert_head); + kmem_cache_free(btrfs_delayed_data_ref_cachep, + insert_dref); + return -ENOMEM; + } + } + + trans = btrfs_join_transaction(root); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + goto free_mem; + } + +again: + mutex_lock(_info->lock); + found_hash = inmem_search_hash(dedupe_info, hash->hash); + /* If we don't find a duplicated extent, just return. */ + if (!found_hash) { + ret = 0; + goto out; + } + bytenr = found_hash->bytenr; + num_bytes = found_hash->num_bytes; + + btrfs_init_delayed_ref_head(insert_head, insert_qrecord, bytenr, + num_bytes, ref_root, 0, BTRFS_ADD_DELAYED_REF, true, + false); + + btrfs_init_delayed_ref_common(trans->fs_info, _dref->node, + bytenr, num_bytes, ref_root, BTRFS_ADD_DELAYED_REF, + BTRFS_EXTENT_DATA_REF_KEY); + insert_dref->root = ref_root; + insert_dref->parent = 0; + insert_dref->objectid = btrfs_ino(BTRFS_I(inode)); + insert_dref->offset =