Re: Qgroups are not applied when snapshotting a subvol?
If we were going to reserve something, it should be a high number, not a low one. Having 0 reserved makes some sense, but reserving other low numbers seems kind of odd when they aren't already reserved. I did some experiments. Currently assigning higher-level qgroup to lower-level qgroup is not possible. Consequently, assigning anything to 0-level qgroup is not possible. On the other hand, assigning qgroups while skipping levels (e.g. qgroup 2/P to 10/Q) is possible. So setting default snapshot level high is technically possible, but you'll not be able to assign these high-level qgroups anywhere low later. although I hadn't realized that the snapshot command _does not_ have this argument, when it absolutely should. It does here in 4.4, it's just not documented :) I too found it by accident. Perhaps have an option Options always suit everyone except developers who need to implement and support them :) Here I'd like to wrap up since I seriously doubt any real btrfs developers are still reading our discussion :) -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Qgroups are not applied when snapshotting a subvol?
Austin S. Hemmelgarn posted on Tue, 28 Mar 2017 07:44:56 -0400 as excerpted: > On 2017-03-27 21:49, Qu Wenruo wrote: >> The problem is, how should we treat subvolume. >> >> Btrfs subvolume sits in the middle of directory and (logical) volume >> used in traditional stacked solution. >> >> While we allow normal user to create/delete/modify dir as long as they >> follow access control, we require privilege to create/delete/modify >> volumes. > No, we require privilege to do certain modifications or delete > subvolumes. > Regular users can create subvolumes with no privileges whatsoever, and > most basic directory operations (rename, chown, chmod, etc) work just > fine within normal UNIX DAC permissions. Unless you're running some > specially patched kernel or some LSM (SELinux possibly) that somehow > restricts access to the ioctl, you can always create subvolumes. (I believe) You misread what QW was trying to say. Note the word "volume" as opposed to subvolume. Like most regulars here, Qu's too deep into btrfs to make that sort of mistake, so he wasn't talking about subvolumes. Rather, he's placing btrfs subvolumes in the middle between directories and (generic/logical) volumes as normally used, and saying that for (generic) directories normal users can create/modify/delete without special privilege (beyond write in the parent), for (generic) volumes special privileges are necessary to create/modify/delete, and btrfs subvolumes fall in between, so there's a real decision to be made in terms of privileges required, do they follow directories and require nothing special or volumes and require special privs? Meanwhile, my own position as I've argued is that regardless of the theoretical debate, as long as we have the very real practical issue of hard scaling issues for subvolumes/snapshots, there's an equally real security issue in allowing unrestricted snapshot and to a lessor extent normal subvolume creation. So while we might /like/ to make subvolumes more like directories and require no special privs, until the snapshot scaling and thus security issue is fixed or at least greatly reduced, as a purely practical security matter, we really need to restrict snapshot creation to privileged users. OTOH, it's worth pointing out that snapshots aren't unique in this regard, reflinked files have exactly the /same/ scaling issues except their one-by-one while snapshots tend to be a whole bunch of reflinked files at once. Which means the question must then be asked, if we choose the privileged route for snapshots/subvolumes due to the reflinking security issues, why aren't we making all reflinking ops privileged ops, or should we? If we did, cp --reflink=always, among other things, would obviously fail as a normal user. So I guess it's a rather harder question than I considered at first. Maybe we /should/ just treat subvolumes as directories in privileges terms as well. In that case, however, we really need to have the same default for subvolume deletion as creation. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix wrong failed mirror_num of read-repair on raid56
On Mon, Mar 27, 2017 at 07:07:15PM +0200, David Sterba wrote: > On Fri, Mar 24, 2017 at 12:13:42PM -0700, Liu Bo wrote: > > In raid56 senario, after trying parity recovery, we didn't set > > mirror_num for btrfs_bio with failed mirror_num, hence > > end_bio_extent_readpage() will report a random mirror_num in dmesg > > log. > > > > Cc: David Sterba> > Signed-off-by: Liu Bo > > --- > > fs/btrfs/volumes.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > > index 73d56ee..be64e4a 100644 > > --- a/fs/btrfs/volumes.c > > +++ b/fs/btrfs/volumes.c > > @@ -6197,6 +6197,7 @@ int btrfs_map_bio(struct btrfs_fs_info *fs_info, > > struct bio *bio, > > } else { > > ret = raid56_parity_recover(fs_info, bio, bbio, > > map_length, mirror_num, 1); > > + btrfs_io_bio(bio)->mirror_num = mirror_num; > > Should the mirror be set inside raid56_parity_recover? There's another > caller, scrub_submit_raid56_bio_wait, that does not update the > bio->mirror_num. I am not sure if this is the same case though. > scrub_submit_raid56_bio_wait doesn't need to, but yes, we could set the mirror inside raid56_parity_recover iff @generic_io is 1. Thanks, -liubo > > } > > > > btrfs_bio_counter_dec(fs_info); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: enable repair during read for raid56 profile
On Mon, Mar 27, 2017 at 06:59:44PM +0200, David Sterba wrote: > On Fri, Mar 24, 2017 at 12:13:35PM -0700, Liu Bo wrote: > > Now that scrub can fix data errors with the help of parity for raid56 > > profile, repair during read is able to as well. > > > > Although the mirror num in raid56 senario has different meanings, i.e. > > (typo: scenario) > Thanks! > > 0 or 1: read data directly > > > 1:do recover with parity, > > it could be fit into how we repair bad block during read. > > Could we possibly add some symbolic names for the RAID56 case? > Yes, we could do that here, but I'm afraid it might not be helpful because most of places still use mirror or mirror_num, such as btrfs_submit_bio_hook and btrfs_map_bio. > > > > The trick is to use BTRFS_MAP_READ instead of BTRFS_MAP_WRITE to get the > > device and position on it. > > Please also document the trick in the code before the following. > Good point. Thanks, -liubo > > + if (btrfs_is_parity_mirror(map_tree, logical, length, mirror_num)) { > > + /* use BTRFS_MAP_READ to get the phy dev and sector */ > > + ret = btrfs_map_block(fs_info, BTRFS_MAP_READ, logical, > > + _length, , 0); > > + if (ret) { > > + btrfs_bio_counter_dec(fs_info); > > + bio_put(bio); > > + return -EIO; > > + } > > + ASSERT(bbio->mirror_num == 1); > > + } else { > > + ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, logical, > > + _length, , mirror_num); > > + if (ret) { > > + btrfs_bio_counter_dec(fs_info); > > + bio_put(bio); > > + return -EIO; > > + } > > + BUG_ON(mirror_num != bbio->mirror_num); > > } > > - BUG_ON(mirror_num != bbio->mirror_num); > > - sector = bbio->stripes[mirror_num-1].physical >> 9; > > + > > + sector = bbio->stripes[bbio->mirror_num - 1].physical >> 9; > > bio->bi_iter.bi_sector = sector; > > - dev = bbio->stripes[mirror_num-1].dev; > > + dev = bbio->stripes[bbio->mirror_num - 1].dev; > > btrfs_put_bbio(bbio); > > if (!dev || !dev->bdev || !dev->writeable) { > > btrfs_bio_counter_dec(fs_info); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs: clear the RAID5/6 incompat flag once no longer needed
There's no known taint on a filesystem that was RAID5/6 once but has been since converted to something non-experimental. Signed-off-by: Adam Borowski--- fs/btrfs/disk-io.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index a4c3e6628ec1..c720a7bc1b15 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3089,6 +3089,8 @@ int open_ctree(struct super_block *sb, BTRFS_BLOCK_GROUP_RAID56_MASK) { btrfs_alert(fs_info, "btrfs RAID5/6 is EXPERIMENTAL and has known data-loss bugs"); + } else { + btrfs_clear_fs_incompat(fs_info, RAID56); } /* -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] btrfs: warn about RAID5/6 being experimental at mount time
Too many people come complaining about losing their data -- and indeed, there's no warning outside a wiki and the mailing list tribal knowledge. Message severity chosen for consistency with XFS -- "alert" makes dmesg produce nice red background which should get the point across. Signed-off-by: Adam Borowski--- fs/btrfs/disk-io.c | 8 1 file changed, 8 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index affd7aada057..a4c3e6628ec1 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3083,6 +3083,14 @@ int open_ctree(struct super_block *sb, btrfs_set_opt(fs_info->mount_opt, SSD); } + if ((fs_info->avail_data_alloc_bits | +fs_info->avail_metadata_alloc_bits | +fs_info->avail_system_alloc_bits) & + BTRFS_BLOCK_GROUP_RAID56_MASK) { + btrfs_alert(fs_info, + "btrfs RAID5/6 is EXPERIMENTAL and has known data-loss bugs"); + } + /* * Mount does not set all options immediately, we can do it now and do * not have to wait for transaction commit -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/5] raid56: scrub related fixes
This patchset can be fetched from my github repo: https://github.com/adam900710/linux.git raid56_fixes It's based on v4.11-rc2, the last two patches get modified according to the advice from Liu Bo. The patchset fixes the following bugs: 1) False alert or wrong csum error number when scrubbing RAID5/6 The bug itself won't cause any damage to fs, just pure race. This can be triggered by running scrub for 64K corrupted data stripe, Normally it will report 16 csum error recovered, but sometimes it will report more than 16 csum error recovered, under rare case, even unrecoverable error an be reported. 2) Corrupted data stripe rebuild corrupts P/Q So scrub makes one error into another, not really fixing anything Since kernel scrub doesn't report parity error, so either offline scrub or manual check is needed to expose such error. 3) Use-after-free caused by cancelling dev-replace This is quite a deadly bug, since cancelling dev-replace can cause kernel panic. Can be triggered by btrfs/069. v2: Use bio_counter to protect rbio against dev-replace cancel, instead of original btrfs_device refcount, which is too restrict and must disable rbio cache, suggested by Liu Bo. v3: Add fix for another possible use-after-free when rechecking recovered full stripe Squashing two patches as they are fixing the same problem, to make bisect easier. Use mutex other than spinlock to protect full stripe locks tree, this allow us to allocate memory inside the critical section on demand. Encapsulate rb_root and mutex into btrfs_full_stripe_locks_tree. Rename scrub_full_stripe_lock to full_stripe_lock inside scrub.c. Rename related function to have unified naming. Code style change to follow the existing scrub code style. Qu Wenruo (5): btrfs: scrub: Introduce full stripe lock for RAID56 btrfs: scrub: Fix RAID56 recovery race condition btrfs: scrub: Don't append on-disk pages for raid56 scrub btrfs: Wait flighting bio before freeing target device for raid56 btrfs: Prevent scrub recheck from racing with dev replace fs/btrfs/ctree.h | 17 fs/btrfs/dev-replace.c | 2 + fs/btrfs/extent-tree.c | 2 + fs/btrfs/raid56.c | 14 +++ fs/btrfs/scrub.c | 248 +++-- 5 files changed, 275 insertions(+), 8 deletions(-) -- 2.12.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/5] btrfs: scrub: Fix RAID56 recovery race condition
When scrubbing a RAID5 which has recoverable data corruption (only one data stripe is corrupted), sometimes scrub will report more csum errors than expected. Sometimes even unrecoverable error will be reported. The problem can be easily reproduced by the following steps: 1) Create a btrfs with RAID5 data profile with 3 devs 2) Mount it with nospace_cache or space_cache=v2 To avoid extra data space usage. 3) Create a 128K file and sync the fs, unmount it Now the 128K file lies at the beginning of the data chunk 4) Locate the physical bytenr of data chunk on dev3 Dev3 is the 1st data stripe. 5) Corrupt the first 64K of the data chunk stripe on dev3 6) Mount the fs and scrub it The correct csum error number should be 16(assuming using x86_64). Larger csum error number can be reported in a 1/3 chance. And unrecoverable error can also be reported in a 1/10 chance. The root cause of the problem is RAID5/6 recover code has race condition, due to the fact that full scrub is initiated per device. While for other mirror based profiles, each mirror is independent with each other, so race won't cause any big problem. For example: Corrupted | Correct | Correct| | Scrub dev3 (D1) |Scrub dev2 (D2) |Scrub dev1(P)| Read out D1 |Read out D2 |Read full stripe | Check csum |Check csum |Check parity | Csum mismatch |Csum match, continue|Parity mismatch | handle_errored_block||handle_errored_block | Read out full stripe || Read out full stripe| D1 csum error(err++) || D1 csum error(err++)| Recover D1 || Recover D1 | So D1's csum error is accounted twice, just because handle_errored_block() doesn't have enough protect, and race can happen. On even worse case, for example D1's recovery code is re-writing D1/D2/P, and P's recovery code is just reading out full stripe, then we can cause unrecoverable error. This patch will use previously introduced lock_full_stripe() and unlock_full_stripe() to protect the whole scrub_handle_errored_block() function for RAID56 recovery. So no extra csum error nor unrecoverable error. Reported-by: Goffredo BaroncelliSigned-off-by: Qu Wenruo --- fs/btrfs/scrub.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index ab33b9a8aac2..f92d2512f4f3 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -1129,6 +1129,17 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check) have_csum = sblock_to_check->pagev[0]->have_csum; dev = sblock_to_check->pagev[0]->dev; + /* +* For RAID5/6 race can happen for different dev scrub thread. +* For data corruption, Parity and Data thread will both try +* to recovery the data. +* Race can lead to double added csum error, or even unrecoverable +* error. +*/ + ret = lock_full_stripe(fs_info, logical); + if (ret < 0) + return ret; + if (sctx->is_dev_replace && !is_metadata && !have_csum) { sblocks_for_recheck = NULL; goto nodatasum_case; @@ -1463,6 +1474,9 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check) kfree(sblocks_for_recheck); } + ret = unlock_full_stripe(fs_info, logical); + if (ret < 0) + return ret; return 0; } -- 2.12.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/5] btrfs: scrub: Introduce full stripe lock for RAID56
Unlike mirror based profiles, RAID5/6 recovery needs to read out the whole full stripe. And if we don't do proper protect, it can easily cause race condition. Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe() for RAID5/6. Which stores a rb_tree of mutex for full stripes, so scrub callers can use them to lock a full stripe to avoid race. Signed-off-by: Qu Wenruo--- fs/btrfs/ctree.h | 17 fs/btrfs/extent-tree.c | 2 + fs/btrfs/scrub.c | 212 + 3 files changed, 231 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 29b7fc28c607..9fe56da21fed 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -538,6 +538,14 @@ struct btrfs_io_ctl { unsigned check_crcs:1; }; +/* + * Tree to record all locked full stripes of a RAID5/6 block group + */ +struct btrfs_full_stripe_locks_tree { + struct rb_root root; + struct mutex lock; +}; + struct btrfs_block_group_cache { struct btrfs_key key; struct btrfs_block_group_item item; @@ -648,6 +656,9 @@ struct btrfs_block_group_cache { * Protected by free_space_lock. */ int needs_free_space; + + /* Record locked full stripes for RAID5/6 block group */ + struct btrfs_full_stripe_locks_tree full_stripe_locks_root; }; /* delayed seq elem */ @@ -3647,6 +3658,12 @@ int btrfs_scrub_cancel_dev(struct btrfs_fs_info *info, struct btrfs_device *dev); int btrfs_scrub_progress(struct btrfs_fs_info *fs_info, u64 devid, struct btrfs_scrub_progress *progress); +static inline void btrfs_init_full_stripe_locks_tree( + struct btrfs_full_stripe_locks_tree *locks_root) +{ + locks_root->root = RB_ROOT; + mutex_init(_root->lock); +} /* dev-replace.c */ void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index be5477676cc8..e4d48997d927 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -131,6 +131,7 @@ void btrfs_put_block_group(struct btrfs_block_group_cache *cache) if (atomic_dec_and_test(>count)) { WARN_ON(cache->pinned > 0); WARN_ON(cache->reserved > 0); + WARN_ON(!RB_EMPTY_ROOT(>full_stripe_locks_root.root)); kfree(cache->free_space_ctl); kfree(cache); } @@ -9917,6 +9918,7 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info, btrfs_init_free_space_ctl(cache); atomic_set(>trimming, 0); mutex_init(>free_space_lock); + btrfs_init_full_stripe_locks_tree(>full_stripe_locks_root); return cache; } diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index b0251eb1239f..ab33b9a8aac2 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -240,6 +240,13 @@ struct scrub_warning { struct btrfs_device *dev; }; +struct full_stripe_lock { + struct rb_node node; + u64 logical; + u64 refs; + struct mutex mutex; +}; + static void scrub_pending_bio_inc(struct scrub_ctx *sctx); static void scrub_pending_bio_dec(struct scrub_ctx *sctx); static void scrub_pending_trans_workers_inc(struct scrub_ctx *sctx); @@ -349,6 +356,211 @@ static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info) } /* + * Insert new full stripe lock into full stripe locks tree + * + * Return pointer to existing or newly inserted full_stripe_lock structure if + * everything works well. + * Return ERR_PTR(-ENOMEM) if we failed to allocate memory + * + * NOTE: caller must hold full_stripe_locks_root->lock before calling this + * function + */ +static struct full_stripe_lock *insert_full_stripe_lock( + struct btrfs_full_stripe_locks_tree *locks_root, + u64 fstripe_logical) +{ + struct rb_node **p; + struct rb_node *parent = NULL; + struct full_stripe_lock *entry; + struct full_stripe_lock *ret; + + WARN_ON(!mutex_is_locked(_root->lock)); + + p = _root->root.rb_node; + while (*p) { + parent = *p; + entry = rb_entry(parent, struct full_stripe_lock, node); + if (fstripe_logical < entry->logical) { + p = &(*p)->rb_left; + } else if (fstripe_logical > entry->logical) { + p = &(*p)->rb_right; + } else { + entry->refs++; + return entry; + } + } + + /* Insert new lock */ + ret = kmalloc(sizeof(*ret), GFP_KERNEL); + if (!ret) + return ERR_PTR(-ENOMEM); + ret->logical = fstripe_logical; + ret->refs = 1; + mutex_init(>mutex); + + rb_link_node(>node, parent, p); + rb_insert_color(>node, _root->root); + return ret; +} + +/* + * Search for a full stripe lock
[PATCH v3 3/5] btrfs: scrub: Don't append on-disk pages for raid56 scrub
In the following situation, scrub will calculate wrong parity to overwrite correct one: RAID5 full stripe: Before | Dev 1 | Dev 2 | Dev 3 | | Data stripe 1 | Data stripe 2 | Parity Stripe | --- 0 | 0x (Bad) | 0xcdcd | 0x| --- 4K | 0xcdcd | 0xcdcd | 0x| ... | 0xcdcd | 0xcdcd | 0x| --- 64K After scrubbing dev3 only: | Dev 1 | Dev 2 | Dev 3 | | Data stripe 1 | Data stripe 2 | Parity Stripe | --- 0 | 0xcdcd (Good) | 0xcdcd | 0xcdcd (Bad) | --- 4K | 0xcdcd | 0xcdcd | 0x| ... | 0xcdcd | 0xcdcd | 0x| --- 64K The reason is after raid56 read rebuild rbio->stripe_pages are all correct recovered (0xcd for data stripes). However when we check and repair parity, in scrub_parity_check_and_repair(), we will appending pages in sparity->spages list to rbio->bio_pages[], which contains old on-disk data. And when we submit parity data to disk, we calculate parity using rbio->bio_pages[] first, if rbio->bio_pages[] not found, then fallback to rbio->stripe_pages[]. The patch fix it by not appending pages from sparity->spages. So finish_parity_scrub() will use rbio->stripe_pages[] which is correct. Signed-off-by: Qu Wenruo--- fs/btrfs/scrub.c | 4 1 file changed, 4 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index f92d2512f4f3..2fd259dbf4db 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -2991,7 +2991,6 @@ static void scrub_parity_check_and_repair(struct scrub_parity *sparity) struct btrfs_fs_info *fs_info = sctx->fs_info; struct bio *bio; struct btrfs_raid_bio *rbio; - struct scrub_page *spage; struct btrfs_bio *bbio = NULL; u64 length; int ret; @@ -3021,9 +3020,6 @@ static void scrub_parity_check_and_repair(struct scrub_parity *sparity) if (!rbio) goto rbio_out; - list_for_each_entry(spage, >spages, list) - raid56_add_scrub_pages(rbio, spage->page, spage->logical); - scrub_pending_bio_inc(sctx); raid56_parity_submit_scrub_rbio(rbio); return; -- 2.12.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 5/5] btrfs: Prevent scrub recheck from racing with dev replace
scrub_setup_recheck_block() calls btrfs_map_sblock() and then access bbio without protection of bio_counter. This can leads to use-after-free if racing with dev replace cancel. Fix it by increasing bio_counter before calling btrfs_map_sblock() and decrease the bio_counter when corresponding recover is finished. Cc: Liu BoReported-by: Liu Bo Signed-off-by: Qu Wenruo --- fs/btrfs/scrub.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index b8c49074d1b3..84b077c993c0 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -1072,9 +1072,11 @@ static inline void scrub_get_recover(struct scrub_recover *recover) atomic_inc(>refs); } -static inline void scrub_put_recover(struct scrub_recover *recover) +static inline void scrub_put_recover(struct btrfs_fs_info *fs_info, +struct scrub_recover *recover) { if (atomic_dec_and_test(>refs)) { + btrfs_bio_counter_dec(fs_info); btrfs_put_bbio(recover->bbio); kfree(recover); } @@ -1464,7 +1466,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check) sblock->pagev[page_index]->sblock = NULL; recover = sblock->pagev[page_index]->recover; if (recover) { - scrub_put_recover(recover); + scrub_put_recover(fs_info, recover); sblock->pagev[page_index]->recover = NULL; } @@ -1556,16 +1558,19 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock, * with a length of PAGE_SIZE, each returned stripe * represents one mirror */ + btrfs_bio_counter_inc_blocked(fs_info); ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical, _length, , 0, 1); if (ret || !bbio || mapped_length < sublen) { btrfs_put_bbio(bbio); + btrfs_bio_counter_dec(fs_info); return -EIO; } recover = kzalloc(sizeof(struct scrub_recover), GFP_NOFS); if (!recover) { btrfs_put_bbio(bbio); + btrfs_bio_counter_dec(fs_info); return -ENOMEM; } @@ -1591,7 +1596,7 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock, spin_lock(>stat_lock); sctx->stat.malloc_errors++; spin_unlock(>stat_lock); - scrub_put_recover(recover); + scrub_put_recover(fs_info, recover); return -ENOMEM; } scrub_page_get(page); @@ -1633,7 +1638,7 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock, scrub_get_recover(recover); page->recover = recover; } - scrub_put_recover(recover); + scrub_put_recover(fs_info, recover); length -= sublen; logical += sublen; page_index++; -- 2.12.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 4/5] btrfs: Wait flighting bio before freeing target device for raid56
When raid56 dev replace is cancelled by running scrub, we will free target device without waiting flighting bios, causing the following NULL pointer deference or general protection. BUG: unable to handle kernel NULL pointer dereference at 05e0 IP: generic_make_request_checks+0x4d/0x610 CPU: 1 PID: 11676 Comm: kworker/u4:14 Tainted: G O4.11.0-rc2 #72 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs] task: 88002875b4c0 task.stack: c90001334000 RIP: 0010:generic_make_request_checks+0x4d/0x610 Call Trace: ? generic_make_request+0xc7/0x360 generic_make_request+0x24/0x360 ? generic_make_request+0xc7/0x360 submit_bio+0x64/0x120 ? page_in_rbio+0x4d/0x80 [btrfs] ? rbio_orig_end_io+0x80/0x80 [btrfs] finish_rmw+0x3f4/0x540 [btrfs] validate_rbio_for_rmw+0x36/0x40 [btrfs] raid_rmw_end_io+0x7a/0x90 [btrfs] bio_endio+0x56/0x60 end_workqueue_fn+0x3c/0x40 [btrfs] btrfs_scrubparity_helper+0xef/0x620 [btrfs] btrfs_endio_raid56_helper+0xe/0x10 [btrfs] process_one_work+0x2af/0x720 ? process_one_work+0x22b/0x720 worker_thread+0x4b/0x4f0 kthread+0x10f/0x150 ? process_one_work+0x720/0x720 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x2e/0x40 RIP: generic_make_request_checks+0x4d/0x610 RSP: c90001337bb8 In btrfs_dev_replace_finishing(), we will call btrfs_rm_dev_replace_blocked() to wait bios before destroying the target device when scrub is finished normally. However when scrub is aborted, either due to error or canceled by scrub, we didn't wait bios, this can leads to use-after-free if there are bios holding the target device. Furthermore, for raid56 scrub, at least 2 places are calling btrfs_map_sblock() without protection of bio_counter, leading to the problem. This patch fixes the problen by 1) Wait bio_counter before freeing target device when canceling replace 2) When calling btrfs_map_sblock() for raid56, use bio_counter to protect the call. Cc: Liu BoSigned-off-by: Qu Wenruo --- fs/btrfs/dev-replace.c | 2 ++ fs/btrfs/raid56.c | 14 ++ fs/btrfs/scrub.c | 5 + 3 files changed, 21 insertions(+) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index e653921f05d9..b9d88136b5a9 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -546,8 +546,10 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, mutex_unlock(_info->chunk_mutex); mutex_unlock(_info->fs_devices->device_list_mutex); mutex_unlock(_mutex); + btrfs_rm_dev_replace_blocked(fs_info); if (tgt_device) btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); + btrfs_rm_dev_replace_unblocked(fs_info); mutex_unlock(_replace->lock_finishing_cancel_unmount); return scrub_ret; diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 1571bf26dc07..5c180fea32ab 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -2194,6 +2194,8 @@ static void read_rebuild_work(struct btrfs_work *work) /* * The following code is used to scrub/replace the parity stripe * + * Caller must have already increased bio_counter for getting @bbio. + * * Note: We need make sure all the pages that add into the scrub/replace * raid bio are correct and not be changed during the scrub/replace. That * is those pages just hold metadata or file data with checksum. @@ -2231,6 +2233,12 @@ raid56_parity_alloc_scrub_rbio(struct btrfs_fs_info *fs_info, struct bio *bio, ASSERT(rbio->stripe_npages == stripe_nsectors); bitmap_copy(rbio->dbitmap, dbitmap, stripe_nsectors); + /* +* We have already increased bio_counter when getting bbio, record it +* so we can free it at rbio_orig_end_io(). +*/ + rbio->generic_bio_cnt = 1; + return rbio; } @@ -2673,6 +2681,12 @@ raid56_alloc_missing_rbio(struct btrfs_fs_info *fs_info, struct bio *bio, return NULL; } + /* +* When we get bbio, we have already increased bio_counter, record it +* so we can free it at rbio_orig_end_io() +*/ + rbio->generic_bio_cnt = 1; + return rbio; } diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 2fd259dbf4db..b8c49074d1b3 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -2413,6 +2413,7 @@ static void scrub_missing_raid56_pages(struct scrub_block *sblock) int ret; int i; + btrfs_bio_counter_inc_blocked(fs_info); ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical, , , 0, 1); if (ret || !bbio || !bbio->raid_map) @@ -2457,6 +2458,7 @@ static void scrub_missing_raid56_pages(struct scrub_block *sblock) rbio_out: bio_put(bio);
Re: How to test in-bound dedupe
At 03/29/2017 01:52 AM, Jitendra wrote: Hi Qu/All, I am looking into in-memory in-bound dedup. I have cloned your git tree from following urls. Linux Tree: https://github.com/adam900710/linux.git wang_dedupe_latest btrfs-progs: https://g...@github.com:adam900710/btrfs-progs.git dedupe_20170316 Then run the follwoing test https://btrfs.wiki.kernel.org/index.php/User_notes_on_dedupe. Thanks for having interest in in-band dedupe. Is there any stuff that needs to be tested for it? if something, then please point out me. Please note that, CONFIG_BTRFS_DEBUG must be set, or inband dedupe ioctl will still be hidden. Basically, you could try all xfstests with inband dedupe enabled. While it's not officially supported yet, you could modify _scratch_mount and _test_mount to always enable dedupe just after mount. My goal is to understand in-memory dedup and try to participate in on-disk de-dup. On-disk dedupe backend is not provided yet, although planned and not hard to implement. Our plan is to add on-disk dedupe backend after in-memory dedupe is merged and tested. And about understanding current dedupe algorithm, it should not be that hard to understand, while the really hard part is from kernel itself. Tons of facilities is involved in fs, from memory management to VFS to page cache to delayed allocation. Thanks, Qu --- Jitendra -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: send snapshot from snapshot incremental
On Tue, Mar 28, 2017 at 06:40:25PM -0400, J. Hart wrote: > Don't be embarrassed. > I'm a native speaker and still have trouble with most explanations.:-) You should try writing them. ;) Hugo ("darkling"). > > On 03/28/2017 06:01 PM, Jakob Schürz wrote: > >Thanks for that explanation. > > > >I'm sure, i didn't understand the -c option... and my english is pretty > >good enough for the most things I need to know in Linux-things... but > >not for this. :-( > > > > > -- Hugo Mills | What's a Nazgûl like you doing in a place like this? hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 |Illiad signature.asc Description: Digital signature
Re: send snapshot from snapshot incremental
Don't be embarrassed. I'm a native speaker and still have trouble with most explanations.:-) On 03/28/2017 06:01 PM, Jakob Schürz wrote: Thanks for that explanation. I'm sure, i didn't understand the -c option... and my english is pretty good enough for the most things I need to know in Linux-things... but not for this. :-( -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: send snapshot from snapshot incremental
Thanks for that explanation. I'm sure, i didn't understand the -c option... and my english is pretty good enough for the most things I need to know in Linux-things... but not for this. :-( Am 2017-03-26 um 22:07 schrieb Peter Grandi: > [ ... ] >> BUT if i take a snapshot from the system, and want to transfer >> it to the external HD, i can not set a parent subvolume, >> because there isn't any. > > Questions like this are based on incomplete understanding of > 'send' and 'receive', and on IRC user "darkling" explained it > fairly well: > >> When you use -c, you're telling the FS that it can expect to >> find a sent copy of that subvol on the receiving side, and >> that anything shared with it can be sent by reference. OK, so >> with -c on its own, you're telling the FS that "all the data >> in this subvol already exists on the remote". > >> So, when you send your subvol, *all* of the subvol's metadata >> is sent, and where that metadata refers to an extent that's >> shared with the -c subvol, the extent data isn't sent, because >> it's known to be on the other end already, and can be shared >> directly from there. > >> OK. So, with -p, there's a "base" subvol. The send subvol and >> the -p reference subvol are both snapshots of that base (at >> different times). The -p reference subvol, as with -c, is >> assumed to be on the remote FS. However, because it's known to >> be an earlier version of the same data, you can be more >> efficient in the sending by saying "start from the earlier >> version, and modify it in this way to get the new version" > >> So, with -p, not all of the metadata is sent, because you know >> you've already got most of it on the remote in the form of the >> earlier version. > >> So -p is "take this thing and apply these differences to it" >> and -c is "build this thing from scratch, but you can share >> some of the data with these sources" > > Also here some additional details: > > http://logs.tvrrug.org.uk/logs/%23btrfs/2016-06-29.html#2016-06-29T22:39:59 > > The requirement for read-only is because in that way it is > pretty sure that the same stuff is on both origin and target > volume. > > It may help to compare with RSYNC: it has to scan both the full > origin and target trees, because it cannot be told that there is > a parent tree that is the same on origin and target; but with > option '--link-dest' it can do something similar to 'send -c'. There is Subvolume A on the send- and the receive-side. There is also Subvolume AA on the send-side from A. The parent-ID from send-AA is the ID from A. The received-ID from A on received-side A is the ID from A. To send the AA, i use the command btrfs send -p A AA|btrfs receive /path/to/receiveFS/ The received-ID from AA on received-side A is the ID from AA. Now i take a snapshot from AA on the send-side -> Called AAA If i try to send AAA to the receiving FS with btrfs send -p AA AAA|btrfs receive /path/to/receiveFS/ no parent snapshot is found. I should better take the -c Option. btrfs send -c AA AAA|btrfs receive /path/to/receiveFS/ Am I right? (Sorry, cannot test it now, i do not have my external Drive here) I might remember, that this didn't work in the past... Best regards Jakob -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: fix unexpected file hole after disk errors
On Tue, Mar 28, 2017 at 02:50:06PM +0200, David Sterba wrote: > On Mon, Mar 06, 2017 at 12:23:30PM -0800, Liu Bo wrote: > > Btrfs creates hole extents to cover any unwritten section right before > > doing buffer writes after commit 3ac0d7b96a26 ("btrfs: Change the expanding > > write sequence to fix snapshot related bug."). > > > > However, that takes the start position of the buffered write to compare > > against the current EOF, hole extents would be created only if (EOF < > > start). > > > > If the EOF is at the middle of the buffered write, no hole extents will be > > created and a file hole without a hole extent is left in this file. > > > > This bug was revealed by generic/019 in fstests. 'fsstress' in this test > > may create the above situation and the test then fails all requests > > including writes, so the buffer write which is supposed to cover the > > hole (without the hole extent) couldn't make it on disk. Running fsck > > against such btrfs ends up with detecting file extent holes. > > > > Things could be more serious, some stale data would be exposed to > > userspace if files with this kind of hole are truncated to a position of > > the hole, because the on-disk inode size is beyond the last extent in the > > file. > > > > This fixes the bug by comparing the end position against the EOF. > > Is the test reliable? As I read it, it should be possible to craft the > file extents and trigger the bug. And verify the fix. It's not, running generic/019 by 10 times could produces a btrfsck error once on my box. Actually I think we don't need this patch any more since we're going to remove hole extents. I made a mistake when writing the above 'more serious' part, in fact truncating to a hole doesn't end up stale data as we don't have any file extent that points to the hole and reading the hole part should get all zero, thus there's no serious problem. It was another bug about setting disk isize to zero accidentally and data was lost, which has been fixed. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How to test in-bound dedupe
Hi Qu/All, I am looking into in-memory in-bound dedup. I have cloned your git tree from following urls. Linux Tree: https://github.com/adam900710/linux.git wang_dedupe_latest btrfs-progs: https://g...@github.com:adam900710/btrfs-progs.git dedupe_20170316 Then run the follwoing test https://btrfs.wiki.kernel.org/index.php/User_notes_on_dedupe. Is there any stuff that needs to be tested for it? if something, then please point out me. My goal is to understand in-memory dedup and try to participate in on-disk de-dup. --- Jitendra -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 4/4] btrfs: cleanup barrier_all_devices() to check dev stat flush error
On Tue, Mar 14, 2017 at 04:26:11PM +0800, Anand Jain wrote: > The objective of this patch is to cleanup barrier_all_devices() > so that the error checking is in a separate loop independent of > of the loop which submits and waits on the device flush requests. I think that getting completely rid of the ENOMEM as suggested under patch 2, the split would not be necessary. The send failures will be gone so we can simply count failures from the waiting write_dev_flush(..., 1). There any returned error also implies the device stat increase for FLUSH_ERRS. Then barrier_all_devices will be a bit simpler. > By doing this it helps to further develop patches which would tune > the error-actions as needed. > > Here functions such as btrfs_dev_stats_dirty() couldn't be used > because it doesn't monitor the flush errors BTRFS_DEV_STAT_FLUSH_ERRS. > > Signed-off-by: Anand Jain> --- > v2: Address Qu review comments viz.. > Add meaningful names, like cp_list (for checkpoint_list head). > (And actually it does not need a new struct type just to hold > the head pointer, list node is already named as device_checkpoint). > Check return value of add_device_checkpoint() > Check if the device is already added at add_device_checkpoint() > Rename fini_devices_checkpoint() to rel_devices_checkpoint() > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 5719e036048b..d0c401884643 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -3566,6 +3566,86 @@ static int write_dev_flush(struct btrfs_device > *device, int wait) > return 0; > } > > +struct device_checkpoint { > + struct list_head cp_node; > + struct btrfs_device *device; > + int stat_value_checkpoint; > +}; I find this approach particularly bad for several reasons. You need to store just one int per device but introduce a bloated temporary structure. If anything, the int could be embeded in btrfs_device, set and checked at the right points. Instead, you add memory allocations to a critical context and more potential failure paths. > +static int add_device_checkpoint(struct btrfs_device *device, > + struct list_head *cp_list) > +{ ... > + cdev = kzalloc(sizeof(struct device_checkpoint), GFP_KERNEL); ... > +} The memory allocations are GFP_KERNEL, this could lead to strange lockups if the allocator tries to free memory and asks the filesystem(s) to flush their data. Traversing the structures leads to quadratic complexity in check_barrier_error(), where the checkpoint list is traversed for each iteration of device list. > @@ -3574,8 +3654,8 @@ static int barrier_all_devices(struct btrfs_fs_info > *info) > { > struct list_head *head; > struct btrfs_device *dev; > - int dropouts = 0; > int ret; > + static LIST_HEAD(cp_list); ^^ What is this supposed to mean? What if several filesystems end up in barrier_all_devices modifying the list? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
On 2017-03-28 10:43, Peter Grandi wrote: This is going to be long because I am writing something detailed hoping pointlessly that someone in the future will find it by searching the list archives while doing research before setting up a new storage system, and they will be the kind of person that tolerates reading messages longer than Twitter. :-). I’m currently shrinking a device and it seems that the performance of shrink is abysmal. When I read this kind of statement I am reminded of all the cases where someone left me to decatastrophize a storage system built on "optimistic" assumptions. The usual "optimism" is what I call the "syntactic approach", that is the axiomatic belief that any syntactically valid combination of features not only will "work", but very fast too and reliably despite slow cheap hardware and "unattentive" configuration. Some people call that the expectation that system developers provide or should provide an "O_PONIES" option. In particular I get very saddened when people use "performance" to mean "speed", as the difference between the two is very great. As a general consideration, shrinking a large filetree online in-place is an amazingly risky, difficult, slow operation and should be a last desperate resort (as apparently in this case), regardless of the filesystem type, and expecting otherwise is "optimistic". My guess is that very complex risky slow operations like that are provided by "clever" filesystem developers for "marketing" purposes, to win box-ticking competitions. That applies to those system developers who do know better; I suspect that even some filesystem developers are "optimistic" as to what they can actually achieve. There are cases where there really is no other sane option. Not everyone has the kind of budget needed for proper HA setups, and if you need maximal uptime and as a result have to reprovision the system online, then you pretty much need a filesystem that supports online shrinking. Also, it's not really all that slow on most filesystem, BTRFS is just hurt by it's comparatively poor performance, and the COW metadata updates that are needed. I intended to shrink a ~22TiB filesystem down to 20TiB. This is still using LVM underneath so that I can’t just remove a device from the filesystem but have to use the resize command. That is actually a very good idea because Btrfs multi-device is not quite as reliable as DM/LVM2 multi-device. This depends on how much you trust your storage hardware relative to how much you trust the kernel code. For raid5/6, yes, BTRFS multi-device is currently crap. For most people raid10 in BTRFS is too. For raid1 mode however, it really is personal opinion. Label: 'backy' uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4 Total devices 1 FS bytes used 18.21TiB devid1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy Maybe 'balance' should have been used a bit more. This has been running since last Thursday, so roughly 3.5days now. The “used” number in devid1 has moved about 1TiB in this time. The filesystem is seeing regular usage (read and write) and when I’m suspending any application traffic I see about 1GiB of movement every now and then. Maybe once every 30 seconds or so. Does this sound fishy or normal to you? With consistent "optimism" this is a request to assess whether "performance" of some operations is adequate on a filetree without telling us either what the filetree contents look like, what the regular workload is, or what the storage layer looks like. Being one of the few system administrators crippled by lack of psychic powers :-), I rely on guesses and inferences here, and having read the whole thread containing some belated details. From the ~22TB total capacity my guess is that the storage layer involves rotating hard disks, and from later details the filesystem contents seems to be heavily reflinked files of several GB in size, and workload seems to be backups to those files from several source hosts. Considering the general level of "optimism" in the situation my wild guess is that the storage layer is based on large slow cheap rotating disks in teh 4GB-8GB range, with very low IOPS-per-TB. Thanks for that info. The 1min per 1GiB is what I saw too - the “it can take longer” wasn’t really explainable to me. A contemporary rotating disk device can do around 0.5MB/s transfer rate with small random accesses with barriers up to around 80-160MB/s in purely sequential access without barriers. 1GB/m of simultaneous read-write means around 16MB/s reads plus 16MB/s writes which is fairly good *performance* (even if slow *speed*) considering that moving extents around, even across disks, involves quite a bit of randomish same-disk updates of metadata; because it all depends usually on how much randomish metadata updates need to done, on any filesystem type, as those must be done with barriers. As I’m not using snapshots: would large files (100+gb) Using
Re: [PATCH 2/4] btrfs: Communicate back ENOMEM when it occurs
On Mon, Mar 13, 2017 at 03:42:12PM +0800, Anand Jain wrote: > The only error that write dev flush (send) will fail is due > to the ENOMEM then, as its not a device specific error and > rather a system wide issue, we should rather stop further > iterations and perpetuate the -ENOMEM error to the caller. I think we should try harder, as flushing is a critical operation and a simple yet unlikely memory allocation failure should not stop it. The device::flush_bio is being allocated each time we start flush and freed afterwards. This seems unnecessary. The proper fix IMO is to preallocate the bio at the time the device is added to the list. The bio lifetime is same as the device'. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
I’ve glazed over on “Not only that …” … can you make youtube video of that : > On 28 Mar 2017, at 16:06, Peter Grandiwrote: > >> I glazed over at “This is going to be long” … :) >>> [ ... ] > > Not only that, you also top-posted while quoting it pointlessly > in its entirety, to the whole mailing list. Well played :-). It’s because I’m special :* On a real note thank’s for giving a f to provide a detailed comment … to much of open source stuff is based on short comments :/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Qgroups are not applied when snapshotting a subvol?
On 2017-03-28 09:53, Marat Khalili wrote: There are a couple of reasons I'm advocating the specific behavior I outlined: Some of your points are valid, but some break current behaviour and expectations or create technical difficulties. 1. It doesn't require any specific qgroup setup. By definition, you can be 100% certain that the destination qgroup exists, and that you won't need to create new qgroups behind the user's back (given your suggestion, what happens when qgroup 1/N doesn't exist?). This is a general problem with current qgroups: you have to reference them by some random numbers, not by user-assigned names like files. It would need to be fixed sooner or later with syntax like L: in place of L/N, or even some special syntax made specifically for path snapshots. BTW, what about reserving level 1 for qgroups describing subvolumes and all their snapshots and forbidding manual management of qgroups at this level? If we were going to reserve something, it should be a high number, not a low one. Having 0 reserved makes some sense, but reserving other low numbers seems kind of odd when they aren't already reserved. 2. Just because it's the default, doesn't mean that the subvolume can't be reassigned to a different qgroup. This also would not remove the ability to assign a specific qgroup through the snapshot creation command. This is arguably a general point in favor of having any default of course, but it's still worth pointing out. Currently 0/N qgroups are special in that they are created automatically and belong to the bottom of the hierarchy. It would be very nice to keep it this way. Changing qgroup assignments after snapshot creation is very inconvenient because it requires quota rescan and thus blocks all other quota operations. The subvolume create command lets you set a specific qgroup on subvolume creation. When I said it won't prevent assigning a specific qgroup, I meant using that, not some after-the-fact assignment, although I hadn't realized that the snapshot command _does not_ have this argument, when it absolutely should. 3. Because BTRFS has COW semantics, the new snapshot should take up near zero space in the qgroup of it's parent. Indeed it works this way in my experiments if you assign snapshot to 1/N qgroup at creation where 0/N also belongs. Moreover, it does not require quota rescan, which is very nice. I mean if it's placed in that qgroup itself. If you create a snapshot of an idle subvolume, you only effectively double the metadata (and possibly not even that completely), so unless it's got a very large number of small files, it should take up next to zero space if it gets put in it's parent's qgroup. 4. This correlates with the behavior most people expect based on ZFS and LVM, which is that snapshots are tied to their parent. I'm not against tying it to the parent. I'm against removing snapshot's own qgroup. Perhaps have an option (in the filesystem, not the mount options, this is tied to qgroup policy, which should be independent of who mounts the FS where) that specifies one of: 1. Current behavior. 2. Join parent's qgroup. 3. Join some other predefined qgroup. At a minimum, it should belong to _some_ qgroup. This could also be covered by having a designated 'default' qgroup that all new subvolumes created without a specified qgroup get put in, but I feel that that is somewhat orthogonal to the issue of how snapshots are handled. It belongs to its own 0/N' qgroup, but this is not probably what you mean. The 0/N groups are accounting only until some limit is set, correct? If that is the case, then that really isn't what I mean. Quotas can be used for accounting, but when used, they are often intended for restricting resource usage. New subvolumes going into unrestricted qgroups by default completely eliminates any usability for resource restriction the moment anybody but the admin has any shell access to the system, and that's my biggest complaint against the current behavior. Putting a snapshot in it's parent's qgroup would completely close this hole and provide semantics that most people are likely to understand pretty easily. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
> I glazed over at “This is going to be long” … :) >> [ ... ] Not only that, you also top-posted while quoting it pointlessly in its entirety, to the whole mailing list. Well played :-). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] btrfs: REQ_PREFLUSH does not use btrfs_end_bio() completion callback
On Mon, Mar 13, 2017 at 03:42:11PM +0800, Anand Jain wrote: > REQ_PREFLUSH bio to flush dev cache uses btrfs_end_empty_barrier() > completion callback only, as of now, and there it accounts for dev > stat flush errors BTRFS_DEV_STAT_FLUSH_ERRS, so remove it from the > btrfs_end_bio(). Can you please be more specific? I was trying to find the code path that does the supposedly duplicate accounting for BTRFS_DEV_STAT_FLUSH_ERRS, but still not there after half an hour. submit_stripe_bio btrfsic_submit_bio btrfs_end_bio -> stats accounting write_dev_flush btrfsic_submit_bio btrfs_end_empty_barrier complete now here it wakes up flush_wait, that is only waited for in write_dev_flush, there the FLUSH_ERRS accounting happens, but ... I'm not sure if I haven't lost in the bio handling maze. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
> [ ... ] slaps together a large storage system in the cheapest > and quickest way knowing that while it is mostly empty it will > seem very fast regardless and therefore to have awesome > performance, and then the "clever" sysadm disappears surrounded > by a halo of glory before the storage system gets full workload > and fills up; [ ... ] Fortunately or unfortunately Btrfs is particularly suitable for this technique, as it has an enormous number of checkbox-ticking awesome looking feature: transparent compression, dynamic add/remove, online balance/scrub, different sized member devices, online grow/shrink, online defrag, limitless scalability, online dedup, arbitrary subvolumes and snapshots, COW and reflinking, online conversion of RAID profiles, ... and one can use all of them at the same time, and for the initial period where volume workload is low and space used not much, it will looks absolutely fantastic, cheap, flexible, always available, fast, the work of genius of a very cool sysadm. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
> [ ... ] reminded of all the cases where someone left me to > decatastrophize a storage system built on "optimistic" > assumptions. In particular when some "clever" sysadm with a "clever" (or dumb) manager slaps together a large storage system in the cheapest and quickest way knowing that while it is mostly empty it will seem very fast regardless and therefore to have awesome performance, and then the "clever" sysadm disappears surrounded by a halo of glory before the storage system gets full workload and fills up; when that happens usually I get to inherit it. BTW The same technique also can be done with HPC clusters. >> I intended to shrink a ~22TiB filesystem down to 20TiB. This >> is still using LVM underneath so that I can’t just remove a >> device from the filesystem but have to use the resize >> command. >> Label: 'backy' uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4 >> Total devices 1 FS bytes used 18.21TiB >> devid1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy Ahh it is indeed a filled up storage system now running a full workload. At least it wasn't me who inherited it this time. :-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
I glazed over at “This is going to be long” … :) > On 28 Mar 2017, at 15:43, Peter Grandiwrote: > > This is going to be long because I am writing something detailed > hoping pointlessly that someone in the future will find it by > searching the list archives while doing research before setting > up a new storage system, and they will be the kind of person > that tolerates reading messages longer than Twitter. :-). > >> I’m currently shrinking a device and it seems that the >> performance of shrink is abysmal. > > When I read this kind of statement I am reminded of all the > cases where someone left me to decatastrophize a storage system > built on "optimistic" assumptions. The usual "optimism" is what > I call the "syntactic approach", that is the axiomatic belief > that any syntactically valid combination of features not only > will "work", but very fast too and reliably despite slow cheap > hardware and "unattentive" configuration. Some people call that > the expectation that system developers provide or should provide > an "O_PONIES" option. In particular I get very saddened when > people use "performance" to mean "speed", as the difference > between the two is very great. > > As a general consideration, shrinking a large filetree online > in-place is an amazingly risky, difficult, slow operation and > should be a last desperate resort (as apparently in this case), > regardless of the filesystem type, and expecting otherwise is > "optimistic". > > My guess is that very complex risky slow operations like that > are provided by "clever" filesystem developers for "marketing" > purposes, to win box-ticking competitions. That applies to those > system developers who do know better; I suspect that even some > filesystem developers are "optimistic" as to what they can > actually achieve. > >> I intended to shrink a ~22TiB filesystem down to 20TiB. This is >> still using LVM underneath so that I can’t just remove a device >> from the filesystem but have to use the resize command. > > That is actually a very good idea because Btrfs multi-device is > not quite as reliable as DM/LVM2 multi-device. > >> Label: 'backy' uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4 >> Total devices 1 FS bytes used 18.21TiB >> devid1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy > > Maybe 'balance' should have been used a bit more. > >> This has been running since last Thursday, so roughly 3.5days >> now. The “used” number in devid1 has moved about 1TiB in this >> time. The filesystem is seeing regular usage (read and write) >> and when I’m suspending any application traffic I see about >> 1GiB of movement every now and then. Maybe once every 30 >> seconds or so. Does this sound fishy or normal to you? > > With consistent "optimism" this is a request to assess whether > "performance" of some operations is adequate on a filetree > without telling us either what the filetree contents look like, > what the regular workload is, or what the storage layer looks > like. > > Being one of the few system administrators crippled by lack of > psychic powers :-), I rely on guesses and inferences here, and > having read the whole thread containing some belated details. > > From the ~22TB total capacity my guess is that the storage layer > involves rotating hard disks, and from later details the > filesystem contents seems to be heavily reflinked files of > several GB in size, and workload seems to be backups to those > files from several source hosts. Considering the general level > of "optimism" in the situation my wild guess is that the storage > layer is based on large slow cheap rotating disks in teh 4GB-8GB > range, with very low IOPS-per-TB. > >> Thanks for that info. The 1min per 1GiB is what I saw too - >> the “it can take longer” wasn’t really explainable to me. > > A contemporary rotating disk device can do around 0.5MB/s > transfer rate with small random accesses with barriers up to > around 80-160MB/s in purely sequential access without barriers. > > 1GB/m of simultaneous read-write means around 16MB/s reads plus > 16MB/s writes which is fairly good *performance* (even if slow > *speed*) considering that moving extents around, even across > disks, involves quite a bit of randomish same-disk updates of > metadata; because it all depends usually on how much randomish > metadata updates need to done, on any filesystem type, as those > must be done with barriers. > >> As I’m not using snapshots: would large files (100+gb) > > Using 100GB sized VM virtual disks (never mind with COW) seems > very unwise to me to start with, but of course a lot of other > people know better :-). Just like a lot of other people know > better that large single pool storage systems are awesome in > every respect :-): cost, reliability, speed, flexibility, > maintenance, etc. > >> with long chains of CoW history (specifically reflink copies) >> also hurt? > > Oh yes... They are about one
Re: Shrinking a device - performance?
This is going to be long because I am writing something detailed hoping pointlessly that someone in the future will find it by searching the list archives while doing research before setting up a new storage system, and they will be the kind of person that tolerates reading messages longer than Twitter. :-). > I’m currently shrinking a device and it seems that the > performance of shrink is abysmal. When I read this kind of statement I am reminded of all the cases where someone left me to decatastrophize a storage system built on "optimistic" assumptions. The usual "optimism" is what I call the "syntactic approach", that is the axiomatic belief that any syntactically valid combination of features not only will "work", but very fast too and reliably despite slow cheap hardware and "unattentive" configuration. Some people call that the expectation that system developers provide or should provide an "O_PONIES" option. In particular I get very saddened when people use "performance" to mean "speed", as the difference between the two is very great. As a general consideration, shrinking a large filetree online in-place is an amazingly risky, difficult, slow operation and should be a last desperate resort (as apparently in this case), regardless of the filesystem type, and expecting otherwise is "optimistic". My guess is that very complex risky slow operations like that are provided by "clever" filesystem developers for "marketing" purposes, to win box-ticking competitions. That applies to those system developers who do know better; I suspect that even some filesystem developers are "optimistic" as to what they can actually achieve. > I intended to shrink a ~22TiB filesystem down to 20TiB. This is > still using LVM underneath so that I can’t just remove a device > from the filesystem but have to use the resize command. That is actually a very good idea because Btrfs multi-device is not quite as reliable as DM/LVM2 multi-device. > Label: 'backy' uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4 >Total devices 1 FS bytes used 18.21TiB >devid1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy Maybe 'balance' should have been used a bit more. > This has been running since last Thursday, so roughly 3.5days > now. The “used” number in devid1 has moved about 1TiB in this > time. The filesystem is seeing regular usage (read and write) > and when I’m suspending any application traffic I see about > 1GiB of movement every now and then. Maybe once every 30 > seconds or so. Does this sound fishy or normal to you? With consistent "optimism" this is a request to assess whether "performance" of some operations is adequate on a filetree without telling us either what the filetree contents look like, what the regular workload is, or what the storage layer looks like. Being one of the few system administrators crippled by lack of psychic powers :-), I rely on guesses and inferences here, and having read the whole thread containing some belated details. >From the ~22TB total capacity my guess is that the storage layer involves rotating hard disks, and from later details the filesystem contents seems to be heavily reflinked files of several GB in size, and workload seems to be backups to those files from several source hosts. Considering the general level of "optimism" in the situation my wild guess is that the storage layer is based on large slow cheap rotating disks in teh 4GB-8GB range, with very low IOPS-per-TB. > Thanks for that info. The 1min per 1GiB is what I saw too - > the “it can take longer” wasn’t really explainable to me. A contemporary rotating disk device can do around 0.5MB/s transfer rate with small random accesses with barriers up to around 80-160MB/s in purely sequential access without barriers. 1GB/m of simultaneous read-write means around 16MB/s reads plus 16MB/s writes which is fairly good *performance* (even if slow *speed*) considering that moving extents around, even across disks, involves quite a bit of randomish same-disk updates of metadata; because it all depends usually on how much randomish metadata updates need to done, on any filesystem type, as those must be done with barriers. > As I’m not using snapshots: would large files (100+gb) Using 100GB sized VM virtual disks (never mind with COW) seems very unwise to me to start with, but of course a lot of other people know better :-). Just like a lot of other people know better that large single pool storage systems are awesome in every respect :-): cost, reliability, speed, flexibility, maintenance, etc. > with long chains of CoW history (specifically reflink copies) > also hurt? Oh yes... They are about one of the worst cases for using Btrfs. But also very "optimistic" to think that kind of stuff can work awesomely on *any* filesystem type. > Something I’d like to verify: does having traffic on the > volume have the potential to delay this infinitely? [ ... ] > it’s just slow and we’re looking forward to about
Re: Qgroups are not applied when snapshotting a subvol?
There are a couple of reasons I'm advocating the specific behavior I outlined: Some of your points are valid, but some break current behaviour and expectations or create technical difficulties. 1. It doesn't require any specific qgroup setup. By definition, you can be 100% certain that the destination qgroup exists, and that you won't need to create new qgroups behind the user's back (given your suggestion, what happens when qgroup 1/N doesn't exist?). This is a general problem with current qgroups: you have to reference them by some random numbers, not by user-assigned names like files. It would need to be fixed sooner or later with syntax like L: in place of L/N, or even some special syntax made specifically for path snapshots. BTW, what about reserving level 1 for qgroups describing subvolumes and all their snapshots and forbidding manual management of qgroups at this level? 2. Just because it's the default, doesn't mean that the subvolume can't be reassigned to a different qgroup. This also would not remove the ability to assign a specific qgroup through the snapshot creation command. This is arguably a general point in favor of having any default of course, but it's still worth pointing out. Currently 0/N qgroups are special in that they are created automatically and belong to the bottom of the hierarchy. It would be very nice to keep it this way. Changing qgroup assignments after snapshot creation is very inconvenient because it requires quota rescan and thus blocks all other quota operations. 3. Because BTRFS has COW semantics, the new snapshot should take up near zero space in the qgroup of it's parent. Indeed it works this way in my experiments if you assign snapshot to 1/N qgroup at creation where 0/N also belongs. Moreover, it does not require quota rescan, which is very nice. 4. This correlates with the behavior most people expect based on ZFS and LVM, which is that snapshots are tied to their parent. I'm not against tying it to the parent. I'm against removing snapshot's own qgroup. At a minimum, it should belong to _some_ qgroup. This could also be covered by having a designated 'default' qgroup that all new subvolumes created without a specified qgroup get put in, but I feel that that is somewhat orthogonal to the issue of how snapshots are handled. It belongs to its own 0/N' qgroup, but this is not probably what you mean. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: sink GFP flags parameter to tree_mod_log_insert_move
All (1) callers pass the same value. Signed-off-by: David Sterba--- fs/btrfs/ctree.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 7dc8844037e0..d034d47c5470 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -567,7 +567,7 @@ tree_mod_log_insert_key(struct btrfs_fs_info *fs_info, static noinline int tree_mod_log_insert_move(struct btrfs_fs_info *fs_info, struct extent_buffer *eb, int dst_slot, int src_slot, -int nr_items, gfp_t flags) +int nr_items) { struct tree_mod_elem *tm = NULL; struct tree_mod_elem **tm_list = NULL; @@ -578,11 +578,11 @@ tree_mod_log_insert_move(struct btrfs_fs_info *fs_info, if (!tree_mod_need_log(fs_info, eb)) return 0; - tm_list = kcalloc(nr_items, sizeof(struct tree_mod_elem *), flags); + tm_list = kcalloc(nr_items, sizeof(struct tree_mod_elem *), GFP_NOFS); if (!tm_list) return -ENOMEM; - tm = kzalloc(sizeof(*tm), flags); + tm = kzalloc(sizeof(*tm), GFP_NOFS); if (!tm) { ret = -ENOMEM; goto free_tms; @@ -596,7 +596,7 @@ tree_mod_log_insert_move(struct btrfs_fs_info *fs_info, for (i = 0; i + dst_slot < src_slot && i < nr_items; i++) { tm_list[i] = alloc_tree_mod_elem(eb, i + dst_slot, - MOD_LOG_KEY_REMOVE_WHILE_MOVING, flags); + MOD_LOG_KEY_REMOVE_WHILE_MOVING, GFP_NOFS); if (!tm_list[i]) { ret = -ENOMEM; goto free_tms; @@ -873,7 +873,7 @@ tree_mod_log_eb_move(struct btrfs_fs_info *fs_info, struct extent_buffer *dst, { int ret; ret = tree_mod_log_insert_move(fs_info, dst, dst_offset, src_offset, - nr_items, GFP_NOFS); + nr_items); BUG_ON(ret < 0); } -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: sink GFP flags parameter to tree_mod_log_insert_root
All (1) callers pass the same value. Signed-off-by: David Sterba--- fs/btrfs/ctree.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index d034d47c5470..165e7ec12af7 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -663,7 +663,7 @@ __tree_mod_log_free_eb(struct btrfs_fs_info *fs_info, static noinline int tree_mod_log_insert_root(struct btrfs_fs_info *fs_info, struct extent_buffer *old_root, -struct extent_buffer *new_root, gfp_t flags, +struct extent_buffer *new_root, int log_removal) { struct tree_mod_elem *tm = NULL; @@ -678,14 +678,14 @@ tree_mod_log_insert_root(struct btrfs_fs_info *fs_info, if (log_removal && btrfs_header_level(old_root) > 0) { nritems = btrfs_header_nritems(old_root); tm_list = kcalloc(nritems, sizeof(struct tree_mod_elem *), - flags); + GFP_NOFS); if (!tm_list) { ret = -ENOMEM; goto free_tms; } for (i = 0; i < nritems; i++) { tm_list[i] = alloc_tree_mod_elem(old_root, i, - MOD_LOG_KEY_REMOVE_WHILE_FREEING, flags); + MOD_LOG_KEY_REMOVE_WHILE_FREEING, GFP_NOFS); if (!tm_list[i]) { ret = -ENOMEM; goto free_tms; @@ -693,7 +693,7 @@ tree_mod_log_insert_root(struct btrfs_fs_info *fs_info, } } - tm = kzalloc(sizeof(*tm), flags); + tm = kzalloc(sizeof(*tm), GFP_NOFS); if (!tm) { ret = -ENOMEM; goto free_tms; @@ -943,7 +943,7 @@ tree_mod_log_set_root_pointer(struct btrfs_root *root, { int ret; ret = tree_mod_log_insert_root(root->fs_info, root->node, - new_root_node, GFP_NOFS, log_removal); + new_root_node, log_removal); BUG_ON(ret < 0); } -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: track exclusive filesystem operation in flags
There are several operations, usually started from ioctls, that cannot run concurrently. The status is tracked in mutually_exclusive_operation_running as an atomic_t. We can easily track the status as one of the per-filesystem flag bits with same synchronization guarantees. The conversion replaces: * atomic_xchg(..., 1)-> test_and_set_bit(FLAG, ...) * atomic_set(..., 0) -> clear_bit(FLAG, ...) Signed-off-by: David Sterba--- fs/btrfs/ctree.h | 7 +-- fs/btrfs/dev-replace.c | 5 ++--- fs/btrfs/ioctl.c | 33 +++-- fs/btrfs/volumes.c | 6 +++--- 4 files changed, 25 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 29b7fc28c607..2521752c4e4a 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -702,6 +702,11 @@ struct btrfs_delayed_root; #define BTRFS_FS_BTREE_ERR 11 #define BTRFS_FS_LOG1_ERR 12 #define BTRFS_FS_LOG2_ERR 13 +/* + * Indicate that a whole-filesystem exclusive operation is running + * (device replace, resize, device add/delete, balance) + */ +#define BTRFS_FS_EXCL_OP 14 struct btrfs_fs_info { u8 fsid[BTRFS_FSID_SIZE]; @@ -1067,8 +1072,6 @@ struct btrfs_fs_info { /* device replace state */ struct btrfs_dev_replace dev_replace; - atomic_t mutually_exclusive_operation_running; - struct percpu_counter bio_counter; wait_queue_head_t replace_wait; diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index e653921f05d9..de7b2c897fe0 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -784,8 +784,7 @@ int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info) } btrfs_dev_replace_unlock(dev_replace, 1); - WARN_ON(atomic_xchg( - _info->mutually_exclusive_operation_running, 1)); + WARN_ON(test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)); task = kthread_run(btrfs_dev_replace_kthread, fs_info, "btrfs-devrepl"); return PTR_ERR_OR_ZERO(task); } @@ -814,7 +813,7 @@ static int btrfs_dev_replace_kthread(void *data) (unsigned int)progress); } btrfs_dev_replace_continue_on_mount(fs_info); - atomic_set(_info->mutually_exclusive_operation_running, 0); + clear_bit(BTRFS_FS_EXCL_OP, _info->flags); return 0; } diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index dabfc7ac48a6..a29dc3fd7152 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1504,7 +1504,7 @@ static noinline int btrfs_ioctl_resize(struct file *file, if (ret) return ret; - if (atomic_xchg(_info->mutually_exclusive_operation_running, 1)) { + if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)) { mnt_drop_write_file(file); return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS; } @@ -1619,7 +1619,7 @@ static noinline int btrfs_ioctl_resize(struct file *file, kfree(vol_args); out: mutex_unlock(_info->volume_mutex); - atomic_set(_info->mutually_exclusive_operation_running, 0); + clear_bit(BTRFS_FS_EXCL_OP, _info->flags); mnt_drop_write_file(file); return ret; } @@ -2661,7 +2661,7 @@ static long btrfs_ioctl_add_dev(struct btrfs_fs_info *fs_info, void __user *arg) if (!capable(CAP_SYS_ADMIN)) return -EPERM; - if (atomic_xchg(_info->mutually_exclusive_operation_running, 1)) + if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)) return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS; mutex_lock(_info->volume_mutex); @@ -2680,7 +2680,7 @@ static long btrfs_ioctl_add_dev(struct btrfs_fs_info *fs_info, void __user *arg) kfree(vol_args); out: mutex_unlock(_info->volume_mutex); - atomic_set(_info->mutually_exclusive_operation_running, 0); + clear_bit(BTRFS_FS_EXCL_OP, _info->flags); return ret; } @@ -2708,7 +2708,7 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void __user *arg) if (vol_args->flags & ~BTRFS_VOL_ARG_V2_FLAGS_SUPPORTED) return -EOPNOTSUPP; - if (atomic_xchg(_info->mutually_exclusive_operation_running, 1)) { + if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)) { ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS; goto out; } @@ -2721,7 +2721,7 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void __user *arg) ret = btrfs_rm_device(fs_info, vol_args->name, 0); } mutex_unlock(_info->volume_mutex); - atomic_set(_info->mutually_exclusive_operation_running, 0); + clear_bit(BTRFS_FS_EXCL_OP, _info->flags); if (!ret) { if (vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID) @@ -2752,7 +2752,7 @@ static long btrfs_ioctl_rm_dev(struct file *file, void __user *arg) if (ret)
Re: [PATCH v2] Btrfs: fix unexpected file hole after disk errors
On Mon, Mar 06, 2017 at 12:23:30PM -0800, Liu Bo wrote: > Btrfs creates hole extents to cover any unwritten section right before > doing buffer writes after commit 3ac0d7b96a26 ("btrfs: Change the expanding > write sequence to fix snapshot related bug."). > > However, that takes the start position of the buffered write to compare > against the current EOF, hole extents would be created only if (EOF < > start). > > If the EOF is at the middle of the buffered write, no hole extents will be > created and a file hole without a hole extent is left in this file. > > This bug was revealed by generic/019 in fstests. 'fsstress' in this test > may create the above situation and the test then fails all requests > including writes, so the buffer write which is supposed to cover the > hole (without the hole extent) couldn't make it on disk. Running fsck > against such btrfs ends up with detecting file extent holes. > > Things could be more serious, some stale data would be exposed to > userspace if files with this kind of hole are truncated to a position of > the hole, because the on-disk inode size is beyond the last extent in the > file. > > This fixes the bug by comparing the end position against the EOF. Is the test reliable? As I read it, it should be possible to craft the file extents and trigger the bug. And verify the fix. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: drop redundant parameters from btrfs_map_sblock
All callers pass 0 for mirror_num and 1 for need_raid_map. Signed-off-by: David Sterba--- fs/btrfs/scrub.c | 6 +++--- fs/btrfs/volumes.c | 6 ++ fs/btrfs/volumes.h | 3 +-- 3 files changed, 6 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index b0251eb1239f..c1a1fa6e9cce 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -1331,7 +1331,7 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock, * represents one mirror */ ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, - logical, _length, , 0, 1); + logical, _length, ); if (ret || !bbio || mapped_length < sublen) { btrfs_put_bbio(bbio); return -EIO; @@ -2188,7 +2188,7 @@ static void scrub_missing_raid56_pages(struct scrub_block *sblock) int i; ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical, - , , 0, 1); + , ); if (ret || !bbio || !bbio->raid_map) goto bbio_out; @@ -2776,7 +2776,7 @@ static void scrub_parity_check_and_repair(struct scrub_parity *sparity) length = sparity->logic_end - sparity->logic_start; ret = btrfs_map_sblock(fs_info, BTRFS_MAP_WRITE, sparity->logic_start, - , , 0, 1); + , ); if (ret || !bbio || !bbio->raid_map) goto bbio_out; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 73d56eef5e60..e8e7b1f9b6d9 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5886,11 +5886,9 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op, /* For Scrub/replace */ int btrfs_map_sblock(struct btrfs_fs_info *fs_info, enum btrfs_map_op op, u64 logical, u64 *length, -struct btrfs_bio **bbio_ret, int mirror_num, -int need_raid_map) +struct btrfs_bio **bbio_ret) { - return __btrfs_map_block(fs_info, op, logical, length, bbio_ret, -mirror_num, need_raid_map); + return __btrfs_map_block(fs_info, op, logical, length, bbio_ret, 0, 1); } int btrfs_rmap_block(struct btrfs_fs_info *fs_info, diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 59be81206dd7..a399d40686dc 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -400,8 +400,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op, struct btrfs_bio **bbio_ret, int mirror_num); int btrfs_map_sblock(struct btrfs_fs_info *fs_info, enum btrfs_map_op op, u64 logical, u64 *length, -struct btrfs_bio **bbio_ret, int mirror_num, -int need_raid_map); +struct btrfs_bio **bbio_ret); int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, u64 physical, u64 devid, u64 **logical, int *naddrs, int *stripe_len); -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Qgroups are not applied when snapshotting a subvol?
On 2017-03-28 08:00, Marat Khalili wrote: The default should be to inherit the qgroup of the parent subvolume. This behaviour is only good for this particular use-case. In general case, qgroups of subvolume and snapshots should exist separately, and both can be included in some higher level qgroup (after all, that's what qgroup hierarchy is for). In my system I found it convenient to include subvolume and its snapshots in qgroup 1/N, where 0/N is qgroup of bare subvolume. I think adopting this behaviour as default would be more sensible. There are a couple of reasons I'm advocating the specific behavior I outlined: 1. It doesn't require any specific qgroup setup. By definition, you can be 100% certain that the destination qgroup exists, and that you won't need to create new qgroups behind the user's back (given your suggestion, what happens when qgroup 1/N doesn't exist?). 2. Just because it's the default, doesn't mean that the subvolume can't be reassigned to a different qgroup. This also would not remove the ability to assign a specific qgroup through the snapshot creation command. This is arguably a general point in favor of having any default of course, but it's still worth pointing out. 3. Because BTRFS has COW semantics, the new snapshot should take up near zero space in the qgroup of it's parent. 4. This correlates with the behavior most people expect based on ZFS and LVM, which is that snapshots are tied to their parent. At a minimum, it should belong to _some_ qgroup. This could also be covered by having a designated 'default' qgroup that all new subvolumes created without a specified qgroup get put in, but I feel that that is somewhat orthogonal to the issue of how snapshots are handled. -- With Best Regards, Marat Khalili On 28/03/17 14:24, Austin S. Hemmelgarn wrote: On 2017-03-27 15:32, Chris Murphy wrote: How about if qgroups are enabled, then non-root user is prevented from creating new subvolumes? Or is there a way for a new nested subvolume to be included in its parent's quota, rather than the new subvolume having a whole new quota limit? Tricky problem. The default should be to inherit the qgroup of the parent subvolume. The organization of subvolumes is hierarchical, and sane people expect things to behave as they look. Taking another angle, on ZFS, 'nested' (nested in quotes because ZFS' definition of 'nested' zvols is weird) inherit their parent's quota and reservations (essentially reverse quota), and they're not even inherently nested in the filesystem like subvolumes are, so we're differing from the only other widely used system that implements things in a similar manner. As far as the subvolume thing, there should be an option to disable user creation of subvolumes, and ideally it should be on by default because: 1. Users can't delete subvolumes by default. This means they can create but not destroy a resource by default, which means that a user can pretty easily accidentally cause issues for the system as a whole. 2. Correlating with 1, users being able to delete subvolumes by default is not safe on multiple levels (easy accidental data loss, numerous other issues), and thus user subvolume removal being off by default is significantly safer. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5] qgroup: Retry after commit on getting EDQUOT
On Mon, Mar 27, 2017 at 01:13:46PM -0500, Goldwyn Rodrigues wrote: > > > On 03/27/2017 12:36 PM, David Sterba wrote: > > On Mon, Mar 27, 2017 at 12:29:57PM -0500, Goldwyn Rodrigues wrote: > >> From: Goldwyn Rodrigues> >> > >> We are facing the same problem with EDQUOT which was experienced with > >> ENOSPC. Not sure if we require a full ticketing system such as ENOSPC, but > >> here is a quick fix, which may be too big a hammer. > >> > >> Quotas are reserved during the start of an operation, incrementing > >> qg->reserved. However, it is written to disk in a commit_transaction > >> which could take as long as commit_interval. In the meantime there > >> could be deletions which are not accounted for because deletions are > >> accounted for only while committed (free_refroot). So, when we get > >> a EDQUOT flush the data to disk and try again. > >> > >> This fixes fstests btrfs/139. > >> > >> Signed-off-by: Goldwyn Rodrigues > >> --- > >> Changes since v1: > >> - Changed start_delalloc_roots() to start_delalloc_inode() to target > >>the root in question only to reduce the amount of flush to be done. > >> - Added wait_ordered_extents(). > >> > >> Changes since v2: > >> - Revised patch header > >> - removed comment on combining conditions > >> - removed test case, to be done in fstests > >> > >> Changes sinve v3: > >> - testcase reinstated > >> - return value checks > >> > >> Changes since v4: > >> - removed testcase since btrfs/139 got incorporated in fstests > > > > The point was to keep the test inside the changelog as well. > > > Yes, that was before we had btrfs/139 in fstest. I have put it in the > patch header. However, if you really want the test script back, I can > put it there. Let me know. I think the test steps are worth keeping in the changelog, even if there's a fstest. I'll copy it from v4, patch added to 4.12 queue. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Qgroups are not applied when snapshotting a subvol?
The default should be to inherit the qgroup of the parent subvolume. This behaviour is only good for this particular use-case. In general case, qgroups of subvolume and snapshots should exist separately, and both can be included in some higher level qgroup (after all, that's what qgroup hierarchy is for). In my system I found it convenient to include subvolume and its snapshots in qgroup 1/N, where 0/N is qgroup of bare subvolume. I think adopting this behaviour as default would be more sensible. -- With Best Regards, Marat Khalili On 28/03/17 14:24, Austin S. Hemmelgarn wrote: On 2017-03-27 15:32, Chris Murphy wrote: How about if qgroups are enabled, then non-root user is prevented from creating new subvolumes? Or is there a way for a new nested subvolume to be included in its parent's quota, rather than the new subvolume having a whole new quota limit? Tricky problem. The default should be to inherit the qgroup of the parent subvolume. The organization of subvolumes is hierarchical, and sane people expect things to behave as they look. Taking another angle, on ZFS, 'nested' (nested in quotes because ZFS' definition of 'nested' zvols is weird) inherit their parent's quota and reservations (essentially reverse quota), and they're not even inherently nested in the filesystem like subvolumes are, so we're differing from the only other widely used system that implements things in a similar manner. As far as the subvolume thing, there should be an option to disable user creation of subvolumes, and ideally it should be on by default because: 1. Users can't delete subvolumes by default. This means they can create but not destroy a resource by default, which means that a user can pretty easily accidentally cause issues for the system as a whole. 2. Correlating with 1, users being able to delete subvolumes by default is not safe on multiple levels (easy accidental data loss, numerous other issues), and thus user subvolume removal being off by default is significantly safer. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Qgroups are not applied when snapshotting a subvol?
On 2017-03-27 21:49, Qu Wenruo wrote: At 03/27/2017 08:01 PM, Austin S. Hemmelgarn wrote: On 2017-03-27 07:02, Moritz Sichert wrote: Am 27.03.2017 um 05:46 schrieb Qu Wenruo: At 03/27/2017 11:26 AM, Andrei Borzenkov wrote: 27.03.2017 03:39, Qu Wenruo пишет: At 03/26/2017 06:03 AM, Moritz Sichert wrote: Hi, I tried to configure qgroups on a btrfs filesystem but was really surprised that when you snapshot a subvolume, the snapshot will not be assigned to the qgroup the subvolume was in. As an example consider the small terminal session in the attachment: I create a subvol A, assign it to qgroup 1/1 and set a limit of 5M on that qgroup. Then I write a file into A and eventually get "disk quota exceeded". Then I create a snapshot of A and call it B. B will not be assigned to 1/1 and writing a file into B confirms that no limits at all are imposed for B. I feel like I must be missing something here. Considering that creating a snapshot does not require root privileges this would mean that any user can just circumvent any quota and therefore make them useless. Is there a way to enforce quotas even when a user creates snapshots? Yes, there is always method to attach the subvolume/snapshot to specified higher level qgroup. Just use "btrfs subvolume snapshot -i 1/1". This requires cooperation from whoever creates subvolume, while the question was - is it possible to enforce it, without need for explicit option/action when snapshot is created. To reiterate - if user omits "-i 1/1" (s)he "escapes" from quota enforcement. What if user really want to create a subvolume assigned another group? You're implying a *policy* that if source subvolume belongs to a higher level qgroup, then snapshot created should also follow that higher level qgroup. However kernel should only provide *mechanisim*, not *policy*. And btrfs does it, it provides method to do it, whether to do or not is users responsibility. If you want to implement that policy, please do it in a higher level, something like SUSE snapper, not in kernel. The problem is, I can't enforce the policy because *every user* can create snapshots. Even if I would restrict the btrfs executable so that only root can execute it, this doesn't help. As using the ioctl for btrfs is allowed for any user, they could just get the executable from somewhere else. To reiterate and reinforce this: If it is not possible to enforce new subvolumes counting for their parent quota, and there is no option to prevent non-root (or non-CAP_SYS_ADMIN) users from creating new subvolumes, then BTRFS qgroups are useless on any system with shell access because a user can trivially escape their quota restrictions (or hide from accounting) by creating a new subvolume which is outside of their qgroup and storing data there. Ideally, there should be an option to disable user subvolume creation (it arguably should be the default, because of resource exhaustion issues, but that's a separate argument), and there should be an option in the kernel to force specific behavior. Both cases are policy, but they are policy that can only be concretely enforced _by the kernel_. The problem is, how should we treat subvolume. Btrfs subvolume sits in the middle of directory and (logical) volume used in traditional stacked solution. While we allow normal user to create/delete/modify dir as long as they follow access control, we require privilege to create/delete/modify volumes. No, we require privilege to do certain modifications or delete subvolumes. Regular users can create subvolumes with no privileges whatsoever, and most basic directory operations (rename, chown, chmod, etc) work just fine within normal UNIX DAC permissions. Unless you're running some specially patched kernel or some LSM (SELinux possibly) that somehow restricts access to the ioctl, you can always create subvolumes. This is part of the reason that I'm personally hesitant to use BTRFS on systems where end users have shell access, it's a DoS waiting to happen. Developers chose to treat btrfs subvolume as dir, makes it quite easy to operate for normal use case, sacrificing qgroup limit which is not a major function (or even did not exist) at that time. IIRC at the beginning time of btrfs, we don't have a full idea of use cases could be. This is common, a lot of problems(even bad design) can only be found after enough feedback from end users. Personally speaking, I prefer to restrict subvolume creation/deletion to privilege users only, and uses a daemon as a proxy to do such privilege operation. So we can do better accounting/access control without bothering the kernel. I will agree that a daemon for this can be useful, but even if we add a mount option to restrict operations on subvolumes by normal users, we can still provide the option of a daemon. On something like a single user system, there is not much advantage to having some complex access control in place. I'm not saying we should
Re: [PATCH v2 0/2] Cleanup for some hardcoded constants
On Thu, Mar 16, 2017 at 10:04:32AM -0600, ednadol...@gmail.com wrote: > From: Edmund Nadolski> > This series replaces several hard-coded values with descriptive > symbols. > > --- > v2: > + rename SEQ_NONE to SEQ_LAST and move definition to ctree.h > + clarify comment at __merge_refs() > > Edmund Nadolski (2): > btrfs: provide enumeration for __merge_refs mode argument > btrfs: replace hardcoded value with SEQ_LAST macro Added to 4.12 queue, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Qgroups are not applied when snapshotting a subvol?
On 2017-03-27 15:32, Chris Murphy wrote: How about if qgroups are enabled, then non-root user is prevented from creating new subvolumes? Or is there a way for a new nested subvolume to be included in its parent's quota, rather than the new subvolume having a whole new quota limit? Tricky problem. The default should be to inherit the qgroup of the parent subvolume. The organization of subvolumes is hierarchical, and sane people expect things to behave as they look. Taking another angle, on ZFS, 'nested' (nested in quotes because ZFS' definition of 'nested' zvols is weird) inherit their parent's quota and reservations (essentially reverse quota), and they're not even inherently nested in the filesystem like subvolumes are, so we're differing from the only other widely used system that implements things in a similar manner. As far as the subvolume thing, there should be an option to disable user creation of subvolumes, and ideally it should be on by default because: 1. Users can't delete subvolumes by default. This means they can create but not destroy a resource by default, which means that a user can pretty easily accidentally cause issues for the system as a whole. 2. Correlating with 1, users being able to delete subvolumes by default is not safe on multiple levels (easy accidental data loss, numerous other issues), and thus user subvolume removal being off by default is significantly safer. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: fix missing __error symbol in libbtrfs.so.0
On Mon, Mar 27, 2017 at 10:07:20PM +0100, sly...@gmail.com wrote: > From: Sergei Trofimovich> > The easiest way to reproduce the error is to try to build > btrfs-progs with > $ make LDFLAGS=-Wl,--no-undefined > > btrfs-list.o: In function `lookup_ino_path': > btrfs-list.c:(.text+0x7d2): undefined reference to `__error' > > Noticed by Denis Descheneaux when snapper tool > stopped working after upgrade to btrfs-progs-4.10. > > As soname didn't change in 4.9 -> 4.10 release > I assume it's just an object file omission > in library depends and not the API/ABI change > of the library error printing. > > Cc: linux-btrfs@vger.kernel.org > Cc: Mike Gilbert > Reported-by: Denis Descheneaux > Bug: https://bugs.gentoo.org/show_bug.cgi?id=613890 > > Signed-off-by: Sergei Trofimovich I've replaced my patch with yours, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: __link_block_group uses GFP_KERNEL
On 3/27/17, David Sterbawrote: > On Sat, Mar 25, 2017 at 09:48:28AM +0300, Denis Kirjanov wrote: >> On 3/25/17, Jeff Mahoney wrote: >> > On 3/24/17 5:02 AM, Denis Kirjanov wrote: >> >> Hi guys, >> >> >> >> Looks like that current code does GFP_KERNEL allocation inside >> >> __link_block_group. >> >> the function invokes kobject_add and internally creates sysfs files >> >> with the GFP_KERNEL flag set. >> > >> > Yep, that's a bug. >> > >> >> But since do_chunk_alloc executes insides the btrfs transaction it's >> >> not allowed to sleep. >> > >> > It's allowed to sleep but isn't allowed to do reclaim that involves >> > file >> > system writeback. Michal Hocko's allocation context idea would fix >> > this, but it's not there yet, so we'll need to defer the kobject_add >> > until we can use GFP_KERNEL. >> >> Ok, I see. Can you point out to the initial patchset? > > https://lwn.net/Articles/716323/ > > Fixing this properly is a lot of work so we might need to add a > temporary workaround, as Jeff suggests, to move calling into sysfs to a > later time. > Care to send a patch? Or I can dig a bit. Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/5] btrfs: scrub: Introduce full stripe lock for RAID56
Thanks for the review first. At 03/28/2017 12:38 AM, David Sterba wrote: On Fri, Mar 24, 2017 at 10:00:23AM +0800, Qu Wenruo wrote: Unlike mirror based profiles, RAID5/6 recovery needs to read out the whole full stripe. And if we don't do proper protect, it can easily cause race condition. Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe() for RAID5/6. Which stores a rb_tree of mutex for full stripes, so scrub callers can use them to lock a full stripe to avoid race. Signed-off-by: Qu Wenruo--- fs/btrfs/ctree.h | 4 ++ fs/btrfs/extent-tree.c | 3 + fs/btrfs/scrub.c | 177 + 3 files changed, 184 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 29b7fc28c607..4d570e1040e6 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -648,6 +648,10 @@ struct btrfs_block_group_cache { * Protected by free_space_lock. */ int needs_free_space; + + /* Scrub full stripe lock tree for RAID5/6 scrub */ + struct rb_root scrub_lock_root; The name scrub_lock_root is a bit confusing, it should better describe what object the tree stores. I'll encapsulate this rb_root and spinlock into a new structure called full_stripe_locks_tree. I tried to rename them to other names, but "lock" and "root/tree" makes it quite hard to name the spinlock. (name like "full_stripe_locks_root_lock" is just super stupid) + spinlock_t scrub_lock; And the lock could be named accordingly. }; /* delayed seq elem */ diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index be5477676cc8..db5d9f84535e 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -131,6 +131,7 @@ void btrfs_put_block_group(struct btrfs_block_group_cache *cache) if (atomic_dec_and_test(>count)) { WARN_ON(cache->pinned > 0); WARN_ON(cache->reserved > 0); + WARN_ON(!RB_EMPTY_ROOT(>scrub_lock_root)); kfree(cache->free_space_ctl); kfree(cache); } @@ -9907,6 +9908,8 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info, atomic_set(>count, 1); spin_lock_init(>lock); + spin_lock_init(>scrub_lock); + cache->scrub_lock_root = RB_ROOT; init_rwsem(>data_rwsem); INIT_LIST_HEAD(>list); INIT_LIST_HEAD(>cluster_list); diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index b0251eb1239f..38b300ac4567 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -240,6 +240,13 @@ struct scrub_warning { struct btrfs_device *dev; }; +struct scrub_full_stripe_lock { + struct rb_node node; + u64 logical; + u64 refs; + struct mutex mutex; +}; + static void scrub_pending_bio_inc(struct scrub_ctx *sctx); static void scrub_pending_bio_dec(struct scrub_ctx *sctx); static void scrub_pending_trans_workers_inc(struct scrub_ctx *sctx); @@ -349,6 +356,176 @@ static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info) } /* + * Caller must hold cache->scrub_root_lock. + * + * Return existing full stripe and increase it refs + * Or return NULL, and insert @fstripe_lock into the bg cache + */ +static struct scrub_full_stripe_lock * +add_scrub_lock(struct btrfs_block_group_cache *cache, + struct scrub_full_stripe_lock *fstripe_lock) Please stick to the function prototype common in that file, ie. type and name on the same line. The arguments can go below it. Understood. The function name is also quite confusing. What about insert_full_stripe_lock() ? +{ + struct rb_node **p; + struct rb_node *parent = NULL; + struct scrub_full_stripe_lock *entry; + + p = >scrub_lock_root.rb_node; + while (*p) { + parent = *p; + entry = rb_entry(parent, struct scrub_full_stripe_lock, node); + if (fstripe_lock->logical < entry->logical) { + p = &(*p)->rb_left; + } else if (fstripe_lock->logical > entry->logical) { + p = &(*p)->rb_right; + } else { + entry->refs++; + return entry; + } + } + /* Insert new one */ + rb_link_node(_lock->node, parent, p); + rb_insert_color(_lock->node, >scrub_lock_root); + + return NULL +} + +static struct scrub_full_stripe_lock * +search_scrub_lock(struct btrfs_block_group_cache *cache, u64 bytenr) Same here. What about search_full_stripe_lock() ? +{ + struct rb_node *node; + struct scrub_full_stripe_lock *entry; + + node = cache->scrub_lock_root.rb_node; + while (node) { + entry = rb_entry(node, struct scrub_full_stripe_lock, node); + if (bytenr < entry->logical) + node = node->rb_left; + else if (bytenr > entry->logical) +