Re: [PATCH 0/2] Policy to balance read across mirrored devices
A little question about mount -o read_mirror_policy=. How would this work with RAID1 over 3 or 4 HDD's? In particular, if the desired block is not available on device . Could i repeat this option like the device-option to specify a order/priority like this: mount -o read_mirror_policy= 1,read_mirror_policy= 3 2018-01-30 7:30 GMT+01:00 Anand Jain: > In case of RAID1 and RAID10 devices are mirror-ed, a read IO can > pick any device for reading. This choice of picking a device for > reading should be configurable. In short not one policy would > satisfy all types of workload and configs. > > So before we add more policies, this patch-set makes existing > $pid policy configurable from the mount option and adds a devid > based read_mirror device policy. And keeps $pid based policy as > the default option for now. So this mount option helps to try out > different read mirror load balances. > > Further we can add more policies on top of it, for example.. > > mount -o read_mirror_policy=pid (current, default) [1] > mount -o read_mirror_policy= [2] > mount -o read_mirror_policy=lba (illustration only) [3] > mount -o read_mirror_policy=ssd (illustration only) [4] > mount -o read_mirror_policy=io (illustration only) [5] > mount -o read_mirror_policy= (illustration only) [6] > > [1] > Current PID based read mirror balance. > > [2] > Set the devid of the device which should be used for read. That > means all the normal read will go to that particular device only. > This also helps testing and gives a better control for the test > scripts including mount context reads. > > [3] > In case of SAN storage, some storage prefers that host access the > LUN based on the LBA so that there won't be duplications of > caching on the storage. > > [4] > In case of mix of SSD and HD we may want to use SSD as the primary > read device. > > [5] > If storage caching is not the bottleneck but the transport layer > is then read load should be tuned based on the IO load. > > [6] > Or a combination of any of above as per the priority. > Timofey should consider to base his patch on top of this. > Btrfs: enchanse raid1/10 balance heuristic > > This patch set is on top of the preparatory patch set: > [PATCH 0/2] Preparatory to add read_mirror mount option > > > Anand Jain (2): > btrfs: add mount option read_mirror_policy > btrfs: add read_mirror_policy parameter devid > > fs/btrfs/ctree.h | 2 ++ > fs/btrfs/super.c | 31 +++ > fs/btrfs/volumes.c | 18 +- > fs/btrfs/volumes.h | 7 +++ > 4 files changed, 57 insertions(+), 1 deletion(-) > > -- > 2.7.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: volumes: Cleanup stripe size calculation
On 31.01.2018 08:16, Qu Wenruo wrote: > Cleanup the following things: > 1) open coded SZ_16M round up > 2) use min() to replace open-coded size comparison > 3) code style > > Signed-off-by: Qu WenruoReviewed-by: Nikolay Borisov > --- > fs/btrfs/volumes.c | 11 +-- > 1 file changed, 5 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index cb0a8d27661b..90ad716d2b50 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -4761,18 +4761,17 @@ static int __btrfs_alloc_chunk(struct > btrfs_trans_handle *trans, >* and compare that answer with the max chunk size >*/ > if (stripe_size * data_stripes > max_chunk_size) { > - u64 mask = (1ULL << 24) - 1; > - > stripe_size = div_u64(max_chunk_size, data_stripes); > > /* bump the answer up to a 16MB boundary */ > - stripe_size = (stripe_size + mask) & ~mask; > + stripe_size = round_up(stripe_size, SZ_16M); > > - /* but don't go higher than the limits we found > + /* > + * but don't go higher than the limits we found >* while searching for free extents >*/ > - if (stripe_size > devices_info[ndevs-1].max_avail) > - stripe_size = devices_info[ndevs-1].max_avail; > + stripe_size = min(devices_info[ndevs - 1].max_avail, > + stripe_size); > } > > stripe_size = div_u64(stripe_size, dev_stripes); > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: volumes: Remove the meaningless condition of minimal nr_devs when allocating a chunk
On 31.01.2018 07:56, Qu Wenruo wrote: > When checking the minimal nr_devs, there is one dead and meaningless > condition: > > if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) { > > > This condition is meaningless, @devs_increment has nothing to do with > @sub_stripes. > > In fact, in btrfs_raid_array[], profile with sub_stripes larger than 1 > (RAID10) already has the @devs_increment set to 2. > So no need to multiple it by @sub_stripes. > > And above condition is also dead. > For RAID10, @devs_increment * @sub_stripes equals 4, which is also the > @devs_min of RAID10. > For other profiles, @sub_stripes is always 1, and since @ndevs is > rounded down to @devs_increment, the condition will always be true. > > Remove the meaningless condition to make later reader wander less. > > Signed-off-by: Qu WenruoReviewed-by: Nikolay Borisov Quick question : What exactly is a substripe? Stripe is essentially how many contiguous portions of disk are necessary to satisfy the profile, right? So for raid1 we write 1 copy of the data per device (hence dev_stripes = 1). For DUP we have 2 copies of the data on the same disk hence dev_stripes 2. How does sub_stripes fit in the grand scheme of things? > --- > fs/btrfs/volumes.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 215e85e22c8e..cb0a8d27661b 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -4729,7 +4729,7 @@ static int __btrfs_alloc_chunk(struct > btrfs_trans_handle *trans, > /* round down to number of usable stripes */ > ndevs = round_down(ndevs, devs_increment); > > - if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) { > + if (ndevs < devs_min) { > ret = -ENOSPC; > goto error; > } > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/2] fix device orders consistency
On 31.01.2018 04:28, Anand Jain wrote: > v3->v2: > rename device_sort() to device_cmp(). > v1->v2: > No code change. Change log updated to include the type > of problem that this consistency would help. And > I don't see patch 2/2 in the ML. So trying to resend. > > By maintaining the device order (some) consistency it makes reproducing > the missing chunk related problems more consistent. (More fixes of this > sort is coming up). > > Anand Jain (2): > btrfs: fix device order consistency > btrfs: fix alloc device order consistency > > fs/btrfs/volumes.c | 18 +- > 1 file changed, 17 insertions(+), 1 deletion(-) > For the whole series: Reviewed-by: Nikolay Borisov-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: volumes: Cleanup stripe size calculation
Cleanup the following things: 1) open coded SZ_16M round up 2) use min() to replace open-coded size comparison 3) code style Signed-off-by: Qu Wenruo--- fs/btrfs/volumes.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index cb0a8d27661b..90ad716d2b50 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4761,18 +4761,17 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, * and compare that answer with the max chunk size */ if (stripe_size * data_stripes > max_chunk_size) { - u64 mask = (1ULL << 24) - 1; - stripe_size = div_u64(max_chunk_size, data_stripes); /* bump the answer up to a 16MB boundary */ - stripe_size = (stripe_size + mask) & ~mask; + stripe_size = round_up(stripe_size, SZ_16M); - /* but don't go higher than the limits we found + /* +* but don't go higher than the limits we found * while searching for free extents */ - if (stripe_size > devices_info[ndevs-1].max_avail) - stripe_size = devices_info[ndevs-1].max_avail; + stripe_size = min(devices_info[ndevs - 1].max_avail, + stripe_size); } stripe_size = div_u64(stripe_size, dev_stripes); -- 2.16.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: volumes: Remove the meaningless condition of minimal nr_devs when allocating a chunk
When checking the minimal nr_devs, there is one dead and meaningless condition: if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) { This condition is meaningless, @devs_increment has nothing to do with @sub_stripes. In fact, in btrfs_raid_array[], profile with sub_stripes larger than 1 (RAID10) already has the @devs_increment set to 2. So no need to multiple it by @sub_stripes. And above condition is also dead. For RAID10, @devs_increment * @sub_stripes equals 4, which is also the @devs_min of RAID10. For other profiles, @sub_stripes is always 1, and since @ndevs is rounded down to @devs_increment, the condition will always be true. Remove the meaningless condition to make later reader wander less. Signed-off-by: Qu Wenruo--- fs/btrfs/volumes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 215e85e22c8e..cb0a8d27661b 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4729,7 +4729,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, /* round down to number of usable stripes */ ndevs = round_down(ndevs, devs_increment); - if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) { + if (ndevs < devs_min) { ret = -ENOSPC; goto error; } -- 2.16.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/2] btrfs: fix alloc device order consistency
Add opened device to the tail of dev_alloc_list instead of head, so that it maintains the same order as dev_list. Signed-off-by: Anand Jain--- fs/btrfs/volumes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 0109f370ad5b..9c3b4ff7f505 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -708,7 +708,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, if (test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state) && device->devid != BTRFS_DEV_REPLACE_DEVID) { fs_devices->rw_devices++; - list_add(>dev_alloc_list, _devices->alloc_list); + list_add_tail(>dev_alloc_list, _devices->alloc_list); } brelse(bh); -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix null pointer dereference when replacing missing device
On 01/31/2018 02:40 AM, fdman...@kernel.org wrote: From: Filipe MananaWhen we are replacing a missing device we mount the filesystem with the degraded mode option in which case we are allowed to have a btrfs device structure without a backing device member (its bdev member is NULL) and therefore we can't dereference that member. Commit 38b5f68e9811 ("btrfs: drop btrfs_device::can_discard to query directly") started to dereference that member when discarding extents, resulting in a null pointer dereference: [ 3145.322257] BTRFS warning (device sdf): devid 2 uuid 4d922414-58eb-4880-8fed-9c3840f6c5d5 is missing [ 3145.364116] BTRFS info (device sdf): dev_replace from (devid 2) to /dev/sdg started [ 3145.413489] BUG: unable to handle kernel NULL pointer dereference at 00e0 [ 3145.415085] IP: btrfs_discard_extent+0x6a/0xf8 [btrfs] [ 3145.415085] PGD 0 P4D 0 [ 3145.415085] Oops: [#1] PREEMPT SMP PTI [ 3145.415085] Modules linked in: ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse parport_pc serio_raw i2c_piix4 i2 [ 3145.415085] CPU: 0 PID: 11989 Comm: btrfs Tainted: GW 4.15.0-rc9-btrfs-next-55+ #1 [ 3145.415085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [ 3145.415085] RIP: 0010:btrfs_discard_extent+0x6a/0xf8 [btrfs] [ 3145.415085] RSP: 0018:c90004813c60 EFLAGS: 00010293 [ 3145.415085] RAX: 88020d39cc00 RBX: 88020c4ea2a0 RCX: 0002 [ 3145.415085] RDX: RSI: 88020c4ea240 RDI: [ 3145.415085] RBP: R08: 4000 R09: [ 3145.415085] R10: c90004813ae8 R11: R12: [ 3145.415085] R13: 88020c418000 R14: R15: [ 3145.415085] FS: 7f565681f8c0() GS:88023fc0() knlGS: [ 3145.415085] CS: 0010 DS: ES: CR0: 80050033 [ 3145.415085] CR2: 00e0 CR3: 00020d208006 CR4: 001606f0 [ 3145.415085] Call Trace: [ 3145.415085] btrfs_finish_extent_commit+0x9a/0x1be [btrfs] [ 3145.415085] btrfs_commit_transaction+0x649/0x7a0 [btrfs] [ 3145.415085] ? start_transaction+0x2b0/0x3b3 [btrfs] [ 3145.415085] btrfs_dev_replace_start+0x274/0x30c [btrfs] [ 3145.415085] btrfs_dev_replace_by_ioctl+0x45/0x59 [btrfs] [ 3145.415085] btrfs_ioctl+0x1a91/0x1d62 [btrfs] [ 3145.415085] ? lock_acquire+0x16a/0x1af [ 3145.415085] ? vfs_ioctl+0x1b/0x28 [ 3145.415085] ? trace_hardirqs_on_caller+0x14c/0x1a6 [ 3145.415085] vfs_ioctl+0x1b/0x28 [ 3145.415085] do_vfs_ioctl+0x5a9/0x5e0 [ 3145.415085] ? _raw_spin_unlock_irq+0x34/0x46 [ 3145.415085] ? entry_SYSCALL_64_fastpath+0x5/0x8b [ 3145.415085] ? trace_hardirqs_on_caller+0x14c/0x1a6 [ 3145.415085] SyS_ioctl+0x52/0x76 [ 3145.415085] entry_SYSCALL_64_fastpath+0x1e/0x8b [ 3145.415085] RIP: 0033:0x7f56558b3c47 [ 3145.415085] RSP: 002b:7ffdcfac4c58 EFLAGS: 0202 [ 3145.415085] Code: be 02 00 00 00 4c 89 ef e8 b9 e7 03 00 85 c0 89 c5 75 75 48 8b 44 24 08 45 31 f6 48 8d 58 60 eb 52 48 8b 03 48 8b b8 a0 00 00 00 <48> 8b 87 e0 00 [ 3145.415085] RIP: btrfs_discard_extent+0x6a/0xf8 [btrfs] RSP: c90004813c60 [ 3145.415085] CR2: 00e0 [ 3145.458185] ---[ end trace 06302e7ac31902bf ]--- This is trivially reproduced by running the test btrfs/027 from fstests like this: $ MOUNT_OPTIONS="-o discard" ./check btrfs/027 Fix this by skipping devices without a backing device before attempting to discard. Fixes: 38b5f68e9811 ("btrfs: drop btrfs_device::can_discard to query directly") Signed-off-by: Filipe Manana oh no, I missed the context of degraded when writing this patch. Thanks for the fix. Reviewed-by: Anand Jain Thanks, Anand --- fs/btrfs/extent-tree.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9d220b276c8f..d59ee24645e3 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2147,6 +2147,10 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, u64 bytes; struct request_queue *req_q; + if (!stripe->dev->bdev) { + ASSERT(btrfs_test_opt(fs_info, DEGRADED)); + continue; + } req_q = bdev_get_queue(stripe->dev->bdev); if (!blk_queue_discard(req_q)) continue; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 7/7] btrfs-progs: Cleanup use of root in leaf_data_end
In function leaf_data_end, root is just used to get fs_info, so change the parameter of this function from btrfs_root to btrfs_fs_info. And also make it consistent with kernel. Changelog: v3->v2: Add const to parameter leaf of function btrfs_item_offset_nr to keep type consistent with leaf_data_end. Signed-off-by: Gu JinxiangReviewed-by: Qu Wenruo --- ctree.c | 32 +--- ctree.h | 2 +- 2 files changed, 18 insertions(+), 16 deletions(-) diff --git a/ctree.c b/ctree.c index 4efc96f..45b368c 100644 --- a/ctree.c +++ b/ctree.c @@ -408,12 +408,12 @@ static int btrfs_comp_keys(struct btrfs_disk_key *disk, struct btrfs_key *k2) * this returns the address of the start of the last item, * which is the stop of the leaf data stack */ -static inline unsigned int leaf_data_end(struct btrfs_root *root, -struct extent_buffer *leaf) +static inline unsigned int leaf_data_end(const struct btrfs_fs_info *fs_info, +const struct extent_buffer *leaf) { u32 nr = btrfs_header_nritems(leaf); if (nr == 0) - return BTRFS_LEAF_DATA_SIZE(root->fs_info); + return BTRFS_LEAF_DATA_SIZE(fs_info); return btrfs_item_offset_nr(leaf, nr - 1); } @@ -1735,10 +1735,10 @@ static int push_leaf_right(struct btrfs_trans_handle *trans, struct btrfs_root right_nritems = btrfs_header_nritems(right); push_space = btrfs_item_end_nr(left, left_nritems - push_items); - push_space -= leaf_data_end(root, left); + push_space -= leaf_data_end(fs_info, left); /* make room in the right data area */ - data_end = leaf_data_end(root, right); + data_end = leaf_data_end(fs_info, right); memmove_extent_buffer(right, btrfs_leaf_data(right) + data_end - push_space, btrfs_leaf_data(right) + data_end, @@ -1747,7 +1747,7 @@ static int push_leaf_right(struct btrfs_trans_handle *trans, struct btrfs_root /* copy from the left data area */ copy_extent_buffer(right, left, btrfs_leaf_data(right) + BTRFS_LEAF_DATA_SIZE(root->fs_info) - push_space, -btrfs_leaf_data(left) + leaf_data_end(root, left), +btrfs_leaf_data(left) + leaf_data_end(fs_info, left), push_space); memmove_extent_buffer(right, btrfs_item_nr_offset(push_items), @@ -1885,7 +1885,7 @@ static int push_leaf_left(struct btrfs_trans_handle *trans, struct btrfs_root btrfs_item_offset_nr(right, push_items -1); copy_extent_buffer(left, right, btrfs_leaf_data(left) + -leaf_data_end(root, left) - push_space, +leaf_data_end(fs_info, left) - push_space, btrfs_leaf_data(right) + btrfs_item_offset_nr(right, push_items - 1), push_space); @@ -1912,12 +1912,13 @@ static int push_leaf_left(struct btrfs_trans_handle *trans, struct btrfs_root if (push_items < right_nritems) { push_space = btrfs_item_offset_nr(right, push_items - 1) - - leaf_data_end(root, right); + leaf_data_end(fs_info, right); memmove_extent_buffer(right, btrfs_leaf_data(right) + BTRFS_LEAF_DATA_SIZE(root->fs_info) - push_space, btrfs_leaf_data(right) + - leaf_data_end(root, right), push_space); + leaf_data_end(fs_info, right), + push_space); memmove_extent_buffer(right, btrfs_item_nr_offset(0), btrfs_item_nr_offset(push_items), @@ -1976,7 +1977,8 @@ static noinline int copy_for_split(struct btrfs_trans_handle *trans, nritems = nritems - mid; btrfs_set_header_nritems(right, nritems); - data_copy_size = btrfs_item_end_nr(l, mid) - leaf_data_end(root, l); + data_copy_size = btrfs_item_end_nr(l, mid) - + leaf_data_end(root->fs_info, l); copy_extent_buffer(right, l, btrfs_item_nr_offset(0), btrfs_item_nr_offset(mid), @@ -1986,7 +1988,7 @@ static noinline int copy_for_split(struct btrfs_trans_handle *trans, btrfs_leaf_data(right) + BTRFS_LEAF_DATA_SIZE(root->fs_info) - data_copy_size, btrfs_leaf_data(l) + -leaf_data_end(root, l), data_copy_size); +leaf_data_end(root->fs_info, l), data_copy_size); rt_data_off = BTRFS_LEAF_DATA_SIZE(root->fs_info) - btrfs_item_end_nr(l, mid); @@ -2324,7
[PATCH v3 3/7] btrfs-progs: Sync code with kernel for BTRFS_MAX_INLINE_DATA_SIZE
Do a cleanup. Also make it consistent with kernel. Use fs_info instead of root for BTRFS_MAX_INLINE_DATA_SIZE, since maybe in some situation we do not know root, but just know fs_info. Changelog: v2->v1: Change macro to inline function to be consistent with kernel. And change the function body to match kernel. v3->v2: Change the title of this commit to make it more exact. Signed-off-by: Gu JinxiangReviewed-by: Qu Wenruo --- convert/source-ext2.c | 2 +- convert/source-reiserfs.c | 2 +- ctree.h | 17 ++--- mkfs/main.c | 4 ++-- 4 files changed, 18 insertions(+), 7 deletions(-) diff --git a/convert/source-ext2.c b/convert/source-ext2.c index e5c2a94..f5ecd8c 100644 --- a/convert/source-ext2.c +++ b/convert/source-ext2.c @@ -309,7 +309,7 @@ static int ext2_create_file_extents(struct btrfs_trans_handle *trans, goto fail; if ((convert_flags & CONVERT_FLAG_INLINE_DATA) && data.first_block == 0 && data.num_blocks > 0 - && inode_size <= BTRFS_MAX_INLINE_DATA_SIZE(root)) { + && inode_size <= BTRFS_MAX_INLINE_DATA_SIZE(root->fs_info)) { u64 num_bytes = data.num_blocks * sectorsize; u64 disk_bytenr = data.disk_block * sectorsize; u64 nbytes; diff --git a/convert/source-reiserfs.c b/convert/source-reiserfs.c index e3582bd..39d6f07 100644 --- a/convert/source-reiserfs.c +++ b/convert/source-reiserfs.c @@ -376,7 +376,7 @@ static int reiserfs_convert_tail(struct btrfs_trans_handle *trans, u64 isize; int ret; - if (length >= BTRFS_MAX_INLINE_DATA_SIZE(root)) + if (length >= BTRFS_MAX_INLINE_DATA_SIZE(root->fs_info)) return convert_direct(trans, root, objectid, inode, body, length, offset, convert_flags); diff --git a/ctree.h b/ctree.h index f3964f1..7be48e2 100644 --- a/ctree.h +++ b/ctree.h @@ -359,9 +359,6 @@ struct btrfs_header { #define __BTRFS_LEAF_DATA_SIZE(bs) ((bs) - sizeof(struct btrfs_header)) #define BTRFS_LEAF_DATA_SIZE(fs_info) \ (__BTRFS_LEAF_DATA_SIZE(fs_info->nodesize)) -#define BTRFS_MAX_INLINE_DATA_SIZE(r) (BTRFS_LEAF_DATA_SIZE(r->fs_info) - \ - sizeof(struct btrfs_item) - \ - sizeof(struct btrfs_file_extent_item)) #define BTRFS_MAX_XATTR_SIZE(r)(BTRFS_LEAF_DATA_SIZE(r->fs_info) - \ sizeof(struct btrfs_item) -\ sizeof(struct btrfs_dir_item)) @@ -1188,10 +1185,24 @@ struct btrfs_root { struct rb_node rb_node; }; +static inline u32 BTRFS_MAX_ITEM_SIZE(const struct btrfs_fs_info *info) +{ + return BTRFS_LEAF_DATA_SIZE(info) - sizeof(struct btrfs_item); +} + static inline u32 BTRFS_NODEPTRS_PER_BLOCK(const struct btrfs_fs_info *info) { return BTRFS_LEAF_DATA_SIZE(info) / sizeof(struct btrfs_key_ptr); } + +#define BTRFS_FILE_EXTENT_INLINE_DATA_START\ + (offsetof(struct btrfs_file_extent_item, disk_bytenr)) +static inline u32 BTRFS_MAX_INLINE_DATA_SIZE(const struct btrfs_fs_info *info) +{ + return BTRFS_MAX_ITEM_SIZE(info) - + BTRFS_FILE_EXTENT_INLINE_DATA_START; +} + /* * inode items have the data typically returned from stat and store other * info about object characteristics. There is one for every file and dir in diff --git a/mkfs/main.c b/mkfs/main.c index d817ad8..a301efc 100644 --- a/mkfs/main.c +++ b/mkfs/main.c @@ -498,7 +498,7 @@ static int fill_inode_item(struct btrfs_trans_handle *trans, } if (S_ISREG(src->st_mode)) { btrfs_set_stack_inode_size(dst, (u64)src->st_size); - if (src->st_size <= BTRFS_MAX_INLINE_DATA_SIZE(root)) + if (src->st_size <= BTRFS_MAX_INLINE_DATA_SIZE(root->fs_info)) btrfs_set_stack_inode_nbytes(dst, src->st_size); else { blocks = src->st_size / sectorsize; @@ -686,7 +686,7 @@ static int add_file_items(struct btrfs_trans_handle *trans, if (st->st_size % sectorsize) blocks += 1; - if (st->st_size <= BTRFS_MAX_INLINE_DATA_SIZE(root)) { + if (st->st_size <= BTRFS_MAX_INLINE_DATA_SIZE(root->fs_info)) { char *buffer = malloc(st->st_size); if (!buffer) { -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/7] btrfs-progs: Use fs_info instead of root for BTRFS_NODEPTRS_PER_BLOCK
Do a cleanup. Also make it consistent with kernel. Use fs_info instead of root for BTRFS_NODEPTRS_PER_BLOCK, since maybe in some situation we do not know root, but just know fs_info. Changelog: v2->v1: To be consistent with kernel, change macro to inline function. Signed-off-by: Gu JinxiangReviewed-by: Qu Wenruo --- cmds-check.c | 4 ++-- ctree.c | 18 +- ctree.h | 7 --- print-tree.c | 2 +- quick-test.c | 2 +- 5 files changed, 17 insertions(+), 16 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index 8d54564..5495dbe 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -2517,7 +2517,7 @@ static void account_bytes(struct btrfs_root *root, struct btrfs_path *path, if (level == 0) { btree_space_waste += btrfs_leaf_free_space(root, eb); } else { - free_nrs = (BTRFS_NODEPTRS_PER_BLOCK(root) - + free_nrs = (BTRFS_NODEPTRS_PER_BLOCK(root->fs_info) - btrfs_header_nritems(eb)); btree_space_waste += free_nrs * sizeof(struct btrfs_key_ptr); } @@ -9475,7 +9475,7 @@ static int run_next_block(struct btrfs_root *root, add_pending(pending, seen, ptr, size); } } - btree_space_waste += (BTRFS_NODEPTRS_PER_BLOCK(root) - + btree_space_waste += (BTRFS_NODEPTRS_PER_BLOCK(fs_info) - nritems) * sizeof(struct btrfs_key_ptr); } total_btree_bytes += buf->len; diff --git a/ctree.c b/ctree.c index 6d130f5..4efc96f 100644 --- a/ctree.c +++ b/ctree.c @@ -427,7 +427,7 @@ btrfs_check_node(struct btrfs_root *root, struct btrfs_disk_key *parent_key, u32 nritems = btrfs_header_nritems(buf); enum btrfs_tree_block_status ret = BTRFS_TREE_BLOCK_INVALID_NRITEMS; - if (nritems == 0 || nritems > BTRFS_NODEPTRS_PER_BLOCK(root)) + if (nritems == 0 || nritems > BTRFS_NODEPTRS_PER_BLOCK(root->fs_info)) goto fail; ret = BTRFS_TREE_BLOCK_INVALID_PARENT_KEY; @@ -714,7 +714,7 @@ static int balance_level(struct btrfs_trans_handle *trans, return ret; } if (btrfs_header_nritems(mid) > - BTRFS_NODEPTRS_PER_BLOCK(root) / 4) + BTRFS_NODEPTRS_PER_BLOCK(fs_info) / 4) return 0; left = read_node_slot(fs_info, parent, pslot - 1); @@ -882,7 +882,7 @@ static int noinline push_nodes_for_insert(struct btrfs_trans_handle *trans, if (extent_buffer_uptodate(left)) { u32 left_nr; left_nr = btrfs_header_nritems(left); - if (left_nr >= BTRFS_NODEPTRS_PER_BLOCK(root) - 1) { + if (left_nr >= BTRFS_NODEPTRS_PER_BLOCK(fs_info) - 1) { wret = 1; } else { ret = btrfs_cow_block(trans, root, left, parent, @@ -925,7 +925,7 @@ static int noinline push_nodes_for_insert(struct btrfs_trans_handle *trans, if (extent_buffer_uptodate(right)) { u32 right_nr; right_nr = btrfs_header_nritems(right); - if (right_nr >= BTRFS_NODEPTRS_PER_BLOCK(root) - 1) { + if (right_nr >= BTRFS_NODEPTRS_PER_BLOCK(root->fs_info) - 1) { wret = 1; } else { ret = btrfs_cow_block(trans, root, right, @@ -1144,7 +1144,7 @@ again: p->slots[level] = slot; if ((p->search_for_split || ins_len > 0) && btrfs_header_nritems(b) >= - BTRFS_NODEPTRS_PER_BLOCK(root) - 3) { + BTRFS_NODEPTRS_PER_BLOCK(fs_info) - 3) { int sret = split_node(trans, root, p, level); BUG_ON(sret > 0); if (sret) @@ -1290,7 +1290,7 @@ static int push_node_left(struct btrfs_trans_handle *trans, src_nritems = btrfs_header_nritems(src); dst_nritems = btrfs_header_nritems(dst); - push_items = BTRFS_NODEPTRS_PER_BLOCK(root) - dst_nritems; + push_items = BTRFS_NODEPTRS_PER_BLOCK(root->fs_info) - dst_nritems; WARN_ON(btrfs_header_generation(src) != trans->transid); WARN_ON(btrfs_header_generation(dst) != trans->transid); @@ -1360,7 +1360,7 @@ static int balance_node_right(struct btrfs_trans_handle *trans, src_nritems = btrfs_header_nritems(src); dst_nritems = btrfs_header_nritems(dst); - push_items = BTRFS_NODEPTRS_PER_BLOCK(root) - dst_nritems; + push_items = BTRFS_NODEPTRS_PER_BLOCK(root->fs_info) - dst_nritems; if (push_items <= 0) { return 1; } @@ -1488,7 +1488,7 @@ static int insert_ptr(struct btrfs_trans_handle *trans, struct btrfs_root nritems = btrfs_header_nritems(lower);
[PATCH v3 1/7] btrfs-progs: Use fs_info instead of root for BTRFS_LEAF_DATA_SIZE
Do a cleanup. Also make it consistent with kernel. Use fs_info instead of root for BTRFS_LEAF_DATA_SIZE, since maybe in some situation we do not know root, but just know fs_info. Signed-off-by: Gu JinxiangReviewed-by: Qu Wenruo --- cmds-check.c | 6 +++--- convert/source-ext2.c | 2 +- convert/source-reiserfs.c | 4 ++-- ctree.c | 49 +++ ctree.h | 10 ++ file-item.c | 2 +- volumes.c | 2 +- 7 files changed, 42 insertions(+), 33 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index 84803a2..8d54564 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -7234,9 +7234,9 @@ again: unsigned int shift = 0, offset; if (i == 0 && btrfs_item_end_nr(buf, i) != - BTRFS_LEAF_DATA_SIZE(root)) { + BTRFS_LEAF_DATA_SIZE(root->fs_info)) { if (btrfs_item_end_nr(buf, i) > - BTRFS_LEAF_DATA_SIZE(root)) { + BTRFS_LEAF_DATA_SIZE(root->fs_info)) { ret = delete_bogus_item(root, path, buf, i); if (!ret) goto again; @@ -7245,7 +7245,7 @@ again: ret = -EIO; break; } - shift = BTRFS_LEAF_DATA_SIZE(root) - + shift = BTRFS_LEAF_DATA_SIZE(root->fs_info) - btrfs_item_end_nr(buf, i); } else if (i > 0 && btrfs_item_end_nr(buf, i) != btrfs_item_offset_nr(buf, i - 1)) { diff --git a/convert/source-ext2.c b/convert/source-ext2.c index e927721..e5c2a94 100644 --- a/convert/source-ext2.c +++ b/convert/source-ext2.c @@ -520,7 +520,7 @@ static int ext2_copy_single_xattr(struct btrfs_trans_handle *trans, } strncpy(namebuf, xattr_prefix_table[name_index], XATTR_NAME_MAX); strncat(namebuf, EXT2_EXT_ATTR_NAME(entry), entry->e_name_len); - if (name_len + datalen > BTRFS_LEAF_DATA_SIZE(root) - + if (name_len + datalen > BTRFS_LEAF_DATA_SIZE(root->fs_info) - sizeof(struct btrfs_item) - sizeof(struct btrfs_dir_item)) { fprintf(stderr, "skip large xattr on inode %Lu name %.*s\n", objectid - INO_OFFSET, name_len, namebuf); diff --git a/convert/source-reiserfs.c b/convert/source-reiserfs.c index be79d8e..e3582bd 100644 --- a/convert/source-reiserfs.c +++ b/convert/source-reiserfs.c @@ -676,7 +676,7 @@ static int reiserfs_xattr_indirect_fn(reiserfs_filsys_t fs, u64 position, size_t alloc = min(position + num_blocks * fs->fs_blocksize, size); char *body; - if (size > BTRFS_LEAF_DATA_SIZE(xa_data->root) - + if (size > BTRFS_LEAF_DATA_SIZE(xa_data->root->fs_info) - sizeof(struct btrfs_item) - sizeof(struct btrfs_dir_item)) { fprintf(stderr, "skip large xattr on objectid %llu name %.*s\n", xa_data->target_oid, (int)xa_data->namelen, @@ -714,7 +714,7 @@ static int reiserfs_xattr_direct_fn(reiserfs_filsys_t fs, __u64 position, struct reiserfs_xattr_data *xa_data = data; char *newbody; - if (size > BTRFS_LEAF_DATA_SIZE(xa_data->root) - + if (size > BTRFS_LEAF_DATA_SIZE(xa_data->root->fs_info) - sizeof(struct btrfs_item) - sizeof(struct btrfs_dir_item)) { fprintf(stderr, "skip large xattr on objectid %llu name %.*s\n", xa_data->target_oid, (int)xa_data->namelen, diff --git a/ctree.c b/ctree.c index 4fc33b1..6d130f5 100644 --- a/ctree.c +++ b/ctree.c @@ -413,7 +413,7 @@ static inline unsigned int leaf_data_end(struct btrfs_root *root, { u32 nr = btrfs_header_nritems(leaf); if (nr == 0) - return BTRFS_LEAF_DATA_SIZE(root); + return BTRFS_LEAF_DATA_SIZE(root->fs_info); return btrfs_item_offset_nr(leaf, nr - 1); } @@ -515,24 +515,26 @@ btrfs_check_leaf(struct btrfs_root *root, struct btrfs_disk_key *parent_key, goto fail; } if (i == 0 && btrfs_item_end_nr(buf, i) != - BTRFS_LEAF_DATA_SIZE(root)) { + BTRFS_LEAF_DATA_SIZE(root->fs_info)) { ret = BTRFS_TREE_BLOCK_INVALID_OFFSETS; fprintf(stderr, "bad item end %u wanted %u\n", btrfs_item_end_nr(buf, i), - (unsigned)BTRFS_LEAF_DATA_SIZE(root)); + (unsigned)BTRFS_LEAF_DATA_SIZE(root->fs_info)); goto fail; } } for (i = 0; i < nritems; i++) { - if (btrfs_item_end_nr(buf, i) >
[PATCH v3 5/7] btrfs-progs: do clean up for redundancy value assignment
Although skinny_metadata's type is int, its value just can be 0/1. And if condition be true only when skinny_metadata equals 1, so in if's executive part, set skinny_metadata to 1 is redundancy. Remove it. Signed-off-by: Gu JinxiangReviewed-by: Qu Wenruo --- extent-tree.c | 1 - 1 file changed, 1 deletion(-) diff --git a/extent-tree.c b/extent-tree.c index 055582c..9201c56 100644 --- a/extent-tree.c +++ b/extent-tree.c @@ -1005,7 +1005,6 @@ static int lookup_inline_extent_backref(struct btrfs_trans_handle *trans, extra_size = -1; if (owner < BTRFS_FIRST_FREE_OBJECTID && skinny_metadata) { - skinny_metadata = 1; key.type = BTRFS_METADATA_ITEM_KEY; key.offset = owner; } else if (skinny_metadata) { -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 6/7] btrfs-progs: remove no longer be used btrfs_alloc_extent
Do a cleanup. Function btrfs_alloc_extent is no longer be used. So let's remove it. Signed-off-by: Gu JinxiangReviewed-by: Qu Wenruo --- ctree.h | 6 -- 1 file changed, 6 deletions(-) diff --git a/ctree.h b/ctree.h index 679bbc9..41b14b5 100644 --- a/ctree.h +++ b/ctree.h @@ -2501,12 +2501,6 @@ struct extent_buffer *btrfs_alloc_free_block(struct btrfs_trans_handle *trans, u32 blocksize, u64 root_objectid, struct btrfs_disk_key *key, int level, u64 hint, u64 empty_size); -int btrfs_alloc_extent(struct btrfs_trans_handle *trans, - struct btrfs_root *root, - u64 num_bytes, u64 parent, - u64 root_objectid, u64 ref_generation, - u64 owner, u64 empty_size, u64 hint_byte, - u64 search_end, struct btrfs_key *ins, int data); int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 bytenr, u64 offset, int metadata, u64 *refs, u64 *flags); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/7] btrfs-progs: Do some clean up to be consistent with kernel
Patch 1~4 and 7: clean up use of btrfs_root. Beside it, syncs code with kernel. Patch 5: remove redundancy value assignment. Patch 6: remove no longer be used function define. Changelog: v2->v1: Patch 2,4: To be consistent with kernel, change macro to inline function. Patch 3: Change macro to inline function to be consistent with kernel. And change the function body to match kernel. v3->v2: Patch 3: Change the title to make it more exact. Patch 7: Add const to parameter leaf of function btrfs_item_offset_nr to keep type consistent with leaf_data_end. Gu Jinxiang (7): btrfs-progs: Use fs_info instead of root for BTRFS_LEAF_DATA_SIZE btrfs-progs: Use fs_info instead of root for BTRFS_NODEPTRS_PER_BLOCK btrfs-progs: Sync code with kernel for BTRFS_MAX_INLINE_DATA_SIZE btrfs-progs: Use fs_info instead of root for BTRFS_MAX_XATTR_SIZE btrfs-progs: do clean up for redundancy value assignment btrfs-progs: remove no longer be used btrfs_alloc_extent btrfs-progs: Cleanup use of root in leaf_data_end cmds-check.c | 10 ++--- convert/source-ext2.c | 4 +- convert/source-reiserfs.c | 6 +-- ctree.c | 97 ++- ctree.h | 47 +-- dir-item.c| 3 +- extent-tree.c | 1 - file-item.c | 2 +- mkfs/main.c | 4 +- print-tree.c | 2 +- quick-test.c | 2 +- volumes.c | 2 +- 12 files changed, 99 insertions(+), 81 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 4/7] btrfs-progs: Use fs_info instead of root for BTRFS_MAX_XATTR_SIZE
Do a cleanup. Also make it consistent with kernel. Use fs_info instead of root for BTRFS_MAX_XATTR_SIZE, since maybe in some situation we do not know root, but just know fs_info. Changelog: v2->v1: To be consistent with kernel, change macro to inline function. Signed-off-by: Gu JinxiangReviewed-by: Qu Wenruo --- ctree.h| 9 + dir-item.c | 3 ++- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/ctree.h b/ctree.h index 7be48e2..679bbc9 100644 --- a/ctree.h +++ b/ctree.h @@ -359,10 +359,6 @@ struct btrfs_header { #define __BTRFS_LEAF_DATA_SIZE(bs) ((bs) - sizeof(struct btrfs_header)) #define BTRFS_LEAF_DATA_SIZE(fs_info) \ (__BTRFS_LEAF_DATA_SIZE(fs_info->nodesize)) -#define BTRFS_MAX_XATTR_SIZE(r)(BTRFS_LEAF_DATA_SIZE(r->fs_info) - \ -sizeof(struct btrfs_item) -\ -sizeof(struct btrfs_dir_item)) - /* * this is a very generous portion of the super block, giving us @@ -1203,6 +1199,11 @@ static inline u32 BTRFS_MAX_INLINE_DATA_SIZE(const struct btrfs_fs_info *info) BTRFS_FILE_EXTENT_INLINE_DATA_START; } +static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info) +{ + return BTRFS_MAX_ITEM_SIZE(info) - sizeof(struct btrfs_dir_item); +} + /* * inode items have the data typically returned from stat and store other * info about object characteristics. There is one for every file and dir in diff --git a/dir-item.c b/dir-item.c index 462546c..0b7250c 100644 --- a/dir-item.c +++ b/dir-item.c @@ -311,7 +311,8 @@ static int verify_dir_item(struct btrfs_root *root, /* BTRFS_MAX_XATTR_SIZE is the same for all dir items */ if ((btrfs_dir_data_len(leaf, dir_item) + -btrfs_dir_name_len(leaf, dir_item)) > BTRFS_MAX_XATTR_SIZE(root)) { +btrfs_dir_name_len(leaf, dir_item)) > + BTRFS_MAX_XATTR_SIZE(root->fs_info)) { fprintf(stderr, "invalid dir item name + data len: %u + %u\n", (unsigned)btrfs_dir_name_len(leaf, dir_item), (unsigned)btrfs_dir_data_len(leaf, dir_item)); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: fsck-tests: Rename tree-reloc-tree test number
There are 2 fsck tests with the same number 027: tree-reloc-tree bad-extent-inline-ref-type And we also have a hole in 015, so just rename tree-reloc-tree to 015, to get rid of the duplicated test number and fill in the hole. Signed-off-by: Qu Wenruo--- .../{027-tree-reloc-tree => 015-tree-reloc-tree}/test.sh| 0 .../tree_reloc_for_data_reloc.img.xz| Bin .../tree_reloc_for_fs_tree.img.xz | Bin 3 files changed, 0 insertions(+), 0 deletions(-) rename tests/fsck-tests/{027-tree-reloc-tree => 015-tree-reloc-tree}/test.sh (100%) rename tests/fsck-tests/{027-tree-reloc-tree => 015-tree-reloc-tree}/tree_reloc_for_data_reloc.img.xz (100%) rename tests/fsck-tests/{027-tree-reloc-tree => 015-tree-reloc-tree}/tree_reloc_for_fs_tree.img.xz (100%) diff --git a/tests/fsck-tests/027-tree-reloc-tree/test.sh b/tests/fsck-tests/015-tree-reloc-tree/test.sh similarity index 100% rename from tests/fsck-tests/027-tree-reloc-tree/test.sh rename to tests/fsck-tests/015-tree-reloc-tree/test.sh diff --git a/tests/fsck-tests/027-tree-reloc-tree/tree_reloc_for_data_reloc.img.xz b/tests/fsck-tests/015-tree-reloc-tree/tree_reloc_for_data_reloc.img.xz similarity index 100% rename from tests/fsck-tests/027-tree-reloc-tree/tree_reloc_for_data_reloc.img.xz rename to tests/fsck-tests/015-tree-reloc-tree/tree_reloc_for_data_reloc.img.xz diff --git a/tests/fsck-tests/027-tree-reloc-tree/tree_reloc_for_fs_tree.img.xz b/tests/fsck-tests/015-tree-reloc-tree/tree_reloc_for_fs_tree.img.xz similarity index 100% rename from tests/fsck-tests/027-tree-reloc-tree/tree_reloc_for_fs_tree.img.xz rename to tests/fsck-tests/015-tree-reloc-tree/tree_reloc_for_fs_tree.img.xz -- 2.16.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/2] fix device orders consistency
v3->v2: rename device_sort() to device_cmp(). v1->v2: No code change. Change log updated to include the type of problem that this consistency would help. And I don't see patch 2/2 in the ML. So trying to resend. By maintaining the device order (some) consistency it makes reproducing the missing chunk related problems more consistent. (More fixes of this sort is coming up). Anand Jain (2): btrfs: fix device order consistency btrfs: fix alloc device order consistency fs/btrfs/volumes.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/2] btrfs: fix device order consistency
By maintaining the device order consistency it makes reproducing the problems related to missing chunk in the degraded mode much more consistent. So fix this by sorting the devices by devid within the kernel. So that we know which device is assigned to the struct fs_info::latest_bdev when all the devices are having and same SB generation. Signed-off-by: Anand Jain--- fs/btrfs/volumes.c | 16 1 file changed, 16 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index b5036bd69e6a..0109f370ad5b 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include "ctree.h" #include "extent_map.h" @@ -1102,6 +1103,20 @@ static int __btrfs_open_devices(struct btrfs_fs_devices *fs_devices, return ret; } +static int device_cmp(void *priv, struct list_head *a, struct list_head *b) +{ + struct btrfs_device *dev1, *dev2; + + dev1 = list_entry(a, struct btrfs_device, dev_list); + dev2 = list_entry(b, struct btrfs_device, dev_list); + + if (dev1->devid < dev2->devid) + return -1; + else if (dev1->devid > dev2->devid) + return 1; + return 0; +} + int btrfs_open_devices(struct btrfs_fs_devices *fs_devices, fmode_t flags, void *holder) { @@ -1112,6 +1127,7 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices, fs_devices->opened++; ret = 0; } else { + list_sort(NULL, _devices->devices, device_cmp); ret = __btrfs_open_devices(fs_devices, flags, holder); } mutex_unlock(_mutex); -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] btrfs: fix device order consistency
+static int device_sort(void *priv, struct list_head *a, struct list_head *b) I'll rename that to devid_cmp as it's the comparator and not really a sorting function. Ok. I will make that change in v3. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: print error if primary super block write fails
On 2018年01月31日 06:33, Howard McLauchlan wrote: > Presently, failing a primary super block write but succeeding in at > least one super block write in general will appear to users as if > nothing important went wrong. However, upon unmounting and re-mounting, > the file system will be in a rolled back state. This was discovered > with a BCC program that uses bpf_override_return() to fail super block > writes. > > This patch outputs an error clarifying that the primary super block > write has failed, so users can expect potentially erroneous behaviour. > It also forces wait_dev_supers() to return an error to its caller if > the primary super block write fails. > > Signed-off-by: Howard McLauchlanReviewed-by: Qu Wenruo Thanks, Qu > --- > V2: Added devid to output, removed unnecessary fs_info parameter > > fs/btrfs/disk-io.c | 13 - > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 5da18ebc9222..6d98f2f21d5f 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -3298,6 +3298,7 @@ static int wait_dev_supers(struct btrfs_device *device, > int max_mirrors) > struct buffer_head *bh; > int i; > int errors = 0; > + bool primary_failed = false; > u64 bytenr; > > if (max_mirrors == 0) > @@ -3314,11 +3315,14 @@ static int wait_dev_supers(struct btrfs_device > *device, int max_mirrors) > BTRFS_SUPER_INFO_SIZE); > if (!bh) { > errors++; > + primary_failed = (i == 0) || primary_failed; > continue; > } > wait_on_buffer(bh); > - if (!buffer_uptodate(bh)) > + if (!buffer_uptodate(bh)) { > errors++; > + primary_failed = (i == 0) || primary_failed; > + } > > /* drop our reference */ > brelse(bh); > @@ -3327,6 +3331,13 @@ static int wait_dev_supers(struct btrfs_device > *device, int max_mirrors) > brelse(bh); > } > > + /* log error, force error return */ > + if (primary_failed) { > + btrfs_err(device->fs_info, "error writing primary super block > to device %llu", > + device->devid); > + return -1; > + } > + > return errors < i ? 0 : -1; > } > > signature.asc Description: OpenPGP digital signature
[PATCH v2] btrfs: print error if primary super block write fails
Presently, failing a primary super block write but succeeding in at least one super block write in general will appear to users as if nothing important went wrong. However, upon unmounting and re-mounting, the file system will be in a rolled back state. This was discovered with a BCC program that uses bpf_override_return() to fail super block writes. This patch outputs an error clarifying that the primary super block write has failed, so users can expect potentially erroneous behaviour. It also forces wait_dev_supers() to return an error to its caller if the primary super block write fails. Signed-off-by: Howard McLauchlan--- V2: Added devid to output, removed unnecessary fs_info parameter fs/btrfs/disk-io.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 5da18ebc9222..6d98f2f21d5f 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3298,6 +3298,7 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors) struct buffer_head *bh; int i; int errors = 0; + bool primary_failed = false; u64 bytenr; if (max_mirrors == 0) @@ -3314,11 +3315,14 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors) BTRFS_SUPER_INFO_SIZE); if (!bh) { errors++; + primary_failed = (i == 0) || primary_failed; continue; } wait_on_buffer(bh); - if (!buffer_uptodate(bh)) + if (!buffer_uptodate(bh)) { errors++; + primary_failed = (i == 0) || primary_failed; + } /* drop our reference */ brelse(bh); @@ -3327,6 +3331,13 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors) brelse(bh); } + /* log error, force error return */ + if (primary_failed) { + btrfs_err(device->fs_info, "error writing primary super block to device %llu", + device->devid); + return -1; + } + return errors < i ? 0 : -1; } -- 2.14.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] iversion: make inode_cmp_iversion{+raw} return bool instead of s64
Ack. Should I expect this in a future pull request, or take it directly? There's no hurry about this, since none of the existing users of that function actually do anything but test the return value against zero, and nobody saves it into anything but a "bool" (which has magical casting properties and does not lose upper bits). Linus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On 2018-01-30 14:50, Tomasz Pala wrote: On Tue, Jan 30, 2018 at 08:46:32 -0500, Austin S. Hemmelgarn wrote: I personally think the degraded mount option is a mistake as this assumes that a lightly degraded system is not able to work which is false. If the system can mount to some working state then it should mount regardless if it is fully operative or not. If the array is in a bad state you need to learn about it by issuing a command or something. The same goes for a MD array (and yes, I am aware of the block layer vs filesystem thing here). The problem with this is that right now, it is not safe to run a BTRFS volume degraded and writable, but for an even remotely usable system Mounting read-only is still better than not mounting at all. Agreed, but what most people who are asking about this are asking for is to have the system just run missing a drive. For example, my emergency.target has limited network access and starts ssh server so I could recover from this situation remotely. with pretty much any modern distro, you need your root filesystem to be writable (or you need to have jumped through the hoops to make sure /var and /tmp are writable even if / isn't). Easy to handle by systemd. Not only this, but much more is planned: http://0pointer.net/blog/projects/stateless.html It's reasonably easy to handle even in a normal init system. The issue is that most distros don't really support it well. Arch and Gentoo make it trivial, but they let you configure storage however the hell you want. Pretty much everybody else is mostly designed to assume that /var is a part of /, they mostly work if it's not, but certain odd things cause problems, and you have to go through somewhat unfriendly configuration work during install to get a system set up that way (well, unfriendly if you're a regular user, it's perfectly fine for a seasoned sysadmin). Also, slightly OT, but has anyone involved in the development described in the article you linked every looked beyond the typical Fedora/Debian environment for any of the stuff the conclusions section says you're trying to achieve? Just curious, since NixOS can do almost all of it with near zero effort except for the vendor data part (NixOS still stores it's config in /etc, but it can work with just one or two files), and a handful of the other specific items have reasonably easy ways to implement them that just aren't widely supported (for example, factory resets have at least three options already, OverlayFS (bottom layer is your base image, stored in a read-only verified manner, top layer is writable for user customization), BTRFS seed devices (similar to an overlay, just at the block level), and bootable, self-installing, compressed system images). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] iversion: make inode_cmp_iversion{+raw} return bool instead of s64
From: Jeff LaytonAs Linus points out: The inode_cmp_iversion{+raw}() functions are pure and utter crap. Why? You say that they return 0/negative/positive, but they do so in a completely broken manner. They return that ternary value as the sequence number difference in a 's64', which means that if you actually care about that ternary value, and do the *sane* thing that the kernel-doc of the function implies is the right thing, you would do int cmp = inode_cmp_iversion(inode, old); if (cmp < 0 ... and as a result you get code that looks sane, but that doesn't actually *WORK* right. Since none of the callers actually care about the ternary value here, convert the inode_cmp_iversion{+raw} functions to just return a boolean value (false for matching, true for non-matching). This matches the existing use of these functions just fine, and makes it simple to convert them to return a ternary value in the future if we grow callers that need it. With this change we can also reimplement inode_cmp_iversion in a simpler way using inode_peek_iversion. Signed-off-by: Jeff Layton --- include/linux/iversion.h | 20 1 file changed, 8 insertions(+), 12 deletions(-) diff --git a/include/linux/iversion.h b/include/linux/iversion.h index 858463fca249..3d2fd06495ec 100644 --- a/include/linux/iversion.h +++ b/include/linux/iversion.h @@ -309,13 +309,13 @@ inode_query_iversion(struct inode *inode) * @inode: inode to check * @old: old value to check against its i_version * - * Compare the current raw i_version counter with a previous one. Returns 0 if - * they are the same or non-zero if they are different. + * Compare the current raw i_version counter with a previous one. Returns false + * if they are the same or true if they are different. */ -static inline s64 +static inline bool inode_cmp_iversion_raw(const struct inode *inode, u64 old) { - return (s64)inode_peek_iversion_raw(inode) - (s64)old; + return inode_peek_iversion_raw(inode) != old; } /** @@ -323,19 +323,15 @@ inode_cmp_iversion_raw(const struct inode *inode, u64 old) * @inode: inode to check * @old: old value to check against its i_version * - * Compare an i_version counter with a previous one. Returns 0 if they are - * the same, a positive value if the one in the inode appears newer than @old, - * and a negative value if @old appears to be newer than the one in the - * inode. + * Compare an i_version counter with a previous one. Returns false if they are + * the same, and true if they are different. * * Note that we don't need to set the QUERIED flag in this case, as the value * in the inode is not being recorded for later use. */ - -static inline s64 +static inline bool inode_cmp_iversion(const struct inode *inode, u64 old) { - return (s64)(inode_peek_iversion_raw(inode) & ~I_VERSION_QUERIED) - - (s64)(old << I_VERSION_QUERIED_SHIFT); + return inode_peek_iversion(inode) != old; } #endif -- 2.14.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs check: backref lost, mismatch with its hash -- can't repair
Finally restored by merging the last snapshot with what btrfs restore gave me -- surprisingly almost the whole bunch of it :-) Thanks for helping! Cheers, Marco Qu Wenruo wrote on 2018-01-30 02:24: > > > On 2018年01月30日 02:16, ^m'e wrote: >> Thanks! >> >> Got these >> >> # ./btrfs.static inspect dump-super -fFa /dev/sdb3 |grep >> backup_tree_root: | sort -u >> backup_tree_root:180410073088gen: 463765level: 1 >> backup_tree_root:180415758336gen: 463766level: 1 >> backup_tree_root:180416364544gen: 463767level: 1 >> backup_tree_root:4194304gen: 463764level: 1 >> >> but, nada: all have transid failures... > > That's why I call it "small chance" > >> >> The backup snapshots are OK as per original check. > > Then you should be OK to restore. > > Thanks, > Qu > >> >> >> On Mon, Jan 29, 2018 at 3:09 PM, Qu Wenruowrote: >>> >>> >>> On 2018年01月29日 22:49, ^m'e wrote: On Mon, Jan 29, 2018 at 2:04 PM, Qu Wenruo wrote: > > > On 2018年01月29日 21:58, ^m'e wrote: >> Thanks for the advice, Qu! >> >> I used the system for a while, did some package upgrades -- writing in >> the suspect corrupted area. Then tried a btrfs-send to my backup vol, >> and it failed miserably with a nice kernel oops. >> >> So I went for a lowmem repair: >> >> # ./btrfsck.static check --repair --mode=lowmem /dev/sdb3 2>&1 | tee >> /mnt/custom/rescue/btrfs-recovery/btrfs-repair.BTR-POOL.1.log >> WARNING: low-memory mode repair support is only partial >> Fixed 0 roots. >> checking extents >> checking free space cache >> checking fs roots >> ERROR: failed to add inode 28891726 as orphan item root 257 >> ERROR: root 257 INODE[28891726] is orphan item > > At least I need dig the kernel code further to determine if the orphan > inode handling in btrfs-progs is correct or not. > > So there won't be more dirty fix soon. > > Hopefully you could get some good backup and restore the system. > > At least the problem is limited to a very small range, and it's > something we could handle easily. > > Thanks for all your report, > Qu > > Right. Meanwhile, could you please suggest the best course of action? btrfs rescue or restore? I have snapshots of my two subvols (rootfs, home -- now fs-checking them just in case...) >>> >>> Don't run --repair any more. >>> It seems to make the case worse. >>> >>> While the RW mount with orphan cleanup abort seems to screwed up the >>> filesystem. >>> >>> In this case, it's pretty hard to recover, but still has small chance. >>> >>> Use btrfs inspec dump-super to get backup roots: >>> >>> # btrfs inspect dump-super -fFa |grep backup_tree_root: | sort >>> | uniq >>> >>> And try all the 4 numbers in the following commands: >>> >>> # btrfs check --tree-root >>> >>> To see if is there any good one without transid error. >>> >>> Thanks, >>> Qu >>> signature.asc Description: OpenPGP digital signature
Re: degraded permanent mount option
On Tue, Jan 30, 2018 at 08:46:32 -0500, Austin S. Hemmelgarn wrote: >> I personally think the degraded mount option is a mistake as this >> assumes that a lightly degraded system is not able to work which is false. >> If the system can mount to some working state then it should mount >> regardless if it is fully operative or not. If the array is in a bad >> state you need to learn about it by issuing a command or something. The >> same goes for a MD array (and yes, I am aware of the block layer vs >> filesystem thing here). > The problem with this is that right now, it is not safe to run a BTRFS > volume degraded and writable, but for an even remotely usable system Mounting read-only is still better than not mounting at all. For example, my emergency.target has limited network access and starts ssh server so I could recover from this situation remotely. > with pretty much any modern distro, you need your root filesystem to be > writable (or you need to have jumped through the hoops to make sure /var > and /tmp are writable even if / isn't). Easy to handle by systemd. Not only this, but much more is planned: http://0pointer.net/blog/projects/stateless.html -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] iversion: make inode_cmp_iversion{+raw} return bool instead of s64
On Tue, 2018-01-30 at 17:50 +, Trond Myklebust wrote: > On Tue, 2018-01-30 at 12:31 -0500, Jeff Layton wrote: > > From: Jeff Layton> > > > As Linus points out: > > > > The inode_cmp_iversion{+raw}() functions are pure and utter crap. > > > > Why? > > > > You say that they return 0/negative/positive, but they do so in a > > completely broken manner. They return that ternary value as the > > sequence number difference in a 's64', which means that if you > > actually care about that ternary value, and do the *sane* thing > > that > > the kernel-doc of the function implies is the right thing, you > > would > > do > > > > int cmp = inode_cmp_iversion(inode, old); > > if (cmp < 0 ... > > > > and as a result you get code that looks sane, but that doesn't > > actually *WORK* right. > > > > Since none of the callers actually care about the ternary value here, > > convert the inode_cmp_iversion{+raw} functions to just return a > > boolean > > value (false for matching, true for non-matching). > > > > This matches the existing use of these functions just fine, and makes > > it > > simple to convert them to return a ternary value in the future if we > > grow callers that need it. > > > > Signed-off-by: Jeff Layton > > --- > > include/linux/iversion.h | 20 +--- > > 1 file changed, 9 insertions(+), 11 deletions(-) > > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h > > index 858463fca249..ace32775c5f0 100644 > > --- a/include/linux/iversion.h > > +++ b/include/linux/iversion.h > > @@ -309,13 +309,13 @@ inode_query_iversion(struct inode *inode) > > * @inode: inode to check > > * @old: old value to check against its i_version > > * > > - * Compare the current raw i_version counter with a previous one. > > Returns 0 if > > - * they are the same or non-zero if they are different. > > + * Compare the current raw i_version counter with a previous one. > > Returns false > > + * if they are the same or true if they are different. > > */ > > -static inline s64 > > +static inline bool > > inode_cmp_iversion_raw(const struct inode *inode, u64 old) > > { > > - return (s64)inode_peek_iversion_raw(inode) - (s64)old; > > + return inode_peek_iversion_raw(inode) != old; > > } > > > > /** > > @@ -323,19 +323,17 @@ inode_cmp_iversion_raw(const struct inode > > *inode, u64 old) > > * @inode: inode to check > > * @old: old value to check against its i_version > > * > > - * Compare an i_version counter with a previous one. Returns 0 if > > they are > > - * the same, a positive value if the one in the inode appears newer > > than @old, > > - * and a negative value if @old appears to be newer than the one in > > the > > - * inode. > > + * Compare an i_version counter with a previous one. Returns false > > if they are > > + * the same, and true if they are different. > > * > > * Note that we don't need to set the QUERIED flag in this case, as > > the value > > * in the inode is not being recorded for later use. > > */ > > > > -static inline s64 > > +static inline bool > > inode_cmp_iversion(const struct inode *inode, u64 old) > > { > > - return (s64)(inode_peek_iversion_raw(inode) & > > ~I_VERSION_QUERIED) - > > - (s64)(old << I_VERSION_QUERIED_SHIFT); > > + return (inode_peek_iversion_raw(inode) & ~I_VERSION_QUERIED) > > != > > + (old << I_VERSION_QUERIED_SHIFT); > > } > > Is there any reason why this couldn't just use inode_peek_iversion() > instead of having to both mask the output from > inode_peek_iversion_raw() and shift 'old'? None at all. I'll send a v2. -- Jeff Layton -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
Just one final word, as all was already said: On Tue, Jan 30, 2018 at 11:30:31 -0500, Austin S. Hemmelgarn wrote: >> In other words, is it: >> - the systemd that threats btrfs WORSE than distributed filesystems, OR >> - btrfs that requires from systemd to be threaded BETTER than other fss? > Or maybe it's both? I'm more than willing to admit that what BTRFS does > expose currently is crap in terms of usability. The reason it hasn't > changed is that we (that is, the BTRFS people and the systemd people) > can't agree on what it should look like. Hard to agree with someone who refuses to do _anything_. You can choose to follow whatever, MD, LVM, ZFS, invent something totally different, write custom daemon or put timeout logic inside the kernel itself. It doesn't matter. You know the ecosystem - it is the udev that must be signalled somehow and systemd WILL follow. -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Tue, Jan 30, 2018 at 11:30:31 -0500, Austin S. Hemmelgarn wrote: >> - crypto below software RAID means double-encryption (wasted CPU), > It also means you leak no information about your storage stack. If JBOD > you're sufficiently worried about data protection that you're using > block-level encryption, you should be thinking _very_ hard about whether > or not that's an acceptable risk (and it usually isn't). Nonsense. Block-level encryption is the last resource protection, your primary concern is to encrypt at the highest level possible. Anyway, I don't need to care at all about encryption, one of my customer might. Just stop extending justification of your tight usage pattern to the rest of the world. BTW if YOU are sufficiently worried about data protection you need to use some hardware solution, like OPAL and completely avoid using consumer-grade (especially SSD) drives. This also saves CPU cycles, but let's not discuss here the gory details. If you can't imagine people have different requirements than you, then this is your mental problem, go solve it somewhere else. >> - RAID below LVM means you're stuck with the same RAID-profile for all >>the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for >>system and RAID0 for various system caches (like ccache on software >>builder machine) or transient LVM-level snapshots. > Then you skip MD and do the RAID work in LVM with DM-RAID (which > technically _is_ MD, just with a different frontend). 1. how is write-mostly handled by LVM-initiated RAID1? 2. how can one split LVM RAID1 to separate volumes in case of bit-rot situation that requires manual intervention to recover specific copy of a data (just like btrfs checksumming does automatically in raid1 mode)? >> - RAID below filesystem means loosing btrfs-RAID extra functionality, >>like recovering data from different mirror when CRC mismatch happens, > That depends on your choice of RAID and the exact configuration of the There is no data checksumming in MD-RAID, there is no voting in MD-RAID. There is FEC mode in dm-verity. > storage stack. As long as you expose two RAID devices, BTRFS > replication works just fine on top of them. Taking up 4 times the space? Or going crazy with 2*MD-RAID0? >> - crypto below LVN means encrypting everything, including data that is >>not sensitive - more CPU wasted, > Encrypting only sensitive data is never a good idea unless you can prove Encrypting the sensitive data _AT_or_ABOVE_ the filesystem level is crucial for any really sensitive data. > with certainty that you will keep it properly segregated, and even then > it's still a pretty bad idea because it makes it obvious exactly where > the information you consider sensitive is stored. ROTFL Do you really think this would make breaking the XTS easier, than in would be if the _entire_ drive would be encrypted using THE SAME secret? With attacker having access to the plain texts _AND_ ciphers? Wow... - stop doing the crypto, seriously. You do this wrong. Do you think that my customer or cooperative would happily share HIS secret with mine, just because we're running on the same server? Have you ever heard about zero-knowledge databases? Can you imagine, that some might want to do the decryption remotely, when he doesn't trust me as the owner of the machine? How is that me KNOWING, that their data is encrypted, eases the attack? >> - RAID below LVM means no way to use SSD acceleration of part of the HDD >>space using MD write-mostly functionality. > Again, just use LVM's DM-RAID and throw in DM-cache. Also, there were Obviously you've never used write-mostly, as you're apparently not aware about the difference in maintenance burden. > some patches just posted for BTRFS that indirectly allow for this > (specifically, they let you change the read-selection algorithm, with > the option of specifying to preferentially read from a specific device). When they will be available in LTS kernel, some will definitely use it and create even more complicated stacks. >> It is the bottom layer, but I might be attached into volumes at virtually >> any place of the logical topology tree. E.g. bare network drive added as >> device-mapper mirror target for on-line volume cloning. > And you seriously think that that's going to be a persistent setup? Persistent setups are archeology in IT. > One-shot stuff like that is almost never an issue unless your init > system is absolutely brain-dead _and_ you need it working as it was > immediately (and a live-clone of a device doesn't if you're doing it right). Brain-dead is a state of mind, when you reject usage scenarios that you completely don't understand, hopefully due to the small experience only. >> The point is: mainaining all of this logic is NOT the job for init system. >> With systemd you need exactly N-N=0 lines of code to make this work. > So, I find it very hard to believe that systemd requires absolutely
[PATCH] Btrfs: fix null pointer dereference when replacing missing device
From: Filipe MananaWhen we are replacing a missing device we mount the filesystem with the degraded mode option in which case we are allowed to have a btrfs device structure without a backing device member (its bdev member is NULL) and therefore we can't dereference that member. Commit 38b5f68e9811 ("btrfs: drop btrfs_device::can_discard to query directly") started to dereference that member when discarding extents, resulting in a null pointer dereference: [ 3145.322257] BTRFS warning (device sdf): devid 2 uuid 4d922414-58eb-4880-8fed-9c3840f6c5d5 is missing [ 3145.364116] BTRFS info (device sdf): dev_replace from (devid 2) to /dev/sdg started [ 3145.413489] BUG: unable to handle kernel NULL pointer dereference at 00e0 [ 3145.415085] IP: btrfs_discard_extent+0x6a/0xf8 [btrfs] [ 3145.415085] PGD 0 P4D 0 [ 3145.415085] Oops: [#1] PREEMPT SMP PTI [ 3145.415085] Modules linked in: ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse parport_pc serio_raw i2c_piix4 i2 [ 3145.415085] CPU: 0 PID: 11989 Comm: btrfs Tainted: GW 4.15.0-rc9-btrfs-next-55+ #1 [ 3145.415085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [ 3145.415085] RIP: 0010:btrfs_discard_extent+0x6a/0xf8 [btrfs] [ 3145.415085] RSP: 0018:c90004813c60 EFLAGS: 00010293 [ 3145.415085] RAX: 88020d39cc00 RBX: 88020c4ea2a0 RCX: 0002 [ 3145.415085] RDX: RSI: 88020c4ea240 RDI: [ 3145.415085] RBP: R08: 4000 R09: [ 3145.415085] R10: c90004813ae8 R11: R12: [ 3145.415085] R13: 88020c418000 R14: R15: [ 3145.415085] FS: 7f565681f8c0() GS:88023fc0() knlGS: [ 3145.415085] CS: 0010 DS: ES: CR0: 80050033 [ 3145.415085] CR2: 00e0 CR3: 00020d208006 CR4: 001606f0 [ 3145.415085] Call Trace: [ 3145.415085] btrfs_finish_extent_commit+0x9a/0x1be [btrfs] [ 3145.415085] btrfs_commit_transaction+0x649/0x7a0 [btrfs] [ 3145.415085] ? start_transaction+0x2b0/0x3b3 [btrfs] [ 3145.415085] btrfs_dev_replace_start+0x274/0x30c [btrfs] [ 3145.415085] btrfs_dev_replace_by_ioctl+0x45/0x59 [btrfs] [ 3145.415085] btrfs_ioctl+0x1a91/0x1d62 [btrfs] [ 3145.415085] ? lock_acquire+0x16a/0x1af [ 3145.415085] ? vfs_ioctl+0x1b/0x28 [ 3145.415085] ? trace_hardirqs_on_caller+0x14c/0x1a6 [ 3145.415085] vfs_ioctl+0x1b/0x28 [ 3145.415085] do_vfs_ioctl+0x5a9/0x5e0 [ 3145.415085] ? _raw_spin_unlock_irq+0x34/0x46 [ 3145.415085] ? entry_SYSCALL_64_fastpath+0x5/0x8b [ 3145.415085] ? trace_hardirqs_on_caller+0x14c/0x1a6 [ 3145.415085] SyS_ioctl+0x52/0x76 [ 3145.415085] entry_SYSCALL_64_fastpath+0x1e/0x8b [ 3145.415085] RIP: 0033:0x7f56558b3c47 [ 3145.415085] RSP: 002b:7ffdcfac4c58 EFLAGS: 0202 [ 3145.415085] Code: be 02 00 00 00 4c 89 ef e8 b9 e7 03 00 85 c0 89 c5 75 75 48 8b 44 24 08 45 31 f6 48 8d 58 60 eb 52 48 8b 03 48 8b b8 a0 00 00 00 <48> 8b 87 e0 00 [ 3145.415085] RIP: btrfs_discard_extent+0x6a/0xf8 [btrfs] RSP: c90004813c60 [ 3145.415085] CR2: 00e0 [ 3145.458185] ---[ end trace 06302e7ac31902bf ]--- This is trivially reproduced by running the test btrfs/027 from fstests like this: $ MOUNT_OPTIONS="-o discard" ./check btrfs/027 Fix this by skipping devices without a backing device before attempting to discard. Fixes: 38b5f68e9811 ("btrfs: drop btrfs_device::can_discard to query directly") Signed-off-by: Filipe Manana --- fs/btrfs/extent-tree.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9d220b276c8f..d59ee24645e3 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2147,6 +2147,10 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, u64 bytes; struct request_queue *req_q; + if (!stripe->dev->bdev) { + ASSERT(btrfs_test_opt(fs_info, DEGRADED)); + continue; + } req_q = bdev_get_queue(stripe->dev->bdev); if (!blk_queue_discard(req_q)) continue; -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] iversion: make inode_cmp_iversion{+raw} return bool instead of s64
On Tue, 2018-01-30 at 12:31 -0500, Jeff Layton wrote: > From: Jeff Layton> > As Linus points out: > > The inode_cmp_iversion{+raw}() functions are pure and utter crap. > > Why? > > You say that they return 0/negative/positive, but they do so in a > completely broken manner. They return that ternary value as the > sequence number difference in a 's64', which means that if you > actually care about that ternary value, and do the *sane* thing > that > the kernel-doc of the function implies is the right thing, you > would > do > > int cmp = inode_cmp_iversion(inode, old); > if (cmp < 0 ... > > and as a result you get code that looks sane, but that doesn't > actually *WORK* right. > > Since none of the callers actually care about the ternary value here, > convert the inode_cmp_iversion{+raw} functions to just return a > boolean > value (false for matching, true for non-matching). > > This matches the existing use of these functions just fine, and makes > it > simple to convert them to return a ternary value in the future if we > grow callers that need it. > > Signed-off-by: Jeff Layton > --- > include/linux/iversion.h | 20 +--- > 1 file changed, 9 insertions(+), 11 deletions(-) > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h > index 858463fca249..ace32775c5f0 100644 > --- a/include/linux/iversion.h > +++ b/include/linux/iversion.h > @@ -309,13 +309,13 @@ inode_query_iversion(struct inode *inode) > * @inode: inode to check > * @old: old value to check against its i_version > * > - * Compare the current raw i_version counter with a previous one. > Returns 0 if > - * they are the same or non-zero if they are different. > + * Compare the current raw i_version counter with a previous one. > Returns false > + * if they are the same or true if they are different. > */ > -static inline s64 > +static inline bool > inode_cmp_iversion_raw(const struct inode *inode, u64 old) > { > - return (s64)inode_peek_iversion_raw(inode) - (s64)old; > + return inode_peek_iversion_raw(inode) != old; > } > > /** > @@ -323,19 +323,17 @@ inode_cmp_iversion_raw(const struct inode > *inode, u64 old) > * @inode: inode to check > * @old: old value to check against its i_version > * > - * Compare an i_version counter with a previous one. Returns 0 if > they are > - * the same, a positive value if the one in the inode appears newer > than @old, > - * and a negative value if @old appears to be newer than the one in > the > - * inode. > + * Compare an i_version counter with a previous one. Returns false > if they are > + * the same, and true if they are different. > * > * Note that we don't need to set the QUERIED flag in this case, as > the value > * in the inode is not being recorded for later use. > */ > > -static inline s64 > +static inline bool > inode_cmp_iversion(const struct inode *inode, u64 old) > { > - return (s64)(inode_peek_iversion_raw(inode) & > ~I_VERSION_QUERIED) - > -(s64)(old << I_VERSION_QUERIED_SHIFT); > + return (inode_peek_iversion_raw(inode) & ~I_VERSION_QUERIED) > != > + (old << I_VERSION_QUERIED_SHIFT); > } Is there any reason why this couldn't just use inode_peek_iversion() instead of having to both mask the output from inode_peek_iversion_raw() and shift 'old'? > #endif -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.mykleb...@primarydata.com
Re: [GIT PULL] inode->i_version rework for v4.16
On Tue, Jan 30, 2018 at 4:05 AM, Jeff Laytonwrote: > > My intent here was to have this handle wraparound using the same sort of > method that the time_before/time_after macros do. Obviously, I didn't > document that well. Oh, the intent was clear. The implementation was wrong. Note that "time_before()" returns a _boolean_. So yes, the time comparison functions do a 64-bit subtraction, exactly like yours do. BUT THEY DO NOT RETURN THAT DIFFERENCE. They test the sign in 64 bits and return that boolean single-bit value. > I want to make sure I understand what's actually broken here thoug. Is > it only broken when the two values are more than 2**63 apart, or is > there something else more fundamentally wrong here? There's something fundamentally wrong. The _intent_ is to allow numbers up to 2**63 apart, but if somebody does that int cmp = inode_cmp_iversion(inode, old); if (cmp < 0 ... then it actually only ever tests numbers 2**31 apart, because the high 32 bits will have been thrown away when the 64-bit difference is cast to "int". And what used to be a sign bit (in 64 bits) no longer exists, and the above tests the *new* sign bit that is bit #31, not #63. So you are a factor of four billion off. And while 2**63 is a big number and perfectly ok for a filesystem event difference, a difference of 2**31 is a "this can actually happen". Yes, even 2**31 is still a big difference, and it will take a very unusual situation, and quite a long time to trigger: something that does a thousand filesystem ops per second will not see the problem for 24 days. So you'll never see it in a test. But 24 days happens in real life.. That's why you need to do the comparison against zero inside the actual helper functions like the time comparisons do. Because if you return the 64-bit difference, it will be trivially lost, and the code will _look_ right, but not work right. The fact that you didn't even seem to see the problem in my example calling sequence just proves that point. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On 2018-01-30 10:09, Tomasz Pala wrote: On Mon, Jan 29, 2018 at 08:42:32 -0500, Austin S. Hemmelgarn wrote: Yes. They are stupid enough to fail miserably with any more complicated setups, like stacking volume managers, crypto layer, network attached storage etc. I think you mean any setup that isn't sensibly layered. No, I mean any setup that wasn't considered by init system authors. Your 'sensibly' is not sensible for me. BCP for over a decade has been to put multipathing at the bottom, then crypto, then software RAID, than LVM, and then whatever filesystem you're using. Really? Let's enumerate some caveats of this: - crypto below software RAID means double-encryption (wasted CPU), It also means you leak no information about your storage stack. If you're sufficiently worried about data protection that you're using block-level encryption, you should be thinking _very_ hard about whether or not that's an acceptable risk (and it usually isn't). - RAID below LVM means you're stuck with the same RAID-profile for all the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for system and RAID0 for various system caches (like ccache on software builder machine) or transient LVM-level snapshots. Then you skip MD and do the RAID work in LVM with DM-RAID (which technically _is_ MD, just with a different frontend). - RAID below filesystem means loosing btrfs-RAID extra functionality, like recovering data from different mirror when CRC mismatch happens, That depends on your choice of RAID and the exact configuration of the storage stack. As long as you expose two RAID devices, BTRFS replication works just fine on top of them. - crypto below LVN means encrypting everything, including data that is not sensitive - more CPU wasted, Encrypting only sensitive data is never a good idea unless you can prove with certainty that you will keep it properly segregated, and even then it's still a pretty bad idea because it makes it obvious exactly where the information you consider sensitive is stored. - RAID below LVM means no way to use SSD acceleration of part of the HDD space using MD write-mostly functionality. Again, just use LVM's DM-RAID and throw in DM-cache. Also, there were some patches just posted for BTRFS that indirectly allow for this (specifically, they let you change the read-selection algorithm, with the option of specifying to preferentially read from a specific device). What you present is only some sane default, which doesn't mean it covers all the real-world cases. My recent server is using: - raw partitioning for base volumes, - LVM, - MD on top of some LVs (varying levels), - paritioned SSD cache attached to specific VGs, - crypto on top of selected LV/MD, - btrfs RAID1 on top of non-MDed LVs. Multipathing has to be the bottom layer for a given node because it interacts directly with hardware topology which gets obscured by the other layers. It is the bottom layer, but I might be attached into volumes at virtually any place of the logical topology tree. E.g. bare network drive added as device-mapper mirror target for on-line volume cloning. And you seriously think that that's going to be a persistent setup? One-shot stuff like that is almost never an issue unless your init system is absolutely brain-dead _and_ you need it working as it was immediately (and a live-clone of a device doesn't if you're doing it right). Crypto essentially has to be next, otherwise you leak info about the storage stack. I'm encrypting only the containers that require block-level encryption. Others might have more effective filesystem-level encryption or even be some TrueCrypt/whatever images. Again, you're leaking information by doing so. At a minimum, you're leaking info about where the data you consider sensitive is stored, and that's not counting volume names (exposed by LVM), container configuration (possibly exposed depending on how your container stack handles it), and other storage stack configuration info (exposed by the metadata of the various layers and possibly by files in /etc if you don't have your root filesystem encrypted). Swapping LVM and software RAID ends up giving you a setup which is difficult for most people to understand and therefore is hard to reliably maintain. It's more difficult, as you need to maintain manually two (or more) separate VGs with matching LVs inside. Harder, but more flexible. And could also be trivially simplified by eliminating MD and using LVM's native support for DM-RAID, which provides essentially the exact same functionality because DM-RAID is largely just a DM fronted for MD. Other init systems enforce things being this way because it maintains people's sanity, not because they have significant difficulty doing things differently (and in fact, it is _trivial_ to change the ordering in some of them, OpenRC on Gentoo for example quite literally requires exactly N-1 lines to change in each of N
Re: degraded permanent mount option
On Tue, Jan 30, 2018 at 16:09:50 +0100, Tomasz Pala wrote: >> BCP for over a >> decade has been to put multipathing at the bottom, then crypto, then >> software RAID, than LVM, and then whatever filesystem you're using. > > Really? Let's enumerate some caveats of this: > > - crypto below software RAID means double-encryption (wasted CPU), > > - RAID below LVM means you're stuck with the same RAID-profile for all > the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for > system and RAID0 for various system caches (like ccache on software > builder machine) or transient LVM-level snapshots. > > - RAID below filesystem means loosing btrfs-RAID extra functionality, > like recovering data from different mirror when CRC mismatch happens, > > - crypto below LVN means encrypting everything, including data that is > not sensitive - more CPU wasted, And, what is much worse - encrypting everything using the same secret. BIG show-stopper. I would shred such BCP as ineffective and insecure for both, data integrity and confidentiality. > - RAID below LVM means no way to use SSD acceleration of part of the HDD > space using MD write-mostly functionality. -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] inode->i_version rework for v4.16
On Tue, Jan 30, 2018 at 07:05:48AM -0500, Jeff Layton wrote: > > I want to make sure I understand what's actually broken here thoug. Is > it only broken when the two values are more than 2**63 apart, or is > there something else more fundamentally wrong here? The other problem is that returning a s64 (or a u64) is extremely inefficient on a 32-bit platform. So in general, it's better to avoid returning a 64-bit type unless it's really necessary - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Tue, Jan 30, 2018 at 10:05:34 -0500, Austin S. Hemmelgarn wrote: >> Instead, they should move their legs continuously and if the train is > not >> on the station yet, just climb back and retry. > No, that's really not a good analogy given the fact that that check for > the presence of a train takes a normal person milliseconds while the > event being raced against (the train departing) takes minutes. In the OMG... preventing races by "this would always take longer"? Seriously? > You're already looping forever _waiting_ for the volume to appear. How udev is waiting for events, not systemd. Nobody will do some crazy cross-layered shortcuts to overcome other's lazyness. > is that any different from lopping forever trying to _mount_ the volume Yes, because udev doesn't mount anything, ever. Not this binary dude! > instead given that failing to mount the volume is not going to damage > things. Failed premature attempt to mount prevents the system from booting WHEN the devices are ready - this is fatal. System boots randomly on racy conditions. But hey, "the devices will always appear faster, than the init attempt to do the mount"! Have you ever had some hardware RAID controller? Never heard about devices appearing after 5 minutes of warming up? > The issue here is that systemd refuses to implement any method > of actually retrying things that fail during startup.> 1. Such methods are trivial and I've already mentioned them a dozen of times. 2. They should be implemented in btrfs-upstream, not systemd-upstream, but I personally would happily help with writing them here. 3. They require full-circle path of 'allow-degraded' to be passed through btrfs code. >> mounting BEFORE volume is complete is FATAL - since no userspace daemon >> would ever retrigger the mount and the system won't came up. Provide one >> btrfsd volume manager and systemd could probably switch to using it. > And here you've lost any respect I might have had for you. Going personal? So thank you for discussion and good bye. Please refrain from answering me, I'm not going to discuss this any further with you. > **YOU DO NOT NEED A DAEMON TO DO EVERY LAST TASK ON THE SYSTEM** Sorry dude, but I won't repeat for the 5th times all the alternatives. You *all* refuse to step in ANY possible solution mentioned. You *all* except the systemd to do ALL the job, just like other init systems were forced to do, against the good design principles. Good luck having btrfs degraded mount under systemd. > > This is one of the two biggest things I hate about systemd(the journal > is the other one for those who care). The journal has currently *many* drawbacks, but this is not 'by design' but 'by appropriate code missing for now'. The same applies to btrfs, isn't it? > You don't need some special daemon to set the time, Ever heard about NTP? > or to set the hostname, FUD - no such daemon > or to fetch account data, FUD > or even to track who's logged in FUD > As much as it may surprise the systemd developers, people got on just > fine handling setting the system time, setting the hostname, fetching > account info, tracking active users, and any number of myriad other > tasks before systemd decided they needed to have their own special daemon. > Sure, in myriad of different scattered distro-specific files. The only reason systemd stepped in for some of there is that nobody else could introduce and force Linux-wide consensus. And if anyone would succeed, there would be some Austins blaming them for 'overtaking good old trashyard into coherent de facto standard.' > In this particular case, you don't need a daemon because the kernel does > the state tracking. Sure, MD doesn't require daemon and LVM doesn't require either. But they do provide some - I know, they are all wrong. -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Mon, Jan 29, 2018 at 21:44:23 -0700, Chris Murphy wrote: > Btrfs is orthogonal to systemd's willingness to wait forever while > making no progress. It doesn't matter what it is, it shouldn't wait > forever. It times out after 90 seconds (by default) and then it fails the mount entirely. > It occurs to me there are such systemd service units specifically for > waiting for example > > systemd-networkd-wait-online.service, systemd-networkd-wait-online - > Wait for network to >come online > > chrony-wait.service - Wait for chrony to synchronize system clock > > NetworkManager has a version of this. I don't see why there can't be a > wait for Btrfs to normally mount, Because mounting degraded btrfs without -o degraded won't WAIT for anything just immediatelly return failed. > just simply try to mount, it fails, wait 10, try again, wait 10 try again. For the last time: No Such Logic In Systemd CORE Every wait/repeat is done using UNITS - as you already noticed itself. And these are plain, regular UNITS. Is there anything that prevents YOU, Chris, from writing these UNITS for btrfs? I know what makes ME stop writing these units - it's lack of feedback from btrfs.ko ioctl handler. Without this I am unable to write UNITS handling fstab mount entries, because the logic would PROBABLY have to be hardcoded inside systemd-fstab-generator. And such logic MUST NOT be hardcoded - this MUST be user-configurable, i.e. made on UNITS level. You might argue that some-distros-SysV units or some Gentoo-OpenRC have support for this and if you want to change anything this is only a few lines of shell code to be altered. But systemd-fstab-generator is compiled binary and so WON'T allow the behaviour to be user-configurable. > And then fail the unit so we end up at a prompt. This can also be easily done, just like emergency-shell spawns when configured. If only btrfs could accept and keep information about volume being allowed for degraded mount. OK, to be honest I _can_ write such rules now, keeping the 'allow-degraded' state somewhere else (in a file for example). But since this is some non-standarized side-channel, such code won't be accepted in systemd upstream, especially because it requires the current udev rule to be slightly changed. -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Mon, Jan 29, 2018 at 14:00:53 -0500, Austin S. Hemmelgarn wrote: > We already do so in the accepted standard manner. If the mount fails > because of a missing device, you get a very specific message in the > kernel log about it, as is the case for most other common errors (for > uncommon ones you usually just get a generic open_ctree error). This is > really the only option too, as the mount() syscall (which the mount > command calls) returns only 0 on success or -1 and an appropriate errno > value on failure, and we can't exactly go about creating a half dozen > new error numbers just for this (well, technically we could, but I very > much doubt that they would be accepted upstream, which defeats the purpose). This is exacly why the separate communication channel being the ioctl is currently used. And I really don't understand why do you fight against expanding this ioctl response. > With what you're proposing for BTRFS however, _everything_ is a > complicated decision, namely: > 1. Do you retry at all? During boot, the answer should usually be yes, > but during normal system operation it should normally be no (because we > should be letting the user handle issues at that point). This is exactly why I propose to introduce ioctl in btrfs.ko that accepts userspace-configured (as per-volume policy) expectations. > 2. How long should you wait before you retry? There is no right answer > here that will work in all cases (I've seen systems which take multiple > minutes for devices to become available on boot), especially considering > those of us who would rather have things fail early. btrfs-last-resort@.timer per analogy to mdadm-last-resort@.timer > 3. If the retry fails, do you retry again? How many times before it > just outright fails? This is going to be system specific policy. On > systems where devices may take a while to come online, the answer is > probably yes and some reasonably large number, while on systems where > devices are known to reliably be online immediately, it makes no sense > to retry more than once or twice. All of this is systemd timer/service job. > 4. If you are going to retry, should you try a degraded mount? Again, > this is going to be system specific policy (regular users would probably > want this to be a yes, while people who care about data integrity over > availability would likely want it to be a no). Just like above - user-configured in systemd timers/services easily. > 5. Assuming you do retry with the degraded mount, how many times should > a normal mount fail before things go degraded? This ties in with 3 and > has the same arguments about variability I gave there. As above. > 6. How many times do you try a degraded mount before just giving up? > Again, similar variability to 3. > 7. Should each attempt try first a regular mount and then a degraded > one, or do you try just normal a couple times and then switch to > degraded, or even start out trying normal and then start alternating? > Any of those patterns has valid arguments both for and against it, so > this again needs to be user configurable policy. > > Altogether, that's a total of 7 policy decisions that should be user > configurable. All of them easy to implement if the btrfs.ko could accept 'allow-degraded' per-volume instruction and return 'try-degraded' in the ioctl. > Having a config file other than /etc/fstab for the mount > command should probably be avoided for sanity reasons (again, BTRFS is a > filesystem, not a volume manager), so they would all have to be handled > through mount options. The kernel will additionally have to understand > that those options need to be ignored (things do try to mount > filesystems without calling a mount helper, most notably the kernel when > it mounts the root filesystem on boot if you're not using an initramfs). > All in all, this type of thing gets out of hand _very_ fast. You need to think about the two separately: 1. tracking STATE - this is remembering 'allow-degraded' option for now, 2. configured POLICY - this is to be handled by init system. -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Mon, Jan 29, 2018 at 08:42:32 -0500, Austin S. Hemmelgarn wrote: >> Yes. They are stupid enough to fail miserably with any more complicated >> setups, like stacking volume managers, crypto layer, network attached >> storage etc. > I think you mean any setup that isn't sensibly layered. No, I mean any setup that wasn't considered by init system authors. Your 'sensibly' is not sensible for me. > BCP for over a > decade has been to put multipathing at the bottom, then crypto, then > software RAID, than LVM, and then whatever filesystem you're using. Really? Let's enumerate some caveats of this: - crypto below software RAID means double-encryption (wasted CPU), - RAID below LVM means you're stuck with the same RAID-profile for all the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for system and RAID0 for various system caches (like ccache on software builder machine) or transient LVM-level snapshots. - RAID below filesystem means loosing btrfs-RAID extra functionality, like recovering data from different mirror when CRC mismatch happens, - crypto below LVN means encrypting everything, including data that is not sensitive - more CPU wasted, - RAID below LVM means no way to use SSD acceleration of part of the HDD space using MD write-mostly functionality. What you present is only some sane default, which doesn't mean it covers all the real-world cases. My recent server is using: - raw partitioning for base volumes, - LVM, - MD on top of some LVs (varying levels), - paritioned SSD cache attached to specific VGs, - crypto on top of selected LV/MD, - btrfs RAID1 on top of non-MDed LVs. > Multipathing has to be the bottom layer for a given node because it > interacts directly with hardware topology which gets obscured by the > other layers. It is the bottom layer, but I might be attached into volumes at virtually any place of the logical topology tree. E.g. bare network drive added as device-mapper mirror target for on-line volume cloning. > Crypto essentially has to be next, otherwise you leak > info about the storage stack. I'm encrypting only the containers that require block-level encryption. Others might have more effective filesystem-level encryption or even be some TrueCrypt/whatever images. > Swapping LVM and software RAID ends up > giving you a setup which is difficult for most people to understand and > therefore is hard to reliably maintain. It's more difficult, as you need to maintain manually two (or more) separate VGs with matching LVs inside. Harder, but more flexible. > Other init systems enforce things being this way because it maintains > people's sanity, not because they have significant difficulty doing > things differently (and in fact, it is _trivial_ to change the ordering > in some of them, OpenRC on Gentoo for example quite literally requires > exactly N-1 lines to change in each of N files when re-ordering N > layers), provided each layer occurs exactly once for a given device and > the relative ordering is the same on all devices. And you know what? The point is: mainaining all of this logic is NOT the job for init system. With systemd you need exactly N-N=0 lines of code to make this work. The appropriate unit files are provided by MD and LVM upstream. And they include fallback mechanism for degrading volumes. > Given my own experience with systemd, it has exactly the same constraint > on relative ordering. I've tried to run split setups with LVM and > dm-crypt where one device had dm-crypt as the bottom layer and the other > had it as the top layer, and things locked up during boot on _every_ > generalized init system I tried. Hard to tell without access to the failing system, but this MIGHT have been: - old/missing/broken-by-distro-maintainers-who-know-better LVM rules, - old/bugged systemd, possibly with broken/old cryptsetup rules. >> It's quite obvious who's the culprit: every single remaining filesystem >> manages to mount under systemd without problems. They just expose >> informations about their state. > No, they don't (except ZFS). They don't expose informations (as there are none), but they DO mount. > There is no 'state' to expose for anything but BTRFS (and ZFS) Does ZFS expose it's state or not? > except possibly if the filesystem needs checked or > not. You're conflating filesystems and volume management. btrfs is a filesystem, device manager and volume manager. I might add DEVICE to a btrfs-thingy. I might mount the same btrfs-thingy selecting different VOLUME (subVOL=something_other) > The alternative way of putting what you just said is: > Every single remaining filesystem manages to mount under systemd without > problems, because it doesn't try to treat them as a block layer. Or: every other volume manager exposes separate block devices. Anyway - however we put this into words, it is btrfs that behaves differently. >> The 'needless complication', as you named it, usually should be the
Re: degraded permanent mount option
On 2018-01-30 08:46, Tomasz Pala wrote: On Mon, Jan 29, 2018 at 08:05:42 -0500, Austin S. Hemmelgarn wrote: Seriously, _THERE IS A RACE CONDITION IN SYSTEMD'S CURRENT HANDLING OF THIS_. It's functionally no different than prefacing an attempt to send a signal to a process by checking if the process exists, or trying to see if some other process is using a file that might be locked by Seriously, there is a race condition on train stations. People check if the train has stopped and opened the door before they move their legs to get in, but the train might be already gone - so this is pointless. Instead, they should move their legs continuously and if the train is > not on the station yet, just climb back and retry. No, that's really not a good analogy given the fact that that check for the presence of a train takes a normal person milliseconds while the event being raced against (the train departing) takes minutes. In the case being discussed, the check takes milliseconds and the event being raced against also takes milliseconds. The scale here is drastically different.> See the difference? I hope now you know what is the race condition. It is the condition, where CONSEQUENCES are fatal. Yes, the consequences of the condition being discussed functionally are fatal (you completely fail to mount the volume), because systemd doesn't retry mounting the root filesystem, it just breaks, which is absolutely at odds with the whole 'just works' mentality I always hear from the systemd fanboys and developers. You're already looping forever _waiting_ for the volume to appear. How is that any different from lopping forever trying to _mount_ the volume instead given that failing to mount the volume is not going to damage things. The issue here is that systemd refuses to implement any method of actually retrying things that fail during startup.> mounting BEFORE volume is complete is FATAL - since no userspace daemon would ever retrigger the mount and the system won't came up. Provide one btrfsd volume manager and systemd could probably switch to using it. And here you've lost any respect I might have had for you. **YOU DO NOT NEED A DAEMON TO DO EVERY LAST TASK ON THE SYSTEM** Period, end of story. This is one of the two biggest things I hate about systemd (the journal is the other one for those who care). You don't need some special daemon to set the time, or to set the hostname, or to fetch account data, or even to track who's logged in (though I understand that the last one is not systemd's fault originally). As much as it may surprise the systemd developers, people got on just fine handling setting the system time, setting the hostname, fetching account info, tracking active users, and any number of myriad other tasks before systemd decided they needed to have their own special daemon. In this particular case, you don't need a daemon because the kernel does the state tracking. It only checks that state completely though _when you ask it to mount the filesystem_ because it requires doing 99% of the work of mounting the filesystem (quite literally, you're doing pretty much everything short of actually hooking things up in the VFS layer). We are not a case like MD where there's just a tiny bit of metadata to parse to check what the state is supposed to be. Imagine if LVM required you to unconditionally activate all the LV's in a VG when you activate the VG and what logic would be required to validate the VG then, and you're pretty close to what's needed to check state for a BTRFS volume (translating LV's to chunks and the VG to the filesystem as a whole). There is no point in trying to parse that data every time a new device shows up, it's a waste of time (at a minimum, you're almost doubling the amount of time it takes to mount a volume if you are doing this each time a device shows up), energy, and resources in general. mounting AFTER volume is complete is FINE - and if the "pseudo-race" happens and volume disappears, then this was either some operator action, so the umount SHOULD happen, or we are facing some MALFUNCION, which is fatal itself, not by being a "race condition". Short of catastrophic failure, the _volume_ doesn't disappear, a component device does, and that is where the problem lies, especially given that the ioctl only tracks that each component device has been seen, not that all are present at the moment the ioctl is invoked. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Streamline btrfs_delalloc_reserve_metadata initial operations
On 30.01.2018 16:34, David Sterba wrote: > On Fri, Jan 12, 2018 at 04:21:05PM +0200, Nikolay Borisov wrote: >> @@ -6062,19 +6062,19 @@ int btrfs_delalloc_reserve_metadata(struct >> btrfs_inode *inode, u64 num_bytes) >> * If we have a transaction open (can happen if we call truncate_block >> * from truncate), then we need FLUSH_LIMIT so we don't deadlock. >> */ >> + >> if (btrfs_is_free_space_inode(inode)) { >> flush = BTRFS_RESERVE_NO_FLUSH; >> delalloc_lock = false; >> -} else if (current->journal_info) { >> -flush = BTRFS_RESERVE_FLUSH_LIMIT; >> -} >> +} else { >> +if (current->journal_info) >> +flush = BTRFS_RESERVE_FLUSH_LIMIT; >> >> -if (flush != BTRFS_RESERVE_NO_FLUSH && >> -btrfs_transaction_in_commit(fs_info)) >> -schedule_timeout(1); >> +if (btrfs_transaction_in_commit(fs_info)) >> +schedule_timeout(1); >> >> -if (delalloc_lock) >> mutex_lock(>delalloc_mutex); >> +} > > Squeezing the condition branches makes the code more readable, I have > only one objection and it's the mutex_lock. It IMHO looks better when > it's a separate branch as it pairs with the unlock: > > if (delalloc_lock) > mutex_lock(...); > > ... > > if (delalloc_lock) > mutex_unlock(...); > > In your version it's implied by the first if that checks > btrfs_is_free_space_inode and delalloc_lock is hidden there. > My line of thought when developing the patch was that delalloc is another level of indirection and. What I wanted to achieve in the end is to make it clear that delalloc_mutex really depends on whether we are reserving for the freespace inode or not. >> >> num_bytes = ALIGN(num_bytes, fs_info->sectorsize); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Streamline btrfs_delalloc_reserve_metadata initial operations
On Fri, Jan 12, 2018 at 04:21:05PM +0200, Nikolay Borisov wrote: > @@ -6062,19 +6062,19 @@ int btrfs_delalloc_reserve_metadata(struct > btrfs_inode *inode, u64 num_bytes) >* If we have a transaction open (can happen if we call truncate_block >* from truncate), then we need FLUSH_LIMIT so we don't deadlock. >*/ > + > if (btrfs_is_free_space_inode(inode)) { > flush = BTRFS_RESERVE_NO_FLUSH; > delalloc_lock = false; > - } else if (current->journal_info) { > - flush = BTRFS_RESERVE_FLUSH_LIMIT; > - } > + } else { > + if (current->journal_info) > + flush = BTRFS_RESERVE_FLUSH_LIMIT; > > - if (flush != BTRFS_RESERVE_NO_FLUSH && > - btrfs_transaction_in_commit(fs_info)) > - schedule_timeout(1); > + if (btrfs_transaction_in_commit(fs_info)) > + schedule_timeout(1); > > - if (delalloc_lock) > mutex_lock(>delalloc_mutex); > + } Squeezing the condition branches makes the code more readable, I have only one objection and it's the mutex_lock. It IMHO looks better when it's a separate branch as it pairs with the unlock: if (delalloc_lock) mutex_lock(...); ... if (delalloc_lock) mutex_unlock(...); In your version it's implied by the first if that checks btrfs_is_free_space_inode and delalloc_lock is hidden there. > > num_bytes = ALIGN(num_bytes, fs_info->sectorsize); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RESEND PATCH] btrfs: Handle btrfs_set_extent_delalloc failure in relocate_file_extent_cluster
Essentially duplicate the error handling from the above block which handles the !PageUptodate(page) case and additionally clear EXTENT_BOUNDARY. Signed-off-by: Nikolay BorisovReviewed-by: Josef Bacik --- Put description of the intended changes. fs/btrfs/relocation.c | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index f0c3f00e97cb..8b2a31cef5cf 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -3268,12 +3268,25 @@ static int relocate_file_extent_cluster(struct inode *inode, nr++; } - btrfs_set_extent_delalloc(inode, page_start, page_end, 0, NULL, - 0); + ret = btrfs_set_extent_delalloc(inode, page_start, page_end, 0, + NULL, 0); + if (ret) { + unlock_page(page); + put_page(page); + btrfs_delalloc_release_metadata(BTRFS_I(inode), +PAGE_SIZE); + btrfs_delalloc_release_extents(BTRFS_I(inode), + PAGE_SIZE); + + clear_extent_bits(_I(inode)->io_tree, + page_start, page_end, + EXTENT_LOCKED | EXTENT_BOUNDARY); + goto out; + + } set_page_dirty(page); - unlock_extent(_I(inode)->io_tree, - page_start, page_end); + unlock_extent(_I(inode)->io_tree, page_start, page_end); unlock_page(page); put_page(page); -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] btrfs: fix device order consistency
On Mon, Jan 22, 2018 at 02:49:36PM -0800, Anand Jain wrote: > By maintaining the device order consistency it makes reproducing > the problems related to missing chunk in the degraded mode much more > consistent. So fix this by sorting the devices by devid within the > kernel. So that we know which device is assigned to the struct > fs_info::latest_bdev when all the devices are having and same > SB generation. > > Signed-off-by: Anand Jain> --- > fs/btrfs/volumes.c | 16 > 1 file changed, 16 insertions(+) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 03f2685a5018..98e41d286283 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > #include > #include "ctree.h" > #include "extent_map.h" > @@ -1107,6 +1108,20 @@ static int __btrfs_open_devices(struct > btrfs_fs_devices *fs_devices, > return ret; > } > > +static int device_sort(void *priv, struct list_head *a, struct list_head *b) I'll rename that to devid_cmp as it's the comparator and not really a sorting function. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/2] fix device orders consistency
On Mon, Jan 22, 2018 at 02:49:35PM -0800, Anand Jain wrote: > Could you pls consider this for 4.16 ? Sorry, the patches are not fixes and I'll probably not add them to 4.16 and they missed the first pull. But I will add them to misc-next so they still help debugging and any fix that you come up with can be added to 4.16-rc independently. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Use schedule_timeout_interruptible
On Wed, Jan 24, 2018 at 10:38:51AM -0500, Josef Bacik wrote: > On Tue, Jan 23, 2018 at 02:46:53PM +0200, Nikolay Borisov wrote: > > Instead of manually fiddling with the state of the task > > (RUNNING->INTERRUPTIBLE->RUNNING) again just use > > schedule_timeout_interruptible > > which adjusts the task state as needed. No functional changes. > > > > Signed-off-by: Nikolay Borisov> > Reviewed-by: Josef Bacik Added to next, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: tree-checker: Replace root parameter with fs_info
On Fri, Jan 26, 2018 at 09:31:06AM -0500, Josef Bacik wrote: > On Thu, Jan 25, 2018 at 02:56:18PM +0800, Qu Wenruo wrote: > > When inspecting the error message with real corruption, the "root=%llu" > > always shows "1" (root tree), instead of correct owner. > > > > The problem is that we are getting @root from page->mapping->host, which > > points the same btree inode, so we will always get the same root. > > > > This makes the root owner output meaningless, and harder to port > > tree-checker to btrfs-progs. > > > > So get rid of the false and meaningless @root parameter and replace it > > with @fs_info. > > To get the owner, we can only rely on btrfs_header_owner() now. > > > > Signed-off-by: Qu Wenruo> > Reviewed-by: Josef Bacik Added to next, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RESEND PATCH] btrfs: Fix UAF when cleaning up fs_devs with a single stale device
Commit 4fde46f0cc71 ("Btrfs: free the stale device") introduced btrfs_free_stale_device which iterates the device lists for all registered btrfs filesystems and deletes those devices which aren't mounted. In a btrfs_devices structure has only 1 device attached to it and it is unused then btrfs_free_stale_devices will proceed to also free the btrfs_fs_devices struct itself. Currently this leads to a UAF since list_for_each_entry will try to perform a check on the already- freed memory to see if it has to terminated the loop. The fix is to use 'break' when we know we are freeing the current fs_devs. Fixes: 4fde46f0cc71 ("Btrfs: free the stale device") Signed-off-by: Nikolay Borisov--- Changed the subject to make it more descriptive fs/btrfs/volumes.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f7147740b68e..c3ab55336ee0 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -645,6 +645,7 @@ static void btrfs_free_stale_devices(const char *path, btrfs_sysfs_remove_fsid(fs_devs); list_del(_devs->list); free_fs_devices(fs_devs); + break; } else { fs_devs->num_devices--; list_del(>dev_list); -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: Ignore errors from btrfs_qgroup_trace_extent_post
On Mon, Jan 29, 2018 at 10:06:52PM +0800, Qu Wenruo wrote: > > > On 2018年01月29日 21:53, Nikolay Borisov wrote: > > Running generic/019 with qgroups on the scratch device enabled is > > almost guaranteed to trigger the BUG_ON in btrfs_free_tree_block. It's > > supposed to trigger only on -ENOMEM, in reality, however, it's possible > > to get -EIO from btrfs_qgroup_trace_extent_post. This function just > > finds the roots of the extent being tracked and sets the qrecord->old_roots > > list. If this operation fails nothing critical happens except the > > quota accounting can be considered wrong. In such case just set the > > INCONSISTENT flag for the quota and print a warning, rather than > > killing off the system. Additionally, it's possible to trigger a BUG_ON > > in btrfs_truncate_inode_items as well. > > > > Signed-off-by: Nikolay Borisov> > Reviewed-by: Qu Wenruo Added to next, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: Add chunk allocation ENOSPC debug message for enospc_debug mount option
On Mon, Jan 22, 2018 at 10:11:44AM +0200, Nikolay Borisov wrote: > > @@ -4731,6 +4740,12 @@ static int __btrfs_alloc_chunk(struct > > btrfs_trans_handle *trans, > > > > if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) { > > ret = -ENOSPC; > > + if (btrfs_test_opt(info, ENOSPC_DEBUG)) { > > + btrfs_debug(info, > > + "%s: not enough devices with free space: have=%d minimal > > required=%d", > > nit: s/minimal/minimum > But there is no point in resending just for that, I guess David could > fix it while merging. Yes, string updated and added to next, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On Mon, Jan 29, 2018 at 08:05:42 -0500, Austin S. Hemmelgarn wrote: > Seriously, _THERE IS A RACE CONDITION IN SYSTEMD'S CURRENT HANDLING OF > THIS_. It's functionally no different than prefacing an attempt to send > a signal to a process by checking if the process exists, or trying to > see if some other process is using a file that might be locked by Seriously, there is a race condition on train stations. People check if the train has stopped and opened the door before they move their legs to get in, but the train might be already gone - so this is pointless. Instead, they should move their legs continuously and if the train is not on the station yet, just climb back and retry. See the difference? I hope now you know what is the race condition. It is the condition, where CONSEQUENCES are fatal. mounting BEFORE volume is complete is FATAL - since no userspace daemon would ever retrigger the mount and the system won't came up. Provide one btrfsd volume manager and systemd could probably switch to using it. mounting AFTER volume is complete is FINE - and if the "pseudo-race" happens and volume disappears, then this was either some operator action, so the umount SHOULD happen, or we are facing some MALFUNCION, which is fatal itself, not by being a "race condition". -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degraded permanent mount option
On 2018-01-29 16:54, waxhead wrote: Austin S. Hemmelgarn wrote: On 2018-01-29 12:58, Andrei Borzenkov wrote: 29.01.2018 14:24, Adam Borowski пишет: ... So any event (the user's request) has already happened. A rc system, of which systemd is one, knows whether we reached the "want root filesystem" or "want secondary filesystems" stage. Once you're there, you can issue the mount() call and let the kernel do the work. It is a btrfs choice to not expose compound device as separate one (like every other device manager does) Btrfs is not a device manager, it's a filesystem. it is a btrfs drawback that doesn't provice anything else except for this IOCTL with it's logic How can it provide you with something it doesn't yet have? If you want the information, call mount(). And as others in this thread have mentioned, what, pray tell, would you want to know "would a mount succeed?" for if you don't want to mount? it is a btrfs drawback that there is nothing to push assembling into "OK, going degraded" state The way to do so is to timeout, then retry with -o degraded. That's possible way to solve it. This likely requires support from mount.btrfs (or btrfs.ko) to return proper indication that filesystem is incomplete so caller can decide whether to retry or to try degraded mount. We already do so in the accepted standard manner. If the mount fails because of a missing device, you get a very specific message in the kernel log about it, as is the case for most other common errors (for uncommon ones you usually just get a generic open_ctree error). This is really the only option too, as the mount() syscall (which the mount command calls) returns only 0 on success or -1 and an appropriate errno value on failure, and we can't exactly go about creating a half dozen new error numbers just for this (well, technically we could, but I very much doubt that they would be accepted upstream, which defeats the purpose). Or may be mount.btrfs should implement this logic internally. This would really be the most simple way to make it acceptable to the other side by not needing to accept anything :) And would also be another layering violation which would require a proliferation of extra mount options to control the mount command itself and adjust the timeout handling. This has been done before with mount.nfs, but for slightly different reasons (primarily to allow nested NFS mounts, since the local directory that the filesystem is being mounted on not being present is treated like a mount timeout), and it had near zero control. It works there because they push the complicated policy decisions to userspace (namely, there is no support for retrying with different options or trying a different server). I just felt like commenting a bit on this from a regular users point of view. Remember that at some point BTRFS will probably be the default filesystem for the average penguin. BTRFS big selling point is redundance and a guarantee that whatever you write is the same that you will read sometime later. Many users will probably build their BTRFS system on a redundant array of storage devices. As long as there are sufficient (not necessarily all) storage devices present they expect their system to come up and work. If the system is not able to come up in a fully operative state it must at least be able to limp until the issue is fixed. Starting a argument about what init system is the most sane or most shiny is not helping. The truth is that systemd is not going away sometime soon and one might as well try to become friends if nothing else for the sake of having things working which should be a common goal regardless of the religion. FWIW, I don't care that it's systemd in this case, I care that people are arguing for the forced use of a coding anti-pattern that ends up being covered as bad practice in first year computer science courses (no, seriously, every professional programmer I've asked about this had time-of-check-time-of-use race conditions covered in one of their first-year CS classes) or the enforcement of an event-based model that really doesn't make any sense for this (OK, it makes a little sense for handling of devices reappearing, but systemd doesn't need to be involved in that beyond telling the kernel that the device reappeared, except that that's udev's job). I personally think the degraded mount option is a mistake as this assumes that a lightly degraded system is not able to work which is false. If the system can mount to some working state then it should mount regardless if it is fully operative or not. If the array is in a bad state you need to learn about it by issuing a command or something. The same goes for a MD array (and yes, I am aware of the block layer vs filesystem thing here). The problem with this is that right now, it is not safe to run a BTRFS volume degraded and writable, but for an even remotely usable system
Re: [PATCH] Btrfs: fix unexpected cow in run_delalloc_nocow
On Thu, Jan 25, 2018 at 11:02:49AM -0700, Liu Bo wrote: > Fstests generic/475 provides a way to fail metadata reads while > checking if checksum exists for the inode inside run_delalloc_nocow(), > and csum_exist_in_range() interprets error (-EIO) as inode having > checksum and makes its caller enters the cow path. > > In case of free space inode, this ends up with a warning in > cow_file_range(). > > The same problem applies for btrfs_cross_ref_exist() since it may also > read metadata in between. > > With this, run_delalloc_nocow() bails out when errors occur at the two > places. > > cc:v2.6.28+ > Fixes: 17d217fe970d ("Btrfs: fix nodatasum handling in balancing code") > Signed-off-by: Liu Bo This and the other fixes sent in this batch have been added to next and I'm going to push them to 4.16 either during the merge window still or at rc1 time. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] inode->i_version rework for v4.16
On Mon, 2018-01-29 at 13:50 -0800, Linus Torvalds wrote: > On Mon, Jan 29, 2018 at 4:26 AM, Jeff Laytonwrote: > > > > This pile of patches is a rework of the inode->i_version field. We have > > traditionally incremented that field on every inode data or metadata > > change. Typically this increment needs to be logged on disk even when > > nothing else has changed, which is rather expensive. > > Hmm. I have pulled this, but it is really really broken in one place, > to the degree that I always went "no, I won't pull this garbage". > > But the breakage is potential, not actual, and can be fixed trivially, > so I'll let it slide - but I do require it to be fixed. And I require > people to *think* about it. > > So what's to horribly horribly wrong? > > The inode_cmp_iversion{+raw}() functions are pure and utter crap. > > Why? > > You say that they return 0/negative/positive, but they do so in a > completely broken manner. They return that ternary value as the > sequence number difference in a 's64', which means that if you > actually care about that ternary value, and do the *sane* thing that > the kernel-doc of the function implies is the right thing, you would > do > > int cmp = inode_cmp_iversion(inode, old); > > if (cmp < 0 ... > > and as a result you get code that looks sane, but that doesn't > actually *WORK* right. > My intent here was to have this handle wraparound using the same sort of method that the time_before/time_after macros do. Obviously, I didn't document that well. I want to make sure I understand what's actually broken here thoug. Is it only broken when the two values are more than 2**63 apart, or is there something else more fundamentally wrong here? > To make it even worse, it will actually work in practice by accident > in 99.9% of all cases, so now you have > > (a) subtly buggy code > (b) that looks fine > (c) and that works in testing > > which is just about the worst possible case for any code. The > interface is simply garbage that encourages bugs. > > And the bug wouldn't be in the user, the bug would be in this code you > just sent me. The interface is simply wrong. > > So this absolutely needs to be fixed. I see two fixes: > > - just return a boolean. That's all that any current user actually > wants, so the ternary value seems pointless. > > - make it return an 'int', and not just any int, but -1/0/1. That way > there is no worry about uses, and if somebody *really* cares about the > ternary value, they can now use a "switch" statement to get it > (alternatively, make it return an enum, but whatever). > > That "ternary" function that has 18446744069414584320 incorrect return > values really is unacceptable. > I think I'll just make it return a boolean value like you suggested first. I'll send a patch to fix it once I've done some basic testing with it. Many thanks, -- Jeff Layton -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] btrfs: Refactor parameter of BTRFS_MAX_DEVS() from root to fs_info
Signed-off-by: Qu WenruoReviewed-by: Anand Jain Reviewed-by: Nikolay Borisov --- v2: Added tags. --- fs/btrfs/volumes.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d818b1f9c625..215e85e22c8e 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4581,7 +4581,7 @@ static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type) btrfs_set_fs_incompat(info, RAID56); } -#define BTRFS_MAX_DEVS(r) ((BTRFS_MAX_ITEM_SIZE(r->fs_info)\ +#define BTRFS_MAX_DEVS(info) ((BTRFS_MAX_ITEM_SIZE(info) \ - sizeof(struct btrfs_chunk)) \ / sizeof(struct btrfs_stripe) + 1) @@ -4638,7 +4638,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, max_stripe_size = SZ_1G; max_chunk_size = 10 * max_stripe_size; if (!devs_max) - devs_max = BTRFS_MAX_DEVS(info->chunk_root); + devs_max = BTRFS_MAX_DEVS(info); } else if (type & BTRFS_BLOCK_GROUP_METADATA) { /* for larger filesystems, use larger metadata chunks */ if (fs_devices->total_rw_bytes > 50ULL * SZ_1G) @@ -4647,7 +4647,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, max_stripe_size = SZ_256M; max_chunk_size = max_stripe_size; if (!devs_max) - devs_max = BTRFS_MAX_DEVS(info->chunk_root); + devs_max = BTRFS_MAX_DEVS(info); } else if (type & BTRFS_BLOCK_GROUP_SYSTEM) { max_stripe_size = SZ_32M; max_chunk_size = 2 * max_stripe_size; -- 2.16.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] btrfs: Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()
Function __get_raid_index() is used to convert block group flags into raid index, which can be used to get various info directly from btrfs_raid_array[]. Refactor this function a little: 1) Rename to btrfs_bg_flags_to_raid_index() Double underline prefix is normally for internal functions, while the function is used by both extent-tree and volumes. Although the name is a little longer, but it should explain its usage quite well. 2) Move it to volumes.h and make it static inline Just several if-else branches, really no need to define it as a normal function. This also makes later code re-use between kernel and btrfs-progs easier. 3) Remove function get_block_group_index() Really no need to do such a simple thing as an exported function. Signed-off-by: Qu WenruoReviewed-by: Anand Jain Reviewed-by: Nikolay Borisov --- v2: Remove get_block_group_index() in this patch. Add extra comment for the usage of btrfs_bg_flags_to_raid_index() --- fs/btrfs/ctree.h | 1 - fs/btrfs/extent-tree.c | 41 ++--- fs/btrfs/volumes.c | 2 +- fs/btrfs/volumes.h | 22 ++ 4 files changed, 33 insertions(+), 33 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 13c260b525a1..27249240fa3e 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2629,7 +2629,6 @@ struct btrfs_block_group_cache *btrfs_lookup_block_group( u64 bytenr); void btrfs_get_block_group(struct btrfs_block_group_cache *cache); void btrfs_put_block_group(struct btrfs_block_group_cache *cache); -int get_block_group_index(struct btrfs_block_group_cache *cache); struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 parent, u64 root_objectid, diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 2f4328511ac8..2693c8617817 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7346,29 +7346,6 @@ wait_block_group_cache_done(struct btrfs_block_group_cache *cache) return ret; } -int __get_raid_index(u64 flags) -{ - if (flags & BTRFS_BLOCK_GROUP_RAID10) - return BTRFS_RAID_RAID10; - else if (flags & BTRFS_BLOCK_GROUP_RAID1) - return BTRFS_RAID_RAID1; - else if (flags & BTRFS_BLOCK_GROUP_DUP) - return BTRFS_RAID_DUP; - else if (flags & BTRFS_BLOCK_GROUP_RAID0) - return BTRFS_RAID_RAID0; - else if (flags & BTRFS_BLOCK_GROUP_RAID5) - return BTRFS_RAID_RAID5; - else if (flags & BTRFS_BLOCK_GROUP_RAID6) - return BTRFS_RAID_RAID6; - - return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */ -} - -int get_block_group_index(struct btrfs_block_group_cache *cache) -{ - return __get_raid_index(cache->flags); -} - static const char *btrfs_raid_type_names[BTRFS_NR_RAID_TYPES] = { [BTRFS_RAID_RAID10] = "raid10", [BTRFS_RAID_RAID1] = "raid1", @@ -7483,7 +7460,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, u64 empty_cluster = 0; struct btrfs_space_info *space_info; int loop = 0; - int index = __get_raid_index(flags); + int index = btrfs_bg_flags_to_raid_index(flags); bool failed_cluster_refill = false; bool failed_alloc = false; bool use_cluster = true; @@ -7569,7 +7546,8 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, btrfs_put_block_group(block_group); up_read(_info->groups_sem); } else { - index = get_block_group_index(block_group); + index = btrfs_bg_flags_to_raid_index( + block_group->flags); btrfs_lock_block_group(block_group, delalloc); goto have_block_group; } @@ -7579,7 +7557,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, } search: have_caching_bg = false; - if (index == 0 || index == __get_raid_index(flags)) + if (index == 0 || index == btrfs_bg_flags_to_raid_index(flags)) full_search = true; down_read(_info->groups_sem); list_for_each_entry(block_group, _info->block_groups[index], @@ -7837,7 +7815,8 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, loop: failed_cluster_refill = false; failed_alloc = false; - BUG_ON(index != get_block_group_index(block_group)); + BUG_ON(btrfs_bg_flags_to_raid_index(block_group->flags) != + index);
Re: [PATCH 1/3] btrfs: Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()
On 2018年01月30日 16:16, Nikolay Borisov wrote: > > > On 30.01.2018 09:40, Qu Wenruo wrote: >> Function __get_raid_index() is used to convert block group flags into >> raid index, which can be used to get various info directly from >> btrfs_raid_array[]. >> >> Refactor this function a little: >> >> 1) Rename to btrfs_bg_flags_to_raid_index() >>Double underline prefix is normally for internal functions, while the >>function is used by both extent-tree and volumes. >> >>Although the name is a little longer, but it should explain its usage >>quite well. >> >> 2) Move it to volumes.h and make it static inline >>Just several if-else branches, really no need to define it as a normal >>function. >> >>This also makes later code re-use between kernel and btrfs-progs >>easier. >> >> Signed-off-by: Qu Wenruo> > Generally this looks good so you can add my : > > Reviewed-by: Nikolay Borisov > > However I wonder whether we should strive to add proper kernel docs to > function whenever we can (without going out of the way to do so). Kernel docs here seems a little over killed here. IMHO kernel docs are used for things like API, state machine or design overview. Here the raid type index is mostly internal used as a quicker and unified way to access btrfs_raid_array[]. Extra comment (maybe 1 or 2 lines) will be added in next version about the usage, and hopes it should work. > I.e > you are modifying the function now so might as well add them and > explicitly state it's used to convert blockgroup flags to raid levels. > The end goal is to make it easier for people who are looking for the > first time at the source to have easier time. Yep, making the source easier to read is always good. Thanks, Qu > >> --- >> fs/btrfs/extent-tree.c | 26 -- >> fs/btrfs/volumes.c | 2 +- >> fs/btrfs/volumes.h | 18 ++ >> 3 files changed, 23 insertions(+), 23 deletions(-) >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index 2f4328511ac8..e9c31b567a9c 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -7346,27 +7346,9 @@ wait_block_group_cache_done(struct >> btrfs_block_group_cache *cache) >> return ret; >> } >> >> -int __get_raid_index(u64 flags) >> -{ >> -if (flags & BTRFS_BLOCK_GROUP_RAID10) >> -return BTRFS_RAID_RAID10; >> -else if (flags & BTRFS_BLOCK_GROUP_RAID1) >> -return BTRFS_RAID_RAID1; >> -else if (flags & BTRFS_BLOCK_GROUP_DUP) >> -return BTRFS_RAID_DUP; >> -else if (flags & BTRFS_BLOCK_GROUP_RAID0) >> -return BTRFS_RAID_RAID0; >> -else if (flags & BTRFS_BLOCK_GROUP_RAID5) >> -return BTRFS_RAID_RAID5; >> -else if (flags & BTRFS_BLOCK_GROUP_RAID6) >> -return BTRFS_RAID_RAID6; >> - >> -return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */ >> -} >> - >> int get_block_group_index(struct btrfs_block_group_cache *cache) >> { >> -return __get_raid_index(cache->flags); >> +return btrfs_bg_flags_to_raid_index(cache->flags); >> } >> >> static const char *btrfs_raid_type_names[BTRFS_NR_RAID_TYPES] = { >> @@ -7483,7 +7465,7 @@ static noinline int find_free_extent(struct >> btrfs_fs_info *fs_info, >> u64 empty_cluster = 0; >> struct btrfs_space_info *space_info; >> int loop = 0; >> -int index = __get_raid_index(flags); >> +int index = btrfs_bg_flags_to_raid_index(flags); >> bool failed_cluster_refill = false; >> bool failed_alloc = false; >> bool use_cluster = true; >> @@ -7579,7 +7561,7 @@ static noinline int find_free_extent(struct >> btrfs_fs_info *fs_info, >> } >> search: >> have_caching_bg = false; >> -if (index == 0 || index == __get_raid_index(flags)) >> +if (index == 0 || index == btrfs_bg_flags_to_raid_index(flags)) >> full_search = true; >> down_read(_info->groups_sem); >> list_for_each_entry(block_group, _info->block_groups[index], >> @@ -9643,7 +9625,7 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, >> u64 bytenr) >> */ >> target = get_restripe_target(fs_info, block_group->flags); >> if (target) { >> -index = __get_raid_index(extended_to_chunk(target)); >> +index = btrfs_bg_flags_to_raid_index(extended_to_chunk(target)); >> } else { >> /* >> * this is just a balance, so if we were marked as full >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index a25684287501..d818b1f9c625 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -4625,7 +4625,7 @@ static int __btrfs_alloc_chunk(struct >> btrfs_trans_handle *trans, >> if (list_empty(_devices->alloc_list)) >> return -ENOSPC; >> >> -index = __get_raid_index(type); >> +index = btrfs_bg_flags_to_raid_index(type); >> >> sub_stripes =
Re: [PATCH 2/2] btrfs: drop optimal argument from find_live_mirror()
On 01/30/2018 05:12 PM, Nikolay Borisov wrote: On 30.01.2018 08:28, Anand Jain wrote: Drop optimal argument from the function find_live_mirror() as we can deduce it in the function itself. Signed-off-by: Anand Jain--- fs/btrfs/volumes.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 9c9d987838c2..a61715677b67 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5253,10 +5253,11 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) static int find_live_mirror(struct btrfs_fs_info *fs_info, struct map_lookup *map, int first, - int optimal, int dev_replace_is_ongoing) + int dev_replace_is_ongoing) { int i; int num; + int optimal; int tolerance; struct btrfs_device *srcdev; @@ -5268,6 +5269,8 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info, else num = map->num_stripes; + optimal = first + current->pid % num; + if (dev_replace_is_ongoing && fs_info->dev_replace.cont_reading_from_srcdev_mode == BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID) @@ -5821,7 +5824,6 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, stripe_index = mirror_num - 1; else { stripe_index = find_live_mirror(fs_info, map, 0, - current->pid % map->num_stripes, dev_replace_is_ongoing); mirror_num = stripe_index + 1; } @@ -5849,8 +5851,6 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int old_stripe_index = stripe_index; stripe_index = find_live_mirror(fs_info, map, stripe_index, - stripe_index + - current->pid % map->sub_stripes, The value of optimal here (BTRFS_BLOCK_GROUP_RAID10) is different than the value of optimal in BTRFS_BLOCK_GROUP_RAID1 case. So you need to put optimal in the conditional branch you added in the previous patch. In raid 1 case: optimal = current->pid % num; In raid 10: optimal = first + current->pid % num; to current semantics First/stripe_index is 0 for RAID1. So I purposely kept it like this for easy code flow. Thanks, Anand dev_replace_is_ongoing); mirror_num = stripe_index - old_stripe_index + 1; } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] btrfs: Unexport get_block_group_index()
On 2018年01月30日 16:18, Nikolay Borisov wrote: > > > On 30.01.2018 09:40, Qu Wenruo wrote: >> That function is only used inside extent-tree.c. >> >> Signed-off-by: Qu Wenruo>> --- >> fs/btrfs/ctree.h | 1 - >> fs/btrfs/extent-tree.c | 3 ++- >> 2 files changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >> index 13c260b525a1..27249240fa3e 100644 >> --- a/fs/btrfs/ctree.h >> +++ b/fs/btrfs/ctree.h >> @@ -2629,7 +2629,6 @@ struct btrfs_block_group_cache >> *btrfs_lookup_block_group( >> u64 bytenr); >> void btrfs_get_block_group(struct btrfs_block_group_cache *cache); >> void btrfs_put_block_group(struct btrfs_block_group_cache *cache); >> -int get_block_group_index(struct btrfs_block_group_cache *cache); >> struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle >> *trans, >> struct btrfs_root *root, >> u64 parent, u64 root_objectid, >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index e9c31b567a9c..6e1128aa29d6 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -7346,7 +7346,8 @@ wait_block_group_cache_done(struct >> btrfs_block_group_cache *cache) >> return ret; >> } >> >> -int get_block_group_index(struct btrfs_block_group_cache *cache) >> +static enum btrfs_raid_types >> +get_block_group_index(struct btrfs_block_group_cache *cache) >> { >> return btrfs_bg_flags_to_raid_index(cache->flags); >> } > > I'd rather you removed this function and used bg_flags_to_raid_index > directly. Otherwise we have yet another level of indirection with a > rather wrong name - "block_group_index" ? Right, also pointed out by Anand, I'll just remove it. Thanks, Qu > >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > signature.asc Description: OpenPGP digital signature
Re: [PATCH 2/2] btrfs: drop optimal argument from find_live_mirror()
On 30.01.2018 08:28, Anand Jain wrote: > Drop optimal argument from the function find_live_mirror() > as we can deduce it in the function itself. > > Signed-off-by: Anand Jain> --- > fs/btrfs/volumes.c | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 9c9d987838c2..a61715677b67 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -5253,10 +5253,11 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info > *fs_info, u64 logical, u64 len) > > static int find_live_mirror(struct btrfs_fs_info *fs_info, > struct map_lookup *map, int first, > - int optimal, int dev_replace_is_ongoing) > + int dev_replace_is_ongoing) > { > int i; > int num; > + int optimal; > int tolerance; > struct btrfs_device *srcdev; > > @@ -5268,6 +5269,8 @@ static int find_live_mirror(struct btrfs_fs_info > *fs_info, > else > num = map->num_stripes; > > + optimal = first + current->pid % num; > + > if (dev_replace_is_ongoing && > fs_info->dev_replace.cont_reading_from_srcdev_mode == >BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID) > @@ -5821,7 +5824,6 @@ static int __btrfs_map_block(struct btrfs_fs_info > *fs_info, > stripe_index = mirror_num - 1; > else { > stripe_index = find_live_mirror(fs_info, map, 0, > - current->pid % map->num_stripes, > dev_replace_is_ongoing); > mirror_num = stripe_index + 1; > } > @@ -5849,8 +5851,6 @@ static int __btrfs_map_block(struct btrfs_fs_info > *fs_info, > int old_stripe_index = stripe_index; > stripe_index = find_live_mirror(fs_info, map, > stripe_index, > - stripe_index + > - current->pid % map->sub_stripes, The value of optimal here (BTRFS_BLOCK_GROUP_RAID10) is different than the value of optimal in BTRFS_BLOCK_GROUP_RAID1 case. So you need to put optimal in the conditional branch you added in the previous patch. In raid 1 case: optimal = current->pid % num; In raid 10: optimal = first + current->pid % num; to current semantics > dev_replace_is_ongoing); > mirror_num = stripe_index - old_stripe_index + 1; > } > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] btrfs: Refactor parameter of BTRFS_MAX_DEVS() from root to fs_info
On 30.01.2018 09:40, Qu Wenruo wrote: > Signed-off-by: Qu WenruoReviewed-by: Nikolay Borisov > --- > fs/btrfs/volumes.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index d818b1f9c625..215e85e22c8e 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -4581,7 +4581,7 @@ static void check_raid56_incompat_flag(struct > btrfs_fs_info *info, u64 type) > btrfs_set_fs_incompat(info, RAID56); > } > > -#define BTRFS_MAX_DEVS(r) ((BTRFS_MAX_ITEM_SIZE(r->fs_info) \ > +#define BTRFS_MAX_DEVS(info) ((BTRFS_MAX_ITEM_SIZE(info) \ > - sizeof(struct btrfs_chunk)) \ > / sizeof(struct btrfs_stripe) + 1) > > @@ -4638,7 +4638,7 @@ static int __btrfs_alloc_chunk(struct > btrfs_trans_handle *trans, > max_stripe_size = SZ_1G; > max_chunk_size = 10 * max_stripe_size; > if (!devs_max) > - devs_max = BTRFS_MAX_DEVS(info->chunk_root); > + devs_max = BTRFS_MAX_DEVS(info); > } else if (type & BTRFS_BLOCK_GROUP_METADATA) { > /* for larger filesystems, use larger metadata chunks */ > if (fs_devices->total_rw_bytes > 50ULL * SZ_1G) > @@ -4647,7 +4647,7 @@ static int __btrfs_alloc_chunk(struct > btrfs_trans_handle *trans, > max_stripe_size = SZ_256M; > max_chunk_size = max_stripe_size; > if (!devs_max) > - devs_max = BTRFS_MAX_DEVS(info->chunk_root); > + devs_max = BTRFS_MAX_DEVS(info); > } else if (type & BTRFS_BLOCK_GROUP_SYSTEM) { > max_stripe_size = SZ_32M; > max_chunk_size = 2 * max_stripe_size; > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] btrfs: Unexport get_block_group_index()
On 30.01.2018 09:40, Qu Wenruo wrote: > That function is only used inside extent-tree.c. > > Signed-off-by: Qu Wenruo> --- > fs/btrfs/ctree.h | 1 - > fs/btrfs/extent-tree.c | 3 ++- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 13c260b525a1..27249240fa3e 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -2629,7 +2629,6 @@ struct btrfs_block_group_cache > *btrfs_lookup_block_group( >u64 bytenr); > void btrfs_get_block_group(struct btrfs_block_group_cache *cache); > void btrfs_put_block_group(struct btrfs_block_group_cache *cache); > -int get_block_group_index(struct btrfs_block_group_cache *cache); > struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle > *trans, >struct btrfs_root *root, >u64 parent, u64 root_objectid, > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index e9c31b567a9c..6e1128aa29d6 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -7346,7 +7346,8 @@ wait_block_group_cache_done(struct > btrfs_block_group_cache *cache) > return ret; > } > > -int get_block_group_index(struct btrfs_block_group_cache *cache) > +static enum btrfs_raid_types > +get_block_group_index(struct btrfs_block_group_cache *cache) > { > return btrfs_bg_flags_to_raid_index(cache->flags); > } I'd rather you removed this function and used bg_flags_to_raid_index directly. Otherwise we have yet another level of indirection with a rather wrong name - "block_group_index" ? > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] btrfs: Refactor parameter of BTRFS_MAX_DEVS() from root to fs_info
On 01/30/2018 03:40 PM, Qu Wenruo wrote: Signed-off-by: Qu WenruoReviewed-by: Anand Jain Thanks, Anand --- fs/btrfs/volumes.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d818b1f9c625..215e85e22c8e 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4581,7 +4581,7 @@ static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type) btrfs_set_fs_incompat(info, RAID56); } -#define BTRFS_MAX_DEVS(r) ((BTRFS_MAX_ITEM_SIZE(r->fs_info) \ +#define BTRFS_MAX_DEVS(info) ((BTRFS_MAX_ITEM_SIZE(info) \ - sizeof(struct btrfs_chunk)) \ / sizeof(struct btrfs_stripe) + 1) @@ -4638,7 +4638,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, max_stripe_size = SZ_1G; max_chunk_size = 10 * max_stripe_size; if (!devs_max) - devs_max = BTRFS_MAX_DEVS(info->chunk_root); + devs_max = BTRFS_MAX_DEVS(info); } else if (type & BTRFS_BLOCK_GROUP_METADATA) { /* for larger filesystems, use larger metadata chunks */ if (fs_devices->total_rw_bytes > 50ULL * SZ_1G) @@ -4647,7 +4647,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, max_stripe_size = SZ_256M; max_chunk_size = max_stripe_size; if (!devs_max) - devs_max = BTRFS_MAX_DEVS(info->chunk_root); + devs_max = BTRFS_MAX_DEVS(info); } else if (type & BTRFS_BLOCK_GROUP_SYSTEM) { max_stripe_size = SZ_32M; max_chunk_size = 2 * max_stripe_size; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] btrfs: Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()
On 30.01.2018 09:40, Qu Wenruo wrote: > Function __get_raid_index() is used to convert block group flags into > raid index, which can be used to get various info directly from > btrfs_raid_array[]. > > Refactor this function a little: > > 1) Rename to btrfs_bg_flags_to_raid_index() >Double underline prefix is normally for internal functions, while the >function is used by both extent-tree and volumes. > >Although the name is a little longer, but it should explain its usage >quite well. > > 2) Move it to volumes.h and make it static inline >Just several if-else branches, really no need to define it as a normal >function. > >This also makes later code re-use between kernel and btrfs-progs >easier. > > Signed-off-by: Qu WenruoGenerally this looks good so you can add my : Reviewed-by: Nikolay Borisov However I wonder whether we should strive to add proper kernel docs to function whenever we can (without going out of the way to do so). I.e you are modifying the function now so might as well add them and explicitly state it's used to convert blockgroup flags to raid levels. The end goal is to make it easier for people who are looking for the first time at the source to have easier time. > --- > fs/btrfs/extent-tree.c | 26 -- > fs/btrfs/volumes.c | 2 +- > fs/btrfs/volumes.h | 18 ++ > 3 files changed, 23 insertions(+), 23 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 2f4328511ac8..e9c31b567a9c 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -7346,27 +7346,9 @@ wait_block_group_cache_done(struct > btrfs_block_group_cache *cache) > return ret; > } > > -int __get_raid_index(u64 flags) > -{ > - if (flags & BTRFS_BLOCK_GROUP_RAID10) > - return BTRFS_RAID_RAID10; > - else if (flags & BTRFS_BLOCK_GROUP_RAID1) > - return BTRFS_RAID_RAID1; > - else if (flags & BTRFS_BLOCK_GROUP_DUP) > - return BTRFS_RAID_DUP; > - else if (flags & BTRFS_BLOCK_GROUP_RAID0) > - return BTRFS_RAID_RAID0; > - else if (flags & BTRFS_BLOCK_GROUP_RAID5) > - return BTRFS_RAID_RAID5; > - else if (flags & BTRFS_BLOCK_GROUP_RAID6) > - return BTRFS_RAID_RAID6; > - > - return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */ > -} > - > int get_block_group_index(struct btrfs_block_group_cache *cache) > { > - return __get_raid_index(cache->flags); > + return btrfs_bg_flags_to_raid_index(cache->flags); > } > > static const char *btrfs_raid_type_names[BTRFS_NR_RAID_TYPES] = { > @@ -7483,7 +7465,7 @@ static noinline int find_free_extent(struct > btrfs_fs_info *fs_info, > u64 empty_cluster = 0; > struct btrfs_space_info *space_info; > int loop = 0; > - int index = __get_raid_index(flags); > + int index = btrfs_bg_flags_to_raid_index(flags); > bool failed_cluster_refill = false; > bool failed_alloc = false; > bool use_cluster = true; > @@ -7579,7 +7561,7 @@ static noinline int find_free_extent(struct > btrfs_fs_info *fs_info, > } > search: > have_caching_bg = false; > - if (index == 0 || index == __get_raid_index(flags)) > + if (index == 0 || index == btrfs_bg_flags_to_raid_index(flags)) > full_search = true; > down_read(_info->groups_sem); > list_for_each_entry(block_group, _info->block_groups[index], > @@ -9643,7 +9625,7 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, > u64 bytenr) >*/ > target = get_restripe_target(fs_info, block_group->flags); > if (target) { > - index = __get_raid_index(extended_to_chunk(target)); > + index = btrfs_bg_flags_to_raid_index(extended_to_chunk(target)); > } else { > /* >* this is just a balance, so if we were marked as full > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index a25684287501..d818b1f9c625 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -4625,7 +4625,7 @@ static int __btrfs_alloc_chunk(struct > btrfs_trans_handle *trans, > if (list_empty(_devices->alloc_list)) > return -ENOSPC; > > - index = __get_raid_index(type); > + index = btrfs_bg_flags_to_raid_index(type); > > sub_stripes = btrfs_raid_array[index].sub_stripes; > dev_stripes = btrfs_raid_array[index].dev_stripes; > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h > index ff15208344a7..413e07e44b42 100644 > --- a/fs/btrfs/volumes.h > +++ b/fs/btrfs/volumes.h > @@ -533,6 +533,24 @@ static inline void btrfs_dev_stat_reset(struct > btrfs_device *dev, > btrfs_dev_stat_set(dev, index, 0); > } > > +static inline enum btrfs_raid_types btrfs_bg_flags_to_raid_index(u64 flags) > +{ > + if (flags & BTRFS_BLOCK_GROUP_RAID10) > + return BTRFS_RAID_RAID10; > + else if
Re: [PATCH 1/3] btrfs: Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()
On 01/30/2018 03:40 PM, Qu Wenruo wrote: Function __get_raid_index() is used to convert block group flags into raid index, which can be used to get various info directly from btrfs_raid_array[]. Refactor this function a little: 1) Rename to btrfs_bg_flags_to_raid_index() Double underline prefix is normally for internal functions, while the function is used by both extent-tree and volumes. Although the name is a little longer, but it should explain its usage quite well. 2) Move it to volumes.h and make it static inline Just several if-else branches, really no need to define it as a normal function. This also makes later code re-use between kernel and btrfs-progs easier. Signed-off-by: Qu WenruoReviewed-by: Anand Jain Thanks, Anand --- fs/btrfs/extent-tree.c | 26 -- fs/btrfs/volumes.c | 2 +- fs/btrfs/volumes.h | 18 ++ 3 files changed, 23 insertions(+), 23 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 2f4328511ac8..e9c31b567a9c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7346,27 +7346,9 @@ wait_block_group_cache_done(struct btrfs_block_group_cache *cache) return ret; } -int __get_raid_index(u64 flags) -{ - if (flags & BTRFS_BLOCK_GROUP_RAID10) - return BTRFS_RAID_RAID10; - else if (flags & BTRFS_BLOCK_GROUP_RAID1) - return BTRFS_RAID_RAID1; - else if (flags & BTRFS_BLOCK_GROUP_DUP) - return BTRFS_RAID_DUP; - else if (flags & BTRFS_BLOCK_GROUP_RAID0) - return BTRFS_RAID_RAID0; - else if (flags & BTRFS_BLOCK_GROUP_RAID5) - return BTRFS_RAID_RAID5; - else if (flags & BTRFS_BLOCK_GROUP_RAID6) - return BTRFS_RAID_RAID6; - - return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */ -} - int get_block_group_index(struct btrfs_block_group_cache *cache) { - return __get_raid_index(cache->flags); + return btrfs_bg_flags_to_raid_index(cache->flags); } static const char *btrfs_raid_type_names[BTRFS_NR_RAID_TYPES] = { @@ -7483,7 +7465,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, u64 empty_cluster = 0; struct btrfs_space_info *space_info; int loop = 0; - int index = __get_raid_index(flags); + int index = btrfs_bg_flags_to_raid_index(flags); bool failed_cluster_refill = false; bool failed_alloc = false; bool use_cluster = true; @@ -7579,7 +7561,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, } search: have_caching_bg = false; - if (index == 0 || index == __get_raid_index(flags)) + if (index == 0 || index == btrfs_bg_flags_to_raid_index(flags)) full_search = true; down_read(_info->groups_sem); list_for_each_entry(block_group, _info->block_groups[index], @@ -9643,7 +9625,7 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, u64 bytenr) */ target = get_restripe_target(fs_info, block_group->flags); if (target) { - index = __get_raid_index(extended_to_chunk(target)); + index = btrfs_bg_flags_to_raid_index(extended_to_chunk(target)); } else { /* * this is just a balance, so if we were marked as full diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index a25684287501..d818b1f9c625 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4625,7 +4625,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, if (list_empty(_devices->alloc_list)) return -ENOSPC; - index = __get_raid_index(type); + index = btrfs_bg_flags_to_raid_index(type); sub_stripes = btrfs_raid_array[index].sub_stripes; dev_stripes = btrfs_raid_array[index].dev_stripes; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index ff15208344a7..413e07e44b42 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -533,6 +533,24 @@ static inline void btrfs_dev_stat_reset(struct btrfs_device *dev, btrfs_dev_stat_set(dev, index, 0); } +static inline enum btrfs_raid_types btrfs_bg_flags_to_raid_index(u64 flags) +{ + if (flags & BTRFS_BLOCK_GROUP_RAID10) + return BTRFS_RAID_RAID10; + else if (flags & BTRFS_BLOCK_GROUP_RAID1) + return BTRFS_RAID_RAID1; + else if (flags & BTRFS_BLOCK_GROUP_DUP) + return BTRFS_RAID_DUP; + else if (flags & BTRFS_BLOCK_GROUP_RAID0) + return BTRFS_RAID_RAID0; + else if (flags & BTRFS_BLOCK_GROUP_RAID5) + return BTRFS_RAID_RAID5; + else if (flags & BTRFS_BLOCK_GROUP_RAID6) + return BTRFS_RAID_RAID6; + + return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */ +} + void
Re: [PATCH 2/3] btrfs: Unexport get_block_group_index()
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e9c31b567a9c..6e1128aa29d6 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7346,7 +7346,8 @@ wait_block_group_cache_done(struct btrfs_block_group_cache *cache) return ret; } -int get_block_group_index(struct btrfs_block_group_cache *cache) +static enum btrfs_raid_types +get_block_group_index(struct btrfs_block_group_cache *cache) { return btrfs_bg_flags_to_raid_index(cache->flags); } This changed should rather be part of 1/3. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
umount btrfs hang
Hi all, While trying umount a single btrfs volume i get stuck and get a call kernel trace. My kernel is 4.14.13 (archlinux). Unable to umount, tried to killed command, no way... what should i do ? Thanks. # btrfs fi df /mnt/vol3 Data, single: total=25.97GiB, used=6.84GiB System, DUP: total=8.00MiB, used=4.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=111.44MiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B - Jan 30 08:57:58 daneel.nbux.org kernel: INFO: task kworker/u16:5:4967 blocked for more than 120 seconds. Jan 30 08:57:58 daneel.nbux.org kernel: Tainted: G C 4.14.13-1-ARCH #1 Jan 30 08:57:58 daneel.nbux.org kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 30 08:57:58 daneel.nbux.org kernel: kworker/u16:5 D0 4967 2 0x8000 Jan 30 08:57:58 daneel.nbux.org kernel: Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: Call Trace: Jan 30 08:57:58 daneel.nbux.org kernel: ? __schedule+0x290/0x890 Jan 30 08:57:58 daneel.nbux.org kernel: schedule+0x2f/0x90 Jan 30 08:57:58 daneel.nbux.org kernel: btrfs_tree_read_lock+0xb6/0x100 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: ? wait_woken+0x80/0x80 Jan 30 08:57:58 daneel.nbux.org kernel: btrfs_read_lock_root_node+0x2f/0x40 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: btrfs_search_slot+0x703/0x9f0 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: btrfs_lookup_inode+0x3a/0xc0 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: ? kmem_cache_alloc+0x153/0x1a0 Jan 30 08:57:58 daneel.nbux.org kernel: btrfs_iget+0x10e/0x6f0 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: __lookup_free_space_inode+0x106/0x160 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: lookup_free_space_inode+0x63/0xe0 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: load_free_space_cache+0x66/0x180 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: ? kmem_cache_alloc_trace+0x161/0x1b0 Jan 30 08:57:58 daneel.nbux.org kernel: cache_block_group+0x1c6/0x3e0 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: ? wait_woken+0x80/0x80 Jan 30 08:57:58 daneel.nbux.org kernel: find_free_extent+0x63e/0xfb0 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: btrfs_reserve_extent+0x9b/0x180 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: btrfs_alloc_tree_block+0x123/0x4c0 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: __btrfs_cow_block+0x121/0x5b0 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: btrfs_cow_block+0xca/0x1c0 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: btrfs_search_slot+0x321/0x9f0 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: btrfs_del_csums+0x107/0x3c0 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: __btrfs_free_extent.isra.61+0x49b/0xd60 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: __btrfs_run_delayed_refs+0x725/0x12f0 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: btrfs_run_delayed_refs+0x68/0x240 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: delayed_ref_async_start+0x8d/0xa0 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: btrfs_worker_helper+0x70/0x330 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: process_one_work+0x1db/0x410 Jan 30 08:57:58 daneel.nbux.org kernel: worker_thread+0x2b/0x3d0 Jan 30 08:57:58 daneel.nbux.org kernel: ? process_one_work+0x410/0x410 Jan 30 08:57:58 daneel.nbux.org kernel: kthread+0x118/0x130 Jan 30 08:57:58 daneel.nbux.org kernel: ? kthread_create_on_node+0x70/0x70 Jan 30 08:57:58 daneel.nbux.org kernel: ? SyS_exit+0x13/0x20 Jan 30 08:57:58 daneel.nbux.org kernel: ret_from_fork+0x1f/0x30 Jan 30 08:57:58 daneel.nbux.org kernel: INFO: task kworker/u16:11:13568 blocked for more than 120 seconds. Jan 30 08:57:58 daneel.nbux.org kernel: Tainted: G C 4.14.13-1-ARCH #1 Jan 30 08:57:58 daneel.nbux.org kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 30 08:57:58 daneel.nbux.org kernel: kworker/u16:11 D0 13568 2 0x8000 Jan 30 08:57:58 daneel.nbux.org kernel: Workqueue: btrfs-freespace-write btrfs_freespace_write_helper [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: Call Trace: Jan 30 08:57:58 daneel.nbux.org kernel: ? __schedule+0x290/0x890 Jan 30 08:57:58 daneel.nbux.org kernel: schedule+0x2f/0x90 Jan 30 08:57:58 daneel.nbux.org kernel: btrfs_tree_lock+0xd1/0x1f0 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: ? wait_woken+0x80/0x80 Jan 30 08:57:58 daneel.nbux.org kernel: btrfs_search_slot+0x6e7/0x9f0 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: btrfs_mark_extent_written+0xb1/0xbb0 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: ? lock_extent_bits+0x54/0x1b0 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: ? join_transaction+0x10c/0x3f0 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: ? start_transaction+0x9e/0x410 [btrfs] Jan 30 08:57:58 daneel.nbux.org kernel: btrfs_finish_ordered_io+0x591/0x810 [btrfs] Jan 30 08:57:58