Re: [PATCH 0/2] Policy to balance read across mirrored devices

2018-01-30 Thread Peter Becker
A little question about mount -o read_mirror_policy=.

How would this work with RAID1 over 3 or 4 HDD's?
In particular, if the desired block is not available on device .
Could i repeat this option like the device-option to specify a
order/priority like this:

mount -o read_mirror_policy= 1,read_mirror_policy= 3

2018-01-30 7:30 GMT+01:00 Anand Jain :
> In case of RAID1 and RAID10 devices are mirror-ed, a read IO can
> pick any device for reading. This choice of picking a device for
> reading should be configurable. In short not one policy would
> satisfy all types of workload and configs.
>
> So before we add more policies, this patch-set makes existing
> $pid policy configurable from the mount option and adds a devid
> based read_mirror device policy. And keeps $pid based policy as
> the default option for now. So this mount option helps to try out
> different read mirror load balances.
>
> Further we can add more policies on top of it, for example..
>
>   mount -o read_mirror_policy=pid (current, default) [1]
>   mount -o read_mirror_policy= [2]
>   mount -o read_mirror_policy=lba (illustration only) [3]
>   mount -o read_mirror_policy=ssd (illustration only) [4]
>   mount -o read_mirror_policy=io (illustration only) [5]
>   mount -o read_mirror_policy= (illustration only) [6]
>
> [1]
>  Current PID based read mirror balance.
>
> [2]
>  Set the devid of the device which should be used for read. That
>  means all the normal read will go to that particular device only.
>  This also helps testing and gives a better control for the test
>  scripts including mount context reads.
>
> [3]
>  In case of SAN storage, some storage prefers that host access the
>  LUN based on the LBA so that there won't be duplications of
>  caching on the storage.
>
> [4]
>  In case of mix of SSD and HD we may want to use SSD as the primary
>  read device.
>
> [5]
>  If storage caching is not the bottleneck but the transport layer
>  is then read load should be tuned based on the IO load.
>
> [6]
>  Or a combination of any of above as per the priority.
>  Timofey should consider to base his patch on top of this.
> Btrfs: enchanse raid1/10 balance heuristic
>
> This patch set is on top of the preparatory patch set:
>   [PATCH 0/2] Preparatory to add read_mirror mount option
>
>
> Anand Jain (2):
>   btrfs: add mount option read_mirror_policy
>   btrfs: add read_mirror_policy parameter devid
>
>  fs/btrfs/ctree.h   |  2 ++
>  fs/btrfs/super.c   | 31 +++
>  fs/btrfs/volumes.c | 18 +-
>  fs/btrfs/volumes.h |  7 +++
>  4 files changed, 57 insertions(+), 1 deletion(-)
>
> --
> 2.7.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: volumes: Cleanup stripe size calculation

2018-01-30 Thread Nikolay Borisov


On 31.01.2018 08:16, Qu Wenruo wrote:
> Cleanup the following things:
> 1) open coded SZ_16M round up
> 2) use min() to replace open-coded size comparison
> 3) code style
> 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/volumes.c | 11 +--
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index cb0a8d27661b..90ad716d2b50 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4761,18 +4761,17 @@ static int __btrfs_alloc_chunk(struct 
> btrfs_trans_handle *trans,
>* and compare that answer with the max chunk size
>*/
>   if (stripe_size * data_stripes > max_chunk_size) {
> - u64 mask = (1ULL << 24) - 1;
> -
>   stripe_size = div_u64(max_chunk_size, data_stripes);
>  
>   /* bump the answer up to a 16MB boundary */
> - stripe_size = (stripe_size + mask) & ~mask;
> + stripe_size = round_up(stripe_size, SZ_16M);
>  
> - /* but don't go higher than the limits we found
> + /*
> +  * but don't go higher than the limits we found
>* while searching for free extents
>*/
> - if (stripe_size > devices_info[ndevs-1].max_avail)
> - stripe_size = devices_info[ndevs-1].max_avail;
> + stripe_size = min(devices_info[ndevs - 1].max_avail,
> +   stripe_size);
>   }
>  
>   stripe_size = div_u64(stripe_size, dev_stripes);
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH] btrfs: volumes: Remove the meaningless condition of minimal nr_devs when allocating a chunk

2018-01-30 Thread Nikolay Borisov


On 31.01.2018 07:56, Qu Wenruo wrote:
> When checking the minimal nr_devs, there is one dead and meaningless
> condition:
> 
> if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) {
> 
> 
> This condition is meaningless, @devs_increment has nothing to do with
> @sub_stripes.
> 
> In fact, in btrfs_raid_array[], profile with sub_stripes larger than 1
> (RAID10) already has the @devs_increment set to 2.
> So no need to multiple it by @sub_stripes.
> 
> And above condition is also dead.
> For RAID10, @devs_increment * @sub_stripes equals 4, which is also the
> @devs_min of RAID10.
> For other profiles, @sub_stripes is always 1, and since @ndevs is
> rounded down to @devs_increment, the condition will always be true.
> 
> Remove the meaningless condition to make later reader wander less.
> 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov 

Quick question : What exactly is a substripe? Stripe is essentially how
many contiguous portions of disk are necessary to satisfy the profile,
right? So for raid1 we write 1 copy of the data per device (hence
dev_stripes = 1). For DUP we have 2 copies of the data on the same disk
hence dev_stripes 2. How does sub_stripes fit in the grand scheme of things?
> ---
>  fs/btrfs/volumes.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 215e85e22c8e..cb0a8d27661b 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4729,7 +4729,7 @@ static int __btrfs_alloc_chunk(struct 
> btrfs_trans_handle *trans,
>   /* round down to number of usable stripes */
>   ndevs = round_down(ndevs, devs_increment);
>  
> - if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) {
> + if (ndevs < devs_min) {
>   ret = -ENOSPC;
>   goto error;
>   }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/2] fix device orders consistency

2018-01-30 Thread Nikolay Borisov


On 31.01.2018 04:28, Anand Jain wrote:
> v3->v2:
>  rename device_sort() to device_cmp().
> v1->v2:
>  No code change. Change log updated to include the type
>  of problem that this consistency would help. And
>  I don't see patch 2/2 in the ML. So trying to resend.
> 
> By maintaining the device order (some) consistency it makes reproducing
> the missing chunk related problems more consistent. (More fixes of this
> sort is coming up).
> 
> Anand Jain (2):
>   btrfs: fix device order consistency
>   btrfs: fix alloc device order consistency
> 
>  fs/btrfs/volumes.c | 18 +-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 

For the whole series:

Reviewed-by: Nikolay Borisov 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: volumes: Cleanup stripe size calculation

2018-01-30 Thread Qu Wenruo
Cleanup the following things:
1) open coded SZ_16M round up
2) use min() to replace open-coded size comparison
3) code style

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/volumes.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index cb0a8d27661b..90ad716d2b50 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4761,18 +4761,17 @@ static int __btrfs_alloc_chunk(struct 
btrfs_trans_handle *trans,
 * and compare that answer with the max chunk size
 */
if (stripe_size * data_stripes > max_chunk_size) {
-   u64 mask = (1ULL << 24) - 1;
-
stripe_size = div_u64(max_chunk_size, data_stripes);
 
/* bump the answer up to a 16MB boundary */
-   stripe_size = (stripe_size + mask) & ~mask;
+   stripe_size = round_up(stripe_size, SZ_16M);
 
-   /* but don't go higher than the limits we found
+   /*
+* but don't go higher than the limits we found
 * while searching for free extents
 */
-   if (stripe_size > devices_info[ndevs-1].max_avail)
-   stripe_size = devices_info[ndevs-1].max_avail;
+   stripe_size = min(devices_info[ndevs - 1].max_avail,
+ stripe_size);
}
 
stripe_size = div_u64(stripe_size, dev_stripes);
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: volumes: Remove the meaningless condition of minimal nr_devs when allocating a chunk

2018-01-30 Thread Qu Wenruo
When checking the minimal nr_devs, there is one dead and meaningless
condition:

if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) {


This condition is meaningless, @devs_increment has nothing to do with
@sub_stripes.

In fact, in btrfs_raid_array[], profile with sub_stripes larger than 1
(RAID10) already has the @devs_increment set to 2.
So no need to multiple it by @sub_stripes.

And above condition is also dead.
For RAID10, @devs_increment * @sub_stripes equals 4, which is also the
@devs_min of RAID10.
For other profiles, @sub_stripes is always 1, and since @ndevs is
rounded down to @devs_increment, the condition will always be true.

Remove the meaningless condition to make later reader wander less.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/volumes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 215e85e22c8e..cb0a8d27661b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4729,7 +4729,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
/* round down to number of usable stripes */
ndevs = round_down(ndevs, devs_increment);
 
-   if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) {
+   if (ndevs < devs_min) {
ret = -ENOSPC;
goto error;
}
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/2] btrfs: fix alloc device order consistency

2018-01-30 Thread Anand Jain
Add opened device to the tail of dev_alloc_list instead of head, so that
it maintains the same order as dev_list.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0109f370ad5b..9c3b4ff7f505 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -708,7 +708,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
*fs_devices,
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state) &&
device->devid != BTRFS_DEV_REPLACE_DEVID) {
fs_devices->rw_devices++;
-   list_add(>dev_alloc_list, _devices->alloc_list);
+   list_add_tail(>dev_alloc_list, _devices->alloc_list);
}
brelse(bh);
 
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix null pointer dereference when replacing missing device

2018-01-30 Thread Anand Jain



On 01/31/2018 02:40 AM, fdman...@kernel.org wrote:

From: Filipe Manana 

When we are replacing a missing device we mount the filesystem with the
degraded mode option in which case we are allowed to have a btrfs device
structure without a backing device member (its bdev member is NULL) and
therefore we can't dereference that member. Commit 38b5f68e9811
("btrfs: drop btrfs_device::can_discard to query directly") started to
dereference that member when discarding extents, resulting in a null
pointer dereference:

  [ 3145.322257] BTRFS warning (device sdf): devid 2 uuid 
4d922414-58eb-4880-8fed-9c3840f6c5d5 is missing
  [ 3145.364116] BTRFS info (device sdf): dev_replace from  
(devid 2) to /dev/sdg started
  [ 3145.413489] BUG: unable to handle kernel NULL pointer dereference at 
00e0
  [ 3145.415085] IP: btrfs_discard_extent+0x6a/0xf8 [btrfs]
  [ 3145.415085] PGD 0 P4D 0
  [ 3145.415085] Oops:  [#1] PREEMPT SMP PTI
  [ 3145.415085] Modules linked in: ppdev ghash_clmulni_intel pcbc aesni_intel 
aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse parport_pc serio_raw 
i2c_piix4 i2
  [ 3145.415085] CPU: 0 PID: 11989 Comm: btrfs Tainted: GW
4.15.0-rc9-btrfs-next-55+ #1
  [ 3145.415085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
  [ 3145.415085] RIP: 0010:btrfs_discard_extent+0x6a/0xf8 [btrfs]
  [ 3145.415085] RSP: 0018:c90004813c60 EFLAGS: 00010293
  [ 3145.415085] RAX: 88020d39cc00 RBX: 88020c4ea2a0 RCX: 
0002
  [ 3145.415085] RDX:  RSI: 88020c4ea240 RDI: 

  [ 3145.415085] RBP:  R08: 4000 R09: 

  [ 3145.415085] R10: c90004813ae8 R11:  R12: 

  [ 3145.415085] R13: 88020c418000 R14:  R15: 

  [ 3145.415085] FS:  7f565681f8c0() GS:88023fc0() 
knlGS:
  [ 3145.415085] CS:  0010 DS:  ES:  CR0: 80050033
  [ 3145.415085] CR2: 00e0 CR3: 00020d208006 CR4: 
001606f0
  [ 3145.415085] Call Trace:
  [ 3145.415085]  btrfs_finish_extent_commit+0x9a/0x1be [btrfs]
  [ 3145.415085]  btrfs_commit_transaction+0x649/0x7a0 [btrfs]
  [ 3145.415085]  ? start_transaction+0x2b0/0x3b3 [btrfs]
  [ 3145.415085]  btrfs_dev_replace_start+0x274/0x30c [btrfs]
  [ 3145.415085]  btrfs_dev_replace_by_ioctl+0x45/0x59 [btrfs]
  [ 3145.415085]  btrfs_ioctl+0x1a91/0x1d62 [btrfs]
  [ 3145.415085]  ? lock_acquire+0x16a/0x1af
  [ 3145.415085]  ? vfs_ioctl+0x1b/0x28
  [ 3145.415085]  ? trace_hardirqs_on_caller+0x14c/0x1a6
  [ 3145.415085]  vfs_ioctl+0x1b/0x28
  [ 3145.415085]  do_vfs_ioctl+0x5a9/0x5e0
  [ 3145.415085]  ? _raw_spin_unlock_irq+0x34/0x46
  [ 3145.415085]  ? entry_SYSCALL_64_fastpath+0x5/0x8b
  [ 3145.415085]  ? trace_hardirqs_on_caller+0x14c/0x1a6
  [ 3145.415085]  SyS_ioctl+0x52/0x76
  [ 3145.415085]  entry_SYSCALL_64_fastpath+0x1e/0x8b
  [ 3145.415085] RIP: 0033:0x7f56558b3c47
  [ 3145.415085] RSP: 002b:7ffdcfac4c58 EFLAGS: 0202
  [ 3145.415085] Code: be 02 00 00 00 4c 89 ef e8 b9 e7 03 00 85 c0 89 c5 75 75 48 8b 
44 24 08 45 31 f6 48 8d 58 60 eb 52 48 8b 03 48 8b b8 a0 00 00 00 <48> 8b 87 e0 
00
  [ 3145.415085] RIP: btrfs_discard_extent+0x6a/0xf8 [btrfs] RSP: 
c90004813c60
  [ 3145.415085] CR2: 00e0
  [ 3145.458185] ---[ end trace 06302e7ac31902bf ]---

This is trivially reproduced by running the test btrfs/027 from fstests
like this:

   $ MOUNT_OPTIONS="-o discard" ./check btrfs/027

Fix this by skipping devices without a backing device before attempting
to discard.

Fixes: 38b5f68e9811 ("btrfs: drop btrfs_device::can_discard to query directly")
Signed-off-by: Filipe Manana 



 oh no, I missed the context of degraded when writing this patch.
 Thanks for the fix.

 Reviewed-by: Anand Jain 

Thanks, Anand


---
  fs/btrfs/extent-tree.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 9d220b276c8f..d59ee24645e3 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2147,6 +2147,10 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, 
u64 bytenr,
u64 bytes;
struct request_queue *req_q;
  
+			if (!stripe->dev->bdev) {

+   ASSERT(btrfs_test_opt(fs_info, DEGRADED));
+   continue;
+   }
req_q = bdev_get_queue(stripe->dev->bdev);
if (!blk_queue_discard(req_q))
continue;


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 7/7] btrfs-progs: Cleanup use of root in leaf_data_end

2018-01-30 Thread Gu Jinxiang
In function leaf_data_end, root is just used to get fs_info,
so change the parameter of this function from btrfs_root to
btrfs_fs_info.
And also make it consistent with kernel.

Changelog:
v3->v2:
Add const to parameter leaf of function btrfs_item_offset_nr
to keep type consistent with leaf_data_end.

Signed-off-by: Gu Jinxiang 
Reviewed-by: Qu Wenruo 
---
 ctree.c | 32 +---
 ctree.h |  2 +-
 2 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/ctree.c b/ctree.c
index 4efc96f..45b368c 100644
--- a/ctree.c
+++ b/ctree.c
@@ -408,12 +408,12 @@ static int btrfs_comp_keys(struct btrfs_disk_key *disk, 
struct btrfs_key *k2)
  * this returns the address of the start of the last item,
  * which is the stop of the leaf data stack
  */
-static inline unsigned int leaf_data_end(struct btrfs_root *root,
-struct extent_buffer *leaf)
+static inline unsigned int leaf_data_end(const struct btrfs_fs_info *fs_info,
+const struct extent_buffer *leaf)
 {
u32 nr = btrfs_header_nritems(leaf);
if (nr == 0)
-   return BTRFS_LEAF_DATA_SIZE(root->fs_info);
+   return BTRFS_LEAF_DATA_SIZE(fs_info);
return btrfs_item_offset_nr(leaf, nr - 1);
 }
 
@@ -1735,10 +1735,10 @@ static int push_leaf_right(struct btrfs_trans_handle 
*trans, struct btrfs_root
right_nritems = btrfs_header_nritems(right);
 
push_space = btrfs_item_end_nr(left, left_nritems - push_items);
-   push_space -= leaf_data_end(root, left);
+   push_space -= leaf_data_end(fs_info, left);
 
/* make room in the right data area */
-   data_end = leaf_data_end(root, right);
+   data_end = leaf_data_end(fs_info, right);
memmove_extent_buffer(right,
  btrfs_leaf_data(right) + data_end - push_space,
  btrfs_leaf_data(right) + data_end,
@@ -1747,7 +1747,7 @@ static int push_leaf_right(struct btrfs_trans_handle 
*trans, struct btrfs_root
/* copy from the left data area */
copy_extent_buffer(right, left, btrfs_leaf_data(right) +
 BTRFS_LEAF_DATA_SIZE(root->fs_info) - push_space,
-btrfs_leaf_data(left) + leaf_data_end(root, left),
+btrfs_leaf_data(left) + leaf_data_end(fs_info, left),
 push_space);
 
memmove_extent_buffer(right, btrfs_item_nr_offset(push_items),
@@ -1885,7 +1885,7 @@ static int push_leaf_left(struct btrfs_trans_handle 
*trans, struct btrfs_root
 btrfs_item_offset_nr(right, push_items -1);
 
copy_extent_buffer(left, right, btrfs_leaf_data(left) +
-leaf_data_end(root, left) - push_space,
+leaf_data_end(fs_info, left) - push_space,
 btrfs_leaf_data(right) +
 btrfs_item_offset_nr(right, push_items - 1),
 push_space);
@@ -1912,12 +1912,13 @@ static int push_leaf_left(struct btrfs_trans_handle 
*trans, struct btrfs_root
 
if (push_items < right_nritems) {
push_space = btrfs_item_offset_nr(right, push_items - 1) -
- leaf_data_end(root, right);
+ leaf_data_end(fs_info, right);
memmove_extent_buffer(right, btrfs_leaf_data(right) +
  BTRFS_LEAF_DATA_SIZE(root->fs_info) -
  push_space,
  btrfs_leaf_data(right) +
- leaf_data_end(root, right), push_space);
+ leaf_data_end(fs_info, right),
+ push_space);
 
memmove_extent_buffer(right, btrfs_item_nr_offset(0),
  btrfs_item_nr_offset(push_items),
@@ -1976,7 +1977,8 @@ static noinline int copy_for_split(struct 
btrfs_trans_handle *trans,
 
nritems = nritems - mid;
btrfs_set_header_nritems(right, nritems);
-   data_copy_size = btrfs_item_end_nr(l, mid) - leaf_data_end(root, l);
+   data_copy_size = btrfs_item_end_nr(l, mid) -
+   leaf_data_end(root->fs_info, l);
 
copy_extent_buffer(right, l, btrfs_item_nr_offset(0),
   btrfs_item_nr_offset(mid),
@@ -1986,7 +1988,7 @@ static noinline int copy_for_split(struct 
btrfs_trans_handle *trans,
 btrfs_leaf_data(right) +
 BTRFS_LEAF_DATA_SIZE(root->fs_info) -
 data_copy_size, btrfs_leaf_data(l) +
-leaf_data_end(root, l), data_copy_size);
+leaf_data_end(root->fs_info, l), data_copy_size);
 
rt_data_off = BTRFS_LEAF_DATA_SIZE(root->fs_info) -
  btrfs_item_end_nr(l, mid);
@@ -2324,7 

[PATCH v3 3/7] btrfs-progs: Sync code with kernel for BTRFS_MAX_INLINE_DATA_SIZE

2018-01-30 Thread Gu Jinxiang
Do a cleanup. Also make it consistent with kernel.
Use fs_info instead of root for BTRFS_MAX_INLINE_DATA_SIZE, since
maybe in some situation we do not know root, but just know fs_info.

Changelog:
v2->v1:
Change macro to inline function to be consistent with kernel.
And change the function body to match kernel.
v3->v2:
Change the title of this commit to make it more exact.

Signed-off-by: Gu Jinxiang 
Reviewed-by: Qu Wenruo 
---
 convert/source-ext2.c |  2 +-
 convert/source-reiserfs.c |  2 +-
 ctree.h   | 17 ++---
 mkfs/main.c   |  4 ++--
 4 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/convert/source-ext2.c b/convert/source-ext2.c
index e5c2a94..f5ecd8c 100644
--- a/convert/source-ext2.c
+++ b/convert/source-ext2.c
@@ -309,7 +309,7 @@ static int ext2_create_file_extents(struct 
btrfs_trans_handle *trans,
goto fail;
if ((convert_flags & CONVERT_FLAG_INLINE_DATA) && data.first_block == 0
&& data.num_blocks > 0
-   && inode_size <= BTRFS_MAX_INLINE_DATA_SIZE(root)) {
+   && inode_size <= BTRFS_MAX_INLINE_DATA_SIZE(root->fs_info)) {
u64 num_bytes = data.num_blocks * sectorsize;
u64 disk_bytenr = data.disk_block * sectorsize;
u64 nbytes;
diff --git a/convert/source-reiserfs.c b/convert/source-reiserfs.c
index e3582bd..39d6f07 100644
--- a/convert/source-reiserfs.c
+++ b/convert/source-reiserfs.c
@@ -376,7 +376,7 @@ static int reiserfs_convert_tail(struct btrfs_trans_handle 
*trans,
u64 isize;
int ret;
 
-   if (length >= BTRFS_MAX_INLINE_DATA_SIZE(root))
+   if (length >= BTRFS_MAX_INLINE_DATA_SIZE(root->fs_info))
return convert_direct(trans, root, objectid, inode, body,
  length, offset, convert_flags);
 
diff --git a/ctree.h b/ctree.h
index f3964f1..7be48e2 100644
--- a/ctree.h
+++ b/ctree.h
@@ -359,9 +359,6 @@ struct btrfs_header {
 #define __BTRFS_LEAF_DATA_SIZE(bs) ((bs) - sizeof(struct btrfs_header))
 #define BTRFS_LEAF_DATA_SIZE(fs_info) \
(__BTRFS_LEAF_DATA_SIZE(fs_info->nodesize))
-#define BTRFS_MAX_INLINE_DATA_SIZE(r) (BTRFS_LEAF_DATA_SIZE(r->fs_info) - \
-   sizeof(struct btrfs_item) - \
-   sizeof(struct btrfs_file_extent_item))
 #define BTRFS_MAX_XATTR_SIZE(r)(BTRFS_LEAF_DATA_SIZE(r->fs_info) - \
 sizeof(struct btrfs_item) -\
 sizeof(struct btrfs_dir_item))
@@ -1188,10 +1185,24 @@ struct btrfs_root {
struct rb_node rb_node;
 };
 
+static inline u32 BTRFS_MAX_ITEM_SIZE(const struct btrfs_fs_info *info)
+{
+   return BTRFS_LEAF_DATA_SIZE(info) - sizeof(struct btrfs_item);
+}
+
 static inline u32 BTRFS_NODEPTRS_PER_BLOCK(const struct btrfs_fs_info *info)
 {
return BTRFS_LEAF_DATA_SIZE(info) / sizeof(struct btrfs_key_ptr);
 }
+
+#define BTRFS_FILE_EXTENT_INLINE_DATA_START\
+   (offsetof(struct btrfs_file_extent_item, disk_bytenr))
+static inline u32 BTRFS_MAX_INLINE_DATA_SIZE(const struct btrfs_fs_info *info)
+{
+   return BTRFS_MAX_ITEM_SIZE(info) -
+   BTRFS_FILE_EXTENT_INLINE_DATA_START;
+}
+
 /*
  * inode items have the data typically returned from stat and store other
  * info about object characteristics.  There is one for every file and dir in
diff --git a/mkfs/main.c b/mkfs/main.c
index d817ad8..a301efc 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -498,7 +498,7 @@ static int fill_inode_item(struct btrfs_trans_handle *trans,
}
if (S_ISREG(src->st_mode)) {
btrfs_set_stack_inode_size(dst, (u64)src->st_size);
-   if (src->st_size <= BTRFS_MAX_INLINE_DATA_SIZE(root))
+   if (src->st_size <= BTRFS_MAX_INLINE_DATA_SIZE(root->fs_info))
btrfs_set_stack_inode_nbytes(dst, src->st_size);
else {
blocks = src->st_size / sectorsize;
@@ -686,7 +686,7 @@ static int add_file_items(struct btrfs_trans_handle *trans,
if (st->st_size % sectorsize)
blocks += 1;
 
-   if (st->st_size <= BTRFS_MAX_INLINE_DATA_SIZE(root)) {
+   if (st->st_size <= BTRFS_MAX_INLINE_DATA_SIZE(root->fs_info)) {
char *buffer = malloc(st->st_size);
 
if (!buffer) {
-- 
1.9.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/7] btrfs-progs: Use fs_info instead of root for BTRFS_NODEPTRS_PER_BLOCK

2018-01-30 Thread Gu Jinxiang
Do a cleanup. Also make it consistent with kernel.
Use fs_info instead of root for BTRFS_NODEPTRS_PER_BLOCK, since
maybe in some situation we do not know root, but just know fs_info.

Changelog:
v2->v1:
To be consistent with kernel, change macro to inline function.

Signed-off-by: Gu Jinxiang 
Reviewed-by: Qu Wenruo 
---
 cmds-check.c |  4 ++--
 ctree.c  | 18 +-
 ctree.h  |  7 ---
 print-tree.c |  2 +-
 quick-test.c |  2 +-
 5 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 8d54564..5495dbe 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -2517,7 +2517,7 @@ static void account_bytes(struct btrfs_root *root, struct 
btrfs_path *path,
if (level == 0) {
btree_space_waste += btrfs_leaf_free_space(root, eb);
} else {
-   free_nrs = (BTRFS_NODEPTRS_PER_BLOCK(root) -
+   free_nrs = (BTRFS_NODEPTRS_PER_BLOCK(root->fs_info) -
btrfs_header_nritems(eb));
btree_space_waste += free_nrs * sizeof(struct btrfs_key_ptr);
}
@@ -9475,7 +9475,7 @@ static int run_next_block(struct btrfs_root *root,
add_pending(pending, seen, ptr, size);
}
}
-   btree_space_waste += (BTRFS_NODEPTRS_PER_BLOCK(root) -
+   btree_space_waste += (BTRFS_NODEPTRS_PER_BLOCK(fs_info) -
  nritems) * sizeof(struct btrfs_key_ptr);
}
total_btree_bytes += buf->len;
diff --git a/ctree.c b/ctree.c
index 6d130f5..4efc96f 100644
--- a/ctree.c
+++ b/ctree.c
@@ -427,7 +427,7 @@ btrfs_check_node(struct btrfs_root *root, struct 
btrfs_disk_key *parent_key,
u32 nritems = btrfs_header_nritems(buf);
enum btrfs_tree_block_status ret = BTRFS_TREE_BLOCK_INVALID_NRITEMS;
 
-   if (nritems == 0 || nritems > BTRFS_NODEPTRS_PER_BLOCK(root))
+   if (nritems == 0 || nritems > BTRFS_NODEPTRS_PER_BLOCK(root->fs_info))
goto fail;
 
ret = BTRFS_TREE_BLOCK_INVALID_PARENT_KEY;
@@ -714,7 +714,7 @@ static int balance_level(struct btrfs_trans_handle *trans,
return ret;
}
if (btrfs_header_nritems(mid) >
-   BTRFS_NODEPTRS_PER_BLOCK(root) / 4)
+   BTRFS_NODEPTRS_PER_BLOCK(fs_info) / 4)
return 0;
 
left = read_node_slot(fs_info, parent, pslot - 1);
@@ -882,7 +882,7 @@ static int noinline push_nodes_for_insert(struct 
btrfs_trans_handle *trans,
if (extent_buffer_uptodate(left)) {
u32 left_nr;
left_nr = btrfs_header_nritems(left);
-   if (left_nr >= BTRFS_NODEPTRS_PER_BLOCK(root) - 1) {
+   if (left_nr >= BTRFS_NODEPTRS_PER_BLOCK(fs_info) - 1) {
wret = 1;
} else {
ret = btrfs_cow_block(trans, root, left, parent,
@@ -925,7 +925,7 @@ static int noinline push_nodes_for_insert(struct 
btrfs_trans_handle *trans,
if (extent_buffer_uptodate(right)) {
u32 right_nr;
right_nr = btrfs_header_nritems(right);
-   if (right_nr >= BTRFS_NODEPTRS_PER_BLOCK(root) - 1) {
+   if (right_nr >= BTRFS_NODEPTRS_PER_BLOCK(root->fs_info) - 1) {
wret = 1;
} else {
ret = btrfs_cow_block(trans, root, right,
@@ -1144,7 +1144,7 @@ again:
p->slots[level] = slot;
if ((p->search_for_split || ins_len > 0) &&
btrfs_header_nritems(b) >=
-   BTRFS_NODEPTRS_PER_BLOCK(root) - 3) {
+   BTRFS_NODEPTRS_PER_BLOCK(fs_info) - 3) {
int sret = split_node(trans, root, p, level);
BUG_ON(sret > 0);
if (sret)
@@ -1290,7 +1290,7 @@ static int push_node_left(struct btrfs_trans_handle 
*trans,
 
src_nritems = btrfs_header_nritems(src);
dst_nritems = btrfs_header_nritems(dst);
-   push_items = BTRFS_NODEPTRS_PER_BLOCK(root) - dst_nritems;
+   push_items = BTRFS_NODEPTRS_PER_BLOCK(root->fs_info) - dst_nritems;
WARN_ON(btrfs_header_generation(src) != trans->transid);
WARN_ON(btrfs_header_generation(dst) != trans->transid);
 
@@ -1360,7 +1360,7 @@ static int balance_node_right(struct btrfs_trans_handle 
*trans,
 
src_nritems = btrfs_header_nritems(src);
dst_nritems = btrfs_header_nritems(dst);
-   push_items = BTRFS_NODEPTRS_PER_BLOCK(root) - dst_nritems;
+   push_items = BTRFS_NODEPTRS_PER_BLOCK(root->fs_info) - dst_nritems;
if (push_items <= 0) {
return 1;
}
@@ -1488,7 +1488,7 @@ static int insert_ptr(struct btrfs_trans_handle *trans, 
struct btrfs_root
nritems = btrfs_header_nritems(lower);

[PATCH v3 1/7] btrfs-progs: Use fs_info instead of root for BTRFS_LEAF_DATA_SIZE

2018-01-30 Thread Gu Jinxiang
Do a cleanup. Also make it consistent with kernel.
Use fs_info instead of root for BTRFS_LEAF_DATA_SIZE, since
maybe in some situation we do not know root, but just know fs_info.

Signed-off-by: Gu Jinxiang 
Reviewed-by: Qu Wenruo 
---
 cmds-check.c  |  6 +++---
 convert/source-ext2.c |  2 +-
 convert/source-reiserfs.c |  4 ++--
 ctree.c   | 49 +++
 ctree.h   | 10 ++
 file-item.c   |  2 +-
 volumes.c |  2 +-
 7 files changed, 42 insertions(+), 33 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 84803a2..8d54564 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -7234,9 +7234,9 @@ again:
unsigned int shift = 0, offset;
 
if (i == 0 && btrfs_item_end_nr(buf, i) !=
-   BTRFS_LEAF_DATA_SIZE(root)) {
+   BTRFS_LEAF_DATA_SIZE(root->fs_info)) {
if (btrfs_item_end_nr(buf, i) >
-   BTRFS_LEAF_DATA_SIZE(root)) {
+   BTRFS_LEAF_DATA_SIZE(root->fs_info)) {
ret = delete_bogus_item(root, path, buf, i);
if (!ret)
goto again;
@@ -7245,7 +7245,7 @@ again:
ret = -EIO;
break;
}
-   shift = BTRFS_LEAF_DATA_SIZE(root) -
+   shift = BTRFS_LEAF_DATA_SIZE(root->fs_info) -
btrfs_item_end_nr(buf, i);
} else if (i > 0 && btrfs_item_end_nr(buf, i) !=
   btrfs_item_offset_nr(buf, i - 1)) {
diff --git a/convert/source-ext2.c b/convert/source-ext2.c
index e927721..e5c2a94 100644
--- a/convert/source-ext2.c
+++ b/convert/source-ext2.c
@@ -520,7 +520,7 @@ static int ext2_copy_single_xattr(struct btrfs_trans_handle 
*trans,
}
strncpy(namebuf, xattr_prefix_table[name_index], XATTR_NAME_MAX);
strncat(namebuf, EXT2_EXT_ATTR_NAME(entry), entry->e_name_len);
-   if (name_len + datalen > BTRFS_LEAF_DATA_SIZE(root) -
+   if (name_len + datalen > BTRFS_LEAF_DATA_SIZE(root->fs_info) -
sizeof(struct btrfs_item) - sizeof(struct btrfs_dir_item)) {
fprintf(stderr, "skip large xattr on inode %Lu name %.*s\n",
objectid - INO_OFFSET, name_len, namebuf);
diff --git a/convert/source-reiserfs.c b/convert/source-reiserfs.c
index be79d8e..e3582bd 100644
--- a/convert/source-reiserfs.c
+++ b/convert/source-reiserfs.c
@@ -676,7 +676,7 @@ static int reiserfs_xattr_indirect_fn(reiserfs_filsys_t fs, 
u64 position,
size_t alloc = min(position + num_blocks * fs->fs_blocksize, size);
char *body;
 
-   if (size > BTRFS_LEAF_DATA_SIZE(xa_data->root) -
+   if (size > BTRFS_LEAF_DATA_SIZE(xa_data->root->fs_info) -
sizeof(struct btrfs_item) - sizeof(struct btrfs_dir_item)) {
fprintf(stderr, "skip large xattr on objectid %llu name %.*s\n",
xa_data->target_oid, (int)xa_data->namelen,
@@ -714,7 +714,7 @@ static int reiserfs_xattr_direct_fn(reiserfs_filsys_t fs, 
__u64 position,
struct reiserfs_xattr_data *xa_data = data;
char *newbody;
 
-   if (size > BTRFS_LEAF_DATA_SIZE(xa_data->root) -
+   if (size > BTRFS_LEAF_DATA_SIZE(xa_data->root->fs_info) -
sizeof(struct btrfs_item) - sizeof(struct btrfs_dir_item)) {
fprintf(stderr, "skip large xattr on objectid %llu name %.*s\n",
xa_data->target_oid, (int)xa_data->namelen,
diff --git a/ctree.c b/ctree.c
index 4fc33b1..6d130f5 100644
--- a/ctree.c
+++ b/ctree.c
@@ -413,7 +413,7 @@ static inline unsigned int leaf_data_end(struct btrfs_root 
*root,
 {
u32 nr = btrfs_header_nritems(leaf);
if (nr == 0)
-   return BTRFS_LEAF_DATA_SIZE(root);
+   return BTRFS_LEAF_DATA_SIZE(root->fs_info);
return btrfs_item_offset_nr(leaf, nr - 1);
 }
 
@@ -515,24 +515,26 @@ btrfs_check_leaf(struct btrfs_root *root, struct 
btrfs_disk_key *parent_key,
goto fail;
}
if (i == 0 && btrfs_item_end_nr(buf, i) !=
-   BTRFS_LEAF_DATA_SIZE(root)) {
+   BTRFS_LEAF_DATA_SIZE(root->fs_info)) {
ret = BTRFS_TREE_BLOCK_INVALID_OFFSETS;
fprintf(stderr, "bad item end %u wanted %u\n",
btrfs_item_end_nr(buf, i),
-   (unsigned)BTRFS_LEAF_DATA_SIZE(root));
+   (unsigned)BTRFS_LEAF_DATA_SIZE(root->fs_info));
goto fail;
}
}
 
for (i = 0; i < nritems; i++) {
-   if (btrfs_item_end_nr(buf, i) > 

[PATCH v3 5/7] btrfs-progs: do clean up for redundancy value assignment

2018-01-30 Thread Gu Jinxiang
Although skinny_metadata's type is int, its value just can be 0/1.
And if condition be true only when skinny_metadata equals 1,
so in if's executive part, set skinny_metadata to 1 is redundancy.
Remove it.

Signed-off-by: Gu Jinxiang 
Reviewed-by: Qu Wenruo 
---
 extent-tree.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/extent-tree.c b/extent-tree.c
index 055582c..9201c56 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1005,7 +1005,6 @@ static int lookup_inline_extent_backref(struct 
btrfs_trans_handle *trans,
extra_size = -1;
 
if (owner < BTRFS_FIRST_FREE_OBJECTID && skinny_metadata) {
-   skinny_metadata = 1;
key.type = BTRFS_METADATA_ITEM_KEY;
key.offset = owner;
} else if (skinny_metadata) {
-- 
1.9.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 6/7] btrfs-progs: remove no longer be used btrfs_alloc_extent

2018-01-30 Thread Gu Jinxiang
Do a cleanup.
Function btrfs_alloc_extent is no longer be used.
So let's remove it.

Signed-off-by: Gu Jinxiang 
Reviewed-by: Qu Wenruo 
---
 ctree.h | 6 --
 1 file changed, 6 deletions(-)

diff --git a/ctree.h b/ctree.h
index 679bbc9..41b14b5 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2501,12 +2501,6 @@ struct extent_buffer *btrfs_alloc_free_block(struct 
btrfs_trans_handle *trans,
u32 blocksize, u64 root_objectid,
struct btrfs_disk_key *key, int level,
u64 hint, u64 empty_size);
-int btrfs_alloc_extent(struct btrfs_trans_handle *trans,
-  struct btrfs_root *root,
-  u64 num_bytes, u64 parent,
-  u64 root_objectid, u64 ref_generation,
-  u64 owner, u64 empty_size, u64 hint_byte,
-  u64 search_end, struct btrfs_key *ins, int data);
 int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
 struct btrfs_root *root, u64 bytenr,
 u64 offset, int metadata, u64 *refs, u64 *flags);
-- 
1.9.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/7] btrfs-progs: Do some clean up to be consistent with kernel

2018-01-30 Thread Gu Jinxiang
Patch 1~4 and 7: clean up use of btrfs_root.
Beside it, syncs code with kernel.
Patch 5: remove redundancy value assignment.
Patch 6: remove no longer be used function define.

Changelog:
v2->v1:
Patch 2,4: To be consistent with kernel, change macro to inline function.
Patch 3:
Change macro to inline function to be consistent with kernel.
And change the function body to match kernel.
v3->v2: 
Patch 3: Change the title to make it more exact.
Patch 7: 
Add const to parameter leaf of function btrfs_item_offset_nr
to keep type consistent with leaf_data_end.

Gu Jinxiang (7):
  btrfs-progs: Use fs_info instead of root for BTRFS_LEAF_DATA_SIZE
  btrfs-progs: Use fs_info instead of root for BTRFS_NODEPTRS_PER_BLOCK
  btrfs-progs: Sync code with kernel for BTRFS_MAX_INLINE_DATA_SIZE
  btrfs-progs: Use fs_info instead of root for BTRFS_MAX_XATTR_SIZE
  btrfs-progs: do clean up for redundancy value assignment
  btrfs-progs: remove no longer be used btrfs_alloc_extent
  btrfs-progs: Cleanup use of root in leaf_data_end

 cmds-check.c  | 10 ++---
 convert/source-ext2.c |  4 +-
 convert/source-reiserfs.c |  6 +--
 ctree.c   | 97 ++-
 ctree.h   | 47 +--
 dir-item.c|  3 +-
 extent-tree.c |  1 -
 file-item.c   |  2 +-
 mkfs/main.c   |  4 +-
 print-tree.c  |  2 +-
 quick-test.c  |  2 +-
 volumes.c |  2 +-
 12 files changed, 99 insertions(+), 81 deletions(-)

-- 
1.9.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/7] btrfs-progs: Use fs_info instead of root for BTRFS_MAX_XATTR_SIZE

2018-01-30 Thread Gu Jinxiang
Do a cleanup. Also make it consistent with kernel.
Use fs_info instead of root for BTRFS_MAX_XATTR_SIZE, since
maybe in some situation we do not know root, but just know fs_info.

Changelog:
v2->v1:
To be consistent with kernel, change macro to inline function.

Signed-off-by: Gu Jinxiang 
Reviewed-by: Qu Wenruo 
---
 ctree.h| 9 +
 dir-item.c | 3 ++-
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/ctree.h b/ctree.h
index 7be48e2..679bbc9 100644
--- a/ctree.h
+++ b/ctree.h
@@ -359,10 +359,6 @@ struct btrfs_header {
 #define __BTRFS_LEAF_DATA_SIZE(bs) ((bs) - sizeof(struct btrfs_header))
 #define BTRFS_LEAF_DATA_SIZE(fs_info) \
(__BTRFS_LEAF_DATA_SIZE(fs_info->nodesize))
-#define BTRFS_MAX_XATTR_SIZE(r)(BTRFS_LEAF_DATA_SIZE(r->fs_info) - \
-sizeof(struct btrfs_item) -\
-sizeof(struct btrfs_dir_item))
-
 
 /*
  * this is a very generous portion of the super block, giving us
@@ -1203,6 +1199,11 @@ static inline u32 BTRFS_MAX_INLINE_DATA_SIZE(const 
struct btrfs_fs_info *info)
BTRFS_FILE_EXTENT_INLINE_DATA_START;
 }
 
+static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
+{
+   return BTRFS_MAX_ITEM_SIZE(info) - sizeof(struct btrfs_dir_item);
+}
+
 /*
  * inode items have the data typically returned from stat and store other
  * info about object characteristics.  There is one for every file and dir in
diff --git a/dir-item.c b/dir-item.c
index 462546c..0b7250c 100644
--- a/dir-item.c
+++ b/dir-item.c
@@ -311,7 +311,8 @@ static int verify_dir_item(struct btrfs_root *root,
 
/* BTRFS_MAX_XATTR_SIZE is the same for all dir items */
if ((btrfs_dir_data_len(leaf, dir_item) +
-btrfs_dir_name_len(leaf, dir_item)) > BTRFS_MAX_XATTR_SIZE(root)) {
+btrfs_dir_name_len(leaf, dir_item)) >
+   BTRFS_MAX_XATTR_SIZE(root->fs_info)) {
fprintf(stderr, "invalid dir item name + data len: %u + %u\n",
   (unsigned)btrfs_dir_name_len(leaf, dir_item),
   (unsigned)btrfs_dir_data_len(leaf, dir_item));
-- 
1.9.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: fsck-tests: Rename tree-reloc-tree test number

2018-01-30 Thread Qu Wenruo
There are 2 fsck tests with the same number 027:
tree-reloc-tree
bad-extent-inline-ref-type

And we also have a hole in 015, so just rename tree-reloc-tree to 015,
to get rid of the duplicated test number and fill in the hole.

Signed-off-by: Qu Wenruo 
---
 .../{027-tree-reloc-tree => 015-tree-reloc-tree}/test.sh|   0
 .../tree_reloc_for_data_reloc.img.xz| Bin
 .../tree_reloc_for_fs_tree.img.xz   | Bin
 3 files changed, 0 insertions(+), 0 deletions(-)
 rename tests/fsck-tests/{027-tree-reloc-tree => 015-tree-reloc-tree}/test.sh 
(100%)
 rename tests/fsck-tests/{027-tree-reloc-tree => 
015-tree-reloc-tree}/tree_reloc_for_data_reloc.img.xz (100%)
 rename tests/fsck-tests/{027-tree-reloc-tree => 
015-tree-reloc-tree}/tree_reloc_for_fs_tree.img.xz (100%)

diff --git a/tests/fsck-tests/027-tree-reloc-tree/test.sh 
b/tests/fsck-tests/015-tree-reloc-tree/test.sh
similarity index 100%
rename from tests/fsck-tests/027-tree-reloc-tree/test.sh
rename to tests/fsck-tests/015-tree-reloc-tree/test.sh
diff --git 
a/tests/fsck-tests/027-tree-reloc-tree/tree_reloc_for_data_reloc.img.xz 
b/tests/fsck-tests/015-tree-reloc-tree/tree_reloc_for_data_reloc.img.xz
similarity index 100%
rename from 
tests/fsck-tests/027-tree-reloc-tree/tree_reloc_for_data_reloc.img.xz
rename to tests/fsck-tests/015-tree-reloc-tree/tree_reloc_for_data_reloc.img.xz
diff --git a/tests/fsck-tests/027-tree-reloc-tree/tree_reloc_for_fs_tree.img.xz 
b/tests/fsck-tests/015-tree-reloc-tree/tree_reloc_for_fs_tree.img.xz
similarity index 100%
rename from tests/fsck-tests/027-tree-reloc-tree/tree_reloc_for_fs_tree.img.xz
rename to tests/fsck-tests/015-tree-reloc-tree/tree_reloc_for_fs_tree.img.xz
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/2] fix device orders consistency

2018-01-30 Thread Anand Jain
v3->v2:
 rename device_sort() to device_cmp().
v1->v2:
 No code change. Change log updated to include the type
 of problem that this consistency would help. And
 I don't see patch 2/2 in the ML. So trying to resend.

By maintaining the device order (some) consistency it makes reproducing
the missing chunk related problems more consistent. (More fixes of this
sort is coming up).

Anand Jain (2):
  btrfs: fix device order consistency
  btrfs: fix alloc device order consistency

 fs/btrfs/volumes.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

-- 
2.7.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/2] btrfs: fix device order consistency

2018-01-30 Thread Anand Jain
By maintaining the device order consistency it makes reproducing
the problems related to missing chunk in the degraded mode much more
consistent. So fix this by sorting the devices by devid within the
kernel. So that we know which device is assigned to the struct
fs_info::latest_bdev when all the devices are having and same
SB generation.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b5036bd69e6a..0109f370ad5b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "extent_map.h"
@@ -1102,6 +1103,20 @@ static int __btrfs_open_devices(struct btrfs_fs_devices 
*fs_devices,
return ret;
 }
 
+static int device_cmp(void *priv, struct list_head *a, struct list_head *b)
+{
+   struct btrfs_device *dev1, *dev2;
+
+   dev1 = list_entry(a, struct btrfs_device, dev_list);
+   dev2 = list_entry(b, struct btrfs_device, dev_list);
+
+   if (dev1->devid < dev2->devid)
+   return -1;
+   else if (dev1->devid > dev2->devid)
+   return 1;
+   return 0;
+}
+
 int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
   fmode_t flags, void *holder)
 {
@@ -1112,6 +1127,7 @@ int btrfs_open_devices(struct btrfs_fs_devices 
*fs_devices,
fs_devices->opened++;
ret = 0;
} else {
+   list_sort(NULL, _devices->devices, device_cmp);
ret = __btrfs_open_devices(fs_devices, flags, holder);
}
mutex_unlock(_mutex);
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] btrfs: fix device order consistency

2018-01-30 Thread Anand Jain


  
+static int device_sort(void *priv, struct list_head *a, struct list_head *b)


I'll rename that to devid_cmp as it's the comparator and not really a
sorting function.


 Ok. I will make that change in v3.

Thanks,
Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: print error if primary super block write fails

2018-01-30 Thread Qu Wenruo


On 2018年01月31日 06:33, Howard McLauchlan wrote:
> Presently, failing a primary super block write but succeeding in at
> least one super block write in general will appear to users as if
> nothing important went wrong. However, upon unmounting and re-mounting,
> the file system will be in a rolled back state. This was discovered
> with a BCC program that uses bpf_override_return() to fail super block
> writes.
> 
> This patch outputs an error clarifying that the primary super block
> write has failed, so users can expect potentially erroneous behaviour.
> It also forces wait_dev_supers() to return an error to its caller if
> the primary super block write fails.
> 
> Signed-off-by: Howard McLauchlan 

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
> V2: Added devid to output, removed unnecessary fs_info parameter
> 
>  fs/btrfs/disk-io.c | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 5da18ebc9222..6d98f2f21d5f 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3298,6 +3298,7 @@ static int wait_dev_supers(struct btrfs_device *device, 
> int max_mirrors)
>   struct buffer_head *bh;
>   int i;
>   int errors = 0;
> + bool primary_failed = false;
>   u64 bytenr;
>  
>   if (max_mirrors == 0)
> @@ -3314,11 +3315,14 @@ static int wait_dev_supers(struct btrfs_device 
> *device, int max_mirrors)
> BTRFS_SUPER_INFO_SIZE);
>   if (!bh) {
>   errors++;
> + primary_failed = (i == 0) || primary_failed;
>   continue;
>   }
>   wait_on_buffer(bh);
> - if (!buffer_uptodate(bh))
> + if (!buffer_uptodate(bh)) {
>   errors++;
> + primary_failed = (i == 0) || primary_failed;
> + }
>  
>   /* drop our reference */
>   brelse(bh);
> @@ -3327,6 +3331,13 @@ static int wait_dev_supers(struct btrfs_device 
> *device, int max_mirrors)
>   brelse(bh);
>   }
>  
> + /* log error, force error return */
> + if (primary_failed) {
> + btrfs_err(device->fs_info, "error writing primary super block 
> to device %llu",
> +   device->devid);
> + return -1;
> + }
> +
>   return errors < i ? 0 : -1;
>  }
>  
> 



signature.asc
Description: OpenPGP digital signature


[PATCH v2] btrfs: print error if primary super block write fails

2018-01-30 Thread Howard McLauchlan
Presently, failing a primary super block write but succeeding in at
least one super block write in general will appear to users as if
nothing important went wrong. However, upon unmounting and re-mounting,
the file system will be in a rolled back state. This was discovered
with a BCC program that uses bpf_override_return() to fail super block
writes.

This patch outputs an error clarifying that the primary super block
write has failed, so users can expect potentially erroneous behaviour.
It also forces wait_dev_supers() to return an error to its caller if
the primary super block write fails.

Signed-off-by: Howard McLauchlan 
---
V2: Added devid to output, removed unnecessary fs_info parameter

 fs/btrfs/disk-io.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5da18ebc9222..6d98f2f21d5f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3298,6 +3298,7 @@ static int wait_dev_supers(struct btrfs_device *device, 
int max_mirrors)
struct buffer_head *bh;
int i;
int errors = 0;
+   bool primary_failed = false;
u64 bytenr;
 
if (max_mirrors == 0)
@@ -3314,11 +3315,14 @@ static int wait_dev_supers(struct btrfs_device *device, 
int max_mirrors)
  BTRFS_SUPER_INFO_SIZE);
if (!bh) {
errors++;
+   primary_failed = (i == 0) || primary_failed;
continue;
}
wait_on_buffer(bh);
-   if (!buffer_uptodate(bh))
+   if (!buffer_uptodate(bh)) {
errors++;
+   primary_failed = (i == 0) || primary_failed;
+   }
 
/* drop our reference */
brelse(bh);
@@ -3327,6 +3331,13 @@ static int wait_dev_supers(struct btrfs_device *device, 
int max_mirrors)
brelse(bh);
}
 
+   /* log error, force error return */
+   if (primary_failed) {
+   btrfs_err(device->fs_info, "error writing primary super block 
to device %llu",
+ device->devid);
+   return -1;
+   }
+
return errors < i ? 0 : -1;
 }
 
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] iversion: make inode_cmp_iversion{+raw} return bool instead of s64

2018-01-30 Thread Linus Torvalds
Ack. Should I expect this in a future pull request, or take it directly?

There's no hurry about this, since none of the existing users of that
function actually do anything but test the return value against zero,
and nobody saves it into anything but a "bool" (which has magical
casting properties and does not lose upper bits).

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Austin S. Hemmelgarn

On 2018-01-30 14:50, Tomasz Pala wrote:

On Tue, Jan 30, 2018 at 08:46:32 -0500, Austin S. Hemmelgarn wrote:


I personally think the degraded mount option is a mistake as this
assumes that a lightly degraded system is not able to work which is false.
If the system can mount to some working state then it should mount
regardless if it is fully operative or not. If the array is in a bad
state you need to learn about it by issuing a command or something. The
same goes for a MD array (and yes, I am aware of the block layer vs
filesystem thing here).

The problem with this is that right now, it is not safe to run a BTRFS
volume degraded and writable, but for an even remotely usable system


Mounting read-only is still better than not mounting at all.
Agreed, but what most people who are asking about this are asking for is 
to have the system just run missing a drive.


For example, my emergency.target has limited network access and starts
ssh server so I could recover from this situation remotely.


with pretty much any modern distro, you need your root filesystem to be
writable (or you need to have jumped through the hoops to make sure /var
and /tmp are writable even if / isn't).


Easy to handle by systemd. Not only this, but much more is planned:

http://0pointer.net/blog/projects/stateless.html

It's reasonably easy to handle even in a normal init system.  The issue 
is that most distros don't really support it well.  Arch and Gentoo make 
it trivial, but they let you configure storage however the hell you 
want.  Pretty much everybody else is mostly designed to assume that /var 
is a part of /, they mostly work if it's not, but certain odd things 
cause problems, and you have to go through somewhat unfriendly 
configuration work during install to get a system set up that way (well, 
unfriendly if you're a regular user, it's perfectly fine for a seasoned 
sysadmin).


Also, slightly OT, but has anyone involved in the development described 
in the article you linked every looked beyond the typical Fedora/Debian 
environment for any of the stuff the conclusions section says you're 
trying to achieve?  Just curious, since NixOS can do almost all of it 
with near zero effort except for the vendor data part (NixOS still 
stores it's config in /etc, but it can work with just one or two files), 
and a handful of the other specific items have reasonably easy ways to 
implement them that just aren't widely supported (for example, factory 
resets have at least three options already, OverlayFS (bottom layer is 
your base image, stored in a read-only verified manner, top layer is 
writable for user customization), BTRFS seed devices (similar to an 
overlay, just at the block level), and bootable, self-installing, 
compressed system images).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] iversion: make inode_cmp_iversion{+raw} return bool instead of s64

2018-01-30 Thread Jeff Layton
From: Jeff Layton 

As Linus points out:

The inode_cmp_iversion{+raw}() functions are pure and utter crap.

Why?

You say that they return 0/negative/positive, but they do so in a
completely broken manner. They return that ternary value as the
sequence number difference in a 's64', which means that if you
actually care about that ternary value, and do the *sane* thing that
the kernel-doc of the function implies is the right thing, you would
do

int cmp = inode_cmp_iversion(inode, old);
if (cmp < 0 ...

and as a result you get code that looks sane, but that doesn't
actually *WORK* right.

Since none of the callers actually care about the ternary value here,
convert the inode_cmp_iversion{+raw} functions to just return a boolean
value (false for matching, true for non-matching).

This matches the existing use of these functions just fine, and makes it
simple to convert them to return a ternary value in the future if we
grow callers that need it.

With this change we can also reimplement inode_cmp_iversion in a simpler
way using inode_peek_iversion.

Signed-off-by: Jeff Layton 
---
 include/linux/iversion.h | 20 
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 858463fca249..3d2fd06495ec 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -309,13 +309,13 @@ inode_query_iversion(struct inode *inode)
  * @inode: inode to check
  * @old: old value to check against its i_version
  *
- * Compare the current raw i_version counter with a previous one. Returns 0 if
- * they are the same or non-zero if they are different.
+ * Compare the current raw i_version counter with a previous one. Returns false
+ * if they are the same or true if they are different.
  */
-static inline s64
+static inline bool
 inode_cmp_iversion_raw(const struct inode *inode, u64 old)
 {
-   return (s64)inode_peek_iversion_raw(inode) - (s64)old;
+   return inode_peek_iversion_raw(inode) != old;
 }
 
 /**
@@ -323,19 +323,15 @@ inode_cmp_iversion_raw(const struct inode *inode, u64 old)
  * @inode: inode to check
  * @old: old value to check against its i_version
  *
- * Compare an i_version counter with a previous one. Returns 0 if they are
- * the same, a positive value if the one in the inode appears newer than @old,
- * and a negative value if @old appears to be newer than the one in the
- * inode.
+ * Compare an i_version counter with a previous one. Returns false if they are
+ * the same, and true if they are different.
  *
  * Note that we don't need to set the QUERIED flag in this case, as the value
  * in the inode is not being recorded for later use.
  */
-
-static inline s64
+static inline bool
 inode_cmp_iversion(const struct inode *inode, u64 old)
 {
-   return (s64)(inode_peek_iversion_raw(inode) & ~I_VERSION_QUERIED) -
-  (s64)(old << I_VERSION_QUERIED_SHIFT);
+   return inode_peek_iversion(inode) != old;
 }
 #endif
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs check: backref lost, mismatch with its hash -- can't repair

2018-01-30 Thread Foo Bar
Finally restored by merging the last snapshot with what btrfs restore gave me
-- surprisingly almost the whole bunch of it :-)

Thanks for helping!

Cheers,

   Marco

Qu Wenruo wrote on 2018-01-30 02:24:
> 
> 
> On 2018年01月30日 02:16, ^m'e wrote:
>> Thanks!
>>
>> Got these
>>
>>   # ./btrfs.static inspect dump-super -fFa /dev/sdb3 |grep
>> backup_tree_root: | sort -u
>> backup_tree_root:180410073088gen: 463765level: 1
>> backup_tree_root:180415758336gen: 463766level: 1
>> backup_tree_root:180416364544gen: 463767level: 1
>> backup_tree_root:4194304gen: 463764level: 1
>>
>> but, nada: all have transid failures...
> 
> That's why I call it "small chance"
> 
>>
>> The backup snapshots are OK as per original check.
> 
> Then you should be OK to restore.
> 
> Thanks,
> Qu
> 
>>
>>
>> On Mon, Jan 29, 2018 at 3:09 PM, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年01月29日 22:49, ^m'e wrote:
 On Mon, Jan 29, 2018 at 2:04 PM, Qu Wenruo  wrote:
>
>
> On 2018年01月29日 21:58, ^m'e wrote:
>> Thanks for the advice, Qu!
>>
>> I used the system for a while, did some package upgrades -- writing in
>> the suspect corrupted area. Then tried a btrfs-send to my backup vol,
>> and it failed miserably with a nice kernel oops.
>>
>> So I went for a lowmem repair:
>> 
>> # ./btrfsck.static check --repair --mode=lowmem /dev/sdb3 2>&1 | tee
>> /mnt/custom/rescue/btrfs-recovery/btrfs-repair.BTR-POOL.1.log
>> WARNING: low-memory mode repair support is only partial
>> Fixed 0 roots.
>> checking extents
>> checking free space cache
>> checking fs roots
>> ERROR: failed to add inode 28891726 as orphan item root 257
>> ERROR: root 257 INODE[28891726] is orphan item
>
> At least I need dig the kernel code further to determine if the orphan
> inode handling in btrfs-progs is correct or not.
>
> So there won't be more dirty fix soon.
>
> Hopefully you could get some good backup and restore the system.
>
> At least the problem is limited to a very small range, and it's
> something we could handle easily.
>
> Thanks for all your report,
> Qu
>
>

 Right.

 Meanwhile, could you please suggest the best course of action? btrfs
 rescue or restore?
 I have snapshots of my two subvols (rootfs, home -- now fs-checking
 them  just in case...)
>>>
>>> Don't run --repair any more.
>>> It seems to make the case worse.
>>>
>>> While the RW mount with orphan cleanup abort seems to screwed up the
>>> filesystem.
>>>
>>> In this case, it's pretty hard to recover, but still has small chance.
>>>
>>> Use btrfs inspec dump-super to get backup roots:
>>>
>>> # btrfs inspect dump-super -fFa  |grep backup_tree_root: | sort
>>> | uniq
>>>
>>> And try all the 4 numbers in the following commands:
>>>
>>> # btrfs check --tree-root  
>>>
>>> To see if is there any good one without transid error.
>>>
>>> Thanks,
>>> Qu
>>>



signature.asc
Description: OpenPGP digital signature


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Tue, Jan 30, 2018 at 08:46:32 -0500, Austin S. Hemmelgarn wrote:

>> I personally think the degraded mount option is a mistake as this 
>> assumes that a lightly degraded system is not able to work which is false.
>> If the system can mount to some working state then it should mount 
>> regardless if it is fully operative or not. If the array is in a bad 
>> state you need to learn about it by issuing a command or something. The 
>> same goes for a MD array (and yes, I am aware of the block layer vs 
>> filesystem thing here).
> The problem with this is that right now, it is not safe to run a BTRFS 
> volume degraded and writable, but for an even remotely usable system 

Mounting read-only is still better than not mounting at all.

For example, my emergency.target has limited network access and starts
ssh server so I could recover from this situation remotely.

> with pretty much any modern distro, you need your root filesystem to be 
> writable (or you need to have jumped through the hoops to make sure /var 
> and /tmp are writable even if / isn't).

Easy to handle by systemd. Not only this, but much more is planned:

http://0pointer.net/blog/projects/stateless.html

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] iversion: make inode_cmp_iversion{+raw} return bool instead of s64

2018-01-30 Thread Jeff Layton
On Tue, 2018-01-30 at 17:50 +, Trond Myklebust wrote:
> On Tue, 2018-01-30 at 12:31 -0500, Jeff Layton wrote:
> > From: Jeff Layton 
> > 
> > As Linus points out:
> > 
> > The inode_cmp_iversion{+raw}() functions are pure and utter crap.
> > 
> > Why?
> > 
> > You say that they return 0/negative/positive, but they do so in a
> > completely broken manner. They return that ternary value as the
> > sequence number difference in a 's64', which means that if you
> > actually care about that ternary value, and do the *sane* thing
> > that
> > the kernel-doc of the function implies is the right thing, you
> > would
> > do
> > 
> > int cmp = inode_cmp_iversion(inode, old);
> > if (cmp < 0 ...
> > 
> > and as a result you get code that looks sane, but that doesn't
> > actually *WORK* right.
> > 
> > Since none of the callers actually care about the ternary value here,
> > convert the inode_cmp_iversion{+raw} functions to just return a
> > boolean
> > value (false for matching, true for non-matching).
> > 
> > This matches the existing use of these functions just fine, and makes
> > it
> > simple to convert them to return a ternary value in the future if we
> > grow callers that need it.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  include/linux/iversion.h | 20 +---
> >  1 file changed, 9 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index 858463fca249..ace32775c5f0 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -309,13 +309,13 @@ inode_query_iversion(struct inode *inode)
> >   * @inode: inode to check
> >   * @old: old value to check against its i_version
> >   *
> > - * Compare the current raw i_version counter with a previous one.
> > Returns 0 if
> > - * they are the same or non-zero if they are different.
> > + * Compare the current raw i_version counter with a previous one.
> > Returns false
> > + * if they are the same or true if they are different.
> >   */
> > -static inline s64
> > +static inline bool
> >  inode_cmp_iversion_raw(const struct inode *inode, u64 old)
> >  {
> > -   return (s64)inode_peek_iversion_raw(inode) - (s64)old;
> > +   return inode_peek_iversion_raw(inode) != old;
> >  }
> >  
> >  /**
> > @@ -323,19 +323,17 @@ inode_cmp_iversion_raw(const struct inode
> > *inode, u64 old)
> >   * @inode: inode to check
> >   * @old: old value to check against its i_version
> >   *
> > - * Compare an i_version counter with a previous one. Returns 0 if
> > they are
> > - * the same, a positive value if the one in the inode appears newer
> > than @old,
> > - * and a negative value if @old appears to be newer than the one in
> > the
> > - * inode.
> > + * Compare an i_version counter with a previous one. Returns false
> > if they are
> > + * the same, and true if they are different.
> >   *
> >   * Note that we don't need to set the QUERIED flag in this case, as
> > the value
> >   * in the inode is not being recorded for later use.
> >   */
> >  
> > -static inline s64
> > +static inline bool
> >  inode_cmp_iversion(const struct inode *inode, u64 old)
> >  {
> > -   return (s64)(inode_peek_iversion_raw(inode) &
> > ~I_VERSION_QUERIED) -
> > -  (s64)(old << I_VERSION_QUERIED_SHIFT);
> > +   return (inode_peek_iversion_raw(inode) & ~I_VERSION_QUERIED)
> > !=
> > +   (old << I_VERSION_QUERIED_SHIFT);
> >  }
> 
> Is there any reason why this couldn't just use inode_peek_iversion()
> instead of having to both mask the output from
> inode_peek_iversion_raw() and shift 'old'?

None at all. I'll send a v2.
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
Just one final word, as all was already said:

On Tue, Jan 30, 2018 at 11:30:31 -0500, Austin S. Hemmelgarn wrote:

>> In other words, is it:
>> - the systemd that threats btrfs WORSE than distributed filesystems, OR
>> - btrfs that requires from systemd to be threaded BETTER than other fss?
> Or maybe it's both?  I'm more than willing to admit that what BTRFS does 
> expose currently is crap in terms of usability.  The reason it hasn't 
> changed is that we (that is, the BTRFS people and the systemd people) 
> can't agree on what it should look like.

Hard to agree with someone who refuses to do _anything_.

You can choose to follow whatever, MD, LVM, ZFS, invent something
totally different, write custom daemon or put timeout logic inside the
kernel itself. It doesn't matter. You know the ecosystem - it is the
udev that must be signalled somehow and systemd WILL follow.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Tue, Jan 30, 2018 at 11:30:31 -0500, Austin S. Hemmelgarn wrote:

>> - crypto below software RAID means double-encryption (wasted CPU),
> It also means you leak no information about your storage stack.  If 

JBOD

> you're sufficiently worried about data protection that you're using 
> block-level encryption, you should be thinking _very_ hard about whether 
> or not that's an acceptable risk (and it usually isn't).

Nonsense. Block-level encryption is the last resource protection, your
primary concern is to encrypt at the highest level possible. Anyway,
I don't need to care at all about encryption, one of my customer might.
Just stop extending justification of your tight usage pattern to the
rest of the world.

BTW if YOU are sufficiently worried about data protection you need to
use some hardware solution, like OPAL and completely avoid using
consumer-grade (especially SSD) drives. This also saves CPU cycles,
but let's not discuss here the gory details.

If you can't imagine people have different requirements than you, then
this is your mental problem, go solve it somewhere else.

>> - RAID below LVM means you're stuck with the same RAID-profile for all
>>the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for
>>system and RAID0 for various system caches (like ccache on software
>>builder machine) or transient LVM-level snapshots.
> Then you skip MD and do the RAID work in LVM with DM-RAID (which 
> technically _is_ MD, just with a different frontend).

1. how is write-mostly handled by LVM-initiated RAID1?
2. how can one split LVM RAID1 to separate volumes in case of bit-rot
situation that requires manual intervention to recover specific copy of
a data (just like btrfs checksumming does automatically in raid1 mode)?

>> - RAID below filesystem means loosing btrfs-RAID extra functionality,
>>like recovering data from different mirror when CRC mismatch happens,
> That depends on your choice of RAID and the exact configuration of the 

There is no data checksumming in MD-RAID, there is no voting in MD-RAID.
There is FEC mode in dm-verity.

> storage stack.  As long as you expose two RAID devices, BTRFS 
> replication works just fine on top of them.

Taking up 4 times the space? Or going crazy with 2*MD-RAID0?

>> - crypto below LVN means encrypting everything, including data that is
>>not sensitive - more CPU wasted,
> Encrypting only sensitive data is never a good idea unless you can prove 

Encrypting the sensitive data _AT_or_ABOVE_ the filesystem level is
crucial for any really sensitive data.

> with certainty that you will keep it properly segregated, and even then 
> it's still a pretty bad idea because it makes it obvious exactly where 
> the information you consider sensitive is stored.

ROTFL

Do you really think this would make breaking the XTS easier, than in
would be if the _entire_ drive would be encrypted using THE SAME secret?
With attacker having access to the plain texts _AND_ ciphers?

Wow... - stop doing the crypto, seriously. You do this wrong.

Do you think that my customer or cooperative would happily share HIS
secret with mine, just because we're running on the same server?

Have you ever heard about zero-knowledge databases?
Can you imagine, that some might want to do the decryption remotely,
when he doesn't trust me as the owner of the machine?

How is that me KNOWING, that their data is encrypted, eases the attack?

>> - RAID below LVM means no way to use SSD acceleration of part of the HDD
>>space using MD write-mostly functionality.
> Again, just use LVM's DM-RAID and throw in DM-cache.  Also, there were 

Obviously you've never used write-mostly, as you're apparently not aware
about the difference in maintenance burden.

> some patches just posted for BTRFS that indirectly allow for this 
> (specifically, they let you change the read-selection algorithm, with 
> the option of specifying to preferentially read from a specific device).

When they will be available in LTS kernel, some will definitely use it
and create even more complicated stacks.

>> It is the bottom layer, but I might be attached into volumes at virtually
>> any place of the logical topology tree. E.g. bare network drive added as
>> device-mapper mirror target for on-line volume cloning.
> And you seriously think that that's going to be a persistent setup? 

Persistent setups are archeology in IT.

> One-shot stuff like that is almost never an issue unless your init 
> system is absolutely brain-dead _and_ you need it working as it was 
> immediately (and a live-clone of a device doesn't if you're doing it right).

Brain-dead is a state of mind, when you reject usage scenarios that you
completely don't understand, hopefully due to the small experience only.

>> The point is: mainaining all of this logic is NOT the job for init system.
>> With systemd you need exactly N-N=0 lines of code to make this work.
> So, I find it very hard to believe that systemd requires absolutely 

[PATCH] Btrfs: fix null pointer dereference when replacing missing device

2018-01-30 Thread fdmanana
From: Filipe Manana 

When we are replacing a missing device we mount the filesystem with the
degraded mode option in which case we are allowed to have a btrfs device
structure without a backing device member (its bdev member is NULL) and
therefore we can't dereference that member. Commit 38b5f68e9811
("btrfs: drop btrfs_device::can_discard to query directly") started to
dereference that member when discarding extents, resulting in a null
pointer dereference:

 [ 3145.322257] BTRFS warning (device sdf): devid 2 uuid 
4d922414-58eb-4880-8fed-9c3840f6c5d5 is missing
 [ 3145.364116] BTRFS info (device sdf): dev_replace from  (devid 
2) to /dev/sdg started
 [ 3145.413489] BUG: unable to handle kernel NULL pointer dereference at 
00e0
 [ 3145.415085] IP: btrfs_discard_extent+0x6a/0xf8 [btrfs]
 [ 3145.415085] PGD 0 P4D 0
 [ 3145.415085] Oops:  [#1] PREEMPT SMP PTI
 [ 3145.415085] Modules linked in: ppdev ghash_clmulni_intel pcbc aesni_intel 
aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse parport_pc serio_raw 
i2c_piix4 i2
 [ 3145.415085] CPU: 0 PID: 11989 Comm: btrfs Tainted: GW
4.15.0-rc9-btrfs-next-55+ #1
 [ 3145.415085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
 [ 3145.415085] RIP: 0010:btrfs_discard_extent+0x6a/0xf8 [btrfs]
 [ 3145.415085] RSP: 0018:c90004813c60 EFLAGS: 00010293
 [ 3145.415085] RAX: 88020d39cc00 RBX: 88020c4ea2a0 RCX: 
0002
 [ 3145.415085] RDX:  RSI: 88020c4ea240 RDI: 

 [ 3145.415085] RBP:  R08: 4000 R09: 

 [ 3145.415085] R10: c90004813ae8 R11:  R12: 

 [ 3145.415085] R13: 88020c418000 R14:  R15: 

 [ 3145.415085] FS:  7f565681f8c0() GS:88023fc0() 
knlGS:
 [ 3145.415085] CS:  0010 DS:  ES:  CR0: 80050033
 [ 3145.415085] CR2: 00e0 CR3: 00020d208006 CR4: 
001606f0
 [ 3145.415085] Call Trace:
 [ 3145.415085]  btrfs_finish_extent_commit+0x9a/0x1be [btrfs]
 [ 3145.415085]  btrfs_commit_transaction+0x649/0x7a0 [btrfs]
 [ 3145.415085]  ? start_transaction+0x2b0/0x3b3 [btrfs]
 [ 3145.415085]  btrfs_dev_replace_start+0x274/0x30c [btrfs]
 [ 3145.415085]  btrfs_dev_replace_by_ioctl+0x45/0x59 [btrfs]
 [ 3145.415085]  btrfs_ioctl+0x1a91/0x1d62 [btrfs]
 [ 3145.415085]  ? lock_acquire+0x16a/0x1af
 [ 3145.415085]  ? vfs_ioctl+0x1b/0x28
 [ 3145.415085]  ? trace_hardirqs_on_caller+0x14c/0x1a6
 [ 3145.415085]  vfs_ioctl+0x1b/0x28
 [ 3145.415085]  do_vfs_ioctl+0x5a9/0x5e0
 [ 3145.415085]  ? _raw_spin_unlock_irq+0x34/0x46
 [ 3145.415085]  ? entry_SYSCALL_64_fastpath+0x5/0x8b
 [ 3145.415085]  ? trace_hardirqs_on_caller+0x14c/0x1a6
 [ 3145.415085]  SyS_ioctl+0x52/0x76
 [ 3145.415085]  entry_SYSCALL_64_fastpath+0x1e/0x8b
 [ 3145.415085] RIP: 0033:0x7f56558b3c47
 [ 3145.415085] RSP: 002b:7ffdcfac4c58 EFLAGS: 0202
 [ 3145.415085] Code: be 02 00 00 00 4c 89 ef e8 b9 e7 03 00 85 c0 89 c5 75 75 
48 8b 44 24 08 45 31 f6 48 8d 58 60 eb 52 48 8b 03 48 8b b8 a0 00 00 00 <48> 8b 
87 e0 00
 [ 3145.415085] RIP: btrfs_discard_extent+0x6a/0xf8 [btrfs] RSP: 
c90004813c60
 [ 3145.415085] CR2: 00e0
 [ 3145.458185] ---[ end trace 06302e7ac31902bf ]---

This is trivially reproduced by running the test btrfs/027 from fstests
like this:

  $ MOUNT_OPTIONS="-o discard" ./check btrfs/027

Fix this by skipping devices without a backing device before attempting
to discard.

Fixes: 38b5f68e9811 ("btrfs: drop btrfs_device::can_discard to query directly")
Signed-off-by: Filipe Manana 
---
 fs/btrfs/extent-tree.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 9d220b276c8f..d59ee24645e3 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2147,6 +2147,10 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, 
u64 bytenr,
u64 bytes;
struct request_queue *req_q;
 
+   if (!stripe->dev->bdev) {
+   ASSERT(btrfs_test_opt(fs_info, DEGRADED));
+   continue;
+   }
req_q = bdev_get_queue(stripe->dev->bdev);
if (!blk_queue_discard(req_q))
continue;
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] iversion: make inode_cmp_iversion{+raw} return bool instead of s64

2018-01-30 Thread Trond Myklebust
On Tue, 2018-01-30 at 12:31 -0500, Jeff Layton wrote:
> From: Jeff Layton 
> 
> As Linus points out:
> 
> The inode_cmp_iversion{+raw}() functions are pure and utter crap.
> 
> Why?
> 
> You say that they return 0/negative/positive, but they do so in a
> completely broken manner. They return that ternary value as the
> sequence number difference in a 's64', which means that if you
> actually care about that ternary value, and do the *sane* thing
> that
> the kernel-doc of the function implies is the right thing, you
> would
> do
> 
> int cmp = inode_cmp_iversion(inode, old);
> if (cmp < 0 ...
> 
> and as a result you get code that looks sane, but that doesn't
> actually *WORK* right.
> 
> Since none of the callers actually care about the ternary value here,
> convert the inode_cmp_iversion{+raw} functions to just return a
> boolean
> value (false for matching, true for non-matching).
> 
> This matches the existing use of these functions just fine, and makes
> it
> simple to convert them to return a ternary value in the future if we
> grow callers that need it.
> 
> Signed-off-by: Jeff Layton 
> ---
>  include/linux/iversion.h | 20 +---
>  1 file changed, 9 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index 858463fca249..ace32775c5f0 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -309,13 +309,13 @@ inode_query_iversion(struct inode *inode)
>   * @inode: inode to check
>   * @old: old value to check against its i_version
>   *
> - * Compare the current raw i_version counter with a previous one.
> Returns 0 if
> - * they are the same or non-zero if they are different.
> + * Compare the current raw i_version counter with a previous one.
> Returns false
> + * if they are the same or true if they are different.
>   */
> -static inline s64
> +static inline bool
>  inode_cmp_iversion_raw(const struct inode *inode, u64 old)
>  {
> - return (s64)inode_peek_iversion_raw(inode) - (s64)old;
> + return inode_peek_iversion_raw(inode) != old;
>  }
>  
>  /**
> @@ -323,19 +323,17 @@ inode_cmp_iversion_raw(const struct inode
> *inode, u64 old)
>   * @inode: inode to check
>   * @old: old value to check against its i_version
>   *
> - * Compare an i_version counter with a previous one. Returns 0 if
> they are
> - * the same, a positive value if the one in the inode appears newer
> than @old,
> - * and a negative value if @old appears to be newer than the one in
> the
> - * inode.
> + * Compare an i_version counter with a previous one. Returns false
> if they are
> + * the same, and true if they are different.
>   *
>   * Note that we don't need to set the QUERIED flag in this case, as
> the value
>   * in the inode is not being recorded for later use.
>   */
>  
> -static inline s64
> +static inline bool
>  inode_cmp_iversion(const struct inode *inode, u64 old)
>  {
> - return (s64)(inode_peek_iversion_raw(inode) &
> ~I_VERSION_QUERIED) -
> -(s64)(old << I_VERSION_QUERIED_SHIFT);
> + return (inode_peek_iversion_raw(inode) & ~I_VERSION_QUERIED)
> !=
> + (old << I_VERSION_QUERIED_SHIFT);
>  }

Is there any reason why this couldn't just use inode_peek_iversion()
instead of having to both mask the output from
inode_peek_iversion_raw() and shift 'old'?

>  #endif
-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.mykleb...@primarydata.com


Re: [GIT PULL] inode->i_version rework for v4.16

2018-01-30 Thread Linus Torvalds
On Tue, Jan 30, 2018 at 4:05 AM, Jeff Layton  wrote:
>
> My intent here was to have this handle wraparound using the same sort of
> method that the time_before/time_after macros do. Obviously, I didn't
> document that well.

Oh, the intent was clear. The implementation was wrong.

Note that "time_before()" returns a _boolean_.

So yes, the time comparison functions do a 64-bit subtraction, exactly
like yours do. BUT THEY DO NOT RETURN THAT DIFFERENCE. They test the
sign in 64 bits and return that boolean single-bit value.

> I want to make sure I understand what's actually broken here thoug. Is
> it only broken when the two values are more than 2**63 apart, or is
> there something else more fundamentally wrong here?

There's something fundamentally wrong. The _intent_ is to allow
numbers up to 2**63 apart, but if somebody does that

 int cmp = inode_cmp_iversion(inode, old);

 if (cmp < 0 ...

then it actually only ever tests numbers 2**31 apart, because the high
32 bits will have been thrown away when the 64-bit difference is cast
to "int".

And what used to be a sign bit (in 64 bits) no longer exists, and the
above tests the *new* sign bit that is bit #31, not #63.

So you are a factor of four billion off.

And while 2**63 is a big number and perfectly ok for a filesystem
event difference, a difference of 2**31 is a "this can actually
happen".

Yes, even 2**31 is still a big difference, and it will take a very
unusual situation, and quite a long time to trigger: something that
does a thousand filesystem ops per second will not see the problem for
24 days. So you'll never see it in a test. But 24 days happens in real
life..

That's why you need to do the comparison against zero inside the
actual helper functions like the time comparisons do. Because if you
return the 64-bit difference, it will be trivially lost, and the code
will _look_ right, but not work right.

The fact that you didn't even seem to see the problem in my example
calling sequence just proves that point.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Austin S. Hemmelgarn

On 2018-01-30 10:09, Tomasz Pala wrote:

On Mon, Jan 29, 2018 at 08:42:32 -0500, Austin S. Hemmelgarn wrote:


Yes. They are stupid enough to fail miserably with any more complicated
setups, like stacking volume managers, crypto layer, network attached
storage etc.

I think you mean any setup that isn't sensibly layered.


No, I mean any setup that wasn't considered by init system authors.
Your 'sensibly' is not sensible for me.


BCP for over a
decade has been to put multipathing at the bottom, then crypto, then
software RAID, than LVM, and then whatever filesystem you're using.


Really? Let's enumerate some caveats of this:

- crypto below software RAID means double-encryption (wasted CPU),
It also means you leak no information about your storage stack.  If 
you're sufficiently worried about data protection that you're using 
block-level encryption, you should be thinking _very_ hard about whether 
or not that's an acceptable risk (and it usually isn't).


- RAID below LVM means you're stuck with the same RAID-profile for all
   the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for
   system and RAID0 for various system caches (like ccache on software
   builder machine) or transient LVM-level snapshots.
Then you skip MD and do the RAID work in LVM with DM-RAID (which 
technically _is_ MD, just with a different frontend).


- RAID below filesystem means loosing btrfs-RAID extra functionality,
   like recovering data from different mirror when CRC mismatch happens,
That depends on your choice of RAID and the exact configuration of the 
storage stack.  As long as you expose two RAID devices, BTRFS 
replication works just fine on top of them.


- crypto below LVN means encrypting everything, including data that is
   not sensitive - more CPU wasted,
Encrypting only sensitive data is never a good idea unless you can prove 
with certainty that you will keep it properly segregated, and even then 
it's still a pretty bad idea because it makes it obvious exactly where 
the information you consider sensitive is stored.


- RAID below LVM means no way to use SSD acceleration of part of the HDD
   space using MD write-mostly functionality.
Again, just use LVM's DM-RAID and throw in DM-cache.  Also, there were 
some patches just posted for BTRFS that indirectly allow for this 
(specifically, they let you change the read-selection algorithm, with 
the option of specifying to preferentially read from a specific device).


What you present is only some sane default, which doesn't mean it covers
all the real-world cases.

My recent server is using:
- raw partitioning for base volumes,
- LVM,
- MD on top of some LVs (varying levels),
- paritioned SSD cache attached to specific VGs,
- crypto on top of selected LV/MD,
- btrfs RAID1 on top of non-MDed LVs.


Multipathing has to be the bottom layer for a given node because it
interacts directly with hardware topology which gets obscured by the
other layers.


It is the bottom layer, but I might be attached into volumes at virtually
any place of the logical topology tree. E.g. bare network drive added as
device-mapper mirror target for on-line volume cloning.
And you seriously think that that's going to be a persistent setup? 
One-shot stuff like that is almost never an issue unless your init 
system is absolutely brain-dead _and_ you need it working as it was 
immediately (and a live-clone of a device doesn't if you're doing it right).



Crypto essentially has to be next, otherwise you leak
info about the storage stack.


I'm encrypting only the containers that require block-level encryption.
Others might have more effective filesystem-level encryption or even be
some TrueCrypt/whatever images.
Again, you're leaking information by doing so.  At a minimum, you're 
leaking info about where the data you consider sensitive is stored, and 
that's not counting volume names (exposed by LVM), container 
configuration (possibly exposed depending on how your container stack 
handles it), and other storage stack configuration info (exposed by the 
metadata of the various layers and possibly by files in /etc if you 
don't have your root filesystem encrypted).



Swapping LVM and software RAID ends up
giving you a setup which is difficult for most people to understand and
therefore is hard to reliably maintain.


It's more difficult, as you need to maintain manually two (or more) separate 
VGs with
matching LVs inside. Harder, but more flexible.
And could also be trivially simplified by eliminating MD and using LVM's 
native support for DM-RAID, which provides essentially the exact same 
functionality because DM-RAID is largely just a DM fronted for MD.



Other init systems enforce things being this way because it maintains
people's sanity, not because they have significant difficulty doing
things differently (and in fact, it is _trivial_ to change the ordering
in some of them, OpenRC on Gentoo for example quite literally requires
exactly N-1 lines to change in each of N 

Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Tue, Jan 30, 2018 at 16:09:50 +0100, Tomasz Pala wrote:

>> BCP for over a 
>> decade has been to put multipathing at the bottom, then crypto, then 
>> software RAID, than LVM, and then whatever filesystem you're using. 
> 
> Really? Let's enumerate some caveats of this:
> 
> - crypto below software RAID means double-encryption (wasted CPU),
> 
> - RAID below LVM means you're stuck with the same RAID-profile for all
>   the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for
>   system and RAID0 for various system caches (like ccache on software
>   builder machine) or transient LVM-level snapshots.
> 
> - RAID below filesystem means loosing btrfs-RAID extra functionality,
>   like recovering data from different mirror when CRC mismatch happens,
> 
> - crypto below LVN means encrypting everything, including data that is
>   not sensitive - more CPU wasted,

And, what is much worse - encrypting everything using the same secret.
BIG show-stopper.

I would shred such BCP as ineffective and insecure for both, data
integrity and confidentiality.

> - RAID below LVM means no way to use SSD acceleration of part of the HDD
>   space using MD write-mostly functionality.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] inode->i_version rework for v4.16

2018-01-30 Thread Theodore Ts'o
On Tue, Jan 30, 2018 at 07:05:48AM -0500, Jeff Layton wrote:
> 
> I want to make sure I understand what's actually broken here thoug. Is
> it only broken when the two values are more than 2**63 apart, or is
> there something else more fundamentally wrong here?

The other problem is that returning a s64 (or a u64) is extremely
inefficient on a 32-bit platform.  So in general, it's better to avoid
returning a 64-bit type unless it's really necessary

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Tue, Jan 30, 2018 at 10:05:34 -0500, Austin S. Hemmelgarn wrote:

>> Instead, they should move their legs continuously and if the train is > not 
>> on the station yet, just climb back and retry.
> No, that's really not a good analogy given the fact that that check for 
> the presence of a train takes a normal person milliseconds while the 
> event being raced against (the train departing) takes minutes.  In the 

OMG... preventing races by "this would always take longer"? Seriously?

> You're already looping forever _waiting_ for the volume to appear.  How 

udev is waiting for events, not systemd. Nobody will do some crazy
cross-layered shortcuts to overcome other's lazyness.

> is that any different from lopping forever trying to _mount_ the volume 

Yes, because udev doesn't mount anything, ever. Not this binary dude!

> instead given that failing to mount the volume is not going to damage 
> things. 

Failed premature attempt to mount prevents the system from booting WHEN
the devices are ready - this is fatal. System boots randomly on racy
conditions.

But hey, "the devices will always appear faster, than the init attempt
to do the mount"!

Have you ever had some hardware RAID controller? Never heard about
devices appearing after 5 minutes of warming up?

> The issue here is that systemd refuses to implement any method 
> of actually retrying things that fail during startup.>

1. Such methods are trivial and I've already mentioned them a dozen of times.
2. They should be implemented in btrfs-upstream, not systemd-upstream,
   but I personally would happily help with writing them here.
3. They require full-circle path of 'allow-degraded' to be passed
   through btrfs code.

>> mounting BEFORE volume is complete is FATAL - since no userspace daemon
>> would ever retrigger the mount and the system won't came up. Provide one
>> btrfsd volume manager and systemd could probably switch to using it.
> And here you've lost any respect I might have had for you.

Going personal? So thank you for discussion and good bye.

Please refrain from answering me, I'm not going to discuss this any
further with you.

> **YOU DO NOT NEED A DAEMON TO DO EVERY LAST TASK ON THE SYSTEM**

Sorry dude, but I won't repeat for the 5th times all the alternatives.

You *all* refuse to step in ANY possible solution mentioned.
You *all* except the systemd to do ALL the job, just like other init
systems were forced to do, against the good design principles.

Good luck having btrfs degraded mount under systemd.

> 
> This is one of the two biggest things I hate about systemd(the journal 
> is the other one for those who care).

The journal has currently *many* drawbacks, but this is not 'by design'
but 'by appropriate code missing for now'. The same applies to btrfs,
isn't it?

> You don't need some special daemon to set the time,

Ever heard about NTP?

> or to set the hostname,

FUD - no such daemon

> or to fetch account data,

FUD

> or even to track who's logged in

FUD

> As much as it may surprise the systemd developers, people got on just 
> fine handling setting the system time, setting the hostname, fetching 
> account info, tracking active users, and any number of myriad other 
> tasks before systemd decided they needed to have their own special daemon.
> 

Sure, in myriad of different scattered distro-specific files. The only
reason systemd stepped in for some of there is that nobody else could
introduce and force Linux-wide consensus. And if anyone would succeed,
there would be some Austins blaming them for 'overtaking good old
trashyard into coherent de facto standard.'

> In this particular case, you don't need a daemon because the kernel does 
> the state tracking. 

Sure, MD doesn't require daemon and LVM doesn't require either. But they
do provide some - I know, they are all wrong.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Mon, Jan 29, 2018 at 21:44:23 -0700, Chris Murphy wrote:

> Btrfs is orthogonal to systemd's willingness to wait forever while
> making no progress. It doesn't matter what it is, it shouldn't wait
> forever.

It times out after 90 seconds (by default) and then it fails the mount
entirely.

> It occurs to me there are such systemd service units specifically for
> waiting for example
> 
> systemd-networkd-wait-online.service, systemd-networkd-wait-online -
> Wait for network to
>come online
> 
>  chrony-wait.service - Wait for chrony to synchronize system clock
> 
> NetworkManager has a version of this. I don't see why there can't be a
> wait for Btrfs to normally mount,

Because mounting degraded btrfs without -o degraded won't WAIT for
anything just immediatelly return failed.

> just simply try to mount, it fails, wait 10, try again, wait 10 try again.

For the last time:

No
Such
Logic
In
Systemd
CORE

Every wait/repeat is done using UNITS - as you already noticed
itself. And these are plain, regular UNITS.

Is there anything that prevents YOU, Chris, from writing these UNITS for
btrfs?

I know what makes ME stop writing these units - it's lack of feedback
from btrfs.ko ioctl handler. Without this I am unable to write UNITS
handling fstab mount entries, because the logic would PROBABLY have to
be hardcoded inside systemd-fstab-generator.

And such logic MUST NOT be hardcoded - this MUST be user-configurable,
i.e. made on UNITS level.

You might argue that some-distros-SysV units or some Gentoo-OpenRC have
support for this and if you want to change anything this is only a few
lines of shell code to be altered. But systemd-fstab-generator is
compiled binary and so WON'T allow the behaviour to be user-configurable.

> And then fail the unit so we end up at a prompt.

This can also be easily done, just like emergency-shell spawns
when configured. If only btrfs could accept and keep information about
volume being allowed for degraded mount.


OK, to be honest I _can_ write such rules now, keeping the
'allow-degraded' state somewhere else (in a file for example).

But since this is some non-standarized side-channel, such code won't
be accepted in systemd upstream, especially because it requires the
current udev rule to be slightly changed.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Mon, Jan 29, 2018 at 14:00:53 -0500, Austin S. Hemmelgarn wrote:

> We already do so in the accepted standard manner.  If the mount fails 
> because of a missing device, you get a very specific message in the 
> kernel log about it, as is the case for most other common errors (for 
> uncommon ones you usually just get a generic open_ctree error).  This is 
> really the only option too, as the mount() syscall (which the mount 
> command calls) returns only 0 on success or -1 and an appropriate errno 
> value on failure, and we can't exactly go about creating a half dozen 
> new error numbers just for this (well, technically we could, but I very 
> much doubt that they would be accepted upstream, which defeats the purpose).

This is exacly why the separate communication channel being the ioctl is
currently used. And I really don't understand why do you fight against
expanding this ioctl response.

> With what you're proposing for BTRFS however, _everything_ is a 
> complicated decision, namely:
> 1. Do you retry at all?  During boot, the answer should usually be yes, 
> but during normal system operation it should normally be no (because we 
> should be letting the user handle issues at that point).

This is exactly why I propose to introduce ioctl in btrfs.ko that
accepts userspace-configured (as per-volume policy) expectations.

> 2. How long should you wait before you retry?  There is no right answer 
> here that will work in all cases (I've seen systems which take multiple 
> minutes for devices to become available on boot), especially considering 
> those of us who would rather have things fail early.

btrfs-last-resort@.timer per analogy to mdadm-last-resort@.timer

> 3. If the retry fails, do you retry again?  How many times before it 
> just outright fails?  This is going to be system specific policy.  On 
> systems where devices may take a while to come online, the answer is 
> probably yes and some reasonably large number, while on systems where 
> devices are known to reliably be online immediately, it makes no sense 
> to retry more than once or twice.

All of this is systemd timer/service job.

> 4. If you are going to retry, should you try a degraded mount?  Again, 
> this is going to be system specific policy (regular users would probably 
> want this to be a yes, while people who care about data integrity over 
> availability would likely want it to be a no).

Just like above - user-configured in systemd timers/services easily.

> 5. Assuming you do retry with the degraded mount, how many times should 
> a normal mount fail before things go degraded?  This ties in with 3 and 
> has the same arguments about variability I gave there.

As above.

> 6. How many times do you try a degraded mount before just giving up? 
> Again, similar variability to 3.
> 7. Should each attempt try first a regular mount and then a degraded 
> one, or do you try just normal a couple times and then switch to 
> degraded, or even start out trying normal and then start alternating? 
> Any of those patterns has valid arguments both for and against it, so 
> this again needs to be user configurable policy.
> 
> Altogether, that's a total of 7 policy decisions that should be user 
> configurable. 

All of them easy to implement if the btrfs.ko could accept
'allow-degraded' per-volume instruction and return 'try-degraded' in the
ioctl.

> Having a config file other than /etc/fstab for the mount 
> command should probably be avoided for sanity reasons (again, BTRFS is a 
> filesystem, not a volume manager), so they would all have to be handled 
> through mount options.  The kernel will additionally have to understand 
> that those options need to be ignored (things do try to mount 
> filesystems without calling a mount helper, most notably the kernel when 
> it mounts the root filesystem on boot if you're not using an initramfs). 
>   All in all, this type of thing gets out of hand _very_ fast.

You need to think about the two separately:
1. tracking STATE - this is remembering 'allow-degraded' option for now,
2. configured POLICY - this is to be handled by init system.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Mon, Jan 29, 2018 at 08:42:32 -0500, Austin S. Hemmelgarn wrote:

>> Yes. They are stupid enough to fail miserably with any more complicated
>> setups, like stacking volume managers, crypto layer, network attached
>> storage etc.
> I think you mean any setup that isn't sensibly layered.

No, I mean any setup that wasn't considered by init system authors.
Your 'sensibly' is not sensible for me.

> BCP for over a 
> decade has been to put multipathing at the bottom, then crypto, then 
> software RAID, than LVM, and then whatever filesystem you're using. 

Really? Let's enumerate some caveats of this:

- crypto below software RAID means double-encryption (wasted CPU),

- RAID below LVM means you're stuck with the same RAID-profile for all
  the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for
  system and RAID0 for various system caches (like ccache on software
  builder machine) or transient LVM-level snapshots.

- RAID below filesystem means loosing btrfs-RAID extra functionality,
  like recovering data from different mirror when CRC mismatch happens,

- crypto below LVN means encrypting everything, including data that is
  not sensitive - more CPU wasted,

- RAID below LVM means no way to use SSD acceleration of part of the HDD
  space using MD write-mostly functionality.

What you present is only some sane default, which doesn't mean it covers
all the real-world cases.

My recent server is using:
- raw partitioning for base volumes,
- LVM,
- MD on top of some LVs (varying levels),
- paritioned SSD cache attached to specific VGs,
- crypto on top of selected LV/MD,
- btrfs RAID1 on top of non-MDed LVs.

> Multipathing has to be the bottom layer for a given node because it 
> interacts directly with hardware topology which gets obscured by the 
> other layers.

It is the bottom layer, but I might be attached into volumes at virtually
any place of the logical topology tree. E.g. bare network drive added as
device-mapper mirror target for on-line volume cloning.

> Crypto essentially has to be next, otherwise you leak
> info about the storage stack.

I'm encrypting only the containers that require block-level encryption.
Others might have more effective filesystem-level encryption or even be
some TrueCrypt/whatever images.

> Swapping LVM and software RAID ends up 
> giving you a setup which is difficult for most people to understand and 
> therefore is hard to reliably maintain.

It's more difficult, as you need to maintain manually two (or more) separate 
VGs with
matching LVs inside. Harder, but more flexible.

> Other init systems enforce things being this way because it maintains 
> people's sanity, not because they have significant difficulty doing 
> things differently (and in fact, it is _trivial_ to change the ordering 
> in some of them, OpenRC on Gentoo for example quite literally requires 
> exactly N-1 lines to change in each of N files when re-ordering N 
> layers), provided each layer occurs exactly once for a given device and 
> the relative ordering is the same on all devices.  And you know what? 

The point is: mainaining all of this logic is NOT the job for init system.
With systemd you need exactly N-N=0 lines of code to make this work.

The appropriate unit files are provided by MD and LVM upstream.
And they include fallback mechanism for degrading volumes.

> Given my own experience with systemd, it has exactly the same constraint 
> on relative ordering.  I've tried to run split setups with LVM and 
> dm-crypt where one device had dm-crypt as the bottom layer and the other 
> had it as the top layer, and things locked up during boot on _every_ 
> generalized init system I tried.

Hard to tell without access to the failing system, but this MIGHT have been:

- old/missing/broken-by-distro-maintainers-who-know-better LVM rules,
- old/bugged systemd, possibly with broken/old cryptsetup rules.

>> It's quite obvious who's the culprit: every single remaining filesystem
>> manages to mount under systemd without problems. They just expose
>> informations about their state.
> No, they don't (except ZFS).

They don't expose informations (as there are none), but they DO mount.

> There is no 'state' to expose for anything but BTRFS (and ZFS)

Does ZFS expose it's state or not?

> except possibly if the filesystem needs checked or 
> not.  You're conflating filesystems and volume management.

btrfs is a filesystem, device manager and volume manager.
I might add DEVICE to a btrfs-thingy.
I might mount the same btrfs-thingy selecting different VOLUME 
(subVOL=something_other)

> The alternative way of putting what you just said is:
> Every single remaining filesystem manages to mount under systemd without 
> problems, because it doesn't try to treat them as a block layer.

Or: every other volume manager exposes separate block devices.

Anyway - however we put this into words, it is btrfs that behaves differently.

>> The 'needless complication', as you named it, usually should be the 

Re: degraded permanent mount option

2018-01-30 Thread Austin S. Hemmelgarn

On 2018-01-30 08:46, Tomasz Pala wrote:

On Mon, Jan 29, 2018 at 08:05:42 -0500, Austin S. Hemmelgarn wrote:


Seriously, _THERE IS A RACE CONDITION IN SYSTEMD'S CURRENT HANDLING OF
THIS_.  It's functionally no different than prefacing an attempt to send
a signal to a process by checking if the process exists, or trying to
see if some other process is using a file that might be locked by


Seriously, there is a race condition on train stations. People check if
the train has stopped and opened the door before they move their legs to
get in, but the train might be already gone - so this is pointless.

Instead, they should move their legs continuously and if the train is > not on 
the station yet, just climb back and retry.
No, that's really not a good analogy given the fact that that check for 
the presence of a train takes a normal person milliseconds while the 
event being raced against (the train departing) takes minutes.  In the 
case being discussed, the check takes milliseconds and the event being 
raced against also takes milliseconds.  The scale here is drastically 
different.>

See the difference? I hope now you know what is the race condition.
It is the condition, where CONSEQUENCES are fatal.
Yes, the consequences of the condition being discussed functionally are 
fatal (you completely fail to mount the volume), because systemd doesn't 
retry mounting the root filesystem, it just breaks, which is absolutely 
at odds with the whole 'just works' mentality I always hear from the 
systemd fanboys and developers.


You're already looping forever _waiting_ for the volume to appear.  How 
is that any different from lopping forever trying to _mount_ the volume 
instead given that failing to mount the volume is not going to damage 
things.  The issue here is that systemd refuses to implement any method 
of actually retrying things that fail during startup.>

mounting BEFORE volume is complete is FATAL - since no userspace daemon
would ever retrigger the mount and the system won't came up. Provide one
btrfsd volume manager and systemd could probably switch to using it.

And here you've lost any respect I might have had for you.

**YOU DO NOT NEED A DAEMON TO DO EVERY LAST TASK ON THE SYSTEM**

Period, end of story.


This is one of the two biggest things I hate about systemd (the journal 
is the other one for those who care).  You don't need some special 
daemon to set the time, or to set the hostname, or to fetch account 
data, or even to track who's logged in (though I understand that the 
last one is not systemd's fault originally).


As much as it may surprise the systemd developers, people got on just 
fine handling setting the system time, setting the hostname, fetching 
account info, tracking active users, and any number of myriad other 
tasks before systemd decided they needed to have their own special daemon.



In this particular case, you don't need a daemon because the kernel does 
the state tracking.  It only checks that state completely though _when 
you ask it to mount the filesystem_ because it requires doing 99% of the 
work of mounting the filesystem (quite literally, you're doing pretty 
much everything short of actually hooking things up in the VFS layer). 
We are not a case like MD where there's just a tiny bit of metadata to 
parse to check what the state is supposed to be.  Imagine if LVM 
required you to unconditionally activate all the LV's in a VG when you 
activate the VG and what logic would be required to validate the VG 
then, and you're pretty close to what's needed to check state for a 
BTRFS volume (translating LV's to chunks and the VG to the filesystem as 
a whole).  There is no point in trying to parse that data every time a 
new device shows up, it's a waste of time (at a minimum, you're almost 
doubling the amount of time it takes to mount a volume if you are doing 
this each time a device shows up), energy, and resources in general.


mounting AFTER volume is complete is FINE - and if the "pseudo-race" happens
and volume disappears, then this was either some operator action, so the
umount SHOULD happen, or we are facing some MALFUNCION, which is fatal
itself, not by being a "race condition".
Short of catastrophic failure, the _volume_ doesn't disappear, a 
component device does, and that is where the problem lies, especially 
given that the ioctl only tracks that each component device has been 
seen, not that all are present at the moment the ioctl is invoked.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Streamline btrfs_delalloc_reserve_metadata initial operations

2018-01-30 Thread Nikolay Borisov


On 30.01.2018 16:34, David Sterba wrote:
> On Fri, Jan 12, 2018 at 04:21:05PM +0200, Nikolay Borisov wrote:
>> @@ -6062,19 +6062,19 @@ int btrfs_delalloc_reserve_metadata(struct 
>> btrfs_inode *inode, u64 num_bytes)
>>   * If we have a transaction open (can happen if we call truncate_block
>>   * from truncate), then we need FLUSH_LIMIT so we don't deadlock.
>>   */
>> +
>>  if (btrfs_is_free_space_inode(inode)) {
>>  flush = BTRFS_RESERVE_NO_FLUSH;
>>  delalloc_lock = false;
>> -} else if (current->journal_info) {
>> -flush = BTRFS_RESERVE_FLUSH_LIMIT;
>> -}
>> +} else {
>> +if (current->journal_info)
>> +flush = BTRFS_RESERVE_FLUSH_LIMIT;
>>  
>> -if (flush != BTRFS_RESERVE_NO_FLUSH &&
>> -btrfs_transaction_in_commit(fs_info))
>> -schedule_timeout(1);
>> +if (btrfs_transaction_in_commit(fs_info))
>> +schedule_timeout(1);
>>  
>> -if (delalloc_lock)
>>  mutex_lock(>delalloc_mutex);
>> +}
> 
> Squeezing the condition branches makes the code more readable, I have
> only one objection and it's the mutex_lock. It IMHO looks better when
> it's a separate branch as it pairs with the unlock:
> 
> if (delalloc_lock)
>   mutex_lock(...);
> 
> ...
> 
> if (delalloc_lock)
>   mutex_unlock(...);
> 
> In your version it's implied by the first if that checks
> btrfs_is_free_space_inode and delalloc_lock is hidden there.
> 

My line of thought when developing the patch was that delalloc is
another level of indirection and. What I wanted to achieve in the end is
to make it clear that delalloc_mutex really depends on whether we are
reserving for the freespace inode or not.

>>  
>>  num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Streamline btrfs_delalloc_reserve_metadata initial operations

2018-01-30 Thread David Sterba
On Fri, Jan 12, 2018 at 04:21:05PM +0200, Nikolay Borisov wrote:
> @@ -6062,19 +6062,19 @@ int btrfs_delalloc_reserve_metadata(struct 
> btrfs_inode *inode, u64 num_bytes)
>* If we have a transaction open (can happen if we call truncate_block
>* from truncate), then we need FLUSH_LIMIT so we don't deadlock.
>*/
> +
>   if (btrfs_is_free_space_inode(inode)) {
>   flush = BTRFS_RESERVE_NO_FLUSH;
>   delalloc_lock = false;
> - } else if (current->journal_info) {
> - flush = BTRFS_RESERVE_FLUSH_LIMIT;
> - }
> + } else {
> + if (current->journal_info)
> + flush = BTRFS_RESERVE_FLUSH_LIMIT;
>  
> - if (flush != BTRFS_RESERVE_NO_FLUSH &&
> - btrfs_transaction_in_commit(fs_info))
> - schedule_timeout(1);
> + if (btrfs_transaction_in_commit(fs_info))
> + schedule_timeout(1);
>  
> - if (delalloc_lock)
>   mutex_lock(>delalloc_mutex);
> + }

Squeezing the condition branches makes the code more readable, I have
only one objection and it's the mutex_lock. It IMHO looks better when
it's a separate branch as it pairs with the unlock:

if (delalloc_lock)
mutex_lock(...);

...

if (delalloc_lock)
mutex_unlock(...);

In your version it's implied by the first if that checks
btrfs_is_free_space_inode and delalloc_lock is hidden there.

>  
>   num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND PATCH] btrfs: Handle btrfs_set_extent_delalloc failure in relocate_file_extent_cluster

2018-01-30 Thread Nikolay Borisov
Essentially duplicate the error handling from the above block which
handles the !PageUptodate(page) case and additionally clear
EXTENT_BOUNDARY.

Signed-off-by: Nikolay Borisov 
Reviewed-by: Josef Bacik 
---

Put description of the intended changes. 

 fs/btrfs/relocation.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index f0c3f00e97cb..8b2a31cef5cf 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3268,12 +3268,25 @@ static int relocate_file_extent_cluster(struct inode 
*inode,
nr++;
}
 
-   btrfs_set_extent_delalloc(inode, page_start, page_end, 0, NULL,
- 0);
+   ret = btrfs_set_extent_delalloc(inode, page_start, page_end, 0,
+   NULL, 0);
+   if (ret) {
+   unlock_page(page);
+   put_page(page);
+   btrfs_delalloc_release_metadata(BTRFS_I(inode),
+PAGE_SIZE);
+   btrfs_delalloc_release_extents(BTRFS_I(inode),
+  PAGE_SIZE);
+
+   clear_extent_bits(_I(inode)->io_tree,
+ page_start, page_end,
+ EXTENT_LOCKED | EXTENT_BOUNDARY);
+   goto out;
+
+   }
set_page_dirty(page);
 
-   unlock_extent(_I(inode)->io_tree,
- page_start, page_end);
+   unlock_extent(_I(inode)->io_tree, page_start, page_end);
unlock_page(page);
put_page(page);
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] btrfs: fix device order consistency

2018-01-30 Thread David Sterba
On Mon, Jan 22, 2018 at 02:49:36PM -0800, Anand Jain wrote:
> By maintaining the device order consistency it makes reproducing
> the problems related to missing chunk in the degraded mode much more
> consistent. So fix this by sorting the devices by devid within the
> kernel. So that we know which device is assigned to the struct
> fs_info::latest_bdev when all the devices are having and same
> SB generation.
> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/volumes.c | 16 
>  1 file changed, 16 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 03f2685a5018..98e41d286283 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "ctree.h"
>  #include "extent_map.h"
> @@ -1107,6 +1108,20 @@ static int __btrfs_open_devices(struct 
> btrfs_fs_devices *fs_devices,
>   return ret;
>  }
>  
> +static int device_sort(void *priv, struct list_head *a, struct list_head *b)

I'll rename that to devid_cmp as it's the comparator and not really a
sorting function.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/2] fix device orders consistency

2018-01-30 Thread David Sterba
On Mon, Jan 22, 2018 at 02:49:35PM -0800, Anand Jain wrote:
>  Could you pls consider this for 4.16 ?

Sorry, the patches are not fixes and I'll probably not add them to 4.16
and they missed the first pull. But I will add them to misc-next so they
still help debugging and any fix that you come up with can be added to
4.16-rc independently.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Use schedule_timeout_interruptible

2018-01-30 Thread David Sterba
On Wed, Jan 24, 2018 at 10:38:51AM -0500, Josef Bacik wrote:
> On Tue, Jan 23, 2018 at 02:46:53PM +0200, Nikolay Borisov wrote:
> > Instead of manually fiddling with the state of the task
> > (RUNNING->INTERRUPTIBLE->RUNNING) again just use 
> > schedule_timeout_interruptible
> > which adjusts the task state as needed. No functional changes.
> > 
> > Signed-off-by: Nikolay Borisov 
> 
> Reviewed-by: Josef Bacik 

Added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: tree-checker: Replace root parameter with fs_info

2018-01-30 Thread David Sterba
On Fri, Jan 26, 2018 at 09:31:06AM -0500, Josef Bacik wrote:
> On Thu, Jan 25, 2018 at 02:56:18PM +0800, Qu Wenruo wrote:
> > When inspecting the error message with real corruption, the "root=%llu"
> > always shows "1" (root tree), instead of correct owner.
> > 
> > The problem is that we are getting @root from page->mapping->host, which
> > points the same btree inode, so we will always get the same root.
> > 
> > This makes the root owner output meaningless, and harder to port
> > tree-checker to btrfs-progs.
> > 
> > So get rid of the false and meaningless @root parameter and replace it
> > with @fs_info.
> > To get the owner, we can only rely on btrfs_header_owner() now.
> > 
> > Signed-off-by: Qu Wenruo 
> 
> Reviewed-by: Josef Bacik 

Added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND PATCH] btrfs: Fix UAF when cleaning up fs_devs with a single stale device

2018-01-30 Thread Nikolay Borisov
Commit 4fde46f0cc71 ("Btrfs: free the stale device") introduced
btrfs_free_stale_device which iterates the device lists for all
registered btrfs filesystems and deletes those devices which aren't
mounted. In a btrfs_devices structure has only 1 device attached to it
and it is unused then btrfs_free_stale_devices will proceed to also
free the btrfs_fs_devices struct itself. Currently this leads to a UAF
since list_for_each_entry will try to perform a check on the already-
freed memory to see if it has to terminated the loop.

The fix is to use 'break' when we know we are freeing the current
fs_devs.

Fixes: 4fde46f0cc71 ("Btrfs: free the stale device")
Signed-off-by: Nikolay Borisov 
---

Changed the subject to make it more descriptive 

 fs/btrfs/volumes.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f7147740b68e..c3ab55336ee0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -645,6 +645,7 @@ static void btrfs_free_stale_devices(const char *path,
btrfs_sysfs_remove_fsid(fs_devs);
list_del(_devs->list);
free_fs_devices(fs_devs);
+   break;
} else {
fs_devs->num_devices--;
list_del(>dev_list);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: Ignore errors from btrfs_qgroup_trace_extent_post

2018-01-30 Thread David Sterba
On Mon, Jan 29, 2018 at 10:06:52PM +0800, Qu Wenruo wrote:
> 
> 
> On 2018年01月29日 21:53, Nikolay Borisov wrote:
> > Running generic/019 with qgroups on the scratch device enabled is
> > almost guaranteed to trigger the BUG_ON in btrfs_free_tree_block. It's
> > supposed to trigger only on -ENOMEM, in reality, however, it's possible
> > to get -EIO from btrfs_qgroup_trace_extent_post. This function just
> > finds the roots of the extent being tracked and sets the qrecord->old_roots
> > list. If this operation fails nothing critical happens except the
> > quota accounting can be considered wrong. In such case just set the
> > INCONSISTENT flag for the quota and print a warning, rather than
> > killing off the system. Additionally, it's possible to trigger a BUG_ON
> > in btrfs_truncate_inode_items as well.
> > 
> > Signed-off-by: Nikolay Borisov 
> 
> Reviewed-by: Qu Wenruo 

Added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: Add chunk allocation ENOSPC debug message for enospc_debug mount option

2018-01-30 Thread David Sterba
On Mon, Jan 22, 2018 at 10:11:44AM +0200, Nikolay Borisov wrote:
> > @@ -4731,6 +4740,12 @@ static int __btrfs_alloc_chunk(struct 
> > btrfs_trans_handle *trans,
> >  
> > if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) {
> > ret = -ENOSPC;
> > +   if (btrfs_test_opt(info, ENOSPC_DEBUG)) {
> > +   btrfs_debug(info,
> > +   "%s: not enough devices with free space: have=%d minimal 
> > required=%d",
> 
> nit: s/minimal/minimum
> But there is no point in resending just for that, I guess David could
> fix it while merging.

Yes, string updated and added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Mon, Jan 29, 2018 at 08:05:42 -0500, Austin S. Hemmelgarn wrote:

> Seriously, _THERE IS A RACE CONDITION IN SYSTEMD'S CURRENT HANDLING OF 
> THIS_.  It's functionally no different than prefacing an attempt to send 
> a signal to a process by checking if the process exists, or trying to 
> see if some other process is using a file that might be locked by 

Seriously, there is a race condition on train stations. People check if
the train has stopped and opened the door before they move their legs to
get in, but the train might be already gone - so this is pointless.

Instead, they should move their legs continuously and if the train is
not on the station yet, just climb back and retry.


See the difference? I hope now you know what is the race condition.
It is the condition, where CONSEQUENCES are fatal.


mounting BEFORE volume is complete is FATAL - since no userspace daemon
would ever retrigger the mount and the system won't came up. Provide one
btrfsd volume manager and systemd could probably switch to using it.

mounting AFTER volume is complete is FINE - and if the "pseudo-race" happens
and volume disappears, then this was either some operator action, so the
umount SHOULD happen, or we are facing some MALFUNCION, which is fatal
itself, not by being a "race condition".

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Austin S. Hemmelgarn

On 2018-01-29 16:54, waxhead wrote:



Austin S. Hemmelgarn wrote:

On 2018-01-29 12:58, Andrei Borzenkov wrote:

29.01.2018 14:24, Adam Borowski пишет:
...


So any event (the user's request) has already happened.  A rc 
system, of
which systemd is one, knows whether we reached the "want root 
filesystem" or
"want secondary filesystems" stage.  Once you're there, you can 
issue the

mount() call and let the kernel do the work.

It is a btrfs choice to not expose compound device as separate one 
(like

every other device manager does)


Btrfs is not a device manager, it's a filesystem.

it is a btrfs drawback that doesn't provice anything else except 
for this

IOCTL with it's logic


How can it provide you with something it doesn't yet have?  If you 
want the
information, call mount().  And as others in this thread have 
mentioned,
what, pray tell, would you want to know "would a mount succeed?" for 
if you

don't want to mount?

it is a btrfs drawback that there is nothing to push assembling 
into "OK,

going degraded" state


The way to do so is to timeout, then retry with -o degraded.



That's possible way to solve it. This likely requires support from
mount.btrfs (or btrfs.ko) to return proper indication that filesystem is
incomplete so caller can decide whether to retry or to try degraded 
mount.
We already do so in the accepted standard manner.  If the mount fails 
because of a missing device, you get a very specific message in the 
kernel log about it, as is the case for most other common errors (for 
uncommon ones you usually just get a generic open_ctree error).  This 
is really the only option too, as the mount() syscall (which the mount 
command calls) returns only 0 on success or -1 and an appropriate 
errno value on failure, and we can't exactly go about creating a half 
dozen new error numbers just for this (well, technically we could, but 
I very much doubt that they would be accepted upstream, which defeats 
the purpose).


Or may be mount.btrfs should implement this logic internally. This would
really be the most simple way to make it acceptable to the other side by
not needing to accept anything :)
And would also be another layering violation which would require a 
proliferation of extra mount options to control the mount command 
itself and adjust the timeout handling.


This has been done before with mount.nfs, but for slightly different 
reasons (primarily to allow nested NFS mounts, since the local 
directory that the filesystem is being mounted on not being present is 
treated like a mount timeout), and it had near zero control.  It works 
there because they push the complicated policy decisions to userspace 
(namely, there is no support for retrying with different options or 
trying a different server).


I just felt like commenting a bit on this from a regular users point of 
view.


Remember that at some point BTRFS will probably be the default 
filesystem for the average penguin.
BTRFS big selling point is redundance and a guarantee that whatever you 
write is the same that you will read sometime later.


Many users will probably build their BTRFS system on a redundant array 
of storage devices. As long as there are sufficient (not necessarily 
all) storage devices present they expect their system to come up and 
work. If the system is not able to come up in a fully operative state it 
must at least be able to limp until the issue is fixed.


Starting a argument about what init system is the most sane or most 
shiny is not helping. The truth is that systemd is not going away 
sometime soon and one might as well try to become friends if nothing 
else for the sake of having things working which should be a common goal 
regardless of the religion.
FWIW, I don't care that it's systemd in this case, I care that people 
are arguing for the forced use of a coding anti-pattern that ends up 
being covered as bad practice in first year computer science courses 
(no, seriously, every professional programmer I've asked about this had 
time-of-check-time-of-use race conditions covered in one of their 
first-year CS classes) or the enforcement of an event-based model that 
really doesn't make any sense for this (OK, it makes a little sense for 
handling of devices reappearing, but systemd doesn't need to be involved 
in that beyond telling the kernel that the device reappeared, except 
that that's udev's job).


I personally think the degraded mount option is a mistake as this 
assumes that a lightly degraded system is not able to work which is false.
If the system can mount to some working state then it should mount 
regardless if it is fully operative or not. If the array is in a bad 
state you need to learn about it by issuing a command or something. The 
same goes for a MD array (and yes, I am aware of the block layer vs 
filesystem thing here).
The problem with this is that right now, it is not safe to run a BTRFS 
volume degraded and writable, but for an even remotely usable system 

Re: [PATCH] Btrfs: fix unexpected cow in run_delalloc_nocow

2018-01-30 Thread David Sterba
On Thu, Jan 25, 2018 at 11:02:49AM -0700, Liu Bo wrote:
> Fstests generic/475 provides a way to fail metadata reads while
> checking if checksum exists for the inode inside run_delalloc_nocow(),
> and csum_exist_in_range() interprets error (-EIO) as inode having
> checksum and makes its caller enters the cow path.
> 
> In case of free space inode, this ends up with a warning in
> cow_file_range().
> 
> The same problem applies for btrfs_cross_ref_exist() since it may also
> read metadata in between.
> 
> With this, run_delalloc_nocow() bails out when errors occur at the two
> places.
> 
> cc:  v2.6.28+
> Fixes: 17d217fe970d ("Btrfs: fix nodatasum handling in balancing code")
> Signed-off-by: Liu Bo 

This and the other fixes sent in this batch have been added to next and
I'm going to push them to 4.16 either during the merge window still or
at rc1 time. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] inode->i_version rework for v4.16

2018-01-30 Thread Jeff Layton
On Mon, 2018-01-29 at 13:50 -0800, Linus Torvalds wrote:
> On Mon, Jan 29, 2018 at 4:26 AM, Jeff Layton  wrote:
> > 
> > This pile of patches is a rework of the inode->i_version field. We have
> > traditionally incremented that field on every inode data or metadata
> > change. Typically this increment needs to be logged on disk even when
> > nothing else has changed, which is rather expensive.
> 
> Hmm. I have pulled this, but it is really really broken in one place,
> to the degree that I always went "no, I won't pull this garbage".
> 
> But the breakage is potential, not actual, and can be fixed trivially,
> so I'll let it slide - but I do require it to be fixed. And I require
> people to *think* about it.
> 
> So what's to horribly horribly wrong?
> 
> The inode_cmp_iversion{+raw}() functions are pure and utter crap.
> 
> Why?
> 
> You say that they return 0/negative/positive, but they do so in a
> completely broken manner. They return that ternary value as the
> sequence number difference in a 's64', which means that if you
> actually care about that ternary value, and do the *sane* thing that
> the kernel-doc of the function implies is the right thing, you would
> do
> 
> int cmp = inode_cmp_iversion(inode, old);
> 
> if (cmp < 0 ...
> 
> and as a result you get code that looks sane, but that doesn't
> actually *WORK* right.
> 

My intent here was to have this handle wraparound using the same sort of
method that the time_before/time_after macros do. Obviously, I didn't
document that well.

I want to make sure I understand what's actually broken here thoug. Is
it only broken when the two values are more than 2**63 apart, or is
there something else more fundamentally wrong here?

> To make it even worse, it will actually work in practice by accident
> in 99.9% of all cases, so now you have
> 
>  (a) subtly buggy code
>  (b) that looks fine
>  (c) and that works in testing
> 
> which is just about the worst possible case for any code. The
> interface is simply garbage that encourages bugs.
> 
> And the bug wouldn't be in the user, the bug would be in this code you
> just sent me. The interface is simply wrong.
> 
> So this absolutely needs to be fixed. I see two fixes:
> 
>  - just return a boolean. That's all that any current user actually
> wants, so the ternary value seems pointless.
> 
>  - make it return an 'int', and not just any int, but -1/0/1. That way
> there is no worry about uses, and if somebody *really* cares about the
> ternary value, they can now use a "switch" statement to get it
> (alternatively, make it return an enum, but whatever).
> 
> That "ternary" function that has 18446744069414584320 incorrect return
> values really is unacceptable.
> 

I think I'll just make it return a boolean value like you suggested
first. I'll send a patch to fix it once I've done some basic testing
with it.

Many thanks,
--
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] btrfs: Refactor parameter of BTRFS_MAX_DEVS() from root to fs_info

2018-01-30 Thread Qu Wenruo
Signed-off-by: Qu Wenruo 
Reviewed-by: Anand Jain 
Reviewed-by: Nikolay Borisov 
---
v2:
  Added tags.
---
 fs/btrfs/volumes.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d818b1f9c625..215e85e22c8e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4581,7 +4581,7 @@ static void check_raid56_incompat_flag(struct 
btrfs_fs_info *info, u64 type)
btrfs_set_fs_incompat(info, RAID56);
 }
 
-#define BTRFS_MAX_DEVS(r) ((BTRFS_MAX_ITEM_SIZE(r->fs_info)\
+#define BTRFS_MAX_DEVS(info) ((BTRFS_MAX_ITEM_SIZE(info)   \
- sizeof(struct btrfs_chunk))   \
/ sizeof(struct btrfs_stripe) + 1)
 
@@ -4638,7 +4638,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
max_stripe_size = SZ_1G;
max_chunk_size = 10 * max_stripe_size;
if (!devs_max)
-   devs_max = BTRFS_MAX_DEVS(info->chunk_root);
+   devs_max = BTRFS_MAX_DEVS(info);
} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
/* for larger filesystems, use larger metadata chunks */
if (fs_devices->total_rw_bytes > 50ULL * SZ_1G)
@@ -4647,7 +4647,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
max_stripe_size = SZ_256M;
max_chunk_size = max_stripe_size;
if (!devs_max)
-   devs_max = BTRFS_MAX_DEVS(info->chunk_root);
+   devs_max = BTRFS_MAX_DEVS(info);
} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
max_stripe_size = SZ_32M;
max_chunk_size = 2 * max_stripe_size;
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] btrfs: Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()

2018-01-30 Thread Qu Wenruo
Function __get_raid_index() is used to convert block group flags into
raid index, which can be used to get various info directly from
btrfs_raid_array[].

Refactor this function a little:

1) Rename to btrfs_bg_flags_to_raid_index()
   Double underline prefix is normally for internal functions, while the
   function is used by both extent-tree and volumes.

   Although the name is a little longer, but it should explain its usage
   quite well.

2) Move it to volumes.h and make it static inline
   Just several if-else branches, really no need to define it as a normal
   function.

   This also makes later code re-use between kernel and btrfs-progs
   easier.

3) Remove function get_block_group_index()
   Really no need to do such a simple thing as an exported function.

Signed-off-by: Qu Wenruo 
Reviewed-by: Anand Jain 
Reviewed-by: Nikolay Borisov 
---
v2:
  Remove get_block_group_index() in this patch.
  Add extra comment for the usage of btrfs_bg_flags_to_raid_index()
---
 fs/btrfs/ctree.h   |  1 -
 fs/btrfs/extent-tree.c | 41 ++---
 fs/btrfs/volumes.c |  2 +-
 fs/btrfs/volumes.h | 22 ++
 4 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 13c260b525a1..27249240fa3e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2629,7 +2629,6 @@ struct btrfs_block_group_cache *btrfs_lookup_block_group(
 u64 bytenr);
 void btrfs_get_block_group(struct btrfs_block_group_cache *cache);
 void btrfs_put_block_group(struct btrfs_block_group_cache *cache);
-int get_block_group_index(struct btrfs_block_group_cache *cache);
 struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 struct btrfs_root *root,
 u64 parent, u64 root_objectid,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2f4328511ac8..2693c8617817 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7346,29 +7346,6 @@ wait_block_group_cache_done(struct 
btrfs_block_group_cache *cache)
return ret;
 }
 
-int __get_raid_index(u64 flags)
-{
-   if (flags & BTRFS_BLOCK_GROUP_RAID10)
-   return BTRFS_RAID_RAID10;
-   else if (flags & BTRFS_BLOCK_GROUP_RAID1)
-   return BTRFS_RAID_RAID1;
-   else if (flags & BTRFS_BLOCK_GROUP_DUP)
-   return BTRFS_RAID_DUP;
-   else if (flags & BTRFS_BLOCK_GROUP_RAID0)
-   return BTRFS_RAID_RAID0;
-   else if (flags & BTRFS_BLOCK_GROUP_RAID5)
-   return BTRFS_RAID_RAID5;
-   else if (flags & BTRFS_BLOCK_GROUP_RAID6)
-   return BTRFS_RAID_RAID6;
-
-   return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */
-}
-
-int get_block_group_index(struct btrfs_block_group_cache *cache)
-{
-   return __get_raid_index(cache->flags);
-}
-
 static const char *btrfs_raid_type_names[BTRFS_NR_RAID_TYPES] = {
[BTRFS_RAID_RAID10] = "raid10",
[BTRFS_RAID_RAID1]  = "raid1",
@@ -7483,7 +7460,7 @@ static noinline int find_free_extent(struct btrfs_fs_info 
*fs_info,
u64 empty_cluster = 0;
struct btrfs_space_info *space_info;
int loop = 0;
-   int index = __get_raid_index(flags);
+   int index = btrfs_bg_flags_to_raid_index(flags);
bool failed_cluster_refill = false;
bool failed_alloc = false;
bool use_cluster = true;
@@ -7569,7 +7546,8 @@ static noinline int find_free_extent(struct btrfs_fs_info 
*fs_info,
btrfs_put_block_group(block_group);
up_read(_info->groups_sem);
} else {
-   index = get_block_group_index(block_group);
+   index = btrfs_bg_flags_to_raid_index(
+   block_group->flags);
btrfs_lock_block_group(block_group, delalloc);
goto have_block_group;
}
@@ -7579,7 +7557,7 @@ static noinline int find_free_extent(struct btrfs_fs_info 
*fs_info,
}
 search:
have_caching_bg = false;
-   if (index == 0 || index == __get_raid_index(flags))
+   if (index == 0 || index == btrfs_bg_flags_to_raid_index(flags))
full_search = true;
down_read(_info->groups_sem);
list_for_each_entry(block_group, _info->block_groups[index],
@@ -7837,7 +7815,8 @@ static noinline int find_free_extent(struct btrfs_fs_info 
*fs_info,
 loop:
failed_cluster_refill = false;
failed_alloc = false;
-   BUG_ON(index != get_block_group_index(block_group));
+   BUG_ON(btrfs_bg_flags_to_raid_index(block_group->flags) !=
+  index);
 

Re: [PATCH 1/3] btrfs: Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()

2018-01-30 Thread Qu Wenruo


On 2018年01月30日 16:16, Nikolay Borisov wrote:
> 
> 
> On 30.01.2018 09:40, Qu Wenruo wrote:
>> Function __get_raid_index() is used to convert block group flags into
>> raid index, which can be used to get various info directly from
>> btrfs_raid_array[].
>>
>> Refactor this function a little:
>>
>> 1) Rename to btrfs_bg_flags_to_raid_index()
>>Double underline prefix is normally for internal functions, while the
>>function is used by both extent-tree and volumes.
>>
>>Although the name is a little longer, but it should explain its usage
>>quite well.
>>
>> 2) Move it to volumes.h and make it static inline
>>Just several if-else branches, really no need to define it as a normal
>>function.
>>
>>This also makes later code re-use between kernel and btrfs-progs
>>easier.
>>
>> Signed-off-by: Qu Wenruo 
> 
> Generally this looks good so you can add my :
> 
> Reviewed-by: Nikolay Borisov 
> 
> However I wonder whether we should strive to add proper kernel docs to
> function whenever we can (without going out of the way to do so).

Kernel docs here seems a little over killed here.
IMHO kernel docs are used for things like API, state machine or design
overview.

Here the raid type index is mostly internal used as a quicker and
unified way to access btrfs_raid_array[].

Extra comment (maybe 1 or 2 lines) will be added in next version about
the usage, and hopes it should work.

> I.e
> you are modifying the function now so might as well add them and
> explicitly state it's used to convert blockgroup flags to raid levels.
> The end goal is to make it easier for people who are looking for the
> first time at the source to have easier time.

Yep, making the source easier to read is always good.

Thanks,
Qu

> 
>> ---
>>  fs/btrfs/extent-tree.c | 26 --
>>  fs/btrfs/volumes.c |  2 +-
>>  fs/btrfs/volumes.h | 18 ++
>>  3 files changed, 23 insertions(+), 23 deletions(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 2f4328511ac8..e9c31b567a9c 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -7346,27 +7346,9 @@ wait_block_group_cache_done(struct 
>> btrfs_block_group_cache *cache)
>>  return ret;
>>  }
>>  
>> -int __get_raid_index(u64 flags)
>> -{
>> -if (flags & BTRFS_BLOCK_GROUP_RAID10)
>> -return BTRFS_RAID_RAID10;
>> -else if (flags & BTRFS_BLOCK_GROUP_RAID1)
>> -return BTRFS_RAID_RAID1;
>> -else if (flags & BTRFS_BLOCK_GROUP_DUP)
>> -return BTRFS_RAID_DUP;
>> -else if (flags & BTRFS_BLOCK_GROUP_RAID0)
>> -return BTRFS_RAID_RAID0;
>> -else if (flags & BTRFS_BLOCK_GROUP_RAID5)
>> -return BTRFS_RAID_RAID5;
>> -else if (flags & BTRFS_BLOCK_GROUP_RAID6)
>> -return BTRFS_RAID_RAID6;
>> -
>> -return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */
>> -}
>> -
>>  int get_block_group_index(struct btrfs_block_group_cache *cache)
>>  {
>> -return __get_raid_index(cache->flags);
>> +return btrfs_bg_flags_to_raid_index(cache->flags);
>>  }
>>  
>>  static const char *btrfs_raid_type_names[BTRFS_NR_RAID_TYPES] = {
>> @@ -7483,7 +7465,7 @@ static noinline int find_free_extent(struct 
>> btrfs_fs_info *fs_info,
>>  u64 empty_cluster = 0;
>>  struct btrfs_space_info *space_info;
>>  int loop = 0;
>> -int index = __get_raid_index(flags);
>> +int index = btrfs_bg_flags_to_raid_index(flags);
>>  bool failed_cluster_refill = false;
>>  bool failed_alloc = false;
>>  bool use_cluster = true;
>> @@ -7579,7 +7561,7 @@ static noinline int find_free_extent(struct 
>> btrfs_fs_info *fs_info,
>>  }
>>  search:
>>  have_caching_bg = false;
>> -if (index == 0 || index == __get_raid_index(flags))
>> +if (index == 0 || index == btrfs_bg_flags_to_raid_index(flags))
>>  full_search = true;
>>  down_read(_info->groups_sem);
>>  list_for_each_entry(block_group, _info->block_groups[index],
>> @@ -9643,7 +9625,7 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, 
>> u64 bytenr)
>>   */
>>  target = get_restripe_target(fs_info, block_group->flags);
>>  if (target) {
>> -index = __get_raid_index(extended_to_chunk(target));
>> +index = btrfs_bg_flags_to_raid_index(extended_to_chunk(target));
>>  } else {
>>  /*
>>   * this is just a balance, so if we were marked as full
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index a25684287501..d818b1f9c625 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -4625,7 +4625,7 @@ static int __btrfs_alloc_chunk(struct 
>> btrfs_trans_handle *trans,
>>  if (list_empty(_devices->alloc_list))
>>  return -ENOSPC;
>>  
>> -index = __get_raid_index(type);
>> +index = btrfs_bg_flags_to_raid_index(type);
>>  
>>  sub_stripes = 

Re: [PATCH 2/2] btrfs: drop optimal argument from find_live_mirror()

2018-01-30 Thread Anand Jain



On 01/30/2018 05:12 PM, Nikolay Borisov wrote:



On 30.01.2018 08:28, Anand Jain wrote:

Drop optimal argument from the function find_live_mirror()
as we can deduce it in the function itself.

Signed-off-by: Anand Jain 
---
  fs/btrfs/volumes.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9c9d987838c2..a61715677b67 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5253,10 +5253,11 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
*fs_info, u64 logical, u64 len)
  
  static int find_live_mirror(struct btrfs_fs_info *fs_info,

struct map_lookup *map, int first,
-   int optimal, int dev_replace_is_ongoing)
+   int dev_replace_is_ongoing)
  {
int i;
int num;
+   int optimal;
int tolerance;
struct btrfs_device *srcdev;
  
@@ -5268,6 +5269,8 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,

else
num = map->num_stripes;
  
+	optimal = first + current->pid % num;

+
if (dev_replace_is_ongoing &&
fs_info->dev_replace.cont_reading_from_srcdev_mode ==
 BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID)
@@ -5821,7 +5824,6 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info,
stripe_index = mirror_num - 1;
else {
stripe_index = find_live_mirror(fs_info, map, 0,
-   current->pid % map->num_stripes,
dev_replace_is_ongoing);
mirror_num = stripe_index + 1;
}
@@ -5849,8 +5851,6 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info,
int old_stripe_index = stripe_index;
stripe_index = find_live_mirror(fs_info, map,
  stripe_index,
- stripe_index +
- current->pid % map->sub_stripes,


The value of optimal here (BTRFS_BLOCK_GROUP_RAID10) is different than
the value of optimal in BTRFS_BLOCK_GROUP_RAID1 case. So you need to put
optimal in the conditional branch you added in the previous patch. In
raid 1 case:

optimal = current->pid % num;

In raid 10:

optimal = first + current->pid % num;

to current semantics


 First/stripe_index is 0 for RAID1. So I purposely kept it like this 
for easy code flow.


Thanks, Anand


  dev_replace_is_ongoing);
mirror_num = stripe_index - old_stripe_index + 1;
}


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] btrfs: Unexport get_block_group_index()

2018-01-30 Thread Qu Wenruo


On 2018年01月30日 16:18, Nikolay Borisov wrote:
> 
> 
> On 30.01.2018 09:40, Qu Wenruo wrote:
>> That function is only used inside extent-tree.c.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>  fs/btrfs/ctree.h   | 1 -
>>  fs/btrfs/extent-tree.c | 3 ++-
>>  2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 13c260b525a1..27249240fa3e 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -2629,7 +2629,6 @@ struct btrfs_block_group_cache 
>> *btrfs_lookup_block_group(
>>   u64 bytenr);
>>  void btrfs_get_block_group(struct btrfs_block_group_cache *cache);
>>  void btrfs_put_block_group(struct btrfs_block_group_cache *cache);
>> -int get_block_group_index(struct btrfs_block_group_cache *cache);
>>  struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle 
>> *trans,
>>   struct btrfs_root *root,
>>   u64 parent, u64 root_objectid,
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index e9c31b567a9c..6e1128aa29d6 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -7346,7 +7346,8 @@ wait_block_group_cache_done(struct 
>> btrfs_block_group_cache *cache)
>>  return ret;
>>  }
>>  
>> -int get_block_group_index(struct btrfs_block_group_cache *cache)
>> +static enum btrfs_raid_types
>> +get_block_group_index(struct btrfs_block_group_cache *cache)
>>  {
>>  return btrfs_bg_flags_to_raid_index(cache->flags);
>>  }
> 
> I'd rather you removed this function and used bg_flags_to_raid_index
> directly. Otherwise we have yet another level of indirection with a
> rather wrong name - "block_group_index" ?

Right, also pointed out by Anand, I'll just remove it.

Thanks,
Qu

> 
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 2/2] btrfs: drop optimal argument from find_live_mirror()

2018-01-30 Thread Nikolay Borisov


On 30.01.2018 08:28, Anand Jain wrote:
> Drop optimal argument from the function find_live_mirror()
> as we can deduce it in the function itself.
> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/volumes.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 9c9d987838c2..a61715677b67 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5253,10 +5253,11 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
> *fs_info, u64 logical, u64 len)
>  
>  static int find_live_mirror(struct btrfs_fs_info *fs_info,
>   struct map_lookup *map, int first,
> - int optimal, int dev_replace_is_ongoing)
> + int dev_replace_is_ongoing)
>  {
>   int i;
>   int num;
> + int optimal;
>   int tolerance;
>   struct btrfs_device *srcdev;
>  
> @@ -5268,6 +5269,8 @@ static int find_live_mirror(struct btrfs_fs_info 
> *fs_info,
>   else
>   num = map->num_stripes;
>  
> + optimal = first + current->pid % num;
> +
>   if (dev_replace_is_ongoing &&
>   fs_info->dev_replace.cont_reading_from_srcdev_mode ==
>BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID)
> @@ -5821,7 +5824,6 @@ static int __btrfs_map_block(struct btrfs_fs_info 
> *fs_info,
>   stripe_index = mirror_num - 1;
>   else {
>   stripe_index = find_live_mirror(fs_info, map, 0,
> - current->pid % map->num_stripes,
>   dev_replace_is_ongoing);
>   mirror_num = stripe_index + 1;
>   }
> @@ -5849,8 +5851,6 @@ static int __btrfs_map_block(struct btrfs_fs_info 
> *fs_info,
>   int old_stripe_index = stripe_index;
>   stripe_index = find_live_mirror(fs_info, map,
> stripe_index,
> -   stripe_index +
> -   current->pid % map->sub_stripes,

The value of optimal here (BTRFS_BLOCK_GROUP_RAID10) is different than
the value of optimal in BTRFS_BLOCK_GROUP_RAID1 case. So you need to put
optimal in the conditional branch you added in the previous patch. In
raid 1 case:

optimal = current->pid % num;

In raid 10:

optimal = first + current->pid % num;

to current semantics

> dev_replace_is_ongoing);
>   mirror_num = stripe_index - old_stripe_index + 1;
>   }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] btrfs: Refactor parameter of BTRFS_MAX_DEVS() from root to fs_info

2018-01-30 Thread Nikolay Borisov


On 30.01.2018 09:40, Qu Wenruo wrote:
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov 


> ---
>  fs/btrfs/volumes.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index d818b1f9c625..215e85e22c8e 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4581,7 +4581,7 @@ static void check_raid56_incompat_flag(struct 
> btrfs_fs_info *info, u64 type)
>   btrfs_set_fs_incompat(info, RAID56);
>  }
>  
> -#define BTRFS_MAX_DEVS(r) ((BTRFS_MAX_ITEM_SIZE(r->fs_info)  \
> +#define BTRFS_MAX_DEVS(info) ((BTRFS_MAX_ITEM_SIZE(info) \
>   - sizeof(struct btrfs_chunk))   \
>   / sizeof(struct btrfs_stripe) + 1)
>  
> @@ -4638,7 +4638,7 @@ static int __btrfs_alloc_chunk(struct 
> btrfs_trans_handle *trans,
>   max_stripe_size = SZ_1G;
>   max_chunk_size = 10 * max_stripe_size;
>   if (!devs_max)
> - devs_max = BTRFS_MAX_DEVS(info->chunk_root);
> + devs_max = BTRFS_MAX_DEVS(info);
>   } else if (type & BTRFS_BLOCK_GROUP_METADATA) {
>   /* for larger filesystems, use larger metadata chunks */
>   if (fs_devices->total_rw_bytes > 50ULL * SZ_1G)
> @@ -4647,7 +4647,7 @@ static int __btrfs_alloc_chunk(struct 
> btrfs_trans_handle *trans,
>   max_stripe_size = SZ_256M;
>   max_chunk_size = max_stripe_size;
>   if (!devs_max)
> - devs_max = BTRFS_MAX_DEVS(info->chunk_root);
> + devs_max = BTRFS_MAX_DEVS(info);
>   } else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
>   max_stripe_size = SZ_32M;
>   max_chunk_size = 2 * max_stripe_size;
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] btrfs: Unexport get_block_group_index()

2018-01-30 Thread Nikolay Borisov


On 30.01.2018 09:40, Qu Wenruo wrote:
> That function is only used inside extent-tree.c.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/ctree.h   | 1 -
>  fs/btrfs/extent-tree.c | 3 ++-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 13c260b525a1..27249240fa3e 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2629,7 +2629,6 @@ struct btrfs_block_group_cache 
> *btrfs_lookup_block_group(
>u64 bytenr);
>  void btrfs_get_block_group(struct btrfs_block_group_cache *cache);
>  void btrfs_put_block_group(struct btrfs_block_group_cache *cache);
> -int get_block_group_index(struct btrfs_block_group_cache *cache);
>  struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle 
> *trans,
>struct btrfs_root *root,
>u64 parent, u64 root_objectid,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index e9c31b567a9c..6e1128aa29d6 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7346,7 +7346,8 @@ wait_block_group_cache_done(struct 
> btrfs_block_group_cache *cache)
>   return ret;
>  }
>  
> -int get_block_group_index(struct btrfs_block_group_cache *cache)
> +static enum btrfs_raid_types
> +get_block_group_index(struct btrfs_block_group_cache *cache)
>  {
>   return btrfs_bg_flags_to_raid_index(cache->flags);
>  }

I'd rather you removed this function and used bg_flags_to_raid_index
directly. Otherwise we have yet another level of indirection with a
rather wrong name - "block_group_index" ?

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] btrfs: Refactor parameter of BTRFS_MAX_DEVS() from root to fs_info

2018-01-30 Thread Anand Jain



On 01/30/2018 03:40 PM, Qu Wenruo wrote:

Signed-off-by: Qu Wenruo 


Reviewed-by: Anand Jain 

Thanks, Anand


---
  fs/btrfs/volumes.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d818b1f9c625..215e85e22c8e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4581,7 +4581,7 @@ static void check_raid56_incompat_flag(struct 
btrfs_fs_info *info, u64 type)
btrfs_set_fs_incompat(info, RAID56);
  }
  
-#define BTRFS_MAX_DEVS(r) ((BTRFS_MAX_ITEM_SIZE(r->fs_info)		\

+#define BTRFS_MAX_DEVS(info) ((BTRFS_MAX_ITEM_SIZE(info)   \
- sizeof(struct btrfs_chunk))   \
/ sizeof(struct btrfs_stripe) + 1)
  
@@ -4638,7 +4638,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,

max_stripe_size = SZ_1G;
max_chunk_size = 10 * max_stripe_size;
if (!devs_max)
-   devs_max = BTRFS_MAX_DEVS(info->chunk_root);
+   devs_max = BTRFS_MAX_DEVS(info);
} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
/* for larger filesystems, use larger metadata chunks */
if (fs_devices->total_rw_bytes > 50ULL * SZ_1G)
@@ -4647,7 +4647,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
max_stripe_size = SZ_256M;
max_chunk_size = max_stripe_size;
if (!devs_max)
-   devs_max = BTRFS_MAX_DEVS(info->chunk_root);
+   devs_max = BTRFS_MAX_DEVS(info);
} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
max_stripe_size = SZ_32M;
max_chunk_size = 2 * max_stripe_size;


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] btrfs: Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()

2018-01-30 Thread Nikolay Borisov


On 30.01.2018 09:40, Qu Wenruo wrote:
> Function __get_raid_index() is used to convert block group flags into
> raid index, which can be used to get various info directly from
> btrfs_raid_array[].
> 
> Refactor this function a little:
> 
> 1) Rename to btrfs_bg_flags_to_raid_index()
>Double underline prefix is normally for internal functions, while the
>function is used by both extent-tree and volumes.
> 
>Although the name is a little longer, but it should explain its usage
>quite well.
> 
> 2) Move it to volumes.h and make it static inline
>Just several if-else branches, really no need to define it as a normal
>function.
> 
>This also makes later code re-use between kernel and btrfs-progs
>easier.
> 
> Signed-off-by: Qu Wenruo 

Generally this looks good so you can add my :

Reviewed-by: Nikolay Borisov 

However I wonder whether we should strive to add proper kernel docs to
function whenever we can (without going out of the way to do so). I.e
you are modifying the function now so might as well add them and
explicitly state it's used to convert blockgroup flags to raid levels.
The end goal is to make it easier for people who are looking for the
first time at the source to have easier time.

> ---
>  fs/btrfs/extent-tree.c | 26 --
>  fs/btrfs/volumes.c |  2 +-
>  fs/btrfs/volumes.h | 18 ++
>  3 files changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 2f4328511ac8..e9c31b567a9c 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7346,27 +7346,9 @@ wait_block_group_cache_done(struct 
> btrfs_block_group_cache *cache)
>   return ret;
>  }
>  
> -int __get_raid_index(u64 flags)
> -{
> - if (flags & BTRFS_BLOCK_GROUP_RAID10)
> - return BTRFS_RAID_RAID10;
> - else if (flags & BTRFS_BLOCK_GROUP_RAID1)
> - return BTRFS_RAID_RAID1;
> - else if (flags & BTRFS_BLOCK_GROUP_DUP)
> - return BTRFS_RAID_DUP;
> - else if (flags & BTRFS_BLOCK_GROUP_RAID0)
> - return BTRFS_RAID_RAID0;
> - else if (flags & BTRFS_BLOCK_GROUP_RAID5)
> - return BTRFS_RAID_RAID5;
> - else if (flags & BTRFS_BLOCK_GROUP_RAID6)
> - return BTRFS_RAID_RAID6;
> -
> - return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */
> -}
> -
>  int get_block_group_index(struct btrfs_block_group_cache *cache)
>  {
> - return __get_raid_index(cache->flags);
> + return btrfs_bg_flags_to_raid_index(cache->flags);
>  }
>  
>  static const char *btrfs_raid_type_names[BTRFS_NR_RAID_TYPES] = {
> @@ -7483,7 +7465,7 @@ static noinline int find_free_extent(struct 
> btrfs_fs_info *fs_info,
>   u64 empty_cluster = 0;
>   struct btrfs_space_info *space_info;
>   int loop = 0;
> - int index = __get_raid_index(flags);
> + int index = btrfs_bg_flags_to_raid_index(flags);
>   bool failed_cluster_refill = false;
>   bool failed_alloc = false;
>   bool use_cluster = true;
> @@ -7579,7 +7561,7 @@ static noinline int find_free_extent(struct 
> btrfs_fs_info *fs_info,
>   }
>  search:
>   have_caching_bg = false;
> - if (index == 0 || index == __get_raid_index(flags))
> + if (index == 0 || index == btrfs_bg_flags_to_raid_index(flags))
>   full_search = true;
>   down_read(_info->groups_sem);
>   list_for_each_entry(block_group, _info->block_groups[index],
> @@ -9643,7 +9625,7 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, 
> u64 bytenr)
>*/
>   target = get_restripe_target(fs_info, block_group->flags);
>   if (target) {
> - index = __get_raid_index(extended_to_chunk(target));
> + index = btrfs_bg_flags_to_raid_index(extended_to_chunk(target));
>   } else {
>   /*
>* this is just a balance, so if we were marked as full
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index a25684287501..d818b1f9c625 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4625,7 +4625,7 @@ static int __btrfs_alloc_chunk(struct 
> btrfs_trans_handle *trans,
>   if (list_empty(_devices->alloc_list))
>   return -ENOSPC;
>  
> - index = __get_raid_index(type);
> + index = btrfs_bg_flags_to_raid_index(type);
>  
>   sub_stripes = btrfs_raid_array[index].sub_stripes;
>   dev_stripes = btrfs_raid_array[index].dev_stripes;
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index ff15208344a7..413e07e44b42 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -533,6 +533,24 @@ static inline void btrfs_dev_stat_reset(struct 
> btrfs_device *dev,
>   btrfs_dev_stat_set(dev, index, 0);
>  }
>  
> +static inline enum btrfs_raid_types btrfs_bg_flags_to_raid_index(u64 flags)
> +{
> + if (flags & BTRFS_BLOCK_GROUP_RAID10)
> + return BTRFS_RAID_RAID10;
> + else if 

Re: [PATCH 1/3] btrfs: Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()

2018-01-30 Thread Anand Jain



On 01/30/2018 03:40 PM, Qu Wenruo wrote:

Function __get_raid_index() is used to convert block group flags into
raid index, which can be used to get various info directly from
btrfs_raid_array[].

Refactor this function a little:

1) Rename to btrfs_bg_flags_to_raid_index()
Double underline prefix is normally for internal functions, while the
function is used by both extent-tree and volumes.

Although the name is a little longer, but it should explain its usage
quite well.

2) Move it to volumes.h and make it static inline
Just several if-else branches, really no need to define it as a normal
function.

This also makes later code re-use between kernel and btrfs-progs
easier.

Signed-off-by: Qu Wenruo 


Reviewed-by: Anand Jain 

Thanks, Anand



---
  fs/btrfs/extent-tree.c | 26 --
  fs/btrfs/volumes.c |  2 +-
  fs/btrfs/volumes.h | 18 ++
  3 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2f4328511ac8..e9c31b567a9c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7346,27 +7346,9 @@ wait_block_group_cache_done(struct 
btrfs_block_group_cache *cache)
return ret;
  }
  
-int __get_raid_index(u64 flags)

-{
-   if (flags & BTRFS_BLOCK_GROUP_RAID10)
-   return BTRFS_RAID_RAID10;
-   else if (flags & BTRFS_BLOCK_GROUP_RAID1)
-   return BTRFS_RAID_RAID1;
-   else if (flags & BTRFS_BLOCK_GROUP_DUP)
-   return BTRFS_RAID_DUP;
-   else if (flags & BTRFS_BLOCK_GROUP_RAID0)
-   return BTRFS_RAID_RAID0;
-   else if (flags & BTRFS_BLOCK_GROUP_RAID5)
-   return BTRFS_RAID_RAID5;
-   else if (flags & BTRFS_BLOCK_GROUP_RAID6)
-   return BTRFS_RAID_RAID6;
-
-   return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */
-}
-
  int get_block_group_index(struct btrfs_block_group_cache *cache)
  {
-   return __get_raid_index(cache->flags);
+   return btrfs_bg_flags_to_raid_index(cache->flags);
  }
  
  static const char *btrfs_raid_type_names[BTRFS_NR_RAID_TYPES] = {

@@ -7483,7 +7465,7 @@ static noinline int find_free_extent(struct btrfs_fs_info 
*fs_info,
u64 empty_cluster = 0;
struct btrfs_space_info *space_info;
int loop = 0;
-   int index = __get_raid_index(flags);
+   int index = btrfs_bg_flags_to_raid_index(flags);
bool failed_cluster_refill = false;
bool failed_alloc = false;
bool use_cluster = true;
@@ -7579,7 +7561,7 @@ static noinline int find_free_extent(struct btrfs_fs_info 
*fs_info,
}
  search:
have_caching_bg = false;
-   if (index == 0 || index == __get_raid_index(flags))
+   if (index == 0 || index == btrfs_bg_flags_to_raid_index(flags))
full_search = true;
down_read(_info->groups_sem);
list_for_each_entry(block_group, _info->block_groups[index],
@@ -9643,7 +9625,7 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, u64 
bytenr)
 */
target = get_restripe_target(fs_info, block_group->flags);
if (target) {
-   index = __get_raid_index(extended_to_chunk(target));
+   index = btrfs_bg_flags_to_raid_index(extended_to_chunk(target));
} else {
/*
 * this is just a balance, so if we were marked as full
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a25684287501..d818b1f9c625 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4625,7 +4625,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
if (list_empty(_devices->alloc_list))
return -ENOSPC;
  
-	index = __get_raid_index(type);

+   index = btrfs_bg_flags_to_raid_index(type);
  
  	sub_stripes = btrfs_raid_array[index].sub_stripes;

dev_stripes = btrfs_raid_array[index].dev_stripes;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index ff15208344a7..413e07e44b42 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -533,6 +533,24 @@ static inline void btrfs_dev_stat_reset(struct 
btrfs_device *dev,
btrfs_dev_stat_set(dev, index, 0);
  }
  
+static inline enum btrfs_raid_types btrfs_bg_flags_to_raid_index(u64 flags)

+{
+   if (flags & BTRFS_BLOCK_GROUP_RAID10)
+   return BTRFS_RAID_RAID10;
+   else if (flags & BTRFS_BLOCK_GROUP_RAID1)
+   return BTRFS_RAID_RAID1;
+   else if (flags & BTRFS_BLOCK_GROUP_DUP)
+   return BTRFS_RAID_DUP;
+   else if (flags & BTRFS_BLOCK_GROUP_RAID0)
+   return BTRFS_RAID_RAID0;
+   else if (flags & BTRFS_BLOCK_GROUP_RAID5)
+   return BTRFS_RAID_RAID5;
+   else if (flags & BTRFS_BLOCK_GROUP_RAID6)
+   return BTRFS_RAID_RAID6;
+
+   return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */
+}
+
  void 

Re: [PATCH 2/3] btrfs: Unexport get_block_group_index()

2018-01-30 Thread Anand Jain





diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e9c31b567a9c..6e1128aa29d6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7346,7 +7346,8 @@ wait_block_group_cache_done(struct 
btrfs_block_group_cache *cache)
return ret;
  }
  
-int get_block_group_index(struct btrfs_block_group_cache *cache)

+static enum btrfs_raid_types
+get_block_group_index(struct btrfs_block_group_cache *cache)
  {
return btrfs_bg_flags_to_raid_index(cache->flags);
  }


 This changed should rather be part of 1/3.

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


umount btrfs hang

2018-01-30 Thread Christophe Yayon
Hi all,

While trying umount a single btrfs volume i get stuck and get a call kernel 
trace.
My kernel is 4.14.13 (archlinux).

Unable to umount, tried to killed command, no way...

what should i do ? Thanks.

# btrfs fi df /mnt/vol3
Data, single: total=25.97GiB, used=6.84GiB
System, DUP: total=8.00MiB, used=4.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=1.00GiB, used=111.44MiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=16.00MiB, used=0.00B

-
Jan 30 08:57:58 daneel.nbux.org kernel: INFO: task kworker/u16:5:4967 blocked 
for more than 120 seconds.
Jan 30 08:57:58 daneel.nbux.org kernel:   Tainted: G C  
4.14.13-1-ARCH #1
Jan 30 08:57:58 daneel.nbux.org kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 30 08:57:58 daneel.nbux.org kernel: kworker/u16:5   D0  4967  2 
0x8000
Jan 30 08:57:58 daneel.nbux.org kernel: Workqueue: btrfs-extent-refs 
btrfs_extent_refs_helper [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel: Call Trace:
Jan 30 08:57:58 daneel.nbux.org kernel:  ? __schedule+0x290/0x890
Jan 30 08:57:58 daneel.nbux.org kernel:  schedule+0x2f/0x90
Jan 30 08:57:58 daneel.nbux.org kernel:  btrfs_tree_read_lock+0xb6/0x100 [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  ? wait_woken+0x80/0x80
Jan 30 08:57:58 daneel.nbux.org kernel:  btrfs_read_lock_root_node+0x2f/0x40 
[btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  btrfs_search_slot+0x703/0x9f0 [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  ? kmem_cache_alloc+0x153/0x1a0
Jan 30 08:57:58 daneel.nbux.org kernel:  btrfs_iget+0x10e/0x6f0 [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  __lookup_free_space_inode+0x106/0x160 
[btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  lookup_free_space_inode+0x63/0xe0 
[btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  load_free_space_cache+0x66/0x180 
[btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  ? kmem_cache_alloc_trace+0x161/0x1b0
Jan 30 08:57:58 daneel.nbux.org kernel:  cache_block_group+0x1c6/0x3e0 [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  ? wait_woken+0x80/0x80
Jan 30 08:57:58 daneel.nbux.org kernel:  find_free_extent+0x63e/0xfb0 [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  btrfs_reserve_extent+0x9b/0x180 [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  btrfs_alloc_tree_block+0x123/0x4c0 
[btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  __btrfs_cow_block+0x121/0x5b0 [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  btrfs_cow_block+0xca/0x1c0 [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  btrfs_search_slot+0x321/0x9f0 [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  btrfs_del_csums+0x107/0x3c0 [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  
__btrfs_free_extent.isra.61+0x49b/0xd60 [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  __btrfs_run_delayed_refs+0x725/0x12f0 
[btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  btrfs_run_delayed_refs+0x68/0x240 
[btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  delayed_ref_async_start+0x8d/0xa0 
[btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  btrfs_worker_helper+0x70/0x330 [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  process_one_work+0x1db/0x410
Jan 30 08:57:58 daneel.nbux.org kernel:  worker_thread+0x2b/0x3d0
Jan 30 08:57:58 daneel.nbux.org kernel:  ? process_one_work+0x410/0x410
Jan 30 08:57:58 daneel.nbux.org kernel:  kthread+0x118/0x130
Jan 30 08:57:58 daneel.nbux.org kernel:  ? kthread_create_on_node+0x70/0x70
Jan 30 08:57:58 daneel.nbux.org kernel:  ? SyS_exit+0x13/0x20
Jan 30 08:57:58 daneel.nbux.org kernel:  ret_from_fork+0x1f/0x30
Jan 30 08:57:58 daneel.nbux.org kernel: INFO: task kworker/u16:11:13568 blocked 
for more than 120 seconds.
Jan 30 08:57:58 daneel.nbux.org kernel:   Tainted: G C  
4.14.13-1-ARCH #1
Jan 30 08:57:58 daneel.nbux.org kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 30 08:57:58 daneel.nbux.org kernel: kworker/u16:11  D0 13568  2 
0x8000
Jan 30 08:57:58 daneel.nbux.org kernel: Workqueue: btrfs-freespace-write 
btrfs_freespace_write_helper [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel: Call Trace:
Jan 30 08:57:58 daneel.nbux.org kernel:  ? __schedule+0x290/0x890
Jan 30 08:57:58 daneel.nbux.org kernel:  schedule+0x2f/0x90
Jan 30 08:57:58 daneel.nbux.org kernel:  btrfs_tree_lock+0xd1/0x1f0 [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  ? wait_woken+0x80/0x80
Jan 30 08:57:58 daneel.nbux.org kernel:  btrfs_search_slot+0x6e7/0x9f0 [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  btrfs_mark_extent_written+0xb1/0xbb0 
[btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  ? lock_extent_bits+0x54/0x1b0 [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  ? join_transaction+0x10c/0x3f0 [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  ? start_transaction+0x9e/0x410 [btrfs]
Jan 30 08:57:58 daneel.nbux.org kernel:  btrfs_finish_ordered_io+0x591/0x810 
[btrfs]
Jan 30 08:57:58