Re: Cannot balance, ENOSPC errors 4.14.2 vanilla kernel

2017-12-20 Thread Adam Bahe
Alright, I have rebuilt kernel 4.14.8 and added the line of code you
gave me. The kernel is installed and I have a full balance running.
Right off the bat one thing I noticed is that the last time I ran a
full balance, balance status showed something like "14 out of about
200 chunks balanced". I thought that was interesting that it was only
trying to balance 200 chunks. With your change, a full balance status
right now shows "34 out of 9770 chunks balanced". It has been running
for about 10 minutes now. But sometimes it took awhile to cause the
filesystem to go read only. So we shall wait and see. A full balance
across 21 devices and 120TB raw or thereabouts will take some time.


On Wed, Dec 20, 2017 at 4:13 PM, Adam Bahe  wrote:
> Yeah I had a hunch that it was something to do with the 2TB disks.
> I've been slowly trying to replace them. But they're the remnants of
> my old storage system so it has been slow going. When I get some time
> I will try and compile the kernel with your patch. Thanks!
>
> On Wed, Dec 20, 2017 at 12:20 AM, Qu Wenruo  wrote:
>> Also, if you're OK to compile kernel, would you please try the following
>> diff to help us to further enhance btrfs chunk allocator?
>>
>> The chunk allocator itself is designed to handle your case, so it should
>> pick up the remaining devices and allocate new RAID10 chunk with the
>> unallocated space.
>>
>> But the truth is not the case.
>> I'm wondering if it's the devs_increment and substripes calculation
>> causing the problem.
>>
>> If you could help testing btrfs with the diff applied, dmesg would
>> contain extra info when you try to do device remove.
>>
>> Thanks,
>> Qu
>>
>> 
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 49810b70afd3..851ff13f5c29 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -4732,6 +4732,8 @@ static int __btrfs_alloc_chunk(struct
>> btrfs_trans_handle *trans,
>>
>> if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) {
>> ret = -ENOSPC;
>> +   pr_info("ndevs=%d dev_increment=%d sub_stripes=%d
>> devs_min=%d\n",
>> +   ndevs, devs_increment, sub_stripes, devs_min);
>> goto error;
>> }
>>
>>
>> On 2017年12月20日 14:11, Qu Wenruo wrote:
>>>
>>>
>>> On 2017年12月20日 13:00, Adam Bahe wrote:
 I'm using raid10.
>>>
>>> Pretty much the same.
>>>
>>> Raid10 is RAID1 first then RAID0.
>>>
>>> For your 20 disks (well, quite amazing) layout, btrfs will try to
>>> allocate using all disks for RAID10.
>>>
>>> Any unfortunately, devid 7, 9, 12, 13, 14, 15 are already full.
>>>
>>> Normally btrfs should exclude them in chunk allocation, but I think
>>> there is some small unaligned unallocated space making btrfs to choose
>>> them for allocation.
>>>
>>> And causing no new chunk could be allocated.
>>>
>>> And considering the size of your fs, common method like adding temporary
>>> small USB disk to allow convert doesn't work in your case.
>>>
>>>
>>> You could try convert the profile from RAID10 to RAID1, which will
>>> always use 2 disk to allocate space, so it would be OK to allocate new
>>> chunks to start your convert.
>>>
>>> And final suggestion, don't use any profile with stripe with so many
>>> uneven disks.
>>>
>>> Thanks,
>>> Qu
>>>

 On Tue, Dec 19, 2017 at 10:51 PM, Qu Wenruo  wrote:
>
>
> On 2017年12月20日 10:51, Adam Bahe wrote:
>> Forgot to add, I should have plenty of space:
>>
>> Label: 'nas'  uuid: 4fcd5725-b6c6-4d8a-9860-f2fc5474cbcb
>> Total devices 20 FS bytes used 26.69TiB
>> devid1 size 3.64TiB used 3.28TiB path /dev/sdm
>> devid2 size 3.64TiB used 3.28TiB path /dev/sde
>> devid3 size 7.28TiB used 3.45TiB path /dev/sdt
>> devid4 size 9.03TiB used 2.51TiB path /dev/sdo
>> devid5 size 7.28TiB used 3.45TiB path /dev/sdi
>> devid6 size 7.28TiB used 3.45TiB path /dev/sdd
>> devid7 size 1.82TiB used 1.82TiB path /dev/sdp
>> devid9 size 1.82TiB used 1.82TiB path /dev/sdw
>> devid   10 size 1.82TiB used 1.82TiB path /dev/sdk
>> devid   11 size 9.03TiB used 1.82TiB path /dev/sdy
>> devid   12 size 1.82TiB used 1.82TiB path /dev/sdg
>> devid   13 size 1.82TiB used 1.82TiB path /dev/sdl
>> devid   14 size 1.82TiB used 1.82TiB path /dev/sdr
>> devid   15 size 1.82TiB used 1.82TiB path /dev/sdf
>> devid   16 size 5.46TiB used 3.45TiB path /dev/sds
>> devid   17 size 9.10TiB used 3.45TiB path /dev/sdn
>> devid   18 size 9.10TiB used 3.45TiB path /dev/sdh
>> devid   19 size 9.10TiB used 3.45TiB path /dev/sdc
>> devid   20 size 9.10TiB used 3.45TiB path /dev/sdu
>> devid   21 size 3.64TiB used 2.19TiB path 

Re: [PATCH v2 06/17] btrfs-progs: lowmem check: introduce force_cow_in_new_chunk()

2017-12-20 Thread Qu Wenruo


On 2017年12月21日 15:09, Su Yue wrote:
> 
> 
> On 12/21/2017 02:51 PM, Qu Wenruo wrote:
>>
>>
>> On 2017年12月20日 16:37, Qu Wenruo wrote:
>>>
>>>
>>> On 2017年12月20日 16:21, Su Yue wrote:


 On 12/20/2017 01:41 PM, Qu Wenruo wrote:
>
>
> On 2017年12月20日 12:57, Su Yue wrote:
>> Introduce create_chunk_and_block_block_group() to allocate new chunk
>> and corresponding block group.
>>
>> The new function force_cow_in_new_chunk() first allocates new chunk
>> and records its start.
>> Then it modifies all metadata block groups cached and full.
>> Finally it marks the new block group uncached and unfree.
>> In the next CoW, extents states will be updated automatically by
>> cache_block_group().
>>
>> Suggested-by: Qu Wenruo 
>> Signed-off-by: Su Yue 
>> ---
>>    cmds-check.c | 80
>> 
>>    1 file changed, 80 insertions(+)
>>
>> diff --git a/cmds-check.c b/cmds-check.c
>> index d98d4bda7357..311c8a9f45e8 100644
>> --- a/cmds-check.c
>> +++ b/cmds-check.c
>> @@ -10911,6 +10911,86 @@ out:
>>    return ret;
>>    }
>>    +static int create_chunk_and_block_group(struct btrfs_fs_info
>> *fs_info,
>> +    u64 flags, u64 *start, u64 *nbytes)
>> +{
>> +    struct btrfs_trans_handle *trans;
>> +    struct btrfs_root *root = fs_info->extent_root;
>> +    int ret;
>> +
>> +    if ((flags & BTRFS_BLOCK_GROUP_TYPE_MASK) == 0)
>> +    return -EINVAL;
>> +
>> +    trans = btrfs_start_transaction(root, 1);
>> +    if (IS_ERR(trans)) {
>> +    ret = PTR_ERR(trans);
>> +    error("error starting transaction %s", strerror(-ret));
>> +    return ret;
>> +    }
>> +    ret = btrfs_alloc_chunk(trans, fs_info, start, nbytes, flags);
>> +    if (ret) {
>> +    error("fail to allocate new chunk %s", strerror(-ret));
>> +    goto out;
>> +    }
>> +    ret = btrfs_make_block_group(trans, fs_info, 0, flags,
>> + BTRFS_FIRST_CHUNK_TREE_OBJECTID, *start, *nbytes);
>> +    if (ret) {
>> +    error("fail to make block group for chunk %llu %llu %s",
>> +  *start, *nbytes, strerror(-ret));
>> +    goto out;
>> +    }
>> +out:
>> +    btrfs_commit_transaction(trans, root);
>> +    return ret;
>> +}
>> +
>> +static int force_cow_in_new_chunk(struct btrfs_fs_info *fs_info)
>> +{
>> +    struct btrfs_block_group_cache *bg;
>> +    u64 start;
>> +    u64 nbytes;
>> +    u64 alloc_profile;
>> +    u64 flags;
>> +    int ret;
>> +
>> +    alloc_profile = (fs_info->avail_metadata_alloc_bits &
>> + fs_info->metadata_alloc_profile);
>> +    flags = BTRFS_BLOCK_GROUP_METADATA | alloc_profile;
>> +    if (btrfs_fs_incompat(fs_info, MIXED_GROUPS))
>> +    flags |= BTRFS_BLOCK_GROUP_DATA;
>> +
>> +    ret = create_chunk_and_block_group(fs_info, flags, ,
>> );
>
> Why bother allocating a new chunk by yourself?
>
> Just mark all block groups full and that's all.
>
> Any later tree block allocation like btrfs_search_slot() with @cow = 1
> will trigger chunk allocation automatically.
>

 Tried to let it happen but BUG_ON with -ENOSPC.
>>>
>>> Then fix it.
>>>
>>> It's not a normal behavior in this case.
>>>
>>> Thanks,
>>> Qu
>>
>> And I think the fix to allow btrfs_reserve_extent() to allocate new
>> chunk will solve a lot of strange BUG_ON().
>>
>> it just occurred to me that, a lot of use cases relies on the assumption
>> that btrfs_reserve_extent() will try to allocate new chunks.
>>
>> Especially for case like convert, certain btrfs check --repair, some
>> rescue tools.
>>
> Sorry for the previous wrong format mail.
> 
> Yes, it has many dependency so I considered to do chunk allocation
> manually in the patchset. If fix is not good enough, many funtions
> of btrfs-progs will behave abnormal.
> Since you ask it, I will go to fix it.

Manually allocation in advance has its advantage, like we can determine
if there is enough space for new chunk instead of checking every return
value with ENOSPC.


However in current case, your metadata usage is limited to the new chunk
only.
If there extent tree has quite a lot of problem, and the chunk allocated
is small (if using single profile and small fs), it can easily hit
ENOSPC again, since btrfs doesn't allocate new chunk for later metadata
write.

So here, we still need to allow btrfs allocate new meta chunk, even we
pre-allocate one meta chunk in advance.

Thanks,
Qu
> 
> Thanks,
> Su
> 
>> So this would be a quite nice start point for such fix.
>>
>> Thanks,
>> Qu
>>
>>>
 Do you mean do_chunk_alloc() in btrfs_reserve_extent() which is called
 while doing CoW?> In 

Re: [PATCH v2 06/17] btrfs-progs: lowmem check: introduce force_cow_in_new_chunk()

2017-12-20 Thread Su Yue



On 12/21/2017 02:51 PM, Qu Wenruo wrote:



On 2017年12月20日 16:37, Qu Wenruo wrote:



On 2017年12月20日 16:21, Su Yue wrote:



On 12/20/2017 01:41 PM, Qu Wenruo wrote:



On 2017年12月20日 12:57, Su Yue wrote:

Introduce create_chunk_and_block_block_group() to allocate new chunk
and corresponding block group.

The new function force_cow_in_new_chunk() first allocates new chunk
and records its start.
Then it modifies all metadata block groups cached and full.
Finally it marks the new block group uncached and unfree.
In the next CoW, extents states will be updated automatically by
cache_block_group().

Suggested-by: Qu Wenruo 
Signed-off-by: Su Yue 
---
   cmds-check.c | 80

   1 file changed, 80 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index d98d4bda7357..311c8a9f45e8 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -10911,6 +10911,86 @@ out:
   return ret;
   }
   +static int create_chunk_and_block_group(struct btrfs_fs_info
*fs_info,
+    u64 flags, u64 *start, u64 *nbytes)
+{
+    struct btrfs_trans_handle *trans;
+    struct btrfs_root *root = fs_info->extent_root;
+    int ret;
+
+    if ((flags & BTRFS_BLOCK_GROUP_TYPE_MASK) == 0)
+    return -EINVAL;
+
+    trans = btrfs_start_transaction(root, 1);
+    if (IS_ERR(trans)) {
+    ret = PTR_ERR(trans);
+    error("error starting transaction %s", strerror(-ret));
+    return ret;
+    }
+    ret = btrfs_alloc_chunk(trans, fs_info, start, nbytes, flags);
+    if (ret) {
+    error("fail to allocate new chunk %s", strerror(-ret));
+    goto out;
+    }
+    ret = btrfs_make_block_group(trans, fs_info, 0, flags,
+ BTRFS_FIRST_CHUNK_TREE_OBJECTID, *start, *nbytes);
+    if (ret) {
+    error("fail to make block group for chunk %llu %llu %s",
+  *start, *nbytes, strerror(-ret));
+    goto out;
+    }
+out:
+    btrfs_commit_transaction(trans, root);
+    return ret;
+}
+
+static int force_cow_in_new_chunk(struct btrfs_fs_info *fs_info)
+{
+    struct btrfs_block_group_cache *bg;
+    u64 start;
+    u64 nbytes;
+    u64 alloc_profile;
+    u64 flags;
+    int ret;
+
+    alloc_profile = (fs_info->avail_metadata_alloc_bits &
+ fs_info->metadata_alloc_profile);
+    flags = BTRFS_BLOCK_GROUP_METADATA | alloc_profile;
+    if (btrfs_fs_incompat(fs_info, MIXED_GROUPS))
+    flags |= BTRFS_BLOCK_GROUP_DATA;
+
+    ret = create_chunk_and_block_group(fs_info, flags, ,
);


Why bother allocating a new chunk by yourself?

Just mark all block groups full and that's all.

Any later tree block allocation like btrfs_search_slot() with @cow = 1
will trigger chunk allocation automatically.



Tried to let it happen but BUG_ON with -ENOSPC.


Then fix it.

It's not a normal behavior in this case.

Thanks,
Qu


And I think the fix to allow btrfs_reserve_extent() to allocate new
chunk will solve a lot of strange BUG_ON().

it just occurred to me that, a lot of use cases relies on the assumption
that btrfs_reserve_extent() will try to allocate new chunks.

Especially for case like convert, certain btrfs check --repair, some
rescue tools.


Sorry for the previous wrong format mail.

Yes, it has many dependency so I considered to do chunk allocation
manually in the patchset. If fix is not good enough, many funtions
of btrfs-progs will behave abnormal.
Since you ask it, I will go to fix it.

Thanks,
Su


So this would be a quite nice start point for such fix.

Thanks,
Qu




Do you mean do_chunk_alloc() in btrfs_reserve_extent() which is called
while doing CoW?> In progs, allocation of new chunk during CoW depends 
@root->ref_cows.
However, @root->ref_cows should be set only if @root is root of a fs
trees.

Thanks,
Su


Thanks,
Qu


+    if (!ret)
+    printf("Create new chunk %llu %llu\n", start, nbytes);
+    else
+    goto err;
+
+    flags = BTRFS_BLOCK_GROUP_METADATA;
+    /* Mark all metadata block groups cached and full in free space*/
+    ret = modify_block_groups_cache(fs_info, flags, 1);
+    if (ret)
+    goto clear_bg_cache;
+
+    bg = btrfs_lookup_block_group(fs_info, start);
+    if (!bg) {
+    ret = -ENOENT;
+    error("fail to look up block group %llu %llu", start, nbytes);
+    goto clear_bg_cache;
+    }
+
+    /* Clear block group cache just allocated */
+    ret = modify_block_group_cache(fs_info, bg, 0);
+    if (ret)
+    goto clear_bg_cache;
+
+    return 0;
+
+clear_bg_cache:
+    modify_block_groups_cache(fs_info, flags, 0);
+err:
+    return ret;
+}
+
   static int check_extent_refs(struct btrfs_root *root,
    struct cache_tree *extent_cache)
   {






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html







--
To unsubscribe from this list: send the line 

Re: [PATCH v2 06/17] btrfs-progs: lowmem check: introduce force_cow_in_new_chunk()

2017-12-20 Thread Su Yue



On 12/21/2017 02:51 PM, Qu Wenruo wrote:



On 2017年12月20日 16:37, Qu Wenruo wrote:



On 2017年12月20日 16:21, Su Yue wrote:



On 12/20/2017 01:41 PM, Qu Wenruo wrote:



On 2017年12月20日 12:57, Su Yue wrote:

Introduce create_chunk_and_block_block_group() to allocate new chunk
and corresponding block group.

The new function force_cow_in_new_chunk() first allocates new chunk
and records its start.
Then it modifies all metadata block groups cached and full.
Finally it marks the new block group uncached and unfree.
In the next CoW, extents states will be updated automatically by
cache_block_group().

Suggested-by: Qu Wenruo 
Signed-off-by: Su Yue 
---
   cmds-check.c | 80

   1 file changed, 80 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index d98d4bda7357..311c8a9f45e8 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -10911,6 +10911,86 @@ out:
   return ret;
   }
   +static int create_chunk_and_block_group(struct btrfs_fs_info
*fs_info,
+    u64 flags, u64 *start, u64 *nbytes)
+{
+    struct btrfs_trans_handle *trans;
+    struct btrfs_root *root = fs_info->extent_root;
+    int ret;
+
+    if ((flags & BTRFS_BLOCK_GROUP_TYPE_MASK) == 0)
+    return -EINVAL;
+
+    trans = btrfs_start_transaction(root, 1);
+    if (IS_ERR(trans)) {
+    ret = PTR_ERR(trans);
+    error("error starting transaction %s", strerror(-ret));
+    return ret;
+    }
+    ret = btrfs_alloc_chunk(trans, fs_info, start, nbytes, flags);
+    if (ret) {
+    error("fail to allocate new chunk %s", strerror(-ret));
+    goto out;
+    }
+    ret = btrfs_make_block_group(trans, fs_info, 0, flags,
+ BTRFS_FIRST_CHUNK_TREE_OBJECTID, *start, *nbytes);
+    if (ret) {
+    error("fail to make block group for chunk %llu %llu %s",
+  *start, *nbytes, strerror(-ret));
+    goto out;
+    }
+out:
+    btrfs_commit_transaction(trans, root);
+    return ret;
+}
+
+static int force_cow_in_new_chunk(struct btrfs_fs_info *fs_info)
+{
+    struct btrfs_block_group_cache *bg;
+    u64 start;
+    u64 nbytes;
+    u64 alloc_profile;
+    u64 flags;
+    int ret;
+
+    alloc_profile = (fs_info->avail_metadata_alloc_bits &
+ fs_info->metadata_alloc_profile);
+    flags = BTRFS_BLOCK_GROUP_METADATA | alloc_profile;
+    if (btrfs_fs_incompat(fs_info, MIXED_GROUPS))
+    flags |= BTRFS_BLOCK_GROUP_DATA;
+
+    ret = create_chunk_and_block_group(fs_info, flags, ,
);


Why bother allocating a new chunk by yourself?

Just mark all block groups full and that's all.

Any later tree block allocation like btrfs_search_slot() with @cow = 1
will trigger chunk allocation automatically.



Tried to let it happen but BUG_ON with -ENOSPC.


Then fix it.

It's not a normal behavior in this case.

Thanks,
Qu


And I think the fix to allow btrfs_reserve_extent() to allocate new
chunk will solve a lot of strange BUG_ON().

it just occurred to me that, a lot of use cases relies on the assumption
that btrfs_reserve_extent() will try to allocate new chunks.
Yes, it has many dependency so I considered to do chunk allocation

manually in the patchset. If fix is not good enough, many funtions
of btrfs-progs will behave abnormal.
Since you ask it, I will go to fix it.

Thanks,
Su

Especially for case like convert, certain btrfs check --repair, some
rescue tools.

So this would be a quite nice start point for such fix.

Thanks,
Qu




Do you mean do_chunk_alloc() in btrfs_reserve_extent() which is called
while doing CoW?> In progs, allocation of new chunk during CoW depends 
@root->ref_cows.
However, @root->ref_cows should be set only if @root is root of a fs
trees.

Thanks,
Su


Thanks,
Qu


+    if (!ret)
+    printf("Create new chunk %llu %llu\n", start, nbytes);
+    else
+    goto err;
+
+    flags = BTRFS_BLOCK_GROUP_METADATA;
+    /* Mark all metadata block groups cached and full in free space*/
+    ret = modify_block_groups_cache(fs_info, flags, 1);
+    if (ret)
+    goto clear_bg_cache;
+
+    bg = btrfs_lookup_block_group(fs_info, start);
+    if (!bg) {
+    ret = -ENOENT;
+    error("fail to look up block group %llu %llu", start, nbytes);
+    goto clear_bg_cache;
+    }
+
+    /* Clear block group cache just allocated */
+    ret = modify_block_group_cache(fs_info, bg, 0);
+    if (ret)
+    goto clear_bg_cache;
+
+    return 0;
+
+clear_bg_cache:
+    modify_block_groups_cache(fs_info, flags, 0);
+err:
+    return ret;
+}
+
   static int check_extent_refs(struct btrfs_root *root,
    struct cache_tree *extent_cache)
   {






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html







--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a 

Re: [PATCH v2 06/17] btrfs-progs: lowmem check: introduce force_cow_in_new_chunk()

2017-12-20 Thread Qu Wenruo


On 2017年12月20日 16:37, Qu Wenruo wrote:
> 
> 
> On 2017年12月20日 16:21, Su Yue wrote:
>>
>>
>> On 12/20/2017 01:41 PM, Qu Wenruo wrote:
>>>
>>>
>>> On 2017年12月20日 12:57, Su Yue wrote:
 Introduce create_chunk_and_block_block_group() to allocate new chunk
 and corresponding block group.

 The new function force_cow_in_new_chunk() first allocates new chunk
 and records its start.
 Then it modifies all metadata block groups cached and full.
 Finally it marks the new block group uncached and unfree.
 In the next CoW, extents states will be updated automatically by
 cache_block_group().

 Suggested-by: Qu Wenruo 
 Signed-off-by: Su Yue 
 ---
   cmds-check.c | 80
 
   1 file changed, 80 insertions(+)

 diff --git a/cmds-check.c b/cmds-check.c
 index d98d4bda7357..311c8a9f45e8 100644
 --- a/cmds-check.c
 +++ b/cmds-check.c
 @@ -10911,6 +10911,86 @@ out:
   return ret;
   }
   +static int create_chunk_and_block_group(struct btrfs_fs_info
 *fs_info,
 +    u64 flags, u64 *start, u64 *nbytes)
 +{
 +    struct btrfs_trans_handle *trans;
 +    struct btrfs_root *root = fs_info->extent_root;
 +    int ret;
 +
 +    if ((flags & BTRFS_BLOCK_GROUP_TYPE_MASK) == 0)
 +    return -EINVAL;
 +
 +    trans = btrfs_start_transaction(root, 1);
 +    if (IS_ERR(trans)) {
 +    ret = PTR_ERR(trans);
 +    error("error starting transaction %s", strerror(-ret));
 +    return ret;
 +    }
 +    ret = btrfs_alloc_chunk(trans, fs_info, start, nbytes, flags);
 +    if (ret) {
 +    error("fail to allocate new chunk %s", strerror(-ret));
 +    goto out;
 +    }
 +    ret = btrfs_make_block_group(trans, fs_info, 0, flags,
 + BTRFS_FIRST_CHUNK_TREE_OBJECTID, *start, *nbytes);
 +    if (ret) {
 +    error("fail to make block group for chunk %llu %llu %s",
 +  *start, *nbytes, strerror(-ret));
 +    goto out;
 +    }
 +out:
 +    btrfs_commit_transaction(trans, root);
 +    return ret;
 +}
 +
 +static int force_cow_in_new_chunk(struct btrfs_fs_info *fs_info)
 +{
 +    struct btrfs_block_group_cache *bg;
 +    u64 start;
 +    u64 nbytes;
 +    u64 alloc_profile;
 +    u64 flags;
 +    int ret;
 +
 +    alloc_profile = (fs_info->avail_metadata_alloc_bits &
 + fs_info->metadata_alloc_profile);
 +    flags = BTRFS_BLOCK_GROUP_METADATA | alloc_profile;
 +    if (btrfs_fs_incompat(fs_info, MIXED_GROUPS))
 +    flags |= BTRFS_BLOCK_GROUP_DATA;
 +
 +    ret = create_chunk_and_block_group(fs_info, flags, ,
 );
>>>
>>> Why bother allocating a new chunk by yourself?
>>>
>>> Just mark all block groups full and that's all.
>>>
>>> Any later tree block allocation like btrfs_search_slot() with @cow = 1
>>> will trigger chunk allocation automatically.
>>>
>>
>> Tried to let it happen but BUG_ON with -ENOSPC.
> 
> Then fix it.
> 
> It's not a normal behavior in this case.
> 
> Thanks,
> Qu

And I think the fix to allow btrfs_reserve_extent() to allocate new
chunk will solve a lot of strange BUG_ON().

it just occurred to me that, a lot of use cases relies on the assumption
that btrfs_reserve_extent() will try to allocate new chunks.

Especially for case like convert, certain btrfs check --repair, some
rescue tools.

So this would be a quite nice start point for such fix.

Thanks,
Qu

> 
>> Do you mean do_chunk_alloc() in btrfs_reserve_extent() which is called
>> while doing CoW?> In progs, allocation of new chunk during CoW depends 
>> @root->ref_cows.
>> However, @root->ref_cows should be set only if @root is root of a fs
>> trees.
>>
>> Thanks,
>> Su
>>
>>> Thanks,
>>> Qu
>>>
 +    if (!ret)
 +    printf("Create new chunk %llu %llu\n", start, nbytes);
 +    else
 +    goto err;
 +
 +    flags = BTRFS_BLOCK_GROUP_METADATA;
 +    /* Mark all metadata block groups cached and full in free space*/
 +    ret = modify_block_groups_cache(fs_info, flags, 1);
 +    if (ret)
 +    goto clear_bg_cache;
 +
 +    bg = btrfs_lookup_block_group(fs_info, start);
 +    if (!bg) {
 +    ret = -ENOENT;
 +    error("fail to look up block group %llu %llu", start, nbytes);
 +    goto clear_bg_cache;
 +    }
 +
 +    /* Clear block group cache just allocated */
 +    ret = modify_block_group_cache(fs_info, bg, 0);
 +    if (ret)
 +    goto clear_bg_cache;
 +
 +    return 0;
 +
 +clear_bg_cache:
 +    modify_block_groups_cache(fs_info, flags, 0);
 +err:
 +    return ret;
 +}
 +
   static int check_extent_refs(struct btrfs_root *root,
 

Re: kernel hangs during balance

2017-12-20 Thread Duncan
Holger Hoffstätte posted on Wed, 20 Dec 2017 20:58:14 +0100 as excerpted:

> On 12/20/17 20:02, Chris Murphy wrote:
>> I don't know if it's the sending MUA or the list server, but the line
>> wrapping makes this much harder to follow. I suggest putting it in a
>> text file and attaching the text file. It's definitely not on the
>> receiving side, I see it here also:
>> https://www.spinics.net/lists/linux-btrfs/msg72872.html
> 
> You can see enough to suggest that blk-mq is hanging, which is
> "unsurprising" (being kind here) with such an old kernel. Rich, build
> your kernel with CONFIG_SCSI_MQ_DEFAULT=n or boot with
> scsi_mod.use_blk_mq=n as kernel prarameter.

Well, the kernel is 4.4 "el repo".  4.4 /is/ an LTS (and elrepo is AFAIK 
a red hat backports repo, not sure how official, but useful for people on 
red hat), but it's now the second-back LTS, with 4.9 and the new 4.14 
being newer LTS series.

The thing is that this is the btrfs list, and we're development and thus 
rather forward focused here.  As such, we normally want at /least/ the 
second newest (first back) lts series kernel for best chance at 
reasonable support.  While I understand people who want to stick with LTS 
being reluctant to go with the /newest/ LTS before even a single current 
release has passed, making that LTS still a bit new as such things go, 
certainly the one-back LTS, 4.9, should be reasonable.

So yes, tho 4.4 is at least an LTS, for purposes of btrfs and this list, 
it really is rather old now, and an upgrade, presumably to 4.9 in keeping 
with the LTS theme, would be recommended.  If the issue can be confirmed 
on the current and LTS 4.14, so much the better, but certainly, 4.9 is a 
recommended upgrade, and a bug there would still be in the concern for a 
fix range, while 4.4... really is just out of the focus range for this 
list, tho various longer focus distros will of course still provide 
support for it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq

2017-12-20 Thread jianchao.wang
Hi tejun

On 12/16/2017 08:07 PM, Tejun Heo wrote:
> After the recent updates to use generation number and state based
> synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except
> to avoid firing the same timeout multiple times.
> 
> Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag
> RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple
> times.  This removes atomic bitops from hot paths too.
> 
> v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out().
> 
> v3: Added RQF_MQ_TIMEOUT_EXPIRED flag.
> 
> Signed-off-by: Tejun Heo 
> Cc: "jianchao.wang" 
> ---
>  block/blk-mq.c | 18 --
>  block/blk-timeout.c|  1 +
>  include/linux/blkdev.h |  2 ++
>  3 files changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 88baa82..47e722b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -607,14 +607,12 @@ void blk_mq_complete_request(struct request *rq)
>*/
>   if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
>   rcu_read_lock();
> - if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
> - !blk_mark_rq_complete(rq))
> + if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
>   __blk_mq_complete_request(rq);
>   rcu_read_unlock();
>   } else {
>   srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
> - if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
> - !blk_mark_rq_complete(rq))
> + if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
>   __blk_mq_complete_request(rq);
>   srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
>   }
It's worrying that even though the blk_mark_rq_complete() here is intended to 
synchronize with
timeout path, but it indeed give the blk_mq_complete_request() the capability 
to exclude with 
itself. Maybe this capability should be reserved.

Thanks
Jianchao
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-20 Thread Chris Murphy
On Wed, Dec 20, 2017 at 1:14 PM, Austin S. Hemmelgarn
 wrote:
> On 2017-12-20 15:07, Chris Murphy wrote:

>> There is an irony here:
>>
>> YaST doesn't have Btrfs raid1 or raid10 options; and also won't do
>> encrypted root with Btrfs either because YaST enforces LVM to do LUKS
>> encryption for some weird reason; and it also enforces NOT putting
>> Btrfs on LVM.
>
> The 'LUKS must use LVM' thing is likely historical.  The BCP for using LUKS
> is that it's at the bottom level (so you leak absolutely nothing about how
> your storage stack is structured), and if that's the case you need something
> on top to support separate filesystems, which up until BTRFS came around has
> solely been LVM.

*shrug* Anaconda has supported plain partition LUKS without
device-mapper for ext3/4 and XFS since forever, even before the
rewrite.


>> Meanwhile, Fedora/Red Hat's Anaconda installer has supported both of
>> these use cases for something like 5 years (does support Btrfs raid1
>> and raid10 layouts; and also supports Btrfs directly on dmcrypt
>> without LVM) - with the caveat that it enforces /boot to be on ext4.
>
> And this caveat is because for some reason Fedora has chosen not to
> integrate BTRFS support into their version of GRUB.

No. The Fedora patchset for upstream GRUB doesn't remove Btrfs
support. However, they don't use grub-mkconfig to rewrite the grub.cfg
when a new kernel is installed. Instead, they use an unrelated project
called grubby, which modifies the existing grub.cfg (and also supports
most all other configs like syslinux/extlinux, yaboot, uboot, lilo,
and others). And grubby gets confused [1] if the grub.cfg is on a
subvolume (other than ID 5). If the grub.cfg is in the ID 5 subvolume,
in a normal directory structure, it works fine.

Chris Murphy



[1] Gory details

The central part of the confusion appears to be this sequence of
comments in this insanely long bug:
https://bugzilla.redhat.com/show_bug.cgi?id=864198#c3
https://bugzilla.redhat.com/show_bug.cgi?id=864198#c5
https://bugzilla.redhat.com/show_bug.cgi?id=864198#c6
https://bugzilla.redhat.com/show_bug.cgi?id=864198#c7

The comments from Gene Czarcinski (now deceased, that's how old this
bug is) try to negotiate understanding the problem and he had a fix
but it didn't meet some upstream grubby requirement, and so the patch
wasn't accepted. Grubby is sufficiently messy that near as I can tell
no other distribution uses it, and no one really cares to maintain it
until something in RHEL breaks and then *that* gets attention.

Upstream bug
https://github.com/rhboot/grubby/issues/22
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cannot balance, ENOSPC errors 4.14.2 vanilla kernel

2017-12-20 Thread Adam Bahe
Yeah I had a hunch that it was something to do with the 2TB disks.
I've been slowly trying to replace them. But they're the remnants of
my old storage system so it has been slow going. When I get some time
I will try and compile the kernel with your patch. Thanks!

On Wed, Dec 20, 2017 at 12:20 AM, Qu Wenruo  wrote:
> Also, if you're OK to compile kernel, would you please try the following
> diff to help us to further enhance btrfs chunk allocator?
>
> The chunk allocator itself is designed to handle your case, so it should
> pick up the remaining devices and allocate new RAID10 chunk with the
> unallocated space.
>
> But the truth is not the case.
> I'm wondering if it's the devs_increment and substripes calculation
> causing the problem.
>
> If you could help testing btrfs with the diff applied, dmesg would
> contain extra info when you try to do device remove.
>
> Thanks,
> Qu
>
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 49810b70afd3..851ff13f5c29 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4732,6 +4732,8 @@ static int __btrfs_alloc_chunk(struct
> btrfs_trans_handle *trans,
>
> if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) {
> ret = -ENOSPC;
> +   pr_info("ndevs=%d dev_increment=%d sub_stripes=%d
> devs_min=%d\n",
> +   ndevs, devs_increment, sub_stripes, devs_min);
> goto error;
> }
>
>
> On 2017年12月20日 14:11, Qu Wenruo wrote:
>>
>>
>> On 2017年12月20日 13:00, Adam Bahe wrote:
>>> I'm using raid10.
>>
>> Pretty much the same.
>>
>> Raid10 is RAID1 first then RAID0.
>>
>> For your 20 disks (well, quite amazing) layout, btrfs will try to
>> allocate using all disks for RAID10.
>>
>> Any unfortunately, devid 7, 9, 12, 13, 14, 15 are already full.
>>
>> Normally btrfs should exclude them in chunk allocation, but I think
>> there is some small unaligned unallocated space making btrfs to choose
>> them for allocation.
>>
>> And causing no new chunk could be allocated.
>>
>> And considering the size of your fs, common method like adding temporary
>> small USB disk to allow convert doesn't work in your case.
>>
>>
>> You could try convert the profile from RAID10 to RAID1, which will
>> always use 2 disk to allocate space, so it would be OK to allocate new
>> chunks to start your convert.
>>
>> And final suggestion, don't use any profile with stripe with so many
>> uneven disks.
>>
>> Thanks,
>> Qu
>>
>>>
>>> On Tue, Dec 19, 2017 at 10:51 PM, Qu Wenruo  wrote:


 On 2017年12月20日 10:51, Adam Bahe wrote:
> Forgot to add, I should have plenty of space:
>
> Label: 'nas'  uuid: 4fcd5725-b6c6-4d8a-9860-f2fc5474cbcb
> Total devices 20 FS bytes used 26.69TiB
> devid1 size 3.64TiB used 3.28TiB path /dev/sdm
> devid2 size 3.64TiB used 3.28TiB path /dev/sde
> devid3 size 7.28TiB used 3.45TiB path /dev/sdt
> devid4 size 9.03TiB used 2.51TiB path /dev/sdo
> devid5 size 7.28TiB used 3.45TiB path /dev/sdi
> devid6 size 7.28TiB used 3.45TiB path /dev/sdd
> devid7 size 1.82TiB used 1.82TiB path /dev/sdp
> devid9 size 1.82TiB used 1.82TiB path /dev/sdw
> devid   10 size 1.82TiB used 1.82TiB path /dev/sdk
> devid   11 size 9.03TiB used 1.82TiB path /dev/sdy
> devid   12 size 1.82TiB used 1.82TiB path /dev/sdg
> devid   13 size 1.82TiB used 1.82TiB path /dev/sdl
> devid   14 size 1.82TiB used 1.82TiB path /dev/sdr
> devid   15 size 1.82TiB used 1.82TiB path /dev/sdf
> devid   16 size 5.46TiB used 3.45TiB path /dev/sds
> devid   17 size 9.10TiB used 3.45TiB path /dev/sdn
> devid   18 size 9.10TiB used 3.45TiB path /dev/sdh
> devid   19 size 9.10TiB used 3.45TiB path /dev/sdc
> devid   20 size 9.10TiB used 3.45TiB path /dev/sdu
> devid   21 size 3.64TiB used 2.19TiB path /dev/sdj

 Depends on your profile.

 If using RAID0/5/6, the extra unallocated space means nothing if the
 smallest disk is used up.

 Thanks,
 Qu

>
> On Tue, Dec 19, 2017 at 8:47 PM, Adam Bahe  wrote:
>> I have been having ENOSPC errors on any btrfs device delete, btrfs
>> balance, btrfs device add actions for awhile now. How do I fix this? I
>> need to be able to remove devices and balance my filesystem again.
>>
>> [Tue Dec 19 15:25:26 2017] BTRFS info (device sdc): relocating block
>> group 190774812082176 flags system|raid10
>> [Tue Dec 19 15:25:26 2017] BTRFS: Transaction aborted (error -27)
>> [Tue Dec 19 15:25:26 2017] [ cut here ]
>> [Tue Dec 19 15:25:26 2017] WARNING: CPU: 1 PID: 17036 at
>> fs/btrfs/extent-tree.c:10151
>> 

Re: kernel hangs during balance

2017-12-20 Thread Rich Rauenzahn
I switched to the LT kernel because of this issue.  I was running
mainline and thought that LT would get me stability.  I can switch
back to LT while we RCA.

At the risk of changing two things, I could add that
(scsi_mod.use_blk_mq=n) to my boot and also switch back to ML.  I do
notice that disk IO is high at the time of the hang -- I've assumed it
is btrfs balance, but if I read the logs right, it may have finished
before the hang.  Adding more logging.

I've also put the kernel logs into a pastebin:  https://pastebin.com/80qRXA9c

...wish I could get the watchdog to force a dump/trace.  Will look at that, too.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs allow compression on NoDataCow files? (AFAIK Not, but it does)

2017-12-20 Thread Timofey Titovets
How reproduce:
touch test_file
chattr +C test_file
dd if=/dev/zero of=test_file bs=1M count=1
btrfs fi def -vrczlib test_file
filefrag -v test_file

test_file
Filesystem type is: 9123683e
File size of test_file is 1048576 (256 blocks of 4096 bytes)
ext: logical_offset:physical_offset: length:   expected: flags:
  0:0..  31:   72917050..  72917081: 32: encoded
  1:   32..  63:   72917118..  72917149: 32:   72917082: encoded
  2:   64..  95:   72919494..  72919525: 32:   72917150: encoded
  3:   96.. 127:   72927576..  72927607: 32:   72919526: encoded
  4:  128.. 159:   72943261..  72943292: 32:   72927608: encoded
  5:  160.. 191:   72944929..  72944960: 32:   72943293: encoded
  6:  192.. 223:   72944952..  72944983: 32:   72944961: encoded
  7:  224.. 255:   72967084..  72967115: 32:   72944984:
last,encoded,eof
test_file: 8 extents found

I can't found at now, where that error happen in code,
but it's reproducible on Linux 4.14.8

Thanks.

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARN_ON in __writeback_inodes_sb_nr

2017-12-20 Thread Nikolay Borisov


On 20.12.2017 21:16, David Ahern wrote:
> I am still seeing this problem on 4.15.0-rc3

This is a known problem and the offending commit is this one
ce8ea7cc6eb3 ("btrfs: don't call btrfs_start_delalloc_roots in
flushoncommit") and this manifests only if you have mounted with
flushoncommit mount option. A fix is unlikely to land for 4.15
> 
> 
> On 11/17/17 12:55 PM, David Ahern wrote:
>> I see a backtrace booting 4.14+ on a mellanox switch. The trace is due
>> to the WARN_ON in __writeback_inodes_sb_nr.
>>
>> [   40.958590] WARNING: CPU: 0 PID: 183 at
>> /home/dsa/kernel-2.git/fs/fs-writeback.c:2339
>> __writeback_inodes_sb_nr+0x8a/0x90
>> [   40.958593] Modules linked in: ebtable_filter ebtables ip6table_raw
>> ip6table_mangle ip6table_filter ip6_tables iTCO_wdt iTCO_vendor_support
>> coretemp x86_pkg_temp_thermal kvm_intel kvm irqbypass lpc_ich mfd_core
>> battery mei_me mei shpchp lm75 regmap_i2c pmbus pmbus_core i2c_dev
>> i2c_mux k10temp mpls_iptunnel mpls_router ip_tunnel at24 tun
>> br_netfilter bonding loop autofs4 dm_mod dax crc32c_intel i2c_i801
>> i2c_core thermal mlxsw_spectrum psample parman bridge stp llc mlxsw_pci
>> mlxsw_core mlxfw devlink e1000e hwmon
>> [   40.958664] CPU: 0 PID: 183 Comm: btrfs-transacti Not tainted 4.14.0+ #38
>> [   40.958666] Hardware name: Mellanox Technologies Ltd. Mellanox
>> switch/Mellanox switch, BIOS 4.6.5 05/21/2015
>> [   40.958669] task: 8803db266540 task.stack: c9344000
>> [   40.958675] RIP: 0010:__writeback_inodes_sb_nr+0x8a/0x90
>> [   40.958677] RSP: 0018:c9347dc8 EFLAGS: 00010246
>> [   40.958681] RAX:  RBX: 88040a641800 RCX:
>> 
>> [   40.958683] RDX: 0002 RSI: 45a7 RDI:
>> c9347e10
>> [   40.958685] RBP: c9347e20 R08: 0003 R09:
>> c9347dd0
>> [   40.958687] R10: 8803db287000 R11:  R12:
>> c9347dcc
>> [   40.958689] R13: 8804015e7a00 R14: 8803dade2bc8 R15:
>> 880400a68000
>> [   40.958692] FS:  () GS:88041dc0()
>> knlGS:
>> [   40.958695] CS:  0010 DS:  ES:  CR0: 80050033
>> [   40.958697] CR2: 5602d78aad50 CR3: 01e09002 CR4:
>> 001606f0
>> [   40.958699] Call Trace:
>> [   40.958710]  writeback_inodes_sb+0x27/0x30
>> [   40.958719]  btrfs_commit_transaction+0x7b2/0x8e0
>> [   40.958723]  ? start_transaction+0x9e/0x450
>> [   40.958728]  transaction_kthread+0x177/0x1b0
>> [   40.958735]  kthread+0x11d/0x150
>> [   40.958740]  ? btrfs_cleanup_transaction+0x4f0/0x4f0
>> [   40.958744]  ? kthread_associate_blkcg+0xb0/0xb0
>> [   40.958751]  ret_from_fork+0x24/0x30
>> [   40.958754] Code: 8b 42 70 48 85 c0 74 23 4c 89 ce 48 89 df 41 0f b6
>> d3 e8 fa fc ff ff 4c 89 e6 48 89 df e8 8f de ff ff 48 83 c4 48 5b 41 5c
>> 5d c3 <0f> ff eb d9 66 90 0f 1f 44 00 00 55 31 c9 48 89 e5 e8 60 ff ff
>> [   40.958826] ---[ end trace defabeb7afdfd414 ]---
>>
>>
>> Tree is DaveM's net-next, but it was recently merged with Linus' tree at:
>>
>> commit 6363b3f3ac5be096d08c8c504128befa0c033529
>> Merge: 1b6115fbe3b3 6297fabd93f9
>> Author: Linus Torvalds 
>> Date:   Wed Nov 15 15:12:28 2017 -0800
>>
>> Merge tag 'ipmi-for-4.15' of git://github.com/cminyard/linux-ipmi
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-20 Thread Austin S. Hemmelgarn

On 2017-12-20 15:07, Chris Murphy wrote:

On Wed, Dec 20, 2017 at 1:02 PM, Chris Murphy  wrote:

On Wed, Dec 20, 2017 at 9:53 AM, Andrei Borzenkov  wrote:

19.12.2017 22:47, Chris Murphy пишет:




BTW, doesn't SuSE use btrfs by default? Would you expect everyone using
this distro to research every component used?


As far as I'm aware, only Btrfs single device stuff is "supported".
The multiple device stuff is definitely not supported on openSUSE, but
I have no idea to what degree they support it with enterprise license,
no doubt that support must come with caveats.



I was rather surprised seeing RAID1 and RAID10 listed as supported in
SLES 12.x release notes, especially as there is no support for
multi-device btrfs in YaST and hence no way to even install on such
filesystem.


Haha. OK well I'm at a loss then. And they use systemd which is going
to run into the udev rule that prevents systemd from even attempting
to mount rootfs if one or more devices are missing. So I don't know
how it really gets supported. At the dracut prompt, manually mount
using -o degraded to /sysroot and then exit? I guess?



There is an irony here:

YaST doesn't have Btrfs raid1 or raid10 options; and also won't do
encrypted root with Btrfs either because YaST enforces LVM to do LUKS
encryption for some weird reason; and it also enforces NOT putting
Btrfs on LVM.
The 'LUKS must use LVM' thing is likely historical.  The BCP for using 
LUKS is that it's at the bottom level (so you leak absolutely nothing 
about how your storage stack is structured), and if that's the case you 
need something on top to support separate filesystems, which up until 
BTRFS came around has solely been LVM.


The 'No BTRFS on LVM' thing is likely for sanity reasons.  Using BTRFS 
on SuSE means allocating /boot and swap, and the entire rest of the disk 
is BTRFS.  They only support a single PV or a single BTRFS volume at the 
bottom level per-disk for /.


Meanwhile, Fedora/Red Hat's Anaconda installer has supported both of
these use cases for something like 5 years (does support Btrfs raid1
and raid10 layouts; and also supports Btrfs directly on dmcrypt
without LVM) - with the caveat that it enforces /boot to be on ext4.
And this caveat is because for some reason Fedora has chosen not to 
integrate BTRFS support into their version of GRUB.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-20 Thread Chris Murphy
On Wed, Dec 20, 2017 at 1:02 PM, Chris Murphy  wrote:
> On Wed, Dec 20, 2017 at 9:53 AM, Andrei Borzenkov  wrote:
>> 19.12.2017 22:47, Chris Murphy пишет:
>>>

 BTW, doesn't SuSE use btrfs by default? Would you expect everyone using
 this distro to research every component used?
>>>
>>> As far as I'm aware, only Btrfs single device stuff is "supported".
>>> The multiple device stuff is definitely not supported on openSUSE, but
>>> I have no idea to what degree they support it with enterprise license,
>>> no doubt that support must come with caveats.
>>>
>>
>> I was rather surprised seeing RAID1 and RAID10 listed as supported in
>> SLES 12.x release notes, especially as there is no support for
>> multi-device btrfs in YaST and hence no way to even install on such
>> filesystem.
>
> Haha. OK well I'm at a loss then. And they use systemd which is going
> to run into the udev rule that prevents systemd from even attempting
> to mount rootfs if one or more devices are missing. So I don't know
> how it really gets supported. At the dracut prompt, manually mount
> using -o degraded to /sysroot and then exit? I guess?


There is an irony here:

YaST doesn't have Btrfs raid1 or raid10 options; and also won't do
encrypted root with Btrfs either because YaST enforces LVM to do LUKS
encryption for some weird reason; and it also enforces NOT putting
Btrfs on LVM.

Meanwhile, Fedora/Red Hat's Anaconda installer has supported both of
these use cases for something like 5 years (does support Btrfs raid1
and raid10 layouts; and also supports Btrfs directly on dmcrypt
without LVM) - with the caveat that it enforces /boot to be on ext4.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-20 Thread Chris Murphy
On Wed, Dec 20, 2017 at 9:53 AM, Andrei Borzenkov  wrote:
> 19.12.2017 22:47, Chris Murphy пишет:
>>
>>>
>>> BTW, doesn't SuSE use btrfs by default? Would you expect everyone using
>>> this distro to research every component used?
>>
>> As far as I'm aware, only Btrfs single device stuff is "supported".
>> The multiple device stuff is definitely not supported on openSUSE, but
>> I have no idea to what degree they support it with enterprise license,
>> no doubt that support must come with caveats.
>>
>
> I was rather surprised seeing RAID1 and RAID10 listed as supported in
> SLES 12.x release notes, especially as there is no support for
> multi-device btrfs in YaST and hence no way to even install on such
> filesystem.

Haha. OK well I'm at a loss then. And they use systemd which is going
to run into the udev rule that prevents systemd from even attempting
to mount rootfs if one or more devices are missing. So I don't know
how it really gets supported. At the dracut prompt, manually mount
using -o degraded to /sysroot and then exit? I guess?


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel hangs during balance

2017-12-20 Thread Holger Hoffstätte
On 12/20/17 20:02, Chris Murphy wrote:
> I don't know if it's the sending MUA or the list server, but the line
> wrapping makes this much harder to follow. I suggest putting it in a
> text file and attaching the text file. It's definitely not on the
> receiving side, I see it here also:
> https://www.spinics.net/lists/linux-btrfs/msg72872.html

You can see enough to suggest that blk-mq is hanging, which is
"unsurprising" (being kind here) with such an old kernel.
Rich, build your kernel with CONFIG_SCSI_MQ_DEFAULT=n or
boot with scsi_mod.use_blk_mq=n as kernel prarameter.

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-20 Thread Chris Murphy
On Wed, Dec 20, 2017 at 1:34 AM, Tomasz Pala  wrote:
> On Tue, Dec 19, 2017 at 16:59:39 -0700, Chris Murphy wrote:
>
>>> Sth like this? I got such problem a few months ago, my solution was
>>> accepted upstream:
>>> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f
>>
>> I can't parse this commit. In particular I can't tell how long it
>> waits, or what triggers the end to waiting.
>
> The point is - it doesn't wait at all. Instead, every 'ready' btrfs
> device triggers event on all the pending devices. Consider 3-device
> filesystem consisting of /dev/sd[abd] with /dev/sdc being different,
> standalone btrfs:
>
> /dev/sda -> 'not ready'
> /dev/sdb -> 'not ready'
> /dev/sdc -> 'ready', triggers /dev/sda -> 'not ready' and /dev/sdb - still 
> 'not ready'
> /dev/sdc -> kernel says 'ready', triggers /dev/sda - 'ready' and /dev/sdb -> 
> 'ready'
>
> This way all the parts of a volume are marked as ready, so systemd won't
> refuse mounting using legacy device nodes like /dev/sda.
>
>
> This particular solution depends on kernel returning 'btrfs ready',
> which would obviously not work for degraded arrays unless the btrfs.ko
> handles some 'missing' or 'mount_degraded' kernel cmdline options
> _before_ actually _trying_ to mount it with -o degraded.


The thing that is valuing a Btrfs's "readiness" is udev. The kernel
doesn't care, it still instantiates a volume UUID. And if you pass -o
degraded mount to a non-ready Btrfs volume, the kernel code will try
to mount that volume in degraded mode (assuming it passes tests for
the minimum number of devices, can find all the supers it needs, and
bootstrap the chunk tree, etc)


If the udev rule were smarter, it could infer "non-ready" Btrfs volume
to mean it should wait (and complaining might be nice so we know why
it's waiting) for some period of time, and then if it's still not
ready to try to mount with -o degraded. I don't know where teaching
system about degraded attempts belongs, whether the udev rule can tell
systemd to add that mount option if a volume is still not ready, of if
systemd needs hard coded understanding of this mount option for Btrfs.
There is no risk of using -o degraded on a Btrfs volume if it's
missing too many devices, such a degraded mount will simply fail.



> After such timeout, I'd like to tell the kernel: "no more devices, give
> me all the remaining btrfs volumes in degraded mode if possible". By
> "give me btrfs vulumes" I mean "mark them as 'ready'" so the udev could
> fire it's rules. And if there would be anything for udev to distinguish
> 'ready' from 'ready-degraded' one could easily compose some notification
> scripting on top of it, including sending e-mail to sysadmin.

I think the linguistics of "btrfs devices ready" is confusing because
what we really care about is whether the volume/array can be mounted
normally (not degraded). The BTRFS_IOC_DEVICES_READY ioctl is pointed
to any one of the volume's devices, and you get a pass/fail. If it
passes (ready), all other devices are present. If it fails (not
ready), one or more devices are missing. It's not necessary to hit
every device with this ioctl to understand what's going on.

If the question can be answered with: ready, ready-degraded - It's
highly likely that you always get read-degraded as the answer for all
btrfs multiple device volumes. So if udev were to get read-degraded
will it still wait to see if the state goes to ready? How long does it
wait? Seems like it still should wait 90 seconds. In which case it's
going to try to mount with -o degraded.

So I see zero advantage and multiple disadvantages to having the
kernel do a degradedness test well before the mount will be attempted.
I think this is asking for a race condition.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARN_ON in __writeback_inodes_sb_nr

2017-12-20 Thread David Ahern
I am still seeing this problem on 4.15.0-rc3


On 11/17/17 12:55 PM, David Ahern wrote:
> I see a backtrace booting 4.14+ on a mellanox switch. The trace is due
> to the WARN_ON in __writeback_inodes_sb_nr.
> 
> [   40.958590] WARNING: CPU: 0 PID: 183 at
> /home/dsa/kernel-2.git/fs/fs-writeback.c:2339
> __writeback_inodes_sb_nr+0x8a/0x90
> [   40.958593] Modules linked in: ebtable_filter ebtables ip6table_raw
> ip6table_mangle ip6table_filter ip6_tables iTCO_wdt iTCO_vendor_support
> coretemp x86_pkg_temp_thermal kvm_intel kvm irqbypass lpc_ich mfd_core
> battery mei_me mei shpchp lm75 regmap_i2c pmbus pmbus_core i2c_dev
> i2c_mux k10temp mpls_iptunnel mpls_router ip_tunnel at24 tun
> br_netfilter bonding loop autofs4 dm_mod dax crc32c_intel i2c_i801
> i2c_core thermal mlxsw_spectrum psample parman bridge stp llc mlxsw_pci
> mlxsw_core mlxfw devlink e1000e hwmon
> [   40.958664] CPU: 0 PID: 183 Comm: btrfs-transacti Not tainted 4.14.0+ #38
> [   40.958666] Hardware name: Mellanox Technologies Ltd. Mellanox
> switch/Mellanox switch, BIOS 4.6.5 05/21/2015
> [   40.958669] task: 8803db266540 task.stack: c9344000
> [   40.958675] RIP: 0010:__writeback_inodes_sb_nr+0x8a/0x90
> [   40.958677] RSP: 0018:c9347dc8 EFLAGS: 00010246
> [   40.958681] RAX:  RBX: 88040a641800 RCX:
> 
> [   40.958683] RDX: 0002 RSI: 45a7 RDI:
> c9347e10
> [   40.958685] RBP: c9347e20 R08: 0003 R09:
> c9347dd0
> [   40.958687] R10: 8803db287000 R11:  R12:
> c9347dcc
> [   40.958689] R13: 8804015e7a00 R14: 8803dade2bc8 R15:
> 880400a68000
> [   40.958692] FS:  () GS:88041dc0()
> knlGS:
> [   40.958695] CS:  0010 DS:  ES:  CR0: 80050033
> [   40.958697] CR2: 5602d78aad50 CR3: 01e09002 CR4:
> 001606f0
> [   40.958699] Call Trace:
> [   40.958710]  writeback_inodes_sb+0x27/0x30
> [   40.958719]  btrfs_commit_transaction+0x7b2/0x8e0
> [   40.958723]  ? start_transaction+0x9e/0x450
> [   40.958728]  transaction_kthread+0x177/0x1b0
> [   40.958735]  kthread+0x11d/0x150
> [   40.958740]  ? btrfs_cleanup_transaction+0x4f0/0x4f0
> [   40.958744]  ? kthread_associate_blkcg+0xb0/0xb0
> [   40.958751]  ret_from_fork+0x24/0x30
> [   40.958754] Code: 8b 42 70 48 85 c0 74 23 4c 89 ce 48 89 df 41 0f b6
> d3 e8 fa fc ff ff 4c 89 e6 48 89 df e8 8f de ff ff 48 83 c4 48 5b 41 5c
> 5d c3 <0f> ff eb d9 66 90 0f 1f 44 00 00 55 31 c9 48 89 e5 e8 60 ff ff
> [   40.958826] ---[ end trace defabeb7afdfd414 ]---
> 
> 
> Tree is DaveM's net-next, but it was recently merged with Linus' tree at:
> 
> commit 6363b3f3ac5be096d08c8c504128befa0c033529
> Merge: 1b6115fbe3b3 6297fabd93f9
> Author: Linus Torvalds 
> Date:   Wed Nov 15 15:12:28 2017 -0800
> 
> Merge tag 'ipmi-for-4.15' of git://github.com/cminyard/linux-ipmi
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel hangs during balance

2017-12-20 Thread Chris Murphy
I don't know if it's the sending MUA or the list server, but the line
wrapping makes this much harder to follow. I suggest putting it in a
text file and attaching the text file. It's definitely not on the
receiving side, I see it here also:
https://www.spinics.net/lists/linux-btrfs/msg72872.html

Also anytime you get a hang, it's best to issue sysrq+w which dumps
extra information into kernel messages that the developers sometimes
find useful.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 1/5] add infrastructure for tagging functions as error injectable

2017-12-20 Thread Alexei Starovoitov

On 12/20/17 3:00 AM, Masami Hiramatsu wrote:

On Fri, 15 Dec 2017 14:12:52 -0500
Josef Bacik  wrote:


From: Josef Bacik 

Using BPF we can override kprob'ed functions and return arbitrary
values.  Obviously this can be a bit unsafe, so make this feature opt-in
for functions.  Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in
order to give BPF access to that function for error injection purposes.

Signed-off-by: Josef Bacik 
Acked-by: Ingo Molnar 
---
 include/asm-generic/vmlinux.lds.h |  10 +++
 include/linux/bpf.h   |  11 +++
 include/linux/kprobes.h   |   1 +
 include/linux/module.h|   5 ++
 kernel/kprobes.c  | 163 ++
 kernel/module.c   |   6 +-
 6 files changed, 195 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index ee8b707d9fa9..a2e8582d094a 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -136,6 +136,15 @@
 #define KPROBE_BLACKLIST()
 #endif

+#ifdef CONFIG_BPF_KPROBE_OVERRIDE
+#define ERROR_INJECT_LIST(). = ALIGN(8);   
\
+   
VMLINUX_SYMBOL(__start_kprobe_error_inject_list) = .;   \
+   KEEP(*(_kprobe_error_inject_list))  
\
+   VMLINUX_SYMBOL(__stop_kprobe_error_inject_list) 
= .;
+#else
+#define ERROR_INJECT_LIST()
+#endif
+
 #ifdef CONFIG_EVENT_TRACING
 #define FTRACE_EVENTS(). = ALIGN(8);   
\
VMLINUX_SYMBOL(__start_ftrace_events) = .;  \
@@ -564,6 +573,7 @@
FTRACE_EVENTS() \
TRACE_SYSCALLS()\
KPROBE_BLACKLIST()  \
+   ERROR_INJECT_LIST() \
MEM_DISCARD(init.rodata)\
CLK_OF_TABLES() \
RESERVEDMEM_OF_TABLES() \
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e55e4255a210..7f4d2a953173 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -576,4 +576,15 @@ extern const struct bpf_func_proto 
bpf_sock_map_update_proto;
 void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);

+#if defined(__KERNEL__) && !defined(__ASSEMBLY__)
+#ifdef CONFIG_BPF_KPROBE_OVERRIDE
+#define BPF_ALLOW_ERROR_INJECTION(fname)   \
+static unsigned long __used\
+   __attribute__((__section__("_kprobe_error_inject_list"))) \
+   _eil_addr_##fname = (unsigned long)fname;
+#else
+#define BPF_ALLOW_ERROR_INJECTION(fname)
+#endif
+#endif


This part shows this feature belongs to bpf, if it is a part of kprobes,
it should be defined in include/asm-generic/kprobes.h as NOKPROBE_SYMBOL
does.

Why this is defined in BPF, but list is under kprobes?


because Ingo specifically requested that macro that marks the function
will be in bpf.h, so any .c file that starts adding such marks will
include that file instead of pulling stuff from kprobe.



So there is no direct relationship with kprobe.
For example, kprobe user modules can OVERRIDE any functions.
And there is no generic error injection code in the kernel
except for the bpf currently.


_currently_ is key word.


Of course, I can accept this code if you accept that I make a
generic error injection code on ftrace without BPF.


what stops other pieces of kernel to use the same technique?
The bpf verifier coupled together with opt-in
per-function marks via BPF_ALLOW_ERROR_INJECTION
give _safe_ way to do error injection.

I can imagine how you can hack kprobe text based interface to
use the same technique, but I suggest to wait and see how we
build on it in bpf land before replicating things in
pure kprobe land.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-20 Thread Duncan
Austin S. Hemmelgarn posted on Wed, 20 Dec 2017 08:33:03 -0500 as
excerpted:

>> The obvious answer is: do it via kernel command line, just like mdadm
>> does:
>> rootflags=device=/dev/sda,device=/dev/sdb
>> rootflags=device=/dev/sda,device=missing
>> rootflags=device=/dev/sda,device=/dev/sdb,degraded
>> 
>> If only btrfs.ko recognized this, kernel would be able to assemble
>> multivolume btrfs itself. Not only this would allow automated degraded
>> mounts, it would also allow using initrd-less kernels on such volumes.
> Last I checked, the 'device=' options work on upstream kernels just
> fine, though I've never tried the degraded option.  Of course, I'm also
> not using systemd, so it may be some interaction with systemd that's
> causing them to not work (and yes, I understand that I'm inclined to
> blame systemd most of the time based on significant past experience with
> systemd creating issues that never existed before).

Has the bug where rootflags=device=/dev/sda1,device=/dev/sdb1 failed, 
been fixed?  Last I knew (which was ancient history in btrfs terms, but 
I've not seen mention of a patch for it in all that time either), device= 
on the userspace commandline worked, and device= on the kernel commandline 
worked if there was just one device, but it would fail for more than one 
device.  Mounting degraded (on a pair-device raid1) would then of course 
work, since it would just use the one device=, but that's simply 
dangerous for routine use regardless of whether it actually assembled or 
not, thus effectively forcing an initr* for multi-device btrfs root in 
ordered to get it mounted properly.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 1/5] add infrastructure for tagging functions as error injectable

2017-12-20 Thread Alexei Starovoitov

On 12/19/17 11:13 PM, Masami Hiramatsu wrote:

On Tue, 19 Dec 2017 18:14:17 -0800
Alexei Starovoitov  wrote:


On 12/18/17 10:29 PM, Masami Hiramatsu wrote:


+#if defined(__KERNEL__) && !defined(__ASSEMBLY__)
+#ifdef CONFIG_BPF_KPROBE_OVERRIDE


BTW, CONFIG_BPF_KPROBE_OVERRIDE is also confusable name.
Since this feature override a function to just return with
some return value (as far as I understand, or would you
also plan to modify execution path inside a function?),
I think it should be better CONFIG_BPF_FUNCTION_OVERRIDE or
CONFIG_BPF_EXECUTION_OVERRIDE.


I don't think such renaming makes sense.
The feature is overriding kprobe by changing how kprobe returns.
It doesn't override BPF_FUNCTION or BPF_EXECUTION.


No, I meant this is BPF's feature which override FUNCTION, so
BPF is a kind of namespace. (Is that only for a function entry
because it can not tweak stackframe at this morment?)


The kernel enters and exists bpf program as normal.


Yeah, but that bpf program modifies instruction pointer, am I correct?


no. bpf side is asking kprobe side to modify it.
bpf cannot do such things as modifying IP or any other register
directly.




Indeed, BPF is based on kprobes, but it seems you are limiting it
with ftrace (function-call trace) (I'm not sure the reason why),
so using "kprobes" for this feature seems strange for me.


do you have an idea how kprobe override can happen when kprobe
placed in the middle of the function?


For example, if you know a basic block in the function, maybe
you can skip a block or something like that. But nowadays
it is somewhat hard because optimizer mixed it up.


still missing how that can work...

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: qgroups: remove unused label

2017-12-20 Thread Arnd Bergmann
The recent revert left one unused label behind:

fs/btrfs/qgroup.c: In function 'qgroup_reserve':
fs/btrfs/qgroup.c:2432:1: error: label 'retry' defined but not used 
[-Werror=unused-label]

Let's remove it, too.

Fixes: b283738ab0ad ("Revert "btrfs: qgroups: Retry after commit on getting 
EDQUOT"")
Signed-off-by: Arnd Bergmann 
---
 fs/btrfs/qgroup.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index de3b96f1005b..2b89848e767d 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2429,7 +2429,6 @@ static int qgroup_reserve(struct btrfs_root *root, u64 
num_bytes, bool enforce,
capable(CAP_SYS_RESOURCE))
enforce = false;
 
-retry:
spin_lock(_info->qgroup_lock);
quota_root = fs_info->quota_root;
if (!quota_root)
-- 
2.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-20 Thread Austin S. Hemmelgarn

On 2017-12-20 11:53, Andrei Borzenkov wrote:

19.12.2017 22:47, Chris Murphy пишет:




BTW, doesn't SuSE use btrfs by default? Would you expect everyone using
this distro to research every component used?


As far as I'm aware, only Btrfs single device stuff is "supported".
The multiple device stuff is definitely not supported on openSUSE, but
I have no idea to what degree they support it with enterprise license,
no doubt that support must come with caveats.



I was rather surprised seeing RAID1 and RAID10 listed as supported in
SLES 12.x release notes, especially as there is no support for
multi-device btrfs in YaST and hence no way to even install on such
filesystem.
That's the beauty of it all though, you don't need to install on such a 
setup directly like you would need to with hardware RAID, you can 
install in single-device mode and then convert the system on-line to use 
multiple devices, and that will (usually) be faster than a direct 
install if you're using replication (unless you're using RAID10 and have 
a _lot_ of disks).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-20 Thread Andrei Borzenkov
19.12.2017 22:47, Chris Murphy пишет:
> 
>>
>> BTW, doesn't SuSE use btrfs by default? Would you expect everyone using
>> this distro to research every component used?
> 
> As far as I'm aware, only Btrfs single device stuff is "supported".
> The multiple device stuff is definitely not supported on openSUSE, but
> I have no idea to what degree they support it with enterprise license,
> no doubt that support must come with caveats.
> 

I was rather surprised seeing RAID1 and RAID10 listed as supported in
SLES 12.x release notes, especially as there is no support for
multi-device btrfs in YaST and hence no way to even install on such
filesystem.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently

2017-12-20 Thread Jan Kara
On Wed 20-12-17 09:03:06, Jeff Layton wrote:
> On Tue, 2017-12-19 at 09:07 +1100, Dave Chinner wrote:
> > On Mon, Dec 18, 2017 at 02:35:20PM -0500, Jeff Layton wrote:
> > > [PATCH] SQUASH: add memory barriers around i_version accesses
> > 
> > Why explicit memory barriers rather than annotating the operations
> > with the required semantics and getting the barriers the arch
> > requires automatically?  I suspect this should be using
> > atomic_read_acquire() and atomic_cmpxchg_release(), because AFAICT
> > the atomic_cmpxchg needs to have release semantics to match the
> > acquire semantics needed for the load of the current value.
> > 
> > From include/linux/atomics.h:
> > 
> >  * For compound atomics performing both a load and a store, ACQUIRE
> >  * semantics apply only to the load and RELEASE semantics only to the
> >  * store portion of the operation. Note that a failed cmpxchg_acquire
> >  * does -not- imply any memory ordering constraints.
> > 
> > Memory barriers hurt my brain. :/
> > 
> > At minimum, shouldn't the atomic op specific barriers be used rather
> > than full memory barriers? i.e:
> > 
> 
> They hurt my brain too. Yes, definitely atomic-specific barriers should
> be used here instead, since this is an atomic64_t now.
> 
> After going over the docs again...my understanding has always been that
> you primarily need memory barriers to order accesses to different areas
> of memory.

That is correct.

> As Jan and I were discussing in the other thread, i_version is not
> synchronized with anything else. In this code, we're only dealing with a
> single 64-bit word. I don't think there are any races there wrt the API
> itself.

There are not but it is like saying that lock implementation is correct
because the lock state does not get corrupted ;). Who cares about protected
data...

> The "legacy" inode_inc_iversion() however _does_ have implied memory
> barriers from the i_lock. There could be some subtle de-facto ordering
> there, so I think we probably do want some barriers in here if only to
> preserve that. It's not likely to cost much, and may save us tracking
> down some fiddly bugs.
> 
> What about this patch? Note that I've only added barriers to
> inode_maybe_inc_iversion. I don't see that we need it for the other
> functions, but please do tell me if I'm wrong there:
> 
> 8<---
> 
> [PATCH] SQUASH: add memory barriers around i_version accesses
> 
> Signed-off-by: Jeff Layton 
> ---
>  include/linux/iversion.h | 53 
> +---
>  1 file changed, 32 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index a9fbf99709df..02187a3bec3b 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -89,6 +89,23 @@ inode_set_iversion_raw(struct inode *inode, const u64 val)
>   atomic64_set(>i_version, val);
>  }
>  
> +/**
> + * inode_peek_iversion_raw - grab a "raw" iversion value
> + * @inode: inode from which i_version should be read
> + *
> + * Grab a "raw" inode->i_version value and return it. The i_version is not
> + * flagged or converted in any way. This is mostly used to access a 
> self-managed
> + * i_version.
> + *
> + * With those filesystems, we want to treat the i_version as an entirely
> + * opaque value.
> + */
> +static inline u64
> +inode_peek_iversion_raw(const struct inode *inode)
> +{
> + return atomic64_read(>i_version);
> +}
> +
>  /**
>   * inode_set_iversion - set i_version to a particular value
>   * @inode: inode to set
> @@ -152,7 +169,16 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
>  {
>   u64 cur, old, new;
>  
> - cur = (u64)atomic64_read(>i_version);
> + /*
> +  * The i_version field is not strictly ordered with any other inode
> +  * information, but the legacy inode_inc_iversion code used a spinlock
> +  * to serialize increments.
> +  *
> +  * This code adds full memory barriers to ensure that any de-facto
> +  * ordering with other info is preserved.
> +  */
> + smp_mb__before_atomic();

This should be just smp_mb(). __before_atomic() pairs with atomic
operations like atomic_inc(). atomic_read() is completely unordered
operation (happens to be plain memory read on x86) and so __before_atomic()
is not enough.

> + cur = inode_peek_iversion_raw(inode);
>   for (;;) {
>   /* If flag is clear then we needn't do anything */
>   if (!force && !(cur & I_VERSION_QUERIED))
> @@ -162,8 +188,10 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
>   new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
>  
>   old = atomic64_cmpxchg(>i_version, cur, new);
> - if (likely(old == cur))
> + if (likely(old == cur)) {
> + smp_mb__after_atomic();

I don't think you need this. Cmpxchg is guaranteed to be full memory
barrier - 

Re: kernel hangs during balance

2017-12-20 Thread Rich Rauenzahn
Ok, caught the hung tasks last night.  I don't *think* this is
related, because I pretty sure this isn't happening on the same
filesystem, but I do have a loopback swap on one btrfs drive.

The hang might have occurred after the btrfs balance was finished 
which is confusing.  I'm adding timestamps to my next btrfs job.


Dec 20 01:14:50 tendo [64680.942928] INFO: task systemd:1 blocked for
more than 120 seconds.
Dec 20 01:14:50 tendo [64680.945844]   Not tainted
4.4.106-1.el7.elrepo.x86_64 #1
Dec 20 01:14:50 tendo [64680.948900] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 01:14:50 tendo [64680.951987]  88040d6730d8
Dec 20 01:14:50 tendo 81a81500
Dec 20 01:14:50 tendo 88040d5f8000
Dec 20 01:14:50 tendo 88040d674000
Dec 20 01:14:50 tendo
Dec 20 01:14:50 tendo [64680.955107]  
Dec 20 01:14:50 tendo 7fff
Dec 20 01:14:50 tendo 8800bd19a648
Dec 20 01:14:50 tendo e8c03980
Dec 20 01:14:50 tendo
Dec 20 01:14:50 tendo [64680.957871]  88040d6730f0
Dec 20 01:14:50 tendo 81700435
Dec 20 01:14:50 tendo 88041fa16cc0
Dec 20 01:14:50 tendo 88040d6731a0
Dec 20 01:14:50 tendo
Dec 20 01:14:50 tendo [64680.960618] Call Trace:
Dec 20 01:14:50 tendo [64680.963342]  [] schedule+0x35/0x80
Dec 20 01:14:50 tendo [64680.966052]  []
schedule_timeout+0x237/0x2d0
Dec 20 01:14:50 tendo [64680.968490]  []
io_schedule_timeout+0xa6/0x110
Dec 20 01:14:50 tendo [64680.970900]  [] bt_get+0x14c/0x1d0
Dec 20 01:14:50 tendo [64680.973211]  [] ?
prepare_to_wait_event+0xf0/0xf0
Dec 20 01:14:50 tendo [64680.975383]  []
blk_mq_get_tag+0xb0/0xe0
Dec 20 01:14:50 tendo [64680.977540]  []
__blk_mq_alloc_request+0x1b/0x1f0
Dec 20 01:14:50 tendo [64680.979708]  []
blk_mq_map_request+0x19e/0x210
Dec 20 01:14:50 tendo [64680.981871]  []
blk_sq_make_request+0xab/0x3d0
Dec 20 01:14:50 tendo [64680.984004]  [] ?
generic_make_request_checks+0x29f/0x4f0
Dec 20 01:14:50 tendo [64680.986122]  [] ?
mempool_alloc+0x6e/0x170
Dec 20 01:14:50 tendo [64680.988207]  []
generic_make_request+0x113/0x2c0
Dec 20 01:14:50 tendo [64680.990137]  [] submit_bio+0x77/0x150
Dec 20 01:14:50 tendo [64680.992041]  [] ?
__test_set_page_writeback+0x149/0x1c0
Dec 20 01:14:50 tendo [64680.993949]  []
__swap_writepage+0x20d/0x250
Dec 20 01:14:50 tendo [64680.995853]  [] ?
__frontswap_store+0x8c/0x120
Dec 20 01:14:50 tendo [64680.997720]  []
swap_writepage+0x39/0x70
Dec 20 01:14:50 tendo [64680.999595]  []
pageout.isra.43+0x170/0x290
Dec 20 01:14:50 tendo [64681.001291]  []
shrink_page_list+0x353/0x780
Dec 20 01:14:50 tendo [64681.002975]  []
shrink_inactive_list+0x217/0x530
Dec 20 01:14:50 tendo [64681.004652]  []
shrink_lruvec+0x588/0x740
Dec 20 01:14:50 tendo [64681.006311]  []
shrink_zone+0xeb/0x2e0
Dec 20 01:14:50 tendo [64681.007967]  []
do_try_to_free_pages+0x164/0x430
Dec 20 01:14:50 tendo [64681.009617]  [] ?
throttle_direct_reclaim+0x14e/0x240
Dec 20 01:14:50 tendo [64681.011141]  []
try_to_free_pages+0xd5/0x190
Dec 20 01:14:51 tendo [64681.012659]  []
__alloc_pages_slowpath.constprop.89+0x3a0/0x6e8
Dec 20 01:14:51 tendo [64681.014191]  [] ?
sched_clock_cpu+0x72/0xa0
Dec 20 01:14:51 tendo [64681.015701]  []
__alloc_pages_nodemask+0x253/0x260
Dec 20 01:14:51 tendo [64681.017215]  [] ?
__kmalloc_node_track_caller+0x249/0x2b0
Dec 20 01:14:51 tendo [64681.018392]  []
alloc_pages_vma+0xa5/0x220
Dec 20 01:14:51 tendo [64681.019296]  []
handle_pte_fault+0xd21/0x1490
Dec 20 01:14:51 tendo [64681.020195]  [] ?
legitimize_path.isra.23+0x2e/0x60
Dec 20 01:14:51 tendo [64681.021100]  [] ? put_dec+0x1a/0x80
Dec 20 01:14:51 tendo [64681.021992]  [] ?
number.isra.13+0x2dd/0x310
Dec 20 01:14:51 tendo [64681.022883]  []
handle_mm_fault+0x250/0x540
Dec 20 01:14:51 tendo [64681.023757]  []
__do_page_fault+0x188/0x3f0
Dec 20 01:14:51 tendo [64681.024626]  []
do_page_fault+0x30/0x80
Dec 20 01:14:51 tendo [64681.025486]  [] ?
kernfs_path_locked+0x37/0x80
Dec 20 01:14:51 tendo [64681.026350]  [] page_fault+0x28/0x30
Dec 20 01:14:51 tendo [64681.027201]  [] ?
copy_user_enhanced_fast_string+0xe/0x20
Dec 20 01:14:51 tendo [64681.028034]  [] ?
seq_read+0x295/0x390
Dec 20 01:14:51 tendo [64681.028832]  [] __vfs_read+0x37/0x100
Dec 20 01:14:51 tendo [64681.029618]  [] ?
security_file_permission+0xa3/0xc0
Dec 20 01:14:51 tendo [64681.030342]  [] ?
rw_verify_area+0x52/0xd0
Dec 20 01:14:51 tendo [64681.031055]  [] vfs_read+0x7f/0x130
Dec 20 01:14:51 tendo [64681.031761]  [] SyS_read+0x55/0xc0
Dec 20 01:14:51 tendo [64681.032441]  []
entry_SYSCALL_64_fastpath+0x12/0x71
Dec 20 01:14:51 tendo [64681.033119] INFO: task khugepaged:68 blocked
for more than 120 seconds.
Dec 20 01:14:51 tendo [64681.033815]   Not tainted
4.4.106-1.el7.elrepo.x86_64 #1
Dec 20 01:14:51 tendo [64681.034518] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 01:14:51 tendo [64681.035238]  88040cf4f318
Dec 20 01:14:51 tendo 88040d6c2c80
Dec 20 01:14:51 tendo 88040cf5

Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2017-12-20 Thread Jan Kara
On Wed 20-12-17 08:35:05, Dave Chinner wrote:
> On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> > On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > > On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote:
> > > > On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote:
> > > > > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> > > > This is just one of those things that's going to be slightly shitty.  
> > > > It's the
> > > > same for memory reclaim, all of those places use pages so we just take
> > > > METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close 
> > > > enough.
> > > 
> > > Ok, so that isn't exactly easy to deal with, because all our
> > > metadata writeback is based on log sequence number targets (i.e. how
> > > far to push the tail of the log towards the current head). We've
> > > actually got no idea how pages/bytes actually map to a LSN target
> > > because while we might account a full buffer as dirty for memory
> > > reclaim purposes (up to 64k in size), we might have only logged 128
> > > bytes of it.
> > > 
> > > i.e. if we are asked to push 2MB of metadata and we treat that as
> > > 2MB of log space (i.e. push target of tail LSN + 2MB) we could have
> > > logged several tens of megabytes of dirty metadata in that LSN
> > > range and have to flush it all. OTOH, if the buffers are fully
> > > logged, then that same target might only flush 1.5MB of metadata
> > > once all the log overhead is taken into account.
> > > 
> > > So there's a fairly large disconnect between the "flush N bytes of
> > > metadata" API and the "push to a target LSN" that XFS uses for
> > > flushing metadata in aged order. I'm betting that extN and otehr
> > > filesystems might have similar mismatches with their journal
> > > flushing...
> > 
> > Well, for ext4 it isn't as bad since we do full block logging only. So if
> > we are asked to flush N pages, we can easily translate that to number of fs
> > blocks and flush that many from the oldest transaction.
> > 
> > Couldn't XFS just track how much it has cleaned (from reclaim perspective)
> > when pushing items from AIL (which is what I suppose XFS would do in
> > response to metadata writeback request) and just stop pushing when it has
> > cleaned as much as it was asked to?
> 
> If only it were that simple :/
> 
> To start with, flushing the dirty objects (such as inodes) to their
> backing buffers do not mean the the object is clean once the
> writeback completes. XFS has decoupled in-memory objects with
> logical object logging rather than logging physical buffers, and
> so can be modified and dirtied while the inode buffer
> is being written back. Hence if we just count things like "buffer
> size written" it's not actually a correct account of the amount of
> dirty metadata we've cleaned. If we don't get that right, it'll
> result in accounting errors and incorrect behaviour.
> 
> The bigger problem, however, is that we have no channel to return
> flush information from the AIL pushing to whatever caller asked for
> the push. Pushing metadata is completely decoupled from every other
> subsystem. i.e. the caller asked the xfsaild to push to a specific
> LSN (e.g. to free up a certain amount of log space for new
> transactions), and *nothing* has any idea of how much metadata we'll
> need to write to push the tail of the log to that LSN.
> 
> It's also completely asynchronous - there's no mechanism for waiting
> on a push to a specific LSN. Anything that needs a specific amount
> of log space to be available waits in ordered ticket queues on the
> log tail moving forwards. The only interfaces that have access to
> the log tail ticket waiting is the transaction reservation
> subsystem, which cannot be used during metadata writeback because
> that's a guaranteed deadlock vector
> 
> Saying "just account for bytes written" assumes directly connected,
> synchronous dispatch metadata writeback infrastructure which we
> simply don't have in XFS. "just clean this many bytes" doesn't
> really fit at all because we have no way of referencing that to the
> distance we need to push the tail of the log. An interface that
> tells us "clean this percentage of dirty metadata" is much more
> useful because we can map that easily to a log sequence number
> based push target

OK, understood.

> > > IOWs, treating metadata like it's one great big data inode doesn't
> > > seem to me to be the right abstraction to use for this - in most
> > > fileystems it's a bunch of objects with a complex dependency tree
> > > and unknown write ordering, not an inode full of data that can be
> > > sequentially written.
> > > 
> > > Maybe we need multiple ops with well defined behaviours. e.g.
> > > ->writeback_metadata() for background writeback, ->sync_metadata() for
> > > sync based operations. That way different filesystems can ignore the
> > > parts they don't need simply by not implementing those operations,
> > > and the writeback 

Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently

2017-12-20 Thread Jeff Layton
On Tue, 2017-12-19 at 09:07 +1100, Dave Chinner wrote:
> On Mon, Dec 18, 2017 at 02:35:20PM -0500, Jeff Layton wrote:
> > [PATCH] SQUASH: add memory barriers around i_version accesses
> 
> Why explicit memory barriers rather than annotating the operations
> with the required semantics and getting the barriers the arch
> requires automatically?  I suspect this should be using
> atomic_read_acquire() and atomic_cmpxchg_release(), because AFAICT
> the atomic_cmpxchg needs to have release semantics to match the
> acquire semantics needed for the load of the current value.
> 
> From include/linux/atomics.h:
> 
>  * For compound atomics performing both a load and a store, ACQUIRE
>  * semantics apply only to the load and RELEASE semantics only to the
>  * store portion of the operation. Note that a failed cmpxchg_acquire
>  * does -not- imply any memory ordering constraints.
> 
> Memory barriers hurt my brain. :/
> 
> At minimum, shouldn't the atomic op specific barriers be used rather
> than full memory barriers? i.e:
> 

They hurt my brain too. Yes, definitely atomic-specific barriers should
be used here instead, since this is an atomic64_t now.

After going over the docs again...my understanding has always been that
you primarily need memory barriers to order accesses to different areas
of memory.

As Jan and I were discussing in the other thread, i_version is not
synchronized with anything else. In this code, we're only dealing with a
single 64-bit word. I don't think there are any races there wrt the API
itself.

The "legacy" inode_inc_iversion() however _does_ have implied memory
barriers from the i_lock. There could be some subtle de-facto ordering
there, so I think we probably do want some barriers in here if only to
preserve that. It's not likely to cost much, and may save us tracking
down some fiddly bugs.

What about this patch? Note that I've only added barriers to
inode_maybe_inc_iversion. I don't see that we need it for the other
functions, but please do tell me if I'm wrong there:

8<---

[PATCH] SQUASH: add memory barriers around i_version accesses

Signed-off-by: Jeff Layton 
---
 include/linux/iversion.h | 53 +---
 1 file changed, 32 insertions(+), 21 deletions(-)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index a9fbf99709df..02187a3bec3b 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -89,6 +89,23 @@ inode_set_iversion_raw(struct inode *inode, const u64 val)
atomic64_set(>i_version, val);
 }
 
+/**
+ * inode_peek_iversion_raw - grab a "raw" iversion value
+ * @inode: inode from which i_version should be read
+ *
+ * Grab a "raw" inode->i_version value and return it. The i_version is not
+ * flagged or converted in any way. This is mostly used to access a 
self-managed
+ * i_version.
+ *
+ * With those filesystems, we want to treat the i_version as an entirely
+ * opaque value.
+ */
+static inline u64
+inode_peek_iversion_raw(const struct inode *inode)
+{
+   return atomic64_read(>i_version);
+}
+
 /**
  * inode_set_iversion - set i_version to a particular value
  * @inode: inode to set
@@ -152,7 +169,16 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
 {
u64 cur, old, new;
 
-   cur = (u64)atomic64_read(>i_version);
+   /*
+* The i_version field is not strictly ordered with any other inode
+* information, but the legacy inode_inc_iversion code used a spinlock
+* to serialize increments.
+*
+* This code adds full memory barriers to ensure that any de-facto
+* ordering with other info is preserved.
+*/
+   smp_mb__before_atomic();
+   cur = inode_peek_iversion_raw(inode);
for (;;) {
/* If flag is clear then we needn't do anything */
if (!force && !(cur & I_VERSION_QUERIED))
@@ -162,8 +188,10 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
 
old = atomic64_cmpxchg(>i_version, cur, new);
-   if (likely(old == cur))
+   if (likely(old == cur)) {
+   smp_mb__after_atomic();
break;
+   }
cur = old;
}
return true;
@@ -183,23 +211,6 @@ inode_inc_iversion(struct inode *inode)
inode_maybe_inc_iversion(inode, true);
 }
 
-/**
- * inode_peek_iversion_raw - grab a "raw" iversion value
- * @inode: inode from which i_version should be read
- *
- * Grab a "raw" inode->i_version value and return it. The i_version is not
- * flagged or converted in any way. This is mostly used to access a 
self-managed
- * i_version.
- *
- * With those filesystems, we want to treat the i_version as an entirely
- * opaque value.
- */
-static inline u64
-inode_peek_iversion_raw(const struct inode *inode)
-{
-   

Re: Unexpected raid1 behaviour

2017-12-20 Thread Austin S. Hemmelgarn

On 2017-12-19 17:23, Tomasz Pala wrote:

On Tue, Dec 19, 2017 at 15:47:03 -0500, Austin S. Hemmelgarn wrote:


Sth like this? I got such problem a few months ago, my solution was
accepted upstream:
https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f

Rationale is in referred ticket, udev would not support any more btrfs
logic, so unless btrfs handles this itself on kernel level (daemon?),
that is all that can be done.

Or maybe systemd can quit trying to treat BTRFS like a volume manager
(which it isn't) and just try to mount the requested filesystem with the
requested options?


Tried that before ("just mount my filesystem, stupid"), it is a no-go.
The problem source is not within systemd treating BTRFS differently, but
in btrfs kernel logic that it uses. Just to show it:

1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb,
2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan 
tool),
3. try
mount /dev/sda /test - fails
mount /dev/sdb /test - works
4. reboot again and try in reversed order
mount /dev/sdb /test - fails
mount /dev/sda /test - works

THIS readiness is exposed via udev to systemd. And it must be used for
multi-layer setups to work (consider stacked LUKS, LVM, MD, iSCSI, FC etc).
Except BTRFS _IS NOT MULTIPLE LAYERS_.  It's one layer at the filesystem 
layer, and handles the other 'layers' internally.


In short: until *something* scans all the btrfs components, so the
kernel makes it ready, systemd won't even try to mount it.
Which is the problem here.  Systemd needs to treat BTRFS differently, 
even if the ioctl it's using gets 'fixed', currently it's treating it 
like LVM or MD, when it needs to be treated as just a filesystem with an 
extra wait condition prior to mount (and needs to trust that the user 
knows what they are doing when they mount something by hand).  The IOCTL 
systemd is using was poorly named, what it really does is say that the 
FS is ready to mount normally (that is, without needing 'device=' or 
'degraded' mount options).  Aside from this being problematic with 
degraded volumes, it's got an inherent TOCTOU race condition (so do the 
checks with all the other block layers you mentioned FWIW).  If systemd 
would just treat BTRFS like a filesystem instead of a volume manager, 
and try to mount the volume with the specified options (after waiting 
for udev to report that it's done scanning everything) instead of asking 
the kernel if it's ready, none of this would be an issue.


Put slightly differently:  I use OpenRC and sysv init.  I have a script 
that runs right after udev starts and directly scans all fixed disks for 
BTRFS signatures, and that's _all_ that I need to do to get multi-device 
BTRFS working properly with the standard local filesystem mount script 
in Gentoo.  I don't have to deal with any of this crap that systemd 
users do because Gentoo's OpenRC script for mounting local filesystems 
treats BTRFS like any other filesystem, and (sensibly) assumes that if 
the call to mount succeeds, things are ready and working.



Then you would just be able to specify 'degraded' in
your mount options, and you don't have to care that the kernel refuses
to mount degraded filesystems without being explicitly asked to.


Exactly. But since LP refused to try mounting despite kernel "not-ready"
state - it is the kernel that must emit 'ready'. So the
question is: how can I make kernel to mark degraded array as "ready"?
You can't, because the DEVICE_READY IOCTL is coded to mark the volume 
ready when all component devices are ready.  IOW, it's there to say 
'this mount will work without needing -o degraded or specifying any 
devices in the mount options'.


The issue is the interaction here, not the kernel behavior by itself, 
since the kernel behavior produces no issues whatsoever for other init 
systems (though I will acknowledge that the ioctl itself is really only 
used by systemd, but I contend that that's because everything else is 
sensible enough to understand that the ioctl is functionally useless and 
just avoid it).


The obvious answer is: do it via kernel command line, just like mdadm
does:
rootflags=device=/dev/sda,device=/dev/sdb
rootflags=device=/dev/sda,device=missing
rootflags=device=/dev/sda,device=/dev/sdb,degraded

If only btrfs.ko recognized this, kernel would be able to assemble
multivolume btrfs itself. Not only this would allow automated degraded
mounts, it would also allow using initrd-less kernels on such volumes.
Last I checked, the 'device=' options work on upstream kernels just 
fine, though I've never tried the degraded option.  Of course, I'm also 
not using systemd, so it may be some interaction with systemd that's 
causing them to not work (and yes, I understand that I'm inclined to 
blame systemd most of the time based on significant past experience with 
systemd creating issues that never existed before).



It doesn't have to be default, might be kernel compile-time knob, module

Re: Unexpected raid1 behaviour

2017-12-20 Thread Austin S. Hemmelgarn

On 2017-12-19 18:53, Chris Murphy wrote:

On Tue, Dec 19, 2017 at 1:11 PM, Austin S. Hemmelgarn
 wrote:

On 2017-12-19 12:56, Tomasz Pala wrote:



BTRFS lacks all of these - there are major functional changes in current
kernels and it reaches far beyond LTS. All the knowledge YOU have here,
on this maillist, should be 'engraved' into btrfs-progs, as there are
people still using kernels with serious malfunctions. btrfs-progs could
easily check kernel version and print appropriate warning - consider
this a "software quirks".


Except the systems running on those ancient kernel versions are not
necessarily using a recent version of btrfs-progs.


Indeed it is much more common to find old user space tools, for
whatever reason, compared to the kernel version.
Most distros have infrastructure in place to handle quick updates to the 
kernel, and tend to keep the kernel up to date to fix hardware issues 
that affect people who may not be using BTRFS.


In contrast, btrfs-progs updates generally aren't high priority, because 
they benefit a much smaller user base (unless you're SuSE).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-20 Thread Austin S. Hemmelgarn

On 2017-12-19 16:58, Tomasz Pala wrote:

On Tue, Dec 19, 2017 at 15:11:22 -0500, Austin S. Hemmelgarn wrote:


Except the systems running on those ancient kernel versions are not
necessarily using a recent version of btrfs-progs.


Still much easier to update a userspace tools than kernel (consider
binary drivers for various hardware).

OK, let's look at this objectively:

Current version of btrfs-progs is 4.14, released last month, and current 
kernel is 4.14.8 (or a 4.15 RC release).


In various distributions:
* Arch Linux:
btrfs-progs version is 4.14-2
kernel version is 4.14.6-1
* Alpine Linux:
btrfs-progs version is 4.10.2-r0
kernel version is 4.9.32-0
* Debian Sid:
btrfs-progs version is 4.13.3-1
kernel version is 4.14.0-1
* Debian 9.3:
btrfs-progs version is 4.7.3-1
kernel version is 4.9.0-4
* Fedora 27:
btrfs-progs version is 4.11.3-3
kernel version is 4.14.6-300
* Gentoo ~amd64 (equivalent of Debian Sid or Fedora Rawhide):
btrfs-progs version is 4.14
kernel version is 4.14.7
* Gentoo stable:
btrfs-progs version is 4.10.2
kernel version is 4.14.7
* Manjaro (a somewhat popular Arch Linux derivative):
btrfs-progs version is 4.14-1
kernel version is 4.11.12-1-rt16
* OpenSUSE Leap 42.3:
btrfs-progs version is 4.5.3+20160729
kernel version is 4.4.103-36
* OpenSUSE Tumbleweed:
btrfs-progs version is 4.13.3
kernel version is 4.14.6-1
* Ubuntu 17.10:
btrfs-progs version is 4.12-1
kernel version is 4.13.0-19
* Ubuntu 16.04.3:
btrfs-progs version is 4.4-1ubuntu1
kernel version is 4.4.0-104

Based on this, it looks like Alpine, Manjaro, and OpenSUSE Leap are the 
only distros for which it was easier to upgrade the userspace than the 
kernel, and Alpine and Manjaro are the only two that it even makes sense 
for that to be the case given that they use GRSecurity and RT patches 
respectively.


The fact is that most people use whatever version their distro packages, 
and don't install software themselves through other means, so for most 
people, it is easier to upgrade the kernel.


Even as a 'power user' using Gentoo (where it's really easy to install 
stuff from external sources because you have all the development tools 
pre-installed), I almost never pull anything that's beyond the main 
repositories or the small handful of user repositories that I've got 
enabled, and that's only for stuff I can't get in a repository.



So in other words, spend the time to write up code for btrfs-progs that
will then be run by a significant minority of users because people using
old kernels usually use old userspace, and people using new kernels
won't have to care, instead of working on other bugs that are still
affecting people?


I am aware of the dillema and the answer is: that depends.
Depends on expected usefulness of such infrastructure regarding _future_
changes and possible bugs.
In case of stable/mature/frozen projects this doesn't make much sense,
as the possible incompatibilities would be very rare.
Wheter this makes sense for btrfs? I don't know - it's not mature, but if the 
quirk rate
would be too high to track appropriate kernel versions it might be
really better to officially state "DO USE 4.14+ kernel, REALLY".

This might be accomplished very easy - when releasing new btrfs-progs
check currently available LTS kernel and use it as a base reference for
warning.

After all, "giving users a hurt me button is not ethical programming."

Scaring users needlessly is also not ethical programming.  As an example:

4.9 is the current LTS release (4.9.71 as of right now).  Dozens of bugs 
have been fixed since then.  If we were to start doing as you propose, 
then we'd be spitting out potentially bogus warnings for everything up 
through current kernels.



Now, if the current kernels won't toggle degraded RAID1 as ro, can I
safely add "degraded" to the mount options? My primary concern is the
machine UPTIME. I care less about the data, as they are backed up to
some remote location and loosing day or week of changes is acceptable,
brain-split as well, while every hour of downtime costs me a real money.

In which case you shouldn't be relying on _ANY_ kind of RAID by itself,
let alone BTRFS.  If you care that much about uptime, you should be
investing in a HA setup and going from there.  If downtime costs you


I got this handled and don't use btrfs there - the question remains:
in a situation as described above, is it safe now to add "degraded"?

To rephrase the question: can degraded RAID1 run permanently as rw
without some *internal* damage?
Not on kernels that don't have the patch that's been mentioned a couple 
of times in this thread, with the caveat that 'internal damage' means 
that it won't mount on such kernels after the first time (but will mount 
on newer kernels that have been patched).



Anyway, users shouldn't 

Re: [PATCH v10 1/5] add infrastructure for tagging functions as error injectable

2017-12-20 Thread Masami Hiramatsu
On Fri, 15 Dec 2017 14:12:52 -0500
Josef Bacik  wrote:

> From: Josef Bacik 
> 
> Using BPF we can override kprob'ed functions and return arbitrary
> values.  Obviously this can be a bit unsafe, so make this feature opt-in
> for functions.  Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in
> order to give BPF access to that function for error injection purposes.
> 
> Signed-off-by: Josef Bacik 
> Acked-by: Ingo Molnar 
> ---
>  include/asm-generic/vmlinux.lds.h |  10 +++
>  include/linux/bpf.h   |  11 +++
>  include/linux/kprobes.h   |   1 +
>  include/linux/module.h|   5 ++
>  kernel/kprobes.c  | 163 
> ++
>  kernel/module.c   |   6 +-
>  6 files changed, 195 insertions(+), 1 deletion(-)
> 
> diff --git a/include/asm-generic/vmlinux.lds.h 
> b/include/asm-generic/vmlinux.lds.h
> index ee8b707d9fa9..a2e8582d094a 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -136,6 +136,15 @@
>  #define KPROBE_BLACKLIST()
>  #endif
>  
> +#ifdef CONFIG_BPF_KPROBE_OVERRIDE
> +#define ERROR_INJECT_LIST()  . = ALIGN(8);   
> \
> + 
> VMLINUX_SYMBOL(__start_kprobe_error_inject_list) = .;   \
> + KEEP(*(_kprobe_error_inject_list))  
> \
> + VMLINUX_SYMBOL(__stop_kprobe_error_inject_list) 
> = .;
> +#else
> +#define ERROR_INJECT_LIST()
> +#endif
> +
>  #ifdef CONFIG_EVENT_TRACING
>  #define FTRACE_EVENTS()  . = ALIGN(8);   
> \
>   VMLINUX_SYMBOL(__start_ftrace_events) = .;  \
> @@ -564,6 +573,7 @@
>   FTRACE_EVENTS() \
>   TRACE_SYSCALLS()\
>   KPROBE_BLACKLIST()  \
> + ERROR_INJECT_LIST() \
>   MEM_DISCARD(init.rodata)\
>   CLK_OF_TABLES() \
>   RESERVEDMEM_OF_TABLES() \
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index e55e4255a210..7f4d2a953173 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -576,4 +576,15 @@ extern const struct bpf_func_proto 
> bpf_sock_map_update_proto;
>  void bpf_user_rnd_init_once(void);
>  u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>  
> +#if defined(__KERNEL__) && !defined(__ASSEMBLY__)
> +#ifdef CONFIG_BPF_KPROBE_OVERRIDE
> +#define BPF_ALLOW_ERROR_INJECTION(fname) \
> +static unsigned long __used  \
> + __attribute__((__section__("_kprobe_error_inject_list")))   \
> + _eil_addr_##fname = (unsigned long)fname;
> +#else
> +#define BPF_ALLOW_ERROR_INJECTION(fname)
> +#endif
> +#endif

This part shows this feature belongs to bpf, if it is a part of kprobes,
it should be defined in include/asm-generic/kprobes.h as NOKPROBE_SYMBOL
does.

Why this is defined in BPF, but list is under kprobes?



> +
>  #endif /* _LINUX_BPF_H */
> diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
> index 9440a2fc8893..963fd364f3d6 100644
> --- a/include/linux/kprobes.h
> +++ b/include/linux/kprobes.h
> @@ -271,6 +271,7 @@ extern bool arch_kprobe_on_func_entry(unsigned long 
> offset);
>  extern bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, 
> unsigned long offset);
>  
>  extern bool within_kprobe_blacklist(unsigned long addr);
> +extern bool within_kprobe_error_injection_list(unsigned long addr);
>  
>  struct kprobe_insn_cache {
>   struct mutex mutex;
> diff --git a/include/linux/module.h b/include/linux/module.h
> index c69b49abe877..548fa09fa806 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -475,6 +475,11 @@ struct module {
>   ctor_fn_t *ctors;
>   unsigned int num_ctors;
>  #endif
> +
> +#ifdef CONFIG_BPF_KPROBE_OVERRIDE
> + unsigned int num_kprobe_ei_funcs;
> + unsigned long *kprobe_ei_funcs;
> +#endif
>  } cacheline_aligned __randomize_layout;
>  #ifndef MODULE_ARCH_INIT
>  #define MODULE_ARCH_INIT {}
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index da2ccf142358..b4aab48ad258 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -83,6 +83,16 @@ static raw_spinlock_t *kretprobe_table_lock_ptr(unsigned 
> long hash)
>   return &(kretprobe_table_locks[hash].lock);
>  }
>  
> +/* List of symbols that can be overriden for error injection. */
> +static LIST_HEAD(kprobe_error_injection_list);
> +static DEFINE_MUTEX(kprobe_ei_mutex);
> +struct kprobe_ei_entry {
> + struct list_head list;
> + unsigned long start_addr;
> +  

Re: [PATCH v10 1/5] add infrastructure for tagging functions as error injectable

2017-12-20 Thread Daniel Borkmann
On 12/20/2017 08:13 AM, Masami Hiramatsu wrote:
> On Tue, 19 Dec 2017 18:14:17 -0800
> Alexei Starovoitov  wrote:
[...]
>> Please make your suggestion as patches based on top of bpf-next.
> 
> bpf-next seems already pick this series. Would you mean I revert it and
> write new patch?

No, please submit as follow-ups instead, thanks Masami!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] btrfs: handle volume split brain scenario

2017-12-20 Thread Nikolay Borisov


On 20.12.2017 10:04, Anand Jain wrote:
> In raid configs RAID1/RAID5/RAID6 it's possible to have some devices
> missing which would render btrfs to be mounted in degraded state but
> still be operational. In those cases it's possible (albeit highly
> undesirable) that the degraded and missing parts of the filesystem are
> mounted independently. When writes occur such split-brain scenarios
   ^
   there should be "in" before
the "such" otherwise the sentence doesn't make much sense. I missed it
when writing it the first time. Don't resend just for this but in case
you make v3 don't forget to include it. Otherwise it can be fixed during
merge.

> (caused by intentional user action) then one of the sides of the RAID
> config will have to be blown away when bringing it back to the
> consistent state.
> 
> Handle split-brain volumes by setting a new flag
> BTRFS_SUPER_FLAG_DEGRADED if the device is mounted degraded. So we
> could detect and fail the mount if all the disks contains this flag.
> 
> To reassemble a split-brain volume first mount the good disk and then
> scan in the device on which new writes can be ignored, (it needs patch
> btrfs: handle dynamically reappearing missing device)
Unfortunately you have posted multiple patches and later patches that
depend on earlier ones are not part of the same series. I suggest you
batch all dependent patches in one series and resubmit it like that.
Then information about how to possibly work with array can be included
in the cover letter so that people know what to expect from subsequent
patches. As it stands currently it's somewhat messy reviewing your code.

> 
> Warning:  A raid1 root device, in split brain condition, would fail
> to bootup to protect the arbitrary loss of data.
> 
> Signed-off-by: Anand Jain 
> ---
> On top of misc-next kdave.
> v2:
>  Improve commit log.
>  Rename to BTRFS_SUPER_FLAG_DEGRADED.
>  Rename variables to fs_devices and device.
>  In open_ctree() check for split-brain after btrfs_read_chunk_tree()
> 
>  fs/btrfs/disk-io.c  | 55 
> -
>  include/uapi/linux/btrfs_tree.h |  1 +
>  2 files changed, 55 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index b302db90598c..e87924b7145b 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -61,7 +61,8 @@
>BTRFS_HEADER_FLAG_RELOC |\
>BTRFS_SUPER_FLAG_ERROR |\
>BTRFS_SUPER_FLAG_SEEDING |\
> -  BTRFS_SUPER_FLAG_METADUMP)
> +  BTRFS_SUPER_FLAG_METADUMP|\
> +  BTRFS_SUPER_FLAG_DEGRADED)
>  
>  static const struct extent_io_ops btree_extent_io_ops;
>  static void end_workqueue_fn(struct btrfs_work *work);
> @@ -2383,6 +2384,43 @@ static int btrfs_read_roots(struct btrfs_fs_info 
> *fs_info)
>   return 0;
>  }
>  
> +bool volume_has_split_brain(struct btrfs_fs_info *fs_info)
> +{
> + unsigned long devs_moved_on = 0;
> + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> + struct list_head *head = _devices->devices;
> + struct btrfs_device *device;
> +
> +again:
> + list_for_each_entry(device, head, dev_list) {
> + struct buffer_head *bh;
> + struct btrfs_super_block *sb;
> +
> + if (!device->devid)
> + continue;
> +
> + bh = btrfs_read_dev_super(device->bdev);
> + if (IS_ERR(bh))
> + continue;
> +
> + sb = (struct btrfs_super_block *)bh->b_data;
> + if (btrfs_super_flags(sb) & BTRFS_SUPER_FLAG_DEGRADED)
> + devs_moved_on++;
> + brelse(bh);
> + }
> +
> + fs_devices = fs_devices->seed;
> + if (fs_devices) {
> + head = _devices->devices;
> + goto again;
> + }
> +
> + if (devs_moved_on == fs_info->fs_devices->total_devices)
> + return true;
> + else
> + return false;
> +}
> +
>  int open_ctree(struct super_block *sb,
>  struct btrfs_fs_devices *fs_devices,
>  char *options)
> @@ -2765,6 +2803,21 @@ int open_ctree(struct super_block *sb,
>   goto fail_tree_roots;
>   }
>  
> + if (fs_info->fs_devices->missing_devices) {
> + btrfs_set_super_flags(fs_info->super_copy,
> +   fs_info->super_copy->flags |
> +   BTRFS_SUPER_FLAG_DEGRADED);
> + } else if (fs_info->super_copy->flags & BTRFS_SUPER_FLAG_DEGRADED) {
> + if (volume_has_split_brain(fs_info)) {
> + btrfs_err(fs_info,
> +   "Detected 'degraded' flag on all devices");
> + goto fail_tree_roots;
> + }

Re: [PATCH v2 2/2] btrfs: handle volume split brain condition for dynamically reappearing device

2017-12-20 Thread Nikolay Borisov


On 20.12.2017 10:04, Anand Jain wrote:
> When the missing device reappears and joins the RAID group, and if there
> are no more missing device at the volume level, then reset the
> BTRFS_SUPER_FLAG_VOL_MOVED_ON flag.

You should rename the flag here as well. Also I believe the changelog
can be simplified:

When the last missing device in a RAID group joins we know the split
brain situation is resolved and we reset the SUPER_FLAG_DEGRADED.

> 
> This patch is on top of the patch [1] in the ML.
> [1] btrfs: handle dynamically reappearing missing device
> 
> Signed-off-by: Anand Jain 
> ---
> On top of misc-next kdave.
> v2:
>  Rename to BTRFS_SUPER_FLAG_DEGRADED.
> 
>  fs/btrfs/volumes.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 65d10f38dd99..32571f4fa72b 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -858,6 +858,10 @@ static noinline int device_list_add(const char *path,
>  
>   fs_devices->missing_devices--;
>   clear_bit(BTRFS_DEV_STATE_MISSING, >dev_state);
> + if (!fs_devices->missing_devices)
> + btrfs_set_super_flags(fs_info->super_copy,
> +   
> fs_info->super_copy->flags &
> +   
> ~BTRFS_SUPER_FLAG_DEGRADED);
>  
>   if (test_bit(BTRFS_DEV_STATE_WRITEABLE,
>>dev_state) &&
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-20 Thread Tomasz Pala
Errata:

On Wed, Dec 20, 2017 at 09:34:48 +0100, Tomasz Pala wrote:

> /dev/sda -> 'not ready'
> /dev/sdb -> 'not ready'
> /dev/sdc -> 'ready', triggers /dev/sda -> 'not ready' and /dev/sdb - still 
> 'not ready'
> /dev/sdc -> kernel says 'ready', triggers /dev/sda - 'ready' and /dev/sdb -> 
> 'ready'

The last line should start with /dev/sdd.

> After such timeout, I'd like to tell the kernel: "no more devices, give
> me all the remaining btrfs volumes in degraded mode if possible". By

Actually "if possible" means both:
- if technically possible (i.e. required data is available, like half of RAID1),
- AND if allowed for specific volume as there might be different policies.

For example - one might allow rootfs to be started in degraded-rw mode in
order for the system to boot up, /home in degraded read-only for the
users to have access to their files and do not mount /srv degraded at all.
The failed mount can be non-critical with 'nofail' fstab flag.

> "give me btrfs vulumes" I mean "mark them as 'ready'" so the udev could
> fire it's rules. And if there would be anything for udev to distinguish
> 'ready' from 'ready-degraded' one could easily compose some notification
> scripting on top of it, including sending e-mail to sysadmin.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 06/17] btrfs-progs: lowmem check: introduce force_cow_in_new_chunk()

2017-12-20 Thread Qu Wenruo


On 2017年12月20日 16:21, Su Yue wrote:
> 
> 
> On 12/20/2017 01:41 PM, Qu Wenruo wrote:
>>
>>
>> On 2017年12月20日 12:57, Su Yue wrote:
>>> Introduce create_chunk_and_block_block_group() to allocate new chunk
>>> and corresponding block group.
>>>
>>> The new function force_cow_in_new_chunk() first allocates new chunk
>>> and records its start.
>>> Then it modifies all metadata block groups cached and full.
>>> Finally it marks the new block group uncached and unfree.
>>> In the next CoW, extents states will be updated automatically by
>>> cache_block_group().
>>>
>>> Suggested-by: Qu Wenruo 
>>> Signed-off-by: Su Yue 
>>> ---
>>>   cmds-check.c | 80
>>> 
>>>   1 file changed, 80 insertions(+)
>>>
>>> diff --git a/cmds-check.c b/cmds-check.c
>>> index d98d4bda7357..311c8a9f45e8 100644
>>> --- a/cmds-check.c
>>> +++ b/cmds-check.c
>>> @@ -10911,6 +10911,86 @@ out:
>>>   return ret;
>>>   }
>>>   +static int create_chunk_and_block_group(struct btrfs_fs_info
>>> *fs_info,
>>> +    u64 flags, u64 *start, u64 *nbytes)
>>> +{
>>> +    struct btrfs_trans_handle *trans;
>>> +    struct btrfs_root *root = fs_info->extent_root;
>>> +    int ret;
>>> +
>>> +    if ((flags & BTRFS_BLOCK_GROUP_TYPE_MASK) == 0)
>>> +    return -EINVAL;
>>> +
>>> +    trans = btrfs_start_transaction(root, 1);
>>> +    if (IS_ERR(trans)) {
>>> +    ret = PTR_ERR(trans);
>>> +    error("error starting transaction %s", strerror(-ret));
>>> +    return ret;
>>> +    }
>>> +    ret = btrfs_alloc_chunk(trans, fs_info, start, nbytes, flags);
>>> +    if (ret) {
>>> +    error("fail to allocate new chunk %s", strerror(-ret));
>>> +    goto out;
>>> +    }
>>> +    ret = btrfs_make_block_group(trans, fs_info, 0, flags,
>>> + BTRFS_FIRST_CHUNK_TREE_OBJECTID, *start, *nbytes);
>>> +    if (ret) {
>>> +    error("fail to make block group for chunk %llu %llu %s",
>>> +  *start, *nbytes, strerror(-ret));
>>> +    goto out;
>>> +    }
>>> +out:
>>> +    btrfs_commit_transaction(trans, root);
>>> +    return ret;
>>> +}
>>> +
>>> +static int force_cow_in_new_chunk(struct btrfs_fs_info *fs_info)
>>> +{
>>> +    struct btrfs_block_group_cache *bg;
>>> +    u64 start;
>>> +    u64 nbytes;
>>> +    u64 alloc_profile;
>>> +    u64 flags;
>>> +    int ret;
>>> +
>>> +    alloc_profile = (fs_info->avail_metadata_alloc_bits &
>>> + fs_info->metadata_alloc_profile);
>>> +    flags = BTRFS_BLOCK_GROUP_METADATA | alloc_profile;
>>> +    if (btrfs_fs_incompat(fs_info, MIXED_GROUPS))
>>> +    flags |= BTRFS_BLOCK_GROUP_DATA;
>>> +
>>> +    ret = create_chunk_and_block_group(fs_info, flags, ,
>>> );
>>
>> Why bother allocating a new chunk by yourself?
>>
>> Just mark all block groups full and that's all.
>>
>> Any later tree block allocation like btrfs_search_slot() with @cow = 1
>> will trigger chunk allocation automatically.
>>
> 
> Tried to let it happen but BUG_ON with -ENOSPC.

Then fix it.

It's not a normal behavior in this case.

Thanks,
Qu

> Do you mean do_chunk_alloc() in btrfs_reserve_extent() which is called
> while doing CoW?> In progs, allocation of new chunk during CoW depends 
> @root->ref_cows.
> However, @root->ref_cows should be set only if @root is root of a fs
> trees.
> 
> Thanks,
> Su
> 
>> Thanks,
>> Qu
>>
>>> +    if (!ret)
>>> +    printf("Create new chunk %llu %llu\n", start, nbytes);
>>> +    else
>>> +    goto err;
>>> +
>>> +    flags = BTRFS_BLOCK_GROUP_METADATA;
>>> +    /* Mark all metadata block groups cached and full in free space*/
>>> +    ret = modify_block_groups_cache(fs_info, flags, 1);
>>> +    if (ret)
>>> +    goto clear_bg_cache;
>>> +
>>> +    bg = btrfs_lookup_block_group(fs_info, start);
>>> +    if (!bg) {
>>> +    ret = -ENOENT;
>>> +    error("fail to look up block group %llu %llu", start, nbytes);
>>> +    goto clear_bg_cache;
>>> +    }
>>> +
>>> +    /* Clear block group cache just allocated */
>>> +    ret = modify_block_group_cache(fs_info, bg, 0);
>>> +    if (ret)
>>> +    goto clear_bg_cache;
>>> +
>>> +    return 0;
>>> +
>>> +clear_bg_cache:
>>> +    modify_block_groups_cache(fs_info, flags, 0);
>>> +err:
>>> +    return ret;
>>> +}
>>> +
>>>   static int check_extent_refs(struct btrfs_root *root,
>>>    struct cache_tree *extent_cache)
>>>   {
>>>
>>
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



signature.asc
Description: OpenPGP digital signature


Re: Unexpected raid1 behaviour

2017-12-20 Thread Tomasz Pala
On Tue, Dec 19, 2017 at 16:59:39 -0700, Chris Murphy wrote:

>> Sth like this? I got such problem a few months ago, my solution was
>> accepted upstream:
>> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f
> 
> I can't parse this commit. In particular I can't tell how long it
> waits, or what triggers the end to waiting.

The point is - it doesn't wait at all. Instead, every 'ready' btrfs
device triggers event on all the pending devices. Consider 3-device
filesystem consisting of /dev/sd[abd] with /dev/sdc being different,
standalone btrfs:

/dev/sda -> 'not ready'
/dev/sdb -> 'not ready'
/dev/sdc -> 'ready', triggers /dev/sda -> 'not ready' and /dev/sdb - still 'not 
ready'
/dev/sdc -> kernel says 'ready', triggers /dev/sda - 'ready' and /dev/sdb -> 
'ready'

This way all the parts of a volume are marked as ready, so systemd won't
refuse mounting using legacy device nodes like /dev/sda.


This particular solution depends on kernel returning 'btrfs ready',
which would obviously not work for degraded arrays unless the btrfs.ko
handles some 'missing' or 'mount_degraded' kernel cmdline options
_before_ actually _trying_ to mount it with -o degraded.

And there is a logical problem with this - _which_ array components
should be ignored? Consider:

volume1: /dev/sda /dev/sdb
volume2: /dev/sdc /dev/sdd-broken

If /dev/sdd is missing from the system, it would never be scanned, so
/dev/sdc would be pending. It cannot be assembled just in time of
scanning alone, because the same would happen with /dev/sda and there
would be desync with /dev/sdb, which IS available - a few moments later.

This is the place for the timeout you've mentioned - there should be
*some* decent timeout allowing all the devices to show up (udev waits
for 90 seconds by default or x-systemd.device-timeout=N from fstab).

After such timeout, I'd like to tell the kernel: "no more devices, give
me all the remaining btrfs volumes in degraded mode if possible". By
"give me btrfs vulumes" I mean "mark them as 'ready'" so the udev could
fire it's rules. And if there would be anything for udev to distinguish
'ready' from 'ready-degraded' one could easily compose some notification
scripting on top of it, including sending e-mail to sysadmin.

Is there anything that would make the kernel do the above?

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: optimize code converge mutex unlock

2017-12-20 Thread Nikolay Borisov


On 20.12.2017 08:42, Anand Jain wrote:
> No functional change rearrange the mutex_unlock.
> 
> Signed-off-by: Anand Jain 

Reviewed-by: Nikolay Borisov 
> ---
>  fs/btrfs/transaction.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 6348573e26a7..04f07144b45c 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -2265,16 +2265,13 @@ int btrfs_commit_transaction(struct 
> btrfs_trans_handle *trans)
>   }
>  
>   ret = write_all_supers(fs_info, 0);
> - if (ret) {
> - mutex_unlock(_info->tree_log_mutex);
> - goto scrub_continue;
> - }
> -
>   /*
>* the super is written, we can safely allow the tree-loggers
>* to go about their business
>*/
>   mutex_unlock(_info->tree_log_mutex);
> + if (ret)
> + goto scrub_continue;
>  
>   btrfs_finish_extent_commit(trans, fs_info);
>  
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 06/17] btrfs-progs: lowmem check: introduce force_cow_in_new_chunk()

2017-12-20 Thread Su Yue



On 12/20/2017 01:41 PM, Qu Wenruo wrote:



On 2017年12月20日 12:57, Su Yue wrote:

Introduce create_chunk_and_block_block_group() to allocate new chunk
and corresponding block group.

The new function force_cow_in_new_chunk() first allocates new chunk
and records its start.
Then it modifies all metadata block groups cached and full.
Finally it marks the new block group uncached and unfree.
In the next CoW, extents states will be updated automatically by
cache_block_group().

Suggested-by: Qu Wenruo 
Signed-off-by: Su Yue 
---
  cmds-check.c | 80 
  1 file changed, 80 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index d98d4bda7357..311c8a9f45e8 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -10911,6 +10911,86 @@ out:
return ret;
  }
  
+static int create_chunk_and_block_group(struct btrfs_fs_info *fs_info,

+   u64 flags, u64 *start, u64 *nbytes)
+{
+   struct btrfs_trans_handle *trans;
+   struct btrfs_root *root = fs_info->extent_root;
+   int ret;
+
+   if ((flags & BTRFS_BLOCK_GROUP_TYPE_MASK) == 0)
+   return -EINVAL;
+
+   trans = btrfs_start_transaction(root, 1);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   error("error starting transaction %s", strerror(-ret));
+   return ret;
+   }
+   ret = btrfs_alloc_chunk(trans, fs_info, start, nbytes, flags);
+   if (ret) {
+   error("fail to allocate new chunk %s", strerror(-ret));
+   goto out;
+   }
+   ret = btrfs_make_block_group(trans, fs_info, 0, flags,
+BTRFS_FIRST_CHUNK_TREE_OBJECTID, *start, *nbytes);
+   if (ret) {
+   error("fail to make block group for chunk %llu %llu %s",
+ *start, *nbytes, strerror(-ret));
+   goto out;
+   }
+out:
+   btrfs_commit_transaction(trans, root);
+   return ret;
+}
+
+static int force_cow_in_new_chunk(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_block_group_cache *bg;
+   u64 start;
+   u64 nbytes;
+   u64 alloc_profile;
+   u64 flags;
+   int ret;
+
+   alloc_profile = (fs_info->avail_metadata_alloc_bits &
+fs_info->metadata_alloc_profile);
+   flags = BTRFS_BLOCK_GROUP_METADATA | alloc_profile;
+   if (btrfs_fs_incompat(fs_info, MIXED_GROUPS))
+   flags |= BTRFS_BLOCK_GROUP_DATA;
+
+   ret = create_chunk_and_block_group(fs_info, flags, , );


Why bother allocating a new chunk by yourself?

Just mark all block groups full and that's all.

Any later tree block allocation like btrfs_search_slot() with @cow = 1
will trigger chunk allocation automatically.



Tried to let it happen but BUG_ON with -ENOSPC.
Do you mean do_chunk_alloc() in btrfs_reserve_extent() which is called 
while doing CoW?

In progs, allocation of new chunk during CoW depends @root->ref_cows.
However, @root->ref_cows should be set only if @root is root of a fs
trees.

Thanks,
Su


Thanks,
Qu


+   if (!ret)
+   printf("Create new chunk %llu %llu\n", start, nbytes);
+   else
+   goto err;
+
+   flags = BTRFS_BLOCK_GROUP_METADATA;
+   /* Mark all metadata block groups cached and full in free space*/
+   ret = modify_block_groups_cache(fs_info, flags, 1);
+   if (ret)
+   goto clear_bg_cache;
+
+   bg = btrfs_lookup_block_group(fs_info, start);
+   if (!bg) {
+   ret = -ENOENT;
+   error("fail to look up block group %llu %llu", start, nbytes);
+   goto clear_bg_cache;
+   }
+
+   /* Clear block group cache just allocated */
+   ret = modify_block_group_cache(fs_info, bg, 0);
+   if (ret)
+   goto clear_bg_cache;
+
+   return 0;
+
+clear_bg_cache:
+   modify_block_groups_cache(fs_info, flags, 0);
+err:
+   return ret;
+}
+
  static int check_extent_refs(struct btrfs_root *root,
 struct cache_tree *extent_cache)
  {






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] btrfs: handle volume split brain scenario

2017-12-20 Thread Anand Jain
In raid configs RAID1/RAID5/RAID6 it's possible to have some devices
missing which would render btrfs to be mounted in degraded state but
still be operational. In those cases it's possible (albeit highly
undesirable) that the degraded and missing parts of the filesystem are
mounted independently. When writes occur such split-brain scenarios
(caused by intentional user action) then one of the sides of the RAID
config will have to be blown away when bringing it back to the
consistent state.

Handle split-brain volumes by setting a new flag
BTRFS_SUPER_FLAG_DEGRADED if the device is mounted degraded. So we
could detect and fail the mount if all the disks contains this flag.

To reassemble a split-brain volume first mount the good disk and then
scan in the device on which new writes can be ignored, (it needs patch
btrfs: handle dynamically reappearing missing device)

Warning:  A raid1 root device, in split brain condition, would fail
to bootup to protect the arbitrary loss of data.

Signed-off-by: Anand Jain 
---
On top of misc-next kdave.
v2:
 Improve commit log.
 Rename to BTRFS_SUPER_FLAG_DEGRADED.
 Rename variables to fs_devices and device.
 In open_ctree() check for split-brain after btrfs_read_chunk_tree()

 fs/btrfs/disk-io.c  | 55 -
 include/uapi/linux/btrfs_tree.h |  1 +
 2 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b302db90598c..e87924b7145b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -61,7 +61,8 @@
 BTRFS_HEADER_FLAG_RELOC |\
 BTRFS_SUPER_FLAG_ERROR |\
 BTRFS_SUPER_FLAG_SEEDING |\
-BTRFS_SUPER_FLAG_METADUMP)
+BTRFS_SUPER_FLAG_METADUMP|\
+BTRFS_SUPER_FLAG_DEGRADED)
 
 static const struct extent_io_ops btree_extent_io_ops;
 static void end_workqueue_fn(struct btrfs_work *work);
@@ -2383,6 +2384,43 @@ static int btrfs_read_roots(struct btrfs_fs_info 
*fs_info)
return 0;
 }
 
+bool volume_has_split_brain(struct btrfs_fs_info *fs_info)
+{
+   unsigned long devs_moved_on = 0;
+   struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+   struct list_head *head = _devices->devices;
+   struct btrfs_device *device;
+
+again:
+   list_for_each_entry(device, head, dev_list) {
+   struct buffer_head *bh;
+   struct btrfs_super_block *sb;
+
+   if (!device->devid)
+   continue;
+
+   bh = btrfs_read_dev_super(device->bdev);
+   if (IS_ERR(bh))
+   continue;
+
+   sb = (struct btrfs_super_block *)bh->b_data;
+   if (btrfs_super_flags(sb) & BTRFS_SUPER_FLAG_DEGRADED)
+   devs_moved_on++;
+   brelse(bh);
+   }
+
+   fs_devices = fs_devices->seed;
+   if (fs_devices) {
+   head = _devices->devices;
+   goto again;
+   }
+
+   if (devs_moved_on == fs_info->fs_devices->total_devices)
+   return true;
+   else
+   return false;
+}
+
 int open_ctree(struct super_block *sb,
   struct btrfs_fs_devices *fs_devices,
   char *options)
@@ -2765,6 +2803,21 @@ int open_ctree(struct super_block *sb,
goto fail_tree_roots;
}
 
+   if (fs_info->fs_devices->missing_devices) {
+   btrfs_set_super_flags(fs_info->super_copy,
+ fs_info->super_copy->flags |
+ BTRFS_SUPER_FLAG_DEGRADED);
+   } else if (fs_info->super_copy->flags & BTRFS_SUPER_FLAG_DEGRADED) {
+   if (volume_has_split_brain(fs_info)) {
+   btrfs_err(fs_info,
+ "Detected 'degraded' flag on all devices");
+   goto fail_tree_roots;
+   }
+   btrfs_set_super_flags(fs_info->super_copy,
+ fs_info->super_copy->flags &
+ ~BTRFS_SUPER_FLAG_DEGRADED);
+   }
+
/*
 * keep the device that is marked to be the target device for the
 * dev_replace procedure
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 33e814ef992f..c08b9b89e285 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2057,8 +2057,13 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const 
char *device_path,
device->fs_devices->num_devices--;
device->fs_devices->total_devices--;
 
-   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
+   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) {
device->fs_devices->missing_devices--;
+   if (!device->fs_devices->missing_devices)
+   

[PATCH v2 2/2] btrfs: handle volume split brain condition for dynamically reappearing device

2017-12-20 Thread Anand Jain
When the missing device reappears and joins the RAID group, and if there
are no more missing device at the volume level, then reset the
BTRFS_SUPER_FLAG_VOL_MOVED_ON flag.

This patch is on top of the patch [1] in the ML.
[1] btrfs: handle dynamically reappearing missing device

Signed-off-by: Anand Jain 
---
On top of misc-next kdave.
v2:
 Rename to BTRFS_SUPER_FLAG_DEGRADED.

 fs/btrfs/volumes.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 65d10f38dd99..32571f4fa72b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -858,6 +858,10 @@ static noinline int device_list_add(const char *path,
 
fs_devices->missing_devices--;
clear_bit(BTRFS_DEV_STATE_MISSING, >dev_state);
+   if (!fs_devices->missing_devices)
+   btrfs_set_super_flags(fs_info->super_copy,
+ 
fs_info->super_copy->flags &
+ 
~BTRFS_SUPER_FLAG_DEGRADED);
 
if (test_bit(BTRFS_DEV_STATE_WRITEABLE,
 >dev_state) &&
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: qgroup: Deprecate the ability to manually inherit rfer/excl numbers

2017-12-20 Thread Nikolay Borisov


On 20.12.2017 07:13, Qu Wenruo wrote:
> btrfs_qgroup_inherit structure has two members, num_ref_copies and
> num_excl_copies, to info btrfs kernel modules to inherit (copy)
> rfer/excl numbers at snapshot/subvolume creation time.
> 
> Since qgroup number is already hard to maintain for multi-level qgroup
> scenario, allowing user to manually manipulate qgroup inherit is quite
> easy to screw up qgroup numbers.
> 
> Although btrfs-progs supports such inheritance specification, the
> options are hidden from user and not documented.
> So there is no need to allow user to manually specify inheritance in
> kernel.
> 
> Reported-by: Omar Sandoval 
> Reported-by: Nikolay Borisov 
> Signed-off-by: Qu Wenruo 

Overall I'm ok with the patch but I'd like to hear the opinion of other
developer about whether we can rename the fields.

> ---
> changelog:
> v2:
>   Don't return -ENOTTY, but just ignoring value set in
>   num_ref/excl_copies, suggested by Nikolay.
>   Don't modify the UAPI members, but add comment to deprecate them,
>   suggested by Omar.
>   And add Omar as first reporter.
> ---
>  fs/btrfs/qgroup.c  | 54 
> +-
>  include/uapi/linux/btrfs.h |  4 ++--
>  2 files changed, 17 insertions(+), 41 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 168fd03ca3ac..42c6b35d8d7f 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -2158,9 +2158,22 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
> *trans,
>   }
>  
>   if (inherit) {
> + /*
> +  * num_excl/rfer_copies indicate how many qgroup pairs needs
> +  * to be manually inherited (copy rfer or excl from src
> +  * qgroup to dst, along with qgroup relationship added)
> +  *
> +  * Allowing user to manipulate inheritance can easily cause
> +  * problem in multi-level qgroup scenario.
> +  * And the ioctl interface is hidden in btrfs-progs for a long
> +  * time, deprecate them should not be a big problem.
> +  *
> +  * Here we just ignore any value set in num_excl/rfer_copies,
> +  * and only handles qgroup relationship specified by
> +  * @num_qgroups.
> +  */
>   i_qgroups = (u64 *)(inherit + 1);
> - nums = inherit->num_qgroups + 2 * inherit->num_ref_copies +
> -2 * inherit->num_excl_copies;
> + nums = inherit->num_qgroups;
>   for (i = 0; i < nums; ++i) {
>   srcgroup = find_qgroup_rb(fs_info, *i_qgroups);
>  
> @@ -2286,43 +2299,6 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
> *trans,
>   ++i_qgroups;
>   }
>  
> - for (i = 0; i <  inherit->num_ref_copies; ++i, i_qgroups += 2) {
> - struct btrfs_qgroup *src;
> - struct btrfs_qgroup *dst;
> -
> - if (!i_qgroups[0] || !i_qgroups[1])
> - continue;
> -
> - src = find_qgroup_rb(fs_info, i_qgroups[0]);
> - dst = find_qgroup_rb(fs_info, i_qgroups[1]);
> -
> - if (!src || !dst) {
> - ret = -EINVAL;
> - goto unlock;
> - }
> -
> - dst->rfer = src->rfer - level_size;
> - dst->rfer_cmpr = src->rfer_cmpr - level_size;
> - }
> - for (i = 0; i <  inherit->num_excl_copies; ++i, i_qgroups += 2) {
> - struct btrfs_qgroup *src;
> - struct btrfs_qgroup *dst;
> -
> - if (!i_qgroups[0] || !i_qgroups[1])
> - continue;
> -
> - src = find_qgroup_rb(fs_info, i_qgroups[0]);
> - dst = find_qgroup_rb(fs_info, i_qgroups[1]);
> -
> - if (!src || !dst) {
> - ret = -EINVAL;
> - goto unlock;
> - }
> -
> - dst->excl = src->excl + level_size;
> - dst->excl_cmpr = src->excl_cmpr + level_size;
> - }
> -
>  unlock:
>   spin_unlock(_info->qgroup_lock);
>  out:
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index ce615b75e855..a10342d491bc 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -80,8 +80,8 @@ struct btrfs_qgroup_limit {
>  struct btrfs_qgroup_inherit {
>   __u64   flags;
>   __u64   num_qgroups;
> - __u64   num_ref_copies;
> - __u64   num_excl_copies;
> + __u64   num_ref_copies;  /* DEPRECATED, will just be ignored */
> + __u64   num_excl_copies; /* DEPRECATED, will just be ignored */
>   struct btrfs_qgroup_limit lim;
>   __u64   qgroups[0];
>  };
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html