Re: [PATCH 11/13] btrfs-progs: check: Delete file extent item with unaligned extent backref

2018-11-06 Thread Qu Wenruo



On 2018/11/7 下午3:04, Su Yanjun  wrote:
> 
> 
> On 11/7/2018 2:38 PM, Qu Wenruo wrote:
>>
>> On 2018/11/7 下午2:21, Su Yanjun  wrote:
>>>
>>> On 10/24/2018 8:45 AM, Qu Wenruo wrote:
 On 2018/10/23 下午5:41, Su Yue wrote:
> From: Su Yanjun 
>
> In original mode, if some file extent item has unaligned extent
> backref,
> fixup_extent_refs can't repair it. This patch will check extent
> alignment
> then delete file extent with unaligned extent backref.
 This looks a little strange to me.

 You mean, an unaligned FILE EXTENT has an unaligned EXTENT_ITEM?

 Then why not just delete the EXTENT_ITEM directly? No need to go back
 checking if it has a corresponding EXTENT_DATA since unaligned one is
 definitely corrupted.

 For corrupted EXTENT_DATA, it should get deleted when we check fs tree.

 This would save you a lot of codes.

 Thanks,
 Qu
>>> The situation is that the file extent has wrong extent backref, actually
>>> it doesn't exist.
>> Did you mean extent EXTENT_ITEM key's objectid is unaligned?
>>
>> Would you please give an example on this case? Like:
>> ( EXTENT_DATA 
>>     disk bytenr  disk len 
>>
>> And its backref like:
>> ( EXTENT_ITEM )
>>
>> And then mark where the number is incorrect.
>>
>> Thanks,
>> Qu
> 
> As in /btrfs-progs/tests/fsck-tests/001-bad-file-extent-bytenr case:
> 
> item 7 key (257 EXTENT_DATA 0) itemoff 3453 itemsize 53
> 
>     generation 6 type 1 (regular)
> 
>     extent data disk byte 755944791 nr 1048576
>     ^
> 
>     extent data offset 0 nr 1048576 ram 1048576
> 
>     extent compression 0 (none)

Then there is no "unaligned extent backref".

It's just a unaligned disk bytenr of a file extent.
Nothing to do with backref.

Please update the commit message to avoid such confusing words, and
include above info, which is pretty easy to understand.

Thanks,
Qu

> 
> Thanks,
> 
> Su
> 
>>> Thanks,
>>> Su
>>>
>>>
>>>
>>>
>>>
>>
> 
> 
> 


Re: [PATCH 11/13] btrfs-progs: check: Delete file extent item with unaligned extent backref

2018-11-06 Thread Su Yanjun




On 11/7/2018 2:38 PM, Qu Wenruo wrote:


On 2018/11/7 下午2:21, Su Yanjun  wrote:


On 10/24/2018 8:45 AM, Qu Wenruo wrote:

On 2018/10/23 下午5:41, Su Yue wrote:

From: Su Yanjun 

In original mode, if some file extent item has unaligned extent backref,
fixup_extent_refs can't repair it. This patch will check extent
alignment
then delete file extent with unaligned extent backref.

This looks a little strange to me.

You mean, an unaligned FILE EXTENT has an unaligned EXTENT_ITEM?

Then why not just delete the EXTENT_ITEM directly? No need to go back
checking if it has a corresponding EXTENT_DATA since unaligned one is
definitely corrupted.

For corrupted EXTENT_DATA, it should get deleted when we check fs tree.

This would save you a lot of codes.

Thanks,
Qu

The situation is that the file extent has wrong extent backref, actually
it doesn't exist.

Did you mean extent EXTENT_ITEM key's objectid is unaligned?

Would you please give an example on this case? Like:
( EXTENT_DATA 
disk bytenr  disk len 

And its backref like:
( EXTENT_ITEM )

And then mark where the number is incorrect.

Thanks,
Qu


As in /btrfs-progs/tests/fsck-tests/001-bad-file-extent-bytenr case:

item 7 key (257 EXTENT_DATA 0) itemoff 3453 itemsize 53

    generation 6 type 1 (regular)

    extent data disk byte 755944791 nr 1048576
    ^

    extent data offset 0 nr 1048576 ram 1048576

    extent compression 0 (none)

Thanks,

Su


Thanks,
Su













Re: [PATCH 10/13] btrfs-progs: check: fix bug in find_possible_backrefs

2018-11-06 Thread Qu Wenruo



On 2018/11/7 下午2:28, Su Yanjun  wrote:
> 
> 
> On 10/24/2018 8:34 AM, Qu Wenruo wrote:
>>
>> On 2018/10/23 下午5:41, Su Yue wrote:
>>> From: Su Yanjun 
>>>
>>> It may cost more time to search all extent data of correspond files but
>>> should not influence total speed too much cause that only corrupted
>>> extent items are participated in.
>> Sorry, I didn't really get the point of the modification from the commit
>> message.
>>
>> Would you please explain the purpose of this patch first?
>>
>> Thanks,
>> Qu
> 
> In find_possible_backrefs() function:
> 
> key.objectid = dback->owner;
> key.type = BTRFS_EXTENT_DATA_KEY;
> key.offset = dback->offset;//
> ret = btrfs_search_slot(NULL, root, , path, 0, 0);
> 
> 'dback->offset' is not the offset in file range, we'll find wrong file
> extent.

Now this makes sense.

The original commit message doesn't touch this part, and focused on the
side effect, not the problem itself.

Please update the commit message.

Thanks,
Qu

> 
> Thanks,
> 
> Su
> 
>>> Signed-off-by: Su Yanjun 
>>> ---
>>>   check/main.c | 110 ++-
>>>   1 file changed, 92 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/check/main.c b/check/main.c
>>> index bd1f322e0f12..90d9fd570287 100644
>>> --- a/check/main.c
>>> +++ b/check/main.c
>>> @@ -7015,6 +7015,89 @@ out:
>>>   return ret ? ret : nr_del;
>>>   }
>>>   +/*
>>> + * Based extent backref item, we find all file extent items in the
>>> fs tree. By
>>> + * the info we can rebuild the extent backref item
>>> + */
>>> +static int __find_possible_backrefs(struct btrfs_root *root,
>>> +    u64 owner, u64 offset, u64 bytenr, u64 *refs_ret,
>>> +    u64 *bytes_ret)
>>> +{
>>> +    int ret = 0;
>>> +    struct btrfs_path path;
>>> +    struct btrfs_key key;
>>> +    struct btrfs_key found_key;
>>> +    struct btrfs_file_extent_item *fi;
>>> +    struct extent_buffer *leaf;
>>> +    u64 backref_offset, disk_bytenr;
>>> +    int slot;
>>> +
>>> +    btrfs_init_path();
>>> +
>>> +    key.objectid = owner;
>>> +    key.type = BTRFS_INODE_ITEM_KEY;
>>> +    key.offset = 0;
>>> +
>>> +    ret = btrfs_search_slot(NULL, root, , , 0, 0);
>>> +    if (ret > 0)
>>> +    ret = -ENOENT;
>>> +    if (ret) {
>>> +    btrfs_release_path();
>>> +    return ret;
>>> +    }
>>> +
>>> +    btrfs_release_path();
>>> +
>>> +    key.objectid = owner;
>>> +    key.type = BTRFS_EXTENT_DATA_KEY;
>>> +    key.offset = 0;
>>> +
>>> +    ret = btrfs_search_slot(NULL, root, , , 0, 0);
>>> +    if (ret < 0) {
>>> +    btrfs_release_path();
>>> +    return ret;
>>> +    }
>>> +
>>> +    while (1) {
>>> +    leaf = path.nodes[0];
>>> +    slot = path.slots[0];
>>> +
>>> +    if (slot >= btrfs_header_nritems(leaf)) {
>>> +    ret = btrfs_next_leaf(root, );
>>> +    if (ret) {
>>> +    if (ret > 0)
>>> +    ret = 0;
>>> +    break;
>>> +    }
>>> +
>>> +    leaf = path.nodes[0];
>>> +    slot = path.slots[0];
>>> +    }
>>> +
>>> +    btrfs_item_key_to_cpu(leaf, _key, slot);
>>> +    if ((found_key.objectid != owner) ||
>>> +    (found_key.type != BTRFS_EXTENT_DATA_KEY))
>>> +    break;
>>> +
>>> +    fi = btrfs_item_ptr(leaf, slot,
>>> +    struct btrfs_file_extent_item);
>>> +
>>> +    backref_offset = found_key.offset -
>>> +    btrfs_file_extent_offset(leaf, fi);
>>> +    disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
>>> +    *bytes_ret = btrfs_file_extent_disk_num_bytes(leaf,
>>> +    fi);
>>> +    if ((disk_bytenr == bytenr) &&
>>> +    (backref_offset == offset)) {
>>> +    (*refs_ret)++;
>>> +    }
>>> +    path.slots[0]++;
>>> +    }
>>> +
>>> +    btrfs_release_path();
>>> +    return ret;
>>> +}
>>> +
>>>   static int find_possible_backrefs(struct btrfs_fs_info *info,
>>>     struct btrfs_path *path,
>>>     struct cache_tree *extent_cache,
>>> @@ -7024,9 +7107,9 @@ static int find_possible_backrefs(struct
>>> btrfs_fs_info *info,
>>>   struct extent_backref *back, *tmp;
>>>   struct data_backref *dback;
>>>   struct cache_extent *cache;
>>> -    struct btrfs_file_extent_item *fi;
>>>   struct btrfs_key key;
>>>   u64 bytenr, bytes;
>>> +    u64 refs;
>>>   int ret;
>>>     rbtree_postorder_for_each_entry_safe(back, tmp,
>>> @@ -7054,24 +7137,15 @@ static int find_possible_backrefs(struct
>>> btrfs_fs_info *info,
>>>   if (IS_ERR(root))
>>>   return PTR_ERR(root);
>>>   -    key.objectid = dback->owner;
>>> -    key.type = BTRFS_EXTENT_DATA_KEY;
>>> -    key.offset = dback->offset;
>>> -    ret = btrfs_search_slot(NULL, root, , path, 0, 0);
>>> -    if (ret) {
>>> -    btrfs_release_path(path);
>>> -    if (ret < 0)
>>> -    return 

Re: [PATCH 11/13] btrfs-progs: check: Delete file extent item with unaligned extent backref

2018-11-06 Thread Qu Wenruo



On 2018/11/7 下午2:21, Su Yanjun  wrote:
> 
> 
> On 10/24/2018 8:45 AM, Qu Wenruo wrote:
>>
>> On 2018/10/23 下午5:41, Su Yue wrote:
>>> From: Su Yanjun 
>>>
>>> In original mode, if some file extent item has unaligned extent backref,
>>> fixup_extent_refs can't repair it. This patch will check extent
>>> alignment
>>> then delete file extent with unaligned extent backref.
>> This looks a little strange to me.
>>
>> You mean, an unaligned FILE EXTENT has an unaligned EXTENT_ITEM?
>>
>> Then why not just delete the EXTENT_ITEM directly? No need to go back
>> checking if it has a corresponding EXTENT_DATA since unaligned one is
>> definitely corrupted.
>>
>> For corrupted EXTENT_DATA, it should get deleted when we check fs tree.
>>
>> This would save you a lot of codes.
>>
>> Thanks,
>> Qu
> The situation is that the file extent has wrong extent backref, actually
> it doesn't exist.

Did you mean extent EXTENT_ITEM key's objectid is unaligned?

Would you please give an example on this case? Like:
( EXTENT_DATA 
   disk bytenr  disk len 

And its backref like:
( EXTENT_ITEM )

And then mark where the number is incorrect.

Thanks,
Qu

> 
> Thanks,
> Su
> 
> 
> 
> 
> 


Re: [PATCH 10/13] btrfs-progs: check: fix bug in find_possible_backrefs

2018-11-06 Thread Su Yanjun




On 10/24/2018 8:34 AM, Qu Wenruo wrote:


On 2018/10/23 下午5:41, Su Yue wrote:

From: Su Yanjun 

It may cost more time to search all extent data of correspond files but
should not influence total speed too much cause that only corrupted
extent items are participated in.

Sorry, I didn't really get the point of the modification from the commit
message.

Would you please explain the purpose of this patch first?

Thanks,
Qu


In find_possible_backrefs() function:

key.objectid = dback->owner;
key.type = BTRFS_EXTENT_DATA_KEY;
key.offset = dback->offset;//
ret = btrfs_search_slot(NULL, root, , path, 0, 0);

'dback->offset' is not the offset in file range, we'll find wrong file extent.

Thanks,

Su


Signed-off-by: Su Yanjun 
---
  check/main.c | 110 ++-
  1 file changed, 92 insertions(+), 18 deletions(-)

diff --git a/check/main.c b/check/main.c
index bd1f322e0f12..90d9fd570287 100644
--- a/check/main.c
+++ b/check/main.c
@@ -7015,6 +7015,89 @@ out:
return ret ? ret : nr_del;
  }
  
+/*

+ * Based extent backref item, we find all file extent items in the fs tree. By
+ * the info we can rebuild the extent backref item
+ */
+static int __find_possible_backrefs(struct btrfs_root *root,
+   u64 owner, u64 offset, u64 bytenr, u64 *refs_ret,
+   u64 *bytes_ret)
+{
+   int ret = 0;
+   struct btrfs_path path;
+   struct btrfs_key key;
+   struct btrfs_key found_key;
+   struct btrfs_file_extent_item *fi;
+   struct extent_buffer *leaf;
+   u64 backref_offset, disk_bytenr;
+   int slot;
+
+   btrfs_init_path();
+
+   key.objectid = owner;
+   key.type = BTRFS_INODE_ITEM_KEY;
+   key.offset = 0;
+
+   ret = btrfs_search_slot(NULL, root, , , 0, 0);
+   if (ret > 0)
+   ret = -ENOENT;
+   if (ret) {
+   btrfs_release_path();
+   return ret;
+   }
+
+   btrfs_release_path();
+
+   key.objectid = owner;
+   key.type = BTRFS_EXTENT_DATA_KEY;
+   key.offset = 0;
+
+   ret = btrfs_search_slot(NULL, root, , , 0, 0);
+   if (ret < 0) {
+   btrfs_release_path();
+   return ret;
+   }
+
+   while (1) {
+   leaf = path.nodes[0];
+   slot = path.slots[0];
+
+   if (slot >= btrfs_header_nritems(leaf)) {
+   ret = btrfs_next_leaf(root, );
+   if (ret) {
+   if (ret > 0)
+   ret = 0;
+   break;
+   }
+
+   leaf = path.nodes[0];
+   slot = path.slots[0];
+   }
+
+   btrfs_item_key_to_cpu(leaf, _key, slot);
+   if ((found_key.objectid != owner) ||
+   (found_key.type != BTRFS_EXTENT_DATA_KEY))
+   break;
+
+   fi = btrfs_item_ptr(leaf, slot,
+   struct btrfs_file_extent_item);
+
+   backref_offset = found_key.offset -
+   btrfs_file_extent_offset(leaf, fi);
+   disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
+   *bytes_ret = btrfs_file_extent_disk_num_bytes(leaf,
+   fi);
+   if ((disk_bytenr == bytenr) &&
+   (backref_offset == offset)) {
+   (*refs_ret)++;
+   }
+   path.slots[0]++;
+   }
+
+   btrfs_release_path();
+   return ret;
+}
+
  static int find_possible_backrefs(struct btrfs_fs_info *info,
  struct btrfs_path *path,
  struct cache_tree *extent_cache,
@@ -7024,9 +7107,9 @@ static int find_possible_backrefs(struct btrfs_fs_info 
*info,
struct extent_backref *back, *tmp;
struct data_backref *dback;
struct cache_extent *cache;
-   struct btrfs_file_extent_item *fi;
struct btrfs_key key;
u64 bytenr, bytes;
+   u64 refs;
int ret;
  
  	rbtree_postorder_for_each_entry_safe(back, tmp,

@@ -7054,24 +7137,15 @@ static int find_possible_backrefs(struct btrfs_fs_info 
*info,
if (IS_ERR(root))
return PTR_ERR(root);
  
-		key.objectid = dback->owner;

-   key.type = BTRFS_EXTENT_DATA_KEY;
-   key.offset = dback->offset;
-   ret = btrfs_search_slot(NULL, root, , path, 0, 0);
-   if (ret) {
-   btrfs_release_path(path);
-   if (ret < 0)
-   return ret;
-   /* Didn't find it, we can carry on */
-   ret = 0;
+   refs = 0;
+   bytes = 0;
+   ret = __find_possible_backrefs(root, 

Re: [PATCH 11/13] btrfs-progs: check: Delete file extent item with unaligned extent backref

2018-11-06 Thread Su Yanjun




On 10/24/2018 8:45 AM, Qu Wenruo wrote:


On 2018/10/23 下午5:41, Su Yue wrote:

From: Su Yanjun 

In original mode, if some file extent item has unaligned extent backref,
fixup_extent_refs can't repair it. This patch will check extent alignment
then delete file extent with unaligned extent backref.

This looks a little strange to me.

You mean, an unaligned FILE EXTENT has an unaligned EXTENT_ITEM?

Then why not just delete the EXTENT_ITEM directly? No need to go back
checking if it has a corresponding EXTENT_DATA since unaligned one is
definitely corrupted.

For corrupted EXTENT_DATA, it should get deleted when we check fs tree.

This would save you a lot of codes.

Thanks,
Qu
The situation is that the file extent has wrong extent backref, actually 
it doesn't exist.


Thanks,
Su







Re: [PATCH] btrfs: Always try all copies when reading extent buffers

2018-11-06 Thread Qu Wenruo



On 2018/11/7 上午12:07, Nikolay Borisov wrote:
> 
> 
> On 6.11.18 г. 16:40 ч., Nikolay Borisov wrote:
>> When a metadata read is served the endio routine btree_readpage_end_io_hook
>> is called which eventually runs the tree-checker. If tree-checker fails
>> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
>> leads to btree_read_extent_buffer_pages wrongly assuming that all
>> available copies of this extent buffer are wrong and failing prematurely.
>> Fix this modify btree_read_extent_buffer_pages to read all copies of
>> the data.
>>
>> This failure was exhibitted in xfstests btrfs/124 which would
>> spuriously fail its balance operations. The reason was that when balance
>> was run following re-introduction of the missing raid1 disk
>> __btrfs_map_block would map the read request to stripe 0, which
>> corresponded to devid 2 (the disk which is being removed in the test):
>>
>> item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 
>> itemsize 112
>>  length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
>>  io_align 65536 io_width 65536 sector_size 4096
>>  num_stripes 2 sub_stripes 1
>>  stripe 0 devid 2 offset 2156920832
>>  dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
>>  stripe 1 devid 1 offset 3553624064
>>  dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
>>
>> This caused read requests for a checksum item that to be routed to the
>> stale disk which triggered the aforementioned logic involving
>> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
>> the balance operation.
>>
>> Signed-off-by: Nikolay Borisov 
>> Suggested-by: Qu Wenruo 
>> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
>> ---
>>  fs/btrfs/disk-io.c | 11 +--
>>  1 file changed, 1 insertion(+), 10 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 00ee5e37e989..279c6dbcc736 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct 
>> btrfs_fs_info *fs_info,
>>  int mirror_num = 0;
>>  int failed_mirror = 0;
>>  
>> -clear_bit(EXTENT_BUFFER_CORRUPT, >bflags);
>>  io_tree = _I(fs_info->btree_inode)->io_tree;
>>  while (1) {
>> +clear_bit(EXTENT_BUFFER_CORRUPT, >bflags);
>>  ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
>> mirror_num);
>>  if (!ret) {
> 
> Qu, 
> 
> Do you think it makes sense to do refactoring like below in
>  a follow up patch: 
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 279c6dbcc736..9891e13a2b6f 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -482,16 +482,11 @@ static int btree_read_extent_buffer_pages(struct 
> btrfs_fs_info *fs_info,
> clear_bit(EXTENT_BUFFER_CORRUPT, >bflags);
> ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
>mirror_num);
> -   if (!ret) {
> -   if (verify_parent_transid(io_tree, eb,
> -  parent_transid, 0))
> -   ret = -EIO;
> -   else if (verify_level_key(fs_info, eb, level,
> - first_key, parent_transid))
> -   ret = -EUCLEAN;
> -   else
> +   if (!ret &&
> +   !verify_parent_transid(io_tree, eb, parent_transid, 0) &&
> +   !verify_level_key(fs_info, eb, level, first_key,
> +parent_transid))
> break;
> -   }
> 
> 
> since the ret value doesn't really have any meaning or perhaps the 
> verify_level_key and ret = -EUCLEAN could be reteinaed as well as the 
> if (ret == EUCLEAN) break logic ? 

Yes, that's a valid cleanup.

Thanks,
Qu

> 
>> @@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct 
>> btrfs_fs_info *fs_info,
>>  break;
>>  }
>>  
>> -/*
>> - * This buffer's crc is fine, but its contents are corrupted, so
>> - * there is no reason to read the other copies, they won't be
>> - * any less wrong.
>> - */
>> -if (test_bit(EXTENT_BUFFER_CORRUPT, >bflags) ||
>> -ret == -EUCLEAN)
>> -break;
>> -
>>  num_copies = btrfs_num_copies(fs_info,
>>eb->start, eb->len);
>>  if (num_copies == 1)
>>


Re: [PATCH] btrfs: Always try all copies when reading extent buffers

2018-11-06 Thread Qu Wenruo



On 2018/11/6 下午11:14, Nikolay Borisov wrote:
> 
> 
> On 6.11.18 г. 16:53 ч., Qu Wenruo wrote:
>>
>>
>> On 2018/11/6 下午10:40, Nikolay Borisov wrote:
>>> When a metadata read is served the endio routine btree_readpage_end_io_hook
>>> is called which eventually runs the tree-checker. If tree-checker fails
>>> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
>>> leads to btree_read_extent_buffer_pages wrongly assuming that all
>>> available copies of this extent buffer are wrong and failing prematurely.
>>> Fix this modify btree_read_extent_buffer_pages to read all copies of
>>> the data.
>>>
>>> This failure was exhibitted in xfstests btrfs/124 which would
>>> spuriously fail its balance operations. The reason was that when balance
>>> was run following re-introduction of the missing raid1 disk
>>> __btrfs_map_block would map the read request to stripe 0, which
>>> corresponded to devid 2 (the disk which is being removed in the test):
>>>
>>> item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 
>>> itemsize 112
>>> length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
>>> io_align 65536 io_width 65536 sector_size 4096
>>> num_stripes 2 sub_stripes 1
>>> stripe 0 devid 2 offset 2156920832
>>> dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
>>> stripe 1 devid 1 offset 3553624064
>>> dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
>>>
>>> This caused read requests for a checksum item that to be routed to the
>>> stale disk which triggered the aforementioned logic involving
>>> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
>>> the balance operation.
>>>
>>> Signed-off-by: Nikolay Borisov 
>>> Suggested-by: Qu Wenruo 
>>
>> Reviewed-by: Qu Wenruo 
>>
>> However there is still a tiny piece missing.
>>
>> Tree checker is done after some basic checks, including:
>> 1) bytenr
>> 2) level
>> 3) fsid
>> 4) csum
>>
>> 1~2) can be easily skipped just by pure luck.
>>
>> But 3) and especially 4) are not that easy to hit.
>> Not to mention meeting both 3) and 4), since csum range covers fsid.
>>
>> So I must say you're a really super lucky guy!
> 
> s/lucky/unlucky/ :)
> 
> I wonder if reads just land in some random memory on the stale disk
> hence it's really a matter of what's "there" so that reads fail with
> random reasons?

Even for random memory, you're still too lucky to hit it.

Thanks,
Qu

> 
>>
>> Thanks,
>> Qu
>>
>>
>>> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
>>> ---
>>>  fs/btrfs/disk-io.c | 11 +--
>>>  1 file changed, 1 insertion(+), 10 deletions(-)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 00ee5e37e989..279c6dbcc736 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct 
>>> btrfs_fs_info *fs_info,
>>> int mirror_num = 0;
>>> int failed_mirror = 0;
>>>  
>>> -   clear_bit(EXTENT_BUFFER_CORRUPT, >bflags);
>>> io_tree = _I(fs_info->btree_inode)->io_tree;
>>> while (1) {
>>> +   clear_bit(EXTENT_BUFFER_CORRUPT, >bflags);
>>> ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
>>>mirror_num);
>>> if (!ret) {
>>> @@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct 
>>> btrfs_fs_info *fs_info,
>>> break;
>>> }
>>>  
>>> -   /*
>>> -* This buffer's crc is fine, but its contents are corrupted, so
>>> -* there is no reason to read the other copies, they won't be
>>> -* any less wrong.
>>> -*/
>>> -   if (test_bit(EXTENT_BUFFER_CORRUPT, >bflags) ||
>>> -   ret == -EUCLEAN)
>>> -   break;
>>> -
>>> num_copies = btrfs_num_copies(fs_info,
>>>   eb->start, eb->len);
>>> if (num_copies == 1)
>>>
>>


Re: [PATCH] btrfs: Always try all copies when reading extent buffers

2018-11-06 Thread Nikolay Borisov



On 6.11.18 г. 16:40 ч., Nikolay Borisov wrote:
> When a metadata read is served the endio routine btree_readpage_end_io_hook
> is called which eventually runs the tree-checker. If tree-checker fails
> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
> leads to btree_read_extent_buffer_pages wrongly assuming that all
> available copies of this extent buffer are wrong and failing prematurely.
> Fix this modify btree_read_extent_buffer_pages to read all copies of
> the data.
> 
> This failure was exhibitted in xfstests btrfs/124 which would
> spuriously fail its balance operations. The reason was that when balance
> was run following re-introduction of the missing raid1 disk
> __btrfs_map_block would map the read request to stripe 0, which
> corresponded to devid 2 (the disk which is being removed in the test):
> 
> item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 
> itemsize 112
>   length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
>   io_align 65536 io_width 65536 sector_size 4096
>   num_stripes 2 sub_stripes 1
>   stripe 0 devid 2 offset 2156920832
>   dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
>   stripe 1 devid 1 offset 3553624064
>   dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
> 
> This caused read requests for a checksum item that to be routed to the
> stale disk which triggered the aforementioned logic involving
> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
> the balance operation.
> 
> Signed-off-by: Nikolay Borisov 
> Suggested-by: Qu Wenruo 
> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
> ---
>  fs/btrfs/disk-io.c | 11 +--
>  1 file changed, 1 insertion(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 00ee5e37e989..279c6dbcc736 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct 
> btrfs_fs_info *fs_info,
>   int mirror_num = 0;
>   int failed_mirror = 0;
>  
> - clear_bit(EXTENT_BUFFER_CORRUPT, >bflags);
>   io_tree = _I(fs_info->btree_inode)->io_tree;
>   while (1) {
> + clear_bit(EXTENT_BUFFER_CORRUPT, >bflags);
>   ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
>  mirror_num);
>   if (!ret) {

Qu, 

Do you think it makes sense to do refactoring like below in
 a follow up patch: 

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 279c6dbcc736..9891e13a2b6f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -482,16 +482,11 @@ static int btree_read_extent_buffer_pages(struct 
btrfs_fs_info *fs_info,
clear_bit(EXTENT_BUFFER_CORRUPT, >bflags);
ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
   mirror_num);
-   if (!ret) {
-   if (verify_parent_transid(io_tree, eb,
-  parent_transid, 0))
-   ret = -EIO;
-   else if (verify_level_key(fs_info, eb, level,
- first_key, parent_transid))
-   ret = -EUCLEAN;
-   else
+   if (!ret &&
+   !verify_parent_transid(io_tree, eb, parent_transid, 0) &&
+   !verify_level_key(fs_info, eb, level, first_key,
+parent_transid))
break;
-   }


since the ret value doesn't really have any meaning or perhaps the 
verify_level_key and ret = -EUCLEAN could be reteinaed as well as the 
if (ret == EUCLEAN) break logic ? 

> @@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct 
> btrfs_fs_info *fs_info,
>   break;
>   }
>  
> - /*
> -  * This buffer's crc is fine, but its contents are corrupted, so
> -  * there is no reason to read the other copies, they won't be
> -  * any less wrong.
> -  */
> - if (test_bit(EXTENT_BUFFER_CORRUPT, >bflags) ||
> - ret == -EUCLEAN)
> - break;
> -
>   num_copies = btrfs_num_copies(fs_info,
> eb->start, eb->len);
>   if (num_copies == 1)
> 


[PATCH] btrfs: fix computation of max fs size for multiple device fs tests

2018-11-06 Thread fdmanana
From: Filipe Manana 

We were sorting numerical values with the 'sort' tool without telling it
that we are sorting numbers, giving us unexpected ordering. So just pass
the '-n' option to the 'sort' tool.

Example:

$ echo -e "11\n9\n20" | sort
11
20
9

$ echo -e "11\n9\n20" | sort -n
9
11
20

Signed-off-by: Filipe Manana 
---
 tests/btrfs/124 | 2 +-
 tests/btrfs/125 | 2 +-
 tests/btrfs/154 | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/tests/btrfs/124 b/tests/btrfs/124
index ce3ad6aa..a52c65f6 100755
--- a/tests/btrfs/124
+++ b/tests/btrfs/124
@@ -61,7 +61,7 @@ dev2=`echo $SCRATCH_DEV_POOL | awk '{print $2}'`
 dev1_sz=`blockdev --getsize64 $dev1`
 dev2_sz=`blockdev --getsize64 $dev2`
 # get min of both
-max_fs_sz=`echo -e "$dev1_sz\n$dev2_sz" | sort | head -1`
+max_fs_sz=`echo -e "$dev1_sz\n$dev2_sz" | sort -n | head -1`
 # Need disks with more than 2G.
 if [ $max_fs_sz -lt 20 ]; then
_scratch_dev_pool_put
diff --git a/tests/btrfs/125 b/tests/btrfs/125
index e38de264..5ac68b67 100755
--- a/tests/btrfs/125
+++ b/tests/btrfs/125
@@ -68,7 +68,7 @@ dev2_sz=`blockdev --getsize64 $dev2`
 dev3_sz=`blockdev --getsize64 $dev3`
 
 # get min of both.
-max_fs_sz=`echo -e "$dev1_sz\n$dev2_sz\n$dev3_sz" | sort | head -1`
+max_fs_sz=`echo -e "$dev1_sz\n$dev2_sz\n$dev3_sz" | sort -n | head -1`
 # Need disks with more than 2G
 if [ $max_fs_sz -lt 20 ]; then
_scratch_dev_pool_put
diff --git a/tests/btrfs/154 b/tests/btrfs/154
index 99ea232a..cd6c688f 100755
--- a/tests/btrfs/154
+++ b/tests/btrfs/154
@@ -51,7 +51,7 @@ DEV1_SZ=`blockdev --getsize64 $DEV1`
 DEV2_SZ=`blockdev --getsize64 $DEV2`
 
 # get min
-MAX_FS_SZ=`echo -e "$DEV1_SZ\n$DEV2_SZ" | sort | head -1`
+MAX_FS_SZ=`echo -e "$DEV1_SZ\n$DEV2_SZ" | sort -n | head -1`
 # Need disks with more than 2G
 if [ $MAX_FS_SZ -lt 20 ]; then
_scratch_dev_pool_put
-- 
2.11.0



Re: [PATCH] btrfs: Always try all copies when reading extent buffers

2018-11-06 Thread Nikolay Borisov



On 6.11.18 г. 16:53 ч., Qu Wenruo wrote:
> 
> 
> On 2018/11/6 下午10:40, Nikolay Borisov wrote:
>> When a metadata read is served the endio routine btree_readpage_end_io_hook
>> is called which eventually runs the tree-checker. If tree-checker fails
>> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
>> leads to btree_read_extent_buffer_pages wrongly assuming that all
>> available copies of this extent buffer are wrong and failing prematurely.
>> Fix this modify btree_read_extent_buffer_pages to read all copies of
>> the data.
>>
>> This failure was exhibitted in xfstests btrfs/124 which would
>> spuriously fail its balance operations. The reason was that when balance
>> was run following re-introduction of the missing raid1 disk
>> __btrfs_map_block would map the read request to stripe 0, which
>> corresponded to devid 2 (the disk which is being removed in the test):
>>
>> item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 
>> itemsize 112
>>  length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
>>  io_align 65536 io_width 65536 sector_size 4096
>>  num_stripes 2 sub_stripes 1
>>  stripe 0 devid 2 offset 2156920832
>>  dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
>>  stripe 1 devid 1 offset 3553624064
>>  dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
>>
>> This caused read requests for a checksum item that to be routed to the
>> stale disk which triggered the aforementioned logic involving
>> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
>> the balance operation.
>>
>> Signed-off-by: Nikolay Borisov 
>> Suggested-by: Qu Wenruo 
> 
> Reviewed-by: Qu Wenruo 
> 
> However there is still a tiny piece missing.
> 
> Tree checker is done after some basic checks, including:
> 1) bytenr
> 2) level
> 3) fsid
> 4) csum
> 
> 1~2) can be easily skipped just by pure luck.
> 
> But 3) and especially 4) are not that easy to hit.
> Not to mention meeting both 3) and 4), since csum range covers fsid.
> 
> So I must say you're a really super lucky guy!

s/lucky/unlucky/ :)

I wonder if reads just land in some random memory on the stale disk
hence it's really a matter of what's "there" so that reads fail with
random reasons?

> 
> Thanks,
> Qu
> 
> 
>> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
>> ---
>>  fs/btrfs/disk-io.c | 11 +--
>>  1 file changed, 1 insertion(+), 10 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 00ee5e37e989..279c6dbcc736 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct 
>> btrfs_fs_info *fs_info,
>>  int mirror_num = 0;
>>  int failed_mirror = 0;
>>  
>> -clear_bit(EXTENT_BUFFER_CORRUPT, >bflags);
>>  io_tree = _I(fs_info->btree_inode)->io_tree;
>>  while (1) {
>> +clear_bit(EXTENT_BUFFER_CORRUPT, >bflags);
>>  ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
>> mirror_num);
>>  if (!ret) {
>> @@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct 
>> btrfs_fs_info *fs_info,
>>  break;
>>  }
>>  
>> -/*
>> - * This buffer's crc is fine, but its contents are corrupted, so
>> - * there is no reason to read the other copies, they won't be
>> - * any less wrong.
>> - */
>> -if (test_bit(EXTENT_BUFFER_CORRUPT, >bflags) ||
>> -ret == -EUCLEAN)
>> -break;
>> -
>>  num_copies = btrfs_num_copies(fs_info,
>>eb->start, eb->len);
>>  if (num_copies == 1)
>>
> 


Re: [PATCH] btrfs: Always try all copies when reading extent buffers

2018-11-06 Thread Qu Wenruo



On 2018/11/6 下午10:40, Nikolay Borisov wrote:
> When a metadata read is served the endio routine btree_readpage_end_io_hook
> is called which eventually runs the tree-checker. If tree-checker fails
> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
> leads to btree_read_extent_buffer_pages wrongly assuming that all
> available copies of this extent buffer are wrong and failing prematurely.
> Fix this modify btree_read_extent_buffer_pages to read all copies of
> the data.
> 
> This failure was exhibitted in xfstests btrfs/124 which would
> spuriously fail its balance operations. The reason was that when balance
> was run following re-introduction of the missing raid1 disk
> __btrfs_map_block would map the read request to stripe 0, which
> corresponded to devid 2 (the disk which is being removed in the test):
> 
> item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 
> itemsize 112
>   length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
>   io_align 65536 io_width 65536 sector_size 4096
>   num_stripes 2 sub_stripes 1
>   stripe 0 devid 2 offset 2156920832
>   dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
>   stripe 1 devid 1 offset 3553624064
>   dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
> 
> This caused read requests for a checksum item that to be routed to the
> stale disk which triggered the aforementioned logic involving
> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
> the balance operation.
> 
> Signed-off-by: Nikolay Borisov 
> Suggested-by: Qu Wenruo 

Reviewed-by: Qu Wenruo 

However there is still a tiny piece missing.

Tree checker is done after some basic checks, including:
1) bytenr
2) level
3) fsid
4) csum

1~2) can be easily skipped just by pure luck.

But 3) and especially 4) are not that easy to hit.
Not to mention meeting both 3) and 4), since csum range covers fsid.

So I must say you're a really super lucky guy!

Thanks,
Qu


> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
> ---
>  fs/btrfs/disk-io.c | 11 +--
>  1 file changed, 1 insertion(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 00ee5e37e989..279c6dbcc736 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct 
> btrfs_fs_info *fs_info,
>   int mirror_num = 0;
>   int failed_mirror = 0;
>  
> - clear_bit(EXTENT_BUFFER_CORRUPT, >bflags);
>   io_tree = _I(fs_info->btree_inode)->io_tree;
>   while (1) {
> + clear_bit(EXTENT_BUFFER_CORRUPT, >bflags);
>   ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
>  mirror_num);
>   if (!ret) {
> @@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct 
> btrfs_fs_info *fs_info,
>   break;
>   }
>  
> - /*
> -  * This buffer's crc is fine, but its contents are corrupted, so
> -  * there is no reason to read the other copies, they won't be
> -  * any less wrong.
> -  */
> - if (test_bit(EXTENT_BUFFER_CORRUPT, >bflags) ||
> - ret == -EUCLEAN)
> - break;
> -
>   num_copies = btrfs_num_copies(fs_info,
> eb->start, eb->len);
>   if (num_copies == 1)
> 


[PATCH] btrfs: Always try all copies when reading extent buffers

2018-11-06 Thread Nikolay Borisov
When a metadata read is served the endio routine btree_readpage_end_io_hook
is called which eventually runs the tree-checker. If tree-checker fails
to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
leads to btree_read_extent_buffer_pages wrongly assuming that all
available copies of this extent buffer are wrong and failing prematurely.
Fix this modify btree_read_extent_buffer_pages to read all copies of
the data.

This failure was exhibitted in xfstests btrfs/124 which would
spuriously fail its balance operations. The reason was that when balance
was run following re-introduction of the missing raid1 disk
__btrfs_map_block would map the read request to stripe 0, which
corresponded to devid 2 (the disk which is being removed in the test):

item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 
112
length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
io_align 65536 io_width 65536 sector_size 4096
num_stripes 2 sub_stripes 1
stripe 0 devid 2 offset 2156920832
dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
stripe 1 devid 1 offset 3553624064
dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca

This caused read requests for a checksum item that to be routed to the
stale disk which triggered the aforementioned logic involving
EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
the balance operation.

Signed-off-by: Nikolay Borisov 
Suggested-by: Qu Wenruo 
Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
---
 fs/btrfs/disk-io.c | 11 +--
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 00ee5e37e989..279c6dbcc736 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct 
btrfs_fs_info *fs_info,
int mirror_num = 0;
int failed_mirror = 0;
 
-   clear_bit(EXTENT_BUFFER_CORRUPT, >bflags);
io_tree = _I(fs_info->btree_inode)->io_tree;
while (1) {
+   clear_bit(EXTENT_BUFFER_CORRUPT, >bflags);
ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
   mirror_num);
if (!ret) {
@@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct 
btrfs_fs_info *fs_info,
break;
}
 
-   /*
-* This buffer's crc is fine, but its contents are corrupted, so
-* there is no reason to read the other copies, they won't be
-* any less wrong.
-*/
-   if (test_bit(EXTENT_BUFFER_CORRUPT, >bflags) ||
-   ret == -EUCLEAN)
-   break;
-
num_copies = btrfs_num_copies(fs_info,
  eb->start, eb->len);
if (num_copies == 1)
-- 
2.17.1



Re: [PATCH 0/7] eb reference count cleanups

2018-11-06 Thread David Sterba
On Wed, Aug 15, 2018 at 06:26:50PM +0300, Nikolay Borisov wrote:
> Here is a series which simplifies the way eb are used in 
> EXTENT_BUFFER_UNMAPPED
> context. The end goal was to remove the special "if we have ref count of 2 
> and 
> EXTENT_BUFFER_UNMAPPED flag then act as if this is the last ref and free the 
> buffer" case. To enable this the first 6 patches modify call sites which 
> needlessly bump the reference count.
> 
> Patch 1 & 2 remove some btree locking when we are operating on unmapped extent
> buffers. Each patch's changelog explains why this is safe to do . 
> 
> Patch 3,4,5 and 6 remove redundant calls to extent_buffer_get since they 
> don't 
> really add any value. In all 3 cases having a reference count of 1 is 
> sufficient 
> for the eb to be freed via btrfs_release_path.
> 
> Patch 7 removes the special handling of EXTENT_BUFFER_UNMAPPED flag in 
> free_extent_buffer. Also adjust the selftest code to account for this change 
> by calling one extra time free_extent_buffer. Also document which references 
> are being dropped. All in all this shouldn't have any functional bearing. 
> 
> This was tested with multiple full xfstest runs as well as unloading the 
> btrfs 
> module after each one to trigger the leak check and ensure no eb's are 
> leaked. 
> I've also run it through btrfs' selftests multiple times with no problems. 
> 
> With this set applied EXTENT_BUFFER_UNMAPPED seems to be relevant only for 
> selftest which leads me to believe it can be removed altogether. I will 
> investigate this next but in the meantime this series should be good to go.

Besides the 8/7 patch, the rest was in for-next for a long time so I'm
merging that to misc-next, targeting 4.21. I'll do one last pass
thhrough fstests with the full set and then upddate and push the branch
so there might be some delay before it appears in the public repo.
Thanks for the cleanup.


Re: [PATCH 7/7] btrfs: test balance and resize with an active swap file

2018-11-06 Thread David Sterba
On Fri, Nov 02, 2018 at 02:29:42PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Make sure we don't shrink the device past an active swap file, but allow
> shrinking otherwise, as well as growing and balance.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  tests/btrfs/177 | 64 +
>  tests/btrfs/177.out |  6 +
>  tests/btrfs/group   |  1 +
>  3 files changed, 71 insertions(+)
>  create mode 100755 tests/btrfs/177
>  create mode 100644 tests/btrfs/177.out
> 
> diff --git a/tests/btrfs/177 b/tests/btrfs/177
> new file mode 100755
> index ..12dad8fc
> --- /dev/null
> +++ b/tests/btrfs/177
> @@ -0,0 +1,64 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2018 Facebook.  All Rights Reserved.
> +#
> +# FS QA Test 177
> +#
> +# Test relocation (balance and resize) with an active swap file.
> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + cd /
> + rm -f $tmp.*
> +}
> +
> +. ./common/rc
> +. ./common/filter
> +. ./common/btrfs
> +
> +rm -f $seqres.full
> +
> +# Modify as appropriate.
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch_swapfile
> +
> +swapfile="$SCRATCH_MNT/swap"
> +
> +# First, create a 1GB filesystem and fill it up.
> +_scratch_mkfs_sized $((1024 * 1024 * 1024)) >> $seqres.full 2>&1
> +_scratch_mount
> +dd if=/dev/zero of="$SCRATCH_MNT/fill" bs=1024k >> $seqres.full 2>&1

Can this be replaced by fallocate? Though it's just a 1GiB of data, it
still takes some seconds on rotational storage.


Re: [PATCH 0/7] fstests: test Btrfs swapfile support

2018-11-06 Thread David Sterba
On Mon, Nov 05, 2018 at 12:09:31AM +0800, Eryu Guan wrote:
> On Fri, Nov 02, 2018 at 02:29:35PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > This series fixes a couple of generic swapfile tests and adds some
> > Btrfs-specific swapfile tests. Btrfs swapfile support is scheduled for
> > 4.21 [1].
> > 
> > 1: https://www.spinics.net/lists/linux-btrfs/msg83454.html
> > 
> > Thanks!
> 
> Thanks for the fixes and new tests!
> 
> > 
> > Omar Sandoval (7):
> >   generic/{472,496,497}: fix $seeqres typo
> >   generic/{472,496}: fix swap file creation on Btrfs
> 
> I've merged above two patches, they're two obvious bug fixes.
> 
> >   btrfs: test swap file activation restrictions
> >   btrfs: test invalid operations on a swap file
> >   btrfs: test swap files on multiple devices
> >   btrfs: test device add/remove/replace with an active swap file
> >   btrfs: test balance and resize with an active swap file
> 
> These tests look fine to me, but it'd be really great if btrfs folks
> could help review above tests and provide Reviewed-by tags.

All look good to me,

Reviewed-by: David Sterba 

A few nits I saw:

- the command names should not be shortened, ie. 'btrfs subvolume
  snapshot' instead of 'btrfs subvol snap'
- the test description (eg. 3/7 and 4/7) could mention which case is
  tested, eg. swapfile with compression or COW or snapshot
- 4/7 has typo 'nowcow' in a comment


Re: [PATCH] Btrfs: incremental send, fix infinite loop when apply children dir moves

2018-11-06 Thread robbieko

Hi,

I can reproduce the infinite loop, the following will describe the 
reason and example.


Example:
tree --inodes parent/ send/
parent/
`-- [261]  261
`-- [271]  271
`-- [266]  266
|-- [259]  259
|-- [260]  260
|   `-- [267]  267
|-- [264]  264
|   `-- [258]  258
|   `-- [257]  257
|-- [265]  265
|-- [268]  268
|-- [269]  269
|   `-- [262]  262
|-- [270]  270
|-- [272]  272
|   |-- [263]  263
|   `-- [275]  275
`-- [274]  274
`-- [273]  273
send/
`-- [275]  275
`-- [274]  274
`-- [273]  273
`-- [262]  262
`-- [269]  269
`-- [258]  258
`-- [271]  271
`-- [268]  268
`-- [267]  267
`-- [270]  270
|-- [259]  259
|   `-- [265]  265
`-- [272]  272
`-- [257]  257
|-- [260]  260
`-- [264]  264
`-- [263]  263
`-- [261]  
261
`-- [
266]  266



1. While process inode 257, we delay its rename operation because inode 
272
has not been renamed (since 272 > 257, that is, beyond the current 
progress).


2. And so on (inode 258-274), we can get a bunch of waiting waiting 
relationships

257 -> (wait for) 272
258 -> 269
259 -> 270
260 -> 272
261 -> 263
262 -> 274
263 -> 264
264 -> 257
265 -> 270
266 -> 263
267 -> 268
268 -> 271
269 -> 262
270 -> 271
271 -> 258
272 -> 274
274 -> 275

3. While processing inode 275, we rename ./261/271/272/275 to ./275,
and then now we start processing the waiting subdirectories in 
apply_children_dir_moves.


4. We first initialize the stack into an empty list, and then we add 274 
to the stack

because 274 is waiting for 275 to complete.
Every time we take the first object in the stack to process it.

5. So we can observe the change in object in the stack.
loop:
 round 1. 274
2. 262 -> 272
3. 272 -> 269
4. 269 -> 257 -> 260
5. 257 -> 260 -> 258
6. 260 -> 258 -> 264
7. 258 -> 264
8. 264 -> 271
9. 271 -> 263
  10. 263 -> 268 -> 270
  11. 268 -> 270 -> 261 -> 266
  12. 270 -> 261 -> 266 -> 267
  13. 261 -> 266 -> 267 -> 259 -> 265 (since 270 path loop, so 
we add 270 waiting for 267)

  14. 266 -> 267 -> 259 -> 265
  15. 267 -> 266 -> 259 -> 265  (since 266 path loop, so we add 
266 waiting for 270, but we don't add to stack)

  16. 266 -> 259 -> 265 -> 270
  17. 266 -> 259 -> 265 -> 270  (since 266 path loop, so we add 
266 waiting for 270, but we don't add to stack)
  18. 266 -> 259 -> 265 -> 270  (since 266 path loop, so we add 
266 waiting for 270, but we don't add to stack)
  19. 266 -> 259 -> 265 -> 270  (since 266 path loop, so we add 
266 waiting for 270, but we don't add to stack)

   ... infinite loop

6. In round 13, we processing 270, we delayed the rename because 270 has 
a path loop with 267,
and then we add 259, 265 to the stack, but we don't remove from 
pending_dir_moves rb_tree.


7. In round 15, we processing 266, we delayed the rename because 266 has 
a path loop with 270,
So we look for parent_ino equal to 270 from pending_dir_moves, and we 
find ino 259

because it was not removed from pending_dir_moves.
Then we create a new pending_dir and join the ino 259, because the ino 
259 is currently in the stack,
the new pending_dir ino 266 is also indirectly added to the stack, 
placed between 267 and 259.


So we fix this problem by remove node from pending_dir_moves,
avoid add new pending_dir_move to stack list.

Qu Wenruo 於 2018-11-05 22:35 寫到:

On 2018/11/5 下午7:11, Filipe Manana wrote:

On Mon, Nov 5, 2018 at 4:10 AM robbieko  wrote:


Filipe Manana 於 2018-10-30 19:36 寫到:
On Tue, Oct 30, 2018 at 7:00 AM robbieko  
wrote:


From: Robbie Ko 

In apply_children_dir_moves, we first create an empty list (stack),
then we get an entry from pending_dir_moves and add it to the 
stack,

but we didn't delete the entry from rb_tree.

So, in add_pending_dir_move, we create a new entry and then use the
parent_ino in the current rb_tree to find the corresponding entry,
and if so, add the new entry to the corresponding list.

However, the entry 

Re: btrfs partition is broken, cannot restore anything

2018-11-06 Thread Attila Vangel
Hi Qu,

I created an issue for this feature request, so that it is not lost:
https://github.com/kdave/btrfs-progs/issues/153

Thanks for the help again! Now unsubscribing from this list.

Regards,
Attila

On Tue, Nov 6, 2018 at 9:23 AM Qu Wenruo  wrote:
> On 2018/11/6 下午4:17, Attila Vangel wrote:
[snip]
> > So I propose that "btrs restore" should always print a summary line at
> > the end to make it more user friendly (e.g. for the first time users).
> > Something like this:
> >
> > x files restored in y directories[, total size of files is z  > readable unit e.g. GB>]. Use -v command line option for more
> > information.
> >
> > The part in square bracket is optional / nice to have, might be useful
> > e.g. to estimate the required free space in the target filesystem.
>
> Nice idea. Maybe we could enhance btrfs-restore to that direction.
>
> Thanks,
> Qu
[snip]


Re: [PATCH v9 0/6] Btrfs: implement swap file support

2018-11-06 Thread David Sterba
On Thu, Sep 27, 2018 at 11:17:32AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> This series implements swap file support for Btrfs.
> 
> Changes from v8 [1]:
> 
> - Fixed a bug in btrfs_swap_activate() which would cause us to miss some
>   file extents if they were merged into one extent map entry.
> - Fixed build for !CONFIG_SWAP.
> - Changed all error messages to KERN_WARN.
> - Unindented long error messages.
> 
> I've Cc'd Jon and Al on patch 3 this time, so hopefully we can get an
> ack for that one, too.
> 
> Thanks!
> 
> 1: https://www.spinics.net/lists/linux-btrfs/msg82267.html
> 
> Omar Sandoval (6):
>   mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
>   mm: export add_swap_extent()
>   vfs: update swap_{,de}activate documentation
>   Btrfs: prevent ioctls from interfering with a swap file
>   Btrfs: rename get_chunk_map() and make it non-static
>   Btrfs: support swap files

fstest generic/472 reports an assertion failure. This is on the updated fstests
git (70c4067285b0bc076), though it should not matter:

[16597.002190] assertion failed: IS_ALIGNED(start, fs_info->sectorsize) && 
IS_ALIGNED(end + 1, fs_info->sectorsize), file: fs/btrfs/file-item.c, line: 319
[16597.016154] [ cut here ]
[16597.020911] kernel BUG at fs/btrfs/ctree.h:3448!
[16597.025692] invalid opcode:  [#1] PREEMPT SMP
[16597.030540] CPU: 1 PID: 2216 Comm: swapon Tainted: GW 
4.19.0-1.ge195904-vanilla+ #343
[16597.030542] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
[16597.030617] RIP: 0010:btrfs_lookup_csums_range+0x3ee/0x6e0 [btrfs]
[16597.030623] RSP: 0018:959a8210fc00 EFLAGS: 00010286
[16597.030626] RAX: 008b RBX: 959a8210fc40 RCX: 
[16597.030628] RDX:  RSI: 89c766fd60a8 RDI: 89c766fd60a8
[16597.030630] RBP: 00f1 R08: 0001 R09: 
[16597.030631] R10:  R11:  R12: 
[16597.030633] R13: 89c74f47c000 R14:  R15: 3e4f
[16597.030635] FS:  7f3d80a18700() GS:89c766e0() 
knlGS:
[16597.030637] CS:  0010 DS:  ES:  CR0: 80050033
[16597.030639] CR2: 7f3d7fa733b0 CR3: 00021e0b5000 CR4: 06e0
[16597.030640] Call Trace:
[16597.030657]  ? __mutex_unlock_slowpath+0x4b/0x2b0
[16597.142625]  ? btrfs_cross_ref_exist+0x6d/0xa0 [btrfs]
[16597.142674]  csum_exist_in_range+0x43/0xc2 [btrfs]
[16597.152938]  can_nocow_extent+0x442/0x4a0 [btrfs]
[16597.152993]  btrfs_swap_activate+0x19f/0x680 [btrfs]
[16597.162947]  __do_sys_swapon+0xdf8/0x1550
[16597.162965]  do_syscall_64+0x5c/0x1b0
[16597.162971]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[16597.162974] RIP: 0033:0x7f3d7fb1f067
[16597.198875] RSP: 002b:7ffcf9c49808 EFLAGS: 0202 ORIG_RAX: 
00a7
[16597.198877] RAX: ffda RBX:  RCX: 7f3d7fb1f067
[16597.198879] RDX: 7ffcf9c49910 RSI:  RDI: 7ffcf9c4a8f0
[16597.198884] RBP:  R08: 7f3d7fdd8c40 R09: 7f3d8081dfc0
[16597.228520] R10: 1cc0c4fd R11: 0202 R12: 004006a0
[16597.228522] R13: 7ffcf9c498f0 R14:  R15: 
[16597.311530] ---[ end trace f4dc8cabefc3e36d ]---
[16597.316318] RIP: 0010:btrfs_lookup_csums_range+0x3ee/0x6e0 [btrfs]
[16597.341828] RSP: 0018:959a8210fc00 EFLAGS: 00010286
[16597.341834] RAX: 008b RBX: 959a8210fc40 RCX: 
[16597.354588] RDX:  RSI: 89c766fd60a8 RDI: 89c766fd60a8
[16597.354590] RBP: 00f1 R08: 0001 R09: 
[16597.354592] R10:  R11:  R12: 
[16597.354593] R13: 89c74f47c000 R14:  R15: 3e4f
[16597.354595] FS:  7f3d80a18700() GS:89c766e0() 
knlGS:
[16597.354596] CS:  0010 DS:  ES:  CR0: 80050033
[16597.354598] CR2: 7f3d7fa733b0 CR3: 00021e0b5000 CR4: 06e0


Re: btrfs partition is broken, cannot restore anything

2018-11-06 Thread Qu Wenruo


On 2018/11/6 下午4:17, Attila Vangel wrote:
> Hi Qu,
> 
> Thanks again for the help.
> In my case:
> 
> $ sudo btrfs check /dev/nvme0n1p2
> Opening filesystem to check...
> checksum verify failed on 18811453440 found E4E3BDB6 wanted 
> checksum verify failed on 18811453440 found E4E3BDB6 wanted 
> bad tree block 18811453440, bytenr mismatch, want=18811453440, have=0
> ERROR: cannot open file system

As expected, since extent tree is corrupted, btrfs check can't do much help.

This is definitely something we could enhance, although it will take a
long time for btrfs-progs to handle such case, then some super hard
kernel mount option for kernel to do pure RO mount.

> 
> (same with --mode=lowmem).
> 
> However "btrfs restore" seems to work! (did not have time for the
> actual run yet)
> 
> What confused me about "btrds restore" is when I ran a dry run, it
> just output about a dozen errors about some files, and I thought
> that's it, no files can be restored.
> Today I tried it with other options, especially -v shown me that the
> majority of the files can be restored. I consider myself an
> intermediate GNU plus Linux user, yet given the stressful situation
> and being a first time user (btrfs n00b) I missed experimenting with
> the -v option.
> 
> So I propose that "btrs restore" should always print a summary line at
> the end to make it more user friendly (e.g. for the first time users).
> Something like this:
> 
> x files restored in y directories[, total size of files is z  readable unit e.g. GB>]. Use -v command line option for more
> information.
> 
> The part in square bracket is optional / nice to have, might be useful
> e.g. to estimate the required free space in the target filesystem.

Nice idea. Maybe we could enhance btrfs-restore to that direction.

Thanks,
Qu

> 
> Cheers,
> Attila
> 
> On Tue, Nov 6, 2018 at 2:28 AM Qu Wenruo  wrote:
>>
>>
>>
>> On 2018/11/6 上午2:01, Attila Vangel wrote:
>>> Hi,
>>>
>>> TL;DR: I want to save data from my unmountable btrfs partition.
>>> I saw some commands in another thread "Salvage files from broken btrfs".
>>> I use the most recent Manjaro live (kernel: 4.19.0-3-MANJARO,
>>> btrfs-progs 4.17.1-1) to execute these commands.
>>>
>>> $ sudo mount -o ro,nologreplay /dev/nvme0n1p2 /mnt
>>> mount: /mnt: wrong fs type, bad option, bad superblock on
>>> /dev/nvme0n1p2, missing codepage or helper program, or other error.
>>>
>>> Corresponding lines from dmesg:
>>>
>>> [ 1517.772302] BTRFS info (device nvme0n1p2): disabling log replay at mount 
>>> time
>>> [ 1517.772307] BTRFS info (device nvme0n1p2): disk space caching is enabled
>>> [ 1517.772310] BTRFS info (device nvme0n1p2): has skinny extents
>>> [ 1517.793414] BTRFS error (device nvme0n1p2): bad tree block start,
>>> want 18811453440 have 0
>>> [ 1517.793430] BTRFS error (device nvme0n1p2): failed to read block groups: 
>>> -5
>>> [ 1517.808619] BTRFS error (device nvme0n1p2): open_ctree failed
>>
>> Extent tree corrupted.
>>
>> If it's the only problem, btrfs-restore should be able to salvage data.
>>
>>>
>>> $ sudo btrfs-find-root /dev/nvme0n1p2
>>
>> No, that's not what you need.
>>
>>> Superblock thinks the generation is 220524
>>> Superblock thinks the level is 1
>>> Found tree root at 25018368 gen 220524 level 1
>>> Well block 4243456(gen: 220520 level: 1) seems good, but
>>> generation/level doesn't match, want gen: 220524 level: 1
>>> Well block 5259264(gen: 220519 level: 1) seems good, but
>>> generation/level doesn't match, want gen: 220524 level: 1
>>> Well block 4866048(gen: 220518 level: 0) seems good, but
>>> generation/level doesn't match, want gen: 220524 level: 1
>>>
>>> $ sudo btrfs ins dump-super -Ffa /dev/nvme0n1p2
>>> superblock: bytenr=65536, device=/dev/nvme0n1p2
>> [snip]
>>>
>>> If I understood correctly, somehow it is possible to use this data to
>>> parametrize btrfs restore to save the files from the partition.
>>
>> None of the output is really helpful.
>>
>> In your case, your extent tree is corrupted, thus kernel will refuse to
>> mount (even RO).
>>
>> You should run "btrfs check" on the fs to see if btrfs can check fs tree.
>> If not, then go directly to "btrfs restore".
>>
>> Thanks,
>> Qu
>>
>>> Could you please help how to do it in this case? I am not familiar
>>> with these technical terms in the outputs.
>>> Thanks in advance!
>>>
>>> Cheers,
>>> Attila
>>>
>>> On Thu, Nov 1, 2018 at 8:40 PM Attila Vangel  
>>> wrote:

 Hi,

 Somehow my btrfs partition got broken. I use Arch, so my kernel is
 quite new (4.18.x).
 I don't remember exactly the sequence of events. At some point it was
 accessible in read-only, but unfortunately I did not take backup
 immediately. dmesg log from that time:

 [ 62.602388] BTRFS warning (device nvme0n1p2): block group
 103923318784 has wrong amount of free space
 [ 62.602390] BTRFS warning (device nvme0n1p2): failed to load free
 space cache for block group 103923318784, 

Re: btrfs partition is broken, cannot restore anything

2018-11-06 Thread Attila Vangel
Hi Qu,

Thanks again for the help.
In my case:

$ sudo btrfs check /dev/nvme0n1p2
Opening filesystem to check...
checksum verify failed on 18811453440 found E4E3BDB6 wanted 
checksum verify failed on 18811453440 found E4E3BDB6 wanted 
bad tree block 18811453440, bytenr mismatch, want=18811453440, have=0
ERROR: cannot open file system

(same with --mode=lowmem).

However "btrfs restore" seems to work! (did not have time for the
actual run yet)

What confused me about "btrds restore" is when I ran a dry run, it
just output about a dozen errors about some files, and I thought
that's it, no files can be restored.
Today I tried it with other options, especially -v shown me that the
majority of the files can be restored. I consider myself an
intermediate GNU plus Linux user, yet given the stressful situation
and being a first time user (btrfs n00b) I missed experimenting with
the -v option.

So I propose that "btrs restore" should always print a summary line at
the end to make it more user friendly (e.g. for the first time users).
Something like this:

x files restored in y directories[, total size of files is z ]. Use -v command line option for more
information.

The part in square bracket is optional / nice to have, might be useful
e.g. to estimate the required free space in the target filesystem.

Cheers,
Attila

On Tue, Nov 6, 2018 at 2:28 AM Qu Wenruo  wrote:
>
>
>
> On 2018/11/6 上午2:01, Attila Vangel wrote:
> > Hi,
> >
> > TL;DR: I want to save data from my unmountable btrfs partition.
> > I saw some commands in another thread "Salvage files from broken btrfs".
> > I use the most recent Manjaro live (kernel: 4.19.0-3-MANJARO,
> > btrfs-progs 4.17.1-1) to execute these commands.
> >
> > $ sudo mount -o ro,nologreplay /dev/nvme0n1p2 /mnt
> > mount: /mnt: wrong fs type, bad option, bad superblock on
> > /dev/nvme0n1p2, missing codepage or helper program, or other error.
> >
> > Corresponding lines from dmesg:
> >
> > [ 1517.772302] BTRFS info (device nvme0n1p2): disabling log replay at mount 
> > time
> > [ 1517.772307] BTRFS info (device nvme0n1p2): disk space caching is enabled
> > [ 1517.772310] BTRFS info (device nvme0n1p2): has skinny extents
> > [ 1517.793414] BTRFS error (device nvme0n1p2): bad tree block start,
> > want 18811453440 have 0
> > [ 1517.793430] BTRFS error (device nvme0n1p2): failed to read block groups: 
> > -5
> > [ 1517.808619] BTRFS error (device nvme0n1p2): open_ctree failed
>
> Extent tree corrupted.
>
> If it's the only problem, btrfs-restore should be able to salvage data.
>
> >
> > $ sudo btrfs-find-root /dev/nvme0n1p2
>
> No, that's not what you need.
>
> > Superblock thinks the generation is 220524
> > Superblock thinks the level is 1
> > Found tree root at 25018368 gen 220524 level 1
> > Well block 4243456(gen: 220520 level: 1) seems good, but
> > generation/level doesn't match, want gen: 220524 level: 1
> > Well block 5259264(gen: 220519 level: 1) seems good, but
> > generation/level doesn't match, want gen: 220524 level: 1
> > Well block 4866048(gen: 220518 level: 0) seems good, but
> > generation/level doesn't match, want gen: 220524 level: 1
> >
> > $ sudo btrfs ins dump-super -Ffa /dev/nvme0n1p2
> > superblock: bytenr=65536, device=/dev/nvme0n1p2
> [snip]
> >
> > If I understood correctly, somehow it is possible to use this data to
> > parametrize btrfs restore to save the files from the partition.
>
> None of the output is really helpful.
>
> In your case, your extent tree is corrupted, thus kernel will refuse to
> mount (even RO).
>
> You should run "btrfs check" on the fs to see if btrfs can check fs tree.
> If not, then go directly to "btrfs restore".
>
> Thanks,
> Qu
>
> > Could you please help how to do it in this case? I am not familiar
> > with these technical terms in the outputs.
> > Thanks in advance!
> >
> > Cheers,
> > Attila
> >
> > On Thu, Nov 1, 2018 at 8:40 PM Attila Vangel  
> > wrote:
> >>
> >> Hi,
> >>
> >> Somehow my btrfs partition got broken. I use Arch, so my kernel is
> >> quite new (4.18.x).
> >> I don't remember exactly the sequence of events. At some point it was
> >> accessible in read-only, but unfortunately I did not take backup
> >> immediately. dmesg log from that time:
> >>
> >> [ 62.602388] BTRFS warning (device nvme0n1p2): block group
> >> 103923318784 has wrong amount of free space
> >> [ 62.602390] BTRFS warning (device nvme0n1p2): failed to load free
> >> space cache for block group 103923318784, rebuilding it now
> >> [ 108.039188] BTRFS error (device nvme0n1p2): bad tree block start 0 
> >> 18812026880
> >> [ 108.039227] BTRFS: error (device nvme0n1p2) in
> >> __btrfs_free_extent:7010: errno=-5 IO failure
> >> [ 108.039241] BTRFS info (device nvme0n1p2): forced readonly
> >> [ 108.039250] BTRFS: error (device nvme0n1p2) in
> >> btrfs_run_delayed_refs:3076: errno=-5 IO failure
> >>
> >> At the next reboot it failed to mount. Problem may have been that at
> >> some point I booted to another distro with older kernel