Re: [PATCH 3/3] btrfs-progs: check: exclude extents of tree blocks and extent items

2017-11-27 Thread Su Yue



On 11/28/2017 12:05 PM, Qu Wenruo wrote:



On 2017年11月28日 10:38, Su Yue wrote:

Thanks for review.

On 11/27/2017 06:37 PM, Qu Wenruo wrote:



On 2017年11月27日 11:13, Su Yue wrote:

Commit d17d6663c99c ("btrfs-progs: lowmem check: Fix regression which
screws up extent allocator") removes pin_metadata_blocks() from
lowmem repair.
So we have to find another way to exclude extents which should be
occupied by tree blocks.


First thing first, any tree block which is referred by will not be freed.

Unless some special case, like extent tree initialization which clears
the whole extent tree so we can't check if one tree block should be
freed using extent tree, there is no need to explicitly pin existing
tree blocks.

Their creation/freeing is already handled well.


If I understand the logic of extents correctly:

The extents in free space cache are handled in the way like
cache_block_group() in extent-tree.c.
It traverses all extents items then marks all holes free in free space
cache.

Yes, in a normal situation, extents are already handled well.
But original and lowmem check try to repair corrupted extent tree.

Suppose a situation:
1) Let an extent item of one tree block is corrupted or missed.
2) The correct extent in free space cache will be marked as free.
3) Next CoW may allocate the "used" extents from free space
    and insert an extent item.
4) Lowmem repair increases refs of the extent item and
    causes a wrong extent item.
OR
3) Lowmem repair inserts the extent item firstly.
4) CoW may allocate the "used" extents from free space.
    BUG_ON failure of inserting the extent item.


OK, this explains things, but still not optimal.

Such behavior should not happen if nothing is exposed.
Pinning down all tree blocks is never a light operation and should be
avoided if possible.



Agreed.


It would be nice to do it when starting a new transaction to modify the fs.



Now, there is only one transaction while checking chunks and extents
because of previous wrong call of pin_metadata_blocks().
It's meaningless to put it at start of new transaction now.

I will split the transaction into many minors. Then lowmem repair will
speed up if no error in chunks and extents. But it is just a little
optimization.




Please explain the reason why you need to pin extents first.


What I do in the patch is like origin mode's.
Pining extents first ensures every extents(corrupted and uncorrupted)
will not be rellocated.


Only extent tree reinit will pin down all metadata block in original
mode while normal repair won't.

So you're still doing something which is not done in original mode,
which either needs to be explained in detail or fix original mode first.



You are right. I did miss details of how original mode frees extent
records.

After come thinking, pining extents is still necessary in lowmem
repair.

Original mode has extent cache to record all extents and free normal
extents while traversing tree blocks.
After traversal, only corrupted extent records exist in the cache.
At this moment, it pins them and start transactions to repair.

Lowmem mode does not store anything like extent records. It begins
transactions(in further patch) in traversal. We can not guarantee
extents allocated from free space is occupied by one tree block which
has not been traversed.

Although pining extents is heavy and painfully work, it seems
inevitable. Otherwise, I have no better idea than pining all
extents.

Thanks,
Su



Thanks,
Su


Thanks,
Qu



Modify pin_down_tree_blocks() only for code reuse.
So behavior of pin_metadata_blocks() which works with option
'init-extent-tree' is not influenced.

Introduce exclude_blocks_and_extent_items() to mark extents of all tree
blocks dirty in fs_info->excluded_extents. It also calls
exclude_extent_items() to exclude extent_items in extent tree.

Signed-off-by: Su Yue 
---
   cmds-check.c | 122
++-
   1 file changed, 112 insertions(+), 10 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index c7b570bef9c3..f39285c5a3c2 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -13351,6 +13351,7 @@ out:
   return err;
   }
   +static int exclude_blocks_and_extent_items(struct btrfs_fs_info
*fs_info);
   /*
    * Low memory usage version check_chunks_and_extents.
    */
@@ -13363,12 +13364,22 @@ static int
check_chunks_and_extents_v2(struct btrfs_fs_info *fs_info)
   struct btrfs_root *root1;
   struct btrfs_root *root;
   struct btrfs_root *cur_root;
+    struct extent_io_tree excluded_extents;
   int err =;
   int ret;
     root =s_info->fs_root;
     if (repair) {
+    extent_io_tree_init(_extents);
+    fs_info->excluded_extents =excluded_extents;
+    ret =xclude_blocks_and_extent_items(fs_info);
+    if (ret) {
+    error("failed to exclude tree blocks and extent items");
+    return ret;
+    }
+    reset_cached_block_groups(fs_info);

[PATCH v2.1 2/2] btrfs: Ensure btrfs_trim_fs can trim the whole fs

2017-11-27 Thread Qu Wenruo
[BUG]
fstrim on some btrfs only trims the unallocated space, not trimming any
space in existing block groups.

[CAUSE]
Before fstrim_range passed to btrfs_trim_fs(), it get truncated to
range [0, super->total_bytes).
So later btrfs_trim_fs() will only be able to trim block groups in range
[0, super->total_bytes).

While for btrfs, any bytenr aligned to sector size is valid, since btrfs use
its logical address space, there is nothing limiting the location where
we put block groups.

For btrfs with routine balance, it's quite easy to relocate all
block groups and bytenr of block groups will start beyond super->total_bytes.

In that case, btrfs will not trim existing block groups.

[FIX]
Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
can get the unmodified range, which is normally set to [0, U64_MAX].

Reported-by: Chris Murphy 
Fixes: f4c697e6406d ("btrfs: return EINVAL if start > total_bytes in fitrim 
ioctl")
Cc:  # v4.0+
Signed-off-by: Qu Wenruo 
---
changelog:
v2:
  Locate the root cause in btrfs_ioctl_fitrim(), remove the truncation
  so we can still allow user to trim custom range.
v2.1:
  Include the missing change in btrfs_trim_fs() and update the commit
  message to reflect this.
---
 fs/btrfs/extent-tree.c |  9 +
 fs/btrfs/ioctl.c   | 11 +++
 2 files changed, 8 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f830aa91ac3d..f74958e11008 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10972,14 +10972,7 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
struct fstrim_range *range)
int dev_ret = 0;
int ret = 0;
 
-   /*
-* try to trim all FS space, our block group may start from non-zero.
-*/
-   if (range->len == total_bytes)
-   cache = btrfs_lookup_first_block_group(fs_info, range->start);
-   else
-   cache = btrfs_lookup_block_group(fs_info, range->start);
-
+   cache = btrfs_lookup_first_block_group(fs_info, range->start);
for (; cache; cache = next_block_group(fs_info, cache)) {
if (cache->key.objectid >= (range->start + range->len)) {
btrfs_put_block_group(cache);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index fd172a93d11a..017fda31400d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -365,7 +365,6 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, 
void __user *arg)
struct fstrim_range range;
u64 minlen = ULLONG_MAX;
u64 num_devices = 0;
-   u64 total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
int ret;
 
if (!capable(CAP_SYS_ADMIN))
@@ -389,11 +388,15 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, 
void __user *arg)
return -EOPNOTSUPP;
if (copy_from_user(, arg, sizeof(range)))
return -EFAULT;
-   if (range.start > total_bytes ||
-   range.len < fs_info->sb->s_blocksize)
+
+   /*
+* NOTE: Don't truncate the range using super->total_bytes.
+* Bytenr of btrfs block group is in btrfs logical address space,
+* which can be any sector size aligned bytenr in [0, U64_MAX].
+*/
+   if (range.len < fs_info->sb->s_blocksize)
return -EINVAL;
 
-   range.len = min(range.len, total_bytes - range.start);
range.minlen = max(range.minlen, minlen);
ret = btrfs_trim_fs(fs_info, );
if (ret < 0)
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RESEND 1/2] btrfs: Enhance btrfs_trim_fs function to handle error better

2017-11-27 Thread Qu Wenruo
Function btrfs_trim_fs() doesn't handle errors in a consistent way, if
error happens when trimming existing block groups, it will skip the
remaining blocks and continue to trim unallocated space for each device.

And the return value will only reflect the final error from device
trimming.

This patch will fix such behavior by:

1) Recording first error from block group or device trimming
   So return value will also reflect any error found when trimming.
   Make developer more aware of the problem.

2) Outputting btrfs warning message for each trimming failure
   Any error for block group or device trimming will cause btrfs warning
   kernel message.

3) Continuing trimming if we can
   If we failed to trim one block group or device, we could still try
   next block group or device.

Such behavior can avoid confusion for case like failure to trim the
first block group and then only unallocated space is trimmed.

Reported-by: Chris Murphy 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 59 --
 1 file changed, 43 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 673ac4e01dd0..f830aa91ac3d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10948,6 +10948,16 @@ static int btrfs_trim_free_extents(struct btrfs_device 
*device,
return ret;
 }
 
+/*
+ * Trim the whole fs, by:
+ * 1) Trimming free space in each block group
+ * 2) Trimming unallocated space in each device
+ *
+ * Will try to continue trimming even if we failed to trim one block group or
+ * device.
+ * The return value will be the error return value of the first error.
+ * Or 0 if nothing wrong happened.
+ */
 int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range)
 {
struct btrfs_block_group_cache *cache = NULL;
@@ -10958,6 +10968,8 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct 
fstrim_range *range)
u64 end;
u64 trimmed = 0;
u64 total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
+   int bg_ret = 0;
+   int dev_ret = 0;
int ret = 0;
 
/*
@@ -10968,7 +10980,7 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct 
fstrim_range *range)
else
cache = btrfs_lookup_block_group(fs_info, range->start);
 
-   while (cache) {
+   for (; cache; cache = next_block_group(fs_info, cache)) {
if (cache->key.objectid >= (range->start + range->len)) {
btrfs_put_block_group(cache);
break;
@@ -10982,29 +10994,36 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
struct fstrim_range *range)
if (!block_group_cache_done(cache)) {
ret = cache_block_group(cache, 0);
if (ret) {
-   btrfs_put_block_group(cache);
-   break;
+   btrfs_warn_rl(fs_info,
+   "failed to cache block group %llu ret %d",
+  cache->key.objectid, ret);
+   if (!bg_ret)
+   bg_ret = ret;
+   continue;
}
ret = wait_block_group_cache_done(cache);
if (ret) {
-   btrfs_put_block_group(cache);
-   break;
+   btrfs_warn_rl(fs_info,
+   "failed to wait cache for block group %llu ret %d",
+  cache->key.objectid, ret);
+   if (!bg_ret)
+   bg_ret = ret;
+   continue;
}
}
-   ret = btrfs_trim_block_group(cache,
-_trimmed,
-start,
-end,
-range->minlen);
+   ret = btrfs_trim_block_group(cache, _trimmed,
+   start, end, range->minlen);
 
trimmed += group_trimmed;
if (ret) {
-   btrfs_put_block_group(cache);
-   break;
+   btrfs_warn_rl(fs_info,
+   "failed to trim block group %llu ret %d",
+  cache->key.objectid, ret);
+   if (!bg_ret)
+ 

[PATCH v2 2/2] btrfs: Ensure btrfs_trim_fs can trim the whole fs

2017-11-27 Thread Qu Wenruo
[BUG]
fstrim on some btrfs only trims the unallocated space, not trimming any
space in existing block groups.

[CAUSE]
Before fstrim_range passed to btrfs_trim_fs(), it get truncated to
range [0, super->total_bytes).
So later btrfs_trim_fs() will only be able to trim block groups in range
[0, super->total_bytes).

While for btrfs, any bytenr aligned to sector size is valid, since btrfs use
its logical address space, there is nothing limiting the location where
we put block groups.

For btrfs with routine balance, it's quite easy to relocate all
block groups and bytenr of block groups will start beyond super->total_bytes.

In that case, btrfs will not trim existing block groups.

[FIX]
Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
can get the unmodified range, which is normally set to [0, U64_MAX].

Reported-by: Chris Murphy 
Fixes: f4c697e6406d ("btrfs: return EINVAL if start > total_bytes in fitrim 
ioctl")
Cc:  # v4.0+
Signed-off-by: Qu Wenruo 
---
changelog:
v2:
  Locate the root cause in btrfs_ioctl_fitrim(), remove the truncation
  so we can still allow user to trim custom range.
---
 fs/btrfs/ioctl.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index fd172a93d11a..017fda31400d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -365,7 +365,6 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, 
void __user *arg)
struct fstrim_range range;
u64 minlen = ULLONG_MAX;
u64 num_devices = 0;
-   u64 total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
int ret;
 
if (!capable(CAP_SYS_ADMIN))
@@ -389,11 +388,15 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, 
void __user *arg)
return -EOPNOTSUPP;
if (copy_from_user(, arg, sizeof(range)))
return -EFAULT;
-   if (range.start > total_bytes ||
-   range.len < fs_info->sb->s_blocksize)
+
+   /*
+* NOTE: Don't truncate the range using super->total_bytes.
+* Bytenr of btrfs block group is in btrfs logical address space,
+* which can be any sector size aligned bytenr in [0, U64_MAX].
+*/
+   if (range.len < fs_info->sb->s_blocksize)
return -EINVAL;
 
-   range.len = min(range.len, total_bytes - range.start);
range.minlen = max(range.minlen, minlen);
ret = btrfs_trim_fs(fs_info, );
if (ret < 0)
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent

2017-11-27 Thread Nikolay Borisov


On 28.11.2017 09:03, Qu Wenruo wrote:
> 
> 
> On 2017年11月28日 14:59, Nikolay Borisov wrote:
>>
>>
>> On 28.11.2017 02:37, Qu Wenruo wrote:
>>>
>>>
>>> On 2017年11月27日 23:29, Nikolay Borisov wrote:


 On 27.11.2017 15:23, Adam Borowski wrote:
> Hi!
> On 4.15-rc1, I get the following failure:
>
> BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688
> slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline
> extent, have 134 expect 281474976710677
>
> Repeatable every boot attempt.  4.14 and earlier boot fine; btrfs check
> (progs 4.13.3) doesn't find any badness either.

 The reason why you hit this on 4.15 is because this error is coming from 
 the tree checker. 
 So how big is the file which inode 35691 represents? 
>>>
>>> The problem is, the reporter is using CONFIG_BTRFS_FS_CHECK_INTEGRITY,
>>> which will call btrfs_check_leaf() every time btrfs_mark_buffer_dirty().
>>>
>>> However just as explained in the fixing patch:
>>> https://patchwork.kernel.org/patch/10047489/
>>>
>>> The timing of call btrfs_mark_buffer_dirty() is not right, one of the
>>> most obvious place is for EXTENT_DATA.
>>
>> Right, that was one of the things which I suspected. So in this case,
>> David, do we want to push those fixes to 4.15-rc in some of the upcoming
>> pull reqs or are we gonna have FS_INTEGRITY left broken for a release
>> (imo not a good thing) ?
> 
> BTW, compression level wild memory access fix is even more important.
> 
> Just try to mount with "-o compress" then kernel will crash.

That one is already queued in David's misc-4.15 branch ergo it will go
in a pull req this release.

> 
> Thanks,
> Qu
> 
>>
>>
>>>
>>> So please apply the above patch to see if it solves the problem.
>>>
>>> Thanks,
>>> Qu

 So the check which fails is: 

 if (item_size != BTRFS_FILE_EXTENT_INLINE_DATA_START + 
  
 btrfs_file_extent_ram_bytes(leaf, fi)) {   
  
 file_extent_err(root, leaf, slot,  
  
 "invalid ram_bytes for uncompressed inline extent, have %u expect 
 %llu",
 item_size, 
 BTRFS_FILE_EXTENT_INLINE_DATA_START +
 btrfs_file_extent_ram_bytes(leaf, fi));
  
 return -EUCLEAN;   
  
 }

 BTRFS_FILE_EXTENT_INLINE_DATA_START is 21 based on pahole's output: 

 struct btrfs_file_extent_item {
__le64 generation;   /* 0 8 */
__le64 ram_bytes;/* 8 8 */
__u8   compression;  /*16 1 */
__u8   encryption;   /*17 1 */
__le16 other_encoding;   /*18 2 */
__u8   type; /*20 1 */
__le64 disk_bytenr;  /*21 8 */
__le64 disk_num_bytes;   /*29 8 */
__le64 offset;   /*37 8 */
__le64 num_bytes;/*45 8 */

/* size: 53, cachelines: 1, members: 10 */
/* last cacheline: 53 bytes */
 };

 The inline extent size is 113 so 21 + 113 should is 134 which equals to 
 what we expect. However, 
 the printing code uses btrfs_file_extent_inline_len and the tree-checker 
 code uses directly the 
 ram bytes. And this function does the correct thing according to whether 
 the file is compressed or not. 
 So Qu, perhaps the code needs to be changed or is this a genuine bug ? 

>
> [   11.347451] BTRFS info (device sda1): use lzo compression
> [   11.352914] BTRFS info (device sda1): using free space tree
> [] Activating lvm and md swap...[ ok done.
> [] Checking file systems...fsck from util-l
> [   11.660352] BTRFS critical (device sda1): corrupt leaf: root=1 
> block=3820662898688 slot=43 ino=35691 file_offset=0, invalid ram_bytes 
> for uncompressed inline extent, have 134 expect 281474976710677
> inux 2.30.2
> [   11.678550] BTRFS info (device sda1): leaf 3820662898688 total ptrs 
> 103 free space 4350
> [   11.687767]item 0 key (35663 12 32909) itemoff 16263 itemsize 20
> [   11.695021]item 1 key (35663 108 0) itemoff 15751 itemsize 512
> [   11.701274]inline extent data size 509
>  ok
> [   11.705704]item 2 key (35664 1 0) itemoff 15591 itemsize 160
> [   11.713034]inode generation 1292 size 509 mode 100644
> [   11.718811]item 3 key (35664 12 32909) itemoff 15571 itemsize 20
> [   11.725113]

Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent

2017-11-27 Thread Qu Wenruo


On 2017年11月28日 14:59, Nikolay Borisov wrote:
> 
> 
> On 28.11.2017 02:37, Qu Wenruo wrote:
>>
>>
>> On 2017年11月27日 23:29, Nikolay Borisov wrote:
>>>
>>>
>>> On 27.11.2017 15:23, Adam Borowski wrote:
 Hi!
 On 4.15-rc1, I get the following failure:

 BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688
 slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline
 extent, have 134 expect 281474976710677

 Repeatable every boot attempt.  4.14 and earlier boot fine; btrfs check
 (progs 4.13.3) doesn't find any badness either.
>>>
>>> The reason why you hit this on 4.15 is because this error is coming from 
>>> the tree checker. 
>>> So how big is the file which inode 35691 represents? 
>>
>> The problem is, the reporter is using CONFIG_BTRFS_FS_CHECK_INTEGRITY,
>> which will call btrfs_check_leaf() every time btrfs_mark_buffer_dirty().
>>
>> However just as explained in the fixing patch:
>> https://patchwork.kernel.org/patch/10047489/
>>
>> The timing of call btrfs_mark_buffer_dirty() is not right, one of the
>> most obvious place is for EXTENT_DATA.
> 
> Right, that was one of the things which I suspected. So in this case,
> David, do we want to push those fixes to 4.15-rc in some of the upcoming
> pull reqs or are we gonna have FS_INTEGRITY left broken for a release
> (imo not a good thing) ?

BTW, compression level wild memory access fix is even more important.

Just try to mount with "-o compress" then kernel will crash.

Thanks,
Qu

> 
> 
>>
>> So please apply the above patch to see if it solves the problem.
>>
>> Thanks,
>> Qu
>>>
>>> So the check which fails is: 
>>>
>>> if (item_size != BTRFS_FILE_EXTENT_INLINE_DATA_START +  
>>> btrfs_file_extent_ram_bytes(leaf, fi)) {
>>> 
>>> file_extent_err(root, leaf, slot,   
>>> 
>>> "invalid ram_bytes for uncompressed inline extent, have %u expect 
>>> %llu",
>>> item_size, 
>>> BTRFS_FILE_EXTENT_INLINE_DATA_START +
>>> btrfs_file_extent_ram_bytes(leaf, fi)); 
>>> 
>>> return -EUCLEAN;
>>> 
>>> }
>>>
>>> BTRFS_FILE_EXTENT_INLINE_DATA_START is 21 based on pahole's output: 
>>>
>>> struct btrfs_file_extent_item {
>>> __le64 generation;   /* 0 8 */
>>> __le64 ram_bytes;/* 8 8 */
>>> __u8   compression;  /*16 1 */
>>> __u8   encryption;   /*17 1 */
>>> __le16 other_encoding;   /*18 2 */
>>> __u8   type; /*20 1 */
>>> __le64 disk_bytenr;  /*21 8 */
>>> __le64 disk_num_bytes;   /*29 8 */
>>> __le64 offset;   /*37 8 */
>>> __le64 num_bytes;/*45 8 */
>>>
>>> /* size: 53, cachelines: 1, members: 10 */
>>> /* last cacheline: 53 bytes */
>>> };
>>>
>>> The inline extent size is 113 so 21 + 113 should is 134 which equals to 
>>> what we expect. However, 
>>> the printing code uses btrfs_file_extent_inline_len and the tree-checker 
>>> code uses directly the 
>>> ram bytes. And this function does the correct thing according to whether 
>>> the file is compressed or not. 
>>> So Qu, perhaps the code needs to be changed or is this a genuine bug ? 
>>>

 [   11.347451] BTRFS info (device sda1): use lzo compression
 [   11.352914] BTRFS info (device sda1): using free space tree
 [] Activating lvm and md swap...[ ok done.
 [] Checking file systems...fsck from util-l
 [   11.660352] BTRFS critical (device sda1): corrupt leaf: root=1 
 block=3820662898688 slot=43 ino=35691 file_offset=0, invalid ram_bytes for 
 uncompressed inline extent, have 134 expect 281474976710677
 inux 2.30.2
 [   11.678550] BTRFS info (device sda1): leaf 3820662898688 total ptrs 103 
 free space 4350
 [   11.687767] item 0 key (35663 12 32909) itemoff 16263 itemsize 20
 [   11.695021] item 1 key (35663 108 0) itemoff 15751 itemsize 512
 [   11.701274] inline extent data size 509
  ok
 [   11.705704] item 2 key (35664 1 0) itemoff 15591 itemsize 160
 [   11.713034] inode generation 1292 size 509 mode 100644
 [   11.718811] item 3 key (35664 12 32909) itemoff 15571 itemsize 20
 [   11.725113] item 4 key (35664 108 0) itemoff 15059 itemsize 512
 [   11.732168] inline extent data size 509
 done.
 [   11.736280] item 5 key (35665 1 0) itemoff 14899 itemsize 160
 [   11.742681] inode generation 1292 

Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent

2017-11-27 Thread Nikolay Borisov


On 28.11.2017 02:37, Qu Wenruo wrote:
> 
> 
> On 2017年11月27日 23:29, Nikolay Borisov wrote:
>>
>>
>> On 27.11.2017 15:23, Adam Borowski wrote:
>>> Hi!
>>> On 4.15-rc1, I get the following failure:
>>>
>>> BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688
>>> slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline
>>> extent, have 134 expect 281474976710677
>>>
>>> Repeatable every boot attempt.  4.14 and earlier boot fine; btrfs check
>>> (progs 4.13.3) doesn't find any badness either.
>>
>> The reason why you hit this on 4.15 is because this error is coming from the 
>> tree checker. 
>> So how big is the file which inode 35691 represents? 
> 
> The problem is, the reporter is using CONFIG_BTRFS_FS_CHECK_INTEGRITY,
> which will call btrfs_check_leaf() every time btrfs_mark_buffer_dirty().
> 
> However just as explained in the fixing patch:
> https://patchwork.kernel.org/patch/10047489/
> 
> The timing of call btrfs_mark_buffer_dirty() is not right, one of the
> most obvious place is for EXTENT_DATA.

Right, that was one of the things which I suspected. So in this case,
David, do we want to push those fixes to 4.15-rc in some of the upcoming
pull reqs or are we gonna have FS_INTEGRITY left broken for a release
(imo not a good thing) ?


> 
> So please apply the above patch to see if it solves the problem.
> 
> Thanks,
> Qu
>>
>> So the check which fails is: 
>>
>> if (item_size != BTRFS_FILE_EXTENT_INLINE_DATA_START +  
>> btrfs_file_extent_ram_bytes(leaf, fi)) { 
>>
>> file_extent_err(root, leaf, slot,
>>
>> "invalid ram_bytes for uncompressed inline extent, have %u expect 
>> %llu",
>> item_size, 
>> BTRFS_FILE_EXTENT_INLINE_DATA_START +
>> btrfs_file_extent_ram_bytes(leaf, fi));  
>>
>> return -EUCLEAN; 
>>
>> }
>>
>> BTRFS_FILE_EXTENT_INLINE_DATA_START is 21 based on pahole's output: 
>>
>> struct btrfs_file_extent_item {
>>  __le64 generation;   /* 0 8 */
>>  __le64 ram_bytes;/* 8 8 */
>>  __u8   compression;  /*16 1 */
>>  __u8   encryption;   /*17 1 */
>>  __le16 other_encoding;   /*18 2 */
>>  __u8   type; /*20 1 */
>>  __le64 disk_bytenr;  /*21 8 */
>>  __le64 disk_num_bytes;   /*29 8 */
>>  __le64 offset;   /*37 8 */
>>  __le64 num_bytes;/*45 8 */
>>
>>  /* size: 53, cachelines: 1, members: 10 */
>>  /* last cacheline: 53 bytes */
>> };
>>
>> The inline extent size is 113 so 21 + 113 should is 134 which equals to what 
>> we expect. However, 
>> the printing code uses btrfs_file_extent_inline_len and the tree-checker 
>> code uses directly the 
>> ram bytes. And this function does the correct thing according to whether the 
>> file is compressed or not. 
>> So Qu, perhaps the code needs to be changed or is this a genuine bug ? 
>>
>>>
>>> [   11.347451] BTRFS info (device sda1): use lzo compression
>>> [   11.352914] BTRFS info (device sda1): using free space tree
>>> [] Activating lvm and md swap...[ ok done.
>>> [] Checking file systems...fsck from util-l
>>> [   11.660352] BTRFS critical (device sda1): corrupt leaf: root=1 
>>> block=3820662898688 slot=43 ino=35691 file_offset=0, invalid ram_bytes for 
>>> uncompressed inline extent, have 134 expect 281474976710677
>>> inux 2.30.2
>>> [   11.678550] BTRFS info (device sda1): leaf 3820662898688 total ptrs 103 
>>> free space 4350
>>> [   11.687767]  item 0 key (35663 12 32909) itemoff 16263 itemsize 20
>>> [   11.695021]  item 1 key (35663 108 0) itemoff 15751 itemsize 512
>>> [   11.701274]  inline extent data size 509
>>>  ok
>>> [   11.705704]  item 2 key (35664 1 0) itemoff 15591 itemsize 160
>>> [   11.713034]  inode generation 1292 size 509 mode 100644
>>> [   11.718811]  item 3 key (35664 12 32909) itemoff 15571 itemsize 20
>>> [   11.725113]  item 4 key (35664 108 0) itemoff 15059 itemsize 512
>>> [   11.732168]  inline extent data size 509
>>> done.
>>> [   11.736280]  item 5 key (35665 1 0) itemoff 14899 itemsize 160
>>> [   11.742681]  inode generation 1292 size 457 mode 100644
>>> [   11.748275]  item 6 key (35665 12 32909) itemoff 14879 itemsize 20
>>> [   11.754584]  item 7 key (35665 108 0) itemoff 14411 itemsize 468
>>> [   11.760674]  inline extent data size 457
>>> [   11.764780]  item 8 key (35666 1 0) 

Re: [PATCH 3/3] btrfs-progs: check: exclude extents of tree blocks and extent items

2017-11-27 Thread Qu Wenruo


On 2017年11月28日 10:38, Su Yue wrote:
> Thanks for review.
> 
> On 11/27/2017 06:37 PM, Qu Wenruo wrote:
>>
>>
>> On 2017年11月27日 11:13, Su Yue wrote:
>>> Commit d17d6663c99c ("btrfs-progs: lowmem check: Fix regression which
>>> screws up extent allocator") removes pin_metadata_blocks() from
>>> lowmem repair.
>>> So we have to find another way to exclude extents which should be
>>> occupied by tree blocks.
>>
>> First thing first, any tree block which is referred by will not be freed.
>>
>> Unless some special case, like extent tree initialization which clears
>> the whole extent tree so we can't check if one tree block should be
>> freed using extent tree, there is no need to explicitly pin existing
>> tree blocks.
>>
>> Their creation/freeing is already handled well.
>>
> If I understand the logic of extents correctly:
> 
> The extents in free space cache are handled in the way like
> cache_block_group() in extent-tree.c.
> It traverses all extents items then marks all holes free in free space
> cache.
> 
> Yes, in a normal situation, extents are already handled well.
> But original and lowmem check try to repair corrupted extent tree.
> 
> Suppose a situation:
> 1) Let an extent item of one tree block is corrupted or missed.
> 2) The correct extent in free space cache will be marked as free.
> 3) Next CoW may allocate the "used" extents from free space
>    and insert an extent item.
> 4) Lowmem repair increases refs of the extent item and
>    causes a wrong extent item.
> OR
> 3) Lowmem repair inserts the extent item firstly.
> 4) CoW may allocate the "used" extents from free space.
>    BUG_ON failure of inserting the extent item.

OK, this explains things, but still not optimal.

Such behavior should not happen if nothing is exposed.
Pinning down all tree blocks is never a light operation and should be
avoided if possible.

It would be nice to do it when starting a new transaction to modify the fs.

> 
>> Please explain the reason why you need to pin extents first.
>>
> What I do in the patch is like origin mode's.
> Pining extents first ensures every extents(corrupted and uncorrupted)
> will not be rellocated.

Only extent tree reinit will pin down all metadata block in original
mode while normal repair won't.

So you're still doing something which is not done in original mode,
which either needs to be explained in detail or fix original mode first.

> 
> Thanks,
> Su
> 
>> Thanks,
>> Qu
>>
>>>
>>> Modify pin_down_tree_blocks() only for code reuse.
>>> So behavior of pin_metadata_blocks() which works with option
>>> 'init-extent-tree' is not influenced.
>>>
>>> Introduce exclude_blocks_and_extent_items() to mark extents of all tree
>>> blocks dirty in fs_info->excluded_extents. It also calls
>>> exclude_extent_items() to exclude extent_items in extent tree.
>>>
>>> Signed-off-by: Su Yue 
>>> ---
>>>   cmds-check.c | 122
>>> ++-
>>>   1 file changed, 112 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/cmds-check.c b/cmds-check.c
>>> index c7b570bef9c3..f39285c5a3c2 100644
>>> --- a/cmds-check.c
>>> +++ b/cmds-check.c
>>> @@ -13351,6 +13351,7 @@ out:
>>>   return err;
>>>   }
>>>   +static int exclude_blocks_and_extent_items(struct btrfs_fs_info
>>> *fs_info);
>>>   /*
>>>    * Low memory usage version check_chunks_and_extents.
>>>    */
>>> @@ -13363,12 +13364,22 @@ static int
>>> check_chunks_and_extents_v2(struct btrfs_fs_info *fs_info)
>>>   struct btrfs_root *root1;
>>>   struct btrfs_root *root;
>>>   struct btrfs_root *cur_root;
>>> +    struct extent_io_tree excluded_extents;
>>>   int err =;
>>>   int ret;
>>>     root =s_info->fs_root;
>>>     if (repair) {
>>> +    extent_io_tree_init(_extents);
>>> +    fs_info->excluded_extents =excluded_extents;
>>> +    ret =xclude_blocks_and_extent_items(fs_info);
>>> +    if (ret) {
>>> +    error("failed to exclude tree blocks and extent items");
>>> +    return ret;
>>> +    }
>>> +    reset_cached_block_groups(fs_info);
>>> +
>>>   trans =trfs_start_transaction(fs_info->extent_root, 1);
>>>   if (IS_ERR(trans)) {
>>>   error("failed to start transaction before check");
>>> @@ -13437,6 +13448,8 @@ out:
>>>   err |=et;
>>>   else
>>>   err &=BG_ACCOUNTING_ERROR;
>>> +    extent_io_tree_cleanup(_extents);
>>> +    fs_info->excluded_extents =ULL;
>>>   }
>>>     if (trans)
>>> @@ -13534,40 +13547,106 @@ init:
>>>   return 0;
>>>   }
>>>   -static int pin_down_tree_blocks(struct btrfs_fs_info *fs_info,
>>> -    struct extent_buffer *eb, int tree_root)
>>> +static int exclude_extent_items(struct btrfs_fs_info *fs_info,
>>> +    struct extent_io_tree *tree)
>>> +{
>>> +    struct btrfs_root *root =s_info->extent_root;
>>> +    struct btrfs_key key;
>>> +    struct btrfs_path path;
>>> +    struct 

[PATCH] btrfs: add helper for device path or missing

2017-11-27 Thread Anand Jain
This patch creates a helper function to get either the rcu device path
or missing.

Signed-off-by: Anand Jain 
---
 fs/btrfs/dev-replace.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 7c655f9a7a50..4b6ceb38cb5f 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -304,6 +304,11 @@ void btrfs_after_dev_replace_commit(struct btrfs_fs_info 
*fs_info)
dev_replace->cursor_left_last_write_of_item;
 }
 
+static inline char *dev_missing_or_rcu_str(struct btrfs_device *device)
+{
+   return device->missing ? "" : rcu_str_deref(device->name);
+}
+
 int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
const char *tgtdev_name, u64 srcdevid, const char *srcdev_name,
int read_src)
@@ -363,8 +368,7 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 
btrfs_info_in_rcu(fs_info,
  "dev_replace from %s (devid %llu) to %s started",
- src_device->missing ? "" :
-   rcu_str_deref(src_device->name),
+ dev_missing_or_rcu_str(src_device),
  src_device->devid,
  rcu_str_deref(tgt_device->name));
 
@@ -538,8 +542,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info 
*fs_info,
} else {
btrfs_err_in_rcu(fs_info,
 "btrfs_scrub_dev(%s, %llu, %s) failed %d",
-src_device->missing ? "" :
-rcu_str_deref(src_device->name),
+dev_missing_or_rcu_str(src_device),
 src_device->devid,
 rcu_str_deref(tgt_device->name), scrub_ret);
btrfs_dev_replace_unlock(dev_replace, 1);
@@ -557,8 +560,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info 
*fs_info,
 
btrfs_info_in_rcu(fs_info,
  "dev_replace from %s (devid %llu) to %s finished",
- src_device->missing ? "" :
- rcu_str_deref(src_device->name),
+ dev_missing_or_rcu_str(src_device),
  src_device->devid,
  rcu_str_deref(tgt_device->name));
tgt_device->is_tgtdev_for_dev_replace = 0;
@@ -814,12 +816,10 @@ static int btrfs_dev_replace_kthread(void *data)
progress = btrfs_dev_replace_progress(fs_info);
progress = div_u64(progress, 10);
btrfs_info_in_rcu(fs_info,
-   "continuing dev_replace from %s (devid %llu) to %s @%u%%",
-   dev_replace->srcdev->missing ? ""
-   : rcu_str_deref(dev_replace->srcdev->name),
+   "continuing dev_replace from %s (devid %llu) to target %s 
@%u%%",
+   dev_missing_or_rcu_str(dev_replace->srcdev),
dev_replace->srcdev->devid,
-   dev_replace->tgtdev ? rcu_str_deref(dev_replace->tgtdev->name)
-   : "",
+   dev_missing_or_rcu_str(dev_replace->tgtdev),
(unsigned int)progress);
 
btrfs_dev_replace_continue_on_mount(fs_info);
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent

2017-11-27 Thread Adam Borowski
On Tue, Nov 28, 2017 at 08:51:07AM +0800, Qu Wenruo wrote:
> On 2017年11月27日 22:22, David Sterba wrote:
> > On Mon, Nov 27, 2017 at 02:23:49PM +0100, Adam Borowski wrote:
> >> On 4.15-rc1, I get the following failure:
> >>
> >> BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688
> >> slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline
> >> extent, have 134 expect 281474976710677
> > 
> > By a quick look at suspiciously large number
> > 
> > hex(281474976710677) = 0x10015
> > 
> > may be a bitflip, but 0x15 does not match 134, so there could be
> > something else involved in the corruption.
> 
> That's a known bug, fixed by that patch which is not merged yet.
> 
> https://patchwork.kernel.org/patch/10047489/

This helped, thanks!


喵!
-- 
⢀⣴⠾⠻⢶⣦⠀ Mozilla's Hippocritical Oath: "Keep trackers off your trail"
⣾⠁⢰⠒⠀⣿⡁ blah blah evading "tracking technology" blah blah
⢿⡄⠘⠷⠚⠋⠀ "https://click.e.mozilla.org/?qs=e7bb0dcf14b1013fca3820...;
⠈⠳⣄ (same for all links)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent

2017-11-27 Thread Qu Wenruo


On 2017年11月27日 22:22, David Sterba wrote:
> On Mon, Nov 27, 2017 at 02:23:49PM +0100, Adam Borowski wrote:
>> Hi!
>> On 4.15-rc1, I get the following failure:
>>
>> BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688
>> slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline
>> extent, have 134 expect 281474976710677
> 
> By a quick look at suspiciously large number
> 
> hex(281474976710677) = 0x10015
> 
> may be a bitflip, but 0x15 does not match 134, so there could be
> something else involved in the corruption.

That's a known bug, fixed by that patch which is not merged yet.

https://patchwork.kernel.org/patch/10047489/

Thanks,
Qu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 1/4] btrfs-progs: mkfs: Enhance minimal device size calculation to fix mkfs failure on small file

2017-11-27 Thread Qu Wenruo


On 2017年11月28日 07:21, David Sterba wrote:
> On Wed, Nov 01, 2017 at 09:30:40AM +0800, Qu Wenruo wrote:
>> +u64 btrfs_min_dev_size(u32 nodesize, int mixed, u64 meta_profile,
>> +   u64 data_profile)
>> +{
>> +u64 reserved = 0;
>> +u64 meta_size;
>> +u64 data_size;
>> +
>> +if (mixed)
>> +return 2 * (BTRFS_MKFS_SYSTEM_GROUP_SIZE +
>> +btrfs_min_global_blk_rsv_size(nodesize));
>> +
>> +/*
>> + * minimal size calculation is complex due to several factors:
>> + * 1) Temporary chunk resue
>> + *If specified chunk profile is SINGLE, we can resue
>> + *temporary chunks, no need to alloc new chunks.
>> + *
>> + * 2) Different minimal chunk size for different profile
>> + *For initial sys chunk, chunk size is fixed to 4M.
>> + *For single profile, minimal chunk size is 8M for all.
>> + *For other profiles, minimal chunk and stripe size
>> + *differs from 8M to 64M.
>> + *
>> + * To calculate it a little easier, here we assume we don't
>> + * reuse any temporary chunk, and calculate the size all
>> + * by ourselves.
>> + *
>> + * Temporary chunks sizes are always fixed:
>> + * One initial sys chunk, one SINGLE meta, and one SINGLE data.
>> + * The latter two are all 8M, accroding to @calc_size of
>> + * btrfs_alloc_chunk().
>> + */
>> +reserved += BTRFS_MKFS_SYSTEM_GROUP_SIZE + SZ_8M * 2;
>> +
>> +/*
>> + * For real chunks, we need to refer different sizes:
>> + * For SINGLE, it's still fixed to 8M (@calc_size).
>> + * For other profiles, refer to max(@min_stripe_size, @calc_size).
>> + *
>> + * And use the stripe size to calculate its physical used space.
>> + */
>> +if (meta_profile & BTRFS_BLOCK_GROUP_PROFILE_MASK)
>> +meta_size = SZ_8M + SZ_32M;
>> +else
>> +meta_size = SZ_8M + SZ_8M;
>> +/* Only metadata put 2 stripes into one disk */
> 
> 'mkfs.btrfs -d dup' also works, so I think this should be handled here
> the same way as metadata, right?

Just a few lines under.
> 
>> +if (meta_profile & BTRFS_BLOCK_GROUP_DUP)
>> +meta_size *= 2;
>> +reserved += meta_size;
>> +
>> +if (data_profile & BTRFS_BLOCK_GROUP_PROFILE_MASK)
>> +data_size = SZ_64M;
>> +else
>> +data_size = SZ_8M;
>> +if (data_profile & BTRFS_BLOCK_GROUP_DUP)
>> +data_size *= 2;
Here data DUP is also handled.

So the problem is about the word "Only".

Do I need to re-submit the whole patchset or just a separate patch to
update the comment?

Thanks,
Qu

>> +reserved += data_size;
>> +return reserved;
>> +}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent

2017-11-27 Thread Qu Wenruo


On 2017年11月27日 23:29, Nikolay Borisov wrote:
> 
> 
> On 27.11.2017 15:23, Adam Borowski wrote:
>> Hi!
>> On 4.15-rc1, I get the following failure:
>>
>> BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688
>> slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline
>> extent, have 134 expect 281474976710677
>>
>> Repeatable every boot attempt.  4.14 and earlier boot fine; btrfs check
>> (progs 4.13.3) doesn't find any badness either.
> 
> The reason why you hit this on 4.15 is because this error is coming from the 
> tree checker. 
> So how big is the file which inode 35691 represents? 

The problem is, the reporter is using CONFIG_BTRFS_FS_CHECK_INTEGRITY,
which will call btrfs_check_leaf() every time btrfs_mark_buffer_dirty().

However just as explained in the fixing patch:
https://patchwork.kernel.org/patch/10047489/

The timing of call btrfs_mark_buffer_dirty() is not right, one of the
most obvious place is for EXTENT_DATA.

So please apply the above patch to see if it solves the problem.

Thanks,
Qu
> 
> So the check which fails is: 
> 
> if (item_size != BTRFS_FILE_EXTENT_INLINE_DATA_START +  
> btrfs_file_extent_ram_bytes(leaf, fi)) {  
>   
> file_extent_err(root, leaf, slot, 
>   
> "invalid ram_bytes for uncompressed inline extent, have %u expect 
> %llu",
> item_size, 
> BTRFS_FILE_EXTENT_INLINE_DATA_START +
> btrfs_file_extent_ram_bytes(leaf, fi));   
>   
> return -EUCLEAN;  
>   
> }
> 
> BTRFS_FILE_EXTENT_INLINE_DATA_START is 21 based on pahole's output: 
> 
> struct btrfs_file_extent_item {
>   __le64 generation;   /* 0 8 */
>   __le64 ram_bytes;/* 8 8 */
>   __u8   compression;  /*16 1 */
>   __u8   encryption;   /*17 1 */
>   __le16 other_encoding;   /*18 2 */
>   __u8   type; /*20 1 */
>   __le64 disk_bytenr;  /*21 8 */
>   __le64 disk_num_bytes;   /*29 8 */
>   __le64 offset;   /*37 8 */
>   __le64 num_bytes;/*45 8 */
> 
>   /* size: 53, cachelines: 1, members: 10 */
>   /* last cacheline: 53 bytes */
> };
> 
> The inline extent size is 113 so 21 + 113 should is 134 which equals to what 
> we expect. However, 
> the printing code uses btrfs_file_extent_inline_len and the tree-checker code 
> uses directly the 
> ram bytes. And this function does the correct thing according to whether the 
> file is compressed or not. 
> So Qu, perhaps the code needs to be changed or is this a genuine bug ? 
> 
>>
>> [   11.347451] BTRFS info (device sda1): use lzo compression
>> [   11.352914] BTRFS info (device sda1): using free space tree
>> [] Activating lvm and md swap...[ ok done.
>> [] Checking file systems...fsck from util-l
>> [   11.660352] BTRFS critical (device sda1): corrupt leaf: root=1 
>> block=3820662898688 slot=43 ino=35691 file_offset=0, invalid ram_bytes for 
>> uncompressed inline extent, have 134 expect 281474976710677
>> inux 2.30.2
>> [   11.678550] BTRFS info (device sda1): leaf 3820662898688 total ptrs 103 
>> free space 4350
>> [   11.687767]   item 0 key (35663 12 32909) itemoff 16263 itemsize 20
>> [   11.695021]   item 1 key (35663 108 0) itemoff 15751 itemsize 512
>> [   11.701274]   inline extent data size 509
>>  ok
>> [   11.705704]   item 2 key (35664 1 0) itemoff 15591 itemsize 160
>> [   11.713034]   inode generation 1292 size 509 mode 100644
>> [   11.718811]   item 3 key (35664 12 32909) itemoff 15571 itemsize 20
>> [   11.725113]   item 4 key (35664 108 0) itemoff 15059 itemsize 512
>> [   11.732168]   inline extent data size 509
>> done.
>> [   11.736280]   item 5 key (35665 1 0) itemoff 14899 itemsize 160
>> [   11.742681]   inode generation 1292 size 457 mode 100644
>> [   11.748275]   item 6 key (35665 12 32909) itemoff 14879 itemsize 20
>> [   11.754584]   item 7 key (35665 108 0) itemoff 14411 itemsize 468
>> [   11.760674]   inline extent data size 457
>> [   11.764780]   item 8 key (35666 1 0) itemoff 14251 itemsize 160
>> [   11.770711]   inode generation 1292 size 533 mode 100644
>> []
>> [   11.776145]   item 9 key (35666 12 32909) itemoff 14231 itemsize 20
>> Cleaning up temp
>> [   11.783069]   item 10 key (35666 108 0) itemoff 13697 itemsize 534
>> orary files...
>> [   11.790666]   inline extent data size 533
>> [   

Re: [PATCH v4 3/4] btrfs-progs: test/common: Enhance prepare_test_dev to reset device size

2017-11-27 Thread David Sterba
On Wed, Nov 01, 2017 at 09:30:42AM +0800, Qu Wenruo wrote:
> So prepare_test_dev() can be called several times in one test case, to
> test different device sizes.
> 
> Signed-off-by: Qu Wenruo 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/4] btrfs-progs: test/common: Introduce run_mustfail_stdout

2017-11-27 Thread David Sterba
On Wed, Nov 01, 2017 at 09:30:41AM +0800, Qu Wenruo wrote:
> For later test case which needs info from stderr.
> 
> Signed-off-by: Qu Wenruo 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 1/4] btrfs-progs: mkfs: Enhance minimal device size calculation to fix mkfs failure on small file

2017-11-27 Thread David Sterba
On Wed, Nov 01, 2017 at 09:30:40AM +0800, Qu Wenruo wrote:
> +u64 btrfs_min_dev_size(u32 nodesize, int mixed, u64 meta_profile,
> +u64 data_profile)
> +{
> + u64 reserved = 0;
> + u64 meta_size;
> + u64 data_size;
> +
> + if (mixed)
> + return 2 * (BTRFS_MKFS_SYSTEM_GROUP_SIZE +
> + btrfs_min_global_blk_rsv_size(nodesize));
> +
> + /*
> +  * minimal size calculation is complex due to several factors:
> +  * 1) Temporary chunk resue
> +  *If specified chunk profile is SINGLE, we can resue
> +  *temporary chunks, no need to alloc new chunks.
> +  *
> +  * 2) Different minimal chunk size for different profile
> +  *For initial sys chunk, chunk size is fixed to 4M.
> +  *For single profile, minimal chunk size is 8M for all.
> +  *For other profiles, minimal chunk and stripe size
> +  *differs from 8M to 64M.
> +  *
> +  * To calculate it a little easier, here we assume we don't
> +  * reuse any temporary chunk, and calculate the size all
> +  * by ourselves.
> +  *
> +  * Temporary chunks sizes are always fixed:
> +  * One initial sys chunk, one SINGLE meta, and one SINGLE data.
> +  * The latter two are all 8M, accroding to @calc_size of
> +  * btrfs_alloc_chunk().
> +  */
> + reserved += BTRFS_MKFS_SYSTEM_GROUP_SIZE + SZ_8M * 2;
> +
> + /*
> +  * For real chunks, we need to refer different sizes:
> +  * For SINGLE, it's still fixed to 8M (@calc_size).
> +  * For other profiles, refer to max(@min_stripe_size, @calc_size).
> +  *
> +  * And use the stripe size to calculate its physical used space.
> +  */
> + if (meta_profile & BTRFS_BLOCK_GROUP_PROFILE_MASK)
> + meta_size = SZ_8M + SZ_32M;
> + else
> + meta_size = SZ_8M + SZ_8M;
> + /* Only metadata put 2 stripes into one disk */

'mkfs.btrfs -d dup' also works, so I think this should be handled here
the same way as metadata, right?

> + if (meta_profile & BTRFS_BLOCK_GROUP_DUP)
> + meta_size *= 2;
> + reserved += meta_size;
> +
> + if (data_profile & BTRFS_BLOCK_GROUP_PROFILE_MASK)
> + data_size = SZ_64M;
> + else
> + data_size = SZ_8M;
> + if (data_profile & BTRFS_BLOCK_GROUP_DUP)
> + data_size *= 2;
> + reserved += data_size;
> + return reserved;
> +}
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: mkfs: check the status of file at mkfs

2017-11-27 Thread David Sterba
On Fri, Nov 24, 2017 at 02:21:15PM +0900, Misono, Tomohiro wrote:
> Currently, only the status of block devices is checked at mkfs,
> but we should also check for regular files whether they are already
> formatted or mounted to prevent overwrite accidentally.
> 
> Device status is checked by test_dev_for_mkfs().
> The part which is not related to block device is split from this
> and used for both block device and regular file.
> 
> Signed-off-by: Tomohiro Misono 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs-progs: fi defrag: clean up duplicate code if find errors

2017-11-27 Thread David Sterba
On Tue, Nov 07, 2017 at 10:24:30AM +0800, Su Yue wrote:
> In function cmd_filesystem_defrag(), lines of code for error handling
> are duplicate and hard to expand in further.
> 
> Create a jump label for errors.
> 
> Signed-off-by: Su Yue 
> ---
>  cmds-filesystem.c | 46 +-
>  1 file changed, 25 insertions(+), 21 deletions(-)
> 
> diff --git a/cmds-filesystem.c b/cmds-filesystem.c
> index 7728430f16a1..0893a44f28fe 100644
> --- a/cmds-filesystem.c
> +++ b/cmds-filesystem.c
> @@ -1029,29 +1029,27 @@ static int cmd_filesystem_defrag(int argc, char 
> **argv)
>   if (fd < 0) {
>   error("cannot open %s: %s", argv[i],
>   strerror(errno));
> - defrag_global_errors++;
> - close_file_or_dir(fd, dirstream);
> - continue;
> + ret = fd;

The fd will be -1, we might need to save the last error, though it's not
directly used (yet).

> + goto next;
>   }
> - if (fstat(fd, )) {
> +
> + ret = fstat(fd, );
> + if (ret) {
>   error("failed to stat %s: %s",
>   argv[i], strerror(errno));
> - defrag_global_errors++;
> - close_file_or_dir(fd, dirstream);
> - continue;
> + goto next;

Similar here.

>   }
>   if (!(S_ISDIR(st.st_mode) || S_ISREG(st.st_mode))) {
>   error("%s is not a directory or a regular file",
>   argv[i]);
> - defrag_global_errors++;
> - close_file_or_dir(fd, dirstream);
> - continue;
> + ret = -EINVAL;

So it's consistent with that one.

> + goto next;
>   }
>   if (recursive && S_ISDIR(st.st_mode)) {
>   ret = nftw(argv[i], defrag_callback, 10,
>   FTW_MOUNT | FTW_PHYS);
>   if (ret == ENOTTY)
> - exit(1);
> + break;

Please split this change, it changes the logic and should be properly
described in the changelog.

Otherwise ok.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] btrfs-progs: test/fsck: Introduce test images containing tree reloc tree

2017-11-27 Thread David Sterba
On Fri, Nov 10, 2017 at 09:34:17AM +0800, Qu Wenruo wrote:
> Tree reloc tree is a special tree with very short life spawn.
> It acts as a special snapshot for any tree, with related nodes/leaves or
> EXTENT_DATA modified to point to new position.
> 
> Considering the short life spawn and its specialness, it should be quite
> reasonable to keep them as both corner case for fsck and educational
> dump for anyone interested in relocation.
> 
> Signed-off-by: Qu Wenruo 

1-3 applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: lowmem check: Reword an unclear error message about file extent gap

2017-11-27 Thread David Sterba
On Fri, Nov 10, 2017 at 05:47:10PM +0800, Lu Fengqi wrote:
> This error occurs when no_holes is not set, but there is a gap
> before the file extent.
> 
> Signed-off-by: Lu Fengqi 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 09/11] Btrfs: kill the btree_inode

2017-11-27 Thread David Sterba
On Wed, Nov 22, 2017 at 04:16:04PM -0500, Josef Bacik wrote:
> From: Josef Bacik 
> @@ -4802,8 +4885,8 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct 
> extent_buffer *src)
>   return new;
>  }
>  
> -struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info 
> *fs_info,
> -   u64 start, unsigned long len)
> +struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_eb_info 
> *eb_info,
> + u64 start, unsigned long len)

The __alloc_dummy_extent_buffer takes the length parameter because it's
used in tests that need to pass different values.
I've removed nodesize from alloc_dummy_extent_buffer and the callchain
because we know that it's always going to be fs_info->nodesize.
Reintroducing it does not look like a good idea.

>  {
>   struct extent_buffer *eb;
>   unsigned long num_pages;

> @@ -160,13 +162,25 @@ struct extent_state {
>  #endif
>  };
>  
> +struct btrfs_eb_info {
> + struct btrfs_fs_info *fs_info;
> + struct extent_io_tree io_tree;
> + struct extent_io_tree io_failure_tree;
> +
> + /* Extent buffer radix tree */
> + spinlock_t buffer_lock;
> + struct radix_tree_root buffer_radix;
> + struct list_lru lru_list;
> + pgoff_t writeback_index;
> +};
> +
>  #define INLINE_EXTENT_BUFFER_PAGES 16
>  #define MAX_INLINE_EXTENT_BUFFER_SIZE (INLINE_EXTENT_BUFFER_PAGES * 
> PAGE_SIZE)
>  struct extent_buffer {
>   u64 start;
>   unsigned long len;
>   unsigned long bflags;
> - struct btrfs_fs_info *fs_info;
> + struct btrfs_eb_info *eb_info;

This single change increases the patch size just because all the
callers need to be updated. I suggest to keep fs_info in extent_buffer,
we're not going to lose much in terms of memory:

currently there are 14 eb objects in a 4k slab page, with the additional
fs_info it's still 14,

280 * 14 = 3920, unused 176 bytes
288 * 14 = 4032, unused 64 bytes

>   spinlock_t refs_lock;
>   atomic_t refs;
>   atomic_t io_pages;
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Remove redundant FLAG_VACANCY

2017-11-27 Thread David Sterba
On Thu, Nov 23, 2017 at 10:51:43AM +0200, Nikolay Borisov wrote:
> Commit 9036c10208e1 ("Btrfs: update hole handling v2") added the FLAG_VACANCY
> to denote holes, however there was already a consistent way of flagging
> extents which represent hole - ->block_start = EXTENT_MAP_HOLE. And also
> the only place where this flag is checked is in the fiemap code, but the
> block_start value is also checked and every other place in the filesystem
> detects holes by using block_start value's. So remove the extra flag. This
> survived a full xfstest run
> 
> Signed-off-by: Nikolay Borisov 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: extent-tree: Make btrfs_inode_rsv_refill function static

2017-11-27 Thread David Sterba
On Fri, Nov 17, 2017 at 03:14:19PM +0800, Qu Wenruo wrote:
> This function is no longer used outside of extent-tree.c.
> Make it static.
> 
> Signed-off-by: Qu Wenruo 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: fix list_add corruption and soft lockups in fsync

2017-11-27 Thread David Sterba
On Tue, Nov 21, 2017 at 02:35:40PM -0700, Liu Bo wrote:
> Xfstests btrfs/146 revealed this corruption,
> 
> [   58.138831] Buffer I/O error on dev dm-0, logical block 2621424, async 
> page read
> [   58.151233] BTRFS error (device sdf): bdev /dev/mapper/error-test errs: wr 
> 1, rd 0, flush 0, corrupt 0, gen 0
> [   58.152403] list_add corruption. prev->next should be next 
> (88005e6775d8), but was c9000189be88. (prev=c9000189be88).
> [   58.153518] [ cut here ]
> [   58.153892] WARNING: CPU: 1 PID: 1287 at lib/list_debug.c:31 
> __list_add_valid+0x169/0x1f0
> ...
> [   58.157379] RIP: 0010:__list_add_valid+0x169/0x1f0
> ...
> [   58.161956] Call Trace:
> [   58.162264]  btrfs_log_inode_parent+0x5bd/0xfb0 [btrfs]
> [   58.163583]  btrfs_log_dentry_safe+0x60/0x80 [btrfs]
> [   58.164003]  btrfs_sync_file+0x4c2/0x6f0 [btrfs]
> [   58.164393]  vfs_fsync_range+0x5f/0xd0
> [   58.164898]  do_fsync+0x5a/0x90
> [   58.165170]  SyS_fsync+0x10/0x20
> [   58.165395]  entry_SYSCALL_64_fastpath+0x1f/0xbe
> ...
> 
> It turns out that we could record btrfs_log_ctx:io_err in
> log_one_extents when IO fails, but make log_one_extents() return '0'
> instead of -EIO, so the IO error is not acknowledged by the callers,
> i.e.  btrfs_log_inode_parent(), which would remove btrfs_log_ctx:list
> from list head 'root->log_ctxs'.  Since btrfs_log_ctx is allocated
> from stack memory, it'd get freed with a object alive on the
> list. then a future list_add will throw the above warning.
> 
> This returns the correct error in the above case.
> 
> Jeff also reported this while testing against his fsync error
> patch set[1].
> 
> [1]: https://www.spinics.net/lists/linux-btrfs/msg65308.html
> "btrfs list corruption and soft lockups while testing writeback error 
> handling"
> 
> Fixes: 8407f553268a4611f254 ("Btrfs: fix data corruption after fast fsync and 
> writeback error")
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: dump_tree: remove superfluous _TREE

2017-11-27 Thread David Sterba
On Sat, Nov 25, 2017 at 09:03:26AM +0800, Qu Wenruo wrote:
> On 2017年11月25日 04:26, Hans van Kranenburg wrote:
> > -# btrfs inspect-internal dump-tree -t fs /dev/block/device
> > ERROR: unrecognized tree id: fs
> > 
> > Without this fix I can't dump-tree fs, but I can dump-tree fs_tree and
> > also fs_tree_tree, which is a bit silly.
> > ---
> >  cmds-inspect-dump-tree.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c
> > index 4e72c08a..df44bb63 100644
> > --- a/cmds-inspect-dump-tree.c
> > +++ b/cmds-inspect-dump-tree.c
> > @@ -143,7 +143,7 @@ static u64 treeid_from_string(const char *str, const 
> > char **end)
> > { "CHUNK", BTRFS_CHUNK_TREE_OBJECTID },
> > { "DEVICE", BTRFS_DEV_TREE_OBJECTID },
> > { "DEV", BTRFS_DEV_TREE_OBJECTID },
> > -   { "FS_TREE", BTRFS_FS_TREE_OBJECTID },
> > +   { "FS", BTRFS_FS_TREE_OBJECTID },
> 
> Considering no other name has the _tree suffix, this looks good to me.
> 
> Reviewed-by: Qu Wenruo 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent

2017-11-27 Thread Nikolay Borisov


On 27.11.2017 15:23, Adam Borowski wrote:
> Hi!
> On 4.15-rc1, I get the following failure:
> 
> BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688
> slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline
> extent, have 134 expect 281474976710677
> 
> Repeatable every boot attempt.  4.14 and earlier boot fine; btrfs check
> (progs 4.13.3) doesn't find any badness either.

The reason why you hit this on 4.15 is because this error is coming from the 
tree checker. 
So how big is the file which inode 35691 represents? 

So the check which fails is: 

if (item_size != BTRFS_FILE_EXTENT_INLINE_DATA_START +  
btrfs_file_extent_ram_bytes(leaf, fi)) {
file_extent_err(root, leaf, slot,   
"invalid ram_bytes for uncompressed inline extent, have %u expect %llu",
item_size, BTRFS_FILE_EXTENT_INLINE_DATA_START +
btrfs_file_extent_ram_bytes(leaf, fi)); 
return -EUCLEAN;
}

BTRFS_FILE_EXTENT_INLINE_DATA_START is 21 based on pahole's output: 

struct btrfs_file_extent_item {
__le64 generation;   /* 0 8 */
__le64 ram_bytes;/* 8 8 */
__u8   compression;  /*16 1 */
__u8   encryption;   /*17 1 */
__le16 other_encoding;   /*18 2 */
__u8   type; /*20 1 */
__le64 disk_bytenr;  /*21 8 */
__le64 disk_num_bytes;   /*29 8 */
__le64 offset;   /*37 8 */
__le64 num_bytes;/*45 8 */

/* size: 53, cachelines: 1, members: 10 */
/* last cacheline: 53 bytes */
};

The inline extent size is 113 so 21 + 113 should is 134 which equals to what we 
expect. However, 
the printing code uses btrfs_file_extent_inline_len and the tree-checker code 
uses directly the 
ram bytes. And this function does the correct thing according to whether the 
file is compressed or not. 
So Qu, perhaps the code needs to be changed or is this a genuine bug ? 

> 
> [   11.347451] BTRFS info (device sda1): use lzo compression
> [   11.352914] BTRFS info (device sda1): using free space tree
> [] Activating lvm and md swap...[ ok done.
> [] Checking file systems...fsck from util-l
> [   11.660352] BTRFS critical (device sda1): corrupt leaf: root=1 
> block=3820662898688 slot=43 ino=35691 file_offset=0, invalid ram_bytes for 
> uncompressed inline extent, have 134 expect 281474976710677
> inux 2.30.2
> [   11.678550] BTRFS info (device sda1): leaf 3820662898688 total ptrs 103 
> free space 4350
> [   11.687767]item 0 key (35663 12 32909) itemoff 16263 itemsize 20
> [   11.695021]item 1 key (35663 108 0) itemoff 15751 itemsize 512
> [   11.701274]inline extent data size 509
>  ok
> [   11.705704]item 2 key (35664 1 0) itemoff 15591 itemsize 160
> [   11.713034]inode generation 1292 size 509 mode 100644
> [   11.718811]item 3 key (35664 12 32909) itemoff 15571 itemsize 20
> [   11.725113]item 4 key (35664 108 0) itemoff 15059 itemsize 512
> [   11.732168]inline extent data size 509
> done.
> [   11.736280]item 5 key (35665 1 0) itemoff 14899 itemsize 160
> [   11.742681]inode generation 1292 size 457 mode 100644
> [   11.748275]item 6 key (35665 12 32909) itemoff 14879 itemsize 20
> [   11.754584]item 7 key (35665 108 0) itemoff 14411 itemsize 468
> [   11.760674]inline extent data size 457
> [   11.764780]item 8 key (35666 1 0) itemoff 14251 itemsize 160
> [   11.770711]inode generation 1292 size 533 mode 100644
> []
> [   11.776145]item 9 key (35666 12 32909) itemoff 14231 itemsize 20
> Cleaning up temp
> [   11.783069]item 10 key (35666 108 0) itemoff 13697 itemsize 534
> orary files...
> [   11.790666]inline extent data size 533
> [   11.795980]item 11 key (35668 1 0) itemoff 13537 itemsize 160
> [   11.801989]inode generation 1292 size 319 mode 100644
>  /tmp
> [   11.807413]item 12 key (35668 12 32909) itemoff 13517 itemsize 20
> [   11.814250]item 13 key (35668 108 0) itemoff 13247 itemsize 270
> [   11.820512]inline extent data size 319
> [   11.825539]item 14 key (35669 1 0) itemoff 13087 itemsize 160
> [   11.831577]inode generation 1292 size 375 mode 100644
> [   11.837149]item 15 key (35669 12 32909) itemoff 

Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent

2017-11-27 Thread David Sterba
On Mon, Nov 27, 2017 at 02:23:49PM +0100, Adam Borowski wrote:
> Hi!
> On 4.15-rc1, I get the following failure:
> 
> BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688
> slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline
> extent, have 134 expect 281474976710677

By a quick look at suspiciously large number

hex(281474976710677) = 0x10015

may be a bitflip, but 0x15 does not match 134, so there could be
something else involved in the corruption.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: ignore return from btrfs_open_one_device()

2017-11-27 Thread Anand Jain
Test case btrfs-progs test-misc/012 can recreate the same fsid with
different number of struct btrfs_fs_devices::total_devices. And the
previous device which is in the kernel device list is stale now. But
as we don't clean the kernel device list (unless same device is scanned
with different fsid), so in the mount context it really ended up
reading the device to find zeroed SB. And thus return the fail to mount.

The long term fix for this should be to refresh the kernel device list.

Though the patch
  btrfs: factor __btrfs_open_devices() to create btrfs_open_one_device()
did tried to skip the failed open, but it forgot to reset the ret value
or not to assign, thus error went up the stack in the mount context.

Signed-off-by: Anand Jain 
Fixes:
  btrfs: factor __btrfs_open_devices() to create btrfs_open_one_device()
---
Hi David,

 My bad. The above patch introduced a regression. Can you kindly squash
 this patch to it.

Thanks, Anand

 fs/btrfs/volumes.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5bd73edc2602..a3fa2dc39881 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1047,8 +1047,7 @@ static int __btrfs_open_devices(struct btrfs_fs_devices 
*fs_devices,
 
list_for_each_entry(device, head, dev_list) {
/* Just open everything we can; ignore failures here */
-   ret = btrfs_open_one_device(fs_devices, device, flags, holder);
-   if (ret)
+   if (btrfs_open_one_device(fs_devices, device, flags, holder))
continue;
 
if (!latest_dev ||
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How about adding an ioctl to convert a directory to a subvolume?

2017-11-27 Thread Austin S. Hemmelgarn

On 2017-11-27 08:17, Qu Wenruo wrote:



On 2017年11月27日 21:02, Austin S. Hemmelgarn wrote:

On 2017-11-27 05:13, Qu Wenruo wrote:



On 2017年11月27日 17:41, Lu Fengqi wrote:

Hi all,

As we all know, under certain circumstances, it is more appropriate to
create some subvolumes rather than keep everything in the same
subvolume. As the condition of demand change, the user may need to
convert a previous directory to a subvolume. For this reason,how about
adding an ioctl to convert a directory to a subvolume?


The idea seems interesting.

However in my opinion, this can be done quite easily in (mostly) user
space, thanks to btrfs support of relink.

The method from Hugo or Chris is quite good, maybe it can be enhanced a
little.

Use the following layout as an example:

root_subv
|- subvolume_1
|  |- dir_1
|  |  |- file_1
|  |  |- file_2
|  |- dir_2
| |- file_3
|- subvolume_2

If we want to convert dir_1 into subvolume, we can do it like:

1) Create a temporary readonly snapshot of parent subvolume containing
     the desired dir
     # btrfs sub snapshot -r root_subv/subvolume_1 \
   root_subv/tmp_snapshot_1

2) Create a new subvolume, as destination.
     # btrfs sub create root_subv/tmp_dest/

3) Copy the content and sync the fs
     Use of reflink is necessary.
     # cp -r --reflink=always root_subv/tmp_snapshot_1/dir_1 \
   root_subv/tmp_dest
     # btrfs sync root_subv/tmp_dest

4) Delete temporary readonly snapshot
     # btrfs subvolume delete root_subv/tmp_snapshot_1

5) Remove the source dir
     # rm -rf root_subv/subvolume_1/dir_1

5) Create a final destination snapshot of "root_subv/temporary_dest"
     # btrfs subvolume snapshot root_subv/tmp_dest \
   root_subv/subvolume_1/dir_1

6) Remove the temporary destination
     # btrfs subvolume delete root_subv/tmp_dest


The main challenge is in step 3).
In fact above method can only handle normal dir/files.
If there is another subvolume inside the desired dir, current "cp -r" is
a bad idea.
We need to skip subvolume dir, and create snapshot for it.

But it's quite easy to write a user space program to handle it.
Maybe using "find" command can already handle it well.

Anyway, doing it in user space is already possible and much easier than
doing it in kernel.



Users can convert by the scripts mentioned in this
thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is
it easier to use the off-the-shelf btrfs subcommand?


If you just want to integrate the functionality into btrfs-progs, maybe
it's possible.

But if you insist in providing a new ioctl for this, I highly doubt if
the extra hassle is worthy.



After an initial consideration, our implementation is broadly divided
into the following steps:
1. Freeze the filesystem or set the subvolume above the source directory
to read-only;


Not really need to freeze the whole fs.
Just create a readonly snapshot of the parent subvolume which contains
the dir.
That's how snapshot is designed for.


2. Perform a pre-check, for example, check if a cross-device link
creation during the conversion;


This can be done in-the-fly.
As the check is so easy (only needs to check if the inode number is 256).
We only need a mid-order iteration of the source dir (in temporary
snapshot), and for normal file, use reflink.
For subvolume dir, create a snapshot for it.

And for such iteration, a python script less than 100 lines would be
sufficient.

On that note, see the function convert_dir_to_subv() in:
https://github.com/Ferroin/btrfs-subv-backup/blob/master/btrfs-subv-backup.py


For an example of how to do it in Python (albeit with some extra code to
handle the case of not having the reflink module from PyPI, and without
anything to prevent the source from being modified).

It would still be nice to be able to do this atomically though, or at
least get cross-rename support in BTRFS, which would allow the final
rename to replace the source with a subvolume to be atomic (assuming of
course you could cross-rename a directory and subvolume).


The problem behind cross-rename is, btrfs doesn't follow the
one-inode-one-tree organization used by most filesystems.

This prevents inode from being referred outside of its subvolume.


And since btrfs uses one-subvolume-one-tree solution, which greatly
simplify the snapshot implementation, it's pretty hard or almost
impossible to do real rename-across-subvolume.
I seriously doubt that that matters in almost all real-world use cases. 
Everything I've seen that uses cross-rename does it with a temporary 
file in the same directory as the target file, using it to avoid the 
non-atomic nature of creating a backup and replacing a file without 
needing extra I/O (yes, reflinks help here, but still aren't perfect).


Just supporting it within a subvolume and returning whatever errno gets 
returned for trying to call rename(2) across filesystem boundaries 
should be more than sufficient for most use cases, even if it doesn't 
work with what I had 

splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent

2017-11-27 Thread Adam Borowski
Hi!
On 4.15-rc1, I get the following failure:

BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688
slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline
extent, have 134 expect 281474976710677

Repeatable every boot attempt.  4.14 and earlier boot fine; btrfs check
(progs 4.13.3) doesn't find any badness either.

[   11.347451] BTRFS info (device sda1): use lzo compression
[   11.352914] BTRFS info (device sda1): using free space tree
[] Activating lvm and md swap...[ ok done.
[] Checking file systems...fsck from util-l
[   11.660352] BTRFS critical (device sda1): corrupt leaf: root=1 
block=3820662898688 slot=43 ino=35691 file_offset=0, invalid ram_bytes for 
uncompressed inline extent, have 134 expect 281474976710677
inux 2.30.2
[   11.678550] BTRFS info (device sda1): leaf 3820662898688 total ptrs 103 free 
space 4350
[   11.687767]  item 0 key (35663 12 32909) itemoff 16263 itemsize 20
[   11.695021]  item 1 key (35663 108 0) itemoff 15751 itemsize 512
[   11.701274]  inline extent data size 509
 ok
[   11.705704]  item 2 key (35664 1 0) itemoff 15591 itemsize 160
[   11.713034]  inode generation 1292 size 509 mode 100644
[   11.718811]  item 3 key (35664 12 32909) itemoff 15571 itemsize 20
[   11.725113]  item 4 key (35664 108 0) itemoff 15059 itemsize 512
[   11.732168]  inline extent data size 509
done.
[   11.736280]  item 5 key (35665 1 0) itemoff 14899 itemsize 160
[   11.742681]  inode generation 1292 size 457 mode 100644
[   11.748275]  item 6 key (35665 12 32909) itemoff 14879 itemsize 20
[   11.754584]  item 7 key (35665 108 0) itemoff 14411 itemsize 468
[   11.760674]  inline extent data size 457
[   11.764780]  item 8 key (35666 1 0) itemoff 14251 itemsize 160
[   11.770711]  inode generation 1292 size 533 mode 100644
[]
[   11.776145]  item 9 key (35666 12 32909) itemoff 14231 itemsize 20
Cleaning up temp
[   11.783069]  item 10 key (35666 108 0) itemoff 13697 itemsize 534
orary files...
[   11.790666]  inline extent data size 533
[   11.795980]  item 11 key (35668 1 0) itemoff 13537 itemsize 160
[   11.801989]  inode generation 1292 size 319 mode 100644
 /tmp
[   11.807413]  item 12 key (35668 12 32909) itemoff 13517 itemsize 20
[   11.814250]  item 13 key (35668 108 0) itemoff 13247 itemsize 270
[   11.820512]  inline extent data size 319
[   11.825539]  item 14 key (35669 1 0) itemoff 13087 itemsize 160
[   11.831577]  inode generation 1292 size 375 mode 100644
[   11.837149]  item 15 key (35669 12 32909) itemoff 13067 itemsize 20
[   11.843873]  item 16 key (35669 108 0) itemoff 12783 itemsize 284
 ok 
[   11.850098]  inline extent data size 375
[   11.855579]  item 17 key (35670 1 0) itemoff 12623 itemsize 160
[   11.862861]  inode generation 1292 size 168 mode 100644
.
[   11.869194]  item 18 key (35670 12 33512) itemoff 12588 itemsize 35
[   11.876467]  item 19 key (35670 108 0) itemoff 12399 itemsize 189
[   11.883564]  inline extent data size 168
[   11.888551]  item 20 key (35676 1 0) itemoff 12239 itemsize 160
[   11.895421]  inode generation 1292 size 512 mode 100600
[   11.901719]  item 21 key (35676 12 32911) itemoff 12218 itemsize 21
[   11.909045]  item 22 key (35676 108 0) itemoff 11685 itemsize 533
[   11.916136]  inline extent data size 512
[   11.921125]  item 23 key (35685 1 0) itemoff 11525 itemsize 160
[   11.928047]  inode generation 1292 size 32128 mode 100644
[   11.934553]  item 24 key (35685 12 32783) itemoff 11508 itemsize 17
[] Mounting 
[   11.941874]  item 25 key (35685 108 0) itemoff 11455 itemsize 53
local filesystem
[   11.949377]  extent data disk bytenr 3757990555648 nr 4096
s...
[   11.956471]  extent data offset 0 nr 4096 ram 4096
[   11.962383]  item 26 key (35685 108 4096) itemoff 11402 itemsize 53
[   11.969704]  extent data disk bytenr 3755041128448 nr 4096
[   11.976324]  extent data offset 4096 nr 24576 ram 32768
[   11.982686]  item 27 key (35685 108 28672) itemoff 11349 itemsize 53
[   11.990140]  extent data disk bytenr 3749090922496 nr 4096
[   11.996786]  extent data offset 0 nr 4096 ram 4096
[   12.002732]  item 28 key (35686 1 0) itemoff 11189 itemsize 160
[   12.009755]  inode generation 1292 size 5023 mode 100644
[   12.016204]  item 29 key (35686 12 32783) itemoff 11165 itemsize 24
[   12.023576]  item 30 key (35686 108 0) itemoff 2 itemsize 53
[   12.030665]  extent data disk bytenr 3651995594752 nr 4096
[   12.037298]  extent data offset 0 nr 8192 ram 8192
[   12.043184]  item 31 key (35687 1 0) itemoff 10952 itemsize 160
[   12.050181]  inode generation 1292 size 293168 mode 100664
[   12.056793]  item 32 key (35687 12 32783) itemoff 10935 itemsize 17
[   12.064082]  item 33 key (35687 108 0) itemoff 10882 itemsize 53
[   12.071154]  extent data disk bytenr 

Re: How about adding an ioctl to convert a directory to a subvolume?

2017-11-27 Thread Qu Wenruo


On 2017年11月27日 21:02, Austin S. Hemmelgarn wrote:
> On 2017-11-27 05:13, Qu Wenruo wrote:
>>
>>
>> On 2017年11月27日 17:41, Lu Fengqi wrote:
>>> Hi all,
>>>
>>> As we all know, under certain circumstances, it is more appropriate to
>>> create some subvolumes rather than keep everything in the same
>>> subvolume. As the condition of demand change, the user may need to
>>> convert a previous directory to a subvolume. For this reason,how about
>>> adding an ioctl to convert a directory to a subvolume?
>>
>> The idea seems interesting.
>>
>> However in my opinion, this can be done quite easily in (mostly) user
>> space, thanks to btrfs support of relink.
>>
>> The method from Hugo or Chris is quite good, maybe it can be enhanced a
>> little.
>>
>> Use the following layout as an example:
>>
>> root_subv
>> |- subvolume_1
>> |  |- dir_1
>> |  |  |- file_1
>> |  |  |- file_2
>> |  |- dir_2
>> | |- file_3
>> |- subvolume_2
>>
>> If we want to convert dir_1 into subvolume, we can do it like:
>>
>> 1) Create a temporary readonly snapshot of parent subvolume containing
>>     the desired dir
>>     # btrfs sub snapshot -r root_subv/subvolume_1 \
>>   root_subv/tmp_snapshot_1
>>
>> 2) Create a new subvolume, as destination.
>>     # btrfs sub create root_subv/tmp_dest/
>>
>> 3) Copy the content and sync the fs
>>     Use of reflink is necessary.
>>     # cp -r --reflink=always root_subv/tmp_snapshot_1/dir_1 \
>>   root_subv/tmp_dest
>>     # btrfs sync root_subv/tmp_dest
>>
>> 4) Delete temporary readonly snapshot
>>     # btrfs subvolume delete root_subv/tmp_snapshot_1
>>
>> 5) Remove the source dir
>>     # rm -rf root_subv/subvolume_1/dir_1
>>
>> 5) Create a final destination snapshot of "root_subv/temporary_dest"
>>     # btrfs subvolume snapshot root_subv/tmp_dest \
>>   root_subv/subvolume_1/dir_1
>>
>> 6) Remove the temporary destination
>>     # btrfs subvolume delete root_subv/tmp_dest
>>
>>
>> The main challenge is in step 3).
>> In fact above method can only handle normal dir/files.
>> If there is another subvolume inside the desired dir, current "cp -r" is
>> a bad idea.
>> We need to skip subvolume dir, and create snapshot for it.
>>
>> But it's quite easy to write a user space program to handle it.
>> Maybe using "find" command can already handle it well.
>>
>> Anyway, doing it in user space is already possible and much easier than
>> doing it in kernel.
>>
>>>
>>> Users can convert by the scripts mentioned in this
>>> thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is
>>> it easier to use the off-the-shelf btrfs subcommand?
>>
>> If you just want to integrate the functionality into btrfs-progs, maybe
>> it's possible.
>>
>> But if you insist in providing a new ioctl for this, I highly doubt if
>> the extra hassle is worthy.
>>
>>>
>>> After an initial consideration, our implementation is broadly divided
>>> into the following steps:
>>> 1. Freeze the filesystem or set the subvolume above the source directory
>>> to read-only;
>>
>> Not really need to freeze the whole fs.
>> Just create a readonly snapshot of the parent subvolume which contains
>> the dir.
>> That's how snapshot is designed for.
>>
>>> 2. Perform a pre-check, for example, check if a cross-device link
>>> creation during the conversion;
>>
>> This can be done in-the-fly.
>> As the check is so easy (only needs to check if the inode number is 256).
>> We only need a mid-order iteration of the source dir (in temporary
>> snapshot), and for normal file, use reflink.
>> For subvolume dir, create a snapshot for it.
>>
>> And for such iteration, a python script less than 100 lines would be
>> sufficient.
> On that note, see the function convert_dir_to_subv() in:
> https://github.com/Ferroin/btrfs-subv-backup/blob/master/btrfs-subv-backup.py
> 
> 
> For an example of how to do it in Python (albeit with some extra code to
> handle the case of not having the reflink module from PyPI, and without
> anything to prevent the source from being modified).
> 
> It would still be nice to be able to do this atomically though, or at
> least get cross-rename support in BTRFS, which would allow the final
> rename to replace the source with a subvolume to be atomic (assuming of
> course you could cross-rename a directory and subvolume).

The problem behind cross-rename is, btrfs doesn't follow the
one-inode-one-tree organization used by most filesystems.

This prevents inode from being referred outside of its subvolume.


And since btrfs uses one-subvolume-one-tree solution, which greatly
simplify the snapshot implementation, it's pretty hard or almost
impossible to do real rename-across-subvolume.

But at least we can reflink, reducing huge amount of data IO, making us
only need to handle inode creation/link.

(Although such one-subvolume-one-tree also makes metadata concurrency
very low, further slowing down the metadata operation)

Thanks,
Qu

>>
>> Thanks,
>> Qu
>>
>>> 3. Perform conversion, such as creating a new 

Re: FAQ / encryption / error handling?

2017-11-27 Thread Austin S. Hemmelgarn

On 2017-11-27 05:06, Dmitrii Tcvetkov wrote:

On Mon, 27 Nov 2017 09:06:12 +0100
Daniel Pocock  wrote:


Hi all,

The FAQ has a couple of sections on encryption (general and dm-crypt)

One thing that isn't explained there: if you create multiple encrypted
volumes (e.g. using dm-crypt) and use Btrfs to combine them into
RAID1, how does error recovery work when a read operation returns
corrupted data?

Without encryption, reading from one disk would give a checksum
mismatch and Btrfs would read from the other disk to (hopefully) get
a good copy of the data.

With this encryption scenario, the failure would potentially be
detected in the decryption layer code and instead of returning bad
data to Btrfs, it would return some error code. In that case, will
Btrfs attempt to read from the other volume and allow the application
to proceed as if nothing was wrong?

Regards,

Daniel


Default (aes-xts-plain64) dm-crypt setup can't verify integrity
of encrypted block and in case of silent corruption will decrypt it to
garbage which btrfs will catch. In case of AEAD encryption
(dm-crypt plus dm-integrity) it can verify integrity itself but I'm not
sure right now which exact error it returns to upper layer as I didn't
used it yet.
The exact error shouldn't matter, provided that BTRFS perceives it as a 
read error from the 'device' (in reality the virtual DM device). 
Provided that condition is met, the error is handled pretty much the 
same regardless of the exact error code.


I use btrfs raid1 on top of LVM on top of dm-crypt devices and
it handled bad blocks on physical devices normally (there was a burst of
about 900 reallocates on one device which btrfs caught and fixed).Same here, and I've also tested it on top of dm-integrity, where BTRFS 
will correctly handle errors passed up from dm-integrity failing to 
verify blocks.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How about adding an ioctl to convert a directory to a subvolume?

2017-11-27 Thread Austin S. Hemmelgarn

On 2017-11-27 05:13, Qu Wenruo wrote:



On 2017年11月27日 17:41, Lu Fengqi wrote:

Hi all,

As we all know, under certain circumstances, it is more appropriate to
create some subvolumes rather than keep everything in the same
subvolume. As the condition of demand change, the user may need to
convert a previous directory to a subvolume. For this reason,how about
adding an ioctl to convert a directory to a subvolume?


The idea seems interesting.

However in my opinion, this can be done quite easily in (mostly) user
space, thanks to btrfs support of relink.

The method from Hugo or Chris is quite good, maybe it can be enhanced a
little.

Use the following layout as an example:

root_subv
|- subvolume_1
|  |- dir_1
|  |  |- file_1
|  |  |- file_2
|  |- dir_2
| |- file_3
|- subvolume_2

If we want to convert dir_1 into subvolume, we can do it like:

1) Create a temporary readonly snapshot of parent subvolume containing
the desired dir
# btrfs sub snapshot -r root_subv/subvolume_1 \
  root_subv/tmp_snapshot_1

2) Create a new subvolume, as destination.
# btrfs sub create root_subv/tmp_dest/

3) Copy the content and sync the fs
Use of reflink is necessary.
# cp -r --reflink=always root_subv/tmp_snapshot_1/dir_1 \
  root_subv/tmp_dest
# btrfs sync root_subv/tmp_dest

4) Delete temporary readonly snapshot
# btrfs subvolume delete root_subv/tmp_snapshot_1

5) Remove the source dir
# rm -rf root_subv/subvolume_1/dir_1

5) Create a final destination snapshot of "root_subv/temporary_dest"
# btrfs subvolume snapshot root_subv/tmp_dest \
  root_subv/subvolume_1/dir_1

6) Remove the temporary destination
# btrfs subvolume delete root_subv/tmp_dest


The main challenge is in step 3).
In fact above method can only handle normal dir/files.
If there is another subvolume inside the desired dir, current "cp -r" is
a bad idea.
We need to skip subvolume dir, and create snapshot for it.

But it's quite easy to write a user space program to handle it.
Maybe using "find" command can already handle it well.

Anyway, doing it in user space is already possible and much easier than
doing it in kernel.



Users can convert by the scripts mentioned in this
thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is
it easier to use the off-the-shelf btrfs subcommand?


If you just want to integrate the functionality into btrfs-progs, maybe
it's possible.

But if you insist in providing a new ioctl for this, I highly doubt if
the extra hassle is worthy.



After an initial consideration, our implementation is broadly divided
into the following steps:
1. Freeze the filesystem or set the subvolume above the source directory
to read-only;


Not really need to freeze the whole fs.
Just create a readonly snapshot of the parent subvolume which contains
the dir.
That's how snapshot is designed for.


2. Perform a pre-check, for example, check if a cross-device link
creation during the conversion;


This can be done in-the-fly.
As the check is so easy (only needs to check if the inode number is 256).
We only need a mid-order iteration of the source dir (in temporary
snapshot), and for normal file, use reflink.
For subvolume dir, create a snapshot for it.

And for such iteration, a python script less than 100 lines would be
sufficient.

On that note, see the function convert_dir_to_subv() in:
https://github.com/Ferroin/btrfs-subv-backup/blob/master/btrfs-subv-backup.py

For an example of how to do it in Python (albeit with some extra code to 
handle the case of not having the reflink module from PyPI, and without 
anything to prevent the source from being modified).


It would still be nice to be able to do this atomically though, or at 
least get cross-rename support in BTRFS, which would allow the final 
rename to replace the source with a subvolume to be atomic (assuming of 
course you could cross-rename a directory and subvolume).


Thanks,
Qu


3. Perform conversion, such as creating a new subvolume and moving the
contents of the source directory;
4. Thaw the filesystem or restore the subvolume writable property.

In fact, I am not so sure whether this use of freeze is appropriate
because the source directory the user needs to convert may be located
at / or /home and this pre-check and conversion process may take a long
time, which can lead to some shell and graphical application suspended.

Please give your comments if any.





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] btrfs: factor __btrfs_open_devices() to create btrfs_open_one_device()

2017-11-27 Thread Anand Jain


Hi Lakshmipathi,

 Oops I can see the same. I am trying to narrow down.

Thanks, Anand

On 11/27/2017 02:47 PM, Lakshmipathi.G wrote:

Hi Anand,

With this patch applied, btrfs-progs/misc-test/021 error out. Is this
same for you?

Without this patch: https://asciinema.org/a/RJmE5469mHlL3S1BIOCifWVn6
With this patch:https://asciinema.org/a/1h5UX6DIFNsvvMXgLo4GiEgdE

thanks!

Cheers,
Lakshmipathi.G
http://www.giis.co.in http://www.webminal.org


On Thu, Nov 9, 2017 at 9:15 PM, Anand Jain  wrote:

No functional changes, create btrfs_open_one_device() from
__btrfs_open_devices(). This is a preparatory work to add dynamic
device scan.

Signed-off-by: Anand Jain 
---
  fs/btrfs/volumes.c | 126 +
  1 file changed, 69 insertions(+), 57 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0857b580014d..d24e966ee29f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -601,6 +601,73 @@ void btrfs_free_stale_device(struct btrfs_device *cur_dev)
 }
  }

+static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
+   struct btrfs_device *device, fmode_t flags,
+   void *holder)
+{
+   struct request_queue *q;
+   struct block_device *bdev;
+   struct buffer_head *bh;
+   struct btrfs_super_block *disk_super;
+   u64 devid;
+   int ret;
+
+   if (device->bdev)
+   return -EINVAL;
+   if (!device->name)
+   return -EINVAL;
+
+   ret = btrfs_get_bdev_and_sb(device->name->str, flags, holder, 1,
+   , );
+   if (ret)
+   return ret;
+
+   disk_super = (struct btrfs_super_block *)bh->b_data;
+   devid = btrfs_stack_device_id(_super->dev_item);
+   if (devid != device->devid)
+   goto error_brelse;
+
+   if (memcmp(device->uuid, disk_super->dev_item.uuid,
+  BTRFS_UUID_SIZE))
+   goto error_brelse;
+
+   device->generation = btrfs_super_generation(disk_super);
+
+   if (btrfs_super_flags(disk_super) & BTRFS_SUPER_FLAG_SEEDING) {
+   device->writeable = 0;
+   fs_devices->seeding = 1;
+   } else {
+   device->writeable = !bdev_read_only(bdev);
+   }
+
+   q = bdev_get_queue(bdev);
+   if (blk_queue_discard(q))
+   device->can_discard = 1;
+   if (!blk_queue_nonrot(q))
+   fs_devices->rotating = 1;
+
+   device->bdev = bdev;
+   device->in_fs_metadata = 0;
+   device->mode = flags;
+
+   fs_devices->open_devices++;
+   if (device->writeable &&
+   device->devid != BTRFS_DEV_REPLACE_DEVID) {
+   fs_devices->rw_devices++;
+   list_add(>dev_alloc_list,
+_devices->alloc_list);
+   }
+   brelse(bh);
+
+   return 0;
+
+error_brelse:
+   brelse(bh);
+   blkdev_put(bdev, flags);
+
+   return -EINVAL;
+}
+
  /*
   * Add new device to list of registered devices
   *
@@ -978,69 +1045,14 @@ static int __btrfs_open_devices(struct btrfs_fs_devices 
*fs_devices,
 flags |= FMODE_EXCL;

 list_for_each_entry(device, head, dev_list) {
-   struct request_queue *q;
-   struct block_device *bdev;
-   struct buffer_head *bh;
-   struct btrfs_super_block *disk_super;
-   u64 devid;
-
-   if (device->bdev)
-   continue;
-   if (!device->name)
-   continue;
-
 /* Just open everything we can; ignore failures here */
-   if (btrfs_get_bdev_and_sb(device->name->str, flags, holder, 1,
-   , ))
+   ret = btrfs_open_one_device(fs_devices, device, flags, holder);
+   if (ret)
 continue;

-   disk_super = (struct btrfs_super_block *)bh->b_data;
-   devid = btrfs_stack_device_id(_super->dev_item);
-   if (devid != device->devid)
-   goto error_brelse;
-
-   if (memcmp(device->uuid, disk_super->dev_item.uuid,
-  BTRFS_UUID_SIZE))
-   goto error_brelse;
-
-   device->generation = btrfs_super_generation(disk_super);
-
-   if (btrfs_super_flags(disk_super) & BTRFS_SUPER_FLAG_SEEDING) {
-   device->writeable = 0;
-   fs_devices->seeding = 1;
-   } else {
-   device->writeable = !bdev_read_only(bdev);
-   }
-
-   q = bdev_get_queue(bdev);
-   if (blk_queue_discard(q))
-   device->can_discard = 1;
-   if (!blk_queue_nonrot(q))
-   fs_devices->rotating = 1;
-
-   device->bdev = 

Re: [PATCH 3/3] btrfs-progs: check: exclude extents of tree blocks and extent items

2017-11-27 Thread Qu Wenruo


On 2017年11月27日 11:13, Su Yue wrote:
> Commit d17d6663c99c ("btrfs-progs: lowmem check: Fix regression which
> screws up extent allocator") removes pin_metadata_blocks() from
> lowmem repair.
> So we have to find another way to exclude extents which should be
> occupied by tree blocks.

First thing first, any tree block which is referred by will not be freed.

Unless some special case, like extent tree initialization which clears
the whole extent tree so we can't check if one tree block should be
freed using extent tree, there is no need to explicitly pin existing
tree blocks.

Their creation/freeing is already handled well.

Please explain the reason why you need to pin extents first.

Thanks,
Qu

> 
> Modify pin_down_tree_blocks() only for code reuse.
> So behavior of pin_metadata_blocks() which works with option
> 'init-extent-tree' is not influenced.
> 
> Introduce exclude_blocks_and_extent_items() to mark extents of all tree
> blocks dirty in fs_info->excluded_extents. It also calls
> exclude_extent_items() to exclude extent_items in extent tree.
> 
> Signed-off-by: Su Yue 
> ---
>  cmds-check.c | 122 
> ++-
>  1 file changed, 112 insertions(+), 10 deletions(-)
> 
> diff --git a/cmds-check.c b/cmds-check.c
> index c7b570bef9c3..f39285c5a3c2 100644
> --- a/cmds-check.c
> +++ b/cmds-check.c
> @@ -13351,6 +13351,7 @@ out:
>   return err;
>  }
>  
> +static int exclude_blocks_and_extent_items(struct btrfs_fs_info *fs_info);
>  /*
>   * Low memory usage version check_chunks_and_extents.
>   */
> @@ -13363,12 +13364,22 @@ static int check_chunks_and_extents_v2(struct 
> btrfs_fs_info *fs_info)
>   struct btrfs_root *root1;
>   struct btrfs_root *root;
>   struct btrfs_root *cur_root;
> + struct extent_io_tree excluded_extents;
>   int err = 0;
>   int ret;
>  
>   root = fs_info->fs_root;
>  
>   if (repair) {
> + extent_io_tree_init(_extents);
> + fs_info->excluded_extents = _extents;
> + ret = exclude_blocks_and_extent_items(fs_info);
> + if (ret) {
> + error("failed to exclude tree blocks and extent items");
> + return ret;
> + }
> + reset_cached_block_groups(fs_info);
> +
>   trans = btrfs_start_transaction(fs_info->extent_root, 1);
>   if (IS_ERR(trans)) {
>   error("failed to start transaction before check");
> @@ -13437,6 +13448,8 @@ out:
>   err |= ret;
>   else
>   err &= ~BG_ACCOUNTING_ERROR;
> + extent_io_tree_cleanup(_extents);
> + fs_info->excluded_extents = NULL;
>   }
>  
>   if (trans)
> @@ -13534,40 +13547,106 @@ init:
>   return 0;
>  }
>  
> -static int pin_down_tree_blocks(struct btrfs_fs_info *fs_info,
> - struct extent_buffer *eb, int tree_root)
> +static int exclude_extent_items(struct btrfs_fs_info *fs_info,
> + struct extent_io_tree *tree)
> +{
> + struct btrfs_root *root = fs_info->extent_root;
> + struct btrfs_key key;
> + struct btrfs_path path;
> + struct extent_buffer *eb;
> + int slot;
> + int ret;
> + u64 start;
> + u64 end;
> +
> + ASSERT(tree);
> + btrfs_init_path();
> + key.objectid = 0;
> + key.type = 0;
> + key.offset = 0;
> +
> + ret = btrfs_search_slot(NULL, root, , , 0, 0);
> + if (ret < 0)
> + goto out;
> +
> + while (1) {
> + eb = path.nodes[0];
> + slot = path.slots[0];
> + btrfs_item_key_to_cpu(eb, , slot);
> + if (key.type != BTRFS_EXTENT_ITEM_KEY &&
> + key.type != BTRFS_METADATA_ITEM_KEY)
> + goto next;
> + start = key.objectid;
> + if (key.type == BTRFS_EXTENT_ITEM_KEY)
> + end = start + key.offset;
> + else
> + end = start + fs_info->nodesize;
> +
> + set_extent_dirty(tree, start, end - 1);
> +next:
> + ret = btrfs_next_item(root, );
> + if (ret > 0) {
> + ret = 0;
> + goto out;
> + }
> + if (ret < 0)
> + goto out;
> + }
> +out:
> + if (ret)
> + error("failed to exclude extent items");
> + btrfs_release_path();
> + return ret;
> +}
> +
> +static int traverse_tree_blocks(struct btrfs_fs_info *fs_info,
> + struct extent_buffer *eb, int tree_root,
> + int pin)
>  {
>   struct extent_buffer *tmp;
>   struct btrfs_root_item *ri;
>   struct btrfs_key key;
> + struct extent_io_tree *tree;
>   u64 bytenr;
>   int level = btrfs_header_level(eb);
>   int nritems;
>   int ret;
>   int i;
> +  

Re: [PATCH 2/3] btrfs-progs: check: record returned errors after walk_down_tree_v2()

2017-11-27 Thread Qu Wenruo


On 2017年11月27日 11:13, Su Yue wrote:
> In lowmem mode with '--repair', check_chunks_and_extents_v2()
> will fix accounting in block groups and clear the error
> bit BG_ACCOUNTING_ERROR.
> However, return value of check_btrfs_root() is 0 either 1 instead of
> error bits.
> 
> If extent tree is on error, lowmem repair always prints error and
> returns nonzero value even the filesystem is fine after repair.
> 
> So let @err contains bits after walk_down_tree_v2().
> 
> Signed-off-by: Su Yue 
> ---
>  cmds-check.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/cmds-check.c b/cmds-check.c
> index 3a72f8e3877d..c7b570bef9c3 100644
> --- a/cmds-check.c
> +++ b/cmds-check.c
> @@ -6607,7 +6607,7 @@ static int check_btrfs_root(struct btrfs_trans_handle 
> *trans,
>   ret = walk_down_tree_v2(trans, root, , , ,
>   ext_ref, check_all);
>  
> - err |= !!ret;
> + err |= ret;

Since ret from walk_down_tree_v2() can be plus, minus and 0, using bit
OR here is quite confusing.

For example, if ret is -1, it will makes all bits of err to 1.

Please don't mix bit operation with possible minus value.
(And feel free to fix similar problems)

Thanks,
Qu

>  
>   /* if ret is negative, walk shall stop */
>   if (ret < 0) {
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/3] btrfs-progs: check: release path in repair_extent_data_item()

2017-11-27 Thread Qu Wenruo


On 2017年11月27日 11:13, Su Yue wrote:
> In repair_extent_data_item(), path is not be released if some
> errors occurs which causes extent buffer leak.
> 
> So release path in end of the function.
> 
> Signed-off-by: Su Yue 
> ---
>  cmds-check.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/cmds-check.c b/cmds-check.c
> index e746ee7b281d..3a72f8e3877d 100644
> --- a/cmds-check.c
> +++ b/cmds-check.c
> @@ -11974,7 +11974,6 @@ static int repair_extent_data_item(struct 
> btrfs_trans_handle *trans,
>   btrfs_mark_buffer_dirty(eb);
>   ret = btrfs_update_block_group(trans, extent_root, disk_bytenr,
>  num_bytes, 1, 0);
> - btrfs_release_path();

Here btrfs_release_path() should not be removed.

Lines below this, btrfs_inc_extent_ref() is called.
In which we will alloc a new path and modify extent tree.

Thanks to the fact that there is no tree locking implemented in
btrfs-progs, it won't cause dead lock, but it's a good habit to make
path allocation deadlock free.

So please keep the path release here.

(And if you find some code holding conflicting path on the same tree,
feel free to fix them, even it won't cause real problem in btrfs right now)

>   }
>  
>   if (nrefs->full_backref[0])
> @@ -11998,6 +11997,7 @@ static int repair_extent_data_item(struct 
> btrfs_trans_handle *trans,
>  
>   err &= ~BACKREF_MISSING;
>  out:
> + btrfs_release_path();

On the other hand, an empty path can be released as many times as you wish.
So double releasing the path in out branch is not a problem.

Thanks,
Q

>   if (ret)
>   error("can't repair root %llu extent data item[%llu %llu]",
> root->objectid, disk_bytenr, num_bytes);
> 



signature.asc
Description: OpenPGP digital signature


Re: How about adding an ioctl to convert a directory to a subvolume?

2017-11-27 Thread Qu Wenruo


On 2017年11月27日 17:41, Lu Fengqi wrote:
> Hi all,
> 
> As we all know, under certain circumstances, it is more appropriate to
> create some subvolumes rather than keep everything in the same
> subvolume. As the condition of demand change, the user may need to
> convert a previous directory to a subvolume. For this reason,how about
> adding an ioctl to convert a directory to a subvolume?

The idea seems interesting.

However in my opinion, this can be done quite easily in (mostly) user
space, thanks to btrfs support of relink.

The method from Hugo or Chris is quite good, maybe it can be enhanced a
little.

Use the following layout as an example:

root_subv
|- subvolume_1
|  |- dir_1
|  |  |- file_1
|  |  |- file_2
|  |- dir_2
| |- file_3
|- subvolume_2

If we want to convert dir_1 into subvolume, we can do it like:

1) Create a temporary readonly snapshot of parent subvolume containing
   the desired dir
   # btrfs sub snapshot -r root_subv/subvolume_1 \
 root_subv/tmp_snapshot_1

2) Create a new subvolume, as destination.
   # btrfs sub create root_subv/tmp_dest/

3) Copy the content and sync the fs
   Use of reflink is necessary.
   # cp -r --reflink=always root_subv/tmp_snapshot_1/dir_1 \
 root_subv/tmp_dest
   # btrfs sync root_subv/tmp_dest

4) Delete temporary readonly snapshot
   # btrfs subvolume delete root_subv/tmp_snapshot_1

5) Remove the source dir
   # rm -rf root_subv/subvolume_1/dir_1

5) Create a final destination snapshot of "root_subv/temporary_dest"
   # btrfs subvolume snapshot root_subv/tmp_dest \
 root_subv/subvolume_1/dir_1

6) Remove the temporary destination
   # btrfs subvolume delete root_subv/tmp_dest


The main challenge is in step 3).
In fact above method can only handle normal dir/files.
If there is another subvolume inside the desired dir, current "cp -r" is
a bad idea.
We need to skip subvolume dir, and create snapshot for it.

But it's quite easy to write a user space program to handle it.
Maybe using "find" command can already handle it well.

Anyway, doing it in user space is already possible and much easier than
doing it in kernel.

> 
> Users can convert by the scripts mentioned in this
> thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is
> it easier to use the off-the-shelf btrfs subcommand?

If you just want to integrate the functionality into btrfs-progs, maybe
it's possible.

But if you insist in providing a new ioctl for this, I highly doubt if
the extra hassle is worthy.

> 
> After an initial consideration, our implementation is broadly divided
> into the following steps:
> 1. Freeze the filesystem or set the subvolume above the source directory
> to read-only;

Not really need to freeze the whole fs.
Just create a readonly snapshot of the parent subvolume which contains
the dir.
That's how snapshot is designed for.

> 2. Perform a pre-check, for example, check if a cross-device link
> creation during the conversion;

This can be done in-the-fly.
As the check is so easy (only needs to check if the inode number is 256).
We only need a mid-order iteration of the source dir (in temporary
snapshot), and for normal file, use reflink.
For subvolume dir, create a snapshot for it.

And for such iteration, a python script less than 100 lines would be
sufficient.

Thanks,
Qu

> 3. Perform conversion, such as creating a new subvolume and moving the
> contents of the source directory;
> 4. Thaw the filesystem or restore the subvolume writable property.
> 
> In fact, I am not so sure whether this use of freeze is appropriate
> because the source directory the user needs to convert may be located
> at / or /home and this pre-check and conversion process may take a long
> time, which can lead to some shell and graphical application suspended.
> 
> Please give your comments if any.
> 



signature.asc
Description: OpenPGP digital signature


Re: FAQ / encryption / error handling?

2017-11-27 Thread Dmitrii Tcvetkov
On Mon, 27 Nov 2017 09:06:12 +0100
Daniel Pocock  wrote:

> Hi all,
> 
> The FAQ has a couple of sections on encryption (general and dm-crypt)
> 
> One thing that isn't explained there: if you create multiple encrypted
> volumes (e.g. using dm-crypt) and use Btrfs to combine them into
> RAID1, how does error recovery work when a read operation returns
> corrupted data?
> 
> Without encryption, reading from one disk would give a checksum
> mismatch and Btrfs would read from the other disk to (hopefully) get
> a good copy of the data.
> 
> With this encryption scenario, the failure would potentially be
> detected in the decryption layer code and instead of returning bad
> data to Btrfs, it would return some error code. In that case, will
> Btrfs attempt to read from the other volume and allow the application
> to proceed as if nothing was wrong?
> 
> Regards,
> 
> Daniel

Default (aes-xts-plain64) dm-crypt setup can't verify integrity
of encrypted block and in case of silent corruption will decrypt it to
garbage which btrfs will catch. In case of AEAD encryption
(dm-crypt plus dm-integrity) it can verify integrity itself but I'm not
sure right now which exact error it returns to upper layer as I didn't
used it yet.

I use btrfs raid1 on top of LVM on top of dm-crypt devices and
it handled bad blocks on physical devices normally (there was a burst of
about 900 reallocates on one device which btrfs caught and fixed).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How about adding an ioctl to convert a directory to a subvolume?

2017-11-27 Thread Lu Fengqi
Hi all,

As we all know, under certain circumstances, it is more appropriate to
create some subvolumes rather than keep everything in the same
subvolume. As the condition of demand change, the user may need to
convert a previous directory to a subvolume. For this reason,how about
adding an ioctl to convert a directory to a subvolume?

Users can convert by the scripts mentioned in this
thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is
it easier to use the off-the-shelf btrfs subcommand?

After an initial consideration, our implementation is broadly divided
into the following steps:
1. Freeze the filesystem or set the subvolume above the source directory
to read-only;
2. Perform a pre-check, for example, check if a cross-device link
creation during the conversion;
3. Perform conversion, such as creating a new subvolume and moving the
contents of the source directory;
4. Thaw the filesystem or restore the subvolume writable property.

In fact, I am not so sure whether this use of freeze is appropriate
because the source directory the user needs to convert may be located
at / or /home and this pre-check and conversion process may take a long
time, which can lead to some shell and graphical application suspended.

Please give your comments if any.

-- 
Thanks,
Lu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


FAQ / encryption / error handling?

2017-11-27 Thread Daniel Pocock

Hi all,

The FAQ has a couple of sections on encryption (general and dm-crypt)

One thing that isn't explained there: if you create multiple encrypted
volumes (e.g. using dm-crypt) and use Btrfs to combine them into RAID1,
how does error recovery work when a read operation returns corrupted data?

Without encryption, reading from one disk would give a checksum mismatch
and Btrfs would read from the other disk to (hopefully) get a good copy
of the data.

With this encryption scenario, the failure would potentially be detected
in the decryption layer code and instead of returning bad data to Btrfs,
it would return some error code.  In that case, will Btrfs attempt to
read from the other volume and allow the application to proceed as if
nothing was wrong?

Regards,

Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html