Re: [PATCH 1/2] btrfs-progs: check: Skip data csum verification for metadata dump

2018-04-02 Thread Qu Wenruo


On 2018年04月03日 13:51, Nikolay Borisov wrote:
> 
> 
> On  3.04.2018 08:39, Qu Wenruo wrote:
>> For metadata dump (fs restored by btrfs-image), since no data is
>> restored check sum verification will definitely report error.
>>
>> Add such check in check_csums() and prompt for user.
>>
>> Issue: #103
>> Signed-off-by: Qu Wenruo 
> 
> Reviewed-by: Nikolay Borisov 
> 
>> ---
>>  check/main.c | 13 -
>>  1 file changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/check/main.c b/check/main.c
>> index 891a6d797756..15c94543946a 100644
>> --- a/check/main.c
>> +++ b/check/main.c
>> @@ -5601,6 +5601,7 @@ static int check_csums(struct btrfs_root *root)
>>  int ret;
>>  u64 data_len;
>>  unsigned long leaf_offset;
>> +bool verify_csum = !!check_data_csum;
> 
> nit: Why don't you convert check_data_csum to bool in this patch.

Next thing to do.

> My
> recent testing in !! usage showed that it's generating quite a number of
> additional instructions.

That's pretty interesting.

Would you please shared some more info about this?

Thanks,
Qu

> Not that it will have big impact in this case
> though.
> 
>>  
>>  root = root->fs_info->csum_root;
>>  if (!extent_buffer_uptodate(root->node)) {
>> @@ -5623,6 +5624,16 @@ static int check_csums(struct btrfs_root *root)
>>  path.slots[0]--;
>>  ret = 0;
>>  
>> +/*
>> + * For metadata dump (btrfs-image) all data is wiped so verifying data
>> + * csum is meaningless and will always report csum error.
>> + */
>> +if (check_data_csum && (btrfs_super_flags(root->fs_info->super_copy) &
>> +(BTRFS_SUPER_FLAG_METADUMP | BTRFS_SUPER_FLAG_METADUMP_V2))) {
>> +printf("skip data csum verification for metadata dump\n");
>> +verify_csum = false;
>> +}
>> +
>>  while (1) {
>>  if (path.slots[0] >= btrfs_header_nritems(path.nodes[0])) {
>>  ret = btrfs_next_leaf(root, );
>> @@ -5644,7 +5655,7 @@ static int check_csums(struct btrfs_root *root)
>>  
>>  data_len = (btrfs_item_size_nr(leaf, path.slots[0]) /
>>csum_size) * root->fs_info->sectorsize;
>> -if (!check_data_csum)
>> +if (!verify_csum)
>>  goto skip_csum_check;
>>  leaf_offset = btrfs_item_ptr_offset(leaf, path.slots[0]);
>>  ret = check_extent_csums(root, key.offset, data_len,
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs-progs: check: Skip data csum verification for metadata dump

2018-04-02 Thread Nikolay Borisov


On  3.04.2018 08:39, Qu Wenruo wrote:
> For metadata dump (fs restored by btrfs-image), since no data is
> restored check sum verification will definitely report error.
> 
> Add such check in check_csums() and prompt for user.
> 
> Issue: #103
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov 

> ---
>  check/main.c | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/check/main.c b/check/main.c
> index 891a6d797756..15c94543946a 100644
> --- a/check/main.c
> +++ b/check/main.c
> @@ -5601,6 +5601,7 @@ static int check_csums(struct btrfs_root *root)
>   int ret;
>   u64 data_len;
>   unsigned long leaf_offset;
> + bool verify_csum = !!check_data_csum;

nit: Why don't you convert check_data_csum to bool in this patch. My
recent testing in !! usage showed that it's generating quite a number of
additional instructions. Not that it will have big impact in this case
though.

>  
>   root = root->fs_info->csum_root;
>   if (!extent_buffer_uptodate(root->node)) {
> @@ -5623,6 +5624,16 @@ static int check_csums(struct btrfs_root *root)
>   path.slots[0]--;
>   ret = 0;
>  
> + /*
> +  * For metadata dump (btrfs-image) all data is wiped so verifying data
> +  * csum is meaningless and will always report csum error.
> +  */
> + if (check_data_csum && (btrfs_super_flags(root->fs_info->super_copy) &
> + (BTRFS_SUPER_FLAG_METADUMP | BTRFS_SUPER_FLAG_METADUMP_V2))) {
> + printf("skip data csum verification for metadata dump\n");
> + verify_csum = false;
> + }
> +
>   while (1) {
>   if (path.slots[0] >= btrfs_header_nritems(path.nodes[0])) {
>   ret = btrfs_next_leaf(root, );
> @@ -5644,7 +5655,7 @@ static int check_csums(struct btrfs_root *root)
>  
>   data_len = (btrfs_item_size_nr(leaf, path.slots[0]) /
> csum_size) * root->fs_info->sectorsize;
> - if (!check_data_csum)
> + if (!verify_csum)
>   goto skip_csum_check;
>   leaf_offset = btrfs_item_ptr_offset(leaf, path.slots[0]);
>   ret = check_extent_csums(root, key.offset, data_len,
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] btrfs: Validate child tree block's level and first key

2018-04-02 Thread Qu Wenruo


On 2018年04月02日 18:47, Qu Wenruo wrote:
> 
> 
> On 2018年03月28日 23:49, David Sterba wrote:
>> On Tue, Mar 27, 2018 at 08:44:19PM +0800, Qu Wenruo wrote:
>>> We have several reports about node pointer points to incorrect child
>>> tree blocks, which could have even wrong owner and level but still with
>>> valid generation and checksum.
>>>
>>> Although btrfs check could handle it and print error message like:
>>> leaf parent key incorrect 60670574592
>>>
>>> Kernel doesn't have enough check on this type of corruption correctly.
>>> At least add such check to read_tree_block() and btrfs_read_buffer(),
>>> where we need two new parameters @level and @first_key to verify the
>>> child tree block.
>>>
>>> The new @level check is mandatory and all call sites are already
>>> modified to extract expected level from its call chain.
>>>
>>> While @first_key is optional, the following call sites are skipping such
>>> check:
>>> 1) Root node/leaf
>>>As ROOT_ITEM doesn't contain the first key, skip @first_key check.
>>> 2) Direct backref
>>>Only parent bytenr and level is known and we need to resolve the key
>>>all by ourselves, skip @first_key check.
>>>
>>> Another note of this verification is, it needs extra info from nodeptr
>>> or ROOT_ITEM, so it can't fit into current tree-checker framework, which
>>> is limited to node/leaf boundary.
>>>
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>> changelog:
>>> v2:
>>>   Make @level check mandatory, suggesed by Jeff and Nikolay.
>>>   Change parameter order as @level is now mandatory, put it in front of
>>>   @first_key.
>>>   Change verify_parent_level() to verify_key_level() to avoid confusion
>>>   on the @level parameter.
>>>   Add btrfs_error() output for CONFIG_BTRFS_DEBUG to help debugging.
>>
>> That's much better overall, thanks. Adding it to next.
> 
> Nikolay reported a case where @first_key check seems to cause false alert.
> (Although my xfstests check hasn't exposed it yet)
> 
> Please discard this patch since it has the possibility to cause false
> alert for btrfs core functionality.

It turns out to be a racing related case in btree operations like
push_leaf_right().

So it indeed brings some clue here.
(Although still pretty hard to reproduce)

Thanks,
Qu

> 
> Thanks,
> Qu
> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 



signature.asc
Description: OpenPGP digital signature


[PATCH 2/2] btrfs-progs: tests/fsck: Add test case to check if btrfs check can skip data csum verfication for metadata dump

2018-04-02 Thread Qu Wenruo
Signed-off-by: Qu Wenruo 
---
 .../031-metadatadump-check-data-csum/test.sh   | 32 ++
 1 file changed, 32 insertions(+)
 create mode 100755 tests/fsck-tests/031-metadatadump-check-data-csum/test.sh

diff --git a/tests/fsck-tests/031-metadatadump-check-data-csum/test.sh 
b/tests/fsck-tests/031-metadatadump-check-data-csum/test.sh
new file mode 100755
index ..9f14c4122f4b
--- /dev/null
+++ b/tests/fsck-tests/031-metadatadump-check-data-csum/test.sh
@@ -0,0 +1,32 @@
+#!/bin/bash
+# To check if "btrfs check" can detect metadata dump (restored by btrfs-iamge)
+# and ignore --check-data-csum option
+
+source "$TEST_TOP/common"
+
+check_prereq btrfs
+check_prereq mkfs.btrfs
+check_prereq btrfs-image
+setup_root_helper
+prepare_test_dev
+ 
+tmp=$(mktemp --tmpdir btrfs-progs-fsck.XXX)
+
+run_check $SUDO_HELPER "$TOP/mkfs.btrfs" -f "$TEST_DEV"
+run_check_mount_test_dev
+
+run_check $SUDO_HELPER dd if=/dev/urandom of="$TEST_MNT/file" bs=4k count=16
+run_check_umount_test_dev
+
+run_check $SUDO_HELPER "$TOP/btrfs-image" "$TEST_DEV" "$tmp.image"
+
+# use prepare_test_dev() to wipe all existing data on $TEST_DEV
+# so there is no way that restored image could have mathcing data csum
+prepare_test_dev
+
+run_check $SUDO_HELPER "$TOP/btrfs-image" -r "$tmp.image" "$TEST_DEV"
+
+# Should not report any error
+run_check "$TOP/btrfs" check --check-data-csum "$TEST_DEV"
+
+rm -rf -- "$tmp*"
-- 
2.16.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs-progs: check: Skip data csum verification for metadata dump

2018-04-02 Thread Qu Wenruo
For metadata dump (fs restored by btrfs-image), since no data is
restored check sum verification will definitely report error.

Add such check in check_csums() and prompt for user.

Issue: #103
Signed-off-by: Qu Wenruo 
---
 check/main.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/check/main.c b/check/main.c
index 891a6d797756..15c94543946a 100644
--- a/check/main.c
+++ b/check/main.c
@@ -5601,6 +5601,7 @@ static int check_csums(struct btrfs_root *root)
int ret;
u64 data_len;
unsigned long leaf_offset;
+   bool verify_csum = !!check_data_csum;
 
root = root->fs_info->csum_root;
if (!extent_buffer_uptodate(root->node)) {
@@ -5623,6 +5624,16 @@ static int check_csums(struct btrfs_root *root)
path.slots[0]--;
ret = 0;
 
+   /*
+* For metadata dump (btrfs-image) all data is wiped so verifying data
+* csum is meaningless and will always report csum error.
+*/
+   if (check_data_csum && (btrfs_super_flags(root->fs_info->super_copy) &
+   (BTRFS_SUPER_FLAG_METADUMP | BTRFS_SUPER_FLAG_METADUMP_V2))) {
+   printf("skip data csum verification for metadata dump\n");
+   verify_csum = false;
+   }
+
while (1) {
if (path.slots[0] >= btrfs_header_nritems(path.nodes[0])) {
ret = btrfs_next_leaf(root, );
@@ -5644,7 +5655,7 @@ static int check_csums(struct btrfs_root *root)
 
data_len = (btrfs_item_size_nr(leaf, path.slots[0]) /
  csum_size) * root->fs_info->sectorsize;
-   if (!check_data_csum)
+   if (!verify_csum)
goto skip_csum_check;
leaf_offset = btrfs_item_ptr_offset(leaf, path.slots[0]);
ret = check_extent_csums(root, key.offset, data_len,
-- 
2.16.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [wiki] Please clarify how to check whether barriers are properly implemented in hardware

2018-04-02 Thread Adam Borowski
On Mon, Apr 02, 2018 at 10:07:01PM +, Hugo Mills wrote:
> On Mon, Apr 02, 2018 at 06:03:00PM -0400, Fedja Beader wrote:
> > Is there some testing utility for this? Is there a way to extract this/tell 
> > with a high enough certainty from datasheets/other material before purchase?
> 
>Given that not implementing barriers is basically a bug in the
> hardware [for SATA or SAS], I don't think anyone's going to specify
> anything other than "fully suppors barriers" in their datasheets.
> 
>I don't know of a testing tool. It may not be obvious that barriers
> aren't being honoured without doing things like power-failure testing.

And you'd need to do a lot of power-cycling during writes, with various
write patterns -- as unless you have a case of "let's lie about barriers to
make benchmarks better than the competition" where barriers are consistently
absent, it might be a genuine bug in a well-meaning controller that at least
tries but sometimes fails to.  The intentional case is usually easy to
detect -- but just wait go get volkswagenized. :/


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 
⢿⡄⠘⠷⠚⠋⠀ ... what's the frequency of that 5V DC?
⠈⠳⣄
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-04-02 Thread Zygo Blaxell
On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
> > On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> > > I thought that a possible solution is to create BG with different
> > number of data disks. E.g. supposing to have a raid 6 system with 6
> > disks, where 2 are parity disk; we should allocate 3 BG
> > > 
> > > BG #1: 1 data disk, 2 parity disks
> > > BG #2: 2 data disks, 2 parity disks,
> > > BG #3: 4 data disks, 2 parity disks
> > > 
> > > For simplicity, the disk-stripe length is assumed = 4K.
> > > 
> > > So If you have a write with a length of 4 KB, this should be placed
> > in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
> > should be placed in in BG#2, then in BG#1.
> > > 
> > > This would avoid space wasting, even if the fragmentation will
> > increase (but shall the fragmentation matters with the modern solid
> > state disks ?).
> 
> I don't really see why this would increase fragmentation or waste space.

Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
remaining 2 blocks).  It also flips the usual order of "determine size
of extent, then allocate space for it" which might require major surgery
on the btrfs allocator to implement.

If we round that write up to 8 blocks (so we can put both pieces in
BG #3), it degenerates into the "pretend partially filled RAID stripes
are completely full" case, something like what ssd_spread already does.
That trades less file fragmentation for more free space fragmentation.

> The extent size is determined before allocation anyway, all that changes
> in this proposal is where those small extents ultimately land on the disk.
> 
> If anything, it might _reduce_ fragmentation since everything in BG #1
> and BG #2 will be of uniform size.
> 
> It does solve write hole (one transaction per RAID stripe).
> 
> > Also, you're still going to be wasting space, it's just that less space will
> > be wasted, and it will be wasted at the chunk level instead of the block
> > level, which opens up a whole new set of issues to deal with, most
> > significantly that it becomes functionally impossible without brute-force
> > search techniques to determine when you will hit the common-case of -ENOSPC
> > due to being unable to allocate a new chunk.
> 
> Hopefully the allocator only keeps one of each size of small block groups
> around at a time.  The allocator can take significant short cuts because
> the size of every extent in the small block groups is known (they are
> all the same size by definition).
> 
> When a small block group fills up, the next one should occupy the
> most-empty subset of disks--which is the opposite of the usual RAID5/6
> allocation policy.  This will probably lead to "interesting" imbalances
> since there are now two allocators on the filesystem with different goals
> (though it is no worse than -draid5 -mraid1, and I had no problems with
> free space when I was running that).
> 
> There will be an increase in the amount of allocated but not usable space,
> though, because now the amount of free space depends on how much data
> is batched up before fsync() or sync().  Probably best to just not count
> any space in the small block groups as 'free' in statvfs terms at all.
> 
> There are a lot of variables implied there.  Without running some
> simulations I have no idea if this is a good idea or not.
> 
> > > Time to time, a re-balance should be performed to empty the BG #1,
> > and #2. Otherwise a new BG should be allocated.
> 
> That shouldn't be _necessary_ (the filesystem should just allocate
> whatever BGs it needs), though it will improve storage efficiency if it
> is done.
> 
> > > The cost should be comparable to the logging/journaling (each
> > data shorter than a full-stripe, has to be written two times); the
> > implementation should be quite easy, because already NOW btrfs support
> > BG with different set of disks.
> 




signature.asc
Description: PGP signature


Re: [PATCH] bitmap: fix memset optimization on big-endian systems

2018-04-02 Thread Linus Torvalds
On Mon, Apr 2, 2018 at 4:37 PM, Linus Torvalds
 wrote:
>
> We should probably have at least switched it to "unsigned long int"

I meant just "unsigned int", of course.

Right now we occasionally have a silly 64-bit field just for a couple of flags.

Of course, we do have cases where 64-bit architectures happily end up
using more than 32 bits too, so the "unsigned long" is occasionally
useful. But it's rare enough that it probably wasn't the right thing
to do.

Water under the bridge. Part of it is due to another historical
accident: the alpha port was one of the early ports, and it didn't do
atomic byte accesses at all.

So we had multiple issues that all conspired to "unsigned long arrays
are the natural for atomic bit operations" even though they have this
really annoying byte order issue.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bitmap: fix memset optimization on big-endian systems

2018-04-02 Thread Linus Torvalds
On Mon, Apr 2, 2018 at 3:58 PM, Omar Sandoval  wrote:
>
> Commit 2a98dc028f91 introduced an optimization to bitmap_{set,clear}()
> which uses memset() when the start and length are constants aligned to a
> byte. This is wrong on big-endian systems;

Ugh, yes.

In retrospect, I do wish I had just made the bitmap types be
byte-based, but we had strong reasons for those "unsigned long" array
semantics.

The traditional problem - and the reason why it is byte-order
dependent - was that we often mix bitmap operations with "unsigned
long flags" style operations.

We should probably have at least switched it to "unsigned long int"
with the whole 64-bit transition, but never did even that, so the
bitmap format is not just byte order dependent, but dependent on the
size of "long".

I guess the "unsigned long flag" issue still exists in several places,
and we're stuck with it, probably forever. Think page flags, but also
various networking flags etc.

You'd *think* they use bitmap operations consistently, but they
definitely mix it with direct accesses to the flags field, eg the page
flags are usually done using the PG_xyz bit numbers, but occasionally
you have code that checks multiple independent bits at once, doing
things like

  #define PAGE_FLAGS_PRIVATE  \
  (1UL << PG_private | 1UL << PG_private_2)

  static inline int page_has_private(struct page *page)
  {
  return !!(page->flags & PAGE_FLAGS_PRIVATE);
  }

so the bits really are _exposed_ as being encoded as bits in an unsigned long.

Your patch is obviously correct, and we just need to make sure people
*really* understand that bitmaps are arrays of unsigned long, and byte
(and bit) order is a real thing.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bitmap: fix memset optimization on big-endian systems

2018-04-02 Thread Omar Sandoval
On Mon, Apr 02, 2018 at 03:58:31PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Commit 2a98dc028f91 introduced an optimization to bitmap_{set,clear}()
> which uses memset() when the start and length are constants aligned to a
> byte. This is wrong on big-endian systems; our bitmaps are arrays of
> unsigned long, so bit n is not at byte n / 8 in memory. This was caught
> by the Btrfs selftests, but the bitmap selftests also fail when run on a
> big-endian machine.
> 
> We can still use memset if the start and length are aligned to an
> unsigned long, so do that on big-endian. The same problem applies to the
> memcmp in bitmap_equal(), so fix it there, too.
> 
> Fixes: 2a98dc028f91 ("include/linux/bitmap.h: turn bitmap_set and 
> bitmap_clear into memset when possible")
> Fixes: 2c6deb01525a ("bitmap: use memcmp optimisation in more situations")
> Cc: sta...@kernel.org

This should be sta...@vger.kernel.org, of course
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] bitmap: fix memset optimization on big-endian systems

2018-04-02 Thread Omar Sandoval
From: Omar Sandoval 

Commit 2a98dc028f91 introduced an optimization to bitmap_{set,clear}()
which uses memset() when the start and length are constants aligned to a
byte. This is wrong on big-endian systems; our bitmaps are arrays of
unsigned long, so bit n is not at byte n / 8 in memory. This was caught
by the Btrfs selftests, but the bitmap selftests also fail when run on a
big-endian machine.

We can still use memset if the start and length are aligned to an
unsigned long, so do that on big-endian. The same problem applies to the
memcmp in bitmap_equal(), so fix it there, too.

Fixes: 2a98dc028f91 ("include/linux/bitmap.h: turn bitmap_set and bitmap_clear 
into memset when possible")
Fixes: 2c6deb01525a ("bitmap: use memcmp optimisation in more situations")
Cc: sta...@kernel.org
Reported-by: "Erhard F." 
Signed-off-by: Omar Sandoval 
---
 include/linux/bitmap.h | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 5f11fbdc27f8..1ee46f492267 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -302,12 +302,20 @@ static inline void bitmap_complement(unsigned long *dst, 
const unsigned long *sr
__bitmap_complement(dst, src, nbits);
 }
 
+#ifdef __LITTLE_ENDIAN
+#define BITMAP_MEM_ALIGNMENT 8
+#else
+#define BITMAP_MEM_ALIGNMENT (8 * sizeof(unsigned long))
+#endif
+#define BITMAP_MEM_MASK (BITMAP_MEM_ALIGNMENT - 1)
+
 static inline int bitmap_equal(const unsigned long *src1,
const unsigned long *src2, unsigned int nbits)
 {
if (small_const_nbits(nbits))
return !((*src1 ^ *src2) & BITMAP_LAST_WORD_MASK(nbits));
-   if (__builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
+   if (__builtin_constant_p(nbits & BITMAP_MEM_MASK) &&
+   IS_ALIGNED(nbits, BITMAP_MEM_ALIGNMENT))
return !memcmp(src1, src2, nbits / 8);
return __bitmap_equal(src1, src2, nbits);
 }
@@ -358,8 +366,10 @@ static __always_inline void bitmap_set(unsigned long *map, 
unsigned int start,
 {
if (__builtin_constant_p(nbits) && nbits == 1)
__set_bit(start, map);
-   else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
-__builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
+   else if (__builtin_constant_p(start & BITMAP_MEM_MASK) &&
+IS_ALIGNED(start, BITMAP_MEM_ALIGNMENT) &&
+__builtin_constant_p(nbits & BITMAP_MEM_MASK) &&
+IS_ALIGNED(nbits, BITMAP_MEM_ALIGNMENT))
memset((char *)map + start / 8, 0xff, nbits / 8);
else
__bitmap_set(map, start, nbits);
@@ -370,8 +380,10 @@ static __always_inline void bitmap_clear(unsigned long 
*map, unsigned int start,
 {
if (__builtin_constant_p(nbits) && nbits == 1)
__clear_bit(start, map);
-   else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
-__builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
+   else if (__builtin_constant_p(start & BITMAP_MEM_MASK) &&
+IS_ALIGNED(start, BITMAP_MEM_ALIGNMENT) &&
+__builtin_constant_p(nbits & BITMAP_MEM_MASK) &&
+IS_ALIGNED(nbits, BITMAP_MEM_ALIGNMENT))
memset((char *)map + start / 8, 0, nbits / 8);
else
__bitmap_clear(map, start, nbits);
-- 
2.16.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-04-02 Thread Zygo Blaxell
On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
> On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> > On 04/02/2018 07:45 AM, Zygo Blaxell wrote:
> > [...]
> > > It is possible to combine writes from a single transaction into full
> > > RMW stripes, but this *does* have an impact on fragmentation in btrfs.
> > > Any partially-filled stripe is effectively read-only and the space within
> > > it is inaccessible until all data within the stripe is overwritten,
> > > deleted, or relocated by balance.
> > > 
> > > btrfs could do a mini-balance on one RAID stripe instead of a RMW stripe
> > > update, but that has a significant write magnification effect (and before
> > > kernel 4.14, non-trivial CPU load as well).
> > > 
> > > btrfs could also just allocate the full stripe to an extent, but emit
> > > only extent ref items for the blocks that are in use.  No fragmentation
> > > but lots of extra disk space used.  Also doesn't quite work the same
> > > way for metadata pages.
> > > 
> > > If btrfs adopted the ZFS approach, the extent allocator and all higher
> > > layers of the filesystem would have to know about--and skip over--the
> > > parity blocks embedded inside extents.  Making this change would mean
> > > that some btrfs RAID profiles start interacting with stuff like balance
> > > and compression which they currently do not.  It would create a new
> > > block group type and require an incompatible on-disk format change for
> > > both reads and writes.
> > 
> > I thought that a possible solution is to create BG with different
> number of data disks. E.g. supposing to have a raid 6 system with 6
> disks, where 2 are parity disk; we should allocate 3 BG
> > 
> > BG #1: 1 data disk, 2 parity disks
> > BG #2: 2 data disks, 2 parity disks,
> > BG #3: 4 data disks, 2 parity disks
> > 
> > For simplicity, the disk-stripe length is assumed = 4K.
> > 
> > So If you have a write with a length of 4 KB, this should be placed
> in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
> should be placed in in BG#2, then in BG#1.
> > 
> > This would avoid space wasting, even if the fragmentation will
> increase (but shall the fragmentation matters with the modern solid
> state disks ?).

I don't really see why this would increase fragmentation or waste space.
The extent size is determined before allocation anyway, all that changes
in this proposal is where those small extents ultimately land on the disk.

If anything, it might _reduce_ fragmentation since everything in BG #1
and BG #2 will be of uniform size.

It does solve write hole (one transaction per RAID stripe).

> Also, you're still going to be wasting space, it's just that less space will
> be wasted, and it will be wasted at the chunk level instead of the block
> level, which opens up a whole new set of issues to deal with, most
> significantly that it becomes functionally impossible without brute-force
> search techniques to determine when you will hit the common-case of -ENOSPC
> due to being unable to allocate a new chunk.

Hopefully the allocator only keeps one of each size of small block groups
around at a time.  The allocator can take significant short cuts because
the size of every extent in the small block groups is known (they are
all the same size by definition).

When a small block group fills up, the next one should occupy the
most-empty subset of disks--which is the opposite of the usual RAID5/6
allocation policy.  This will probably lead to "interesting" imbalances
since there are now two allocators on the filesystem with different goals
(though it is no worse than -draid5 -mraid1, and I had no problems with
free space when I was running that).

There will be an increase in the amount of allocated but not usable space,
though, because now the amount of free space depends on how much data
is batched up before fsync() or sync().  Probably best to just not count
any space in the small block groups as 'free' in statvfs terms at all.

There are a lot of variables implied there.  Without running some
simulations I have no idea if this is a good idea or not.

> > Time to time, a re-balance should be performed to empty the BG #1,
> and #2. Otherwise a new BG should be allocated.

That shouldn't be _necessary_ (the filesystem should just allocate
whatever BGs it needs), though it will improve storage efficiency if it
is done.

> > The cost should be comparable to the logging/journaling (each
> data shorter than a full-stripe, has to be written two times); the
> implementation should be quite easy, because already NOW btrfs support
> BG with different set of disks.



signature.asc
Description: PGP signature


Re: [wiki] Please clarify how to check whether barriers are properly implemented in hardware

2018-04-02 Thread Hugo Mills
On Mon, Apr 02, 2018 at 06:03:00PM -0400, Fedja Beader wrote:
> Is there some testing utility for this? Is there a way to extract this/tell 
> with a high enough certainty from datasheets/other material before purchase?

   Given that not implementing barriers is basically a bug in the
hardware [for SATA or SAS], I don't think anyone's going to specify
anything other than "fully suppors barriers" in their datasheets.

   I don't know of a testing tool. It may not be obvious that barriers
aren't being honoured without doing things like power-failure testing.

   Hugo.

> https://btrfs.wiki.kernel.org/index.php/FAQ#How_does_this_happen.3F

-- 
Hugo Mills | "Damn and blast British Telecom!" said Dirk,
hugo@... carfax.org.uk | the words coming easily from force of habit.
http://carfax.org.uk/  |Douglas Adams,
PGP: E2AB1DE4  |   Dirk Gently's Holistic Detective Agency


signature.asc
Description: Digital signature


[wiki] Please clarify how to check whether barriers are properly implemented in hardware

2018-04-02 Thread Fedja Beader
Is there some testing utility for this? Is there a way to extract this/tell 
with a high enough certainty from datasheets/other material before purchase?

https://btrfs.wiki.kernel.org/index.php/FAQ#How_does_this_happen.3F
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix NULL pointer dereference in log_dir_items

2018-04-02 Thread Nikolay Borisov


On  2.04.2018 20:51, Liu Bo wrote:
> On Sun, Apr 1, 2018 at 3:03 AM, Nikolay Borisov  wrote:
>>
>>
>> On 31.03.2018 01:11, Liu Bo wrote:
>>> 0, 1 and <0 can be returned by btrfs_next_leaf(), and when <0 is
>>> returned, path->nodes[0] could be NULL, log_dir_items lacks such a
>>> check for <0 and we may run into a null pointer dereference panic.
>>>
>>> Signed-off-by: Liu Bo 
>> Reviewed-by: Nikolay Borisov 
>>
>> This bug has been present ever since 2.6.29 (e02119d5a7b4 ("Btrfs: Add a
>> write ahead tree log to optimize synchronous operations"))
>>  so this needs a stable tag.
> 
> OK, git describe e02119d5a7b4 doesn't show a tag though.

I just used git tag --contains 
> 
> thanks,
> liubo
>>
>>> ---
>>>  fs/btrfs/tree-log.c | 7 +--
>>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
>>> index 4344577..4ee9431 100644
>>> --- a/fs/btrfs/tree-log.c
>>> +++ b/fs/btrfs/tree-log.c
>>> @@ -3518,8 +3518,11 @@ static noinline int log_dir_items(struct 
>>> btrfs_trans_handle *trans,
>>>* from this directory and from this transaction
>>>*/
>>>   ret = btrfs_next_leaf(root, path);
>>> - if (ret == 1) {
>>> - last_offset = (u64)-1;
>>> + if (ret) {
>>> + if (ret == 1)
>>> + last_offset = (u64)-1;
>>> + else
>>> + err = ret;
>>>   goto done;
>>>   }
>>>   btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]);
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: bail out on error during replay_dir_deletes

2018-04-02 Thread Liu Bo
If errors were returned by btrfs_next_leaf(), replay_dir_deletes needs
to bail out, otherwise @ret would be forced to be 0 after 'break;' and
the caller won't be aware of it.

Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous 
operations")
Reviewed-by: Nikolay Borisov 
Signed-off-by: Liu Bo 
---
v2: Add Fixes tag and reviewed-by.~

 fs/btrfs/tree-log.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 4ee9431..11e2c26 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2356,8 +2356,10 @@ static noinline int replay_dir_deletes(struct 
btrfs_trans_handle *trans,
nritems = btrfs_header_nritems(path->nodes[0]);
if (path->slots[0] >= nritems) {
ret = btrfs_next_leaf(root, path);
-   if (ret)
+   if (ret == 1)
break;
+   else if (ret < 0)
+   goto out;
}
btrfs_item_key_to_cpu(path->nodes[0], _key,
  path->slots[0]);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: fix NULL pointer dereference in log_dir_items

2018-04-02 Thread Liu Bo
0, 1 and <0 can be returned by btrfs_next_leaf(), and when <0 is
returned, path->nodes[0] could be NULL, log_dir_items lacks such a
check for <0 and we may run into a null pointer dereference panic.

Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous 
operations")
Reviewed-by: Nikolay Borisov 
Signed-off-by: Liu Bo 
---
v2: Add Fixes tag and reviewed-by.

 fs/btrfs/tree-log.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 4344577..4ee9431 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3518,8 +3518,11 @@ static noinline int log_dir_items(struct 
btrfs_trans_handle *trans,
 * from this directory and from this transaction
 */
ret = btrfs_next_leaf(root, path);
-   if (ret == 1) {
-   last_offset = (u64)-1;
+   if (ret) {
+   if (ret == 1)
+   last_offset = (u64)-1;
+   else
+   err = ret;
goto done;
}
btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix NULL pointer dereference in log_dir_items

2018-04-02 Thread Liu Bo
On Sun, Apr 1, 2018 at 3:03 AM, Nikolay Borisov  wrote:
>
>
> On 31.03.2018 01:11, Liu Bo wrote:
>> 0, 1 and <0 can be returned by btrfs_next_leaf(), and when <0 is
>> returned, path->nodes[0] could be NULL, log_dir_items lacks such a
>> check for <0 and we may run into a null pointer dereference panic.
>>
>> Signed-off-by: Liu Bo 
> Reviewed-by: Nikolay Borisov 
>
> This bug has been present ever since 2.6.29 (e02119d5a7b4 ("Btrfs: Add a
> write ahead tree log to optimize synchronous operations"))
>  so this needs a stable tag.

OK, git describe e02119d5a7b4 doesn't show a tag though.

thanks,
liubo
>
>> ---
>>  fs/btrfs/tree-log.c | 7 +--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
>> index 4344577..4ee9431 100644
>> --- a/fs/btrfs/tree-log.c
>> +++ b/fs/btrfs/tree-log.c
>> @@ -3518,8 +3518,11 @@ static noinline int log_dir_items(struct 
>> btrfs_trans_handle *trans,
>>* from this directory and from this transaction
>>*/
>>   ret = btrfs_next_leaf(root, path);
>> - if (ret == 1) {
>> - last_offset = (u64)-1;
>> + if (ret) {
>> + if (ret == 1)
>> + last_offset = (u64)-1;
>> + else
>> + err = ret;
>>   goto done;
>>   }
>>   btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]);
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-04-02 Thread Goffredo Baroncelli
On 04/02/2018 07:45 AM, Zygo Blaxell wrote:
[...]
> It is possible to combine writes from a single transaction into full
> RMW stripes, but this *does* have an impact on fragmentation in btrfs.
> Any partially-filled stripe is effectively read-only and the space within
> it is inaccessible until all data within the stripe is overwritten,
> deleted, or relocated by balance.
>
> btrfs could do a mini-balance on one RAID stripe instead of a RMW stripe
> update, but that has a significant write magnification effect (and before
> kernel 4.14, non-trivial CPU load as well).
> 
> btrfs could also just allocate the full stripe to an extent, but emit
> only extent ref items for the blocks that are in use.  No fragmentation
> but lots of extra disk space used.  Also doesn't quite work the same
> way for metadata pages.
> 
> If btrfs adopted the ZFS approach, the extent allocator and all higher
> layers of the filesystem would have to know about--and skip over--the
> parity blocks embedded inside extents.  Making this change would mean
> that some btrfs RAID profiles start interacting with stuff like balance
> and compression which they currently do not.  It would create a new
> block group type and require an incompatible on-disk format change for
> both reads and writes.

I thought that a possible solution is to create BG with different number of 
data disks. E.g. supposing to have a raid 6 system with 6 disks, where 2 are 
parity disk; we should allocate 3 BG

BG #1: 1 data disk, 2 parity disks
BG #2: 2 data disks, 2 parity disks,
BG #3: 4 data disks, 2 parity disks

For simplicity, the disk-stripe length is assumed = 4K.

So If you have a write with a length of 4 KB, this should be placed in BG#1; if 
you have a write with a length of 4*3KB, the first 8KB, should be placed in in 
BG#2, then in BG#1.

This would avoid space wasting, even if the fragmentation will increase (but 
shall the fragmentation matters with the modern solid state disks ?).

Time to time, a re-balance should be performed to empty the BG #1, and #2. 
Otherwise a new BG should be allocated.

The cost should be comparable to the logging/journaling (each data shorter than 
a full-stripe, has to be written two times); the implementation should be quite 
easy, because already NOW btrfs support BG with different set of disks.

BR 
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: test btrfs fsync after hole punching with no-holes mode

2018-04-02 Thread Filipe Manana
On Thu, Mar 29, 2018 at 7:55 PM, Jayashree Mohan
 wrote:
> Hi Filipe,
>
> I tried running the xfstest above on kernel 4.15.0 and I haven't been
> able to hit the bug. The xfstest passes clean for me. I compared the
> btrfs-debug-tree in my case with the one attached by Eryu, and I see I
> do not have any xattr as he does. However, for every run of the
> xfstest, the extent tree info that I get from btrfs-debug-tree has
> varying number of items in the first leaf node. (I have attached two
> dump files for your reference.)
>
> I also tried changing the offset at which fpunch is performed to match
> the offset of the last extent in the first leaf of the extent tree -
> however the md5 before and after crash was the same.
>
> Could you give more details on this?

You are getting extents smaller then 256Kb, because writeback is being
triggered by the kernel (likely due to memory pressure).
The second version of the test uses direct IO instead to avoid that.
Thanks.

>
> https://friendpaste.com/1wVAz7hG0U5SgYtZavbJhL
> https://friendpaste.com/1wVAz7hG0U5SgYtZavxALg
>
> Thanks,
> Jayashree Mohan
>
>
>
> On Thu, Mar 29, 2018 at 8:45 AM, Filipe Manana  wrote:
>> On Wed, Mar 28, 2018 at 11:33 AM, Eryu Guan  wrote:
>>> On Wed, Mar 28, 2018 at 09:48:17AM +0100, Filipe Manana wrote:
 On Wed, Mar 28, 2018 at 3:17 AM, Eryu Guan  wrote:
 > On Mon, Mar 26, 2018 at 11:59:21PM +0100, fdman...@kernel.org wrote:
 >> From: Filipe Manana 
 >>
 >> Test that when we have the no-holes mode enabled and a specific metadata
 >> layout, if we punch a hole and fsync the file, at replay time the whole
 >> hole was preserved.
 >>
 >> This issue is fixed by the following btrfs patch for the linux kernel:
 >>
 >>   "Btrfs: fix fsync after hole punching when using no-holes feature"
 >
 > I'd expect a test failure with 4.16-rc6 kernel, as the mentioned fix
 > above is not there. But test always passes for me. Did I miss anything?
 > btrfs-progs version is btrfs-progs-4.11.1-3.fc27.

 It should fail on any kernel, with any btrfs-progs version (which
 should be irrelevant).
 Somehow on your system we are not getting the specific metadata layout
 needed to trigger the issue.

 Can you apply the following patch on top of the test and provide the
 result 159.full file?

 https://friendpaste.com/6xAuLeN4xl1AGjO9Qc5I8L

 So that I can see what metadata layout you are getting.
 Thanks!
>>>
>>> Sure, please see attachment.
>>
>> Thanks!
>> So you indeed get a different metadata layout, and that is because you
>> have SELinux enabled which causes some xattr to be added to the test
>> file (foobar):
>>
>> item 6 key (257 XATTR_ITEM 3817753667) itemoff 64932 itemsize 83
>> location key (0 UNKNOWN.0 0) type XATTR
>> transid 7 data_len 37 name_len 16
>> name: security.selinux
>> data unconfined_u:object_r:unlabeled_t:s0
>>
>> I can make the test work with and without selinux enabled (by punching
>> holes on a few extents that are candidates to be at leaf boundaries).
>> Is it worth it? (I assume most people run the tests without selinux)
>>
>> thanks
>>
>>>
>>> Thanks,
>>> Eryu
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] fstests: test btrfs fsync after hole punching with no-holes mode

2018-04-02 Thread fdmanana
From: Filipe Manana 

Test that when we have the no-holes mode enabled and a specific metadata
layout, if we punch a hole and fsync the file, at replay time the whole
hole was preserved.

This issue is fixed by the following btrfs patch for the linux kernel:

  "Btrfs: fix fsync after hole punching when using no-holes feature"

Signed-off-by: Filipe Manana 
---

V2: Made the test work when selinux is enabled, and made it use direct IO
writes to ensure 256K extents.

 tests/btrfs/159 | 115 
 tests/btrfs/159.out |   9 
 tests/btrfs/group   |   1 +
 3 files changed, 125 insertions(+)
 create mode 100755 tests/btrfs/159
 create mode 100644 tests/btrfs/159.out

diff --git a/tests/btrfs/159 b/tests/btrfs/159
new file mode 100755
index ..eb667692
--- /dev/null
+++ b/tests/btrfs/159
@@ -0,0 +1,115 @@
+#! /bin/bash
+# FSQA Test No. 159
+#
+# Test that when we have the no-holes mode enabled and a specific metadata
+# layout, if we punch a hole and fsync the file, at replay time the whole
+# hole was preserved.
+#
+#---
+#
+# Copyright (C) 2018 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   _cleanup_flakey
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_dm_target flakey
+_require_xfs_io_command "fpunch"
+_require_odirect
+
+rm -f $seqres.full
+
+run_test()
+{
+   local punch_offset=$1
+
+   # We create the filesystem with a node size of 64Kb because we need to
+   # create a specific metadata layout in order to trigger the bug we are
+   # testing. At the moment the node size can not be smaller then the
+   # system's page size, so given that the largest possible page size is
+   # 64Kb and by default the node size is set to the system's page size
+   # value, we explicitly create a filesystem with a 64Kb node size.
+   _scratch_mkfs -O no-holes -n $((64 * 1024)) >>$seqres.full 2>&1
+   _require_metadata_journaling $SCRATCH_DEV
+   _init_flakey
+   _mount_flakey
+
+   # Create our test file with 832 extents of 256Kb each. Before each
+   # extent, there is a 256Kb hole (except for the first extent, which
+   # starts at offset 0). This creates two leafs in the filesystem tree.
+   # We use direct IO to ensure we get exactly 256K extents (with buffered
+   # IO we can get writeback triggered at any time and therefore get
+   # extents smaller than 256K).
+   for ((i = 0; i <= 831; i++)); do
+   local offset=$((i * 2 * 256 * 1024))
+   $XFS_IO_PROG -f -d -c "pwrite -S 0xab -b 256K $offset 256K" \
+   $SCRATCH_MNT/foobar >/dev/null
+   done
+
+   # Make sure everything done so far is durably persisted.
+   sync
+
+   # Now punch a hole that covers part of the extent at offset
+   # "$punch_offset".
+   # We want to punch a hole that starts in the middle of the last extent
+   # item in the first leaf. On a system without selinux enabled that is
+   # the extent that starts at offset 216530944, while on a system with it
+   # enabled it is the extent that starts at offset 216006656 (because
+   # selinux causes a xattr item to be added to our test file).
+   $XFS_IO_PROG -c "fpunch $((punch_offset + 128 * 1024 - 4000)) 256K" \
+-c "fsync" \
+$SCRATCH_MNT/foobar
+
+   echo "File digest before power failure:"
+   md5sum $SCRATCH_MNT/foobar | _filter_scratch
+   # Simulate a power failure and mount the filesystem to check that
+   # replaying the fsync log/journal succeeds and our test file has the
+   # expected content.
+   _flakey_drop_and_remount
+   echo "File digest after power 

Re: Question, will ls -l eventually be able to show subvolumes?

2018-04-02 Thread Austin S. Hemmelgarn

On 2018-03-30 12:38, Adam Borowski wrote:

On Fri, Mar 30, 2018 at 10:42:10AM +0100, Pete wrote:

I've just notice work going on to make rmdir be able to delete
subvolumes.  Is there an intent to allow ls -l to display directories as
subvolumes?


That's entirely up to coreutils guys.


Expanding on Adam's 100% accurate answer:

The primary issue here is that determining what is a subvolume requires 
some non-trivial work.  You either need to call a specific ioctl (and 
make sure you call it in the right location) which last I checked 
requires root privileges to use, or do some significant guess-work that 
happens to fall apart once you throw explicitly mounted subvolumes into 
the mix.


You can take a look at the `get_subvol_list` function in [1] for an 
example of what that guess-work looks like in Python.  That particular 
function relies on some specific internals of BTRFS that really aren't 
technically part of the ABI (namely that subvolumes all have an inode 
number of 256), but it falls apart if the subvolumes are explicitly 
mounted (because if you don't exclude explicit mounts, you may 
accidentally recurse into another BTRFS volume).


There's also a secondary issue in that the concept of a subvolume isn't 
really universal.  BTRFS and ZFS are the only filesystems supported on 
Linux that have such a concept, but ZFS datasets are a very different 
implementation of the concept that doesn't really make sense to mark 
with `ls`.  Given the level of specificity, it just doesn't make sense 
to mark subvolumes as such in `ls`, just like it doesn't make much sense 
to treat NILFS2 checkpoints any differently from read-only directories.



[1] 
https://raw.githubusercontent.com/Ferroin/btrfs-subv-backup/master/btrfs-subv-backup.py

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] btrfs: Validate child tree block's level and first key

2018-04-02 Thread Qu Wenruo


On 2018年03月28日 23:49, David Sterba wrote:
> On Tue, Mar 27, 2018 at 08:44:19PM +0800, Qu Wenruo wrote:
>> We have several reports about node pointer points to incorrect child
>> tree blocks, which could have even wrong owner and level but still with
>> valid generation and checksum.
>>
>> Although btrfs check could handle it and print error message like:
>> leaf parent key incorrect 60670574592
>>
>> Kernel doesn't have enough check on this type of corruption correctly.
>> At least add such check to read_tree_block() and btrfs_read_buffer(),
>> where we need two new parameters @level and @first_key to verify the
>> child tree block.
>>
>> The new @level check is mandatory and all call sites are already
>> modified to extract expected level from its call chain.
>>
>> While @first_key is optional, the following call sites are skipping such
>> check:
>> 1) Root node/leaf
>>As ROOT_ITEM doesn't contain the first key, skip @first_key check.
>> 2) Direct backref
>>Only parent bytenr and level is known and we need to resolve the key
>>all by ourselves, skip @first_key check.
>>
>> Another note of this verification is, it needs extra info from nodeptr
>> or ROOT_ITEM, so it can't fit into current tree-checker framework, which
>> is limited to node/leaf boundary.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>> changelog:
>> v2:
>>   Make @level check mandatory, suggesed by Jeff and Nikolay.
>>   Change parameter order as @level is now mandatory, put it in front of
>>   @first_key.
>>   Change verify_parent_level() to verify_key_level() to avoid confusion
>>   on the @level parameter.
>>   Add btrfs_error() output for CONFIG_BTRFS_DEBUG to help debugging.
> 
> That's much better overall, thanks. Adding it to next.

Nikolay reported a case where @first_key check seems to cause false alert.
(Although my xfstests check hasn't exposed it yet)

Please discard this patch since it has the possibility to cause false
alert for btrfs core functionality.

Thanks,
Qu

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


[PATCH] btrfs-progs: Remove duplicate value-get for data_extents_scrubbed

2018-04-02 Thread Gu Jinxiang
Get data_extents_scrubbed value for twice, since there is only
one data_extents_scrubbed in struct btrfs_scrub_progress, remove
the duplicate one.

Signed-off-by: Gu Jinxiang 
---
 cmds-scrub.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/cmds-scrub.c b/cmds-scrub.c
index dabe7d9a..6b909f20 100644
--- a/cmds-scrub.c
+++ b/cmds-scrub.c
@@ -589,8 +589,6 @@ again:
/* fall through */
case 5: /* read key/value pair */
ret = 0;
-   _SCRUB_KVREAD(ret, , data_extents_scrubbed, avail, l,
-   [curr]->p);
_SCRUB_KVREAD(ret, , data_extents_scrubbed, avail, l,
[curr]->p);
_SCRUB_KVREAD(ret, , tree_extents_scrubbed, avail, l,
-- 
2.14.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: rename btrfs_get_block_group_info and make it static

2018-04-02 Thread Nikolay Borisov


On  2.04.2018 12:24, Su Yue wrote:
> The function btrfs_get_block_group_info() was introduced by the
> commit 5af3e8cce8b7 ("Btrfs: make filesystem read-only when submitting
>  barrier fails") which used it in disk-io.c.
> 
> However, the function is only called in ioctl.c now.
> Its parameter type btrfs_ioctl_space_info* is only for ioctl.
> 
> So, make it static and rename it to be original name
> get_block_group_info.
> 
> No functional change.
> 
> Signed-off-by: Su Yue 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/ctree.h | 2 --
>  fs/btrfs/ioctl.c | 8 
>  2 files changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 0eb55825862a..e69ef37efbc1 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3253,8 +3253,6 @@ int btrfs_is_empty_uuid(u8 *uuid);
>  int btrfs_defrag_file(struct inode *inode, struct file *file,
> struct btrfs_ioctl_defrag_range_args *range,
> u64 newer_than, unsigned long max_pages);
> -void btrfs_get_block_group_info(struct list_head *groups_list,
> - struct btrfs_ioctl_space_info *space);
>  void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock,
>  struct btrfs_ioctl_balance_args *bargs);
>  ssize_t btrfs_dedupe_file_range(struct file *src_file, u64 loff, u64 olen,
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 4ba793f25c3a..184001ffe38c 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -4047,8 +4047,8 @@ static long btrfs_ioctl_default_subvol(struct file 
> *file, void __user *argp)
>   return ret;
>  }
>  
> -void btrfs_get_block_group_info(struct list_head *groups_list,
> - struct btrfs_ioctl_space_info *space)
> +static void get_block_group_info(struct list_head *groups_list,
> +  struct btrfs_ioctl_space_info *space)
>  {
>   struct btrfs_block_group_cache *block_group;
>  
> @@ -4164,8 +4164,8 @@ static long btrfs_ioctl_space_info(struct btrfs_fs_info 
> *fs_info,
>   down_read(>groups_sem);
>   for (c = 0; c < BTRFS_NR_RAID_TYPES; c++) {
>   if (!list_empty(>block_groups[c])) {
> - btrfs_get_block_group_info(
> - >block_groups[c], );
> + get_block_group_info(>block_groups[c],
> +  );
>   memcpy(dest, , sizeof(space));
>   dest++;
>   space_args.total_spaces++;
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: rename btrfs_get_block_group_info and make it static

2018-04-02 Thread Su Yue
The function btrfs_get_block_group_info() was introduced by the
commit 5af3e8cce8b7 ("Btrfs: make filesystem read-only when submitting
 barrier fails") which used it in disk-io.c.

However, the function is only called in ioctl.c now.
Its parameter type btrfs_ioctl_space_info* is only for ioctl.

So, make it static and rename it to be original name
get_block_group_info.

No functional change.

Signed-off-by: Su Yue 
---
 fs/btrfs/ctree.h | 2 --
 fs/btrfs/ioctl.c | 8 
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0eb55825862a..e69ef37efbc1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3253,8 +3253,6 @@ int btrfs_is_empty_uuid(u8 *uuid);
 int btrfs_defrag_file(struct inode *inode, struct file *file,
  struct btrfs_ioctl_defrag_range_args *range,
  u64 newer_than, unsigned long max_pages);
-void btrfs_get_block_group_info(struct list_head *groups_list,
-   struct btrfs_ioctl_space_info *space);
 void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock,
   struct btrfs_ioctl_balance_args *bargs);
 ssize_t btrfs_dedupe_file_range(struct file *src_file, u64 loff, u64 olen,
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 4ba793f25c3a..184001ffe38c 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4047,8 +4047,8 @@ static long btrfs_ioctl_default_subvol(struct file *file, 
void __user *argp)
return ret;
 }
 
-void btrfs_get_block_group_info(struct list_head *groups_list,
-   struct btrfs_ioctl_space_info *space)
+static void get_block_group_info(struct list_head *groups_list,
+struct btrfs_ioctl_space_info *space)
 {
struct btrfs_block_group_cache *block_group;
 
@@ -4164,8 +4164,8 @@ static long btrfs_ioctl_space_info(struct btrfs_fs_info 
*fs_info,
down_read(>groups_sem);
for (c = 0; c < BTRFS_NR_RAID_TYPES; c++) {
if (!list_empty(>block_groups[c])) {
-   btrfs_get_block_group_info(
-   >block_groups[c], );
+   get_block_group_info(>block_groups[c],
+);
memcpy(dest, , sizeof(space));
dest++;
space_args.total_spaces++;
-- 
2.16.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html