Re: Send/receive snapshot from/between backup

2016-11-17 Thread Piotr Pawłow
On 16.11.2016 21:32, René Bühlmann wrote:
>
>> 1. Change received uuid of S1 on SSH to match S1 uuid on USB.
>> [...]
> 2. Full transfer S1 from SSH to SSH'
>
> 3. Incremental transfer S2 from USB to SSH' (S1 as parent)
>
> 4. Incremental transfer S3 from Origin to SSH' (S2 as parent)
>
> [..]
> Step 3 and 4 did work as well and surprisingly, I did not even had to
> update the "received uuid". They cant be full transfers as that would
> have taken months with my bandwidth. How can this be?

Hans van Kranenburg wrote in another post that you can send back an
incremental snapshot, so I guess the algorithm is more complicated.

Since S1 on SSH was faked to be received from USB, and S1 on SSH' is
received from S1 on SSH, and SSH is the same host as SSH' so both
subvolumes are visible by btrfs receive tool, maybe it can follow the
path and see that S1 on SSH' came from USB? I don't know.

I wonder why does it even matter for send / receive where a subvolume
comes from. Why not base it only on the content? Do I have a subvolume
that has the same content as the parent used for incremental send? - if
yes, then use it for receive.

If btrfs used a Merkle tree and had strong checksums, then a subvolume
checksum could be used as the content identifier (does it use a Merkle
tree? - Wikipedia article about the tree lists btrfs, but I can't find
any source).

Otherwise there could be a "fake" content identifier. A random "content
uuid" generated when setting subvolume read-only, and removed when
setting read/write (read only snapshot would preserve it, and rw
snapshot would remove it). Send / receive would preserve "content uuid",
and compare parent's "content uuid" for incremental send. Wouldn't it be
easier and more flexible than the current "received uuid" scheme? Am I
missing something?

Regards
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: Introduce check for explicit SHARED extent flag reporting

2016-11-17 Thread Eryu Guan
On Thu, Nov 17, 2016 at 10:06:48AM +0800, Qu Wenruo wrote:
> For fs support reflink, some of them (OK, btrfs again) doesn't split
> SHARED flag for extent fiemap reporting.
> 
> For example:
>   0 4K 8K
>/ File1: Extent 0  \
>   /\
>   |<- On disk Extent-->|
>   |/
>   | File2 /
> Extent: 0
> 
> Fs supports explicit SHARED extent reporting should report fiemap like:
> File1: 2 extents
> Extent 0-4K: SHARED
> Extent 4-8K:
> File2: 1 extents
> Extent 0-4K: SHARED
> 
> Fs doesn't support explicit reporting will report fiemap like:
> File1: 1 extent
> Extent 0-8K: SHARED
> File2: 1 extent
> Extent 0-4K: SHARED
> 
> Test case like generic/372 require explicit reporting will cause false
> alert on btrfs.
> 
> Add such runtime check for that requirememt.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  common/reflink| 44 
>  tests/generic/372 |  1 +
>  2 files changed, 45 insertions(+)
> 
> diff --git a/common/reflink b/common/reflink
> index 8b34046..9ada2e8 100644
> --- a/common/reflink
> +++ b/common/reflink
> @@ -78,6 +78,50 @@ _require_scratch_reflink()
>   _scratch_unmount
>  }
>  
> +# this test requires scratch fs to report explicit SHARED flag
> +# e.g.
> +#   0 4K 8K
> +#/ File1: Extent 0  \
> +#   /\
> +#   |<- On disk Extent-->|
> +#   |/
> +#   | File2 /
> +# Extent: 0
> +# Fs supports explicit SHARED extent reporting should report fiemap like:
> +# File1: 2 extents
> +# Extent 0-4K: SHARED
> +# Extent 4-8K:
> +# File2: 1 extents
> +# Extent 0-4K: SHARED
> +#
> +# Fs doesn't support explicit reporting will report fiemap like:
> +# File1: 1 extent
> +# Extent 0-8K: SHARED
> +# File2: 1 extent
> +# Extent 0-4K: SHARED
> +_require_scratch_explicit_shared_extents()
> +{
> + _require_scratch
> + _require_fiemap
> + _require_scratch_reflink
> + _require_xfs_io_command "reflink"
> + local nr_extents
> +
> + _scratch_mkfs > /dev/null
> + _scratch_mount
> +
> + _pwrite_byte 0x61 0 128k $SCRATCH_MNT/file1
> + _reflink_range $SCRATCH_MNT/file1 0 $SCRATCH_MNT/file2 0 64k

This looks fine to me. Only that we need to redirect stdout of above two
commands to /dev/null, otherwise generic/372 fails on XFS because of
extra xfs_io outputs.

I can fix it at commit time, if there's no other comments.

Thanks,
Eryu

> +
> + _scratch_cycle_mount
> +
> + nr_extents=$(_count_extents $SCRATCH_MNT/file1)
> + if [ $nr_extents -eq 1 ]; then
> + _notrun "Explicit SHARED flag reporting not support by 
> filesystem type: $FSTYP"
> + fi
> + _scratch_unmount
> +}
> +
>  # this test requires the test fs support dedupe...
>  _require_test_dedupe()
>  {
> diff --git a/tests/generic/372 b/tests/generic/372
> index 31dff20..51a3eca 100755
> --- a/tests/generic/372
> +++ b/tests/generic/372
> @@ -47,6 +47,7 @@ _supported_os Linux
>  _supported_fs generic
>  _require_scratch_reflink
>  _require_fiemap
> +_require_scratch_explicit_shared_extents
>  
>  echo "Format and mount"
>  _scratch_mkfs > $seqres.full 2>&1
> -- 
> 2.7.4
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] btrfs-progs: test: expand size of test device of fsck-tests 013

2016-11-17 Thread Qu Wenruo



On 11/18/2016 01:47 PM, Tsutomu Itoh wrote:

In my test environment, size of /lib/modules/`uname -r`/* is
larger than 1GB, so test data can not copy to testdev.
Therefore we need expand the size of testdev.


IMHO the correct fix is to enhance populate_fs() to fill the fs into a 
given size, other than using the unreliable copy.


Thanks,
Qu



# make test-fsck
[TEST]   fsck-tests.sh
[TEST/fsck]   013-extent-tree-rebuild
failed: cp -aR /lib/modules/4.9.0-rc5/ /test/btrfs-progs/tests/mnt
test failed for case 013-extent-tree-rebuild
Makefile:272: recipe for target 'test-fsck' failed
make: *** [test-fsck] Error 1
#

Signed-off-by: Tsutomu Itoh 
---
 tests/fsck-tests/013-extent-tree-rebuild/test.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/fsck-tests/013-extent-tree-rebuild/test.sh 
b/tests/fsck-tests/013-extent-tree-rebuild/test.sh
index f678e29..92c5d04 100755
--- a/tests/fsck-tests/013-extent-tree-rebuild/test.sh
+++ b/tests/fsck-tests/013-extent-tree-rebuild/test.sh
@@ -7,7 +7,7 @@ check_prereq mkfs.btrfs
 check_prereq btrfs

 setup_root_helper
-prepare_test_dev 1G
+prepare_test_dev 2G

 # test whether fsck can rebuild a corrupted extent tree
 test_extent_tree_rebuild()


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-17 Thread Qu Wenruo



On 11/18/2016 12:43 PM, Zygo Blaxell wrote:

On Fri, Nov 18, 2016 at 10:42:23AM +0800, Qu Wenruo wrote:



At 11/18/2016 09:56 AM, Hugo Mills wrote:

On Fri, Nov 18, 2016 at 09:19:11AM +0800, Qu Wenruo wrote:



At 11/18/2016 07:13 AM, Zygo Blaxell wrote:

On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:

Fix the so-called famous RAID5/6 scrub error.

Thanks Goffredo Baroncelli for reporting the bug, and make it into our
sight.
(Yes, without the Phoronix report on this,
https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
I won't ever be aware of it)


If you're hearing about btrfs RAID5 bugs for the first time through
Phoronix, then your testing coverage is *clearly* inadequate.


I'm not fixing everything, I'm just focusing on the exact one bug
reported by Goffredo Baroncelli.

Although it seems that, the bug reported by him is in fact two bugs.
One is race condition I'm fixing, another one is that recovery is
recovering data correctly, but screwing up parity.

I just don't understand why you always want to fix everything in one step.


  Fix the important, fundamental things first, and the others
later. This, from my understanding of Zygo's comments, appears to be
one of the others.

  It's papering over the missing bricks in the wall instead of
chipping out the mortar and putting new bricks in. It may need to be
fixed, but it's not the fundamental "OMG, everything's totally broken"
problem. If anything, it's only a serious problem *because* the other
thing (write hole) is still there.

  It just seems like a piece of mis-prioritised effort.


It seems that, we have different standards on the priority.


My concern isn't priority.  Easier bugs often get fixed first.  That's
just the way Linux development works.

I am very concerned by articles like this:

http://phoronix.com/scan.php?page=news_item=Btrfs-RAID5-RAID6-Fixed

with headlines like "btrfs RAID5/RAID6 support is finally fixed" when
that's very much not the case.  Only one bug has been removed for the
key use case that makes RAID5 interesting, and it's just the first of
many that still remain in the path of a user trying to recover from a
normal disk failure.

Admittedly this is Michael's (Phoronix's) problem more than Qu's, but
it's important to always be clear and _complete_ when stating bug status
because people quote statements out of context.  When the article quoted
the text

"it's not a timed bomb buried deeply into the RAID5/6 code,
but a race condition in scrub recovery code"

the commenters on Phoronix are clearly interpreting this to mean "famous
RAID5/6 scrub error" had been fixed *and* the issue reported by Goffredo
was the time bomb issue.  It's more accurate to say something like

"Goffredo's issue is not the time bomb buried deeply in the
RAID5/6 code, but a separate issue caused by a race condition
in scrub recovery code"

Reading the Phoronix article, one might imagine RAID5 is now working
as well as RAID1 on btrfs.  To be clear, it's not--although the gap
is now significantly narrower.


For me, if some function on the very basic/minimal environment can't work
reliably, then it's a high priority bug.

In this case, in a very minimal setup, with only 128K data spreading on 3
devices RAID5. With a data stripe fully corrupted, without any other thing
interfering.
Scrub can't return correct csum error number and even cause false
unrecoverable error, then it's a high priority thing.



If the problem involves too many steps like removing devices, degraded mode,
fsstress and some time. Then it's not that priority unless one pin-downs the
root case to, for example, degraded mode itself with special sequenced
operations.


There are multiple bugs in the stress + remove device case.  Some are
quite easy to isolate.  They range in difficulty from simple BUG_ON
instead of error returns to finally solving the RMW update problem.

Run the test, choose any of the bugs that occur to work on, repeat until
the test stops finding new bugs for a while.  There are currently several
bugs to choose from with various levels of difficulty to fix them, and you
should hit the first level of bugs in a matter of hours if not minutes.

Using this method, you would have discovered Goffredo's bug years ago.
Instead, you only discovered it after Phoronix quoted the conclusion
of an investigation that started because of problems I reported here on
this list when I discovered half a dozen of btrfs's critical RAID5 bugs
from analyzing *one* disk failure event.


Nope, I don't want to touch anything related to degraded mode if any 
fundamental function is broken, and the bug Goffredo reported has 
nothing to do with any steps you mentioned.


How I find the problem is never related to the bug fix.
Yeah, I found it in Phoronix or whatever other source, then does it have 
anything related to the fix?


You can continue your test or fix or report. But that has nothing 
related to the 

Re: [PATCH] fstests: Introduce check for explicit SHARED extent flag reporting

2016-11-17 Thread Qu Wenruo



On 11/18/2016 01:54 PM, Darrick J. Wong wrote:

On Thu, Nov 17, 2016 at 10:06:48AM +0800, Qu Wenruo wrote:

For fs support reflink, some of them (OK, btrfs again) doesn't split
SHARED flag for extent fiemap reporting.

For example:
  0 4K 8K
   / File1: Extent 0  \
  /\
  |<- On disk Extent-->|
  |/
  | File2 /
Extent: 0

Fs supports explicit SHARED extent reporting should report fiemap like:
File1: 2 extents
Extent 0-4K: SHARED
Extent 4-8K:
File2: 1 extents
Extent 0-4K: SHARED

Fs doesn't support explicit reporting will report fiemap like:
File1: 1 extent
Extent 0-8K: SHARED
File2: 1 extent
Extent 0-4K: SHARED


How difficult /would/ it be to fix btrfs, anyway?


Not sure.

Since the file extent backref is not well designed in btrfs, we must 
check the fs tree to determine the correct file offset before splitting 
the fiemap result.


And it may cause extra performance problem.

So I would like to make it as it for now.

Anyway, even btrfs can report correct result, while punching hole at 
range without SHARED flag, it won't free any space...

So the current behavior has its meaning.

Thanks,
Qu



--D



Test case like generic/372 require explicit reporting will cause false
alert on btrfs.

Add such runtime check for that requirememt.

Signed-off-by: Qu Wenruo 
---
 common/reflink| 44 
 tests/generic/372 |  1 +
 2 files changed, 45 insertions(+)

diff --git a/common/reflink b/common/reflink
index 8b34046..9ada2e8 100644
--- a/common/reflink
+++ b/common/reflink
@@ -78,6 +78,50 @@ _require_scratch_reflink()
_scratch_unmount
 }

+# this test requires scratch fs to report explicit SHARED flag
+# e.g.
+#   0 4K 8K
+#/ File1: Extent 0  \
+#   /\
+#   |<- On disk Extent-->|
+#   |/
+#   | File2 /
+# Extent: 0
+# Fs supports explicit SHARED extent reporting should report fiemap like:
+# File1: 2 extents
+# Extent 0-4K: SHARED
+# Extent 4-8K:
+# File2: 1 extents
+# Extent 0-4K: SHARED
+#
+# Fs doesn't support explicit reporting will report fiemap like:
+# File1: 1 extent
+# Extent 0-8K: SHARED
+# File2: 1 extent
+# Extent 0-4K: SHARED
+_require_scratch_explicit_shared_extents()
+{
+   _require_scratch
+   _require_fiemap
+   _require_scratch_reflink
+   _require_xfs_io_command "reflink"
+   local nr_extents
+
+   _scratch_mkfs > /dev/null
+   _scratch_mount
+
+   _pwrite_byte 0x61 0 128k $SCRATCH_MNT/file1
+   _reflink_range $SCRATCH_MNT/file1 0 $SCRATCH_MNT/file2 0 64k
+
+   _scratch_cycle_mount
+
+   nr_extents=$(_count_extents $SCRATCH_MNT/file1)
+   if [ $nr_extents -eq 1 ]; then
+   _notrun "Explicit SHARED flag reporting not support by filesystem 
type: $FSTYP"
+   fi
+   _scratch_unmount
+}
+
 # this test requires the test fs support dedupe...
 _require_test_dedupe()
 {
diff --git a/tests/generic/372 b/tests/generic/372
index 31dff20..51a3eca 100755
--- a/tests/generic/372
+++ b/tests/generic/372
@@ -47,6 +47,7 @@ _supported_os Linux
 _supported_fs generic
 _require_scratch_reflink
 _require_fiemap
+_require_scratch_explicit_shared_extents

 echo "Format and mount"
 _scratch_mkfs > $seqres.full 2>&1
--
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe fstests" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: Introduce check for explicit SHARED extent flag reporting

2016-11-17 Thread Darrick J. Wong
On Thu, Nov 17, 2016 at 10:06:48AM +0800, Qu Wenruo wrote:
> For fs support reflink, some of them (OK, btrfs again) doesn't split
> SHARED flag for extent fiemap reporting.
> 
> For example:
>   0 4K 8K
>/ File1: Extent 0  \
>   /\
>   |<- On disk Extent-->|
>   |/
>   | File2 /
> Extent: 0
> 
> Fs supports explicit SHARED extent reporting should report fiemap like:
> File1: 2 extents
> Extent 0-4K: SHARED
> Extent 4-8K:
> File2: 1 extents
> Extent 0-4K: SHARED
> 
> Fs doesn't support explicit reporting will report fiemap like:
> File1: 1 extent
> Extent 0-8K: SHARED
> File2: 1 extent
> Extent 0-4K: SHARED

How difficult /would/ it be to fix btrfs, anyway?

--D

> 
> Test case like generic/372 require explicit reporting will cause false
> alert on btrfs.
> 
> Add such runtime check for that requirememt.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  common/reflink| 44 
>  tests/generic/372 |  1 +
>  2 files changed, 45 insertions(+)
> 
> diff --git a/common/reflink b/common/reflink
> index 8b34046..9ada2e8 100644
> --- a/common/reflink
> +++ b/common/reflink
> @@ -78,6 +78,50 @@ _require_scratch_reflink()
>   _scratch_unmount
>  }
>  
> +# this test requires scratch fs to report explicit SHARED flag
> +# e.g.
> +#   0 4K 8K
> +#/ File1: Extent 0  \
> +#   /\
> +#   |<- On disk Extent-->|
> +#   |/
> +#   | File2 /
> +# Extent: 0
> +# Fs supports explicit SHARED extent reporting should report fiemap like:
> +# File1: 2 extents
> +# Extent 0-4K: SHARED
> +# Extent 4-8K:
> +# File2: 1 extents
> +# Extent 0-4K: SHARED
> +#
> +# Fs doesn't support explicit reporting will report fiemap like:
> +# File1: 1 extent
> +# Extent 0-8K: SHARED
> +# File2: 1 extent
> +# Extent 0-4K: SHARED
> +_require_scratch_explicit_shared_extents()
> +{
> + _require_scratch
> + _require_fiemap
> + _require_scratch_reflink
> + _require_xfs_io_command "reflink"
> + local nr_extents
> +
> + _scratch_mkfs > /dev/null
> + _scratch_mount
> +
> + _pwrite_byte 0x61 0 128k $SCRATCH_MNT/file1
> + _reflink_range $SCRATCH_MNT/file1 0 $SCRATCH_MNT/file2 0 64k
> +
> + _scratch_cycle_mount
> +
> + nr_extents=$(_count_extents $SCRATCH_MNT/file1)
> + if [ $nr_extents -eq 1 ]; then
> + _notrun "Explicit SHARED flag reporting not support by 
> filesystem type: $FSTYP"
> + fi
> + _scratch_unmount
> +}
> +
>  # this test requires the test fs support dedupe...
>  _require_test_dedupe()
>  {
> diff --git a/tests/generic/372 b/tests/generic/372
> index 31dff20..51a3eca 100755
> --- a/tests/generic/372
> +++ b/tests/generic/372
> @@ -47,6 +47,7 @@ _supported_os Linux
>  _supported_fs generic
>  _require_scratch_reflink
>  _require_fiemap
> +_require_scratch_explicit_shared_extents
>  
>  echo "Format and mount"
>  _scratch_mkfs > $seqres.full 2>&1
> -- 
> 2.7.4
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] btrfs-progs: test: fix convert-tests 004 failure

2016-11-17 Thread Tsutomu Itoh
convert-tests 004 failed because the argument to check_all_images
is not specified.

# make test-convert
[TEST]   convert-tests.sh
[TEST/conv]   004-ext2-backup-superblock-ranges
find: '': No such file or directory
#

Signed-off-by: Tsutomu Itoh 
---
 tests/convert-tests/004-ext2-backup-superblock-ranges/test.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/convert-tests/004-ext2-backup-superblock-ranges/test.sh 
b/tests/convert-tests/004-ext2-backup-superblock-ranges/test.sh
index c56650b..d764ed8 100755
--- a/tests/convert-tests/004-ext2-backup-superblock-ranges/test.sh
+++ b/tests/convert-tests/004-ext2-backup-superblock-ranges/test.sh
@@ -39,4 +39,4 @@ function check_image() {
rm -f $TEST_DEV
 }
 
-check_all_images
+check_all_images `pwd`
-- 
2.9.3
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] btrfs-progs: test: expand size of test device of fsck-tests 013

2016-11-17 Thread Tsutomu Itoh
In my test environment, size of /lib/modules/`uname -r`/* is
larger than 1GB, so test data can not copy to testdev.
Therefore we need expand the size of testdev.

# make test-fsck
[TEST]   fsck-tests.sh
[TEST/fsck]   013-extent-tree-rebuild
failed: cp -aR /lib/modules/4.9.0-rc5/ /test/btrfs-progs/tests/mnt
test failed for case 013-extent-tree-rebuild
Makefile:272: recipe for target 'test-fsck' failed
make: *** [test-fsck] Error 1
#

Signed-off-by: Tsutomu Itoh 
---
 tests/fsck-tests/013-extent-tree-rebuild/test.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/fsck-tests/013-extent-tree-rebuild/test.sh 
b/tests/fsck-tests/013-extent-tree-rebuild/test.sh
index f678e29..92c5d04 100755
--- a/tests/fsck-tests/013-extent-tree-rebuild/test.sh
+++ b/tests/fsck-tests/013-extent-tree-rebuild/test.sh
@@ -7,7 +7,7 @@ check_prereq mkfs.btrfs
 check_prereq btrfs
 
 setup_root_helper
-prepare_test_dev 1G
+prepare_test_dev 2G
 
 # test whether fsck can rebuild a corrupted extent tree
 test_extent_tree_rebuild()
-- 
2.9.3
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] btrfs-progs: test: fix error of test target of Makefile

2016-11-17 Thread Tsutomu Itoh
Add test-cli to test target of Makefile.

Signed-off-by: Tsutomu Itoh 
---
 Makefile.in | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Makefile.in b/Makefile.in
index 4bd3e63..6c78841 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -298,7 +298,7 @@ test-inst: all
$(MAKE) DESTDIR=$$tmpdest install && \
$(RM) -rf -- $$tmpdest
 
-test: test-fsck test-mkfs test-convert test-misc test-fuzz
+test: test-fsck test-mkfs test-convert test-misc test-fuzz test-cli
 
 #
 # NOTE: For static compiles, you need to have all the required libs
-- 
2.9.3
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe

2016-11-17 Thread Christoph Hellwig
On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> So now we have 3 options:
> 
> a) Apply these patches as-is.
> b) Fix XFS to both return the actual bytes_deduped and cap the length
>for the EOF case. Do the same for Btrfs.
> c) Make XFS consistent with the existing Btrfs behavior.
> 
> I'm starting to lean towards option c after writing this cover letter,
> but I might as well send these out and get a second opinion.

I agree - btrfs was there first, and there is no fatal flaw in the ABI,
so let's adjust XFS to it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-17 Thread Zygo Blaxell
On Fri, Nov 18, 2016 at 10:42:23AM +0800, Qu Wenruo wrote:
> 
> 
> At 11/18/2016 09:56 AM, Hugo Mills wrote:
> >On Fri, Nov 18, 2016 at 09:19:11AM +0800, Qu Wenruo wrote:
> >>
> >>
> >>At 11/18/2016 07:13 AM, Zygo Blaxell wrote:
> >>>On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:
> Fix the so-called famous RAID5/6 scrub error.
> 
> Thanks Goffredo Baroncelli for reporting the bug, and make it into our
> sight.
> (Yes, without the Phoronix report on this,
> https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
> I won't ever be aware of it)
> >>>
> >>>If you're hearing about btrfs RAID5 bugs for the first time through
> >>>Phoronix, then your testing coverage is *clearly* inadequate.
> >>
> >>I'm not fixing everything, I'm just focusing on the exact one bug
> >>reported by Goffredo Baroncelli.
> >>
> >>Although it seems that, the bug reported by him is in fact two bugs.
> >>One is race condition I'm fixing, another one is that recovery is
> >>recovering data correctly, but screwing up parity.
> >>
> >>I just don't understand why you always want to fix everything in one step.
> >
> >   Fix the important, fundamental things first, and the others
> >later. This, from my understanding of Zygo's comments, appears to be
> >one of the others.
> >
> >   It's papering over the missing bricks in the wall instead of
> >chipping out the mortar and putting new bricks in. It may need to be
> >fixed, but it's not the fundamental "OMG, everything's totally broken"
> >problem. If anything, it's only a serious problem *because* the other
> >thing (write hole) is still there.
> >
> >   It just seems like a piece of mis-prioritised effort.
> 
> It seems that, we have different standards on the priority.

My concern isn't priority.  Easier bugs often get fixed first.  That's
just the way Linux development works.

I am very concerned by articles like this:

http://phoronix.com/scan.php?page=news_item=Btrfs-RAID5-RAID6-Fixed

with headlines like "btrfs RAID5/RAID6 support is finally fixed" when
that's very much not the case.  Only one bug has been removed for the
key use case that makes RAID5 interesting, and it's just the first of
many that still remain in the path of a user trying to recover from a
normal disk failure.

Admittedly this is Michael's (Phoronix's) problem more than Qu's, but
it's important to always be clear and _complete_ when stating bug status
because people quote statements out of context.  When the article quoted
the text

"it's not a timed bomb buried deeply into the RAID5/6 code,
but a race condition in scrub recovery code"

the commenters on Phoronix are clearly interpreting this to mean "famous
RAID5/6 scrub error" had been fixed *and* the issue reported by Goffredo
was the time bomb issue.  It's more accurate to say something like

"Goffredo's issue is not the time bomb buried deeply in the
RAID5/6 code, but a separate issue caused by a race condition
in scrub recovery code"

Reading the Phoronix article, one might imagine RAID5 is now working
as well as RAID1 on btrfs.  To be clear, it's not--although the gap
is now significantly narrower.

> For me, if some function on the very basic/minimal environment can't work
> reliably, then it's a high priority bug.
> 
> In this case, in a very minimal setup, with only 128K data spreading on 3
> devices RAID5. With a data stripe fully corrupted, without any other thing
> interfering.
> Scrub can't return correct csum error number and even cause false
> unrecoverable error, then it's a high priority thing.

> If the problem involves too many steps like removing devices, degraded mode,
> fsstress and some time. Then it's not that priority unless one pin-downs the
> root case to, for example, degraded mode itself with special sequenced
> operations.

There are multiple bugs in the stress + remove device case.  Some are
quite easy to isolate.  They range in difficulty from simple BUG_ON
instead of error returns to finally solving the RMW update problem.

Run the test, choose any of the bugs that occur to work on, repeat until
the test stops finding new bugs for a while.  There are currently several
bugs to choose from with various levels of difficulty to fix them, and you
should hit the first level of bugs in a matter of hours if not minutes.

Using this method, you would have discovered Goffredo's bug years ago.
Instead, you only discovered it after Phoronix quoted the conclusion
of an investigation that started because of problems I reported here on
this list when I discovered half a dozen of btrfs's critical RAID5 bugs
from analyzing *one* disk failure event.

> Yes, in fact sometimes our developers are fixing such bug at high priority,
> but that's because other part has no such obvious problem.
> But if we have obvious and easy to reproduce bug, then we should start from
> that.

To be able to use a RAID5 in production it must be possible to recover
from 

Re: [RFC PATCH 1/2] Btrfs: refactor btrfs_extent_same() slightly

2016-11-17 Thread Qu Wenruo



At 11/18/2016 08:07 AM, Omar Sandoval wrote:

From: Omar Sandoval 

This just makes the following patch cleaner.

Looks good for me.

It is just a good refactoring. Making code short and easier to read.

And it doesn't affect the len = 0 behavior.
So no matter what the choice we do, it can be applied independently.

Reviewed-by: Qu Wenruo 

Thanks,
Qu


Signed-off-by: Omar Sandoval 
---
 fs/btrfs/ioctl.c | 33 -
 1 file changed, 12 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7acbd2c..e23f945 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3124,26 +3124,27 @@ static int btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
int ret;
u64 len = olen;
struct cmp_pages cmp;
-   int same_inode = 0;
+   bool same_inode = (src == dst);
u64 same_lock_start = 0;
u64 same_lock_len = 0;

-   if (src == dst)
-   same_inode = 1;
-
if (len == 0)
return 0;

-   if (same_inode) {
+   if (same_inode)
inode_lock(src);
+   else
+   btrfs_double_inode_lock(src, dst);

-   ret = extent_same_check_offsets(src, loff, , olen);
-   if (ret)
-   goto out_unlock;
-   ret = extent_same_check_offsets(src, dst_loff, , olen);
-   if (ret)
-   goto out_unlock;
+   ret = extent_same_check_offsets(src, loff, , olen);
+   if (ret)
+   goto out_unlock;

+   ret = extent_same_check_offsets(dst, dst_loff, , olen);
+   if (ret)
+   goto out_unlock;
+
+   if (same_inode) {
/*
 * Single inode case wants the same checks, except we
 * don't want our length pushed out past i_size as
@@ -3171,16 +3172,6 @@ static int btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,

same_lock_start = min_t(u64, loff, dst_loff);
same_lock_len = max_t(u64, loff, dst_loff) + len - 
same_lock_start;
-   } else {
-   btrfs_double_inode_lock(src, dst);
-
-   ret = extent_same_check_offsets(src, loff, , olen);
-   if (ret)
-   goto out_unlock;
-
-   ret = extent_same_check_offsets(dst, dst_loff, , olen);
-   if (ret)
-   goto out_unlock;
}

/* don't make the dst file partly checksummed */




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] generic: test concurrent non-overlapping direct I/O on the same extents

2016-11-17 Thread Eryu Guan
On Thu, Nov 17, 2016 at 10:14:12AM -0800, Omar Sandoval wrote:
> On Thu, Nov 17, 2016 at 01:26:11PM +0800, Eryu Guan wrote:
> > On Wed, Nov 16, 2016 at 04:29:34PM -0800, Omar Sandoval wrote:
> > > From: Omar Sandoval 
> > > 
> > > There have been a couple of logic bugs in `btrfs_get_extent()` which
> > > could lead to spurious -EEXIST errors from read or write. This test
> > > exercises those conditions by having two threads race to add an extent
> > > to the extent map.
> > > 
> > > This is fixed by Linux commit 8dff9c853410 ("Btrfs: deal with duplciates
> > > during extent_map insertion in btrfs_get_extent") and the patch "Btrfs:
> > > deal with existing encompassing extent map in btrfs_get_extent()"
> > > (http://marc.info/?l=linux-btrfs=147873402311143=2).
> > > 
> > > Although the bug is Btrfs-specific, nothing about the test is.
> > > 
> > > Signed-off-by: Omar Sandoval 
> > > ---
> > [snip]
> > > +# real QA test starts here
> > > +
> > > +_supported_fs generic
> > > +_supported_os Linux
> > > +_require_test
> > > +_require_xfs_io_command "falloc"
> > > +_require_test_program "dio-interleaved"
> > > +
> > > +extent_size="$(($(stat -f -c '%S' "$TEST_DIR") * 2))"
> > 
> > There's a helper to get fs block size: "get_block_size".
> > 
> > > +num_extents=1024
> > > +testfile="$TEST_DIR/$$-testfile"
> > > +
> > > +truncate -s 0 "$testfile"
> > 
> > I prefer using xfs_io to do the truncate, 
> > 
> > $XFS_IO_PROG -fc "truncate 0" "$testfile"
> > 
> > Because in rare cases truncate(1) may be unavailable, e.g. RHEL5,
> > usually it's not a big issue, but xfs_io works all the time, we have a
> > better way, so why not :)
> > 
> > > +for ((off = 0; off < num_extents * extent_size; off += extent_size)); do
> > > + xfs_io -c "falloc $off $extent_size" "$testfile"
> > 
> > Use $XFS_IO_PROG not bare xfs_io.
> > 
> > I can fix all the tiny issues at commit time.
> > 
> > Thanks,
> > Eryu
> 
> Thank you, Eryu, that's all okay with me. I also had a typo here:
> 
> diff --git a/src/dio-interleaved.c b/src/dio-interleaved.c
> index 831a191..6b04c99 100644
> --- a/src/dio-interleaved.c
> +++ b/src/dio-interleaved.c
> @@ -58,7 +58,7 @@ int main(int argc, char **argv)
>   int fd;
>  
>   if (argc != 4) {
> - fprintf(stderr, "usage: %s SECTORSIZE NUM_EXTENTS PATH\n",
> + fprintf(stderr, "usage: %s EXTENT_SIZE NUM_EXTENTS PATH\n",
>   argv[0]);
>   return EXIT_FAILURE;
>   }
> 
> 
> Feel free to fix this at commit time, too, or I can send a v2 if you
> prefer.

Thanks! I'll fix them all.

Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


From shenzhen China Freight agency company

2016-11-17 Thread Alfred
Dear Sirs:

How are you!

If your company have import some product from China.May I help you on the 
logistics field.

We specialized in international express, air transport, shipping. with good 
price and good service, with many years experience ! We are pursuing is the 
integrity and security of goods .

Further information please contact freely!Thank you.

Have a good day.


---

Alfred


Shenzhen Dongda Freight agency CO., LTD

---

Tel:86-755-61961370
Mob:86-18929393836
Fax:86-755-29196661
Email:alfred_wei...@sina.com
SKYPE:ddhy168  QQ:1559489526
Address: 516 Baishiwei Logistics Park Fuwei
 Community Fuyong Street
 Bao'an District,
 Shenzhen,518103 China

Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-17 Thread Qu Wenruo



At 11/18/2016 09:56 AM, Hugo Mills wrote:

On Fri, Nov 18, 2016 at 09:19:11AM +0800, Qu Wenruo wrote:



At 11/18/2016 07:13 AM, Zygo Blaxell wrote:

On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:

Fix the so-called famous RAID5/6 scrub error.

Thanks Goffredo Baroncelli for reporting the bug, and make it into our
sight.
(Yes, without the Phoronix report on this,
https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
I won't ever be aware of it)


If you're hearing about btrfs RAID5 bugs for the first time through
Phoronix, then your testing coverage is *clearly* inadequate.


I'm not fixing everything, I'm just focusing on the exact one bug
reported by Goffredo Baroncelli.

Although it seems that, the bug reported by him is in fact two bugs.
One is race condition I'm fixing, another one is that recovery is
recovering data correctly, but screwing up parity.

I just don't understand why you always want to fix everything in one step.


   Fix the important, fundamental things first, and the others
later. This, from my understanding of Zygo's comments, appears to be
one of the others.

   It's papering over the missing bricks in the wall instead of
chipping out the mortar and putting new bricks in. It may need to be
fixed, but it's not the fundamental "OMG, everything's totally broken"
problem. If anything, it's only a serious problem *because* the other
thing (write hole) is still there.

   It just seems like a piece of mis-prioritised effort.


It seems that, we have different standards on the priority.

For me, if some function on the very basic/minimal environment can't 
work reliably, then it's a high priority bug.


In this case, in a very minimal setup, with only 128K data spreading on 
3 devices RAID5. With a data stripe fully corrupted, without any other 
thing interfering.
Scrub can't return correct csum error number and even cause false 
unrecoverable error, then it's a high priority thing.


If the problem involves too many steps like removing devices, degraded 
mode, fsstress and some time. Then it's not that priority unless one 
pin-downs the root case to, for example, degraded mode itself with 
special sequenced operations.


Yes, in fact sometimes our developers are fixing such bug at high 
priority, but that's because other part has no such obvious problem.
But if we have obvious and easy to reproduce bug, then we should start 
from that.


So I don't consider the problem Zygo written is a high priority problem.




Fill up a RAID5 array, start a FS stress test, pull a drive out while
that's running, let the FS stress test run for another hour, then try
to replace or delete the missing device.  If there are any crashes,
corruptions, or EIO during any part of this process (assuming all the
remaining disks are healthy), then btrfs RAID5 is still broken, and
you've found another bug to fix.


Then it will be another bug to fix, not the bug I'm fixing.
Unless you can prove whatever the bug is, is relating to the fix, I
don't see any help then.



The fact that so many problems in btrfs can still be found this way
indicates to me that nobody is doing this basic level of testing
(or if they are, they're not doing anything about the results).


Unlike many of us(including myself) assumed, it's not a timed bomb buried
deeply into the RAID5/6 code, but a race condition in scrub recovery
code.


I don't see how this patch fixes the write hole issue at the core of
btrfs RAID56.  It just makes the thin layer of bugs over that issue a
little thinner.  There's still the metadata RMW update timebomb at the
bottom of the bug pile that can't be fixed by scrub (the filesystem is
unrecoverably damaged when the bomb goes off, so scrub isn't possible).


Not "the core of btrfs RAID56", but "the core of all RAID56".

And, this is just another unrelated problem, I didn't even want to
repeat what I've written.


   That's the one that needs fixing *first*, though. (IMO, YMMV,
Contents may have settled in transit).


And forget to mention, we don't even have an idea on how to fix the 
generic RAID56 RMW problem, even for mdadm RAID56.


BTW, for mdadm RAID56, they just scrub the whole array after every power 
loss, we could just call user to do that, and btrfs csum can assist 
scrub to do a better job than mdadm raid56.


Although, we must make sure btrfs scrub on raid56 won't cause false 
alert(the patchset) and no screwed up parity(I'll fix soon) first.


Thanks,
Qu



   Hugo.


Thanks,
Qu




The problem is not found because normal mirror based profiles aren't
affected by the race, since they are independent with each other.


True.


Although this time the fix doesn't affect the scrub code much, it should
warn us that current scrub code is really hard to maintain.


This last sentence is true.  I found and fixed three BUG_ONs in RAID5
code on the first day I started testing in degraded mode, then hit
the scrub code and had to give up.  It was like a brick wall made out
of 

Re: Btrfs Heatmap - v2 - block group internals!

2016-11-17 Thread Qu Wenruo



At 11/18/2016 03:27 AM, Austin S. Hemmelgarn wrote:

On 2016-11-17 13:51, Hans van Kranenburg wrote:

Hey,

On 11/17/2016 02:27 AM, Qu Wenruo wrote:


At 11/17/2016 04:30 AM, Hans van Kranenburg wrote:

In the last two days I've added the --blockgroup option to btrfs
heatmap
to let it create pictures of block group internals.

Examples and more instructions are to be found in the README at:
https://github.com/knorrie/btrfs-heatmap/blob/master/README.md

To use the new functionality it needs a fairly recent python-btrfs for
the 'skinny' METADATA_ITEM_KEY to be present. Latest python-btrfs
release is v0.3, created yesterday.


Wow, really cool!


Thanks!


I always dream about a visualizing tool to represent the chunk and
extent level of btrfs.

This should really save me from reading the boring dec numbers from
btrfs-debug-tree.

Although IMHO the full fs output is mixing extent and chunk level
together, which makes it a little hard to represent multi-device case,
it's still an awesome tool!


The picture of a full filesystem just appends all devices together into
one big space, and then walks the dev_extent tree and associated
chunk/blockgroup items for the %used/greyscale value.


My fault, just thought different greyscale means meta/data extents.
I got it confused with block group level output.

Then I'm mostly OK.



Just found one small problem.
After specifying --size 16 to output a given block group (small block 
group, I need large size to make output visible), it takes a full cpu 
and takes a long long long time to run.

So long I don't even want to wait.

I changed size to 10, and it finished much faster.

Is that expected?



I don't see what displaying a blockgroup-level aggregate usage number
has to do with multi-device, except that the same %usage will appear
another time when using RAID1*.


Although in fact, for profiles like RAID0/5/6/10, it's completely 
possible that one dev_extent contains all the data, while another 
dev_extent is almost empty.
Strictly speaking, at full fs or dev level, we should output things at 
dev_extent level, then greyscale should be representing dev_extent 
usage(which is not possible or quite hard to calculate)


Anyway, the greyscale is mostly OK, just as a good addition output for 
full fs graph.



Although if it could output the fs or specific dev without gray scale, I 
think it would be better.

It will be much clearer about the dev_extent level fragments.



When generating a picture of a file system with multiple devices,
boundaries between the separate devices are not visible now.

If someone has a brilliant idea about how to do this without throwing
out actual usage data...


The first thought that comes to mind for me is to make each device be a
different color, and otherwise obey the same intensity mapping
correlating to how much data is there.  For example, if you've got a 3
device FS, the parts of the image that correspond to device 1 would go
from 0x00 to 0xFF, the parts for device 2 could be 0x00 to
0x00FF00, and the parts for device 3 could be 0x00 to 0xFF. This
is of course not perfect (you can't tell what device each segment of
empty space corresponds to), but would probably cover most use cases.
(for example, with such a scheme, you could look at an image and tell
whether the data is relatively well distributed across all the devices
or you might need to re-balance).


What about linear output separated with lines(or just black)?
Like:
X = Used(While)
O = Unallocated (Gray)
  = Out of dev range (Black)
|-= Separator (Black)
---
| | |X| |X|
| |X|O|O|X|
| |O|O|O|O|
| |X|X|X|X|
|O|X|O|O|X|
---
 D D D D D
 e e e e e
 v v v v v
 1 2 3 4 5

Or multi vertical line to represent one dev:

|O |  |X |  |X |
|X |X |O |O |XX|
|XO|O |O |O |OX|
|XO|X |XX|X |XX|
|OX|X |OX|O |XX|



Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: root backup-reformat-restore

2016-11-17 Thread Marcus Sundman

On 18.11.2016 02:52, Hugo Mills wrote:

On Fri, Nov 18, 2016 at 02:38:25AM +0200, Marcus Sundman wrote:

The FAQ says that "the best solution for small devices (under about
16 GB) is to reformat the FS with the --mixed option to mkfs.btrfs".

OK. Does anyone have any good suggestions for doing that with an
existing / partition (which has special files and whatnot)?

I assume a backup-restore cycle is needed, but with which program(s)?

  - If you don't have (or don't care about) any snapshots, then
 - if you can mount both the old and the new FS at the same time
- mv


It's the same partition, so I probably won't have the new FS there at 
the same time.



 - else, if you want to replace the old FS with the new
- tar to a file elsewhere, and untar later


Can tar really preserve all special files' attributes? (I have separate 
partitions for /boot and /home, but the rest is in /)


Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-17 Thread Hugo Mills
On Fri, Nov 18, 2016 at 09:19:11AM +0800, Qu Wenruo wrote:
> 
> 
> At 11/18/2016 07:13 AM, Zygo Blaxell wrote:
> >On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:
> >>Fix the so-called famous RAID5/6 scrub error.
> >>
> >>Thanks Goffredo Baroncelli for reporting the bug, and make it into our
> >>sight.
> >>(Yes, without the Phoronix report on this,
> >>https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
> >>I won't ever be aware of it)
> >
> >If you're hearing about btrfs RAID5 bugs for the first time through
> >Phoronix, then your testing coverage is *clearly* inadequate.
> 
> I'm not fixing everything, I'm just focusing on the exact one bug
> reported by Goffredo Baroncelli.
> 
> Although it seems that, the bug reported by him is in fact two bugs.
> One is race condition I'm fixing, another one is that recovery is
> recovering data correctly, but screwing up parity.
> 
> I just don't understand why you always want to fix everything in one step.

   Fix the important, fundamental things first, and the others
later. This, from my understanding of Zygo's comments, appears to be
one of the others.

   It's papering over the missing bricks in the wall instead of
chipping out the mortar and putting new bricks in. It may need to be
fixed, but it's not the fundamental "OMG, everything's totally broken"
problem. If anything, it's only a serious problem *because* the other
thing (write hole) is still there.

   It just seems like a piece of mis-prioritised effort.

> >Fill up a RAID5 array, start a FS stress test, pull a drive out while
> >that's running, let the FS stress test run for another hour, then try
> >to replace or delete the missing device.  If there are any crashes,
> >corruptions, or EIO during any part of this process (assuming all the
> >remaining disks are healthy), then btrfs RAID5 is still broken, and
> >you've found another bug to fix.
> 
> Then it will be another bug to fix, not the bug I'm fixing.
> Unless you can prove whatever the bug is, is relating to the fix, I
> don't see any help then.
> 
> >
> >The fact that so many problems in btrfs can still be found this way
> >indicates to me that nobody is doing this basic level of testing
> >(or if they are, they're not doing anything about the results).
> >
> >>Unlike many of us(including myself) assumed, it's not a timed bomb buried
> >>deeply into the RAID5/6 code, but a race condition in scrub recovery
> >>code.
> >
> >I don't see how this patch fixes the write hole issue at the core of
> >btrfs RAID56.  It just makes the thin layer of bugs over that issue a
> >little thinner.  There's still the metadata RMW update timebomb at the
> >bottom of the bug pile that can't be fixed by scrub (the filesystem is
> >unrecoverably damaged when the bomb goes off, so scrub isn't possible).
> 
> Not "the core of btrfs RAID56", but "the core of all RAID56".
> 
> And, this is just another unrelated problem, I didn't even want to
> repeat what I've written.

   That's the one that needs fixing *first*, though. (IMO, YMMV,
Contents may have settled in transit).

   Hugo.

> Thanks,
> Qu
> 
> >
> >>The problem is not found because normal mirror based profiles aren't
> >>affected by the race, since they are independent with each other.
> >
> >True.
> >
> >>Although this time the fix doesn't affect the scrub code much, it should
> >>warn us that current scrub code is really hard to maintain.
> >
> >This last sentence is true.  I found and fixed three BUG_ONs in RAID5
> >code on the first day I started testing in degraded mode, then hit
> >the scrub code and had to give up.  It was like a brick wall made out
> >of mismatched assumptions and layering inversions, using uninitialized
> >kernel data as mortar (though I suppose the "uninitialized" data symptom
> >might just have been an unprotected memory access).
> >
> >>Abuse of workquque to delay works and the full fs scrub is race prone.
> >>
> >>Xfstest will follow a little later, as we don't have good enough tools
> >>to corrupt data stripes pinpointly.
> >>
> >>Qu Wenruo (2):
> >>  btrfs: scrub: Introduce full stripe lock for RAID56
> >>  btrfs: scrub: Fix RAID56 recovery race condition
> >>
> >> fs/btrfs/ctree.h   |   4 ++
> >> fs/btrfs/extent-tree.c |   3 +
> >> fs/btrfs/scrub.c   | 192 
> >> +
> >> 3 files changed, 199 insertions(+)
> >>

-- 
Hugo Mills | UDP jokes: It's OK if no-one gets them.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-17 Thread Qu Wenruo



At 11/18/2016 07:13 AM, Zygo Blaxell wrote:

On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:

Fix the so-called famous RAID5/6 scrub error.

Thanks Goffredo Baroncelli for reporting the bug, and make it into our
sight.
(Yes, without the Phoronix report on this,
https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
I won't ever be aware of it)


If you're hearing about btrfs RAID5 bugs for the first time through
Phoronix, then your testing coverage is *clearly* inadequate.


I'm not fixing everything, I'm just focusing on the exact one bug 
reported by Goffredo Baroncelli.


Although it seems that, the bug reported by him is in fact two bugs.
One is race condition I'm fixing, another one is that recovery is 
recovering data correctly, but screwing up parity.


I just don't understand why you always want to fix everything in one step.



Fill up a RAID5 array, start a FS stress test, pull a drive out while
that's running, let the FS stress test run for another hour, then try
to replace or delete the missing device.  If there are any crashes,
corruptions, or EIO during any part of this process (assuming all the
remaining disks are healthy), then btrfs RAID5 is still broken, and
you've found another bug to fix.


Then it will be another bug to fix, not the bug I'm fixing.
Unless you can prove whatever the bug is, is relating to the fix, I 
don't see any help then.




The fact that so many problems in btrfs can still be found this way
indicates to me that nobody is doing this basic level of testing
(or if they are, they're not doing anything about the results).


Unlike many of us(including myself) assumed, it's not a timed bomb buried
deeply into the RAID5/6 code, but a race condition in scrub recovery
code.


I don't see how this patch fixes the write hole issue at the core of
btrfs RAID56.  It just makes the thin layer of bugs over that issue a
little thinner.  There's still the metadata RMW update timebomb at the
bottom of the bug pile that can't be fixed by scrub (the filesystem is
unrecoverably damaged when the bomb goes off, so scrub isn't possible).


Not "the core of btrfs RAID56", but "the core of all RAID56".

And, this is just another unrelated problem, I didn't even want to 
repeat what I've written.


Thanks,
Qu




The problem is not found because normal mirror based profiles aren't
affected by the race, since they are independent with each other.


True.


Although this time the fix doesn't affect the scrub code much, it should
warn us that current scrub code is really hard to maintain.


This last sentence is true.  I found and fixed three BUG_ONs in RAID5
code on the first day I started testing in degraded mode, then hit
the scrub code and had to give up.  It was like a brick wall made out
of mismatched assumptions and layering inversions, using uninitialized
kernel data as mortar (though I suppose the "uninitialized" data symptom
might just have been an unprotected memory access).


Abuse of workquque to delay works and the full fs scrub is race prone.

Xfstest will follow a little later, as we don't have good enough tools
to corrupt data stripes pinpointly.

Qu Wenruo (2):
  btrfs: scrub: Introduce full stripe lock for RAID56
  btrfs: scrub: Fix RAID56 recovery race condition

 fs/btrfs/ctree.h   |   4 ++
 fs/btrfs/extent-tree.c |   3 +
 fs/btrfs/scrub.c   | 192 +
 3 files changed, 199 insertions(+)

--
2.10.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-17 Thread Qu Wenruo



At 11/18/2016 02:09 AM, Goffredo Baroncelli wrote:

Hi Qu,

On 2016-11-15 03:50, Qu Wenruo wrote:

Fix the so-called famous RAID5/6 scrub error.

Thanks Goffredo Baroncelli for reporting the bug, and make it into our
sight.
(Yes, without the Phoronix report on this,
https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
I won't ever be aware of it)

Unlike many of us(including myself) assumed, it's not a timed bomb buried
deeply into the RAID5/6 code, but a race condition in scrub recovery
code.

The problem is not found because normal mirror based profiles aren't
affected by the race, since they are independent with each other.


Unfortunately, even with these patches btrfs still fails to rebuild a raid5 
array in my tests.
To perform the tests I create a filesystem raid5-three disks; then I created a 
file

# python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" >out.txt

The filesystem layout is composed by stripes large 128k of data, and 64k of 
parity.

The idea is that the first third of the stripe (64k on disk0) is composed by 
"adaaa...";
the second third (again 64k on disk1 is composed by "bd"; and finally the 
parity (disk0^disk1) is stored on the last portion of the stripe.

Doing some calculation it is possible to know where the data is physically 
stored.

I perform 3 tests:
1) I corrupt some bytes of the data stored on disk0; mount the filesystem; run 
scrub; unmount the filesystem; check all the disks if the bytes of the stripe 
are corrected
2) I corrupt some bytes of the data stored on disk1; mount the filesystem; run 
scrub; unmount the filesystem; check all the disks the bytes of the stripe are 
corrected
3) I corrupt some bytes of the parity stored on disk2; mount the filesystem; 
run scrub; unmount the filesystem; check all the disks the bytes of the stripe 
are corrected

Before your patches, my all tests fail (not all the times, but very often).
After your patches, test1 and test2 still fail. In my test test3 succeded.


Thanks for your report and script.

I compared my script with yours, and found the difference is, I'm 
corrupting the whole data stripe (64K, and without space cache to make 
sure all data are at the 1st full stripe), while you only corrupt 5 
bytes of it.


And in that case, btrfs won't report more error than expected 1, which 
means the race fix is working.


But new btrfs-progs scrub will report that the recovered data stripes 
are all OK, but P/Q is not correct.


This seems to be a new bug, unrelated to the RAID5/6 scrub race(whose 
main symptom is incorrect csum error number and sometimes even 
unrecoverable error).



I assume it will be easier to fix than the race, but I can be totally 
wrong unless I dig into it further.


Anyway I'll fix it in a new patchset.

Thanks,
Qu



Enclosed my script which performs the tests (it uses loop device; please run in 
a VM; firts you have to uncomment the function make_imgs to create the disk 
image.).

Let me know if I can help you providing more information.

BR
G.Baroncelli




Although this time the fix doesn't affect the scrub code much, it should
warn us that current scrub code is really hard to maintain.

Abuse of workquque to delay works and the full fs scrub is race prone.

Xfstest will follow a little later, as we don't have good enough tools
to corrupt data stripes pinpointly.

Qu Wenruo (2):
  btrfs: scrub: Introduce full stripe lock for RAID56
  btrfs: scrub: Fix RAID56 recovery race condition

 fs/btrfs/ctree.h   |   4 ++
 fs/btrfs/extent-tree.c |   3 +
 fs/btrfs/scrub.c   | 192 +
 3 files changed, 199 insertions(+)









raid56_recover.sh
Description: application/shellscript


Re: root backup-reformat-restore

2016-11-17 Thread Hugo Mills
On Fri, Nov 18, 2016 at 02:38:25AM +0200, Marcus Sundman wrote:
> The FAQ says that "the best solution for small devices (under about
> 16 GB) is to reformat the FS with the --mixed option to mkfs.btrfs".
> 
> OK. Does anyone have any good suggestions for doing that with an
> existing / partition (which has special files and whatnot)?
> 
> I assume a backup-restore cycle is needed, but with which program(s)?

 - If you don't have (or don't care about) any snapshots, then
- if you can mount both the old and the new FS at the same time
   - mv
- else, if you want to replace the old FS with the new
   - tar to a file elsewhere, and untar later
 - else, if you care about the snapshots
- if you don't care about continuing btrfs send/receive backups to
  somewhere else
  - btrfs send/receive with -c (*)
- if you want to continue using btrfs send -p incremental backups after
  the migration
   - cry, because you can't

(*) For subvols S1, S2, S3, ...: btrfs send S1; btrfs send -c S1 S2;
btrfs send -c S1 -c S2 S3; btrfs send -c S1 -c S2 -c S3 S4; etc...

   Hugo.

-- 
Hugo Mills | Great films about cricket: The Umpire Strikes Back
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


root backup-reformat-restore

2016-11-17 Thread Marcus Sundman
The FAQ says that "the best solution for small devices (under about 16 
GB) is to reformat the FS with the --mixed option to mkfs.btrfs".


OK. Does anyone have any good suggestions for doing that with an 
existing / partition (which has special files and whatnot)?


I assume a backup-restore cycle is needed, but with which program(s)?

Thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 1/2] Btrfs: refactor btrfs_extent_same() slightly

2016-11-17 Thread Omar Sandoval
From: Omar Sandoval 

This just makes the following patch cleaner.

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/ioctl.c | 33 -
 1 file changed, 12 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7acbd2c..e23f945 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3124,26 +3124,27 @@ static int btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
int ret;
u64 len = olen;
struct cmp_pages cmp;
-   int same_inode = 0;
+   bool same_inode = (src == dst);
u64 same_lock_start = 0;
u64 same_lock_len = 0;
 
-   if (src == dst)
-   same_inode = 1;
-
if (len == 0)
return 0;
 
-   if (same_inode) {
+   if (same_inode)
inode_lock(src);
+   else
+   btrfs_double_inode_lock(src, dst);
 
-   ret = extent_same_check_offsets(src, loff, , olen);
-   if (ret)
-   goto out_unlock;
-   ret = extent_same_check_offsets(src, dst_loff, , olen);
-   if (ret)
-   goto out_unlock;
+   ret = extent_same_check_offsets(src, loff, , olen);
+   if (ret)
+   goto out_unlock;
 
+   ret = extent_same_check_offsets(dst, dst_loff, , olen);
+   if (ret)
+   goto out_unlock;
+
+   if (same_inode) {
/*
 * Single inode case wants the same checks, except we
 * don't want our length pushed out past i_size as
@@ -3171,16 +3172,6 @@ static int btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
 
same_lock_start = min_t(u64, loff, dst_loff);
same_lock_len = max_t(u64, loff, dst_loff) + len - 
same_lock_start;
-   } else {
-   btrfs_double_inode_lock(src, dst);
-
-   ret = extent_same_check_offsets(src, loff, , olen);
-   if (ret)
-   goto out_unlock;
-
-   ret = extent_same_check_offsets(dst, dst_loff, , olen);
-   if (ret)
-   goto out_unlock;
}
 
/* don't make the dst file partly checksummed */
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe

2016-11-17 Thread Omar Sandoval
From: Omar Sandoval 

This is the follow-up to the discussions here [1] and here [2] that
makes Btrfs treat a src_length of 0 to dedupe as "until EOF". The
implementation is straightforward, but there are a few catches that
convinced me to post this as an RFC:

1. We still don't know for sure whether userspace cares about the
   original Btrfs behavior. Darrick and I both seem to think not, but
   hopefully someone will speak up if it matters to them.
2. When doing the implicit EOF, XFS returns 0 for bytes_deduped. I
   copied this in my implementation, but I'm guessing that's a bug
   rather than a feature.
3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the
   implicit EOF gets around this in the existing XFS implementation. I
   copied this for the Btrfs implementation.

So now we have 3 options:

a) Apply these patches as-is.
b) Fix XFS to both return the actual bytes_deduped and cap the length
   for the EOF case. Do the same for Btrfs.
c) Make XFS consistent with the existing Btrfs behavior.

I'm starting to lean towards option c after writing this cover letter,
but I might as well send these out and get a second opinion.

Thanks!

1: http://marc.info/?l=linux-btrfs=146828374631829=2
2: http://marc.info/?l=linux-btrfs=147694962912532=2

Omar Sandoval (2):
  Btrfs: refactor btrfs_extent_same() slightly
  Btrfs: make a source length of 0 imply EOF for dedupe

 fs/btrfs/ioctl.c | 45 -
 1 file changed, 20 insertions(+), 25 deletions(-)

-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 2/2] Btrfs: make a source length of 0 imply EOF for dedupe

2016-11-17 Thread Omar Sandoval
From: Omar Sandoval 

This makes us consistent with both reflink and the XFS implementation of
dedupe.

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/ioctl.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e23f945..320fb35 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3121,21 +3121,25 @@ static int extent_same_check_offsets(struct inode 
*inode, u64 off, u64 *plen,
 static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
 struct inode *dst, u64 dst_loff)
 {
-   int ret;
-   u64 len = olen;
+   int ret = 0;
+   u64 len;
struct cmp_pages cmp;
bool same_inode = (src == dst);
u64 same_lock_start = 0;
u64 same_lock_len = 0;
 
-   if (len == 0)
-   return 0;
-
if (same_inode)
inode_lock(src);
else
btrfs_double_inode_lock(src, dst);
 
+   if (olen == 0) {
+   olen = i_size_read(src) - loff;
+   if (olen == 0)
+   goto out_unlock;
+   }
+   len = olen;
+
ret = extent_same_check_offsets(src, loff, , olen);
if (ret)
goto out_unlock;
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-17 Thread Zygo Blaxell
On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:
> Fix the so-called famous RAID5/6 scrub error.
> 
> Thanks Goffredo Baroncelli for reporting the bug, and make it into our
> sight.
> (Yes, without the Phoronix report on this,
> https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
> I won't ever be aware of it)

If you're hearing about btrfs RAID5 bugs for the first time through
Phoronix, then your testing coverage is *clearly* inadequate.

Fill up a RAID5 array, start a FS stress test, pull a drive out while
that's running, let the FS stress test run for another hour, then try
to replace or delete the missing device.  If there are any crashes,
corruptions, or EIO during any part of this process (assuming all the
remaining disks are healthy), then btrfs RAID5 is still broken, and
you've found another bug to fix.

The fact that so many problems in btrfs can still be found this way
indicates to me that nobody is doing this basic level of testing
(or if they are, they're not doing anything about the results).

> Unlike many of us(including myself) assumed, it's not a timed bomb buried
> deeply into the RAID5/6 code, but a race condition in scrub recovery
> code.

I don't see how this patch fixes the write hole issue at the core of
btrfs RAID56.  It just makes the thin layer of bugs over that issue a
little thinner.  There's still the metadata RMW update timebomb at the
bottom of the bug pile that can't be fixed by scrub (the filesystem is
unrecoverably damaged when the bomb goes off, so scrub isn't possible).

> The problem is not found because normal mirror based profiles aren't
> affected by the race, since they are independent with each other.

True.

> Although this time the fix doesn't affect the scrub code much, it should
> warn us that current scrub code is really hard to maintain.

This last sentence is true.  I found and fixed three BUG_ONs in RAID5
code on the first day I started testing in degraded mode, then hit
the scrub code and had to give up.  It was like a brick wall made out
of mismatched assumptions and layering inversions, using uninitialized
kernel data as mortar (though I suppose the "uninitialized" data symptom
might just have been an unprotected memory access).

> Abuse of workquque to delay works and the full fs scrub is race prone.
> 
> Xfstest will follow a little later, as we don't have good enough tools
> to corrupt data stripes pinpointly.
> 
> Qu Wenruo (2):
>   btrfs: scrub: Introduce full stripe lock for RAID56
>   btrfs: scrub: Fix RAID56 recovery race condition
> 
>  fs/btrfs/ctree.h   |   4 ++
>  fs/btrfs/extent-tree.c |   3 +
>  fs/btrfs/scrub.c   | 192 
> +
>  3 files changed, 199 insertions(+)
> 
> -- 
> 2.10.2
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


btrfs raid1 degraded needs chunk root rebuild

2016-11-17 Thread Vladi Gergov
Any ideas if currently I can still recover data from this volume? I have
tried the new btrfs tools with btrfs rescue chunk-recover without much
success. Any ideas if i can recover any of the data?

- Forwarded message from Chris Mason  -

From: Chris Mason 
Subject: Re: btrfs raid1 degraded does not mount or fsck
To: Vladi Gergov 
Cc: linux-btrfs 
Date: Tue, 16 Nov 2010 16:50:58 -0500
User-Agent: Sup/git

Excerpts from Vladi Gergov's message of 2010-10-29 16:53:42 -0400:
> >>> gypsyops @ /mnt > sudo mount -o degraded /dev/sdc das3/
> Password:
> mount: wrong fs type, bad option, bad superblock on /dev/sdc,
>missing codepage or helper program, or other error
>In some cases useful info is found in syslog - try
>dmesg | tail  or so
>
> [  684.577540] device label das4 devid 2 transid 107954 /dev/sdc
> [  684.595150] btrfs: allowing degraded mounts
> [  684.595594] btrfs: failed to read chunk root on sdb
> [  684.604110] btrfs: open_ctree failed
>
> >>> gypsyops @ /mnt > sudo btrfsck /dev/sdc
> btrfsck: volumes.c:1367: btrfs_read_sys_array: Assertion `!(ret)' failed.

Ok, I dug through this and found the bug responsible for your
unmountable FS.  When we're mounted in degraded mode, and we don't have
enough drives available to do raid1,10, we're can use the wrong raid
level for new allocations.

I'm fixing the kernel side so this doesn't happen anymore, but I'll need
to rebuild the chunk tree (and probably a few others) off your good disk to
fix things.

I've got it reproduced here though, so I'll make an fsck that can scan
for the correct trees and fix it for you.

Since you're basically going to be my first external fsck customer, is
there anyway you can do a raw device based backup of the blocks?  This
way if I do mess things up we can repeat the experiment.

-chris

!DSPAM:4ce2fd55191821234852255!



- End forwarded message -

-- 

,-| Vladi
`-| Gergov
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 186671] New: OOM on system with just rsync running 32GB of ram 30GB of pagecache

2016-11-17 Thread Vlastimil Babka
On 11/16/2016 02:39 PM, E V wrote:
> System panic'd overnight running 4.9rc5 & rsync. Attached a photo of
> the stack trace, and the 38 call traces in a 2 minute window shortly
> before, to the bugzilla case for those not on it's e-mail list:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=186671

The panic screenshot has only the last part, but the end marker says
it's OOM with no killable processes. The DEBUG_VM config thus didn't
trigger anything, and still there's tons of pagecache, mostly clean,
that's not being reclaimed.

Could you now try this?
- enable CONFIG_PAGE_OWNER
- boot with kernel option: page_owner=on
- after the first oom, "cat /sys/kernel/debug/page_owner > file"
- provide the file (compressed, it will be quite large)

Vlastimil

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Heatmap - v2 - block group internals!

2016-11-17 Thread Hans van Kranenburg
On 11/17/2016 08:27 PM, Austin S. Hemmelgarn wrote:
> On 2016-11-17 13:51, Hans van Kranenburg wrote:
>>
>> When generating a picture of a file system with multiple devices,
>> boundaries between the separate devices are not visible now.
>>
>> If someone has a brilliant idea about how to do this without throwing
>> out actual usage data...
>>
> The first thought that comes to mind for me is to make each device be a
> different color, and otherwise obey the same intensity mapping
> correlating to how much data is there.  For example, if you've got a 3
> device FS, the parts of the image that correspond to device 1 would go
> from 0x00 to 0xFF, the parts for device 2 could be 0x00 to
> 0x00FF00, and the parts for device 3 could be 0x00 to 0xFF. This
> is of course not perfect (you can't tell what device each segment of
> empty space corresponds to), but would probably cover most use cases.
> (for example, with such a scheme, you could look at an image and tell
> whether the data is relatively well distributed across all the devices
> or you might need to re-balance).

"most use cases" -> what are those use cases? If you want to know how
much total GiB or TiB is present on all devices, a simple btrfs fi show
does suffice.

Another option is to just write three images, one for each of the
devices. :) Those are more easily compared.

The first idea with color that I had was to use two different colors for
data and metadata. When also using separate colors for devices, it might
all together become a big mess really quickly, or, maybe a beautiful
rainbow.

But, the fun with visualizations of data is that you learn whether they
just work(tm) or don't as soon as you see them. Mathematical or
algorithmic beauty is not always a good recipe for beauty as seen by the
human eye.

So, let's gather a bunch of ideas which we can try out and then observe
the result.

Before doing so, I'm going to restructure the code a bit more so I can
write another script in the same directory, just doing import heatmap
and calling a few functions in there to quickly try stuff, bypassing the
normal cli api.

Also, the png writing handling is now done by some random png library
that I found, which requires me to build (or copy/resize) an entire
pixel grid in memory, explicitely listing all pixel values, which is a
bit of a memory hog for bigger pictures, so I want to see if something
can be done there also.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-17 Thread Martin Steigerwald
Am Donnerstag, 17. November 2016, 12:05:31 CET schrieb Chris Murphy:
> I think the wiki should be updated to reflect that raid1 and raid10
> are mostly OK. I think it's grossly misleading to consider either as
> green/OK when a single degraded read write mount creates single chunks
> that will then prevent a subsequent degraded read write mount. And
> also the lack of various notifications of device faultiness I think
> make it less than OK also. It's not in the "do not use" category but
> it should be in the middle ground status so users can make informed
> decisions.

I agree – as error reporting I think is indead misleading. Feel free to edit 
it.

Ciao,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-17 Thread Austin S. Hemmelgarn

On 2016-11-17 15:05, Chris Murphy wrote:

I think the wiki should be updated to reflect that raid1 and raid10
are mostly OK. I think it's grossly misleading to consider either as
green/OK when a single degraded read write mount creates single chunks
that will then prevent a subsequent degraded read write mount. And
also the lack of various notifications of device faultiness I think
make it less than OK also. It's not in the "do not use" category but
it should be in the middle ground status so users can make informed
decisions.


It's worth pointing out also regarding this:
* This is handled sanely in recent kernels (the check got changed from 
per-fs to per-chunk, so you still have a usable FS if all the single 
chunks are only on devices you still have).
* This is only an issue with filesystems with exactly two disks.  If a 
3+ disk raid1 FS goes degraded, you still generate raid1 chunks.
* There are a couple of other cases where raid1 mode falls flat on it's 
face (lots of I/O errors in a short span of time with compression 
enabled can cause a kernel panic for example).
* raid10 has some other issues of it's own (you lose two devices, your 
filesystem is dead, which shouldn't be the case 100% of the time (if you 
lose different parts of each mirror, BTRFS _should_ be able to recover, 
it just doesn't do so right now)).


As far as the failed device handling issues, those are a problem with 
BTRFS in general, not just raid1 and raid10, so I wouldn't count those 
against raid1 and raid10.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-17 Thread Chris Murphy
I think the wiki should be updated to reflect that raid1 and raid10
are mostly OK. I think it's grossly misleading to consider either as
green/OK when a single degraded read write mount creates single chunks
that will then prevent a subsequent degraded read write mount. And
also the lack of various notifications of device faultiness I think
make it less than OK also. It's not in the "do not use" category but
it should be in the middle ground status so users can make informed
decisions.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Heatmap - v2 - block group internals!

2016-11-17 Thread Austin S. Hemmelgarn

On 2016-11-17 13:51, Hans van Kranenburg wrote:

Hey,

On 11/17/2016 02:27 AM, Qu Wenruo wrote:


At 11/17/2016 04:30 AM, Hans van Kranenburg wrote:

In the last two days I've added the --blockgroup option to btrfs heatmap
to let it create pictures of block group internals.

Examples and more instructions are to be found in the README at:
https://github.com/knorrie/btrfs-heatmap/blob/master/README.md

To use the new functionality it needs a fairly recent python-btrfs for
the 'skinny' METADATA_ITEM_KEY to be present. Latest python-btrfs
release is v0.3, created yesterday.


Wow, really cool!


Thanks!


I always dream about a visualizing tool to represent the chunk and
extent level of btrfs.

This should really save me from reading the boring dec numbers from
btrfs-debug-tree.

Although IMHO the full fs output is mixing extent and chunk level
together, which makes it a little hard to represent multi-device case,
it's still an awesome tool!


The picture of a full filesystem just appends all devices together into
one big space, and then walks the dev_extent tree and associated
chunk/blockgroup items for the %used/greyscale value.

I don't see what displaying a blockgroup-level aggregate usage number
has to do with multi-device, except that the same %usage will appear
another time when using RAID1*.

When generating a picture of a file system with multiple devices,
boundaries between the separate devices are not visible now.

If someone has a brilliant idea about how to do this without throwing
out actual usage data...

The first thought that comes to mind for me is to make each device be a 
different color, and otherwise obey the same intensity mapping 
correlating to how much data is there.  For example, if you've got a 3 
device FS, the parts of the image that correspond to device 1 would go 
from 0x00 to 0xFF, the parts for device 2 could be 0x00 to 
0x00FF00, and the parts for device 3 could be 0x00 to 0xFF. 
This is of course not perfect (you can't tell what device each segment 
of empty space corresponds to), but would probably cover most use cases. 
(for example, with such a scheme, you could look at an image and tell 
whether the data is relatively well distributed across all the devices 
or you might need to re-balance).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: send: fix failure of xfstests btrfs/038

2016-11-17 Thread Omar Sandoval
On Tue, Nov 15, 2016 at 05:16:27PM +0900, Tsutomu Itoh wrote:
> The following patch was imperfect, so xfstests btrfs/038 was failed.
> 
>   6d4fb3d  btrfs-progs: send: fix handling of multiple snapshots (-p option)
> 
> [before]
> | # ./check btrfs/038
> | FSTYP -- btrfs
> | PLATFORM  -- Linux/x86_64 luna 4.9.0-rc5
> | MKFS_OPTIONS  -- /dev/sdb3
> | MOUNT_OPTIONS -- /dev/sdb3 /test6
> |
> | btrfs/038 1s ... [failed, exit status 1] - output mismatch (see 
> /For_RT/xfstests2/results//btrfs/038.out.bad)
> | --- tests/btrfs/038.out 2015-08-04 16:09:38.0 +0900
> | +++ /For_RT/xfstests2/results//btrfs/038.out.bad2016-11-15 
> 13:39:48.589435290 +0900
> | @@ -7,3 +7,5 @@
> |  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> |  wrote 1/1 bytes at offset 30
> |  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> | +failed: '/usr/local/bin/btrfs send -p /test6/mysnap1 -c 
> /test6/clones_snap /test6/mysnap2 -f /tmp/tmp.aLpvAC7lsx/2.snap'
> | +(see /For_RT/xfstests2/results//btrfs/038.full for details)
> | ...
> | (Run 'diff -u tests/btrfs/038.out 
> /For_RT/xfstests2/results//btrfs/038.out.bad'  to see the entire diff)
> | Ran: btrfs/038
> | Failures: btrfs/038
> 
> [after]
> | # ./check btrfs/038
> | FSTYP -- btrfs
> | PLATFORM  -- Linux/x86_64 luna 4.9.0-rc5
> | MKFS_OPTIONS  -- /dev/sdb3
> | MOUNT_OPTIONS -- /dev/sdb3 /test6
> | 
> | btrfs/038 1s ... 1s
> | Ran: btrfs/038
> | Passed all 1 tests
> 
> Signed-off-by: Tsutomu Itoh 

Verified that this fixes both btrfs/038 and btrfs/117.

Tested-by: Omar Sandoval 

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Heatmap - v2 - block group internals!

2016-11-17 Thread Hans van Kranenburg
Hey,

On 11/17/2016 02:27 AM, Qu Wenruo wrote:
> 
> At 11/17/2016 04:30 AM, Hans van Kranenburg wrote:
>> In the last two days I've added the --blockgroup option to btrfs heatmap
>> to let it create pictures of block group internals.
>>
>> Examples and more instructions are to be found in the README at:
>> https://github.com/knorrie/btrfs-heatmap/blob/master/README.md
>>
>> To use the new functionality it needs a fairly recent python-btrfs for
>> the 'skinny' METADATA_ITEM_KEY to be present. Latest python-btrfs
>> release is v0.3, created yesterday.
>>
> Wow, really cool!

Thanks!

> I always dream about a visualizing tool to represent the chunk and
> extent level of btrfs.
> 
> This should really save me from reading the boring dec numbers from
> btrfs-debug-tree.
> 
> Although IMHO the full fs output is mixing extent and chunk level
> together, which makes it a little hard to represent multi-device case,
> it's still an awesome tool!

The picture of a full filesystem just appends all devices together into
one big space, and then walks the dev_extent tree and associated
chunk/blockgroup items for the %used/greyscale value.

I don't see what displaying a blockgroup-level aggregate usage number
has to do with multi-device, except that the same %usage will appear
another time when using RAID1*.

When generating a picture of a file system with multiple devices,
boundaries between the separate devices are not visible now.

If someone has a brilliant idea about how to do this without throwing
out actual usage data...

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-17 Thread Goffredo Baroncelli
Hi Qu,

On 2016-11-15 03:50, Qu Wenruo wrote:
> Fix the so-called famous RAID5/6 scrub error.
> 
> Thanks Goffredo Baroncelli for reporting the bug, and make it into our
> sight.
> (Yes, without the Phoronix report on this,
> https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
> I won't ever be aware of it)
> 
> Unlike many of us(including myself) assumed, it's not a timed bomb buried
> deeply into the RAID5/6 code, but a race condition in scrub recovery
> code.
> 
> The problem is not found because normal mirror based profiles aren't
> affected by the race, since they are independent with each other.

Unfortunately, even with these patches btrfs still fails to rebuild a raid5 
array in my tests.
To perform the tests I create a filesystem raid5-three disks; then I created a 
file

# python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" >out.txt

The filesystem layout is composed by stripes large 128k of data, and 64k of 
parity.

The idea is that the first third of the stripe (64k on disk0) is composed by 
"adaaa...";
the second third (again 64k on disk1 is composed by "bd"; and finally the 
parity (disk0^disk1) is stored on the last portion of the stripe.

Doing some calculation it is possible to know where the data is physically 
stored.

I perform 3 tests:
1) I corrupt some bytes of the data stored on disk0; mount the filesystem; run 
scrub; unmount the filesystem; check all the disks if the bytes of the stripe 
are corrected
2) I corrupt some bytes of the data stored on disk1; mount the filesystem; run 
scrub; unmount the filesystem; check all the disks the bytes of the stripe are 
corrected
3) I corrupt some bytes of the parity stored on disk2; mount the filesystem; 
run scrub; unmount the filesystem; check all the disks the bytes of the stripe 
are corrected

Before your patches, my all tests fail (not all the times, but very often).
After your patches, test1 and test2 still fail. In my test test3 succeded.

Enclosed my script which performs the tests (it uses loop device; please run in 
a VM; firts you have to uncomment the function make_imgs to create the disk 
image.).

Let me know if I can help you providing more information.

BR
G.Baroncelli


> 
> Although this time the fix doesn't affect the scrub code much, it should
> warn us that current scrub code is really hard to maintain.
> 
> Abuse of workquque to delay works and the full fs scrub is race prone.
> 
> Xfstest will follow a little later, as we don't have good enough tools
> to corrupt data stripes pinpointly.
> 
> Qu Wenruo (2):
>   btrfs: scrub: Introduce full stripe lock for RAID56
>   btrfs: scrub: Fix RAID56 recovery race condition
> 
>  fs/btrfs/ctree.h   |   4 ++
>  fs/btrfs/extent-tree.c |   3 +
>  fs/btrfs/scrub.c   | 192 
> +
>  3 files changed, 199 insertions(+)
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


test-btrfs.sh
Description: Bourne shell script


Re: [PATCH] generic: test concurrent non-overlapping direct I/O on the same extents

2016-11-17 Thread Omar Sandoval
On Thu, Nov 17, 2016 at 01:26:11PM +0800, Eryu Guan wrote:
> On Wed, Nov 16, 2016 at 04:29:34PM -0800, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > There have been a couple of logic bugs in `btrfs_get_extent()` which
> > could lead to spurious -EEXIST errors from read or write. This test
> > exercises those conditions by having two threads race to add an extent
> > to the extent map.
> > 
> > This is fixed by Linux commit 8dff9c853410 ("Btrfs: deal with duplciates
> > during extent_map insertion in btrfs_get_extent") and the patch "Btrfs:
> > deal with existing encompassing extent map in btrfs_get_extent()"
> > (http://marc.info/?l=linux-btrfs=147873402311143=2).
> > 
> > Although the bug is Btrfs-specific, nothing about the test is.
> > 
> > Signed-off-by: Omar Sandoval 
> > ---
> [snip]
> > +# real QA test starts here
> > +
> > +_supported_fs generic
> > +_supported_os Linux
> > +_require_test
> > +_require_xfs_io_command "falloc"
> > +_require_test_program "dio-interleaved"
> > +
> > +extent_size="$(($(stat -f -c '%S' "$TEST_DIR") * 2))"
> 
> There's a helper to get fs block size: "get_block_size".
> 
> > +num_extents=1024
> > +testfile="$TEST_DIR/$$-testfile"
> > +
> > +truncate -s 0 "$testfile"
> 
> I prefer using xfs_io to do the truncate, 
> 
> $XFS_IO_PROG -fc "truncate 0" "$testfile"
> 
> Because in rare cases truncate(1) may be unavailable, e.g. RHEL5,
> usually it's not a big issue, but xfs_io works all the time, we have a
> better way, so why not :)
> 
> > +for ((off = 0; off < num_extents * extent_size; off += extent_size)); do
> > +   xfs_io -c "falloc $off $extent_size" "$testfile"
> 
> Use $XFS_IO_PROG not bare xfs_io.
> 
> I can fix all the tiny issues at commit time.
> 
> Thanks,
> Eryu

Thank you, Eryu, that's all okay with me. I also had a typo here:

diff --git a/src/dio-interleaved.c b/src/dio-interleaved.c
index 831a191..6b04c99 100644
--- a/src/dio-interleaved.c
+++ b/src/dio-interleaved.c
@@ -58,7 +58,7 @@ int main(int argc, char **argv)
int fd;
 
if (argc != 4) {
-   fprintf(stderr, "usage: %s SECTORSIZE NUM_EXTENTS PATH\n",
+   fprintf(stderr, "usage: %s EXTENT_SIZE NUM_EXTENTS PATH\n",
argv[0]);
return EXIT_FAILURE;
}


Feel free to fix this at commit time, too, or I can send a v2 if you
prefer.

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs btree_ctree_super fault

2016-11-17 Thread Chris Mason

On 11/17/2016 12:39 AM, Chris Cui wrote:

We have just encountered the same bug on 4.9.0-rc2.  Any solution now?


kernel BUG at fs/btrfs/ctree.c:3172!
invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 0 PID: 22702 Comm: trinity-c40 Not tainted 4.9.0-rc4-think+ #1
task: 8804ffde37c0 task.stack: c90002188000
RIP: 0010:[]
  [] btrfs_set_item_key_safe+0x179/0x190 [btrfs]
RSP: :c9000218b8a8  EFLAGS: 00010246
RAX:  RBX: 8804fddcf348 RCX: 1000
RDX:  RSI: c9000218b9ce RDI: c9000218b8c7
RBP: c9000218b908 R08: 4000 R09: c9000218b8c8
R10:  R11: 0001 R12: c9000218b8b6
R13: c9000218b9ce R14: 0001 R15: 880480684a88
FS:  7f7c7f998b40() GS:88050780() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 00044f15f000 CR4: 001406f0
DR0: 7f4ce439d000 DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0600
Stack:
 88050143 d305a00a2245 006c0002 0510
 6c0002d3 1000 6427eebb 880480684a88
  8804fddcf348 2000 
Call Trace:
 [] __btrfs_drop_extents+0xb00/0xe30 [btrfs]


We're going to bash on Josef's patch and probably send it with the next 
merge window (queued for stable as well).


https://patchwork.kernel.org/patch/9431679/

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html