Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-11-05 Thread Goffredo Baroncelli
Resent because I don't see it in ml

Hi Qu,

On 2016-11-04 03:10, Qu Wenruo wrote:
[...]
> 
> I reproduced your problem and find that seems to be a problem of race.
[...]
[...]> 
> I digged a little further into the case 2) and found:
> a) Kernel is scrubbing correct range
>So the extra csum error is not caused by checking wrong logical
>bytenr range
> 
> b) Kernel is scrubbing some pages twice
>That's the cause of the problem.

For time constraint, I was unable to dig further this issue. I tried to add 
some printk to check if the code process correctly the data; what I found is 

1) the code seems to process the right data
2) the code seems to produces the correct data (i.e. the code was able to 
rebuild the correct data on the basis of the check-sums and the parity)
3) As you, I found that the code processed two time the same data

Unfortunately I was unable to figure why the code was unable to write the right 
data on the platter. Following the write path through the several handler of 
the bio was a job greater than my capabilities (and my time :-) )

> 
> 
> And unlike most of us assume, in fact scrub full fs is a very racy thing.

On the basis of your (and mine) observation that the code seems to process 
multiple time the same data, this may be an explanation.

> Scrubbing full fs is split into N scrubbing ioctls for each device.
> 
> So for above script, kernel is doing *3* scrubbing work.
> For other profile it may not be a problem, but for RAID5/6 race can happen 
> easily like:
> 
> Scrub dev1(P)   |  Scrub dev2(D1)| Scrub dev3(D2)
> ---
> Read out full stripe| Read out D1| Read out D2
> | Check Csum for D1  | Check Csum for D2
> | Csum mismatch (err++)  | Csum matches
> Cal parity  | Read out full stripe   |
> Parity mismatch | Do recovery|
> Check full stripe   |
> D1 csum mismatch (err++)|
> 
> So csum mismatch can be counted twice.
> 
> And since scrubbing for corrupted data stripe can easily race with
> scrubbing for parity, if timing happens in a worse situation, it can
> lead to unrecoverable csum error.

Interesting, I wasn't aware that the scrub is done in parallel on the different 
disks. This explain a lot of strangeness




 
> On the other hand, if you only scrub the damaged device only, no race
> will happen so case 2) 3) will just disappear.

> 
> Would you please try to only scrub one device one time?

I do it and I can confirm your hypothesis: if I do the scrub process 1 disk a 
time, I was unable to reproduce the corruption. Instead if I do the scrub 
process in parallel on all the disks, I sometime got a corruption: in average 
each 6 tests I got from 1 to 3 failures.

So the strategy of scrub must be different in case of a RAID6/5 chunk: in this 
case the parallel scrub must be avoided: the scrub must be performed on per 
stripe basis.


> 
>>
>> 5) I check the disks at the offsets above, to verify that the data/parity is 
>> correct
> 
> You could try the new offline scrub, it can save you a lot of time to find 
> data/parity corruption.
> https://github.com/adam900710/btrfs-progs/tree/fsck_scrub

I will try it

BR
G.Baroncelli


> 
> And of course, more reliable than kernel scrub (single thread, no extra IO no 
> race) :)
> 
> Thanks,
> Qu
> 
>>
>> However I found that:
>> 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any 
>> corruption, but recomputed the parity (always correctly);
>>
>> 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find 
>> the corruption. But I found two main behaviors:
>>
>> 2.a) the kernel repaired the damage, but compute the wrong parity. Where it 
>> was the parity, the kernel copied the data of the second disk on the parity 
>> disk
>>
>> 2.b) the kernel repaired the damage, and rebuild a correct parity
>>
>> I have to point out another strange thing: in dmesg I found two kinds of 
>> messages:
>>
>> msg1)
>> []
>> [ 1021.366944] BTRFS info (device loop2): disk space caching is enabled
>> [ 1021.366949] BTRFS: has skinny extents
>> [ 1021.399208] BTRFS warning (device loop2): checksum error at logical 
>> 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, 
>> length 4096, links 1 (path: out.txt)
>> [ 1021.399214] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, 
>> rd 0, flush 0, corrupt 1, gen 0
>> [ 1021.399291] BTRFS error (device loop2): fixed up error at logical 
>> 142802944 on dev /dev/loop0
>>
>> msg2)
>> [ 1017.435068] BTRFS info (device loop2): disk space caching is enabled
>> [ 1017.435074] BTRFS: has skinny extents
>> [ 1017.436778] BTRFS info (device loop2): bdev /dev/loop0 errs: wr 0, rd 
>> 0, flush 0, corrupt 1, gen 0
>> [ 1017.463403] BTRFS warning (device loop2): checksum error at logical 
>> 14

Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-11-03 Thread Qu Wenruo



At 06/25/2016 08:21 PM, Goffredo Baroncelli wrote:

Hi all,

following the thread "Adventures in btrfs raid5 disk recovery", I investigated 
a bit the BTRFS capability to scrub a corrupted raid5 filesystem. To test it, I first 
find where a file was stored, and then I tried to corrupt the data disks (when unmounted) 
or the parity disk.
The result showed that sometime the kernel recomputed the parity wrongly.

I tested the following kernel
- 4.6.1
- 4.5.4
and both showed the same behavior.

The test was performed as described below:

1) create a filesystem in raid5 mode (for data and metadata) of 1500MB

truncate -s 500M disk1.img; losetup -f disk1.img
truncate -s 500M disk2.img; losetup -f disk2.img
truncate -s 500M disk3.img; losetup -f disk3.img
sudo mkfs.btrfs -d raid5 -m raid5 /dev/loop[0-2]
sudo mount /dev/loop0 mnt/

2) I created a file with a length of 128kb:

python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt
sudo umount mnt/

3) I looked at the output of 'btrfs-debug-tree /dev/loop0' and I was able to 
find where the file stripe is located:

/dev/loop0: offset=81788928+16*4096(64k, second half of the file: 
'bd.)
/dev/loop1: offset=61865984+16*4096(64k, first half of the file: 
'ad.)
/dev/loop2: offset=61865984+16*4096(64k, parity: 
'\x03\x00\x03\x03\x03.)

4) I tried to corrupt each disk (one disk per test), and then run a scrub:

for example for the disk /dev/loop2:
sudo dd if=/dev/zero of=/dev/loop2 bs=1 \
seek=$((61865984+16*4096)) count=5
sudo mount /dev/loop0 mnt
sudo btrfs scrub start mnt/.


I reproduced your problem and find that seems to be a problem of race.

The problem I found is mostly the same as yours, but with some more details:

The script is like:
---
#!/bin/bash

dev1=/dev/vdb6
dev2=/dev/vdb7
dev3=/dev/vdb8
mnt=/mnt/test

umount $mnt
# First full stripe layout is
# dev1: Parity, dev2: Data Stripe1 dev3: Data Stripe2
mkfs.btrfs $dev1 $dev2 $dev3 -f -m raid5 -d raid5
mount $dev1 $mnt -o nospace_cache
xfs_io -f -c "pwrite 0 128k" $mnt/file1
sync
umount $mnt

dmesg -C
# destory parity of data stripe 1
dd if=/dev/urandom of=$dev2 bs=1 count=64k seek=546308096


# btrfs-progs scrub
# Newly introduced function, offline scrub, quite handy here
btrfs check --scrub $dev1

# kernel scrub
mount $dev1 $mnt -o nospace_cache
btrfs scrub start -B $mnt

# Feel free to umount and call offline scrub
# umount $mnt
# btrfs check --scrub $dev1
---

The result is divided into the following types:

1) Kernel scrub reports 16 recoverable csum error
   Same as offline scrub. All correct.

2) Kernel scrub reports 17 or more recoverable csum error
   Even we only corrupted 16 pages, the extra csum error comes out
   almost no where

3) Kernel scrub reports some unrecoverable csum error
   Totally insane. But quite low possibility.

I digged a little further into the case 2) and found:
a) Kernel is scrubbing correct range
   So the extra csum error is not caused by checking wrong logical
   bytenr range

b) Kernel is scrubbing some pages twice
   That's the cause of the problem.


And unlike most of us assume, in fact scrub full fs is a very racy thing.
Scrubbing full fs is split into N scrubbing ioctls for each device.

So for above script, kernel is doing *3* scrubbing work.
For other profile it may not be a problem, but for RAID5/6 race can 
happen easily like:


Scrub dev1(P)   |  Scrub dev2(D1)| Scrub dev3(D2)
---
Read out full stripe| Read out D1| Read out D2
| Check Csum for D1  | Check Csum for D2
| Csum mismatch (err++)  | Csum matches
Cal parity  | Read out full stripe   |
Parity mismatch | Do recovery|
Check full stripe   |
D1 csum mismatch (err++)|

So csum mismatch can be counted twice.

And since scrubbing for corrupted data stripe can easily race with 
scrubbing for parity, if timing happens in a worse situation, it can 
lead to unrecoverable csum error.


On the other hand, if you only scrub the damaged device only, no race 
will happen so case 2) 3) will just disappear.


Would you please try to only scrub one device one time?



5) I check the disks at the offsets above, to verify that the data/parity is 
correct


You could try the new offline scrub, it can save you a lot of time to 
find data/parity corruption.

https://github.com/adam900710/btrfs-progs/tree/fsck_scrub

And of course, more reliable than kernel scrub (single thread, no extra 
IO no race) :)


Thanks,
Qu



However I found that:
1) if I corrupt the parity disk (/dev/loop2), scrub don't find any corruption, 
but recomputed the parity (always correctly);

2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find the 
corruption. But I found two main beha

Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-09-21 Thread Qu Wenruo



At 09/22/2016 11:07 AM, Christoph Anton Mitterer wrote:

On Thu, 2016-09-22 at 10:08 +0800, Qu Wenruo wrote:

And I don't see the necessary to csum the parity.
Why csum a csum again?


I'd say simply for the following reason:
Imagine the smallest RAID5: 2x data D1 D2, 1x parity P
If D2 is lost it could be recalculated via D1 and P.

What if only (all) the checksum information for D2 is lost (e.g.
because of further silent data corruption on the blocks of these
csums)?


First thing first, csum is metadata, which is normally protected by 
metadata profile.


So, you are saying the csum tree is corrupted.
Wow, that's protected by RAID5(assume you're using RAID5 for both 
metadata and data) and tree leaf csum.

If that happens, which means at least 2 devices are already corrupted.

Your data is doomed anyway.


Then we'd only know D1 is valid (which still has working csums).


If that really happens, csum tree leaf is corrupted, and your D1 csum 
will be corrupted too, considering how close they are.


No means to recover anyway.


But we
wouldn't know whether D2 is (because gone) neither whether P is
(because not csummed).
And next imagine silent data corruption in either D2 or P => we cannot
tell which of them is valid, no repair possible... or do I miss
something?


Stated above, your fs is already screwed up in that case.

The biggest problem is, parity is a block level thing (its bytenr are 
dev bytenr, not logical bytenr), and csum is logical level thing.


Screwing these things up will cause more problem and more time to 
maintain than the benefit.






Just as you expected, it doesn't check parity.
Even for RAID1/DUP, it won't check the backup if it succeeded
reading
the first stripe.

That would IMO be really a highly severe bug... making scrubbing close
to completely useless for multi-device fs.
I mean the whole reason for doing it is to find [silently] corrupted
blocks... in order to be able to do something about it.
If on would only notice if the read actually fails because the one
working block is also gone.. then why having a RAID in the first place?


That's "--check-data-csum", not in-kernel scrub.

Thanks,
Qu




Current implement doesn't really care if it's the data or the copy
corrupted, any data can be read out, then there is no problem.

Except it makes RAID practically useless... => problem


Cheers,
Chris.





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-09-21 Thread Chris Murphy
On Wed, Sep 21, 2016 at 9:00 PM, Qu Wenruo  wrote:

>> Both copies are not scrubbed? Oh hell...
>
>
> I was replying to the "--check-data-csum" of btrfsck.
>
> I mean, for --check-data-csum, it doesn't read the backup if the first data
> can be read out without error.
>
> And if the first data is wrong, btrfsck will read backup, and output error
> about wrong csum, but won't return error.

OK I'm convinced both progs and kernel scrubs have problems. I guess
it makes sense to fix progs first, and make sure it's fixed there,
then get it fixed in the kernel.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-09-21 Thread Christoph Anton Mitterer
On Thu, 2016-09-22 at 10:08 +0800, Qu Wenruo wrote:
> And I don't see the necessary to csum the parity.
> Why csum a csum again?

I'd say simply for the following reason:
Imagine the smallest RAID5: 2x data D1 D2, 1x parity P
If D2 is lost it could be recalculated via D1 and P.

What if only (all) the checksum information for D2 is lost (e.g.
because of further silent data corruption on the blocks of these
csums)?
Then we'd only know D1 is valid (which still has working csums). But we
wouldn't know whether D2 is (because gone) neither whether P is
(because not csummed).
And next imagine silent data corruption in either D2 or P => we cannot
tell which of them is valid, no repair possible... or do I miss
something?


> Just as you expected, it doesn't check parity.
> Even for RAID1/DUP, it won't check the backup if it succeeded
> reading 
> the first stripe.
That would IMO be really a highly severe bug... making scrubbing close
to completely useless for multi-device fs.
I mean the whole reason for doing it is to find [silently] corrupted
blocks... in order to be able to do something about it.
If on would only notice if the read actually fails because the one
working block is also gone.. then why having a RAID in the first place?

> Current implement doesn't really care if it's the data or the copy 
> corrupted, any data can be read out, then there is no problem.
Except it makes RAID practically useless... => problem


Cheers,
Chris.




smime.p7s
Description: S/MIME cryptographic signature


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-09-21 Thread Qu Wenruo



At 09/22/2016 10:44 AM, Chris Murphy wrote:

On Wed, Sep 21, 2016 at 8:08 PM, Qu Wenruo  wrote:



At 09/21/2016 11:13 PM, Chris Murphy wrote:



I understand some things should go in fsck for comparison. But in this
case I don't see how it can help. Parity is not checksummed. The only
way to know if it's wrong is to read all of the data strips, compute
parity, and compare in-memory parity from current read to on-disk
parity.



That's what we plan to do.
And I don't see the necessary to csum the parity.
Why csum a csum again?


parity!=csum


http://www.spinics.net/lists/linux-btrfs/msg56602.html


I know, it's more than csum, as normal csum will only tell you if it's 
corrupted, but parity can be used to recover data.


But, parity needs enough data stripe to recover data.
It's just between full backup(RAID1) and pure csum(hash).

But csum for parity is still not worthy, and will screw the whole 
block(chunk) and logical layer.







There is already an offline scrub in btrfs
check which doesn't repair, but also I don't know if it checks parity.

   --check-data-csum
   verify checksums of data blocks



Just as you expected, it doesn't check parity.
Even for RAID1/DUP, it won't check the backup if it succeeded reading the
first stripe.


Both copies are not scrubbed? Oh hell...


I was replying to the "--check-data-csum" of btrfsck.

I mean, for --check-data-csum, it doesn't read the backup if the first 
data can be read out without error.


And if the first data is wrong, btrfsck will read backup, and output 
error about wrong csum, but won't return error.





[chris@f24s ~]$ sudo btrfs scrub status /brick2
scrub status for 7fea4120-9581-43cb-ab07-6631757c0b55
scrub started at Tue Sep 20 12:16:18 2016 and finished after 01:46:58
total bytes scrubbed: 955.93GiB with 0 errors

How can this possibly correctly say 956GiB scrubbed if it has not
checked both copies? That message is saying *all* the data, both
copies, were scrubbed. You're saying that message is wrong? It only
scrubbed half that amount?

[chris@f24s ~]$ sudo btrfs fi df /brick2
Data, RAID1: total=478.00GiB, used=477.11GiB
System, RAID1: total=32.00MiB, used=96.00KiB
Metadata, RAID1: total=2.00GiB, used=877.59MiB
GlobalReserve, single: total=304.00MiB, used=0.00B


When that scrub was happening, both drives were being accessed at 100%
throughput.









Current implement doesn't really care if it's the data or the copy
corrupted, any data can be read out, then there is no problem.
The same thing applies to tree blocks.

So the ability to check every stripe/copy is still quite needed for that
option.

And that's what I'm planning to enhance, make --check-data-csum to kernel
scrub equivalent.


OK thanks.






   This expects that the filesystem is otherwise OK, so this
is basically and
   offline scrub but does not repair data from spare coipes.



Repair can be implemented, but maybe just rewrite the same data into the
same place.
If that's a bad block, then it can't repair further more unless we can
relocate extent to other place.


Any device that's out of reserve sectors and can no longer remap LBA's
on its own, is a drive that needs to be decommissioned. It's a new
feature in just the last year or so that mdadm has a badblocks map so
it can do what the drive won't do, but I'm personally not a fan of
keeping malfunctioning drives in RAID.






Is it possible to put parities into their own tree? They'd be
checksummed there.



Personally speaking, this is quite a bad idea to me.
I prefer to separate different logical layers into their own codes.
Not mixing them together.

Block level things to block level(RAID/Chunk), logical thing to logical
level(tree blocks).


OK.



Current btrfs csum design is already much much better than pure RAID.
Just think of RAID1, while one copy is corrupted, then which copy is correct
then?


Yes.





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-09-21 Thread Chris Murphy
On Wed, Sep 21, 2016 at 8:08 PM, Qu Wenruo  wrote:
>
>
> At 09/21/2016 11:13 PM, Chris Murphy wrote:

>> I understand some things should go in fsck for comparison. But in this
>> case I don't see how it can help. Parity is not checksummed. The only
>> way to know if it's wrong is to read all of the data strips, compute
>> parity, and compare in-memory parity from current read to on-disk
>> parity.
>
>
> That's what we plan to do.
> And I don't see the necessary to csum the parity.
> Why csum a csum again?

parity!=csum


http://www.spinics.net/lists/linux-btrfs/msg56602.html



>> There is already an offline scrub in btrfs
>> check which doesn't repair, but also I don't know if it checks parity.
>>
>>--check-data-csum
>>verify checksums of data blocks
>
>
> Just as you expected, it doesn't check parity.
> Even for RAID1/DUP, it won't check the backup if it succeeded reading the
> first stripe.

Both copies are not scrubbed? Oh hell...

[chris@f24s ~]$ sudo btrfs scrub status /brick2
scrub status for 7fea4120-9581-43cb-ab07-6631757c0b55
scrub started at Tue Sep 20 12:16:18 2016 and finished after 01:46:58
total bytes scrubbed: 955.93GiB with 0 errors

How can this possibly correctly say 956GiB scrubbed if it has not
checked both copies? That message is saying *all* the data, both
copies, were scrubbed. You're saying that message is wrong? It only
scrubbed half that amount?

[chris@f24s ~]$ sudo btrfs fi df /brick2
Data, RAID1: total=478.00GiB, used=477.11GiB
System, RAID1: total=32.00MiB, used=96.00KiB
Metadata, RAID1: total=2.00GiB, used=877.59MiB
GlobalReserve, single: total=304.00MiB, used=0.00B


When that scrub was happening, both drives were being accessed at 100%
throughput.







>
> Current implement doesn't really care if it's the data or the copy
> corrupted, any data can be read out, then there is no problem.
> The same thing applies to tree blocks.
>
> So the ability to check every stripe/copy is still quite needed for that
> option.
>
> And that's what I'm planning to enhance, make --check-data-csum to kernel
> scrub equivalent.

OK thanks.


>
>>
>>This expects that the filesystem is otherwise OK, so this
>> is basically and
>>offline scrub but does not repair data from spare coipes.
>
>
> Repair can be implemented, but maybe just rewrite the same data into the
> same place.
> If that's a bad block, then it can't repair further more unless we can
> relocate extent to other place.

Any device that's out of reserve sectors and can no longer remap LBA's
on its own, is a drive that needs to be decommissioned. It's a new
feature in just the last year or so that mdadm has a badblocks map so
it can do what the drive won't do, but I'm personally not a fan of
keeping malfunctioning drives in RAID.


>
>>
>> Is it possible to put parities into their own tree? They'd be
>> checksummed there.
>
>
> Personally speaking, this is quite a bad idea to me.
> I prefer to separate different logical layers into their own codes.
> Not mixing them together.
>
> Block level things to block level(RAID/Chunk), logical thing to logical
> level(tree blocks).

OK.

>
> Current btrfs csum design is already much much better than pure RAID.
> Just think of RAID1, while one copy is corrupted, then which copy is correct
> then?

Yes.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-09-21 Thread Qu Wenruo



At 09/21/2016 11:13 PM, Chris Murphy wrote:

On Wed, Sep 21, 2016 at 3:15 AM, Qu Wenruo  wrote:



At 09/21/2016 03:35 PM, Tomasz Torcz wrote:


On Wed, Sep 21, 2016 at 03:28:25PM +0800, Qu Wenruo wrote:


Hi,

For this well-known bug, is there any one fixing it?

It can't be more frustrating finding some one has already worked on it
after
spending days digging.

BTW, since kernel scrub is somewhat scrap for raid5/6, I'd like to
implement
btrfsck scrub support, at least we can use btrfsck to fix bad stripes
before
kernel fix.



  Why wouldn't you fix in-kernel code?  Why implement duplicate
functionality
when you can fix the root cause?


We'll fix in-kernel code.

Fsck one is not duplicate, we need a better standard thing to compare with
kernel behavior.

Just like qgroup fix in btrfsck, if kernel can't handle something well, we
do need to fix kernel, but a good off-line fixer won't hurt.
(Btrfs-progs is much easier to implement, and get fast review/merge cycle,
and it can help us to find better solution before screwing kernel up again)


I understand some things should go in fsck for comparison. But in this
case I don't see how it can help. Parity is not checksummed. The only
way to know if it's wrong is to read all of the data strips, compute
parity, and compare in-memory parity from current read to on-disk
parity.


That's what we plan to do.
And I don't see the necessary to csum the parity.
Why csum a csum again?


It takes a long time, and at least scrub is online, where
btrfsck scrub is not.


At least btrfsck scrub will work and easier to implement, while kernel 
scrub doesn't.


The more important thing is, we can forget all about the complicated 
concurrency of online scrub, focusing on the implementation itself at 
user-space.

Which is easier to implement and easier to maintain.


There is already an offline scrub in btrfs
check which doesn't repair, but also I don't know if it checks parity.

   --check-data-csum
   verify checksums of data blocks


Just as you expected, it doesn't check parity.
Even for RAID1/DUP, it won't check the backup if it succeeded reading 
the first stripe.


Current implement doesn't really care if it's the data or the copy 
corrupted, any data can be read out, then there is no problem.

The same thing applies to tree blocks.

So the ability to check every stripe/copy is still quite needed for that 
option.


And that's what I'm planning to enhance, make --check-data-csum to 
kernel scrub equivalent.




   This expects that the filesystem is otherwise OK, so this
is basically and
   offline scrub but does not repair data from spare coipes.


Repair can be implemented, but maybe just rewrite the same data into the 
same place.
If that's a bad block, then it can't repair further more unless we can 
relocate extent to other place.




Is it possible to put parities into their own tree? They'd be
checksummed there.


Personally speaking, this is quite a bad idea to me.
I prefer to separate different logical layers into their own codes.
Not mixing them together.

Block level things to block level(RAID/Chunk), logical thing to logical 
level(tree blocks).


Current btrfs csum design is already much much better than pure RAID.
Just think of RAID1, while one copy is corrupted, then which copy is 
correct then?


Thanks,
Qu


Somehow I think the long term approach is that
partial stripe writes, which apparently are overwrites and not CoW,
need to go away. In particular I wonder what the metadata raid56 write
pattern is, if this usually means a lot of full stripe CoW writes, or
if there are many small metadata RMW changes that makes them partial
stripe writes and not CoW and thus not safe.






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-09-21 Thread Chris Murphy
On Wed, Sep 21, 2016 at 3:15 AM, Qu Wenruo  wrote:
>
>
> At 09/21/2016 03:35 PM, Tomasz Torcz wrote:
>>
>> On Wed, Sep 21, 2016 at 03:28:25PM +0800, Qu Wenruo wrote:
>>>
>>> Hi,
>>>
>>> For this well-known bug, is there any one fixing it?
>>>
>>> It can't be more frustrating finding some one has already worked on it
>>> after
>>> spending days digging.
>>>
>>> BTW, since kernel scrub is somewhat scrap for raid5/6, I'd like to
>>> implement
>>> btrfsck scrub support, at least we can use btrfsck to fix bad stripes
>>> before
>>> kernel fix.
>>
>>
>>   Why wouldn't you fix in-kernel code?  Why implement duplicate
>> functionality
>> when you can fix the root cause?
>>
> We'll fix in-kernel code.
>
> Fsck one is not duplicate, we need a better standard thing to compare with
> kernel behavior.
>
> Just like qgroup fix in btrfsck, if kernel can't handle something well, we
> do need to fix kernel, but a good off-line fixer won't hurt.
> (Btrfs-progs is much easier to implement, and get fast review/merge cycle,
> and it can help us to find better solution before screwing kernel up again)

I understand some things should go in fsck for comparison. But in this
case I don't see how it can help. Parity is not checksummed. The only
way to know if it's wrong is to read all of the data strips, compute
parity, and compare in-memory parity from current read to on-disk
parity. It takes a long time, and at least scrub is online, where
btrfsck scrub is not.  There is already an offline scrub in btrfs
check which doesn't repair, but also I don't know if it checks parity.

   --check-data-csum
   verify checksums of data blocks

   This expects that the filesystem is otherwise OK, so this
is basically and
   offline scrub but does not repair data from spare coipes.

Is it possible to put parities into their own tree? They'd be
checksummed there. Somehow I think the long term approach is that
partial stripe writes, which apparently are overwrites and not CoW,
need to go away. In particular I wonder what the metadata raid56 write
pattern is, if this usually means a lot of full stripe CoW writes, or
if there are many small metadata RMW changes that makes them partial
stripe writes and not CoW and thus not safe.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-09-21 Thread Chris Murphy
On Wed, Sep 21, 2016 at 1:28 AM, Qu Wenruo  wrote:
> Hi,
>
> For this well-known bug, is there any one fixing it?
>
> It can't be more frustrating finding some one has already worked on it after
> spending days digging.
>
> BTW, since kernel scrub is somewhat scrap for raid5/6, I'd like to implement
> btrfsck scrub support, at least we can use btrfsck to fix bad stripes before
> kernel fix.

Well the kernel will fix it if the user just scrubs again. The problem
is the user doesn't know their file system might have bad parity. So I
don't know how implementing an optional check in btrfsk helps. If it's
non-optional that means reading 100% of the volume, not just metadata,
that's not workable for btrfsck. The user just needs to do another
scrub if they suspect they have been hit by this, and if they get no
errors they're OK. If they get an error that something is being fixed,
they might have to do a 2nd scrub to avoid this bug - but I'm not sure
if there's any different error message between a non-parity strip
being fixed compared to parity strip being replaced.

The central thing happening in this bug is it requires a degraded full
stripe [1] already exists. That is, a non-parity strip [2] is already
corrupt. What this bug does is it fixes that strip from good parity,
but then wrongly recomputes parity for some reason and writes bad
parity to disk. So it shifts the "degradedness" of the full stripe
from non-parity to parity. There's no actual additional loss of
redundancy, it's just that the messages will say a problem was found
and fixed, which is not entirely true. Non-parity data is fixed, but
now parity is wrong, silently. There is no consequence of this unless
it's raid5 and there's another strip loss in that same stripe.

Uncertain if the bug happens with raid6, or if raid6 extra redundancy
has just masked the problem. Uncertain if the bug happens with
balance, or passively with normal reads. Only scrub has been tested
and it's non-deterministic, maybe happens 1 in 3 or 4 attempts.


[1][2]
I'm using SNIA terms. Strip = stripe element = mdadm chunk = the 64KiB
per device block. Stripe = full stripe = data strips + parity strip
(or 2 strips for raid6).


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-09-21 Thread Qu Wenruo



At 09/21/2016 03:35 PM, Tomasz Torcz wrote:

On Wed, Sep 21, 2016 at 03:28:25PM +0800, Qu Wenruo wrote:

Hi,

For this well-known bug, is there any one fixing it?

It can't be more frustrating finding some one has already worked on it after
spending days digging.

BTW, since kernel scrub is somewhat scrap for raid5/6, I'd like to implement
btrfsck scrub support, at least we can use btrfsck to fix bad stripes before
kernel fix.


  Why wouldn't you fix in-kernel code?  Why implement duplicate functionality
when you can fix the root cause?


We'll fix in-kernel code.

Fsck one is not duplicate, we need a better standard thing to compare 
with kernel behavior.


Just like qgroup fix in btrfsck, if kernel can't handle something well, 
we do need to fix kernel, but a good off-line fixer won't hurt.
(Btrfs-progs is much easier to implement, and get fast review/merge 
cycle, and it can help us to find better solution before screwing kernel 
up again)


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-09-21 Thread Tomasz Torcz
On Wed, Sep 21, 2016 at 03:28:25PM +0800, Qu Wenruo wrote:
> Hi,
> 
> For this well-known bug, is there any one fixing it?
> 
> It can't be more frustrating finding some one has already worked on it after
> spending days digging.
> 
> BTW, since kernel scrub is somewhat scrap for raid5/6, I'd like to implement
> btrfsck scrub support, at least we can use btrfsck to fix bad stripes before
> kernel fix.

  Why wouldn't you fix in-kernel code?  Why implement duplicate functionality
when you can fix the root cause?

-- 
Tomasz   .. oo o.   oo o. .o   .o o. o. oo o.   ..
Torcz.. .o .o   .o .o oo   oo .o .. .. oo   oo
o.o.o.   .o .. o.   o. o. o.   o. o. oo .. ..   o.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-09-21 Thread Qu Wenruo

Hi,

For this well-known bug, is there any one fixing it?

It can't be more frustrating finding some one has already worked on it 
after spending days digging.


BTW, since kernel scrub is somewhat scrap for raid5/6, I'd like to 
implement btrfsck scrub support, at least we can use btrfsck to fix bad 
stripes before kernel fix.


Thanks,
Qu

At 06/25/2016 08:21 PM, Goffredo Baroncelli wrote:

Hi all,

following the thread "Adventures in btrfs raid5 disk recovery", I investigated 
a bit the BTRFS capability to scrub a corrupted raid5 filesystem. To test it, I first 
find where a file was stored, and then I tried to corrupt the data disks (when unmounted) 
or the parity disk.
The result showed that sometime the kernel recomputed the parity wrongly.

I tested the following kernel
- 4.6.1
- 4.5.4
and both showed the same behavior.

The test was performed as described below:

1) create a filesystem in raid5 mode (for data and metadata) of 1500MB

truncate -s 500M disk1.img; losetup -f disk1.img
truncate -s 500M disk2.img; losetup -f disk2.img
truncate -s 500M disk3.img; losetup -f disk3.img
sudo mkfs.btrfs -d raid5 -m raid5 /dev/loop[0-2]
sudo mount /dev/loop0 mnt/

2) I created a file with a length of 128kb:

python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt
sudo umount mnt/

3) I looked at the output of 'btrfs-debug-tree /dev/loop0' and I was able to 
find where the file stripe is located:

/dev/loop0: offset=81788928+16*4096(64k, second half of the file: 
'bd.)
/dev/loop1: offset=61865984+16*4096(64k, first half of the file: 
'ad.)
/dev/loop2: offset=61865984+16*4096(64k, parity: 
'\x03\x00\x03\x03\x03.)

4) I tried to corrupt each disk (one disk per test), and then run a scrub:

for example for the disk /dev/loop2:
sudo dd if=/dev/zero of=/dev/loop2 bs=1 \
seek=$((61865984+16*4096)) count=5
sudo mount /dev/loop0 mnt
sudo btrfs scrub start mnt/.

5) I check the disks at the offsets above, to verify that the data/parity is 
correct

However I found that:
1) if I corrupt the parity disk (/dev/loop2), scrub don't find any corruption, 
but recomputed the parity (always correctly);

2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find the 
corruption. But I found two main behaviors:

2.a) the kernel repaired the damage, but compute the wrong parity. Where it was 
the parity, the kernel copied the data of the second disk on the parity disk

2.b) the kernel repaired the damage, and rebuild a correct parity

I have to point out another strange thing: in dmesg I found two kinds of 
messages:

msg1)
[]
[ 1021.366944] BTRFS info (device loop2): disk space caching is enabled
[ 1021.366949] BTRFS: has skinny extents
[ 1021.399208] BTRFS warning (device loop2): checksum error at logical 
142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, 
length 4096, links 1 (path: out.txt)
[ 1021.399214] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, 
rd 0, flush 0, corrupt 1, gen 0
[ 1021.399291] BTRFS error (device loop2): fixed up error at logical 
142802944 on dev /dev/loop0

msg2)
[ 1017.435068] BTRFS info (device loop2): disk space caching is enabled
[ 1017.435074] BTRFS: has skinny extents
[ 1017.436778] BTRFS info (device loop2): bdev /dev/loop0 errs: wr 0, 
rd 0, flush 0, corrupt 1, gen 0
[ 1017.463403] BTRFS warning (device loop2): checksum error at logical 
142802944 on dev /dev/loop0, sector 159872,  root 5, inode 257, offset 
65536, length 4096, links 1 (path: out.txt)
[ 1017.463409] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, 
rd 0, flush 0, corrupt 2, gen 0
[ 1017.463467] BTRFS warning (device loop2): checksum error at logical 
142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, 
length 4096, links 1 (path: out.txt)
[ 1017.463472] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, 
rd 0, flush 0, corrupt 3, gen 0
[ 1017.463512] BTRFS error (device loop2): unable to fixup (regular) 
error at logical 142802944 on dev /dev/loop0
[ 1017.463535] BTRFS error (device loop2): fixed up error at logical 
142802944 on dev /dev/loop0


but these seem to be UNrelated to the kernel behavior 2.a) or 2.b)

Another strangeness is that SCRUB sometime reports
 ERROR: there are uncorrectable errors
and sometime reports
 WARNING: errors detected during scrubbing, corrected

but also these seems UNrelated to the behavior 2.a) or 2.b) or msg1 or msg2


Enclosed you can find the script which I used to trigger the bug. I have to rerun it 
several times to show the problem because it doesn't happen every time. Pay attention 
that the offset and the loop device name are hard coded. You must run the script in the 
same directory where it is: eg "bash test.sh".

Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-08-19 Thread Philip Espunkt
> On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote:
>>
>> most of the time, it seems that btrfs-raid5 is not capable to
>> rebuild parity and data. Worse the message returned by scrub is
>> incoherent by the status on the disk. The tests didn't fail every
>> time; this complicate the diagnosis. However my script fails most
>> of the time.

Have you opened a bug ticket at http://bugzilla.kernel.org/?
I think that would be useful for tracking.


Philip
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-07-18 Thread Goffredo Baroncelli
Hi

On 2016-07-16 17:51, Jarkko Lavinen wrote:
> On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote:
>> Using "btrfs insp phy" I developed a script to trigger the bug.
> 
> Thank you for the script and all for sharing the raid5 and scrubbing
> issues. I have been using two raid5 arrays and ran scrub occasionally
> without any problems lately and been in false confidence. I converted
> successfully raid5 arrays into raid10 without any glitch.
> 
> I tried to modify the shell script so that instead of corrupting data
> with dd, a simulated bad block is created with device mapper. Modern
> disks are likely to either return the correct data or an error if
> they cannot.


You are right; but doing so we are complicating further the test case:
- my tests show what happen when there is a corruption, but the drive behaves 
well
- your tests show what happen when there is a corruption AND the drive has a 
failure

I agree that your simulation is more realistic, but I fear that doing so we are 
complicating the bug finding.

> 
> The modified script behaves very much like the original dd version.
> With dd version I see wrong data instead of expected data. 

When toy say "I see wrong data", you means with 
1) "cat mnt/out.txt" 
or 2) with "dd if=/dev/loop." ?

In the first case I see always good data; in the second case I see wrong data 
but of course no reading error

> With simulated bad block I see no data at all instead of expected data
> since dd quits on read error.
> 
> Jarkko Lavinen
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-07-17 Thread Jarkko Lavinen
On Sat, Jul 16, 2016 at 06:51:11PM +0300, Jarkko Lavinen wrote:
>  The modified script behaves very much like the original dd version.

Not quite. The bad sector simulation works like old hard drives without error 
correction and bad block remapping. This changes the error behaviour.

My script prints now kernel messages once the check_fs fails. The time range of 
messages messages is from the adding of the bad sector device to the point when 
check_fs fails.

The parity test which often passes with the Goffredo's script, always fails 
with my bad sector version and scrub says the error is uncorrectable. In the 
kernel messages there are two buffer IO read errors but no write error as if 
scrub quits before writing?

In the data2 test scrub again says the error is uncorrectable but according to 
the kernel messages the bad sector is read 4 times and written twice during the 
scrub. In my bad sector script the data2 is still corrupted and parity ok since 
the bad sector cannot be written and scrub likely quits earlier than in 
Goffredo's script. In his script the data2 gets fixed but the parity gets 
corrupted.

Jarkko Lavinen

$ bash h2.sh
--- test 1: corrupt parity
scrub started on mnt/., fsid 2625e2d0-420c-40b6-befa-97fc18eaed48 (pid=32490)
ERROR: there are uncorrectable errors
*** Wrong data on disk:off /dev/mapper/loop0:61931520 (parity)
Data read ||, expected |0300 0303|

Kernel messages in the test
First Check_fs started
Buffer I/O error on dev dm-0, logical block 15120, async page read
Scrub started
Second Check_fs started
Buffer I/O error on dev dm-0, logical block 15120, async page read

--- test 2: corrupt data2
scrub started on mnt/., fsid 8e506268-16c7-48fa-b176-0a8877f2a7aa (pid=434)
ERROR: there are uncorrectable errors
*** Wrong data on disk:off /dev/mapper/loop2:81854464 (data2)
Data read ||, expected |bdbbb|

Kernel messages in the test
First Check_fs started
Buffer I/O error on dev dm-2, logical block 19984, async page read
Scrub started
BTRFS warning (device dm-0): i/o error at logical 142802944 on dev 
/dev/mapper/loop2, sector 159872, root 5, inode 257, offset 65536, length 4096, 
links 1 (path: out.txt)
BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 0, rd 1, flush 0, 
corrupt 0, gen 0
BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 1, rd 1, flush 0, 
corrupt 0, gen 0
BTRFS warning (device dm-0): i/o error at logical 142802944 on dev 
/dev/mapper/loop2, sector 159872, root 5, inode 257, offset 65536, length 4096, 
links 1 (path: out.txt)
BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 1, rd 2, flush 0, 
corrupt 0, gen 0
BTRFS error (device dm-0): unable to fixup (regular) error at logical 142802944 
on dev /dev/mapper/loop2
BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 2, rd 2, flush 0, 
corrupt 0, gen 0
BTRFS error (device dm-0): unable to fixup (regular) error at logical 142802944 
on dev /dev/mapper/loop2
BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 2, rd 3, flush 0, 
corrupt 0, gen 0
BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 2, rd 4, flush 0, 
corrupt 0, gen 0
Second Check_fs started
BTRFS info (device dm-0): bdev /dev/mapper/loop2 errs: wr 2, rd 4, flush 0, 
corrupt 0, gen 0
Buffer I/O error on dev dm-2, logical block 19984, async page read

--- test 3: corrupt data1
scrub started on mnt/., fsid f8a4ecca-2475-4e5e-9651-65d9478b56fe (pid=856)
ERROR: there are uncorrectable errors
*** Wrong data on disk:off /dev/mapper/loop1:61931520 (data1)
Data read ||, expected |adaaa|

Kernel messages in the test
First Check_fs started
Buffer I/O error on dev dm-1, logical block 15120, async page read
Scrub started
BTRFS warning (device dm-0): i/o error at logical 142737408 on dev 
/dev/mapper/loop1, sector 120960, root 5, inode 257, offset 0, length 4096, 
links 1 (path: out.txt)
BTRFS error (device dm-0): bdev /dev/mapper/loop1 errs: wr 0, rd 1, flush 0, 
corrupt 0, gen 0
BTRFS warning (device dm-0): i/o error at logical 142737408 on dev 
/dev/mapper/loop1, sector 120960, root 5, inode 257, offset 0, length 4096, 
links 1 (path: out.txt)
BTRFS error (device dm-0): bdev /dev/mapper/loop1 errs: wr 0, rd 2, flush 0, 
corrupt 0, gen 0
BTRFS error (device dm-0): bdev /dev/mapper/loop1 errs: wr 1, rd 2, flush 0, 
corrupt 0, gen 0
BTRFS error (device dm-0): unable to fixup (regular) error at logical 142737408 
on dev /dev/mapper/loop1
BTRFS error (device dm-0): unable to fixup (regular) error at logical 142737408 
on dev /dev/mapper/loop1
BTRFS error (device dm-0): bdev /dev/mapper/loop1 errs: wr 1, rd 3, flush 0, 
corrupt 0, gen 0
Second Check_fs started
BTRFS error (device dm-0): bdev /dev/mapper/loop1 errs: wr 1, rd 4, flush 0, 
corrupt 0, gen 0
BTRFS info (device dm-0): bdev /dev/mapper/loop1 errs: wr 1, rd 4, flush 0, 
corrupt 0, gen 0
Buffer I/O error on dev dm-1, logical block 15120, async page read

--- test 4: corrupt data2; read without scrub
*** Wrong data on disk:off /dev/mapper/loop2:81854464 (d

Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-07-16 Thread Jarkko Lavinen
On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote:
> Using "btrfs insp phy" I developed a script to trigger the bug.

Thank you for the script and all for sharing the raid5 and scrubbing issues. I 
have been using two raid5 arrays and ran scrub occasionally without any 
problems lately and been in false confidence. I converted successfully raid5 
arrays into raid10 without any glitch.

I tried to modify the shell script so that instead of corrupting data with dd, 
a simulated bad block is created with device mapper. Modern disks are likely to 
either return the correct data or an error if they cannot.

The modified script behaves very much like the original dd version. With dd 
version I see wrong data instead of expected data. With simulated bad block I 
see no data at all instead of expected data since dd quits on read error.

Jarkko Lavinen


h.sh
Description: Bourne shell script


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two

2016-07-15 Thread Andrei Borzenkov
15.07.2016 19:29, Chris Mason пишет:
>
>> However I have to point out that this kind of test is very
>> difficult to do: the file-cache could lead to read an old data, so please
>> suggestion about how flush the cache are good (I do some sync,
>> unmount the filesystem and perform "echo 3 >/proc/sys/vm/drop_caches",
>> but sometime it seems not enough).
> 
> O_DIRECT should handle the cache flushing for you.
> 

There is also BLKFLSBUF ioctl (blockdev --flushbufs on shell level).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two

2016-07-15 Thread Goffredo Baroncelli
On 2016-07-15 06:39, Andrei Borzenkov wrote:
> 15.07.2016 00:20, Chris Mason пишет:
>>
>>
>> On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote:
>>> Hi All,
>>>
>>> I developed a new btrfs command "btrfs insp phy"[1] to further
>>> investigate this bug [2]. Using "btrfs insp phy" I developed a script
>>> to trigger the bug. The bug is not always triggered, but most of time
>>> yes.
>>>
>>> Basically the script create a raid5 filesystem (using three
>>> loop-device on three file called disk[123].img); on this filesystem 
> 
> Are those devices themselves on btrfs? Just to avoid any sort of
> possible side effects?

Good question. However the files are stored on a ext4 filesystem (but I don't 
know if this is better or worse)

> 
>>> it is create a file. Then using "btrfs insp phy", the physical
>>> placement of the data on the device are computed.
>>>
>>> First the script checks that the data are the right one (for data1,
>>> data2 and parity), then it corrupt the data:
>>>
>>> test1: the parity is corrupted, then scrub is ran. Then the (data1,
>>> data2, parity) data on the disk are checked. This test goes fine all
>>> the times
>>>
>>> test2: data2 is corrupted, then scrub is ran. Then the (data1, data2,
>>> parity) data on the disk are checked. This test fail most of the time:
>>> the data on the disk is not correct; the parity is wrong. Scrub
>>> sometime reports "WARNING: errors detected during scrubbing,
>>> corrected" and sometime reports "ERROR: there are uncorrectable
>>> errors". But this seems unrelated to the fact that the data is
>>> corrupetd or not
>>> test3: like test2, but data1 is corrupted. The result are the same as
>>> above.
>>>
>>>
>>> test4: data2 is corrupted, the the file is read. The system doesn't
>>> return error (the data seems to be fine); but the data2 on the disk is
>>> still corrupted.
>>>
>>>
>>> Note: data1, data2, parity are the disk-element of the raid5 stripe-
>>>
>>> Conclusion:
>>>
>>> most of the time, it seems that btrfs-raid5 is not capable to rebuild
>>> parity and data. Worse the message returned by scrub is incoherent by
>>> the status on the disk. The tests didn't fail every time; this
>>> complicate the diagnosis. However my script fails most of the time.
>>
>> Interesting, thanks for taking the time to write this up.  Is the
>> failure specific to scrub?  Or is parity rebuild in general also failing
>> in this case?
>>
> 
> How do you rebuild parity without scrub as long as all devices appear to
> be present?

I corrupted the data, then I read the file. The data has to be correct on
the basis of the parity. Even in this case I found problem.

> 
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two

2016-07-15 Thread Chris Mason



On 07/15/2016 12:28 PM, Goffredo Baroncelli wrote:

On 2016-07-14 23:20, Chris Mason wrote:



On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote:

Hi All,

I developed a new btrfs command "btrfs insp phy"[1] to further
investigate this bug [2]. Using "btrfs insp phy" I developed a
script to trigger the bug. The bug is not always triggered, but
most of time yes.

Basically the script create a raid5 filesystem (using three
loop-device on three file called disk[123].img); on this filesystem
it is create a file. Then using "btrfs insp phy", the physical
placement of the data on the device are computed.

First the script checks that the data are the right one (for data1,
data2 and parity), then it corrupt the data:

test1: the parity is corrupted, then scrub is ran. Then the (data1,
data2, parity) data on the disk are checked. This test goes fine
all the times

test2: data2 is corrupted, then scrub is ran. Then the (data1,
data2, parity) data on the disk are checked. This test fail most of
the time: the data on the disk is not correct; the parity is wrong.
Scrub sometime reports "WARNING: errors detected during scrubbing,
corrected" and sometime reports "ERROR: there are uncorrectable
errors". But this seems unrelated to the fact that the data is
corrupetd or not test3: like test2, but data1 is corrupted. The
result are the same as above.


test4: data2 is corrupted, the the file is read. The system doesn't
return error (the data seems to be fine); but the data2 on the disk
is still corrupted.


Note: data1, data2, parity are the disk-element of the raid5
stripe-

Conclusion:

most of the time, it seems that btrfs-raid5 is not capable to
rebuild parity and data. Worse the message returned by scrub is
incoherent by the status on the disk. The tests didn't fail every
time; this complicate the diagnosis. However my script fails most
of the time.


Interesting, thanks for taking the time to write this up.  Is the
failure specific to scrub?  Or is parity rebuild in general also
failing in this case?


Test #4 handles this case: I corrupt the data, and when I read
it the data is good. So parity is used but the data on the platter
are still bad.

However I have to point out that this kind of test is very
difficult to do: the file-cache could lead to read an old data, so please
suggestion about how flush the cache are good (I do some sync,
unmount the filesystem and perform "echo 3 >/proc/sys/vm/drop_caches",
but sometime it seems not enough).


O_DIRECT should handle the cache flushing for you.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two

2016-07-15 Thread Goffredo Baroncelli
On 2016-07-14 23:20, Chris Mason wrote:
> 
> 
> On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote:
>> Hi All,
>> 
>> I developed a new btrfs command "btrfs insp phy"[1] to further
>> investigate this bug [2]. Using "btrfs insp phy" I developed a
>> script to trigger the bug. The bug is not always triggered, but
>> most of time yes.
>> 
>> Basically the script create a raid5 filesystem (using three
>> loop-device on three file called disk[123].img); on this filesystem
>> it is create a file. Then using "btrfs insp phy", the physical
>> placement of the data on the device are computed.
>> 
>> First the script checks that the data are the right one (for data1,
>> data2 and parity), then it corrupt the data:
>> 
>> test1: the parity is corrupted, then scrub is ran. Then the (data1,
>> data2, parity) data on the disk are checked. This test goes fine
>> all the times
>> 
>> test2: data2 is corrupted, then scrub is ran. Then the (data1,
>> data2, parity) data on the disk are checked. This test fail most of
>> the time: the data on the disk is not correct; the parity is wrong.
>> Scrub sometime reports "WARNING: errors detected during scrubbing,
>> corrected" and sometime reports "ERROR: there are uncorrectable
>> errors". But this seems unrelated to the fact that the data is
>> corrupetd or not test3: like test2, but data1 is corrupted. The
>> result are the same as above.
>> 
>> 
>> test4: data2 is corrupted, the the file is read. The system doesn't
>> return error (the data seems to be fine); but the data2 on the disk
>> is still corrupted.
>> 
>> 
>> Note: data1, data2, parity are the disk-element of the raid5
>> stripe-
>> 
>> Conclusion:
>> 
>> most of the time, it seems that btrfs-raid5 is not capable to
>> rebuild parity and data. Worse the message returned by scrub is
>> incoherent by the status on the disk. The tests didn't fail every
>> time; this complicate the diagnosis. However my script fails most
>> of the time.
> 
> Interesting, thanks for taking the time to write this up.  Is the
> failure specific to scrub?  Or is parity rebuild in general also
> failing in this case?

Test #4 handles this case: I corrupt the data, and when I read
it the data is good. So parity is used but the data on the platter
are still bad.

However I have to point out that this kind of test is very
difficult to do: the file-cache could lead to read an old data, so please
suggestion about how flush the cache are good (I do some sync, 
unmount the filesystem and perform "echo 3 >/proc/sys/vm/drop_caches", 
but sometime it seems not enough).



> 
> -chris
> 

BR
G.Baroncelli
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two

2016-07-15 Thread Chris Mason



On 07/15/2016 11:10 AM, Andrei Borzenkov wrote:

15.07.2016 16:20, Chris Mason пишет:


Interesting, thanks for taking the time to write this up.  Is the
failure specific to scrub?  Or is parity rebuild in general also failing
in this case?



How do you rebuild parity without scrub as long as all devices appear to
be present?


If one block is corrupted, the crcs will fail and the kernel will
rebuild parity when you read the file.  You can also use balance instead
of scrub.



As we have seen recently, btrfs does not compute, stores or verifies
checksum of RAID56 parity. So if parity is corrupted, the only way to
detect and correct it is to use scrub. Balance may work by side effect,
because it simply recomputes parity on new data, but it will not fix
wrong parity on existing data.


Ah, I misread your question  Yes, this is definitely where scrub is the 
best tool.  But even if we have to add debugging to force parity 
recomputation, we should see if the problem is only in scrub or deeper.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two

2016-07-15 Thread Andrei Borzenkov
15.07.2016 16:20, Chris Mason пишет:
>>>
>>> Interesting, thanks for taking the time to write this up.  Is the
>>> failure specific to scrub?  Or is parity rebuild in general also failing
>>> in this case?
>>>
>>
>> How do you rebuild parity without scrub as long as all devices appear to
>> be present?
> 
> If one block is corrupted, the crcs will fail and the kernel will
> rebuild parity when you read the file.  You can also use balance instead
> of scrub.
> 

As we have seen recently, btrfs does not compute, stores or verifies
checksum of RAID56 parity. So if parity is corrupted, the only way to
detect and correct it is to use scrub. Balance may work by side effect,
because it simply recomputes parity on new data, but it will not fix
wrong parity on existing data.

I agree that if data block is corrupted it will be detected, but then
you do not need to recompute parity in the first place.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two

2016-07-15 Thread Chris Mason



On 07/15/2016 12:39 AM, Andrei Borzenkov wrote:

15.07.2016 00:20, Chris Mason пишет:



On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote:

Hi All,

I developed a new btrfs command "btrfs insp phy"[1] to further
investigate this bug [2]. Using "btrfs insp phy" I developed a script
to trigger the bug. The bug is not always triggered, but most of time
yes.

Basically the script create a raid5 filesystem (using three
loop-device on three file called disk[123].img); on this filesystem


Are those devices themselves on btrfs? Just to avoid any sort of
possible side effects?


it is create a file. Then using "btrfs insp phy", the physical
placement of the data on the device are computed.

First the script checks that the data are the right one (for data1,
data2 and parity), then it corrupt the data:

test1: the parity is corrupted, then scrub is ran. Then the (data1,
data2, parity) data on the disk are checked. This test goes fine all
the times

test2: data2 is corrupted, then scrub is ran. Then the (data1, data2,
parity) data on the disk are checked. This test fail most of the time:
the data on the disk is not correct; the parity is wrong. Scrub
sometime reports "WARNING: errors detected during scrubbing,
corrected" and sometime reports "ERROR: there are uncorrectable
errors". But this seems unrelated to the fact that the data is
corrupetd or not
test3: like test2, but data1 is corrupted. The result are the same as
above.


test4: data2 is corrupted, the the file is read. The system doesn't
return error (the data seems to be fine); but the data2 on the disk is
still corrupted.


Note: data1, data2, parity are the disk-element of the raid5 stripe-

Conclusion:

most of the time, it seems that btrfs-raid5 is not capable to rebuild
parity and data. Worse the message returned by scrub is incoherent by
the status on the disk. The tests didn't fail every time; this
complicate the diagnosis. However my script fails most of the time.


Interesting, thanks for taking the time to write this up.  Is the
failure specific to scrub?  Or is parity rebuild in general also failing
in this case?



How do you rebuild parity without scrub as long as all devices appear to
be present?


If one block is corrupted, the crcs will fail and the kernel will 
rebuild parity when you read the file.  You can also use balance instead 
of scrub.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two

2016-07-14 Thread Andrei Borzenkov
15.07.2016 00:20, Chris Mason пишет:
> 
> 
> On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote:
>> Hi All,
>>
>> I developed a new btrfs command "btrfs insp phy"[1] to further
>> investigate this bug [2]. Using "btrfs insp phy" I developed a script
>> to trigger the bug. The bug is not always triggered, but most of time
>> yes.
>>
>> Basically the script create a raid5 filesystem (using three
>> loop-device on three file called disk[123].img); on this filesystem 

Are those devices themselves on btrfs? Just to avoid any sort of
possible side effects?

>> it is create a file. Then using "btrfs insp phy", the physical
>> placement of the data on the device are computed.
>>
>> First the script checks that the data are the right one (for data1,
>> data2 and parity), then it corrupt the data:
>>
>> test1: the parity is corrupted, then scrub is ran. Then the (data1,
>> data2, parity) data on the disk are checked. This test goes fine all
>> the times
>>
>> test2: data2 is corrupted, then scrub is ran. Then the (data1, data2,
>> parity) data on the disk are checked. This test fail most of the time:
>> the data on the disk is not correct; the parity is wrong. Scrub
>> sometime reports "WARNING: errors detected during scrubbing,
>> corrected" and sometime reports "ERROR: there are uncorrectable
>> errors". But this seems unrelated to the fact that the data is
>> corrupetd or not
>> test3: like test2, but data1 is corrupted. The result are the same as
>> above.
>>
>>
>> test4: data2 is corrupted, the the file is read. The system doesn't
>> return error (the data seems to be fine); but the data2 on the disk is
>> still corrupted.
>>
>>
>> Note: data1, data2, parity are the disk-element of the raid5 stripe-
>>
>> Conclusion:
>>
>> most of the time, it seems that btrfs-raid5 is not capable to rebuild
>> parity and data. Worse the message returned by scrub is incoherent by
>> the status on the disk. The tests didn't fail every time; this
>> complicate the diagnosis. However my script fails most of the time.
> 
> Interesting, thanks for taking the time to write this up.  Is the
> failure specific to scrub?  Or is parity rebuild in general also failing
> in this case?
> 

How do you rebuild parity without scrub as long as all devices appear to
be present?


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two

2016-07-14 Thread Chris Mason



On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote:

Hi All,

I developed a new btrfs command "btrfs insp phy"[1] to further investigate this bug [2]. 
Using "btrfs insp phy" I developed a script to trigger the bug. The bug is not always 
triggered, but most of time yes.

Basically the script create a raid5 filesystem (using three loop-device on three file 
called disk[123].img); on this filesystem  it is create a file. Then using "btrfs 
insp phy", the physical placement of the data on the device are computed.

First the script checks that the data are the right one (for data1, data2 and 
parity), then it corrupt the data:

test1: the parity is corrupted, then scrub is ran. Then the (data1, data2, 
parity) data on the disk are checked. This test goes fine all the times

test2: data2 is corrupted, then scrub is ran. Then the (data1, data2, parity) data on the disk are 
checked. This test fail most of the time: the data on the disk is not correct; the parity is wrong. 
Scrub sometime reports "WARNING: errors detected during scrubbing, corrected" and 
sometime reports "ERROR: there are uncorrectable errors". But this seems unrelated to the 
fact that the data is corrupetd or not
test3: like test2, but data1 is corrupted. The result are the same as above.


test4: data2 is corrupted, the the file is read. The system doesn't return 
error (the data seems to be fine); but the data2 on the disk is still corrupted.


Note: data1, data2, parity are the disk-element of the raid5 stripe-

Conclusion:

most of the time, it seems that btrfs-raid5 is not capable to rebuild parity 
and data. Worse the message returned by scrub is incoherent by the status on 
the disk. The tests didn't fail every time; this complicate the diagnosis. 
However my script fails most of the time.


Interesting, thanks for taking the time to write this up.  Is the 
failure specific to scrub?  Or is parity rebuild in general also failing 
in this case?


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two

2016-07-12 Thread Goffredo Baroncelli
rruption failed
exit 100
}
echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
sudo mount $loop1 mnt
sudo btrfs scrub start mnt/.
sync; sync
cat mnt/out.txt &>/dev/null || echo "Read FAIL"
sudo umount mnt
echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
check_fs || return 1
echo "--- test1: OK"
return 0
}



test_corrupt_data2() {
echo "--- test 2: corrupt data2"
echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
sudo dd 2>/dev/null if=/dev/zero of=$data2_dev bs=1 \
seek=$data2_off count=5
check_fs &>/dev/null && {
echo Corruption failed
exit 100
}
echo 3 | sudo tee >/dev/null >/dev/null /proc/sys/vm/drop_caches
sudo mount $loop1 mnt
sudo btrfs scrub start mnt/.
sync; sync
cat mnt/out.txt &>/dev/null || echo "Read FAIL"
sudo umount mnt
echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
check_fs || return 1
echo "--- test2: OK"
return 0
}

test_corrupt_data1() {
echo "--- test 3: corrupt data1"
echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
sudo dd 2>/dev/null if=/dev/zero of=$data1_dev bs=1 \
seek=$data1_off count=5
check_fs &>/dev/null && {
echo Corruption failed
exit 100
}
echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
sudo mount $loop1 mnt
sudo btrfs scrub start mnt/.
sync; sync
cat mnt/out.txt &>/dev/null || echo "Read FAIL"
sudo umount mnt
echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
check_fs || return 1
echo "--- test3: OK"
return 0
}

test_corrupt_data2_wo_scrub() {
echo "--- test 4: corrupt data2; read without scrub"
echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
sudo dd 2>/dev/null if=/dev/zero of=$data2_dev bs=1 \
seek=$data2_off count=5
check_fs &>/dev/null && {
            echo Corruption failed
            exit 100
}
echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
sudo mount $loop1 mnt
cat mnt/out.txt &>/dev/null || echo "Read FAIL"
sudo umount mnt
echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
check_fs || return 1
echo "--- test 4: OK"
return 0
}


for t in test_corrupt_parity test_corrupt_data2 test_corrupt_data1 \
test_corrupt_data2_wo_scrub; do

init_fs &>/dev/null
if ! check_fs &>/dev/null; then 
 echo Integrity test failed
 exit 100
fi

$t
echo

done


-




[1] See email "New btrfs sub command: btrfs inspect physical-find"
[2] See email "[BUG] Btrfs scrub sometime recalculate wrong parity in raid5"



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-27 Thread Duncan
Steven Haigh posted on Mon, 27 Jun 2016 13:21:00 +1000 as excerpted:

> I'd also recommend updates to the ArchLinux wiki - as for some reason I
> always seem to end up there when searching for a certain topic...

Not really btrfs related, but for people using popular search engines, at 
least, this is likely for two reasons:

1) Arch is apparently the most popular distro among reasonably 
technically literate users -- those who will both tend to have good 
technical knowledge and advice on the various real-life issues Linux 
users tend to encounter, and are likely to post it to an easily publicly 
indexable forum.  (And in that regard, wikis are likely to be more 
indexable than (web archives of) mailing lists like this, because that's 
(part of) what wikis /do/ by design, make topics keyword searchable.  
Lists, not so much.)

2) Specifically to the point of being _publicly_ indexable, Arch may have 
a more liberal robots.txt that allows indexers more access, than other 
distros, some of which may limit robot access for performance reasons.


With this combination, arch's wiki is a natural place for searches to 
point.

So agreed, a high priority on getting the raid56 warning up there on the 
arch wiki is a good idea, indeed.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-27 Thread Christoph Anton Mitterer
On Mon, 2016-06-27 at 07:35 +0300, Andrei Borzenkov wrote:
> The problem is that current implementation of RAID56 puts exactly CoW
> data at risk. I.e. writing new (copy of) data may suddenly make old
> (copy of) data inaccessible, even though it had been safely committed
> to
> disk and is now in read-only snapshot.

Sure,... mine was just a general thing to be added.
No checksums => no way to tell which block is valid in case of silent
block errors => no way to recover unless by chance

=> should be included as a warning, especially as userland software
starts to automatically set nodatacow (IIRC systemd does so), thereby
silently breaking functionality (integrity+recoverability) assumed by
the user.


Cheers,
Chris-

smime.p7s
Description: S/MIME cryptographic signature


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-26 Thread Andrei Borzenkov
27.06.2016 06:50, Christoph Anton Mitterer пишет:
> 
> What should IMHO be done as well is giving a big fat warning in the
> manpages/etc. that when nodatacow is used RAID recovery cannot produce
> valid data (at least as long as there isn't checksumming implemented
> for nodatacow).

The problem is that current implementation of RAID56 puts exactly CoW
data at risk. I.e. writing new (copy of) data may suddenly make old
(copy of) data inaccessible, even though it had been safely committed to
disk and is now in read-only snapshot.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-26 Thread Christoph Anton Mitterer
On Sun, 2016-06-26 at 15:33 -0700, ronnie sahlberg wrote:
> 1, a much more strongly worded warning in the wiki. Make sure there
> are no misunderstandings
> that they really should not use raid56 right now for new filesystems.
I doubt most end users can be assumed to read the wiki...


> 2, Instead of a --force flag. (Users tend to ignore ---force and
> warnings in documentation.)
> Instead ifdef out the options to create raid56 in mkfs.btrfs.
> Developers who want to test can just remove the ifdef and recompile
> the tools anyway.
> But if end-users have to recompile userspace, that really forces the
> point that "you
> really should not use this right now".

Well if one does --force or --yes-i-know-what-i-do, and one actually
doesn't than such person is on his own.
People can always shoot themselves if they want to.

Actually I think that the compile-time way is inferior here.
Distros may just always enable raid56 there to allow people to continue
mounting their existing filesystems.



What should IMHO be done as well is giving a big fat warning in the
manpages/etc. that when nodatacow is used RAID recovery cannot produce
valid data (at least as long as there isn't checksumming implemented
for nodatacow).
Probably it should also be documented what btrfs does in such
situation. E.g. does it just randomly pick a readable block from one of
the copies? Simply error out and consider the file broken? Fill the
blocks in question with zero?

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-26 Thread Steven Haigh

On 2016-06-27 08:38, Hugo Mills wrote:

On Sun, Jun 26, 2016 at 03:33:08PM -0700, ronnie sahlberg wrote:

On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Could this explain why people have been reporting so many raid56 mode
> cases of btrfs replacing a first drive appearing to succeed just fine,
> but then they go to btrfs replace a second drive, and the array crashes
> as if the first replace didn't work correctly after all, resulting in two
> bad devices once the second replace gets under way, of course bringing
> down the array?
>
> If so, then it looks like we have our answer as to what has been going
> wrong that has been so hard to properly trace and thus to bugfix.
>
> Combine that with the raid4 dedicated parity device behavior you're
> seeing if the writes are all exactly 128 MB, with that possibly
> explaining the super-slow replaces, and this thread may have just given
> us answers to both of those until-now-untraceable issues.
>
> Regardless, what's /very/ clear by now is that raid56 mode as it
> currently exists is more or less fatally flawed, and a full scrap and
> rewrite to an entirely different raid56 mode on-disk format may be
> necessary to fix it.
>
> And what's even clearer is that people /really/ shouldn't be using raid56
> mode for anything but testing with throw-away data, at this point.
> Anything else is simply irresponsible.
>
> Does that mean we need to put a "raid56 mode may eat your babies" level
> warning in the manpage and require a --force to either mkfs.btrfs or
> balance to raid56 mode?  Because that's about where I am on it.

Agree. At this point letting ordinary users create raid56 filesystems
is counterproductive.


I would suggest:

1, a much more strongly worded warning in the wiki. Make sure there
are no misunderstandings
that they really should not use raid56 right now for new filesystems.


   I beefed up the warnings in several places in the wiki a couple of
days ago.


Not to sound rude - but I don't think these go anywhere near far enough. 
It needs to be completely obvious that its a good chance you'll lose 
everything. IMHO that's the only way that will stop BTRFS from getting 
the 'data eater' reputation. It can be revisited and reworded when the 
implementation is more tested and stable.


--
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-26 Thread Steven Haigh

On 2016-06-27 08:33, ronnie sahlberg wrote:

On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.dun...@cox.net> wrote:

Chris Murphy posted on Sat, 25 Jun 2016 11:25:05 -0600 as excerpted:

Wow. So it sees the data strip corruption, uses good parity on disk 
to

fix it, writes the fix to disk, recomputes parity for some reason but
does it wrongly, and then overwrites good parity with bad parity?
That's fucked. So in other words, if there are any errors fixed up
during a scrub, you should do a 2nd scrub. The first scrub should 
make

sure data is correct, and the 2nd scrub should make sure the bug is
papered over by computing correct parity and replacing the bad 
parity.


I wonder if the same problem happens with balance or if this is just 
a

bug in scrub code?


Could this explain why people have been reporting so many raid56 mode
cases of btrfs replacing a first drive appearing to succeed just fine,
but then they go to btrfs replace a second drive, and the array 
crashes
as if the first replace didn't work correctly after all, resulting in 
two

bad devices once the second replace gets under way, of course bringing
down the array?

If so, then it looks like we have our answer as to what has been going
wrong that has been so hard to properly trace and thus to bugfix.

Combine that with the raid4 dedicated parity device behavior you're
seeing if the writes are all exactly 128 MB, with that possibly
explaining the super-slow replaces, and this thread may have just 
given

us answers to both of those until-now-untraceable issues.

Regardless, what's /very/ clear by now is that raid56 mode as it
currently exists is more or less fatally flawed, and a full scrap and
rewrite to an entirely different raid56 mode on-disk format may be
necessary to fix it.

And what's even clearer is that people /really/ shouldn't be using 
raid56

mode for anything but testing with throw-away data, at this point.
Anything else is simply irresponsible.

Does that mean we need to put a "raid56 mode may eat your babies" 
level

warning in the manpage and require a --force to either mkfs.btrfs or
balance to raid56 mode?  Because that's about where I am on it.


Agree. At this point letting ordinary users create raid56 filesystems
is counterproductive.


+1


I would suggest:

1, a much more strongly worded warning in the wiki. Make sure there
are no misunderstandings
that they really should not use raid56 right now for new filesystems.


I voiced my concern on #btrfs about this - it really should show that 
this may eat your data and is properly experimental. At the moment, it 
looks as if the features are implemented and working as expected. In my 
case with nothing out of the ordinary - I've now got ~3.8Tb free disk 
space. Certainly not ready for *ANY* kind of public use.



2, Instead of a --force flag. (Users tend to ignore ---force and
warnings in documentation.)
Instead ifdef out the options to create raid56 in mkfs.btrfs.
Developers who want to test can just remove the ifdef and recompile
the tools anyway.
But if end-users have to recompile userspace, that really forces the
point that "you
really should not use this right now".


I think this is a somewhat good idea - however it should be a warning 
along the lines of:
"BTRFS RAID56 is VERY experimental and is known to corrupt data in 
certain cases. Use at your own risk!


Continue? (y/N):"


3, reach out to the documentation and fora for the major distros and
make sure they update their
documentation accordingly.
I think a lot of end-users, if they try to research something, are
more likely to go to  fora and wiki
than search out an upstream fora.


Another good idea.

I'd also recommend updates to the ArchLinux wiki - as for some reason I 
always seem to end up there when searching for a certain topic...


--
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-26 Thread Hugo Mills
On Sun, Jun 26, 2016 at 03:33:08PM -0700, ronnie sahlberg wrote:
> On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> > Could this explain why people have been reporting so many raid56 mode
> > cases of btrfs replacing a first drive appearing to succeed just fine,
> > but then they go to btrfs replace a second drive, and the array crashes
> > as if the first replace didn't work correctly after all, resulting in two
> > bad devices once the second replace gets under way, of course bringing
> > down the array?
> >
> > If so, then it looks like we have our answer as to what has been going
> > wrong that has been so hard to properly trace and thus to bugfix.
> >
> > Combine that with the raid4 dedicated parity device behavior you're
> > seeing if the writes are all exactly 128 MB, with that possibly
> > explaining the super-slow replaces, and this thread may have just given
> > us answers to both of those until-now-untraceable issues.
> >
> > Regardless, what's /very/ clear by now is that raid56 mode as it
> > currently exists is more or less fatally flawed, and a full scrap and
> > rewrite to an entirely different raid56 mode on-disk format may be
> > necessary to fix it.
> >
> > And what's even clearer is that people /really/ shouldn't be using raid56
> > mode for anything but testing with throw-away data, at this point.
> > Anything else is simply irresponsible.
> >
> > Does that mean we need to put a "raid56 mode may eat your babies" level
> > warning in the manpage and require a --force to either mkfs.btrfs or
> > balance to raid56 mode?  Because that's about where I am on it.
> 
> Agree. At this point letting ordinary users create raid56 filesystems
> is counterproductive.
> 
> 
> I would suggest:
> 
> 1, a much more strongly worded warning in the wiki. Make sure there
> are no misunderstandings
> that they really should not use raid56 right now for new filesystems.

   I beefed up the warnings in several places in the wiki a couple of
days ago.

   Hugo.

> 2, Instead of a --force flag. (Users tend to ignore ---force and
> warnings in documentation.)
> Instead ifdef out the options to create raid56 in mkfs.btrfs.
> Developers who want to test can just remove the ifdef and recompile
> the tools anyway.
> But if end-users have to recompile userspace, that really forces the
> point that "you
> really should not use this right now".
> 
> 3, reach out to the documentation and fora for the major distros and
> make sure they update their
> documentation accordingly.
> I think a lot of end-users, if they try to research something, are
> more likely to go to  fora and wiki
> than search out an upstream fora.

-- 
Hugo Mills | "No! My collection of rare, incurable diseases!
hugo@... carfax.org.uk | Violated!"
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Stimpson J. Cat, The Ren & Stimpy Show


signature.asc
Description: Digital signature


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-26 Thread ronnie sahlberg
On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Chris Murphy posted on Sat, 25 Jun 2016 11:25:05 -0600 as excerpted:
>
>> Wow. So it sees the data strip corruption, uses good parity on disk to
>> fix it, writes the fix to disk, recomputes parity for some reason but
>> does it wrongly, and then overwrites good parity with bad parity?
>> That's fucked. So in other words, if there are any errors fixed up
>> during a scrub, you should do a 2nd scrub. The first scrub should make
>> sure data is correct, and the 2nd scrub should make sure the bug is
>> papered over by computing correct parity and replacing the bad parity.
>>
>> I wonder if the same problem happens with balance or if this is just a
>> bug in scrub code?
>
> Could this explain why people have been reporting so many raid56 mode
> cases of btrfs replacing a first drive appearing to succeed just fine,
> but then they go to btrfs replace a second drive, and the array crashes
> as if the first replace didn't work correctly after all, resulting in two
> bad devices once the second replace gets under way, of course bringing
> down the array?
>
> If so, then it looks like we have our answer as to what has been going
> wrong that has been so hard to properly trace and thus to bugfix.
>
> Combine that with the raid4 dedicated parity device behavior you're
> seeing if the writes are all exactly 128 MB, with that possibly
> explaining the super-slow replaces, and this thread may have just given
> us answers to both of those until-now-untraceable issues.
>
> Regardless, what's /very/ clear by now is that raid56 mode as it
> currently exists is more or less fatally flawed, and a full scrap and
> rewrite to an entirely different raid56 mode on-disk format may be
> necessary to fix it.
>
> And what's even clearer is that people /really/ shouldn't be using raid56
> mode for anything but testing with throw-away data, at this point.
> Anything else is simply irresponsible.
>
> Does that mean we need to put a "raid56 mode may eat your babies" level
> warning in the manpage and require a --force to either mkfs.btrfs or
> balance to raid56 mode?  Because that's about where I am on it.

Agree. At this point letting ordinary users create raid56 filesystems
is counterproductive.


I would suggest:

1, a much more strongly worded warning in the wiki. Make sure there
are no misunderstandings
that they really should not use raid56 right now for new filesystems.

2, Instead of a --force flag. (Users tend to ignore ---force and
warnings in documentation.)
Instead ifdef out the options to create raid56 in mkfs.btrfs.
Developers who want to test can just remove the ifdef and recompile
the tools anyway.
But if end-users have to recompile userspace, that really forces the
point that "you
really should not use this right now".

3, reach out to the documentation and fora for the major distros and
make sure they update their
documentation accordingly.
I think a lot of end-users, if they try to research something, are
more likely to go to  fora and wiki
than search out an upstream fora.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-26 Thread Chris Murphy
On Sun, Jun 26, 2016 at 3:20 AM, Goffredo Baroncelli  wrote:
> On 2016-06-26 00:33, Chris Murphy wrote:
>> On Sat, Jun 25, 2016 at 12:42 PM, Goffredo Baroncelli
>>  wrote:
>>> On 2016-06-25 19:58, Chris Murphy wrote:
>>> [...]
> Wow. So it sees the data strip corruption, uses good parity on disk to
> fix it, writes the fix to disk, recomputes parity for some reason but
> does it wrongly, and then overwrites good parity with bad parity?

 The wrong parity, is it valid for the data strips that includes the
 (intentionally) corrupt data?

 Can parity computation happen before the csum check? Where sometimes you 
 get:

 read data strips > computer parity > check csum fails > read good
 parity from disk > fix up the bad data chunk > write wrong parity
 (based on wrong data)?

 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3

 2371-2383 suggest that there's a parity check, it's not always being
 rewritten to disk if it's already correct. But it doesn't know it's
 not correct, it thinks it's wrong so writes out the wrongly computed
 parity?
>>>
>>> The parity is not valid for both the corrected data and the corrupted data. 
>>> It seems that the scrub process copy the contents of the disk2 to disk3. It 
>>> could happens only if the contents of disk1 is zero.
>>
>> I'm not sure what it takes to hit this exactly. I just tested 3x
>> raid5, where two files 128KiB "a" and 128KiB "b", so that's a full
>> stripe write for each. I corrupted devid 1 64KiB of "a" and devid2
>> 64KiB of "b" did a scrub, error is detected, and corrected, and parity
>> is still correct.
>
> How many time tried this corruption test ? I was unable to raise the bug 
> systematically; in average every three tests I got 1 bug

Once.

I just did it a 2nd time and both file's parity are wrong now. So I
did it several more times. Sometimes both files' parity is bad.
Sometimes just one file's parity is bad. Sometimes neither file's
parity is bad.

It's a very bad bug, because it is a form of silent data corruption
and it's induced by Btrfs. And it's apparently non-deterministically
hit. Is this some form of race condition?

Somewhat orthogonal to this, is that while Btrfs is subject to the
write hole problem where parity can be wrong, this is detected and
warned. Bad data doesn't propagate up to user space.

This might explain how users are getting hit with corrupt files only
after they have a degraded volume. They did a scrub where some fixups
happen, but behind the scene possibly parity was corrupted even though
their data was fixed. Later they have a failed device, and the bad
parity is needed, and now there are a bunch of scary checksum errors
with piles of files listed as unrecoverable. And in fact they are
unrecoverable because the bad parity means bad reconstruction, so even
scraping it off with btrfs restore won't work.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-26 Thread Goffredo Baroncelli
On 2016-06-26 00:33, Chris Murphy wrote:
> On Sat, Jun 25, 2016 at 12:42 PM, Goffredo Baroncelli
>  wrote:
>> On 2016-06-25 19:58, Chris Murphy wrote:
>> [...]
 Wow. So it sees the data strip corruption, uses good parity on disk to
 fix it, writes the fix to disk, recomputes parity for some reason but
 does it wrongly, and then overwrites good parity with bad parity?
>>>
>>> The wrong parity, is it valid for the data strips that includes the
>>> (intentionally) corrupt data?
>>>
>>> Can parity computation happen before the csum check? Where sometimes you 
>>> get:
>>>
>>> read data strips > computer parity > check csum fails > read good
>>> parity from disk > fix up the bad data chunk > write wrong parity
>>> (based on wrong data)?
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3
>>>
>>> 2371-2383 suggest that there's a parity check, it's not always being
>>> rewritten to disk if it's already correct. But it doesn't know it's
>>> not correct, it thinks it's wrong so writes out the wrongly computed
>>> parity?
>>
>> The parity is not valid for both the corrected data and the corrupted data. 
>> It seems that the scrub process copy the contents of the disk2 to disk3. It 
>> could happens only if the contents of disk1 is zero.
> 
> I'm not sure what it takes to hit this exactly. I just tested 3x
> raid5, where two files 128KiB "a" and 128KiB "b", so that's a full
> stripe write for each. I corrupted devid 1 64KiB of "a" and devid2
> 64KiB of "b" did a scrub, error is detected, and corrected, and parity
> is still correct.

How many time tried this corruption test ? I was unable to raise the bug 
systematically; in average every three tests I got 1 bug 


[]

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-25 Thread Duncan
Chris Murphy posted on Sat, 25 Jun 2016 11:25:05 -0600 as excerpted:

> Wow. So it sees the data strip corruption, uses good parity on disk to
> fix it, writes the fix to disk, recomputes parity for some reason but
> does it wrongly, and then overwrites good parity with bad parity?
> That's fucked. So in other words, if there are any errors fixed up
> during a scrub, you should do a 2nd scrub. The first scrub should make
> sure data is correct, and the 2nd scrub should make sure the bug is
> papered over by computing correct parity and replacing the bad parity.
> 
> I wonder if the same problem happens with balance or if this is just a
> bug in scrub code?

Could this explain why people have been reporting so many raid56 mode 
cases of btrfs replacing a first drive appearing to succeed just fine, 
but then they go to btrfs replace a second drive, and the array crashes 
as if the first replace didn't work correctly after all, resulting in two 
bad devices once the second replace gets under way, of course bringing 
down the array?

If so, then it looks like we have our answer as to what has been going 
wrong that has been so hard to properly trace and thus to bugfix.

Combine that with the raid4 dedicated parity device behavior you're 
seeing if the writes are all exactly 128 MB, with that possibly 
explaining the super-slow replaces, and this thread may have just given 
us answers to both of those until-now-untraceable issues.

Regardless, what's /very/ clear by now is that raid56 mode as it 
currently exists is more or less fatally flawed, and a full scrap and 
rewrite to an entirely different raid56 mode on-disk format may be 
necessary to fix it.

And what's even clearer is that people /really/ shouldn't be using raid56 
mode for anything but testing with throw-away data, at this point.  
Anything else is simply irresponsible.

Does that mean we need to put a "raid56 mode may eat your babies" level 
warning in the manpage and require a --force to either mkfs.btrfs or 
balance to raid56 mode?  Because that's about where I am on it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-25 Thread Chris Murphy
On Sat, Jun 25, 2016 at 12:42 PM, Goffredo Baroncelli
 wrote:
> On 2016-06-25 19:58, Chris Murphy wrote:
> [...]
>>> Wow. So it sees the data strip corruption, uses good parity on disk to
>>> fix it, writes the fix to disk, recomputes parity for some reason but
>>> does it wrongly, and then overwrites good parity with bad parity?
>>
>> The wrong parity, is it valid for the data strips that includes the
>> (intentionally) corrupt data?
>>
>> Can parity computation happen before the csum check? Where sometimes you get:
>>
>> read data strips > computer parity > check csum fails > read good
>> parity from disk > fix up the bad data chunk > write wrong parity
>> (based on wrong data)?
>>
>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3
>>
>> 2371-2383 suggest that there's a parity check, it's not always being
>> rewritten to disk if it's already correct. But it doesn't know it's
>> not correct, it thinks it's wrong so writes out the wrongly computed
>> parity?
>
> The parity is not valid for both the corrected data and the corrupted data. 
> It seems that the scrub process copy the contents of the disk2 to disk3. It 
> could happens only if the contents of disk1 is zero.

I'm not sure what it takes to hit this exactly. I just tested 3x
raid5, where two files 128KiB "a" and 128KiB "b", so that's a full
stripe write for each. I corrupted devid 1 64KiB of "a" and devid2
64KiB of "b" did a scrub, error is detected, and corrected, and parity
is still correct.

I also tried to corrupt both parities and scrub, and like you I get no
messages from scrub in user space or kernel but the parity is
corrected.

The fixup is also not cow'd. It is an overwrite, which seems
unproblematic to me at face value. But?

Next I corrupted parities, failed one drive, mounted degraded, and
read in both files. If there is a write hole, I should get back
corrupt data from parity reconstruction blindly being trusted and
wrongly reconstructed.

[root@f24s ~]# cp /mnt/5/* /mnt/1/tmp
cp: error reading '/mnt/5/a128.txt': Input/output error
cp: error reading '/mnt/5/b128.txt': Input/output error

[607594.478720] BTRFS warning (device dm-7): csum failed ino 295 off 0
csum 1940348404 expected csum 650595490
[607594.478818] BTRFS warning (device dm-7): csum failed ino 295 off
4096 csum 463855480 expected csum 650595490
[607594.478869] BTRFS warning (device dm-7): csum failed ino 295 off
8192 csum 3317251692 expected csum 650595490
[607594.479227] BTRFS warning (device dm-7): csum failed ino 295 off
12288 csum 2973611336 expected csum 650595490
[607594.479244] BTRFS warning (device dm-7): csum failed ino 295 off
16384 csum 2556299655 expected csum 650595490
[607594.479254] BTRFS warning (device dm-7): csum failed ino 295 off
20480 csum 1098993191 expected csum 650595490
[607594.479263] BTRFS warning (device dm-7): csum failed ino 295 off
24576 csum 1503293813 expected csum 650595490
[607594.479272] BTRFS warning (device dm-7): csum failed ino 295 off
28672 csum 1538866238 expected csum 650595490
[607594.479282] BTRFS warning (device dm-7): csum failed ino 295 off
36864 csum 2855931166 expected csum 650595490
[607594.479292] BTRFS warning (device dm-7): csum failed ino 295 off
32768 csum 3351364818 expected csum 650595490


Soo.no write hole? Clearly it must reconstruct from corrupt
parity, and then checks the csum tree for EXTENT_CSUM and it doesn't
match so it fails to propagate upstream. And doesn't result in a
fixup. Good.

What happens if I umount, make the missing device visible again, and
mount not degraded?

[607775.394504] BTRFS error (device dm-7): parent transid verify
failed on 18517852160 wanted 143 found 140
[607775.424505] BTRFS info (device dm-7): read error corrected: ino 1
off 18517852160 (dev /dev/mapper/VG-a sector 67584)
[607775.425055] BTRFS info (device dm-7): read error corrected: ino 1
off 18517856256 (dev /dev/mapper/VG-a sector 67592)
[607775.425560] BTRFS info (device dm-7): read error corrected: ino 1
off 18517860352 (dev /dev/mapper/VG-a sector 67600)
[607775.425850] BTRFS info (device dm-7): read error corrected: ino 1
off 18517864448 (dev /dev/mapper/VG-a sector 67608)
[607775.431867] BTRFS error (device dm-7): parent transid verify
failed on 16303439872 wanted 145 found 139
[607775.432973] BTRFS info (device dm-7): read error corrected: ino 1
off 16303439872 (dev /dev/mapper/VG-a sector 4262240)
[607775.433438] BTRFS info (device dm-7): read error corrected: ino 1
off 16303443968 (dev /dev/mapper/VG-a sector 4262248)
[607775.433842] BTRFS info (device dm-7): read error corrected: ino 1
off 16303448064 (dev /dev/mapper/VG-a sector 4262256)
[607775.434220] BTRFS info (device dm-7): read error corrected: ino 1
off 16303452160 (dev /dev/mapper/VG-a sector 4262264)
[607775.434847] BTRFS error (device dm-7): parent transid verify
failed on 16303456256 wanted 145 found 139
[607775.435972] BTRFS info (device dm-7): read error corrected: ino 1
off 163034562

Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-25 Thread Goffredo Baroncelli
On 2016-06-25 19:58, Chris Murphy wrote:
[...]
>> Wow. So it sees the data strip corruption, uses good parity on disk to
>> fix it, writes the fix to disk, recomputes parity for some reason but
>> does it wrongly, and then overwrites good parity with bad parity?
> 
> The wrong parity, is it valid for the data strips that includes the
> (intentionally) corrupt data?
> 
> Can parity computation happen before the csum check? Where sometimes you get:
> 
> read data strips > computer parity > check csum fails > read good
> parity from disk > fix up the bad data chunk > write wrong parity
> (based on wrong data)?
> 
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3
> 
> 2371-2383 suggest that there's a parity check, it's not always being
> rewritten to disk if it's already correct. But it doesn't know it's
> not correct, it thinks it's wrong so writes out the wrongly computed
> parity?

The parity is not valid for both the corrected data and the corrupted data. It 
seems that the scrub process copy the contents of the disk2 to disk3. It could 
happens only if the contents of disk1 is zero.

BR
GB


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-25 Thread Chris Murphy
On Sat, Jun 25, 2016 at 11:25 AM, Chris Murphy  wrote:
> On Sat, Jun 25, 2016 at 6:21 AM, Goffredo Baroncelli  
> wrote:
>
>> 5) I check the disks at the offsets above, to verify that the data/parity is 
>> correct
>>
>> However I found that:
>> 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any 
>> corruption, but recomputed the parity (always correctly);
>
> This is mostly good news, that it is fixing bad parity during scrub.
> What's not clear due to the lack of any message is if the scrub is
> always writing out new parity, or only writes it if there's a
> mismatch.
>
>
>> 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find 
>> the corruption. But I found two main behaviors:
>>
>> 2.a) the kernel repaired the damage, but compute the wrong parity. Where it 
>> was the parity, the kernel copied the data of the second disk on the parity 
>> disk
>
> Wow. So it sees the data strip corruption, uses good parity on disk to
> fix it, writes the fix to disk, recomputes parity for some reason but
> does it wrongly, and then overwrites good parity with bad parity?

The wrong parity, is it valid for the data strips that includes the
(intentionally) corrupt data?

Can parity computation happen before the csum check? Where sometimes you get:

read data strips > computer parity > check csum fails > read good
parity from disk > fix up the bad data chunk > write wrong parity
(based on wrong data)?

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3

2371-2383 suggest that there's a parity check, it's not always being
rewritten to disk if it's already correct. But it doesn't know it's
not correct, it thinks it's wrong so writes out the wrongly computed
parity?



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-25 Thread Chris Murphy
On Sat, Jun 25, 2016 at 6:21 AM, Goffredo Baroncelli  wrote:

> 5) I check the disks at the offsets above, to verify that the data/parity is 
> correct
>
> However I found that:
> 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any 
> corruption, but recomputed the parity (always correctly);

This is mostly good news, that it is fixing bad parity during scrub.
What's not clear due to the lack of any message is if the scrub is
always writing out new parity, or only writes it if there's a
mismatch.


> 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find 
> the corruption. But I found two main behaviors:
>
> 2.a) the kernel repaired the damage, but compute the wrong parity. Where it 
> was the parity, the kernel copied the data of the second disk on the parity 
> disk

Wow. So it sees the data strip corruption, uses good parity on disk to
fix it, writes the fix to disk, recomputes parity for some reason but
does it wrongly, and then overwrites good parity with bad parity?
That's fucked. So in other words, if there are any errors fixed up
during a scrub, you should do a 2nd scrub. The first scrub should make
sure data is correct, and the 2nd scrub should make sure the bug is
papered over by computing correct parity and replacing the bad parity.

I wonder if the same problem happens with balance or if this is just a
bug in scrub code?


> but these seem to be UNrelated to the kernel behavior 2.a) or 2.b)
>
> Another strangeness is that SCRUB sometime reports
>  ERROR: there are uncorrectable errors
> and sometime reports
>  WARNING: errors detected during scrubbing, corrected
>
> but also these seems UNrelated to the behavior 2.a) or 2.b) or msg1 or msg2

I've seen this also, errors in user space but no kernel messages.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-25 Thread Goffredo Baroncelli
Hi all,

following the thread "Adventures in btrfs raid5 disk recovery", I investigated 
a bit the BTRFS capability to scrub a corrupted raid5 filesystem. To test it, I 
first find where a file was stored, and then I tried to corrupt the data disks 
(when unmounted) or the parity disk.
The result showed that sometime the kernel recomputed the parity wrongly.

I tested the following kernel
- 4.6.1
- 4.5.4
and both showed the same behavior.

The test was performed as described below:

1) create a filesystem in raid5 mode (for data and metadata) of 1500MB 

truncate -s 500M disk1.img; losetup -f disk1.img
truncate -s 500M disk2.img; losetup -f disk2.img
truncate -s 500M disk3.img; losetup -f disk3.img
sudo mkfs.btrfs -d raid5 -m raid5 /dev/loop[0-2]
sudo mount /dev/loop0 mnt/

2) I created a file with a length of 128kb:

python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt
sudo umount mnt/

3) I looked at the output of 'btrfs-debug-tree /dev/loop0' and I was able to 
find where the file stripe is located:

/dev/loop0: offset=81788928+16*4096(64k, second half of the file: 
'bd.)
/dev/loop1: offset=61865984+16*4096(64k, first half of the file: 
'ad.)
/dev/loop2: offset=61865984+16*4096(64k, parity: 
'\x03\x00\x03\x03\x03.)

4) I tried to corrupt each disk (one disk per test), and then run a scrub:

for example for the disk /dev/loop2:
sudo dd if=/dev/zero of=/dev/loop2 bs=1 \
seek=$((61865984+16*4096)) count=5
sudo mount /dev/loop0 mnt
sudo btrfs scrub start mnt/.

5) I check the disks at the offsets above, to verify that the data/parity is 
correct

However I found that:
1) if I corrupt the parity disk (/dev/loop2), scrub don't find any corruption, 
but recomputed the parity (always correctly);

2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find the 
corruption. But I found two main behaviors:

2.a) the kernel repaired the damage, but compute the wrong parity. Where it was 
the parity, the kernel copied the data of the second disk on the parity disk

2.b) the kernel repaired the damage, and rebuild a correct parity 

I have to point out another strange thing: in dmesg I found two kinds of 
messages:

msg1)
[]
[ 1021.366944] BTRFS info (device loop2): disk space caching is enabled
[ 1021.366949] BTRFS: has skinny extents
[ 1021.399208] BTRFS warning (device loop2): checksum error at logical 
142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, 
length 4096, links 1 (path: out.txt)
[ 1021.399214] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, 
rd 0, flush 0, corrupt 1, gen 0
[ 1021.399291] BTRFS error (device loop2): fixed up error at logical 
142802944 on dev /dev/loop0

msg2)
[ 1017.435068] BTRFS info (device loop2): disk space caching is enabled
[ 1017.435074] BTRFS: has skinny extents
[ 1017.436778] BTRFS info (device loop2): bdev /dev/loop0 errs: wr 0, 
rd 0, flush 0, corrupt 1, gen 0
[ 1017.463403] BTRFS warning (device loop2): checksum error at logical 
142802944 on dev /dev/loop0, sector 159872,  root 5, inode 257, offset 
65536, length 4096, links 1 (path: out.txt)
[ 1017.463409] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, 
rd 0, flush 0, corrupt 2, gen 0
[ 1017.463467] BTRFS warning (device loop2): checksum error at logical 
142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, 
length 4096, links 1 (path: out.txt)
[ 1017.463472] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, 
rd 0, flush 0, corrupt 3, gen 0
[ 1017.463512] BTRFS error (device loop2): unable to fixup (regular) 
error at logical 142802944 on dev /dev/loop0
[ 1017.463535] BTRFS error (device loop2): fixed up error at logical 
142802944 on dev /dev/loop0


but these seem to be UNrelated to the kernel behavior 2.a) or 2.b)

Another strangeness is that SCRUB sometime reports
 ERROR: there are uncorrectable errors
and sometime reports
 WARNING: errors detected during scrubbing, corrected

but also these seems UNrelated to the behavior 2.a) or 2.b) or msg1 or msg2


Enclosed you can find the script which I used to trigger the bug. I have to 
rerun it several times to show the problem because it doesn't happen every 
time. Pay attention that the offset and the loop device name are hard coded. 
You must run the script in the same directory where it is: eg "bash test.sh". 

Br
G.Baroncelli


 
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


test.sh
Description: Bourne shell script