Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
Resent because I don't see it in ml Hi Qu, On 2016-11-04 03:10, Qu Wenruo wrote: [...] > > I reproduced your problem and find that seems to be a problem of race. [...] [...]> > I digged a little further into the case 2) and found: > a) Kernel is scrubbing correct range >So the extra csum error is not caused by checking wrong logical >bytenr range > > b) Kernel is scrubbing some pages twice >That's the cause of the problem. For time constraint, I was unable to dig further this issue. I tried to add some printk to check if the code process correctly the data; what I found is 1) the code seems to process the right data 2) the code seems to produces the correct data (i.e. the code was able to rebuild the correct data on the basis of the check-sums and the parity) 3) As you, I found that the code processed two time the same data Unfortunately I was unable to figure why the code was unable to write the right data on the platter. Following the write path through the several handler of the bio was a job greater than my capabilities (and my time :-) ) > > > And unlike most of us assume, in fact scrub full fs is a very racy thing. On the basis of your (and mine) observation that the code seems to process multiple time the same data, this may be an explanation. > Scrubbing full fs is split into N scrubbing ioctls for each device. > > So for above script, kernel is doing *3* scrubbing work. > For other profile it may not be a problem, but for RAID5/6 race can happen > easily like: > > Scrub dev1(P) | Scrub dev2(D1)| Scrub dev3(D2) > --- > Read out full stripe| Read out D1| Read out D2 > | Check Csum for D1 | Check Csum for D2 > | Csum mismatch (err++) | Csum matches > Cal parity | Read out full stripe | > Parity mismatch | Do recovery| > Check full stripe | > D1 csum mismatch (err++)| > > So csum mismatch can be counted twice. > > And since scrubbing for corrupted data stripe can easily race with > scrubbing for parity, if timing happens in a worse situation, it can > lead to unrecoverable csum error. Interesting, I wasn't aware that the scrub is done in parallel on the different disks. This explain a lot of strangeness > On the other hand, if you only scrub the damaged device only, no race > will happen so case 2) 3) will just disappear. > > Would you please try to only scrub one device one time? I do it and I can confirm your hypothesis: if I do the scrub process 1 disk a time, I was unable to reproduce the corruption. Instead if I do the scrub process in parallel on all the disks, I sometime got a corruption: in average each 6 tests I got from 1 to 3 failures. So the strategy of scrub must be different in case of a RAID6/5 chunk: in this case the parallel scrub must be avoided: the scrub must be performed on per stripe basis. > >> >> 5) I check the disks at the offsets above, to verify that the data/parity is >> correct > > You could try the new offline scrub, it can save you a lot of time to find > data/parity corruption. > https://github.com/adam900710/btrfs-progs/tree/fsck_scrub I will try it BR G.Baroncelli > > And of course, more reliable than kernel scrub (single thread, no extra IO no > race) :) > > Thanks, > Qu > >> >> However I found that: >> 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any >> corruption, but recomputed the parity (always correctly); >> >> 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find >> the corruption. But I found two main behaviors: >> >> 2.a) the kernel repaired the damage, but compute the wrong parity. Where it >> was the parity, the kernel copied the data of the second disk on the parity >> disk >> >> 2.b) the kernel repaired the damage, and rebuild a correct parity >> >> I have to point out another strange thing: in dmesg I found two kinds of >> messages: >> >> msg1) >> [] >> [ 1021.366944] BTRFS info (device loop2): disk space caching is enabled >> [ 1021.366949] BTRFS: has skinny extents >> [ 1021.399208] BTRFS warning (device loop2): checksum error at logical >> 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, >> length 4096, links 1 (path: out.txt) >> [ 1021.399214] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, >> rd 0, flush 0, corrupt 1, gen 0 >> [ 1021.399291] BTRFS error (device loop2): fixed up error at logical >> 142802944 on dev /dev/loop0 >> >> msg2) >> [ 1017.435068] BTRFS info (device loop2): disk space caching is enabled >> [ 1017.435074] BTRFS: has skinny extents >> [ 1017.436778] BTRFS info (device loop2): bdev /dev/loop0 errs: wr 0, rd >> 0, flush 0, corrupt 1, gen 0 >> [ 1017.463403] BTRFS warning (device loop2): checksum error at logical >> 14
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
At 06/25/2016 08:21 PM, Goffredo Baroncelli wrote: Hi all, following the thread "Adventures in btrfs raid5 disk recovery", I investigated a bit the BTRFS capability to scrub a corrupted raid5 filesystem. To test it, I first find where a file was stored, and then I tried to corrupt the data disks (when unmounted) or the parity disk. The result showed that sometime the kernel recomputed the parity wrongly. I tested the following kernel - 4.6.1 - 4.5.4 and both showed the same behavior. The test was performed as described below: 1) create a filesystem in raid5 mode (for data and metadata) of 1500MB truncate -s 500M disk1.img; losetup -f disk1.img truncate -s 500M disk2.img; losetup -f disk2.img truncate -s 500M disk3.img; losetup -f disk3.img sudo mkfs.btrfs -d raid5 -m raid5 /dev/loop[0-2] sudo mount /dev/loop0 mnt/ 2) I created a file with a length of 128kb: python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt sudo umount mnt/ 3) I looked at the output of 'btrfs-debug-tree /dev/loop0' and I was able to find where the file stripe is located: /dev/loop0: offset=81788928+16*4096(64k, second half of the file: 'bd.) /dev/loop1: offset=61865984+16*4096(64k, first half of the file: 'ad.) /dev/loop2: offset=61865984+16*4096(64k, parity: '\x03\x00\x03\x03\x03.) 4) I tried to corrupt each disk (one disk per test), and then run a scrub: for example for the disk /dev/loop2: sudo dd if=/dev/zero of=/dev/loop2 bs=1 \ seek=$((61865984+16*4096)) count=5 sudo mount /dev/loop0 mnt sudo btrfs scrub start mnt/. I reproduced your problem and find that seems to be a problem of race. The problem I found is mostly the same as yours, but with some more details: The script is like: --- #!/bin/bash dev1=/dev/vdb6 dev2=/dev/vdb7 dev3=/dev/vdb8 mnt=/mnt/test umount $mnt # First full stripe layout is # dev1: Parity, dev2: Data Stripe1 dev3: Data Stripe2 mkfs.btrfs $dev1 $dev2 $dev3 -f -m raid5 -d raid5 mount $dev1 $mnt -o nospace_cache xfs_io -f -c "pwrite 0 128k" $mnt/file1 sync umount $mnt dmesg -C # destory parity of data stripe 1 dd if=/dev/urandom of=$dev2 bs=1 count=64k seek=546308096 # btrfs-progs scrub # Newly introduced function, offline scrub, quite handy here btrfs check --scrub $dev1 # kernel scrub mount $dev1 $mnt -o nospace_cache btrfs scrub start -B $mnt # Feel free to umount and call offline scrub # umount $mnt # btrfs check --scrub $dev1 --- The result is divided into the following types: 1) Kernel scrub reports 16 recoverable csum error Same as offline scrub. All correct. 2) Kernel scrub reports 17 or more recoverable csum error Even we only corrupted 16 pages, the extra csum error comes out almost no where 3) Kernel scrub reports some unrecoverable csum error Totally insane. But quite low possibility. I digged a little further into the case 2) and found: a) Kernel is scrubbing correct range So the extra csum error is not caused by checking wrong logical bytenr range b) Kernel is scrubbing some pages twice That's the cause of the problem. And unlike most of us assume, in fact scrub full fs is a very racy thing. Scrubbing full fs is split into N scrubbing ioctls for each device. So for above script, kernel is doing *3* scrubbing work. For other profile it may not be a problem, but for RAID5/6 race can happen easily like: Scrub dev1(P) | Scrub dev2(D1)| Scrub dev3(D2) --- Read out full stripe| Read out D1| Read out D2 | Check Csum for D1 | Check Csum for D2 | Csum mismatch (err++) | Csum matches Cal parity | Read out full stripe | Parity mismatch | Do recovery| Check full stripe | D1 csum mismatch (err++)| So csum mismatch can be counted twice. And since scrubbing for corrupted data stripe can easily race with scrubbing for parity, if timing happens in a worse situation, it can lead to unrecoverable csum error. On the other hand, if you only scrub the damaged device only, no race will happen so case 2) 3) will just disappear. Would you please try to only scrub one device one time? 5) I check the disks at the offsets above, to verify that the data/parity is correct You could try the new offline scrub, it can save you a lot of time to find data/parity corruption. https://github.com/adam900710/btrfs-progs/tree/fsck_scrub And of course, more reliable than kernel scrub (single thread, no extra IO no race) :) Thanks, Qu However I found that: 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any corruption, but recomputed the parity (always correctly); 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find the corruption. But I found two main beha
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
At 09/22/2016 11:07 AM, Christoph Anton Mitterer wrote: On Thu, 2016-09-22 at 10:08 +0800, Qu Wenruo wrote: And I don't see the necessary to csum the parity. Why csum a csum again? I'd say simply for the following reason: Imagine the smallest RAID5: 2x data D1 D2, 1x parity P If D2 is lost it could be recalculated via D1 and P. What if only (all) the checksum information for D2 is lost (e.g. because of further silent data corruption on the blocks of these csums)? First thing first, csum is metadata, which is normally protected by metadata profile. So, you are saying the csum tree is corrupted. Wow, that's protected by RAID5(assume you're using RAID5 for both metadata and data) and tree leaf csum. If that happens, which means at least 2 devices are already corrupted. Your data is doomed anyway. Then we'd only know D1 is valid (which still has working csums). If that really happens, csum tree leaf is corrupted, and your D1 csum will be corrupted too, considering how close they are. No means to recover anyway. But we wouldn't know whether D2 is (because gone) neither whether P is (because not csummed). And next imagine silent data corruption in either D2 or P => we cannot tell which of them is valid, no repair possible... or do I miss something? Stated above, your fs is already screwed up in that case. The biggest problem is, parity is a block level thing (its bytenr are dev bytenr, not logical bytenr), and csum is logical level thing. Screwing these things up will cause more problem and more time to maintain than the benefit. Just as you expected, it doesn't check parity. Even for RAID1/DUP, it won't check the backup if it succeeded reading the first stripe. That would IMO be really a highly severe bug... making scrubbing close to completely useless for multi-device fs. I mean the whole reason for doing it is to find [silently] corrupted blocks... in order to be able to do something about it. If on would only notice if the read actually fails because the one working block is also gone.. then why having a RAID in the first place? That's "--check-data-csum", not in-kernel scrub. Thanks, Qu Current implement doesn't really care if it's the data or the copy corrupted, any data can be read out, then there is no problem. Except it makes RAID practically useless... => problem Cheers, Chris. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Wed, Sep 21, 2016 at 9:00 PM, Qu Wenruo wrote: >> Both copies are not scrubbed? Oh hell... > > > I was replying to the "--check-data-csum" of btrfsck. > > I mean, for --check-data-csum, it doesn't read the backup if the first data > can be read out without error. > > And if the first data is wrong, btrfsck will read backup, and output error > about wrong csum, but won't return error. OK I'm convinced both progs and kernel scrubs have problems. I guess it makes sense to fix progs first, and make sure it's fixed there, then get it fixed in the kernel. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Thu, 2016-09-22 at 10:08 +0800, Qu Wenruo wrote: > And I don't see the necessary to csum the parity. > Why csum a csum again? I'd say simply for the following reason: Imagine the smallest RAID5: 2x data D1 D2, 1x parity P If D2 is lost it could be recalculated via D1 and P. What if only (all) the checksum information for D2 is lost (e.g. because of further silent data corruption on the blocks of these csums)? Then we'd only know D1 is valid (which still has working csums). But we wouldn't know whether D2 is (because gone) neither whether P is (because not csummed). And next imagine silent data corruption in either D2 or P => we cannot tell which of them is valid, no repair possible... or do I miss something? > Just as you expected, it doesn't check parity. > Even for RAID1/DUP, it won't check the backup if it succeeded > reading > the first stripe. That would IMO be really a highly severe bug... making scrubbing close to completely useless for multi-device fs. I mean the whole reason for doing it is to find [silently] corrupted blocks... in order to be able to do something about it. If on would only notice if the read actually fails because the one working block is also gone.. then why having a RAID in the first place? > Current implement doesn't really care if it's the data or the copy > corrupted, any data can be read out, then there is no problem. Except it makes RAID practically useless... => problem Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
At 09/22/2016 10:44 AM, Chris Murphy wrote: On Wed, Sep 21, 2016 at 8:08 PM, Qu Wenruo wrote: At 09/21/2016 11:13 PM, Chris Murphy wrote: I understand some things should go in fsck for comparison. But in this case I don't see how it can help. Parity is not checksummed. The only way to know if it's wrong is to read all of the data strips, compute parity, and compare in-memory parity from current read to on-disk parity. That's what we plan to do. And I don't see the necessary to csum the parity. Why csum a csum again? parity!=csum http://www.spinics.net/lists/linux-btrfs/msg56602.html I know, it's more than csum, as normal csum will only tell you if it's corrupted, but parity can be used to recover data. But, parity needs enough data stripe to recover data. It's just between full backup(RAID1) and pure csum(hash). But csum for parity is still not worthy, and will screw the whole block(chunk) and logical layer. There is already an offline scrub in btrfs check which doesn't repair, but also I don't know if it checks parity. --check-data-csum verify checksums of data blocks Just as you expected, it doesn't check parity. Even for RAID1/DUP, it won't check the backup if it succeeded reading the first stripe. Both copies are not scrubbed? Oh hell... I was replying to the "--check-data-csum" of btrfsck. I mean, for --check-data-csum, it doesn't read the backup if the first data can be read out without error. And if the first data is wrong, btrfsck will read backup, and output error about wrong csum, but won't return error. [chris@f24s ~]$ sudo btrfs scrub status /brick2 scrub status for 7fea4120-9581-43cb-ab07-6631757c0b55 scrub started at Tue Sep 20 12:16:18 2016 and finished after 01:46:58 total bytes scrubbed: 955.93GiB with 0 errors How can this possibly correctly say 956GiB scrubbed if it has not checked both copies? That message is saying *all* the data, both copies, were scrubbed. You're saying that message is wrong? It only scrubbed half that amount? [chris@f24s ~]$ sudo btrfs fi df /brick2 Data, RAID1: total=478.00GiB, used=477.11GiB System, RAID1: total=32.00MiB, used=96.00KiB Metadata, RAID1: total=2.00GiB, used=877.59MiB GlobalReserve, single: total=304.00MiB, used=0.00B When that scrub was happening, both drives were being accessed at 100% throughput. Current implement doesn't really care if it's the data or the copy corrupted, any data can be read out, then there is no problem. The same thing applies to tree blocks. So the ability to check every stripe/copy is still quite needed for that option. And that's what I'm planning to enhance, make --check-data-csum to kernel scrub equivalent. OK thanks. This expects that the filesystem is otherwise OK, so this is basically and offline scrub but does not repair data from spare coipes. Repair can be implemented, but maybe just rewrite the same data into the same place. If that's a bad block, then it can't repair further more unless we can relocate extent to other place. Any device that's out of reserve sectors and can no longer remap LBA's on its own, is a drive that needs to be decommissioned. It's a new feature in just the last year or so that mdadm has a badblocks map so it can do what the drive won't do, but I'm personally not a fan of keeping malfunctioning drives in RAID. Is it possible to put parities into their own tree? They'd be checksummed there. Personally speaking, this is quite a bad idea to me. I prefer to separate different logical layers into their own codes. Not mixing them together. Block level things to block level(RAID/Chunk), logical thing to logical level(tree blocks). OK. Current btrfs csum design is already much much better than pure RAID. Just think of RAID1, while one copy is corrupted, then which copy is correct then? Yes. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Wed, Sep 21, 2016 at 8:08 PM, Qu Wenruo wrote: > > > At 09/21/2016 11:13 PM, Chris Murphy wrote: >> I understand some things should go in fsck for comparison. But in this >> case I don't see how it can help. Parity is not checksummed. The only >> way to know if it's wrong is to read all of the data strips, compute >> parity, and compare in-memory parity from current read to on-disk >> parity. > > > That's what we plan to do. > And I don't see the necessary to csum the parity. > Why csum a csum again? parity!=csum http://www.spinics.net/lists/linux-btrfs/msg56602.html >> There is already an offline scrub in btrfs >> check which doesn't repair, but also I don't know if it checks parity. >> >>--check-data-csum >>verify checksums of data blocks > > > Just as you expected, it doesn't check parity. > Even for RAID1/DUP, it won't check the backup if it succeeded reading the > first stripe. Both copies are not scrubbed? Oh hell... [chris@f24s ~]$ sudo btrfs scrub status /brick2 scrub status for 7fea4120-9581-43cb-ab07-6631757c0b55 scrub started at Tue Sep 20 12:16:18 2016 and finished after 01:46:58 total bytes scrubbed: 955.93GiB with 0 errors How can this possibly correctly say 956GiB scrubbed if it has not checked both copies? That message is saying *all* the data, both copies, were scrubbed. You're saying that message is wrong? It only scrubbed half that amount? [chris@f24s ~]$ sudo btrfs fi df /brick2 Data, RAID1: total=478.00GiB, used=477.11GiB System, RAID1: total=32.00MiB, used=96.00KiB Metadata, RAID1: total=2.00GiB, used=877.59MiB GlobalReserve, single: total=304.00MiB, used=0.00B When that scrub was happening, both drives were being accessed at 100% throughput. > > Current implement doesn't really care if it's the data or the copy > corrupted, any data can be read out, then there is no problem. > The same thing applies to tree blocks. > > So the ability to check every stripe/copy is still quite needed for that > option. > > And that's what I'm planning to enhance, make --check-data-csum to kernel > scrub equivalent. OK thanks. > >> >>This expects that the filesystem is otherwise OK, so this >> is basically and >>offline scrub but does not repair data from spare coipes. > > > Repair can be implemented, but maybe just rewrite the same data into the > same place. > If that's a bad block, then it can't repair further more unless we can > relocate extent to other place. Any device that's out of reserve sectors and can no longer remap LBA's on its own, is a drive that needs to be decommissioned. It's a new feature in just the last year or so that mdadm has a badblocks map so it can do what the drive won't do, but I'm personally not a fan of keeping malfunctioning drives in RAID. > >> >> Is it possible to put parities into their own tree? They'd be >> checksummed there. > > > Personally speaking, this is quite a bad idea to me. > I prefer to separate different logical layers into their own codes. > Not mixing them together. > > Block level things to block level(RAID/Chunk), logical thing to logical > level(tree blocks). OK. > > Current btrfs csum design is already much much better than pure RAID. > Just think of RAID1, while one copy is corrupted, then which copy is correct > then? Yes. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
At 09/21/2016 11:13 PM, Chris Murphy wrote: On Wed, Sep 21, 2016 at 3:15 AM, Qu Wenruo wrote: At 09/21/2016 03:35 PM, Tomasz Torcz wrote: On Wed, Sep 21, 2016 at 03:28:25PM +0800, Qu Wenruo wrote: Hi, For this well-known bug, is there any one fixing it? It can't be more frustrating finding some one has already worked on it after spending days digging. BTW, since kernel scrub is somewhat scrap for raid5/6, I'd like to implement btrfsck scrub support, at least we can use btrfsck to fix bad stripes before kernel fix. Why wouldn't you fix in-kernel code? Why implement duplicate functionality when you can fix the root cause? We'll fix in-kernel code. Fsck one is not duplicate, we need a better standard thing to compare with kernel behavior. Just like qgroup fix in btrfsck, if kernel can't handle something well, we do need to fix kernel, but a good off-line fixer won't hurt. (Btrfs-progs is much easier to implement, and get fast review/merge cycle, and it can help us to find better solution before screwing kernel up again) I understand some things should go in fsck for comparison. But in this case I don't see how it can help. Parity is not checksummed. The only way to know if it's wrong is to read all of the data strips, compute parity, and compare in-memory parity from current read to on-disk parity. That's what we plan to do. And I don't see the necessary to csum the parity. Why csum a csum again? It takes a long time, and at least scrub is online, where btrfsck scrub is not. At least btrfsck scrub will work and easier to implement, while kernel scrub doesn't. The more important thing is, we can forget all about the complicated concurrency of online scrub, focusing on the implementation itself at user-space. Which is easier to implement and easier to maintain. There is already an offline scrub in btrfs check which doesn't repair, but also I don't know if it checks parity. --check-data-csum verify checksums of data blocks Just as you expected, it doesn't check parity. Even for RAID1/DUP, it won't check the backup if it succeeded reading the first stripe. Current implement doesn't really care if it's the data or the copy corrupted, any data can be read out, then there is no problem. The same thing applies to tree blocks. So the ability to check every stripe/copy is still quite needed for that option. And that's what I'm planning to enhance, make --check-data-csum to kernel scrub equivalent. This expects that the filesystem is otherwise OK, so this is basically and offline scrub but does not repair data from spare coipes. Repair can be implemented, but maybe just rewrite the same data into the same place. If that's a bad block, then it can't repair further more unless we can relocate extent to other place. Is it possible to put parities into their own tree? They'd be checksummed there. Personally speaking, this is quite a bad idea to me. I prefer to separate different logical layers into their own codes. Not mixing them together. Block level things to block level(RAID/Chunk), logical thing to logical level(tree blocks). Current btrfs csum design is already much much better than pure RAID. Just think of RAID1, while one copy is corrupted, then which copy is correct then? Thanks, Qu Somehow I think the long term approach is that partial stripe writes, which apparently are overwrites and not CoW, need to go away. In particular I wonder what the metadata raid56 write pattern is, if this usually means a lot of full stripe CoW writes, or if there are many small metadata RMW changes that makes them partial stripe writes and not CoW and thus not safe. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Wed, Sep 21, 2016 at 3:15 AM, Qu Wenruo wrote: > > > At 09/21/2016 03:35 PM, Tomasz Torcz wrote: >> >> On Wed, Sep 21, 2016 at 03:28:25PM +0800, Qu Wenruo wrote: >>> >>> Hi, >>> >>> For this well-known bug, is there any one fixing it? >>> >>> It can't be more frustrating finding some one has already worked on it >>> after >>> spending days digging. >>> >>> BTW, since kernel scrub is somewhat scrap for raid5/6, I'd like to >>> implement >>> btrfsck scrub support, at least we can use btrfsck to fix bad stripes >>> before >>> kernel fix. >> >> >> Why wouldn't you fix in-kernel code? Why implement duplicate >> functionality >> when you can fix the root cause? >> > We'll fix in-kernel code. > > Fsck one is not duplicate, we need a better standard thing to compare with > kernel behavior. > > Just like qgroup fix in btrfsck, if kernel can't handle something well, we > do need to fix kernel, but a good off-line fixer won't hurt. > (Btrfs-progs is much easier to implement, and get fast review/merge cycle, > and it can help us to find better solution before screwing kernel up again) I understand some things should go in fsck for comparison. But in this case I don't see how it can help. Parity is not checksummed. The only way to know if it's wrong is to read all of the data strips, compute parity, and compare in-memory parity from current read to on-disk parity. It takes a long time, and at least scrub is online, where btrfsck scrub is not. There is already an offline scrub in btrfs check which doesn't repair, but also I don't know if it checks parity. --check-data-csum verify checksums of data blocks This expects that the filesystem is otherwise OK, so this is basically and offline scrub but does not repair data from spare coipes. Is it possible to put parities into their own tree? They'd be checksummed there. Somehow I think the long term approach is that partial stripe writes, which apparently are overwrites and not CoW, need to go away. In particular I wonder what the metadata raid56 write pattern is, if this usually means a lot of full stripe CoW writes, or if there are many small metadata RMW changes that makes them partial stripe writes and not CoW and thus not safe. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Wed, Sep 21, 2016 at 1:28 AM, Qu Wenruo wrote: > Hi, > > For this well-known bug, is there any one fixing it? > > It can't be more frustrating finding some one has already worked on it after > spending days digging. > > BTW, since kernel scrub is somewhat scrap for raid5/6, I'd like to implement > btrfsck scrub support, at least we can use btrfsck to fix bad stripes before > kernel fix. Well the kernel will fix it if the user just scrubs again. The problem is the user doesn't know their file system might have bad parity. So I don't know how implementing an optional check in btrfsk helps. If it's non-optional that means reading 100% of the volume, not just metadata, that's not workable for btrfsck. The user just needs to do another scrub if they suspect they have been hit by this, and if they get no errors they're OK. If they get an error that something is being fixed, they might have to do a 2nd scrub to avoid this bug - but I'm not sure if there's any different error message between a non-parity strip being fixed compared to parity strip being replaced. The central thing happening in this bug is it requires a degraded full stripe [1] already exists. That is, a non-parity strip [2] is already corrupt. What this bug does is it fixes that strip from good parity, but then wrongly recomputes parity for some reason and writes bad parity to disk. So it shifts the "degradedness" of the full stripe from non-parity to parity. There's no actual additional loss of redundancy, it's just that the messages will say a problem was found and fixed, which is not entirely true. Non-parity data is fixed, but now parity is wrong, silently. There is no consequence of this unless it's raid5 and there's another strip loss in that same stripe. Uncertain if the bug happens with raid6, or if raid6 extra redundancy has just masked the problem. Uncertain if the bug happens with balance, or passively with normal reads. Only scrub has been tested and it's non-deterministic, maybe happens 1 in 3 or 4 attempts. [1][2] I'm using SNIA terms. Strip = stripe element = mdadm chunk = the 64KiB per device block. Stripe = full stripe = data strips + parity strip (or 2 strips for raid6). -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
At 09/21/2016 03:35 PM, Tomasz Torcz wrote: On Wed, Sep 21, 2016 at 03:28:25PM +0800, Qu Wenruo wrote: Hi, For this well-known bug, is there any one fixing it? It can't be more frustrating finding some one has already worked on it after spending days digging. BTW, since kernel scrub is somewhat scrap for raid5/6, I'd like to implement btrfsck scrub support, at least we can use btrfsck to fix bad stripes before kernel fix. Why wouldn't you fix in-kernel code? Why implement duplicate functionality when you can fix the root cause? We'll fix in-kernel code. Fsck one is not duplicate, we need a better standard thing to compare with kernel behavior. Just like qgroup fix in btrfsck, if kernel can't handle something well, we do need to fix kernel, but a good off-line fixer won't hurt. (Btrfs-progs is much easier to implement, and get fast review/merge cycle, and it can help us to find better solution before screwing kernel up again) Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Wed, Sep 21, 2016 at 03:28:25PM +0800, Qu Wenruo wrote: > Hi, > > For this well-known bug, is there any one fixing it? > > It can't be more frustrating finding some one has already worked on it after > spending days digging. > > BTW, since kernel scrub is somewhat scrap for raid5/6, I'd like to implement > btrfsck scrub support, at least we can use btrfsck to fix bad stripes before > kernel fix. Why wouldn't you fix in-kernel code? Why implement duplicate functionality when you can fix the root cause? -- Tomasz .. oo o. oo o. .o .o o. o. oo o. .. Torcz.. .o .o .o .o oo oo .o .. .. oo oo o.o.o. .o .. o. o. o. o. o. o. oo .. .. o. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
Hi, For this well-known bug, is there any one fixing it? It can't be more frustrating finding some one has already worked on it after spending days digging. BTW, since kernel scrub is somewhat scrap for raid5/6, I'd like to implement btrfsck scrub support, at least we can use btrfsck to fix bad stripes before kernel fix. Thanks, Qu At 06/25/2016 08:21 PM, Goffredo Baroncelli wrote: Hi all, following the thread "Adventures in btrfs raid5 disk recovery", I investigated a bit the BTRFS capability to scrub a corrupted raid5 filesystem. To test it, I first find where a file was stored, and then I tried to corrupt the data disks (when unmounted) or the parity disk. The result showed that sometime the kernel recomputed the parity wrongly. I tested the following kernel - 4.6.1 - 4.5.4 and both showed the same behavior. The test was performed as described below: 1) create a filesystem in raid5 mode (for data and metadata) of 1500MB truncate -s 500M disk1.img; losetup -f disk1.img truncate -s 500M disk2.img; losetup -f disk2.img truncate -s 500M disk3.img; losetup -f disk3.img sudo mkfs.btrfs -d raid5 -m raid5 /dev/loop[0-2] sudo mount /dev/loop0 mnt/ 2) I created a file with a length of 128kb: python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt sudo umount mnt/ 3) I looked at the output of 'btrfs-debug-tree /dev/loop0' and I was able to find where the file stripe is located: /dev/loop0: offset=81788928+16*4096(64k, second half of the file: 'bd.) /dev/loop1: offset=61865984+16*4096(64k, first half of the file: 'ad.) /dev/loop2: offset=61865984+16*4096(64k, parity: '\x03\x00\x03\x03\x03.) 4) I tried to corrupt each disk (one disk per test), and then run a scrub: for example for the disk /dev/loop2: sudo dd if=/dev/zero of=/dev/loop2 bs=1 \ seek=$((61865984+16*4096)) count=5 sudo mount /dev/loop0 mnt sudo btrfs scrub start mnt/. 5) I check the disks at the offsets above, to verify that the data/parity is correct However I found that: 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any corruption, but recomputed the parity (always correctly); 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find the corruption. But I found two main behaviors: 2.a) the kernel repaired the damage, but compute the wrong parity. Where it was the parity, the kernel copied the data of the second disk on the parity disk 2.b) the kernel repaired the damage, and rebuild a correct parity I have to point out another strange thing: in dmesg I found two kinds of messages: msg1) [] [ 1021.366944] BTRFS info (device loop2): disk space caching is enabled [ 1021.366949] BTRFS: has skinny extents [ 1021.399208] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt) [ 1021.399214] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [ 1021.399291] BTRFS error (device loop2): fixed up error at logical 142802944 on dev /dev/loop0 msg2) [ 1017.435068] BTRFS info (device loop2): disk space caching is enabled [ 1017.435074] BTRFS: has skinny extents [ 1017.436778] BTRFS info (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [ 1017.463403] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt) [ 1017.463409] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 [ 1017.463467] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt) [ 1017.463472] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 [ 1017.463512] BTRFS error (device loop2): unable to fixup (regular) error at logical 142802944 on dev /dev/loop0 [ 1017.463535] BTRFS error (device loop2): fixed up error at logical 142802944 on dev /dev/loop0 but these seem to be UNrelated to the kernel behavior 2.a) or 2.b) Another strangeness is that SCRUB sometime reports ERROR: there are uncorrectable errors and sometime reports WARNING: errors detected during scrubbing, corrected but also these seems UNrelated to the behavior 2.a) or 2.b) or msg1 or msg2 Enclosed you can find the script which I used to trigger the bug. I have to rerun it several times to show the problem because it doesn't happen every time. Pay attention that the offset and the loop device name are hard coded. You must run the script in the same directory where it is: eg "bash test.sh".
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
> On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote: >> >> most of the time, it seems that btrfs-raid5 is not capable to >> rebuild parity and data. Worse the message returned by scrub is >> incoherent by the status on the disk. The tests didn't fail every >> time; this complicate the diagnosis. However my script fails most >> of the time. Have you opened a bug ticket at http://bugzilla.kernel.org/? I think that would be useful for tracking. Philip -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
Hi On 2016-07-16 17:51, Jarkko Lavinen wrote: > On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote: >> Using "btrfs insp phy" I developed a script to trigger the bug. > > Thank you for the script and all for sharing the raid5 and scrubbing > issues. I have been using two raid5 arrays and ran scrub occasionally > without any problems lately and been in false confidence. I converted > successfully raid5 arrays into raid10 without any glitch. > > I tried to modify the shell script so that instead of corrupting data > with dd, a simulated bad block is created with device mapper. Modern > disks are likely to either return the correct data or an error if > they cannot. You are right; but doing so we are complicating further the test case: - my tests show what happen when there is a corruption, but the drive behaves well - your tests show what happen when there is a corruption AND the drive has a failure I agree that your simulation is more realistic, but I fear that doing so we are complicating the bug finding. > > The modified script behaves very much like the original dd version. > With dd version I see wrong data instead of expected data. When toy say "I see wrong data", you means with 1) "cat mnt/out.txt" or 2) with "dd if=/dev/loop." ? In the first case I see always good data; in the second case I see wrong data but of course no reading error > With simulated bad block I see no data at all instead of expected data > since dd quits on read error. > > Jarkko Lavinen > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Sat, Jul 16, 2016 at 06:51:11PM +0300, Jarkko Lavinen wrote: > The modified script behaves very much like the original dd version. Not quite. The bad sector simulation works like old hard drives without error correction and bad block remapping. This changes the error behaviour. My script prints now kernel messages once the check_fs fails. The time range of messages messages is from the adding of the bad sector device to the point when check_fs fails. The parity test which often passes with the Goffredo's script, always fails with my bad sector version and scrub says the error is uncorrectable. In the kernel messages there are two buffer IO read errors but no write error as if scrub quits before writing? In the data2 test scrub again says the error is uncorrectable but according to the kernel messages the bad sector is read 4 times and written twice during the scrub. In my bad sector script the data2 is still corrupted and parity ok since the bad sector cannot be written and scrub likely quits earlier than in Goffredo's script. In his script the data2 gets fixed but the parity gets corrupted. Jarkko Lavinen $ bash h2.sh --- test 1: corrupt parity scrub started on mnt/., fsid 2625e2d0-420c-40b6-befa-97fc18eaed48 (pid=32490) ERROR: there are uncorrectable errors *** Wrong data on disk:off /dev/mapper/loop0:61931520 (parity) Data read ||, expected |0300 0303| Kernel messages in the test First Check_fs started Buffer I/O error on dev dm-0, logical block 15120, async page read Scrub started Second Check_fs started Buffer I/O error on dev dm-0, logical block 15120, async page read --- test 2: corrupt data2 scrub started on mnt/., fsid 8e506268-16c7-48fa-b176-0a8877f2a7aa (pid=434) ERROR: there are uncorrectable errors *** Wrong data on disk:off /dev/mapper/loop2:81854464 (data2) Data read ||, expected |bdbbb| Kernel messages in the test First Check_fs started Buffer I/O error on dev dm-2, logical block 19984, async page read Scrub started BTRFS warning (device dm-0): i/o error at logical 142802944 on dev /dev/mapper/loop2, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt) BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0 BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 1, rd 1, flush 0, corrupt 0, gen 0 BTRFS warning (device dm-0): i/o error at logical 142802944 on dev /dev/mapper/loop2, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt) BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 1, rd 2, flush 0, corrupt 0, gen 0 BTRFS error (device dm-0): unable to fixup (regular) error at logical 142802944 on dev /dev/mapper/loop2 BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 2, rd 2, flush 0, corrupt 0, gen 0 BTRFS error (device dm-0): unable to fixup (regular) error at logical 142802944 on dev /dev/mapper/loop2 BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 2, rd 3, flush 0, corrupt 0, gen 0 BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 2, rd 4, flush 0, corrupt 0, gen 0 Second Check_fs started BTRFS info (device dm-0): bdev /dev/mapper/loop2 errs: wr 2, rd 4, flush 0, corrupt 0, gen 0 Buffer I/O error on dev dm-2, logical block 19984, async page read --- test 3: corrupt data1 scrub started on mnt/., fsid f8a4ecca-2475-4e5e-9651-65d9478b56fe (pid=856) ERROR: there are uncorrectable errors *** Wrong data on disk:off /dev/mapper/loop1:61931520 (data1) Data read ||, expected |adaaa| Kernel messages in the test First Check_fs started Buffer I/O error on dev dm-1, logical block 15120, async page read Scrub started BTRFS warning (device dm-0): i/o error at logical 142737408 on dev /dev/mapper/loop1, sector 120960, root 5, inode 257, offset 0, length 4096, links 1 (path: out.txt) BTRFS error (device dm-0): bdev /dev/mapper/loop1 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0 BTRFS warning (device dm-0): i/o error at logical 142737408 on dev /dev/mapper/loop1, sector 120960, root 5, inode 257, offset 0, length 4096, links 1 (path: out.txt) BTRFS error (device dm-0): bdev /dev/mapper/loop1 errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 BTRFS error (device dm-0): bdev /dev/mapper/loop1 errs: wr 1, rd 2, flush 0, corrupt 0, gen 0 BTRFS error (device dm-0): unable to fixup (regular) error at logical 142737408 on dev /dev/mapper/loop1 BTRFS error (device dm-0): unable to fixup (regular) error at logical 142737408 on dev /dev/mapper/loop1 BTRFS error (device dm-0): bdev /dev/mapper/loop1 errs: wr 1, rd 3, flush 0, corrupt 0, gen 0 Second Check_fs started BTRFS error (device dm-0): bdev /dev/mapper/loop1 errs: wr 1, rd 4, flush 0, corrupt 0, gen 0 BTRFS info (device dm-0): bdev /dev/mapper/loop1 errs: wr 1, rd 4, flush 0, corrupt 0, gen 0 Buffer I/O error on dev dm-1, logical block 15120, async page read --- test 4: corrupt data2; read without scrub *** Wrong data on disk:off /dev/mapper/loop2:81854464 (d
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote: > Using "btrfs insp phy" I developed a script to trigger the bug. Thank you for the script and all for sharing the raid5 and scrubbing issues. I have been using two raid5 arrays and ran scrub occasionally without any problems lately and been in false confidence. I converted successfully raid5 arrays into raid10 without any glitch. I tried to modify the shell script so that instead of corrupting data with dd, a simulated bad block is created with device mapper. Modern disks are likely to either return the correct data or an error if they cannot. The modified script behaves very much like the original dd version. With dd version I see wrong data instead of expected data. With simulated bad block I see no data at all instead of expected data since dd quits on read error. Jarkko Lavinen h.sh Description: Bourne shell script
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two
15.07.2016 19:29, Chris Mason пишет: > >> However I have to point out that this kind of test is very >> difficult to do: the file-cache could lead to read an old data, so please >> suggestion about how flush the cache are good (I do some sync, >> unmount the filesystem and perform "echo 3 >/proc/sys/vm/drop_caches", >> but sometime it seems not enough). > > O_DIRECT should handle the cache flushing for you. > There is also BLKFLSBUF ioctl (blockdev --flushbufs on shell level). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two
On 2016-07-15 06:39, Andrei Borzenkov wrote: > 15.07.2016 00:20, Chris Mason пишет: >> >> >> On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote: >>> Hi All, >>> >>> I developed a new btrfs command "btrfs insp phy"[1] to further >>> investigate this bug [2]. Using "btrfs insp phy" I developed a script >>> to trigger the bug. The bug is not always triggered, but most of time >>> yes. >>> >>> Basically the script create a raid5 filesystem (using three >>> loop-device on three file called disk[123].img); on this filesystem > > Are those devices themselves on btrfs? Just to avoid any sort of > possible side effects? Good question. However the files are stored on a ext4 filesystem (but I don't know if this is better or worse) > >>> it is create a file. Then using "btrfs insp phy", the physical >>> placement of the data on the device are computed. >>> >>> First the script checks that the data are the right one (for data1, >>> data2 and parity), then it corrupt the data: >>> >>> test1: the parity is corrupted, then scrub is ran. Then the (data1, >>> data2, parity) data on the disk are checked. This test goes fine all >>> the times >>> >>> test2: data2 is corrupted, then scrub is ran. Then the (data1, data2, >>> parity) data on the disk are checked. This test fail most of the time: >>> the data on the disk is not correct; the parity is wrong. Scrub >>> sometime reports "WARNING: errors detected during scrubbing, >>> corrected" and sometime reports "ERROR: there are uncorrectable >>> errors". But this seems unrelated to the fact that the data is >>> corrupetd or not >>> test3: like test2, but data1 is corrupted. The result are the same as >>> above. >>> >>> >>> test4: data2 is corrupted, the the file is read. The system doesn't >>> return error (the data seems to be fine); but the data2 on the disk is >>> still corrupted. >>> >>> >>> Note: data1, data2, parity are the disk-element of the raid5 stripe- >>> >>> Conclusion: >>> >>> most of the time, it seems that btrfs-raid5 is not capable to rebuild >>> parity and data. Worse the message returned by scrub is incoherent by >>> the status on the disk. The tests didn't fail every time; this >>> complicate the diagnosis. However my script fails most of the time. >> >> Interesting, thanks for taking the time to write this up. Is the >> failure specific to scrub? Or is parity rebuild in general also failing >> in this case? >> > > How do you rebuild parity without scrub as long as all devices appear to > be present? I corrupted the data, then I read the file. The data has to be correct on the basis of the parity. Even in this case I found problem. > > > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two
On 07/15/2016 12:28 PM, Goffredo Baroncelli wrote: On 2016-07-14 23:20, Chris Mason wrote: On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote: Hi All, I developed a new btrfs command "btrfs insp phy"[1] to further investigate this bug [2]. Using "btrfs insp phy" I developed a script to trigger the bug. The bug is not always triggered, but most of time yes. Basically the script create a raid5 filesystem (using three loop-device on three file called disk[123].img); on this filesystem it is create a file. Then using "btrfs insp phy", the physical placement of the data on the device are computed. First the script checks that the data are the right one (for data1, data2 and parity), then it corrupt the data: test1: the parity is corrupted, then scrub is ran. Then the (data1, data2, parity) data on the disk are checked. This test goes fine all the times test2: data2 is corrupted, then scrub is ran. Then the (data1, data2, parity) data on the disk are checked. This test fail most of the time: the data on the disk is not correct; the parity is wrong. Scrub sometime reports "WARNING: errors detected during scrubbing, corrected" and sometime reports "ERROR: there are uncorrectable errors". But this seems unrelated to the fact that the data is corrupetd or not test3: like test2, but data1 is corrupted. The result are the same as above. test4: data2 is corrupted, the the file is read. The system doesn't return error (the data seems to be fine); but the data2 on the disk is still corrupted. Note: data1, data2, parity are the disk-element of the raid5 stripe- Conclusion: most of the time, it seems that btrfs-raid5 is not capable to rebuild parity and data. Worse the message returned by scrub is incoherent by the status on the disk. The tests didn't fail every time; this complicate the diagnosis. However my script fails most of the time. Interesting, thanks for taking the time to write this up. Is the failure specific to scrub? Or is parity rebuild in general also failing in this case? Test #4 handles this case: I corrupt the data, and when I read it the data is good. So parity is used but the data on the platter are still bad. However I have to point out that this kind of test is very difficult to do: the file-cache could lead to read an old data, so please suggestion about how flush the cache are good (I do some sync, unmount the filesystem and perform "echo 3 >/proc/sys/vm/drop_caches", but sometime it seems not enough). O_DIRECT should handle the cache flushing for you. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two
On 2016-07-14 23:20, Chris Mason wrote: > > > On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote: >> Hi All, >> >> I developed a new btrfs command "btrfs insp phy"[1] to further >> investigate this bug [2]. Using "btrfs insp phy" I developed a >> script to trigger the bug. The bug is not always triggered, but >> most of time yes. >> >> Basically the script create a raid5 filesystem (using three >> loop-device on three file called disk[123].img); on this filesystem >> it is create a file. Then using "btrfs insp phy", the physical >> placement of the data on the device are computed. >> >> First the script checks that the data are the right one (for data1, >> data2 and parity), then it corrupt the data: >> >> test1: the parity is corrupted, then scrub is ran. Then the (data1, >> data2, parity) data on the disk are checked. This test goes fine >> all the times >> >> test2: data2 is corrupted, then scrub is ran. Then the (data1, >> data2, parity) data on the disk are checked. This test fail most of >> the time: the data on the disk is not correct; the parity is wrong. >> Scrub sometime reports "WARNING: errors detected during scrubbing, >> corrected" and sometime reports "ERROR: there are uncorrectable >> errors". But this seems unrelated to the fact that the data is >> corrupetd or not test3: like test2, but data1 is corrupted. The >> result are the same as above. >> >> >> test4: data2 is corrupted, the the file is read. The system doesn't >> return error (the data seems to be fine); but the data2 on the disk >> is still corrupted. >> >> >> Note: data1, data2, parity are the disk-element of the raid5 >> stripe- >> >> Conclusion: >> >> most of the time, it seems that btrfs-raid5 is not capable to >> rebuild parity and data. Worse the message returned by scrub is >> incoherent by the status on the disk. The tests didn't fail every >> time; this complicate the diagnosis. However my script fails most >> of the time. > > Interesting, thanks for taking the time to write this up. Is the > failure specific to scrub? Or is parity rebuild in general also > failing in this case? Test #4 handles this case: I corrupt the data, and when I read it the data is good. So parity is used but the data on the platter are still bad. However I have to point out that this kind of test is very difficult to do: the file-cache could lead to read an old data, so please suggestion about how flush the cache are good (I do some sync, unmount the filesystem and perform "echo 3 >/proc/sys/vm/drop_caches", but sometime it seems not enough). > > -chris > BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two
On 07/15/2016 11:10 AM, Andrei Borzenkov wrote: 15.07.2016 16:20, Chris Mason пишет: Interesting, thanks for taking the time to write this up. Is the failure specific to scrub? Or is parity rebuild in general also failing in this case? How do you rebuild parity without scrub as long as all devices appear to be present? If one block is corrupted, the crcs will fail and the kernel will rebuild parity when you read the file. You can also use balance instead of scrub. As we have seen recently, btrfs does not compute, stores or verifies checksum of RAID56 parity. So if parity is corrupted, the only way to detect and correct it is to use scrub. Balance may work by side effect, because it simply recomputes parity on new data, but it will not fix wrong parity on existing data. Ah, I misread your question Yes, this is definitely where scrub is the best tool. But even if we have to add debugging to force parity recomputation, we should see if the problem is only in scrub or deeper. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two
15.07.2016 16:20, Chris Mason пишет: >>> >>> Interesting, thanks for taking the time to write this up. Is the >>> failure specific to scrub? Or is parity rebuild in general also failing >>> in this case? >>> >> >> How do you rebuild parity without scrub as long as all devices appear to >> be present? > > If one block is corrupted, the crcs will fail and the kernel will > rebuild parity when you read the file. You can also use balance instead > of scrub. > As we have seen recently, btrfs does not compute, stores or verifies checksum of RAID56 parity. So if parity is corrupted, the only way to detect and correct it is to use scrub. Balance may work by side effect, because it simply recomputes parity on new data, but it will not fix wrong parity on existing data. I agree that if data block is corrupted it will be detected, but then you do not need to recompute parity in the first place. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two
On 07/15/2016 12:39 AM, Andrei Borzenkov wrote: 15.07.2016 00:20, Chris Mason пишет: On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote: Hi All, I developed a new btrfs command "btrfs insp phy"[1] to further investigate this bug [2]. Using "btrfs insp phy" I developed a script to trigger the bug. The bug is not always triggered, but most of time yes. Basically the script create a raid5 filesystem (using three loop-device on three file called disk[123].img); on this filesystem Are those devices themselves on btrfs? Just to avoid any sort of possible side effects? it is create a file. Then using "btrfs insp phy", the physical placement of the data on the device are computed. First the script checks that the data are the right one (for data1, data2 and parity), then it corrupt the data: test1: the parity is corrupted, then scrub is ran. Then the (data1, data2, parity) data on the disk are checked. This test goes fine all the times test2: data2 is corrupted, then scrub is ran. Then the (data1, data2, parity) data on the disk are checked. This test fail most of the time: the data on the disk is not correct; the parity is wrong. Scrub sometime reports "WARNING: errors detected during scrubbing, corrected" and sometime reports "ERROR: there are uncorrectable errors". But this seems unrelated to the fact that the data is corrupetd or not test3: like test2, but data1 is corrupted. The result are the same as above. test4: data2 is corrupted, the the file is read. The system doesn't return error (the data seems to be fine); but the data2 on the disk is still corrupted. Note: data1, data2, parity are the disk-element of the raid5 stripe- Conclusion: most of the time, it seems that btrfs-raid5 is not capable to rebuild parity and data. Worse the message returned by scrub is incoherent by the status on the disk. The tests didn't fail every time; this complicate the diagnosis. However my script fails most of the time. Interesting, thanks for taking the time to write this up. Is the failure specific to scrub? Or is parity rebuild in general also failing in this case? How do you rebuild parity without scrub as long as all devices appear to be present? If one block is corrupted, the crcs will fail and the kernel will rebuild parity when you read the file. You can also use balance instead of scrub. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two
15.07.2016 00:20, Chris Mason пишет: > > > On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote: >> Hi All, >> >> I developed a new btrfs command "btrfs insp phy"[1] to further >> investigate this bug [2]. Using "btrfs insp phy" I developed a script >> to trigger the bug. The bug is not always triggered, but most of time >> yes. >> >> Basically the script create a raid5 filesystem (using three >> loop-device on three file called disk[123].img); on this filesystem Are those devices themselves on btrfs? Just to avoid any sort of possible side effects? >> it is create a file. Then using "btrfs insp phy", the physical >> placement of the data on the device are computed. >> >> First the script checks that the data are the right one (for data1, >> data2 and parity), then it corrupt the data: >> >> test1: the parity is corrupted, then scrub is ran. Then the (data1, >> data2, parity) data on the disk are checked. This test goes fine all >> the times >> >> test2: data2 is corrupted, then scrub is ran. Then the (data1, data2, >> parity) data on the disk are checked. This test fail most of the time: >> the data on the disk is not correct; the parity is wrong. Scrub >> sometime reports "WARNING: errors detected during scrubbing, >> corrected" and sometime reports "ERROR: there are uncorrectable >> errors". But this seems unrelated to the fact that the data is >> corrupetd or not >> test3: like test2, but data1 is corrupted. The result are the same as >> above. >> >> >> test4: data2 is corrupted, the the file is read. The system doesn't >> return error (the data seems to be fine); but the data2 on the disk is >> still corrupted. >> >> >> Note: data1, data2, parity are the disk-element of the raid5 stripe- >> >> Conclusion: >> >> most of the time, it seems that btrfs-raid5 is not capable to rebuild >> parity and data. Worse the message returned by scrub is incoherent by >> the status on the disk. The tests didn't fail every time; this >> complicate the diagnosis. However my script fails most of the time. > > Interesting, thanks for taking the time to write this up. Is the > failure specific to scrub? Or is parity rebuild in general also failing > in this case? > How do you rebuild parity without scrub as long as all devices appear to be present? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two
On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote: Hi All, I developed a new btrfs command "btrfs insp phy"[1] to further investigate this bug [2]. Using "btrfs insp phy" I developed a script to trigger the bug. The bug is not always triggered, but most of time yes. Basically the script create a raid5 filesystem (using three loop-device on three file called disk[123].img); on this filesystem it is create a file. Then using "btrfs insp phy", the physical placement of the data on the device are computed. First the script checks that the data are the right one (for data1, data2 and parity), then it corrupt the data: test1: the parity is corrupted, then scrub is ran. Then the (data1, data2, parity) data on the disk are checked. This test goes fine all the times test2: data2 is corrupted, then scrub is ran. Then the (data1, data2, parity) data on the disk are checked. This test fail most of the time: the data on the disk is not correct; the parity is wrong. Scrub sometime reports "WARNING: errors detected during scrubbing, corrected" and sometime reports "ERROR: there are uncorrectable errors". But this seems unrelated to the fact that the data is corrupetd or not test3: like test2, but data1 is corrupted. The result are the same as above. test4: data2 is corrupted, the the file is read. The system doesn't return error (the data seems to be fine); but the data2 on the disk is still corrupted. Note: data1, data2, parity are the disk-element of the raid5 stripe- Conclusion: most of the time, it seems that btrfs-raid5 is not capable to rebuild parity and data. Worse the message returned by scrub is incoherent by the status on the disk. The tests didn't fail every time; this complicate the diagnosis. However my script fails most of the time. Interesting, thanks for taking the time to write this up. Is the failure specific to scrub? Or is parity rebuild in general also failing in this case? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two
rruption failed exit 100 } echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches sudo mount $loop1 mnt sudo btrfs scrub start mnt/. sync; sync cat mnt/out.txt &>/dev/null || echo "Read FAIL" sudo umount mnt echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches check_fs || return 1 echo "--- test1: OK" return 0 } test_corrupt_data2() { echo "--- test 2: corrupt data2" echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches sudo dd 2>/dev/null if=/dev/zero of=$data2_dev bs=1 \ seek=$data2_off count=5 check_fs &>/dev/null && { echo Corruption failed exit 100 } echo 3 | sudo tee >/dev/null >/dev/null /proc/sys/vm/drop_caches sudo mount $loop1 mnt sudo btrfs scrub start mnt/. sync; sync cat mnt/out.txt &>/dev/null || echo "Read FAIL" sudo umount mnt echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches check_fs || return 1 echo "--- test2: OK" return 0 } test_corrupt_data1() { echo "--- test 3: corrupt data1" echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches sudo dd 2>/dev/null if=/dev/zero of=$data1_dev bs=1 \ seek=$data1_off count=5 check_fs &>/dev/null && { echo Corruption failed exit 100 } echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches sudo mount $loop1 mnt sudo btrfs scrub start mnt/. sync; sync cat mnt/out.txt &>/dev/null || echo "Read FAIL" sudo umount mnt echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches check_fs || return 1 echo "--- test3: OK" return 0 } test_corrupt_data2_wo_scrub() { echo "--- test 4: corrupt data2; read without scrub" echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches sudo dd 2>/dev/null if=/dev/zero of=$data2_dev bs=1 \ seek=$data2_off count=5 check_fs &>/dev/null && { echo Corruption failed exit 100 } echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches sudo mount $loop1 mnt cat mnt/out.txt &>/dev/null || echo "Read FAIL" sudo umount mnt echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches check_fs || return 1 echo "--- test 4: OK" return 0 } for t in test_corrupt_parity test_corrupt_data2 test_corrupt_data1 \ test_corrupt_data2_wo_scrub; do init_fs &>/dev/null if ! check_fs &>/dev/null; then echo Integrity test failed exit 100 fi $t echo done - [1] See email "New btrfs sub command: btrfs inspect physical-find" [2] See email "[BUG] Btrfs scrub sometime recalculate wrong parity in raid5" -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
Steven Haigh posted on Mon, 27 Jun 2016 13:21:00 +1000 as excerpted: > I'd also recommend updates to the ArchLinux wiki - as for some reason I > always seem to end up there when searching for a certain topic... Not really btrfs related, but for people using popular search engines, at least, this is likely for two reasons: 1) Arch is apparently the most popular distro among reasonably technically literate users -- those who will both tend to have good technical knowledge and advice on the various real-life issues Linux users tend to encounter, and are likely to post it to an easily publicly indexable forum. (And in that regard, wikis are likely to be more indexable than (web archives of) mailing lists like this, because that's (part of) what wikis /do/ by design, make topics keyword searchable. Lists, not so much.) 2) Specifically to the point of being _publicly_ indexable, Arch may have a more liberal robots.txt that allows indexers more access, than other distros, some of which may limit robot access for performance reasons. With this combination, arch's wiki is a natural place for searches to point. So agreed, a high priority on getting the raid56 warning up there on the arch wiki is a good idea, indeed. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Mon, 2016-06-27 at 07:35 +0300, Andrei Borzenkov wrote: > The problem is that current implementation of RAID56 puts exactly CoW > data at risk. I.e. writing new (copy of) data may suddenly make old > (copy of) data inaccessible, even though it had been safely committed > to > disk and is now in read-only snapshot. Sure,... mine was just a general thing to be added. No checksums => no way to tell which block is valid in case of silent block errors => no way to recover unless by chance => should be included as a warning, especially as userland software starts to automatically set nodatacow (IIRC systemd does so), thereby silently breaking functionality (integrity+recoverability) assumed by the user. Cheers, Chris- smime.p7s Description: S/MIME cryptographic signature
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
27.06.2016 06:50, Christoph Anton Mitterer пишет: > > What should IMHO be done as well is giving a big fat warning in the > manpages/etc. that when nodatacow is used RAID recovery cannot produce > valid data (at least as long as there isn't checksumming implemented > for nodatacow). The problem is that current implementation of RAID56 puts exactly CoW data at risk. I.e. writing new (copy of) data may suddenly make old (copy of) data inaccessible, even though it had been safely committed to disk and is now in read-only snapshot. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Sun, 2016-06-26 at 15:33 -0700, ronnie sahlberg wrote: > 1, a much more strongly worded warning in the wiki. Make sure there > are no misunderstandings > that they really should not use raid56 right now for new filesystems. I doubt most end users can be assumed to read the wiki... > 2, Instead of a --force flag. (Users tend to ignore ---force and > warnings in documentation.) > Instead ifdef out the options to create raid56 in mkfs.btrfs. > Developers who want to test can just remove the ifdef and recompile > the tools anyway. > But if end-users have to recompile userspace, that really forces the > point that "you > really should not use this right now". Well if one does --force or --yes-i-know-what-i-do, and one actually doesn't than such person is on his own. People can always shoot themselves if they want to. Actually I think that the compile-time way is inferior here. Distros may just always enable raid56 there to allow people to continue mounting their existing filesystems. What should IMHO be done as well is giving a big fat warning in the manpages/etc. that when nodatacow is used RAID recovery cannot produce valid data (at least as long as there isn't checksumming implemented for nodatacow). Probably it should also be documented what btrfs does in such situation. E.g. does it just randomly pick a readable block from one of the copies? Simply error out and consider the file broken? Fill the blocks in question with zero? Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On 2016-06-27 08:38, Hugo Mills wrote: On Sun, Jun 26, 2016 at 03:33:08PM -0700, ronnie sahlberg wrote: On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.dun...@cox.net> wrote: > Could this explain why people have been reporting so many raid56 mode > cases of btrfs replacing a first drive appearing to succeed just fine, > but then they go to btrfs replace a second drive, and the array crashes > as if the first replace didn't work correctly after all, resulting in two > bad devices once the second replace gets under way, of course bringing > down the array? > > If so, then it looks like we have our answer as to what has been going > wrong that has been so hard to properly trace and thus to bugfix. > > Combine that with the raid4 dedicated parity device behavior you're > seeing if the writes are all exactly 128 MB, with that possibly > explaining the super-slow replaces, and this thread may have just given > us answers to both of those until-now-untraceable issues. > > Regardless, what's /very/ clear by now is that raid56 mode as it > currently exists is more or less fatally flawed, and a full scrap and > rewrite to an entirely different raid56 mode on-disk format may be > necessary to fix it. > > And what's even clearer is that people /really/ shouldn't be using raid56 > mode for anything but testing with throw-away data, at this point. > Anything else is simply irresponsible. > > Does that mean we need to put a "raid56 mode may eat your babies" level > warning in the manpage and require a --force to either mkfs.btrfs or > balance to raid56 mode? Because that's about where I am on it. Agree. At this point letting ordinary users create raid56 filesystems is counterproductive. I would suggest: 1, a much more strongly worded warning in the wiki. Make sure there are no misunderstandings that they really should not use raid56 right now for new filesystems. I beefed up the warnings in several places in the wiki a couple of days ago. Not to sound rude - but I don't think these go anywhere near far enough. It needs to be completely obvious that its a good chance you'll lose everything. IMHO that's the only way that will stop BTRFS from getting the 'data eater' reputation. It can be revisited and reworded when the implementation is more tested and stable. -- Steven Haigh Email: net...@crc.id.au Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On 2016-06-27 08:33, ronnie sahlberg wrote: On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.dun...@cox.net> wrote: Chris Murphy posted on Sat, 25 Jun 2016 11:25:05 -0600 as excerpted: Wow. So it sees the data strip corruption, uses good parity on disk to fix it, writes the fix to disk, recomputes parity for some reason but does it wrongly, and then overwrites good parity with bad parity? That's fucked. So in other words, if there are any errors fixed up during a scrub, you should do a 2nd scrub. The first scrub should make sure data is correct, and the 2nd scrub should make sure the bug is papered over by computing correct parity and replacing the bad parity. I wonder if the same problem happens with balance or if this is just a bug in scrub code? Could this explain why people have been reporting so many raid56 mode cases of btrfs replacing a first drive appearing to succeed just fine, but then they go to btrfs replace a second drive, and the array crashes as if the first replace didn't work correctly after all, resulting in two bad devices once the second replace gets under way, of course bringing down the array? If so, then it looks like we have our answer as to what has been going wrong that has been so hard to properly trace and thus to bugfix. Combine that with the raid4 dedicated parity device behavior you're seeing if the writes are all exactly 128 MB, with that possibly explaining the super-slow replaces, and this thread may have just given us answers to both of those until-now-untraceable issues. Regardless, what's /very/ clear by now is that raid56 mode as it currently exists is more or less fatally flawed, and a full scrap and rewrite to an entirely different raid56 mode on-disk format may be necessary to fix it. And what's even clearer is that people /really/ shouldn't be using raid56 mode for anything but testing with throw-away data, at this point. Anything else is simply irresponsible. Does that mean we need to put a "raid56 mode may eat your babies" level warning in the manpage and require a --force to either mkfs.btrfs or balance to raid56 mode? Because that's about where I am on it. Agree. At this point letting ordinary users create raid56 filesystems is counterproductive. +1 I would suggest: 1, a much more strongly worded warning in the wiki. Make sure there are no misunderstandings that they really should not use raid56 right now for new filesystems. I voiced my concern on #btrfs about this - it really should show that this may eat your data and is properly experimental. At the moment, it looks as if the features are implemented and working as expected. In my case with nothing out of the ordinary - I've now got ~3.8Tb free disk space. Certainly not ready for *ANY* kind of public use. 2, Instead of a --force flag. (Users tend to ignore ---force and warnings in documentation.) Instead ifdef out the options to create raid56 in mkfs.btrfs. Developers who want to test can just remove the ifdef and recompile the tools anyway. But if end-users have to recompile userspace, that really forces the point that "you really should not use this right now". I think this is a somewhat good idea - however it should be a warning along the lines of: "BTRFS RAID56 is VERY experimental and is known to corrupt data in certain cases. Use at your own risk! Continue? (y/N):" 3, reach out to the documentation and fora for the major distros and make sure they update their documentation accordingly. I think a lot of end-users, if they try to research something, are more likely to go to fora and wiki than search out an upstream fora. Another good idea. I'd also recommend updates to the ArchLinux wiki - as for some reason I always seem to end up there when searching for a certain topic... -- Steven Haigh Email: net...@crc.id.au Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Sun, Jun 26, 2016 at 03:33:08PM -0700, ronnie sahlberg wrote: > On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.dun...@cox.net> wrote: > > Could this explain why people have been reporting so many raid56 mode > > cases of btrfs replacing a first drive appearing to succeed just fine, > > but then they go to btrfs replace a second drive, and the array crashes > > as if the first replace didn't work correctly after all, resulting in two > > bad devices once the second replace gets under way, of course bringing > > down the array? > > > > If so, then it looks like we have our answer as to what has been going > > wrong that has been so hard to properly trace and thus to bugfix. > > > > Combine that with the raid4 dedicated parity device behavior you're > > seeing if the writes are all exactly 128 MB, with that possibly > > explaining the super-slow replaces, and this thread may have just given > > us answers to both of those until-now-untraceable issues. > > > > Regardless, what's /very/ clear by now is that raid56 mode as it > > currently exists is more or less fatally flawed, and a full scrap and > > rewrite to an entirely different raid56 mode on-disk format may be > > necessary to fix it. > > > > And what's even clearer is that people /really/ shouldn't be using raid56 > > mode for anything but testing with throw-away data, at this point. > > Anything else is simply irresponsible. > > > > Does that mean we need to put a "raid56 mode may eat your babies" level > > warning in the manpage and require a --force to either mkfs.btrfs or > > balance to raid56 mode? Because that's about where I am on it. > > Agree. At this point letting ordinary users create raid56 filesystems > is counterproductive. > > > I would suggest: > > 1, a much more strongly worded warning in the wiki. Make sure there > are no misunderstandings > that they really should not use raid56 right now for new filesystems. I beefed up the warnings in several places in the wiki a couple of days ago. Hugo. > 2, Instead of a --force flag. (Users tend to ignore ---force and > warnings in documentation.) > Instead ifdef out the options to create raid56 in mkfs.btrfs. > Developers who want to test can just remove the ifdef and recompile > the tools anyway. > But if end-users have to recompile userspace, that really forces the > point that "you > really should not use this right now". > > 3, reach out to the documentation and fora for the major distros and > make sure they update their > documentation accordingly. > I think a lot of end-users, if they try to research something, are > more likely to go to fora and wiki > than search out an upstream fora. -- Hugo Mills | "No! My collection of rare, incurable diseases! hugo@... carfax.org.uk | Violated!" http://carfax.org.uk/ | PGP: E2AB1DE4 |Stimpson J. Cat, The Ren & Stimpy Show signature.asc Description: Digital signature
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.dun...@cox.net> wrote: > Chris Murphy posted on Sat, 25 Jun 2016 11:25:05 -0600 as excerpted: > >> Wow. So it sees the data strip corruption, uses good parity on disk to >> fix it, writes the fix to disk, recomputes parity for some reason but >> does it wrongly, and then overwrites good parity with bad parity? >> That's fucked. So in other words, if there are any errors fixed up >> during a scrub, you should do a 2nd scrub. The first scrub should make >> sure data is correct, and the 2nd scrub should make sure the bug is >> papered over by computing correct parity and replacing the bad parity. >> >> I wonder if the same problem happens with balance or if this is just a >> bug in scrub code? > > Could this explain why people have been reporting so many raid56 mode > cases of btrfs replacing a first drive appearing to succeed just fine, > but then they go to btrfs replace a second drive, and the array crashes > as if the first replace didn't work correctly after all, resulting in two > bad devices once the second replace gets under way, of course bringing > down the array? > > If so, then it looks like we have our answer as to what has been going > wrong that has been so hard to properly trace and thus to bugfix. > > Combine that with the raid4 dedicated parity device behavior you're > seeing if the writes are all exactly 128 MB, with that possibly > explaining the super-slow replaces, and this thread may have just given > us answers to both of those until-now-untraceable issues. > > Regardless, what's /very/ clear by now is that raid56 mode as it > currently exists is more or less fatally flawed, and a full scrap and > rewrite to an entirely different raid56 mode on-disk format may be > necessary to fix it. > > And what's even clearer is that people /really/ shouldn't be using raid56 > mode for anything but testing with throw-away data, at this point. > Anything else is simply irresponsible. > > Does that mean we need to put a "raid56 mode may eat your babies" level > warning in the manpage and require a --force to either mkfs.btrfs or > balance to raid56 mode? Because that's about where I am on it. Agree. At this point letting ordinary users create raid56 filesystems is counterproductive. I would suggest: 1, a much more strongly worded warning in the wiki. Make sure there are no misunderstandings that they really should not use raid56 right now for new filesystems. 2, Instead of a --force flag. (Users tend to ignore ---force and warnings in documentation.) Instead ifdef out the options to create raid56 in mkfs.btrfs. Developers who want to test can just remove the ifdef and recompile the tools anyway. But if end-users have to recompile userspace, that really forces the point that "you really should not use this right now". 3, reach out to the documentation and fora for the major distros and make sure they update their documentation accordingly. I think a lot of end-users, if they try to research something, are more likely to go to fora and wiki than search out an upstream fora. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Sun, Jun 26, 2016 at 3:20 AM, Goffredo Baroncelli wrote: > On 2016-06-26 00:33, Chris Murphy wrote: >> On Sat, Jun 25, 2016 at 12:42 PM, Goffredo Baroncelli >> wrote: >>> On 2016-06-25 19:58, Chris Murphy wrote: >>> [...] > Wow. So it sees the data strip corruption, uses good parity on disk to > fix it, writes the fix to disk, recomputes parity for some reason but > does it wrongly, and then overwrites good parity with bad parity? The wrong parity, is it valid for the data strips that includes the (intentionally) corrupt data? Can parity computation happen before the csum check? Where sometimes you get: read data strips > computer parity > check csum fails > read good parity from disk > fix up the bad data chunk > write wrong parity (based on wrong data)? https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3 2371-2383 suggest that there's a parity check, it's not always being rewritten to disk if it's already correct. But it doesn't know it's not correct, it thinks it's wrong so writes out the wrongly computed parity? >>> >>> The parity is not valid for both the corrected data and the corrupted data. >>> It seems that the scrub process copy the contents of the disk2 to disk3. It >>> could happens only if the contents of disk1 is zero. >> >> I'm not sure what it takes to hit this exactly. I just tested 3x >> raid5, where two files 128KiB "a" and 128KiB "b", so that's a full >> stripe write for each. I corrupted devid 1 64KiB of "a" and devid2 >> 64KiB of "b" did a scrub, error is detected, and corrected, and parity >> is still correct. > > How many time tried this corruption test ? I was unable to raise the bug > systematically; in average every three tests I got 1 bug Once. I just did it a 2nd time and both file's parity are wrong now. So I did it several more times. Sometimes both files' parity is bad. Sometimes just one file's parity is bad. Sometimes neither file's parity is bad. It's a very bad bug, because it is a form of silent data corruption and it's induced by Btrfs. And it's apparently non-deterministically hit. Is this some form of race condition? Somewhat orthogonal to this, is that while Btrfs is subject to the write hole problem where parity can be wrong, this is detected and warned. Bad data doesn't propagate up to user space. This might explain how users are getting hit with corrupt files only after they have a degraded volume. They did a scrub where some fixups happen, but behind the scene possibly parity was corrupted even though their data was fixed. Later they have a failed device, and the bad parity is needed, and now there are a bunch of scary checksum errors with piles of files listed as unrecoverable. And in fact they are unrecoverable because the bad parity means bad reconstruction, so even scraping it off with btrfs restore won't work. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On 2016-06-26 00:33, Chris Murphy wrote: > On Sat, Jun 25, 2016 at 12:42 PM, Goffredo Baroncelli > wrote: >> On 2016-06-25 19:58, Chris Murphy wrote: >> [...] Wow. So it sees the data strip corruption, uses good parity on disk to fix it, writes the fix to disk, recomputes parity for some reason but does it wrongly, and then overwrites good parity with bad parity? >>> >>> The wrong parity, is it valid for the data strips that includes the >>> (intentionally) corrupt data? >>> >>> Can parity computation happen before the csum check? Where sometimes you >>> get: >>> >>> read data strips > computer parity > check csum fails > read good >>> parity from disk > fix up the bad data chunk > write wrong parity >>> (based on wrong data)? >>> >>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3 >>> >>> 2371-2383 suggest that there's a parity check, it's not always being >>> rewritten to disk if it's already correct. But it doesn't know it's >>> not correct, it thinks it's wrong so writes out the wrongly computed >>> parity? >> >> The parity is not valid for both the corrected data and the corrupted data. >> It seems that the scrub process copy the contents of the disk2 to disk3. It >> could happens only if the contents of disk1 is zero. > > I'm not sure what it takes to hit this exactly. I just tested 3x > raid5, where two files 128KiB "a" and 128KiB "b", so that's a full > stripe write for each. I corrupted devid 1 64KiB of "a" and devid2 > 64KiB of "b" did a scrub, error is detected, and corrected, and parity > is still correct. How many time tried this corruption test ? I was unable to raise the bug systematically; in average every three tests I got 1 bug [] -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
Chris Murphy posted on Sat, 25 Jun 2016 11:25:05 -0600 as excerpted: > Wow. So it sees the data strip corruption, uses good parity on disk to > fix it, writes the fix to disk, recomputes parity for some reason but > does it wrongly, and then overwrites good parity with bad parity? > That's fucked. So in other words, if there are any errors fixed up > during a scrub, you should do a 2nd scrub. The first scrub should make > sure data is correct, and the 2nd scrub should make sure the bug is > papered over by computing correct parity and replacing the bad parity. > > I wonder if the same problem happens with balance or if this is just a > bug in scrub code? Could this explain why people have been reporting so many raid56 mode cases of btrfs replacing a first drive appearing to succeed just fine, but then they go to btrfs replace a second drive, and the array crashes as if the first replace didn't work correctly after all, resulting in two bad devices once the second replace gets under way, of course bringing down the array? If so, then it looks like we have our answer as to what has been going wrong that has been so hard to properly trace and thus to bugfix. Combine that with the raid4 dedicated parity device behavior you're seeing if the writes are all exactly 128 MB, with that possibly explaining the super-slow replaces, and this thread may have just given us answers to both of those until-now-untraceable issues. Regardless, what's /very/ clear by now is that raid56 mode as it currently exists is more or less fatally flawed, and a full scrap and rewrite to an entirely different raid56 mode on-disk format may be necessary to fix it. And what's even clearer is that people /really/ shouldn't be using raid56 mode for anything but testing with throw-away data, at this point. Anything else is simply irresponsible. Does that mean we need to put a "raid56 mode may eat your babies" level warning in the manpage and require a --force to either mkfs.btrfs or balance to raid56 mode? Because that's about where I am on it. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Sat, Jun 25, 2016 at 12:42 PM, Goffredo Baroncelli wrote: > On 2016-06-25 19:58, Chris Murphy wrote: > [...] >>> Wow. So it sees the data strip corruption, uses good parity on disk to >>> fix it, writes the fix to disk, recomputes parity for some reason but >>> does it wrongly, and then overwrites good parity with bad parity? >> >> The wrong parity, is it valid for the data strips that includes the >> (intentionally) corrupt data? >> >> Can parity computation happen before the csum check? Where sometimes you get: >> >> read data strips > computer parity > check csum fails > read good >> parity from disk > fix up the bad data chunk > write wrong parity >> (based on wrong data)? >> >> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3 >> >> 2371-2383 suggest that there's a parity check, it's not always being >> rewritten to disk if it's already correct. But it doesn't know it's >> not correct, it thinks it's wrong so writes out the wrongly computed >> parity? > > The parity is not valid for both the corrected data and the corrupted data. > It seems that the scrub process copy the contents of the disk2 to disk3. It > could happens only if the contents of disk1 is zero. I'm not sure what it takes to hit this exactly. I just tested 3x raid5, where two files 128KiB "a" and 128KiB "b", so that's a full stripe write for each. I corrupted devid 1 64KiB of "a" and devid2 64KiB of "b" did a scrub, error is detected, and corrected, and parity is still correct. I also tried to corrupt both parities and scrub, and like you I get no messages from scrub in user space or kernel but the parity is corrected. The fixup is also not cow'd. It is an overwrite, which seems unproblematic to me at face value. But? Next I corrupted parities, failed one drive, mounted degraded, and read in both files. If there is a write hole, I should get back corrupt data from parity reconstruction blindly being trusted and wrongly reconstructed. [root@f24s ~]# cp /mnt/5/* /mnt/1/tmp cp: error reading '/mnt/5/a128.txt': Input/output error cp: error reading '/mnt/5/b128.txt': Input/output error [607594.478720] BTRFS warning (device dm-7): csum failed ino 295 off 0 csum 1940348404 expected csum 650595490 [607594.478818] BTRFS warning (device dm-7): csum failed ino 295 off 4096 csum 463855480 expected csum 650595490 [607594.478869] BTRFS warning (device dm-7): csum failed ino 295 off 8192 csum 3317251692 expected csum 650595490 [607594.479227] BTRFS warning (device dm-7): csum failed ino 295 off 12288 csum 2973611336 expected csum 650595490 [607594.479244] BTRFS warning (device dm-7): csum failed ino 295 off 16384 csum 2556299655 expected csum 650595490 [607594.479254] BTRFS warning (device dm-7): csum failed ino 295 off 20480 csum 1098993191 expected csum 650595490 [607594.479263] BTRFS warning (device dm-7): csum failed ino 295 off 24576 csum 1503293813 expected csum 650595490 [607594.479272] BTRFS warning (device dm-7): csum failed ino 295 off 28672 csum 1538866238 expected csum 650595490 [607594.479282] BTRFS warning (device dm-7): csum failed ino 295 off 36864 csum 2855931166 expected csum 650595490 [607594.479292] BTRFS warning (device dm-7): csum failed ino 295 off 32768 csum 3351364818 expected csum 650595490 Soo.no write hole? Clearly it must reconstruct from corrupt parity, and then checks the csum tree for EXTENT_CSUM and it doesn't match so it fails to propagate upstream. And doesn't result in a fixup. Good. What happens if I umount, make the missing device visible again, and mount not degraded? [607775.394504] BTRFS error (device dm-7): parent transid verify failed on 18517852160 wanted 143 found 140 [607775.424505] BTRFS info (device dm-7): read error corrected: ino 1 off 18517852160 (dev /dev/mapper/VG-a sector 67584) [607775.425055] BTRFS info (device dm-7): read error corrected: ino 1 off 18517856256 (dev /dev/mapper/VG-a sector 67592) [607775.425560] BTRFS info (device dm-7): read error corrected: ino 1 off 18517860352 (dev /dev/mapper/VG-a sector 67600) [607775.425850] BTRFS info (device dm-7): read error corrected: ino 1 off 18517864448 (dev /dev/mapper/VG-a sector 67608) [607775.431867] BTRFS error (device dm-7): parent transid verify failed on 16303439872 wanted 145 found 139 [607775.432973] BTRFS info (device dm-7): read error corrected: ino 1 off 16303439872 (dev /dev/mapper/VG-a sector 4262240) [607775.433438] BTRFS info (device dm-7): read error corrected: ino 1 off 16303443968 (dev /dev/mapper/VG-a sector 4262248) [607775.433842] BTRFS info (device dm-7): read error corrected: ino 1 off 16303448064 (dev /dev/mapper/VG-a sector 4262256) [607775.434220] BTRFS info (device dm-7): read error corrected: ino 1 off 16303452160 (dev /dev/mapper/VG-a sector 4262264) [607775.434847] BTRFS error (device dm-7): parent transid verify failed on 16303456256 wanted 145 found 139 [607775.435972] BTRFS info (device dm-7): read error corrected: ino 1 off 163034562
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On 2016-06-25 19:58, Chris Murphy wrote: [...] >> Wow. So it sees the data strip corruption, uses good parity on disk to >> fix it, writes the fix to disk, recomputes parity for some reason but >> does it wrongly, and then overwrites good parity with bad parity? > > The wrong parity, is it valid for the data strips that includes the > (intentionally) corrupt data? > > Can parity computation happen before the csum check? Where sometimes you get: > > read data strips > computer parity > check csum fails > read good > parity from disk > fix up the bad data chunk > write wrong parity > (based on wrong data)? > > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3 > > 2371-2383 suggest that there's a parity check, it's not always being > rewritten to disk if it's already correct. But it doesn't know it's > not correct, it thinks it's wrong so writes out the wrongly computed > parity? The parity is not valid for both the corrected data and the corrupted data. It seems that the scrub process copy the contents of the disk2 to disk3. It could happens only if the contents of disk1 is zero. BR GB -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Sat, Jun 25, 2016 at 11:25 AM, Chris Murphy wrote: > On Sat, Jun 25, 2016 at 6:21 AM, Goffredo Baroncelli > wrote: > >> 5) I check the disks at the offsets above, to verify that the data/parity is >> correct >> >> However I found that: >> 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any >> corruption, but recomputed the parity (always correctly); > > This is mostly good news, that it is fixing bad parity during scrub. > What's not clear due to the lack of any message is if the scrub is > always writing out new parity, or only writes it if there's a > mismatch. > > >> 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find >> the corruption. But I found two main behaviors: >> >> 2.a) the kernel repaired the damage, but compute the wrong parity. Where it >> was the parity, the kernel copied the data of the second disk on the parity >> disk > > Wow. So it sees the data strip corruption, uses good parity on disk to > fix it, writes the fix to disk, recomputes parity for some reason but > does it wrongly, and then overwrites good parity with bad parity? The wrong parity, is it valid for the data strips that includes the (intentionally) corrupt data? Can parity computation happen before the csum check? Where sometimes you get: read data strips > computer parity > check csum fails > read good parity from disk > fix up the bad data chunk > write wrong parity (based on wrong data)? https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3 2371-2383 suggest that there's a parity check, it's not always being rewritten to disk if it's already correct. But it doesn't know it's not correct, it thinks it's wrong so writes out the wrongly computed parity? -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Sat, Jun 25, 2016 at 6:21 AM, Goffredo Baroncelli wrote: > 5) I check the disks at the offsets above, to verify that the data/parity is > correct > > However I found that: > 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any > corruption, but recomputed the parity (always correctly); This is mostly good news, that it is fixing bad parity during scrub. What's not clear due to the lack of any message is if the scrub is always writing out new parity, or only writes it if there's a mismatch. > 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find > the corruption. But I found two main behaviors: > > 2.a) the kernel repaired the damage, but compute the wrong parity. Where it > was the parity, the kernel copied the data of the second disk on the parity > disk Wow. So it sees the data strip corruption, uses good parity on disk to fix it, writes the fix to disk, recomputes parity for some reason but does it wrongly, and then overwrites good parity with bad parity? That's fucked. So in other words, if there are any errors fixed up during a scrub, you should do a 2nd scrub. The first scrub should make sure data is correct, and the 2nd scrub should make sure the bug is papered over by computing correct parity and replacing the bad parity. I wonder if the same problem happens with balance or if this is just a bug in scrub code? > but these seem to be UNrelated to the kernel behavior 2.a) or 2.b) > > Another strangeness is that SCRUB sometime reports > ERROR: there are uncorrectable errors > and sometime reports > WARNING: errors detected during scrubbing, corrected > > but also these seems UNrelated to the behavior 2.a) or 2.b) or msg1 or msg2 I've seen this also, errors in user space but no kernel messages. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[BUG] Btrfs scrub sometime recalculate wrong parity in raid5
Hi all, following the thread "Adventures in btrfs raid5 disk recovery", I investigated a bit the BTRFS capability to scrub a corrupted raid5 filesystem. To test it, I first find where a file was stored, and then I tried to corrupt the data disks (when unmounted) or the parity disk. The result showed that sometime the kernel recomputed the parity wrongly. I tested the following kernel - 4.6.1 - 4.5.4 and both showed the same behavior. The test was performed as described below: 1) create a filesystem in raid5 mode (for data and metadata) of 1500MB truncate -s 500M disk1.img; losetup -f disk1.img truncate -s 500M disk2.img; losetup -f disk2.img truncate -s 500M disk3.img; losetup -f disk3.img sudo mkfs.btrfs -d raid5 -m raid5 /dev/loop[0-2] sudo mount /dev/loop0 mnt/ 2) I created a file with a length of 128kb: python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt sudo umount mnt/ 3) I looked at the output of 'btrfs-debug-tree /dev/loop0' and I was able to find where the file stripe is located: /dev/loop0: offset=81788928+16*4096(64k, second half of the file: 'bd.) /dev/loop1: offset=61865984+16*4096(64k, first half of the file: 'ad.) /dev/loop2: offset=61865984+16*4096(64k, parity: '\x03\x00\x03\x03\x03.) 4) I tried to corrupt each disk (one disk per test), and then run a scrub: for example for the disk /dev/loop2: sudo dd if=/dev/zero of=/dev/loop2 bs=1 \ seek=$((61865984+16*4096)) count=5 sudo mount /dev/loop0 mnt sudo btrfs scrub start mnt/. 5) I check the disks at the offsets above, to verify that the data/parity is correct However I found that: 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any corruption, but recomputed the parity (always correctly); 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find the corruption. But I found two main behaviors: 2.a) the kernel repaired the damage, but compute the wrong parity. Where it was the parity, the kernel copied the data of the second disk on the parity disk 2.b) the kernel repaired the damage, and rebuild a correct parity I have to point out another strange thing: in dmesg I found two kinds of messages: msg1) [] [ 1021.366944] BTRFS info (device loop2): disk space caching is enabled [ 1021.366949] BTRFS: has skinny extents [ 1021.399208] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt) [ 1021.399214] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [ 1021.399291] BTRFS error (device loop2): fixed up error at logical 142802944 on dev /dev/loop0 msg2) [ 1017.435068] BTRFS info (device loop2): disk space caching is enabled [ 1017.435074] BTRFS: has skinny extents [ 1017.436778] BTRFS info (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [ 1017.463403] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt) [ 1017.463409] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 [ 1017.463467] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt) [ 1017.463472] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 [ 1017.463512] BTRFS error (device loop2): unable to fixup (regular) error at logical 142802944 on dev /dev/loop0 [ 1017.463535] BTRFS error (device loop2): fixed up error at logical 142802944 on dev /dev/loop0 but these seem to be UNrelated to the kernel behavior 2.a) or 2.b) Another strangeness is that SCRUB sometime reports ERROR: there are uncorrectable errors and sometime reports WARNING: errors detected during scrubbing, corrected but also these seems UNrelated to the behavior 2.a) or 2.b) or msg1 or msg2 Enclosed you can find the script which I used to trigger the bug. I have to rerun it several times to show the problem because it doesn't happen every time. Pay attention that the offset and the loop device name are hard coded. You must run the script in the same directory where it is: eg "bash test.sh". Br G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 test.sh Description: Bourne shell script