Re: Trying to rescue my data :(
On 26/06/16 12:30, Duncan wrote: > Steven Haigh posted on Sun, 26 Jun 2016 02:39:23 +1000 as excerpted: > >> In every case, it was a flurry of csum error messages, then instant >> death. > > This is very possibly a known bug in btrfs, that occurs even in raid1 > where a later scrub repairs all csum errors. While in theory btrfs raid1 > should simply pull from the mirrored copy if its first try fails checksum > (assuming the second one passes, of course), and it seems to do this just > fine if there's only an occasional csum error, if it gets too many at > once, it *does* unfortunately crash, despite the second copy being > available and being just fine as later demonstrated by the scrub fixing > the bad copy from the good one. > > I'm used to dealing with that here any time I have a bad shutdown (and > I'm running live-git kde, which currently has a bug that triggers a > system crash if I let it idle and shut off the monitors, so I've been > getting crash shutdowns and having to deal with this unfortunately often, > recently). Fortunately I keep my root, with all system executables, etc, > mounted read-only by default, so it's not affected and I can /almost/ > boot normally after such a crash. The problem is /var/log and /home > (which has some parts of /var that need to be writable symlinked into / > home/var, so / can stay read-only). Something in the normal after-crash > boot triggers enough csum errors there that I often crash again. > > So I have to boot to emergency mode and manually mount the filesystems in > question, so nothing's trying to access them until I run the scrub and > fix the csum errors. Scrub itself doesn't trigger the crash, thankfully, > and once it has repaired all the csum errors due to partial writes on one > mirror that either were never made or were properly completed on the > other mirror, I can exit emergency mode and complete the normal boot (to > the multi-user default target). As there's no more csum errors then > because scrub fixed them all, the boot doesn't crash due to too many such > errors, and I'm back in business. > > > Tho I believe at least the csum bug that affects me may only trigger if > compression is (or perhaps has been in the past) enabled. Since I run > compress=lzo everywhere, that would certainly affect me. It would also > explain why the bug has remained around for quite some time as well, > since presumably the devs don't run with compression on enough for this > to have become a personal itch they needed to scratch, thus its remaining > untraced and unfixed. > > So if you weren't using the compress option, your bug is probably > different, but either way, the whole thing about too many csum errors at > once triggering a system crash sure does sound familiar, here. Yes, I was running the compress=lzo option as well... Maybe here lays a common problem? -- Steven Haigh Email: net...@crc.id.au Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 signature.asc Description: OpenPGP digital signature
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
Chris Murphy posted on Sat, 25 Jun 2016 11:25:05 -0600 as excerpted: > Wow. So it sees the data strip corruption, uses good parity on disk to > fix it, writes the fix to disk, recomputes parity for some reason but > does it wrongly, and then overwrites good parity with bad parity? > That's fucked. So in other words, if there are any errors fixed up > during a scrub, you should do a 2nd scrub. The first scrub should make > sure data is correct, and the 2nd scrub should make sure the bug is > papered over by computing correct parity and replacing the bad parity. > > I wonder if the same problem happens with balance or if this is just a > bug in scrub code? Could this explain why people have been reporting so many raid56 mode cases of btrfs replacing a first drive appearing to succeed just fine, but then they go to btrfs replace a second drive, and the array crashes as if the first replace didn't work correctly after all, resulting in two bad devices once the second replace gets under way, of course bringing down the array? If so, then it looks like we have our answer as to what has been going wrong that has been so hard to properly trace and thus to bugfix. Combine that with the raid4 dedicated parity device behavior you're seeing if the writes are all exactly 128 MB, with that possibly explaining the super-slow replaces, and this thread may have just given us answers to both of those until-now-untraceable issues. Regardless, what's /very/ clear by now is that raid56 mode as it currently exists is more or less fatally flawed, and a full scrap and rewrite to an entirely different raid56 mode on-disk format may be necessary to fix it. And what's even clearer is that people /really/ shouldn't be using raid56 mode for anything but testing with throw-away data, at this point. Anything else is simply irresponsible. Does that mean we need to put a "raid56 mode may eat your babies" level warning in the manpage and require a --force to either mkfs.btrfs or balance to raid56 mode? Because that's about where I am on it. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Trying to rescue my data :(
Steven Haigh posted on Sun, 26 Jun 2016 02:39:23 +1000 as excerpted: > In every case, it was a flurry of csum error messages, then instant > death. This is very possibly a known bug in btrfs, that occurs even in raid1 where a later scrub repairs all csum errors. While in theory btrfs raid1 should simply pull from the mirrored copy if its first try fails checksum (assuming the second one passes, of course), and it seems to do this just fine if there's only an occasional csum error, if it gets too many at once, it *does* unfortunately crash, despite the second copy being available and being just fine as later demonstrated by the scrub fixing the bad copy from the good one. I'm used to dealing with that here any time I have a bad shutdown (and I'm running live-git kde, which currently has a bug that triggers a system crash if I let it idle and shut off the monitors, so I've been getting crash shutdowns and having to deal with this unfortunately often, recently). Fortunately I keep my root, with all system executables, etc, mounted read-only by default, so it's not affected and I can /almost/ boot normally after such a crash. The problem is /var/log and /home (which has some parts of /var that need to be writable symlinked into / home/var, so / can stay read-only). Something in the normal after-crash boot triggers enough csum errors there that I often crash again. So I have to boot to emergency mode and manually mount the filesystems in question, so nothing's trying to access them until I run the scrub and fix the csum errors. Scrub itself doesn't trigger the crash, thankfully, and once it has repaired all the csum errors due to partial writes on one mirror that either were never made or were properly completed on the other mirror, I can exit emergency mode and complete the normal boot (to the multi-user default target). As there's no more csum errors then because scrub fixed them all, the boot doesn't crash due to too many such errors, and I'm back in business. Tho I believe at least the csum bug that affects me may only trigger if compression is (or perhaps has been in the past) enabled. Since I run compress=lzo everywhere, that would certainly affect me. It would also explain why the bug has remained around for quite some time as well, since presumably the devs don't run with compression on enough for this to have become a personal itch they needed to scratch, thus its remaining untraced and unfixed. So if you weren't using the compress option, your bug is probably different, but either way, the whole thing about too many csum errors at once triggering a system crash sure does sound familiar, here. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Sat, Jun 25, 2016 at 12:42 PM, Goffredo Baroncelli wrote: > On 2016-06-25 19:58, Chris Murphy wrote: > [...] >>> Wow. So it sees the data strip corruption, uses good parity on disk to >>> fix it, writes the fix to disk, recomputes parity for some reason but >>> does it wrongly, and then overwrites good parity with bad parity? >> >> The wrong parity, is it valid for the data strips that includes the >> (intentionally) corrupt data? >> >> Can parity computation happen before the csum check? Where sometimes you get: >> >> read data strips > computer parity > check csum fails > read good >> parity from disk > fix up the bad data chunk > write wrong parity >> (based on wrong data)? >> >> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3 >> >> 2371-2383 suggest that there's a parity check, it's not always being >> rewritten to disk if it's already correct. But it doesn't know it's >> not correct, it thinks it's wrong so writes out the wrongly computed >> parity? > > The parity is not valid for both the corrected data and the corrupted data. > It seems that the scrub process copy the contents of the disk2 to disk3. It > could happens only if the contents of disk1 is zero. I'm not sure what it takes to hit this exactly. I just tested 3x raid5, where two files 128KiB "a" and 128KiB "b", so that's a full stripe write for each. I corrupted devid 1 64KiB of "a" and devid2 64KiB of "b" did a scrub, error is detected, and corrected, and parity is still correct. I also tried to corrupt both parities and scrub, and like you I get no messages from scrub in user space or kernel but the parity is corrected. The fixup is also not cow'd. It is an overwrite, which seems unproblematic to me at face value. But? Next I corrupted parities, failed one drive, mounted degraded, and read in both files. If there is a write hole, I should get back corrupt data from parity reconstruction blindly being trusted and wrongly reconstructed. [root@f24s ~]# cp /mnt/5/* /mnt/1/tmp cp: error reading '/mnt/5/a128.txt': Input/output error cp: error reading '/mnt/5/b128.txt': Input/output error [607594.478720] BTRFS warning (device dm-7): csum failed ino 295 off 0 csum 1940348404 expected csum 650595490 [607594.478818] BTRFS warning (device dm-7): csum failed ino 295 off 4096 csum 463855480 expected csum 650595490 [607594.478869] BTRFS warning (device dm-7): csum failed ino 295 off 8192 csum 3317251692 expected csum 650595490 [607594.479227] BTRFS warning (device dm-7): csum failed ino 295 off 12288 csum 2973611336 expected csum 650595490 [607594.479244] BTRFS warning (device dm-7): csum failed ino 295 off 16384 csum 2556299655 expected csum 650595490 [607594.479254] BTRFS warning (device dm-7): csum failed ino 295 off 20480 csum 1098993191 expected csum 650595490 [607594.479263] BTRFS warning (device dm-7): csum failed ino 295 off 24576 csum 1503293813 expected csum 650595490 [607594.479272] BTRFS warning (device dm-7): csum failed ino 295 off 28672 csum 1538866238 expected csum 650595490 [607594.479282] BTRFS warning (device dm-7): csum failed ino 295 off 36864 csum 2855931166 expected csum 650595490 [607594.479292] BTRFS warning (device dm-7): csum failed ino 295 off 32768 csum 3351364818 expected csum 650595490 Soo.no write hole? Clearly it must reconstruct from corrupt parity, and then checks the csum tree for EXTENT_CSUM and it doesn't match so it fails to propagate upstream. And doesn't result in a fixup. Good. What happens if I umount, make the missing device visible again, and mount not degraded? [607775.394504] BTRFS error (device dm-7): parent transid verify failed on 18517852160 wanted 143 found 140 [607775.424505] BTRFS info (device dm-7): read error corrected: ino 1 off 18517852160 (dev /dev/mapper/VG-a sector 67584) [607775.425055] BTRFS info (device dm-7): read error corrected: ino 1 off 18517856256 (dev /dev/mapper/VG-a sector 67592) [607775.425560] BTRFS info (device dm-7): read error corrected: ino 1 off 18517860352 (dev /dev/mapper/VG-a sector 67600) [607775.425850] BTRFS info (device dm-7): read error corrected: ino 1 off 18517864448 (dev /dev/mapper/VG-a sector 67608) [607775.431867] BTRFS error (device dm-7): parent transid verify failed on 16303439872 wanted 145 found 139 [607775.432973] BTRFS info (device dm-7): read error corrected: ino 1 off 16303439872 (dev /dev/mapper/VG-a sector 4262240) [607775.433438] BTRFS info (device dm-7): read error corrected: ino 1 off 16303443968 (dev /dev/mapper/VG-a sector 4262248) [607775.433842] BTRFS info (device dm-7): read error corrected: ino 1 off 16303448064 (dev /dev/mapper/VG-a sector 4262256) [607775.434220] BTRFS info (device dm-7): read error corrected: ino 1 off 16303452160 (dev /dev/mapper/VG-a sector 4262264) [607775.434847] BTRFS error (device dm-7): parent transid verify failed on 16303456256 wanted 145 found 139 [607775.435972] BTRFS info (device dm-7): read error corrected: ino 1 off 163034562
Re: Adventures in btrfs raid5 disk recovery
Interestingly enough, so far I'm finding with full stripe writes, i.e. 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This is raid4. So...I wonder if some of these slow cases end up with a bunch of stripes that are effectively raid4-like, and have a lot of parity overwrites, which is where raid4 suffers due to disk contention. Totally speculative as the sample size is too small and distinctly non-random. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 00/24] Delete CURRENT_TIME_SEC and replace current_fs_time()
The series is aimed at getting rid of CURRENT_TIME, CURRENT_TIME_SEC macros and replacing current_fs_time() with current_time(). The macros are not y2038 safe. There is no plan to transition them into being y2038 safe. ktime_get_* api's can be used in their place. And, these are y2038 safe. CURRENT_TIME will be deleted after 4.8 rc1 as there is a dependency function time64_to_tm() for one of the CURRENT_TIME occurance. Thanks to Arnd Bergmann for all the guidance and discussions. Patches 3-5 were mostly generated using coccinelle. All filesystem timestamps use current_fs_time() for right granularity as mentioned in the respective commit texts of patches. This has a changed signature, renamed to current_time() and moved to the fs/inode.c. This series also serves as a preparatory series to transition vfs to 64 bit timestamps as outlined here: https://lkml.org/lkml/2016/2/12/104 . As per Linus's suggestion in https://lkml.org/lkml/2016/5/24/663 , all the inode timestamp changes have been squashed into a single patch. Also, current_time() now is used as a single generic vfs filesystem timestamp api. It also takes struct inode* as argument instead of struct super_block*. Posting all patches together in a bigger series so that the big picture is clear. As per the suggestion in https://lwn.net/Articles/672598/, CURRENT_TIME macro bug fixes are being handled in a series separate from transitioning vfs to use. Changes since v2: * Fix buildbot error for uninitalized sb in inode. * Minor fixes according to Arnd's comments. * Leave out the fnic and deletion of CURRENT_TIME to be submitted after 4.8 rc1. Deepa Dinamani (24): vfs: Add current_time() api fs: proc: Delete inode time initializations in proc_alloc_inode() fs: Replace CURRENT_TIME with current_time() for inode timestamps fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps fs: Replace current_fs_time() with current_time() fs: jfs: Replace CURRENT_TIME_SEC by current_time() fs: ext4: Use current_time() for inode timestamps fs: ubifs: Replace CURRENT_TIME_SEC with current_time fs: btrfs: Use ktime_get_real_ts for root ctime fs: udf: Replace CURRENT_TIME with current_time() fs: cifs: Replace CURRENT_TIME by current_time() fs: cifs: Replace CURRENT_TIME with ktime_get_real_ts() fs: cifs: Replace CURRENT_TIME by get_seconds fs: f2fs: Use ktime_get_real_seconds for sit_info times drivers: staging: lustre: Replace CURRENT_TIME with current_time() fs: ocfs2: Use time64_t to represent orphan scan times fs: ocfs2: Replace CURRENT_TIME with ktime_get_real_seconds() audit: Use timespec64 to represent audit timestamps fs: nfs: Make nfs boot time y2038 safe block: Replace CURRENT_TIME with ktime_get_real_ts libceph: Replace CURRENT_TIME with ktime_get_real_ts fs: ceph: Replace current_fs_time for request stamp time: Delete current_fs_time() function time: Delete CURRENT_TIME_SEC arch/powerpc/platforms/cell/spufs/inode.c | 2 +- arch/s390/hypfs/inode.c| 4 +-- drivers/block/rbd.c| 2 +- drivers/char/sonypi.c | 2 +- drivers/infiniband/hw/qib/qib_fs.c | 2 +- drivers/misc/ibmasm/ibmasmfs.c | 2 +- drivers/oprofile/oprofilefs.c | 2 +- drivers/platform/x86/sony-laptop.c | 2 +- drivers/staging/lustre/lustre/llite/llite_lib.c| 16 ++-- drivers/staging/lustre/lustre/llite/namei.c| 4 +-- drivers/staging/lustre/lustre/mdc/mdc_reint.c | 6 ++--- .../lustre/lustre/obdclass/linux/linux-obdo.c | 6 ++--- drivers/staging/lustre/lustre/obdclass/obdo.c | 6 ++--- drivers/staging/lustre/lustre/osc/osc_io.c | 2 +- drivers/usb/core/devio.c | 18 +++--- drivers/usb/gadget/function/f_fs.c | 8 +++--- drivers/usb/gadget/legacy/inode.c | 2 +- fs/9p/vfs_inode.c | 2 +- fs/adfs/inode.c| 2 +- fs/affs/amigaffs.c | 6 ++--- fs/affs/inode.c| 2 +- fs/attr.c | 2 +- fs/autofs4/inode.c | 2 +- fs/autofs4/root.c | 6 ++--- fs/bad_inode.c | 2 +- fs/bfs/dir.c | 14 +-- fs/binfmt_misc.c | 2 +- fs/btrfs/file.c| 6 ++--- fs/btrfs/inode.c | 22 fs/btrfs/ioctl.c | 8 +++--- fs/btrfs/root-tree.c | 3 ++- fs/btrfs/transaction.c | 4 +-- fs/btrfs/xattr.c | 2 +- fs/
[PATCH v3 09/24] fs: btrfs: Use ktime_get_real_ts for root ctime
btrfs_root_item maintains the ctime for root updates. This is not part of vfs_inode. Since current_time() uses struct inode* as an argument as Linus suggested, this cannot be used to update root times unless, we modify the signature to use inode. Since btrfs uses nanosecond time granularity, it can also use ktime_get_real_ts directly to obtain timestamp for the root. It is necessary to use the timespec time api here because the same btrfs_set_stack_timespec_*() apis are used for vfs inode times as well. These can be transitioned to using timespec64 when btrfs internally changes to use timespec64 as well. Signed-off-by: Deepa Dinamani Cc: Chris Mason Cc: Josef Bacik Cc: David Sterba Cc: linux-btrfs@vger.kernel.org Acked-by: David Sterba --- fs/btrfs/root-tree.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c index f1c3086..161118b 100644 --- a/fs/btrfs/root-tree.c +++ b/fs/btrfs/root-tree.c @@ -496,10 +496,11 @@ void btrfs_update_root_times(struct btrfs_trans_handle *trans, struct btrfs_root *root) { struct btrfs_root_item *item = &root->root_item; - struct timespec ct = current_fs_time(root->fs_info->sb); + struct timespec ct; spin_lock(&root->root_item_lock); btrfs_set_root_ctransid(item, trans->transid); + ktime_get_real_ts(&ct); btrfs_set_stack_timespec_sec(&item->ctime, ct.tv_sec); btrfs_set_stack_timespec_nsec(&item->ctime, ct.tv_nsec); spin_unlock(&root->root_item_lock); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bad hard drive - checksum verify failure forces readonly mount
On Sat, Jun 25, 2016 at 2:10 PM, Vasco Almeida wrote: > Citando Chris Murphy : >> >> I would do a couple things in order: >> 1. Mount ro and copy off what you want in case the whole thing gets >> worse and can't ever be mounted again. >> 2. Mount with only these options: -o skip_balance,subvolid=5,nospace_cache > > > I have mounted with that options and was readwrite first and then it forces > readonly. You can see a delay between first BTRFS messages and the "BTRFS > info: forced readonly" message in dmesg. > > /dev/mapper/vg_pupu-lv_opensuse_root on /mnt type btrfs > (ro,relatime,seclabel,nospace_cache,skip_balance,subvolid=5,subvol=/) > > >> If it mounts rw, don't do anything with it, just see if it cleans up >> after itself. It also looks from the previous trace it was trying to >> remove a snapshot and there are complaints of problems in that >> snapshot. So hopefully just waiting 5 minutes doing nothing and it'll >> clean up after itself (you can check with top to see if there are any >> btrfs related transactions that run including the btrfs-cleaner >> process) wait until they're done. > > > I can see that btrfs processes including btrfs-cleaner but they may be not > doing much since device was forced readonly after mounting it. Readonly just refers to user space to and including VFS, is my understanding. The file system itself can still write to the block device. > I have umount it normally (umount /mnt) after more than 20 minutes since > mounting it. > >> 3. btrfs-image so that devs can see what's causing the problem that >> the current code isn't handling well enough. > > > btrfs-image does not create dump image: > > # btrfs-image /dev/mapper/vg_pupu-lv_opensuse_root > btrfs-lv_opensuse_root.image > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 > Csum didn't match > Error reading metadata block > Error adding block -5 > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 > Csum didn't match > Error reading metadata block > Error flushing pending -5 > create failed (Success) > # echo $? > 1 Well it's pretty strange to have DUP metadata and for the checksum verify to fail on both copies. I don't have much optimism that brfsck repair can fix it either. But still it's worth a shot since there's not much else to go on. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bad hard drive - checksum verify failure forces readonly mount
Citando Chris Murphy : On Fri, Jun 24, 2016 at 6:06 PM, Vasco Almeida wrote: Citando Chris Murphy : dmesg http://paste.fedoraproject.org/384352/80842814/ [ 1837.386732] BTRFS info (device dm-9): continuing balance [ 1838.006038] BTRFS info (device dm-9): relocating block group 15799943168 flags 34 [ 1838.684892] BTRFS info (device dm-9): relocating block group 10934550528 flags 36 [ 1839.301453] [ cut here ] [ 1839.301495] WARNING: CPU: 3 PID: 76 at fs/btrfs/extent-tree.c:1625 lookup_inline_extent_backref+0x45c/0x5a0 [btrfs]() followed by [ 1839.301797] WARNING: CPU: 3 PID: 76 at fs/btrfs/extent-tree.c:2946 btrfs_run_delayed_refs+0x29d/0x2d0 [btrfs]() [ 1839.301798] BTRFS: Transaction aborted (error -5) [...] [ 1839.301972] BTRFS: error (device dm-9) in btrfs_run_delayed_refs:2946: errno=-5 IO failure [ 1839.301975] BTRFS info (device dm-9): forced readonly So it looks like it was resuming a balance automatically, and while processing delayed references it's running into something it doesn't expect and doesn't have a way to fix, so it goes read only to avoid causing more problems. I would do a couple things in order: 1. Mount ro and copy off what you want in case the whole thing gets worse and can't ever be mounted again. 2. Mount with only these options: -o skip_balance,subvolid=5,nospace_cache I have mounted with that options and was readwrite first and then it forces readonly. You can see a delay between first BTRFS messages and the "BTRFS info: forced readonly" message in dmesg. /dev/mapper/vg_pupu-lv_opensuse_root on /mnt type btrfs (ro,relatime,seclabel,nospace_cache,skip_balance,subvolid=5,subvol=/) If it mounts rw, don't do anything with it, just see if it cleans up after itself. It also looks from the previous trace it was trying to remove a snapshot and there are complaints of problems in that snapshot. So hopefully just waiting 5 minutes doing nothing and it'll clean up after itself (you can check with top to see if there are any btrfs related transactions that run including the btrfs-cleaner process) wait until they're done. I can see that btrfs processes including btrfs-cleaner but they may be not doing much since device was forced readonly after mounting it. Then umount. If you want you could have two other consoles ready first, one for 'journalctl -f' and another for sysrq+t to issue in case you get a hang. This doesn't fix anything but it collects more information for a bug report for the devs. Once you get it umounted normally or by force, the next thing to do is I have umount it normally (umount /mnt) after more than 20 minutes since mounting it. 3. btrfs-image so that devs can see what's causing the problem that the current code isn't handling well enough. btrfs-image does not create dump image: # btrfs-image /dev/mapper/vg_pupu-lv_opensuse_root btrfs-lv_opensuse_root.image checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 Csum didn't match Error reading metadata block Error adding block -5 checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 Csum didn't match Error reading metadata block Error flushing pending -5 create failed (Success) # echo $? 1 4. btrfs check --repair Did not issue this command yet. dmesg http://paste.fedoraproject.org/384799/14668851/ Thank your for helping. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On 2016-06-25 19:58, Chris Murphy wrote: [...] >> Wow. So it sees the data strip corruption, uses good parity on disk to >> fix it, writes the fix to disk, recomputes parity for some reason but >> does it wrongly, and then overwrites good parity with bad parity? > > The wrong parity, is it valid for the data strips that includes the > (intentionally) corrupt data? > > Can parity computation happen before the csum check? Where sometimes you get: > > read data strips > computer parity > check csum fails > read good > parity from disk > fix up the bad data chunk > write wrong parity > (based on wrong data)? > > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3 > > 2371-2383 suggest that there's a parity check, it's not always being > rewritten to disk if it's already correct. But it doesn't know it's > not correct, it thinks it's wrong so writes out the wrongly computed > parity? The parity is not valid for both the corrected data and the corrupted data. It seems that the scrub process copy the contents of the disk2 to disk3. It could happens only if the contents of disk1 is zero. BR GB -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Sat, Jun 25, 2016 at 11:25 AM, Chris Murphy wrote: > On Sat, Jun 25, 2016 at 6:21 AM, Goffredo Baroncelli > wrote: > >> 5) I check the disks at the offsets above, to verify that the data/parity is >> correct >> >> However I found that: >> 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any >> corruption, but recomputed the parity (always correctly); > > This is mostly good news, that it is fixing bad parity during scrub. > What's not clear due to the lack of any message is if the scrub is > always writing out new parity, or only writes it if there's a > mismatch. > > >> 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find >> the corruption. But I found two main behaviors: >> >> 2.a) the kernel repaired the damage, but compute the wrong parity. Where it >> was the parity, the kernel copied the data of the second disk on the parity >> disk > > Wow. So it sees the data strip corruption, uses good parity on disk to > fix it, writes the fix to disk, recomputes parity for some reason but > does it wrongly, and then overwrites good parity with bad parity? The wrong parity, is it valid for the data strips that includes the (intentionally) corrupt data? Can parity computation happen before the csum check? Where sometimes you get: read data strips > computer parity > check csum fails > read good parity from disk > fix up the bad data chunk > write wrong parity (based on wrong data)? https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3 2371-2383 suggest that there's a parity check, it's not always being rewritten to disk if it's already correct. But it doesn't know it's not correct, it thinks it's wrong so writes out the wrongly computed parity? -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Sat, Jun 25, 2016 at 6:21 AM, Goffredo Baroncelli wrote: > 5) I check the disks at the offsets above, to verify that the data/parity is > correct > > However I found that: > 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any > corruption, but recomputed the parity (always correctly); This is mostly good news, that it is fixing bad parity during scrub. What's not clear due to the lack of any message is if the scrub is always writing out new parity, or only writes it if there's a mismatch. > 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find > the corruption. But I found two main behaviors: > > 2.a) the kernel repaired the damage, but compute the wrong parity. Where it > was the parity, the kernel copied the data of the second disk on the parity > disk Wow. So it sees the data strip corruption, uses good parity on disk to fix it, writes the fix to disk, recomputes parity for some reason but does it wrongly, and then overwrites good parity with bad parity? That's fucked. So in other words, if there are any errors fixed up during a scrub, you should do a 2nd scrub. The first scrub should make sure data is correct, and the 2nd scrub should make sure the bug is papered over by computing correct parity and replacing the bad parity. I wonder if the same problem happens with balance or if this is just a bug in scrub code? > but these seem to be UNrelated to the kernel behavior 2.a) or 2.b) > > Another strangeness is that SCRUB sometime reports > ERROR: there are uncorrectable errors > and sometime reports > WARNING: errors detected during scrubbing, corrected > > but also these seems UNrelated to the behavior 2.a) or 2.b) or msg1 or msg2 I've seen this also, errors in user space but no kernel messages. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Trying to rescue my data :(
On Sat, Jun 25, 2016 at 10:39 AM, Steven Haigh wrote: > Well, I did end up recovering the data that I cared about. I'm not > really keen to ride the BTRFS RAID6 train again any time soon :\ > > I now have the same as I've had for years - md RAID6 with XFS on top of > it. I'm still copying data back to the array from the various sources I > had to copy it to so I had enough space to do so. Just make sure you've got each drive's SCT ERC shorter than the kernel SCSI command timer for each block device in /sys/block/device-name/device/timeout or you can very easily end up with the same if not worse problem which is total array collapse. It's more rare to see the problem on mdraid6 because the extra parity ends up papering over the problem caused by this misconfiguration, but it's a misconfiguration that's the default unless you're using enterprise/NAS specific drives with short recoveries set on them by default. The linux-raid@ list is full of problems resulting from this issue. I think the obvious mistake here though is assuming reshapes entail no risk. There's a -f required for a reason. You could have ended up in just as bad situation doing a reshape without a backup of an md or lvm based array. Yes it should work, and if it doesn't it's a bug, but how much data do you want to lose today? > What I find interesting is that the patterns of corruption in the BTRFS > RAID6 is quite clustered. I have ~80Gb of MP3s ripped over the years - > of that, the corruption would take out 3-4 songs in a row, then the next > 10 albums or so were intact. What made recovery VERY hard, is that it > got to several situations that just caused a complete system hang. The data stripe size is 64KiB * (num of disks - 2). So in your case I think that's 64 *3 = 192KiB. That's less than the size of one song, so that means roughly 15 bad stripes in a row. That's less than a block group also. The Btrfs conversion should be safer than methods used by mdadm and lvm because the operation is cow. The raid6 block group is supposed to remain intact and "live" if you will, until the single block group is written to stable media. The full crash set of kernel messages might be useful to find out what was happening that instigated all of this corruption. But even still the subsequent mount should at worst rollback to state of block groups of different profiles where the most recent (failed) conversion is still a raid6 block group intact. So, I'd still say btrfs-image it and host it somewhere, file a bug, cross reference this thread in the bug, and the bug URL in this thread. Might take months or even a year before a dev looks at it, but better than nothing. > > I tried it on bare metal - just in case it was a Xen thing, but it hard > hung the entire machine then. In every case, it was a flurry of csum > error messages, then instant death. I would have been much happier if > the file had been skipped or returned as unavailable instead of having > the entire machine crash. Of course. The unanswered question though is why are there so many csum errors? Are these metadata csum errors, or are they EXTENT_CSUM errors, and how are they becoming wrong? Wrongly read, wrongly written, wrongly recomputed from parity? How did the parity go bad if that's the case? So it needs an autopsy or it just doesn't get better. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn wrote: > Well, the obvious major advantage that comes to mind for me to checksumming > parity is that it would let us scrub the parity data itself and verify it. OK but hold on. During scrub, it should read data, compute checksums *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in the checksum tree, and the parity strip in the chunk tree. And if parity is wrong, then it should be replaced. Even check > md/sync_action does this. So no pun intended but Btrfs isn't even at parity with mdadm on data integrity if it doesn't check if the parity matches data. > I'd personally much rather know my parity is bad before I need to use it > than after using it to reconstruct data and getting an error there, and I'd > be willing to be that most seasoned sysadmins working for companies using > big storage arrays likely feel the same about it. That doesn't require parity csums though. It just requires computing parity during a scrub and comparing it to the parity on disk to make sure they're the same. If they aren't, assuming no other error for that full stripe read, then the parity block is replaced. So that's also something to check in the code or poke a system with a stick and see what happens. > I could see it being > practical to have an option to turn this off for performance reasons or > similar, but again, I have a feeling that most people would rather be able > to check if a rebuild will eat data before trying to rebuild (depending on > the situation in such a case, it will sometimes just make more sense to nuke > the array and restore from a backup instead of spending time waiting for it > to rebuild). The much bigger problem we have right now that affects Btrfs, LVM/mdadm md raid, is this silly bad default with non-enterprise drives having no configurable SCT ERC, with ensuing long recovery times, and the kernel SCSI command timer at 30 seconds - which actually also fucks over regular single disk users also because it means they don't get the "benefit" of long recovery times, which is the whole g'd point of that feature. This itself causes so many problems where bad sectors just get worse and don't get fixed up because of all the link resets. So I still think it's a bullshit default kernel side because it pretty much affects the majority use case, it is only a non-problem with proprietary hardware raid, and software raid using enterprise (or NAS specific) drives that already have short recovery times by default. This has been true for a very long time, maybe a decade. And it's such complete utter crap that this hasn't been dealt with properly by any party. No distribution has fixed this for their users. Upstream udev hasn't dealt with it. And kernel folks haven't dealt with it. It's a perverse joke on the user to do this out of the box. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Trying to rescue my data :(
On 26/06/16 02:25, Chris Murphy wrote: > On Fri, Jun 24, 2016 at 10:19 PM, Steven Haigh wrote: > >> >> Interesting though that EVERY crash references: >> kernel BUG at fs/btrfs/extent_io.c:2401! > > Yeah because you're mounted ro, and if this is 4.4.13 unmodified btrfs > from kernel.org then that's the 3rd line: > > if (head->is_data) { > ret = btrfs_del_csums(trans, root, >node->bytenr, >node->num_bytes); > > So why/what is it cleaning up if it's mounted ro? Anyway, once you're > no longer making forward progress you could try something newer, > although it's a coin toss what to try. There are some issues with > 4.6.0-4.6.2 but there have been a lot of changes in btrfs/extent_io.c > and btrfs/raid56.c between 4.4.13 that you're using and 4.6.2, so you > could try that or even build 4.7.rc4 or rc5 by tomorrowish and see how > that fairs. It sounds like there's just too much (mostly metadata) > corruption for the degraded state to deal with so it may not matter. > I'm really skeptical of btrfsck on degraded fs's so I don't think > that'll help. Well, I did end up recovering the data that I cared about. I'm not really keen to ride the BTRFS RAID6 train again any time soon :\ I now have the same as I've had for years - md RAID6 with XFS on top of it. I'm still copying data back to the array from the various sources I had to copy it to so I had enough space to do so. What I find interesting is that the patterns of corruption in the BTRFS RAID6 is quite clustered. I have ~80Gb of MP3s ripped over the years - of that, the corruption would take out 3-4 songs in a row, then the next 10 albums or so were intact. What made recovery VERY hard, is that it got to several situations that just caused a complete system hang. I tried it on bare metal - just in case it was a Xen thing, but it hard hung the entire machine then. In every case, it was a flurry of csum error messages, then instant death. I would have been much happier if the file had been skipped or returned as unavailable instead of having the entire machine crash. I ended up putting the bit of script that I posted earlier in /etc/rc.local - then just kept doing: xl destroy myvm && xl create /etc/xen/myvm -c Wait for the crash, run the above again. All in all, it took me about 350 boots with an average uptime of about 3 minutes to get the data out that I decided to keep. While not a BTRFS loss, I did decide with how long it was going to take to not bother recovering ~3.5Tb of other data that is easily available in other places on the internet. If I really need the Fedora 24 KDE Spin ISO, or the CentOS 6 Install DVD, etc etc I can download it again. -- Steven Haigh Email: net...@crc.id.au Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 signature.asc Description: OpenPGP digital signature
Re: Trying to rescue my data :(
On Fri, Jun 24, 2016 at 10:19 PM, Steven Haigh wrote: > > Interesting though that EVERY crash references: > kernel BUG at fs/btrfs/extent_io.c:2401! Yeah because you're mounted ro, and if this is 4.4.13 unmodified btrfs from kernel.org then that's the 3rd line: if (head->is_data) { ret = btrfs_del_csums(trans, root, node->bytenr, node->num_bytes); So why/what is it cleaning up if it's mounted ro? Anyway, once you're no longer making forward progress you could try something newer, although it's a coin toss what to try. There are some issues with 4.6.0-4.6.2 but there have been a lot of changes in btrfs/extent_io.c and btrfs/raid56.c between 4.4.13 that you're using and 4.6.2, so you could try that or even build 4.7.rc4 or rc5 by tomorrowish and see how that fairs. It sounds like there's just too much (mostly metadata) corruption for the degraded state to deal with so it may not matter. I'm really skeptical of btrfsck on degraded fs's so I don't think that'll help. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL 2/2] Btrfs
Hi Linus, Btrfs part two was supposed to be a single patch on part of v4.7-rc4. Somehow I didn't notice that my part2 branch repeated a few of the patches in part 1 when I set it up earlier this week. Cherry-picking gone wrong as I folded a fix into Dave Sterba's original integration. I've been testing the git-merged result of part1, part2 and your master for a while, but I just rebased part2 so it didn't include any duplicates. I ran git diff to verify the merged result of today's pull is exactly the same as the one I've been testing. My for-linus-4.7-part2 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.7-part2 Has one patch from Omar to bring iterate_shared back to btrfs. We have a tree of work we queue up for directory items and it doesn't lend itself well to shared access. While we're cleaning it up, Omar has changed things to use an exclusive lock when there are delayed items. Omar Sandoval (1) commits (+34/-13): Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes Total: (1) commits (+34/-13) fs/btrfs/delayed-inode.c | 27 ++- fs/btrfs/delayed-inode.h | 10 ++ fs/btrfs/inode.c | 10 ++ 3 files changed, 34 insertions(+), 13 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bad hard drive - checksum verify failure forces readonly mount
On Fri, Jun 24, 2016 at 6:06 PM, Vasco Almeida wrote: > Citando Chris Murphy : >> A lot of changes have happened since 4.1.2 I would still use something >> newer and try to repair it. > > > By repair do you mean issue "btrfs check --repair /device" ? Once you have copied off the important stuff, yes. It's less likely to make things worse now. However, there are some things to do first: > dmesg http://paste.fedoraproject.org/384352/80842814/ [ 1837.386732] BTRFS info (device dm-9): continuing balance [ 1838.006038] BTRFS info (device dm-9): relocating block group 15799943168 flags 34 [ 1838.684892] BTRFS info (device dm-9): relocating block group 10934550528 flags 36 [ 1839.301453] [ cut here ] [ 1839.301495] WARNING: CPU: 3 PID: 76 at fs/btrfs/extent-tree.c:1625 lookup_inline_extent_backref+0x45c/0x5a0 [btrfs]() followed by [ 1839.301797] WARNING: CPU: 3 PID: 76 at fs/btrfs/extent-tree.c:2946 btrfs_run_delayed_refs+0x29d/0x2d0 [btrfs]() [ 1839.301798] BTRFS: Transaction aborted (error -5) [...] [ 1839.301972] BTRFS: error (device dm-9) in btrfs_run_delayed_refs:2946: errno=-5 IO failure [ 1839.301975] BTRFS info (device dm-9): forced readonly So it looks like it was resuming a balance automatically, and while processing delayed references it's running into something it doesn't expect and doesn't have a way to fix, so it goes read only to avoid causing more problems. I would do a couple things in order: 1. Mount ro and copy off what you want in case the whole thing gets worse and can't ever be mounted again. 2. Mount with only these options: -o skip_balance,subvolid=5,nospace_cache If it mounts rw, don't do anything with it, just see if it cleans up after itself. It also looks from the previous trace it was trying to remove a snapshot and there are complaints of problems in that snapshot. So hopefully just waiting 5 minutes doing nothing and it'll clean up after itself (you can check with top to see if there are any btrfs related transactions that run including the btrfs-cleaner process) wait until they're done. Then umount. If you want you could have two other consoles ready first, one for 'journalctl -f' and another for sysrq+t to issue in case you get a hang. This doesn't fix anything but it collects more information for a bug report for the devs. Once you get it umounted normally or by force, the next thing to do is 3. btrfs-image so that devs can see what's causing the problem that the current code isn't handling well enough. 4. btrfs check --repair Let's see the results of that repair. You can use 'script btrfsrepair.txt' first and then 'btrfs check --repair' and it will log everything. After btrfs check completes, use 'exit' to stop script from recording and you should have a btrfsrepair.txt file you can post somewhere. When using > not everything gets logged for some reason but script will capture everything. Depending on how the repair goes, there might be a couple more options left. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL 1/2] Btrfs
Hi Linus, I have a two part pull this time because one of the patches Dave Sterba collected needed to be against v4.7-rc2 or higher (we used rc4). I try to make my for-linus-xx branch testable on top of the last major so we can hand fixes to people on the list more easily, so I've split this pull in two. My for-linus-4.7 branch has some fixes and two performance improvements that we've been testing for some time. git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.7 Josef's two performance fixes are most notable. The transid tracking patch makes a big improvement on pretty much every workload. Josef Bacik (2) commits (+38/-27): Btrfs: don't do nocow check unless we have to (+22/-22) Btrfs: track transid for delayed ref flushing (+16/-5) Liu Bo (1) commits (+11/-2): Btrfs: fix error handling in map_private_extent_buffer Chris Mason (1) commits (+11/-9): btrfs: fix deadlock in delayed_ref_async_start Wei Yongjun (1) commits (+1/-1): Btrfs: fix error return code in btrfs_init_test_fs() Chandan Rajendra (1) commits (+4/-6): Btrfs: Force stripesize to the value of sectorsize Wang Xiaoguang (1) commits (+2/-1): btrfs: fix disk_i_size update bug when fallocate() fails Total: (7) commits (+67/-46) fs/btrfs/ctree.c | 6 +- fs/btrfs/ctree.h | 2 +- fs/btrfs/disk-io.c | 6 ++ fs/btrfs/extent-tree.c | 15 +-- fs/btrfs/extent_io.c | 7 ++- fs/btrfs/file.c | 44 ++-- fs/btrfs/inode.c | 1 + fs/btrfs/ordered-data.c | 3 ++- fs/btrfs/tests/btrfs-tests.c | 2 +- fs/btrfs/transaction.c | 3 ++- fs/btrfs/volumes.c | 4 ++-- 11 files changed, 57 insertions(+), 36 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[BUG] Btrfs scrub sometime recalculate wrong parity in raid5
Hi all, following the thread "Adventures in btrfs raid5 disk recovery", I investigated a bit the BTRFS capability to scrub a corrupted raid5 filesystem. To test it, I first find where a file was stored, and then I tried to corrupt the data disks (when unmounted) or the parity disk. The result showed that sometime the kernel recomputed the parity wrongly. I tested the following kernel - 4.6.1 - 4.5.4 and both showed the same behavior. The test was performed as described below: 1) create a filesystem in raid5 mode (for data and metadata) of 1500MB truncate -s 500M disk1.img; losetup -f disk1.img truncate -s 500M disk2.img; losetup -f disk2.img truncate -s 500M disk3.img; losetup -f disk3.img sudo mkfs.btrfs -d raid5 -m raid5 /dev/loop[0-2] sudo mount /dev/loop0 mnt/ 2) I created a file with a length of 128kb: python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt sudo umount mnt/ 3) I looked at the output of 'btrfs-debug-tree /dev/loop0' and I was able to find where the file stripe is located: /dev/loop0: offset=81788928+16*4096(64k, second half of the file: 'bd.) /dev/loop1: offset=61865984+16*4096(64k, first half of the file: 'ad.) /dev/loop2: offset=61865984+16*4096(64k, parity: '\x03\x00\x03\x03\x03.) 4) I tried to corrupt each disk (one disk per test), and then run a scrub: for example for the disk /dev/loop2: sudo dd if=/dev/zero of=/dev/loop2 bs=1 \ seek=$((61865984+16*4096)) count=5 sudo mount /dev/loop0 mnt sudo btrfs scrub start mnt/. 5) I check the disks at the offsets above, to verify that the data/parity is correct However I found that: 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any corruption, but recomputed the parity (always correctly); 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find the corruption. But I found two main behaviors: 2.a) the kernel repaired the damage, but compute the wrong parity. Where it was the parity, the kernel copied the data of the second disk on the parity disk 2.b) the kernel repaired the damage, and rebuild a correct parity I have to point out another strange thing: in dmesg I found two kinds of messages: msg1) [] [ 1021.366944] BTRFS info (device loop2): disk space caching is enabled [ 1021.366949] BTRFS: has skinny extents [ 1021.399208] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt) [ 1021.399214] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [ 1021.399291] BTRFS error (device loop2): fixed up error at logical 142802944 on dev /dev/loop0 msg2) [ 1017.435068] BTRFS info (device loop2): disk space caching is enabled [ 1017.435074] BTRFS: has skinny extents [ 1017.436778] BTRFS info (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [ 1017.463403] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt) [ 1017.463409] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 [ 1017.463467] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt) [ 1017.463472] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 [ 1017.463512] BTRFS error (device loop2): unable to fixup (regular) error at logical 142802944 on dev /dev/loop0 [ 1017.463535] BTRFS error (device loop2): fixed up error at logical 142802944 on dev /dev/loop0 but these seem to be UNrelated to the kernel behavior 2.a) or 2.b) Another strangeness is that SCRUB sometime reports ERROR: there are uncorrectable errors and sometime reports WARNING: errors detected during scrubbing, corrected but also these seems UNrelated to the behavior 2.a) or 2.b) or msg1 or msg2 Enclosed you can find the script which I used to trigger the bug. I have to rerun it several times to show the problem because it doesn't happen every time. Pay attention that the offset and the loop device name are hard coded. You must run the script in the same directory where it is: eg "bash test.sh". Br G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 test.sh Description: Bourne shell script