Re: Trying to rescue my data :(

2016-06-25 Thread Steven Haigh
On 26/06/16 12:30, Duncan wrote:
> Steven Haigh posted on Sun, 26 Jun 2016 02:39:23 +1000 as excerpted:
> 
>> In every case, it was a flurry of csum error messages, then instant
>> death.
> 
> This is very possibly a known bug in btrfs, that occurs even in raid1 
> where a later scrub repairs all csum errors.  While in theory btrfs raid1 
> should simply pull from the mirrored copy if its first try fails checksum 
> (assuming the second one passes, of course), and it seems to do this just 
> fine if there's only an occasional csum error, if it gets too many at 
> once, it *does* unfortunately crash, despite the second copy being 
> available and being just fine as later demonstrated by the scrub fixing 
> the bad copy from the good one.
> 
> I'm used to dealing with that here any time I have a bad shutdown (and 
> I'm running live-git kde, which currently has a bug that triggers a 
> system crash if I let it idle and shut off the monitors, so I've been 
> getting crash shutdowns and having to deal with this unfortunately often, 
> recently).  Fortunately I keep my root, with all system executables, etc, 
> mounted read-only by default, so it's not affected and I can /almost/ 
> boot normally after such a crash.  The problem is /var/log and /home 
> (which has some parts of /var that need to be writable symlinked into /
> home/var, so / can stay read-only).  Something in the normal after-crash 
> boot triggers enough csum errors there that I often crash again.
> 
> So I have to boot to emergency mode and manually mount the filesystems in 
> question, so nothing's trying to access them until I run the scrub and 
> fix the csum errors.  Scrub itself doesn't trigger the crash, thankfully, 
> and once it has repaired all the csum errors due to partial writes on one 
> mirror that either were never made or were properly completed on the 
> other mirror, I can exit emergency mode and complete the normal boot (to 
> the multi-user default target).  As there's no more csum errors then 
> because scrub fixed them all, the boot doesn't crash due to too many such 
> errors, and I'm back in business.
> 
> 
> Tho I believe at least the csum bug that affects me may only trigger if 
> compression is (or perhaps has been in the past) enabled.  Since I run 
> compress=lzo everywhere, that would certainly affect me.  It would also 
> explain why the bug has remained around for quite some time as well, 
> since presumably the devs don't run with compression on enough for this 
> to have become a personal itch they needed to scratch, thus its remaining 
> untraced and unfixed.
> 
> So if you weren't using the compress option, your bug is probably 
> different, but either way, the whole thing about too many csum errors at 
> once triggering a system crash sure does sound familiar, here.

Yes, I was running the compress=lzo option as well... Maybe here lays a
common problem?

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-25 Thread Duncan
Chris Murphy posted on Sat, 25 Jun 2016 11:25:05 -0600 as excerpted:

> Wow. So it sees the data strip corruption, uses good parity on disk to
> fix it, writes the fix to disk, recomputes parity for some reason but
> does it wrongly, and then overwrites good parity with bad parity?
> That's fucked. So in other words, if there are any errors fixed up
> during a scrub, you should do a 2nd scrub. The first scrub should make
> sure data is correct, and the 2nd scrub should make sure the bug is
> papered over by computing correct parity and replacing the bad parity.
> 
> I wonder if the same problem happens with balance or if this is just a
> bug in scrub code?

Could this explain why people have been reporting so many raid56 mode 
cases of btrfs replacing a first drive appearing to succeed just fine, 
but then they go to btrfs replace a second drive, and the array crashes 
as if the first replace didn't work correctly after all, resulting in two 
bad devices once the second replace gets under way, of course bringing 
down the array?

If so, then it looks like we have our answer as to what has been going 
wrong that has been so hard to properly trace and thus to bugfix.

Combine that with the raid4 dedicated parity device behavior you're 
seeing if the writes are all exactly 128 MB, with that possibly 
explaining the super-slow replaces, and this thread may have just given 
us answers to both of those until-now-untraceable issues.

Regardless, what's /very/ clear by now is that raid56 mode as it 
currently exists is more or less fatally flawed, and a full scrap and 
rewrite to an entirely different raid56 mode on-disk format may be 
necessary to fix it.

And what's even clearer is that people /really/ shouldn't be using raid56 
mode for anything but testing with throw-away data, at this point.  
Anything else is simply irresponsible.

Does that mean we need to put a "raid56 mode may eat your babies" level 
warning in the manpage and require a --force to either mkfs.btrfs or 
balance to raid56 mode?  Because that's about where I am on it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Trying to rescue my data :(

2016-06-25 Thread Duncan
Steven Haigh posted on Sun, 26 Jun 2016 02:39:23 +1000 as excerpted:

> In every case, it was a flurry of csum error messages, then instant
> death.

This is very possibly a known bug in btrfs, that occurs even in raid1 
where a later scrub repairs all csum errors.  While in theory btrfs raid1 
should simply pull from the mirrored copy if its first try fails checksum 
(assuming the second one passes, of course), and it seems to do this just 
fine if there's only an occasional csum error, if it gets too many at 
once, it *does* unfortunately crash, despite the second copy being 
available and being just fine as later demonstrated by the scrub fixing 
the bad copy from the good one.

I'm used to dealing with that here any time I have a bad shutdown (and 
I'm running live-git kde, which currently has a bug that triggers a 
system crash if I let it idle and shut off the monitors, so I've been 
getting crash shutdowns and having to deal with this unfortunately often, 
recently).  Fortunately I keep my root, with all system executables, etc, 
mounted read-only by default, so it's not affected and I can /almost/ 
boot normally after such a crash.  The problem is /var/log and /home 
(which has some parts of /var that need to be writable symlinked into /
home/var, so / can stay read-only).  Something in the normal after-crash 
boot triggers enough csum errors there that I often crash again.

So I have to boot to emergency mode and manually mount the filesystems in 
question, so nothing's trying to access them until I run the scrub and 
fix the csum errors.  Scrub itself doesn't trigger the crash, thankfully, 
and once it has repaired all the csum errors due to partial writes on one 
mirror that either were never made or were properly completed on the 
other mirror, I can exit emergency mode and complete the normal boot (to 
the multi-user default target).  As there's no more csum errors then 
because scrub fixed them all, the boot doesn't crash due to too many such 
errors, and I'm back in business.


Tho I believe at least the csum bug that affects me may only trigger if 
compression is (or perhaps has been in the past) enabled.  Since I run 
compress=lzo everywhere, that would certainly affect me.  It would also 
explain why the bug has remained around for quite some time as well, 
since presumably the devs don't run with compression on enough for this 
to have become a personal itch they needed to scratch, thus its remaining 
untraced and unfixed.

So if you weren't using the compress option, your bug is probably 
different, but either way, the whole thing about too many csum errors at 
once triggering a system crash sure does sound familiar, here.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-25 Thread Chris Murphy
On Sat, Jun 25, 2016 at 12:42 PM, Goffredo Baroncelli
 wrote:
> On 2016-06-25 19:58, Chris Murphy wrote:
> [...]
>>> Wow. So it sees the data strip corruption, uses good parity on disk to
>>> fix it, writes the fix to disk, recomputes parity for some reason but
>>> does it wrongly, and then overwrites good parity with bad parity?
>>
>> The wrong parity, is it valid for the data strips that includes the
>> (intentionally) corrupt data?
>>
>> Can parity computation happen before the csum check? Where sometimes you get:
>>
>> read data strips > computer parity > check csum fails > read good
>> parity from disk > fix up the bad data chunk > write wrong parity
>> (based on wrong data)?
>>
>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3
>>
>> 2371-2383 suggest that there's a parity check, it's not always being
>> rewritten to disk if it's already correct. But it doesn't know it's
>> not correct, it thinks it's wrong so writes out the wrongly computed
>> parity?
>
> The parity is not valid for both the corrected data and the corrupted data. 
> It seems that the scrub process copy the contents of the disk2 to disk3. It 
> could happens only if the contents of disk1 is zero.

I'm not sure what it takes to hit this exactly. I just tested 3x
raid5, where two files 128KiB "a" and 128KiB "b", so that's a full
stripe write for each. I corrupted devid 1 64KiB of "a" and devid2
64KiB of "b" did a scrub, error is detected, and corrected, and parity
is still correct.

I also tried to corrupt both parities and scrub, and like you I get no
messages from scrub in user space or kernel but the parity is
corrected.

The fixup is also not cow'd. It is an overwrite, which seems
unproblematic to me at face value. But?

Next I corrupted parities, failed one drive, mounted degraded, and
read in both files. If there is a write hole, I should get back
corrupt data from parity reconstruction blindly being trusted and
wrongly reconstructed.

[root@f24s ~]# cp /mnt/5/* /mnt/1/tmp
cp: error reading '/mnt/5/a128.txt': Input/output error
cp: error reading '/mnt/5/b128.txt': Input/output error

[607594.478720] BTRFS warning (device dm-7): csum failed ino 295 off 0
csum 1940348404 expected csum 650595490
[607594.478818] BTRFS warning (device dm-7): csum failed ino 295 off
4096 csum 463855480 expected csum 650595490
[607594.478869] BTRFS warning (device dm-7): csum failed ino 295 off
8192 csum 3317251692 expected csum 650595490
[607594.479227] BTRFS warning (device dm-7): csum failed ino 295 off
12288 csum 2973611336 expected csum 650595490
[607594.479244] BTRFS warning (device dm-7): csum failed ino 295 off
16384 csum 2556299655 expected csum 650595490
[607594.479254] BTRFS warning (device dm-7): csum failed ino 295 off
20480 csum 1098993191 expected csum 650595490
[607594.479263] BTRFS warning (device dm-7): csum failed ino 295 off
24576 csum 1503293813 expected csum 650595490
[607594.479272] BTRFS warning (device dm-7): csum failed ino 295 off
28672 csum 1538866238 expected csum 650595490
[607594.479282] BTRFS warning (device dm-7): csum failed ino 295 off
36864 csum 2855931166 expected csum 650595490
[607594.479292] BTRFS warning (device dm-7): csum failed ino 295 off
32768 csum 3351364818 expected csum 650595490


Soo.no write hole? Clearly it must reconstruct from corrupt
parity, and then checks the csum tree for EXTENT_CSUM and it doesn't
match so it fails to propagate upstream. And doesn't result in a
fixup. Good.

What happens if I umount, make the missing device visible again, and
mount not degraded?

[607775.394504] BTRFS error (device dm-7): parent transid verify
failed on 18517852160 wanted 143 found 140
[607775.424505] BTRFS info (device dm-7): read error corrected: ino 1
off 18517852160 (dev /dev/mapper/VG-a sector 67584)
[607775.425055] BTRFS info (device dm-7): read error corrected: ino 1
off 18517856256 (dev /dev/mapper/VG-a sector 67592)
[607775.425560] BTRFS info (device dm-7): read error corrected: ino 1
off 18517860352 (dev /dev/mapper/VG-a sector 67600)
[607775.425850] BTRFS info (device dm-7): read error corrected: ino 1
off 18517864448 (dev /dev/mapper/VG-a sector 67608)
[607775.431867] BTRFS error (device dm-7): parent transid verify
failed on 16303439872 wanted 145 found 139
[607775.432973] BTRFS info (device dm-7): read error corrected: ino 1
off 16303439872 (dev /dev/mapper/VG-a sector 4262240)
[607775.433438] BTRFS info (device dm-7): read error corrected: ino 1
off 16303443968 (dev /dev/mapper/VG-a sector 4262248)
[607775.433842] BTRFS info (device dm-7): read error corrected: ino 1
off 16303448064 (dev /dev/mapper/VG-a sector 4262256)
[607775.434220] BTRFS info (device dm-7): read error corrected: ino 1
off 16303452160 (dev /dev/mapper/VG-a sector 4262264)
[607775.434847] BTRFS error (device dm-7): parent transid verify
failed on 16303456256 wanted 145 found 139
[607775.435972] BTRFS info (device dm-7): read error corrected: ino 1
off 163034562

Re: Adventures in btrfs raid5 disk recovery

2016-06-25 Thread Chris Murphy
Interestingly enough, so far I'm finding with full stripe writes, i.e.
3x raid5, exactly 128KiB data writes, devid 3 is always parity. This
is raid4. So...I wonder if some of these slow cases end up with a
bunch of stripes that are effectively raid4-like, and have a lot of
parity overwrites, which is where raid4 suffers due to disk
contention.

Totally speculative as the sample size is too small and distinctly non-random.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 00/24] Delete CURRENT_TIME_SEC and replace current_fs_time()

2016-06-25 Thread Deepa Dinamani
The series is aimed at getting rid of CURRENT_TIME, CURRENT_TIME_SEC macros
and replacing current_fs_time() with current_time().
The macros are not y2038 safe. There is no plan to transition them into being
y2038 safe.
ktime_get_* api's can be used in their place. And, these are y2038 safe.

CURRENT_TIME will be deleted after 4.8 rc1 as there is a dependency function
time64_to_tm() for one of the CURRENT_TIME occurance.

Thanks to Arnd Bergmann for all the guidance and discussions.

Patches 3-5 were mostly generated using coccinelle.

All filesystem timestamps use current_fs_time() for right granularity as
mentioned in the respective commit texts of patches. This has a changed
signature, renamed to current_time() and moved to the fs/inode.c.

This series also serves as a preparatory series to transition vfs to 64 bit
timestamps as outlined here: https://lkml.org/lkml/2016/2/12/104 .

As per Linus's suggestion in https://lkml.org/lkml/2016/5/24/663 , all the
inode timestamp changes have been squashed into a single patch. Also,
current_time() now is used as a single generic vfs filesystem timestamp api.
It also takes struct inode* as argument instead of struct super_block*.
Posting all patches together in a bigger series so that the big picture is
clear.

As per the suggestion in https://lwn.net/Articles/672598/, CURRENT_TIME macro
bug fixes are being handled in a series separate from transitioning vfs to use.

Changes since v2:
* Fix buildbot error for uninitalized sb in inode.
* Minor fixes according to Arnd's comments.
* Leave out the fnic and deletion of CURRENT_TIME to be submitted after 4.8 rc1.

Deepa Dinamani (24):
  vfs: Add current_time() api
  fs: proc: Delete inode time initializations in proc_alloc_inode()
  fs: Replace CURRENT_TIME with current_time() for inode timestamps
  fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
  fs: Replace current_fs_time() with current_time()
  fs: jfs: Replace CURRENT_TIME_SEC by current_time()
  fs: ext4: Use current_time() for inode timestamps
  fs: ubifs: Replace CURRENT_TIME_SEC with current_time
  fs: btrfs: Use ktime_get_real_ts for root ctime
  fs: udf: Replace CURRENT_TIME with current_time()
  fs: cifs: Replace CURRENT_TIME by current_time()
  fs: cifs: Replace CURRENT_TIME with ktime_get_real_ts()
  fs: cifs: Replace CURRENT_TIME by get_seconds
  fs: f2fs: Use ktime_get_real_seconds for sit_info times
  drivers: staging: lustre: Replace CURRENT_TIME with current_time()
  fs: ocfs2: Use time64_t to represent orphan scan times
  fs: ocfs2: Replace CURRENT_TIME with ktime_get_real_seconds()
  audit: Use timespec64 to represent audit timestamps
  fs: nfs: Make nfs boot time y2038 safe
  block: Replace CURRENT_TIME with ktime_get_real_ts
  libceph: Replace CURRENT_TIME with ktime_get_real_ts
  fs: ceph: Replace current_fs_time for request stamp
  time: Delete current_fs_time() function
  time: Delete CURRENT_TIME_SEC

 arch/powerpc/platforms/cell/spufs/inode.c  |  2 +-
 arch/s390/hypfs/inode.c|  4 +--
 drivers/block/rbd.c|  2 +-
 drivers/char/sonypi.c  |  2 +-
 drivers/infiniband/hw/qib/qib_fs.c |  2 +-
 drivers/misc/ibmasm/ibmasmfs.c |  2 +-
 drivers/oprofile/oprofilefs.c  |  2 +-
 drivers/platform/x86/sony-laptop.c |  2 +-
 drivers/staging/lustre/lustre/llite/llite_lib.c| 16 ++--
 drivers/staging/lustre/lustre/llite/namei.c|  4 +--
 drivers/staging/lustre/lustre/mdc/mdc_reint.c  |  6 ++---
 .../lustre/lustre/obdclass/linux/linux-obdo.c  |  6 ++---
 drivers/staging/lustre/lustre/obdclass/obdo.c  |  6 ++---
 drivers/staging/lustre/lustre/osc/osc_io.c |  2 +-
 drivers/usb/core/devio.c   | 18 +++---
 drivers/usb/gadget/function/f_fs.c |  8 +++---
 drivers/usb/gadget/legacy/inode.c  |  2 +-
 fs/9p/vfs_inode.c  |  2 +-
 fs/adfs/inode.c|  2 +-
 fs/affs/amigaffs.c |  6 ++---
 fs/affs/inode.c|  2 +-
 fs/attr.c  |  2 +-
 fs/autofs4/inode.c |  2 +-
 fs/autofs4/root.c  |  6 ++---
 fs/bad_inode.c |  2 +-
 fs/bfs/dir.c   | 14 +--
 fs/binfmt_misc.c   |  2 +-
 fs/btrfs/file.c|  6 ++---
 fs/btrfs/inode.c   | 22 
 fs/btrfs/ioctl.c   |  8 +++---
 fs/btrfs/root-tree.c   |  3 ++-
 fs/btrfs/transaction.c |  4 +--
 fs/btrfs/xattr.c   |  2 +-
 fs/

[PATCH v3 09/24] fs: btrfs: Use ktime_get_real_ts for root ctime

2016-06-25 Thread Deepa Dinamani
btrfs_root_item maintains the ctime for root updates.
This is not part of vfs_inode.

Since current_time() uses struct inode* as an argument
as Linus suggested, this cannot be used to update root
times unless, we modify the signature to use inode.

Since btrfs uses nanosecond time granularity, it can also
use ktime_get_real_ts directly to obtain timestamp for
the root. It is necessary to use the timespec time api
here because the same btrfs_set_stack_timespec_*() apis
are used for vfs inode times as well. These can be
transitioned to using timespec64 when btrfs internally
changes to use timespec64 as well.

Signed-off-by: Deepa Dinamani 
Cc: Chris Mason 
Cc: Josef Bacik 
Cc: David Sterba 
Cc: linux-btrfs@vger.kernel.org
Acked-by: David Sterba 
---
 fs/btrfs/root-tree.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
index f1c3086..161118b 100644
--- a/fs/btrfs/root-tree.c
+++ b/fs/btrfs/root-tree.c
@@ -496,10 +496,11 @@ void btrfs_update_root_times(struct btrfs_trans_handle 
*trans,
 struct btrfs_root *root)
 {
struct btrfs_root_item *item = &root->root_item;
-   struct timespec ct = current_fs_time(root->fs_info->sb);
+   struct timespec ct;
 
spin_lock(&root->root_item_lock);
btrfs_set_root_ctransid(item, trans->transid);
+   ktime_get_real_ts(&ct);
btrfs_set_stack_timespec_sec(&item->ctime, ct.tv_sec);
btrfs_set_stack_timespec_nsec(&item->ctime, ct.tv_nsec);
spin_unlock(&root->root_item_lock);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bad hard drive - checksum verify failure forces readonly mount

2016-06-25 Thread Chris Murphy
On Sat, Jun 25, 2016 at 2:10 PM, Vasco Almeida  wrote:
> Citando Chris Murphy :

>>
>> I would do a couple things in order:
>> 1. Mount ro and copy off what you want in case the whole thing gets
>> worse and can't ever be mounted again.
>> 2. Mount with only these options: -o skip_balance,subvolid=5,nospace_cache
>
>
> I have mounted with that options and was readwrite first and then it forces
> readonly. You can see a delay between first BTRFS messages and the "BTRFS
> info: forced readonly" message in dmesg.
>
> /dev/mapper/vg_pupu-lv_opensuse_root on /mnt type btrfs
> (ro,relatime,seclabel,nospace_cache,skip_balance,subvolid=5,subvol=/)
>
>
>> If it mounts rw, don't do anything with it, just see if it cleans up
>> after itself. It also looks from the previous trace it was trying to
>> remove a snapshot and there are complaints of problems in that
>> snapshot. So hopefully just waiting 5 minutes doing nothing and it'll
>> clean up after itself (you can check with top to see if there are any
>> btrfs related transactions that run including the btrfs-cleaner
>> process) wait until they're done.
>
>
> I can see that btrfs processes including btrfs-cleaner but they may be not
> doing much since device was forced readonly after mounting it.

Readonly just refers to user space to and including VFS, is my
understanding. The file system itself can still write to the block
device.


> I have umount it normally (umount /mnt) after more than 20 minutes since
> mounting it.
>
>> 3. btrfs-image so that devs can see what's causing the problem that
>> the current code isn't handling well enough.
>
>
> btrfs-image does not create dump image:
>
> # btrfs-image /dev/mapper/vg_pupu-lv_opensuse_root
> btrfs-lv_opensuse_root.image
> checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> Csum didn't match
> Error reading metadata block
> Error adding block -5
> checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> Csum didn't match
> Error reading metadata block
> Error flushing pending -5
> create failed (Success)
> # echo $?
> 1

Well it's pretty strange to have DUP metadata and for the checksum
verify to fail on both copies. I don't have much optimism that brfsck
repair can fix it either. But still it's worth a shot since there's
not much else to go on.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bad hard drive - checksum verify failure forces readonly mount

2016-06-25 Thread Vasco Almeida

Citando Chris Murphy :


On Fri, Jun 24, 2016 at 6:06 PM, Vasco Almeida  wrote:

Citando Chris Murphy :
dmesg http://paste.fedoraproject.org/384352/80842814/


[ 1837.386732] BTRFS info (device dm-9): continuing balance
[ 1838.006038] BTRFS info (device dm-9): relocating block group
15799943168 flags 34
[ 1838.684892] BTRFS info (device dm-9): relocating block group
10934550528 flags 36
[ 1839.301453] [ cut here ]
[ 1839.301495] WARNING: CPU: 3 PID: 76 at fs/btrfs/extent-tree.c:1625
lookup_inline_extent_backref+0x45c/0x5a0 [btrfs]()

followed by

[ 1839.301797] WARNING: CPU: 3 PID: 76 at fs/btrfs/extent-tree.c:2946
btrfs_run_delayed_refs+0x29d/0x2d0 [btrfs]()
[ 1839.301798] BTRFS: Transaction aborted (error -5)
[...]
[ 1839.301972] BTRFS: error (device dm-9) in
btrfs_run_delayed_refs:2946: errno=-5 IO failure
[ 1839.301975] BTRFS info (device dm-9): forced readonly

So it looks like it was resuming a balance automatically, and while
processing delayed references it's running into something it doesn't
expect and doesn't have a way to fix, so it goes read only to avoid
causing more problems.

I would do a couple things in order:
1. Mount ro and copy off what you want in case the whole thing gets
worse and can't ever be mounted again.
2. Mount with only these options: -o skip_balance,subvolid=5,nospace_cache


I have mounted with that options and was readwrite first and then it  
forces readonly. You can see a delay between first BTRFS messages and  
the "BTRFS info: forced readonly" message in dmesg.


/dev/mapper/vg_pupu-lv_opensuse_root on /mnt type btrfs  
(ro,relatime,seclabel,nospace_cache,skip_balance,subvolid=5,subvol=/)




If it mounts rw, don't do anything with it, just see if it cleans up
after itself. It also looks from the previous trace it was trying to
remove a snapshot and there are complaints of problems in that
snapshot. So hopefully just waiting 5 minutes doing nothing and it'll
clean up after itself (you can check with top to see if there are any
btrfs related transactions that run including the btrfs-cleaner
process) wait until they're done.


I can see that btrfs processes including btrfs-cleaner but they may be  
not doing much since device was forced readonly after mounting it.



Then umount. If you want you could have two other consoles ready
first, one for 'journalctl -f' and another for sysrq+t to issue in
case you get a hang. This doesn't fix anything but it collects more
information for a bug report for the devs.

Once you get it umounted normally or by force, the next thing to do is


I have umount it normally (umount /mnt) after more than 20 minutes  
since mounting it.



3. btrfs-image so that devs can see what's causing the problem that
the current code isn't handling well enough.


btrfs-image does not create dump image:

# btrfs-image /dev/mapper/vg_pupu-lv_opensuse_root  
btrfs-lv_opensuse_root.image

checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
Csum didn't match
Error reading metadata block
Error adding block -5
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
Csum didn't match
Error reading metadata block
Error flushing pending -5
create failed (Success)
# echo $?
1



4. btrfs check --repair


Did not issue this command yet.

dmesg http://paste.fedoraproject.org/384799/14668851/

Thank your for helping.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-25 Thread Goffredo Baroncelli
On 2016-06-25 19:58, Chris Murphy wrote:
[...]
>> Wow. So it sees the data strip corruption, uses good parity on disk to
>> fix it, writes the fix to disk, recomputes parity for some reason but
>> does it wrongly, and then overwrites good parity with bad parity?
> 
> The wrong parity, is it valid for the data strips that includes the
> (intentionally) corrupt data?
> 
> Can parity computation happen before the csum check? Where sometimes you get:
> 
> read data strips > computer parity > check csum fails > read good
> parity from disk > fix up the bad data chunk > write wrong parity
> (based on wrong data)?
> 
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3
> 
> 2371-2383 suggest that there's a parity check, it's not always being
> rewritten to disk if it's already correct. But it doesn't know it's
> not correct, it thinks it's wrong so writes out the wrongly computed
> parity?

The parity is not valid for both the corrected data and the corrupted data. It 
seems that the scrub process copy the contents of the disk2 to disk3. It could 
happens only if the contents of disk1 is zero.

BR
GB


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-25 Thread Chris Murphy
On Sat, Jun 25, 2016 at 11:25 AM, Chris Murphy  wrote:
> On Sat, Jun 25, 2016 at 6:21 AM, Goffredo Baroncelli  
> wrote:
>
>> 5) I check the disks at the offsets above, to verify that the data/parity is 
>> correct
>>
>> However I found that:
>> 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any 
>> corruption, but recomputed the parity (always correctly);
>
> This is mostly good news, that it is fixing bad parity during scrub.
> What's not clear due to the lack of any message is if the scrub is
> always writing out new parity, or only writes it if there's a
> mismatch.
>
>
>> 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find 
>> the corruption. But I found two main behaviors:
>>
>> 2.a) the kernel repaired the damage, but compute the wrong parity. Where it 
>> was the parity, the kernel copied the data of the second disk on the parity 
>> disk
>
> Wow. So it sees the data strip corruption, uses good parity on disk to
> fix it, writes the fix to disk, recomputes parity for some reason but
> does it wrongly, and then overwrites good parity with bad parity?

The wrong parity, is it valid for the data strips that includes the
(intentionally) corrupt data?

Can parity computation happen before the csum check? Where sometimes you get:

read data strips > computer parity > check csum fails > read good
parity from disk > fix up the bad data chunk > write wrong parity
(based on wrong data)?

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3

2371-2383 suggest that there's a parity check, it's not always being
rewritten to disk if it's already correct. But it doesn't know it's
not correct, it thinks it's wrong so writes out the wrongly computed
parity?



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-25 Thread Chris Murphy
On Sat, Jun 25, 2016 at 6:21 AM, Goffredo Baroncelli  wrote:

> 5) I check the disks at the offsets above, to verify that the data/parity is 
> correct
>
> However I found that:
> 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any 
> corruption, but recomputed the parity (always correctly);

This is mostly good news, that it is fixing bad parity during scrub.
What's not clear due to the lack of any message is if the scrub is
always writing out new parity, or only writes it if there's a
mismatch.


> 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find 
> the corruption. But I found two main behaviors:
>
> 2.a) the kernel repaired the damage, but compute the wrong parity. Where it 
> was the parity, the kernel copied the data of the second disk on the parity 
> disk

Wow. So it sees the data strip corruption, uses good parity on disk to
fix it, writes the fix to disk, recomputes parity for some reason but
does it wrongly, and then overwrites good parity with bad parity?
That's fucked. So in other words, if there are any errors fixed up
during a scrub, you should do a 2nd scrub. The first scrub should make
sure data is correct, and the 2nd scrub should make sure the bug is
papered over by computing correct parity and replacing the bad parity.

I wonder if the same problem happens with balance or if this is just a
bug in scrub code?


> but these seem to be UNrelated to the kernel behavior 2.a) or 2.b)
>
> Another strangeness is that SCRUB sometime reports
>  ERROR: there are uncorrectable errors
> and sometime reports
>  WARNING: errors detected during scrubbing, corrected
>
> but also these seems UNrelated to the behavior 2.a) or 2.b) or msg1 or msg2

I've seen this also, errors in user space but no kernel messages.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Trying to rescue my data :(

2016-06-25 Thread Chris Murphy
On Sat, Jun 25, 2016 at 10:39 AM, Steven Haigh  wrote:

> Well, I did end up recovering the data that I cared about. I'm not
> really keen to ride the BTRFS RAID6 train again any time soon :\
>
> I now have the same as I've had for years - md RAID6 with XFS on top of
> it. I'm still copying data back to the array from the various sources I
> had to copy it to so I had enough space to do so.

Just make sure you've got each drive's SCT ERC shorter than the kernel
SCSI command timer for each block device in
/sys/block/device-name/device/timeout or you can very easily end up
with the same if not worse problem which is total array collapse. It's
more rare to see the problem on mdraid6 because the extra parity ends
up papering over the problem caused by this misconfiguration, but it's
a misconfiguration that's the default unless you're using
enterprise/NAS specific drives with short recoveries set on them by
default. The linux-raid@ list is full of problems resulting from this
issue.

I think the obvious mistake here though is assuming reshapes entail no
risk. There's a -f required for a reason. You could have ended up in
just as bad situation doing a reshape without a backup of an md or lvm
based array. Yes it should work, and if it doesn't it's a bug, but how
much data do you want to lose today?



> What I find interesting is that the patterns of corruption in the BTRFS
> RAID6 is quite clustered. I have ~80Gb of MP3s ripped over the years -
> of that, the corruption would take out 3-4 songs in a row, then the next
> 10 albums or so were intact. What made recovery VERY hard, is that it
> got to several situations that just caused a complete system hang.

The data stripe size is 64KiB * (num of disks - 2). So in your case I
think that's 64 *3 = 192KiB. That's less than the size of one song, so
that means roughly 15 bad stripes in a row. That's less than a block
group also.

The Btrfs conversion should be safer than methods used by mdadm and
lvm because the operation is cow. The raid6 block group is supposed to
remain intact and "live" if you will, until the single block group is
written to stable media. The full crash set of kernel messages might
be useful to find out what was happening that instigated all of this
corruption. But even still the subsequent mount should at worst
rollback to state of block groups of different profiles where the most
recent (failed) conversion is still a raid6 block group intact.

So, I'd still say btrfs-image it and host it somewhere, file a bug,
cross reference this thread in the bug, and the bug URL in this
thread. Might take months or even a year before a dev looks at it, but
better than nothing.


>
> I tried it on bare metal - just in case it was a Xen thing, but it hard
> hung the entire machine then. In every case, it was a flurry of csum
> error messages, then instant death. I would have been much happier if
> the file had been skipped or returned as unavailable instead of having
> the entire machine crash.

Of course. The unanswered question though is why are there so many
csum errors? Are these metadata csum errors, or are they EXTENT_CSUM
errors, and how are they becoming wrong? Wrongly read, wrongly
written, wrongly recomputed from parity? How did the parity go bad if
that's the case? So it needs an autopsy or it just doesn't get better.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-25 Thread Chris Murphy
On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
 wrote:

> Well, the obvious major advantage that comes to mind for me to checksumming
> parity is that it would let us scrub the parity data itself and verify it.

OK but hold on. During scrub, it should read data, compute checksums
*and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
the checksum tree, and the parity strip in the chunk tree. And if
parity is wrong, then it should be replaced.

Even check > md/sync_action does this. So no pun intended but Btrfs
isn't even at parity with mdadm on data integrity if it doesn't check
if the parity matches data.


> I'd personally much rather know my parity is bad before I need to use it
> than after using it to reconstruct data and getting an error there, and I'd
> be willing to be that most seasoned sysadmins working for companies using
> big storage arrays likely feel the same about it.

That doesn't require parity csums though. It just requires computing
parity during a scrub and comparing it to the parity on disk to make
sure they're the same. If they aren't, assuming no other error for
that full stripe read, then the parity block is replaced.

So that's also something to check in the code or poke a system with a
stick and see what happens.

> I could see it being
> practical to have an option to turn this off for performance reasons or
> similar, but again, I have a feeling that most people would rather be able
> to check if a rebuild will eat data before trying to rebuild (depending on
> the situation in such a case, it will sometimes just make more sense to nuke
> the array and restore from a backup instead of spending time waiting for it
> to rebuild).

The much bigger problem we have right now that affects Btrfs,
LVM/mdadm md raid, is this silly bad default with non-enterprise
drives having no configurable SCT ERC, with ensuing long recovery
times, and the kernel SCSI command timer at 30 seconds - which
actually also fucks over regular single disk users also because it
means they don't get the "benefit" of long recovery times, which is
the whole g'd point of that feature. This itself causes so many
problems where bad sectors just get worse and don't get fixed up
because of all the link resets. So I still think it's a bullshit
default kernel side because it pretty much affects the majority use
case, it is only a non-problem with proprietary hardware raid, and
software raid using enterprise (or NAS specific) drives that already
have short recovery times by default.

This has been true for a very long time, maybe a decade. And it's such
complete utter crap that this hasn't been dealt with properly by any
party. No distribution has fixed this for their users. Upstream udev
hasn't dealt with it. And kernel folks haven't dealt with it. It's a
perverse joke on the user to do this out of the box.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Trying to rescue my data :(

2016-06-25 Thread Steven Haigh
On 26/06/16 02:25, Chris Murphy wrote:
> On Fri, Jun 24, 2016 at 10:19 PM, Steven Haigh  wrote:
> 
>>
>> Interesting though that EVERY crash references:
>> kernel BUG at fs/btrfs/extent_io.c:2401!
> 
> Yeah because you're mounted ro, and if this is 4.4.13 unmodified btrfs
> from kernel.org then that's the 3rd line:
> 
> if (head->is_data) {
> ret = btrfs_del_csums(trans, root,
>node->bytenr,
>node->num_bytes);
> 
> So why/what is it cleaning up if it's mounted ro? Anyway, once you're
> no longer making forward progress you could try something newer,
> although it's a coin toss what to try. There are some issues with
> 4.6.0-4.6.2 but there have been a lot of changes in btrfs/extent_io.c
> and btrfs/raid56.c between 4.4.13 that you're using and 4.6.2, so you
> could try that or even build 4.7.rc4 or rc5 by tomorrowish and see how
> that fairs. It sounds like there's just too much (mostly metadata)
> corruption for the degraded state to deal with so it may not matter.
> I'm really skeptical of btrfsck on degraded fs's so I don't think
> that'll help.

Well, I did end up recovering the data that I cared about. I'm not
really keen to ride the BTRFS RAID6 train again any time soon :\

I now have the same as I've had for years - md RAID6 with XFS on top of
it. I'm still copying data back to the array from the various sources I
had to copy it to so I had enough space to do so.

What I find interesting is that the patterns of corruption in the BTRFS
RAID6 is quite clustered. I have ~80Gb of MP3s ripped over the years -
of that, the corruption would take out 3-4 songs in a row, then the next
10 albums or so were intact. What made recovery VERY hard, is that it
got to several situations that just caused a complete system hang.

I tried it on bare metal - just in case it was a Xen thing, but it hard
hung the entire machine then. In every case, it was a flurry of csum
error messages, then instant death. I would have been much happier if
the file had been skipped or returned as unavailable instead of having
the entire machine crash.

I ended up putting the bit of script that I posted earlier in
/etc/rc.local - then just kept doing:
xl destroy myvm && xl create /etc/xen/myvm -c

Wait for the crash, run the above again.

All in all, it took me about 350 boots with an average uptime of about 3
minutes to get the data out that I decided to keep. While not a BTRFS
loss, I did decide with how long it was going to take to not bother
recovering ~3.5Tb of other data that is easily available in other places
on the internet. If I really need the Fedora 24 KDE Spin ISO, or the
CentOS 6 Install DVD, etc etc I can download it again.

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature


Re: Trying to rescue my data :(

2016-06-25 Thread Chris Murphy
On Fri, Jun 24, 2016 at 10:19 PM, Steven Haigh  wrote:

>
> Interesting though that EVERY crash references:
> kernel BUG at fs/btrfs/extent_io.c:2401!

Yeah because you're mounted ro, and if this is 4.4.13 unmodified btrfs
from kernel.org then that's the 3rd line:

if (head->is_data) {
ret = btrfs_del_csums(trans, root,
   node->bytenr,
   node->num_bytes);

So why/what is it cleaning up if it's mounted ro? Anyway, once you're
no longer making forward progress you could try something newer,
although it's a coin toss what to try. There are some issues with
4.6.0-4.6.2 but there have been a lot of changes in btrfs/extent_io.c
and btrfs/raid56.c between 4.4.13 that you're using and 4.6.2, so you
could try that or even build 4.7.rc4 or rc5 by tomorrowish and see how
that fairs. It sounds like there's just too much (mostly metadata)
corruption for the degraded state to deal with so it may not matter.
I'm really skeptical of btrfsck on degraded fs's so I don't think
that'll help.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL 2/2] Btrfs

2016-06-25 Thread Chris Mason
Hi Linus,

Btrfs part two was supposed to be a single patch on part of v4.7-rc4.
Somehow I didn't notice that my part2 branch repeated a few of the
patches in part 1 when I set it up earlier this week.  Cherry-picking
gone wrong as I folded a fix into Dave Sterba's original integration.

I've been testing the git-merged result of part1, part2 and your
master for a while, but I just rebased part2 so it didn't include
any duplicates.  I ran git diff to verify the merged result of
today's pull is exactly the same as the one I've been testing.

My for-linus-4.7-part2 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.7-part2

Has one patch from Omar to bring iterate_shared back to btrfs.  We have
a tree of work we queue up for directory items and it doesn't
lend itself well to shared access.  While we're cleaning it up, Omar
has changed things to use an exclusive lock when there are delayed
items.

Omar Sandoval (1) commits (+34/-13):
Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes

Total: (1) commits (+34/-13)

 fs/btrfs/delayed-inode.c | 27 ++-
 fs/btrfs/delayed-inode.h | 10 ++
 fs/btrfs/inode.c | 10 ++
 3 files changed, 34 insertions(+), 13 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bad hard drive - checksum verify failure forces readonly mount

2016-06-25 Thread Chris Murphy
On Fri, Jun 24, 2016 at 6:06 PM, Vasco Almeida  wrote:
> Citando Chris Murphy :

>> A lot of changes have happened since 4.1.2 I would still use something
>> newer and try to repair it.
>
>
> By repair do you mean issue "btrfs check --repair /device" ?

Once you have copied off the important stuff, yes. It's less likely to
make things worse now. However, there are some things to do first:




> dmesg http://paste.fedoraproject.org/384352/80842814/

[ 1837.386732] BTRFS info (device dm-9): continuing balance
[ 1838.006038] BTRFS info (device dm-9): relocating block group
15799943168 flags 34
[ 1838.684892] BTRFS info (device dm-9): relocating block group
10934550528 flags 36
[ 1839.301453] [ cut here ]
[ 1839.301495] WARNING: CPU: 3 PID: 76 at fs/btrfs/extent-tree.c:1625
lookup_inline_extent_backref+0x45c/0x5a0 [btrfs]()

followed by

[ 1839.301797] WARNING: CPU: 3 PID: 76 at fs/btrfs/extent-tree.c:2946
btrfs_run_delayed_refs+0x29d/0x2d0 [btrfs]()
[ 1839.301798] BTRFS: Transaction aborted (error -5)
[...]
[ 1839.301972] BTRFS: error (device dm-9) in
btrfs_run_delayed_refs:2946: errno=-5 IO failure
[ 1839.301975] BTRFS info (device dm-9): forced readonly

So it looks like it was resuming a balance automatically, and while
processing delayed references it's running into something it doesn't
expect and doesn't have a way to fix, so it goes read only to avoid
causing more problems.

I would do a couple things in order:
1. Mount ro and copy off what you want in case the whole thing gets
worse and can't ever be mounted again.
2. Mount with only these options: -o skip_balance,subvolid=5,nospace_cache

If it mounts rw, don't do anything with it, just see if it cleans up
after itself. It also looks from the previous trace it was trying to
remove a snapshot and there are complaints of problems in that
snapshot. So hopefully just waiting 5 minutes doing nothing and it'll
clean up after itself (you can check with top to see if there are any
btrfs related transactions that run including the btrfs-cleaner
process) wait until they're done.

Then umount. If you want you could have two other consoles ready
first, one for 'journalctl -f' and another for sysrq+t to issue in
case you get a hang. This doesn't fix anything but it collects more
information for a bug report for the devs.

Once you get it umounted normally or by force, the next thing to do is

3. btrfs-image so that devs can see what's causing the problem that
the current code isn't handling well enough.
4. btrfs check --repair

Let's see the results of that repair. You can use 'script
btrfsrepair.txt' first and then 'btrfs check --repair' and it will log
everything. After btrfs check completes, use 'exit' to stop script
from recording and you should have a btrfsrepair.txt file you can post
somewhere. When using > not everything gets logged for some reason but
script will capture everything.

Depending on how the repair goes, there might be a couple more options left.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL 1/2] Btrfs

2016-06-25 Thread Chris Mason
Hi Linus,

I have a two part pull this time because one of the patches Dave Sterba
collected needed to be against v4.7-rc2 or higher (we used rc4).  I try
to make my for-linus-xx branch testable on top of the last major
so we can hand fixes to people on the list more easily, so I've split
this pull in two.

My for-linus-4.7 branch has some fixes and two performance improvements
that we've been testing for some time.

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.7

Josef's two performance fixes are most notable.  The transid tracking
patch makes a big improvement on pretty much every workload.

Josef Bacik (2) commits (+38/-27):
Btrfs: don't do nocow check unless we have to (+22/-22)
Btrfs: track transid for delayed ref flushing (+16/-5)

Liu Bo (1) commits (+11/-2):
Btrfs: fix error handling in map_private_extent_buffer

Chris Mason (1) commits (+11/-9):
btrfs: fix deadlock in delayed_ref_async_start

Wei Yongjun (1) commits (+1/-1):
Btrfs: fix error return code in btrfs_init_test_fs()

Chandan Rajendra (1) commits (+4/-6):
Btrfs: Force stripesize to the value of sectorsize

Wang Xiaoguang (1) commits (+2/-1):
btrfs: fix disk_i_size update bug when fallocate() fails

Total: (7) commits (+67/-46)

 fs/btrfs/ctree.c |  6 +-
 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/disk-io.c   |  6 ++
 fs/btrfs/extent-tree.c   | 15 +--
 fs/btrfs/extent_io.c |  7 ++-
 fs/btrfs/file.c  | 44 ++--
 fs/btrfs/inode.c |  1 +
 fs/btrfs/ordered-data.c  |  3 ++-
 fs/btrfs/tests/btrfs-tests.c |  2 +-
 fs/btrfs/transaction.c   |  3 ++-
 fs/btrfs/volumes.c   |  4 ++--
 11 files changed, 57 insertions(+), 36 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-25 Thread Goffredo Baroncelli
Hi all,

following the thread "Adventures in btrfs raid5 disk recovery", I investigated 
a bit the BTRFS capability to scrub a corrupted raid5 filesystem. To test it, I 
first find where a file was stored, and then I tried to corrupt the data disks 
(when unmounted) or the parity disk.
The result showed that sometime the kernel recomputed the parity wrongly.

I tested the following kernel
- 4.6.1
- 4.5.4
and both showed the same behavior.

The test was performed as described below:

1) create a filesystem in raid5 mode (for data and metadata) of 1500MB 

truncate -s 500M disk1.img; losetup -f disk1.img
truncate -s 500M disk2.img; losetup -f disk2.img
truncate -s 500M disk3.img; losetup -f disk3.img
sudo mkfs.btrfs -d raid5 -m raid5 /dev/loop[0-2]
sudo mount /dev/loop0 mnt/

2) I created a file with a length of 128kb:

python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt
sudo umount mnt/

3) I looked at the output of 'btrfs-debug-tree /dev/loop0' and I was able to 
find where the file stripe is located:

/dev/loop0: offset=81788928+16*4096(64k, second half of the file: 
'bd.)
/dev/loop1: offset=61865984+16*4096(64k, first half of the file: 
'ad.)
/dev/loop2: offset=61865984+16*4096(64k, parity: 
'\x03\x00\x03\x03\x03.)

4) I tried to corrupt each disk (one disk per test), and then run a scrub:

for example for the disk /dev/loop2:
sudo dd if=/dev/zero of=/dev/loop2 bs=1 \
seek=$((61865984+16*4096)) count=5
sudo mount /dev/loop0 mnt
sudo btrfs scrub start mnt/.

5) I check the disks at the offsets above, to verify that the data/parity is 
correct

However I found that:
1) if I corrupt the parity disk (/dev/loop2), scrub don't find any corruption, 
but recomputed the parity (always correctly);

2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find the 
corruption. But I found two main behaviors:

2.a) the kernel repaired the damage, but compute the wrong parity. Where it was 
the parity, the kernel copied the data of the second disk on the parity disk

2.b) the kernel repaired the damage, and rebuild a correct parity 

I have to point out another strange thing: in dmesg I found two kinds of 
messages:

msg1)
[]
[ 1021.366944] BTRFS info (device loop2): disk space caching is enabled
[ 1021.366949] BTRFS: has skinny extents
[ 1021.399208] BTRFS warning (device loop2): checksum error at logical 
142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, 
length 4096, links 1 (path: out.txt)
[ 1021.399214] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, 
rd 0, flush 0, corrupt 1, gen 0
[ 1021.399291] BTRFS error (device loop2): fixed up error at logical 
142802944 on dev /dev/loop0

msg2)
[ 1017.435068] BTRFS info (device loop2): disk space caching is enabled
[ 1017.435074] BTRFS: has skinny extents
[ 1017.436778] BTRFS info (device loop2): bdev /dev/loop0 errs: wr 0, 
rd 0, flush 0, corrupt 1, gen 0
[ 1017.463403] BTRFS warning (device loop2): checksum error at logical 
142802944 on dev /dev/loop0, sector 159872,  root 5, inode 257, offset 
65536, length 4096, links 1 (path: out.txt)
[ 1017.463409] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, 
rd 0, flush 0, corrupt 2, gen 0
[ 1017.463467] BTRFS warning (device loop2): checksum error at logical 
142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, 
length 4096, links 1 (path: out.txt)
[ 1017.463472] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, 
rd 0, flush 0, corrupt 3, gen 0
[ 1017.463512] BTRFS error (device loop2): unable to fixup (regular) 
error at logical 142802944 on dev /dev/loop0
[ 1017.463535] BTRFS error (device loop2): fixed up error at logical 
142802944 on dev /dev/loop0


but these seem to be UNrelated to the kernel behavior 2.a) or 2.b)

Another strangeness is that SCRUB sometime reports
 ERROR: there are uncorrectable errors
and sometime reports
 WARNING: errors detected during scrubbing, corrected

but also these seems UNrelated to the behavior 2.a) or 2.b) or msg1 or msg2


Enclosed you can find the script which I used to trigger the bug. I have to 
rerun it several times to show the problem because it doesn't happen every 
time. Pay attention that the offset and the loop device name are hard coded. 
You must run the script in the same directory where it is: eg "bash test.sh". 

Br
G.Baroncelli


 
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


test.sh
Description: Bourne shell script