Re: Seeking Help on Corruption Issues

2017-10-04 Thread Hugo Mills
On Tue, Oct 03, 2017 at 03:49:25PM -0700, Stephen Nesbitt wrote:
> 
> On 10/3/2017 2:11 PM, Hugo Mills wrote:
> >Hi, Stephen,
> >
> >On Tue, Oct 03, 2017 at 08:52:04PM +, Stephen Nesbitt wrote:
> >>Here it i. There are a couple of out-of-order entries beginning at 117. And
> >>yes I did uncover a bad stick of RAM:
> >>
> >>btrfs-progs v4.9.1
> >>leaf 2589782867968 items 134 free space 6753 generation 3351574 owner 2
> >>fs uuid 24b768c3-2141-44bf-ae93-1c3833c8c8e3
> >>chunk uuid 19ce12f0-d271-46b8-a691-e0d26c1790c6
> >[snip]
> >>item 116 key (1623012749312 EXTENT_ITEM 45056) itemoff 10908 itemsize 53
> >>extent refs 1 gen 3346444 flags DATA
> >>extent data backref root 271 objectid 2478 offset 0 count 1
> >>item 117 key (1621939052544 EXTENT_ITEM 8192) itemoff 10855 itemsize 53
> >>extent refs 1 gen 3346495 flags DATA
> >>extent data backref root 271 objectid 21751764 offset 6733824 count 1
> >>item 118 key (1623012450304 EXTENT_ITEM 8192) itemoff 10802 itemsize 53
> >>extent refs 1 gen 3351513 flags DATA
> >>extent data backref root 271 objectid 5724364 offset 680640512 count 1
> >>item 119 key (1623012802560 EXTENT_ITEM 12288) itemoff 10749 itemsize 53
> >>extent refs 1 gen 3346376 flags DATA
> >>extent data backref root 271 objectid 21751764 offset 6701056 count 1
> hex(1623012749312)
> >'0x179e3193000'
> hex(1621939052544)
> >'0x179a319e000'
> hex(1623012450304)
> >'0x179e314a000'
> hex(1623012802560)
> >'0x179e31a'
> >
> >That's "e" -> "a" in the fourth hex digit, which is a single-bit
> >flip, and should be fixable by btrfs check (I think). However, even
> >fixing that, it's not ordered, because 118 is then before 117, which
> >could be another bitflip ("9" -> "4" in the 7th digit), but two bad
> >bits that close to each other seems unlikely to me.
> >
> >Hugo.
> 
> Hope this is a duplicate reply - I might have fat fingered something.
> 
> The underlying file is disposable/replaceable. Any way to zero
> out/zap the bad BTRFS entry?

   Not really. Even trying to delete the related file(s), it's going
to fall over when reading the metadata in in the first place. (The key
order check is a metadata invariant, like the csum checks and transid
checks).

   At best, you'd have to get btrfs check to fix it. It should be able
to manage a single-bit error, but you've got two single-bit errors in
close proximity, and I'm not sure it'll be able to deal with it. Might
be worth trying it. The FS _might_ blow up as a result of an attempted
fix, but you say it's replacable, so that's kind of OK. The worst I'd
_expect_ to happen with btrfs check --repair is that it just won't be
able to deal with it and you're left where you started.

   Go for it.

   Hugo.

-- 
Hugo Mills | You shouldn't anthropomorphise computers. They
hugo@... carfax.org.uk | really don't like that.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Seeking Help on Corruption Issues

2017-10-03 Thread Stephen Nesbitt


On 10/3/2017 2:11 PM, Hugo Mills wrote:

Hi, Stephen,

On Tue, Oct 03, 2017 at 08:52:04PM +, Stephen Nesbitt wrote:

Here it i. There are a couple of out-of-order entries beginning at 117. And
yes I did uncover a bad stick of RAM:

btrfs-progs v4.9.1
leaf 2589782867968 items 134 free space 6753 generation 3351574 owner 2
fs uuid 24b768c3-2141-44bf-ae93-1c3833c8c8e3
chunk uuid 19ce12f0-d271-46b8-a691-e0d26c1790c6

[snip]

item 116 key (1623012749312 EXTENT_ITEM 45056) itemoff 10908 itemsize 53
extent refs 1 gen 3346444 flags DATA
extent data backref root 271 objectid 2478 offset 0 count 1
item 117 key (1621939052544 EXTENT_ITEM 8192) itemoff 10855 itemsize 53
extent refs 1 gen 3346495 flags DATA
extent data backref root 271 objectid 21751764 offset 6733824 count 1
item 118 key (1623012450304 EXTENT_ITEM 8192) itemoff 10802 itemsize 53
extent refs 1 gen 3351513 flags DATA
extent data backref root 271 objectid 5724364 offset 680640512 count 1
item 119 key (1623012802560 EXTENT_ITEM 12288) itemoff 10749 itemsize 53
extent refs 1 gen 3346376 flags DATA
extent data backref root 271 objectid 21751764 offset 6701056 count 1

hex(1623012749312)

'0x179e3193000'

hex(1621939052544)

'0x179a319e000'

hex(1623012450304)

'0x179e314a000'

hex(1623012802560)

'0x179e31a'

That's "e" -> "a" in the fourth hex digit, which is a single-bit
flip, and should be fixable by btrfs check (I think). However, even
fixing that, it's not ordered, because 118 is then before 117, which
could be another bitflip ("9" -> "4" in the 7th digit), but two bad
bits that close to each other seems unlikely to me.

Hugo.


Hope this is a duplicate reply - I might have fat fingered something.

The underlying file is disposable/replaceable. Any way to zero out/zap 
the bad BTRFS entry?


-steve

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Seeking Help on Corruption Issues

2017-10-03 Thread Hugo Mills
   Hi, Stephen,

On Tue, Oct 03, 2017 at 08:52:04PM +, Stephen Nesbitt wrote:
> Here it i. There are a couple of out-of-order entries beginning at 117. And
> yes I did uncover a bad stick of RAM:
> 
> btrfs-progs v4.9.1
> leaf 2589782867968 items 134 free space 6753 generation 3351574 owner 2
> fs uuid 24b768c3-2141-44bf-ae93-1c3833c8c8e3
> chunk uuid 19ce12f0-d271-46b8-a691-e0d26c1790c6
[snip]
> item 116 key (1623012749312 EXTENT_ITEM 45056) itemoff 10908 itemsize 53
> extent refs 1 gen 3346444 flags DATA
> extent data backref root 271 objectid 2478 offset 0 count 1
> item 117 key (1621939052544 EXTENT_ITEM 8192) itemoff 10855 itemsize 53
> extent refs 1 gen 3346495 flags DATA
> extent data backref root 271 objectid 21751764 offset 6733824 count 1
> item 118 key (1623012450304 EXTENT_ITEM 8192) itemoff 10802 itemsize 53
> extent refs 1 gen 3351513 flags DATA
> extent data backref root 271 objectid 5724364 offset 680640512 count 1
> item 119 key (1623012802560 EXTENT_ITEM 12288) itemoff 10749 itemsize 53
> extent refs 1 gen 3346376 flags DATA
> extent data backref root 271 objectid 21751764 offset 6701056 count 1

>>> hex(1623012749312)
'0x179e3193000'
>>> hex(1621939052544)
'0x179a319e000'
>>> hex(1623012450304)
'0x179e314a000'
>>> hex(1623012802560)
'0x179e31a'

   That's "e" -> "a" in the fourth hex digit, which is a single-bit
flip, and should be fixable by btrfs check (I think). However, even
fixing that, it's not ordered, because 118 is then before 117, which
could be another bitflip ("9" -> "4" in the 7th digit), but two bad
bits that close to each other seems unlikely to me.

   Hugo.

-- 
Hugo Mills | Great films about cricket: Silly Point Break
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Seeking Help on Corruption Issues

2017-10-03 Thread Hugo Mills
On Tue, Oct 03, 2017 at 01:06:50PM -0700, Stephen Nesbitt wrote:
> All:
> 
> I came back to my computer yesterday to find my filesystem in read
> only mode. Running a btrfs scrub start -dB aborts as follows:
> 
> btrfs scrub start -dB /mnt
> ERROR: scrubbing /mnt failed for device id 4: ret=-1, errno=5
> (Input/output error)
> ERROR: scrubbing /mnt failed for device id 5: ret=-1, errno=5
> (Input/output error)
> scrub device /dev/sdb (id 4) canceled
>     scrub started at Mon Oct  2 21:51:46 2017 and was aborted after
> 00:09:02
>     total bytes scrubbed: 75.58GiB with 1 errors
>     error details: csum=1
>     corrected errors: 0, uncorrectable errors: 1, unverified errors: 0
> scrub device /dev/sdc (id 5) canceled
>     scrub started at Mon Oct  2 21:51:46 2017 and was aborted after
> 00:11:11
>     total bytes scrubbed: 50.75GiB with 0 errors
> 
> The resulting dmesg is:
> [  699.534066] BTRFS error (device sdc): bdev /dev/sdb errs: wr 0,
> rd 0, flush 0, corrupt 6, gen 0
> [  699.703045] BTRFS error (device sdc): unable to fixup (regular)
> error at logical 1609808347136 on dev /dev/sdb
> [  783.306525] BTRFS critical (device sdc): corrupt leaf, bad key
> order: block=2589782867968, root=1, slot=116

   This error usually means bad RAM. Can you show us the output of
"btrfs-debug-tree -b 2589782867968 /dev/sdc"?

   Hugo.

> [  789.776132] BTRFS critical (device sdc): corrupt leaf, bad key
> order: block=2589782867968, root=1, slot=116
> [  911.529842] BTRFS critical (device sdc): corrupt leaf, bad key
> order: block=2589782867968, root=1, slot=116
> [  918.365225] BTRFS critical (device sdc): corrupt leaf, bad key
> order: block=2589782867968, root=1, slot=116
> 
> Running btrfs check /dev/sdc results in:
> btrfs check /dev/sdc
> Checking filesystem on /dev/sdc
> UUID: 24b768c3-2141-44bf-ae93-1c3833c8c8e3
> checking extents
> bad key ordering 116 117
> bad block 2589782867968
> ERROR: errors found in extent allocation tree or chunk allocation
> checking free space cache
> There is no free space entry for 1623012450304-1623012663296
> There is no free space entry for 1623012450304-1623225008128
> cache appears valid but isn't 1622151266304
> found 288815742976 bytes used err is -22
> total csum bytes: 0
> total tree bytes: 350781440
> total fs tree bytes: 0
> total extent tree bytes: 350027776
> btree space waste bytes: 115829777
> file data blocks allocated: 156499968
> 
> uname -a:
> Linux sysresccd 4.9.24-std500-amd64 #2 SMP Sat Apr 22 17:14:43 UTC
> 2017 x86_64 Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz GenuineIntel
> GNU/Linux
> 
> btrfs --version: btrfs-progs v4.9.1
> 
> btrfs fi show:
> Label: none  uuid: 24b768c3-2141-44bf-ae93-1c3833c8c8e3
>     Total devices 2 FS bytes used 475.08GiB
>     devid    4 size 931.51GiB used 612.06GiB path /dev/sdb
>     devid    5 size 931.51GiB used 613.09GiB path /dev/sdc
> 
> btrfs fi df /mnt:
> Data, RAID1: total=603.00GiB, used=468.03GiB
> System, RAID1: total=64.00MiB, used=112.00KiB
> System, single: total=32.00MiB, used=0.00B
> Metadata, RAID1: total=9.00GiB, used=7.04GiB
> Metadata, single: total=1.00GiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> What is the recommended procedure at this point? Run btrfs check
> --repair? I have backups so losing a file or two isn't critical, but
> I really don't want to go through the effort of a bare metal
> reinstall.
> 
> In the process of researching this I did uncover a bad DIMM. Am I
> correct that the problems I'm seeing are likely linked to the
> resulting memory errors.
> 
> Thx in advance,
> 
> -steve
> 

-- 
Hugo Mills | Quidquid latine dictum sit, altum videtur
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Seeking Help on Corruption Issues

2017-10-03 Thread Stephen Nesbitt

All:

I came back to my computer yesterday to find my filesystem in read only 
mode. Running a btrfs scrub start -dB aborts as follows:


btrfs scrub start -dB /mnt
ERROR: scrubbing /mnt failed for device id 4: ret=-1, errno=5 
(Input/output error)
ERROR: scrubbing /mnt failed for device id 5: ret=-1, errno=5 
(Input/output error)

scrub device /dev/sdb (id 4) canceled
    scrub started at Mon Oct  2 21:51:46 2017 and was aborted after 
00:09:02

    total bytes scrubbed: 75.58GiB with 1 errors
    error details: csum=1
    corrected errors: 0, uncorrectable errors: 1, unverified errors: 0
scrub device /dev/sdc (id 5) canceled
    scrub started at Mon Oct  2 21:51:46 2017 and was aborted after 
00:11:11

    total bytes scrubbed: 50.75GiB with 0 errors

The resulting dmesg is:
[  699.534066] BTRFS error (device sdc): bdev /dev/sdb errs: wr 0, rd 0, 
flush 0, corrupt 6, gen 0
[  699.703045] BTRFS error (device sdc): unable to fixup (regular) error 
at logical 1609808347136 on dev /dev/sdb
[  783.306525] BTRFS critical (device sdc): corrupt leaf, bad key order: 
block=2589782867968, root=1, slot=116
[  789.776132] BTRFS critical (device sdc): corrupt leaf, bad key order: 
block=2589782867968, root=1, slot=116
[  911.529842] BTRFS critical (device sdc): corrupt leaf, bad key order: 
block=2589782867968, root=1, slot=116
[  918.365225] BTRFS critical (device sdc): corrupt leaf, bad key order: 
block=2589782867968, root=1, slot=116


Running btrfs check /dev/sdc results in:
btrfs check /dev/sdc
Checking filesystem on /dev/sdc
UUID: 24b768c3-2141-44bf-ae93-1c3833c8c8e3
checking extents
bad key ordering 116 117
bad block 2589782867968
ERROR: errors found in extent allocation tree or chunk allocation
checking free space cache
There is no free space entry for 1623012450304-1623012663296
There is no free space entry for 1623012450304-1623225008128
cache appears valid but isn't 1622151266304
found 288815742976 bytes used err is -22
total csum bytes: 0
total tree bytes: 350781440
total fs tree bytes: 0
total extent tree bytes: 350027776
btree space waste bytes: 115829777
file data blocks allocated: 156499968

uname -a:
Linux sysresccd 4.9.24-std500-amd64 #2 SMP Sat Apr 22 17:14:43 UTC 2017 
x86_64 Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz GenuineIntel GNU/Linux


btrfs --version: btrfs-progs v4.9.1

btrfs fi show:
Label: none  uuid: 24b768c3-2141-44bf-ae93-1c3833c8c8e3
    Total devices 2 FS bytes used 475.08GiB
    devid    4 size 931.51GiB used 612.06GiB path /dev/sdb
    devid    5 size 931.51GiB used 613.09GiB path /dev/sdc

btrfs fi df /mnt:
Data, RAID1: total=603.00GiB, used=468.03GiB
System, RAID1: total=64.00MiB, used=112.00KiB
System, single: total=32.00MiB, used=0.00B
Metadata, RAID1: total=9.00GiB, used=7.04GiB
Metadata, single: total=1.00GiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B

What is the recommended procedure at this point? Run btrfs check 
--repair? I have backups so losing a file or two isn't critical, but I 
really don't want to go through the effort of a bare metal reinstall.


In the process of researching this I did uncover a bad DIMM. Am I 
correct that the problems I'm seeing are likely linked to the resulting 
memory errors.


Thx in advance,

-steve

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html