RE: raid6 file system in a bad state

2016-10-17 Thread Jason D. Michaelson
I've been following that thread. It's been my fear.

I'm in the process of doing a restore of what I can get off of it so that i can 
re-create the file system with raid1 which, if i'm reading that thread 
correctly doesn't suffer at all from the rmw problems extant in the raid5/6 
code at the moment.

Again, thanks for your help.

> -Original Message-
> From: ch...@colorremedies.com [mailto:ch...@colorremedies.com] On
> Behalf Of Chris Murphy
> Sent: Friday, October 14, 2016 4:55 PM
> To: Chris Murphy
> Cc: Jason D. Michaelson; Btrfs BTRFS
> Subject: Re: raid6 file system in a bad state
> 
> This may be relevant and is pretty terrible.
> 
> http://www.spinics.net/lists/linux-btrfs/msg59741.html
> 
> 
> Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6 file system in a bad state

2016-10-14 Thread Chris Murphy
This may be relevant and is pretty terrible.

http://www.spinics.net/lists/linux-btrfs/msg59741.html


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6 file system in a bad state

2016-10-12 Thread Chris Murphy
On Wed, Oct 12, 2016 at 11:59 AM, Jason D. Michaelson
 wrote:

> With the bad disc in place:
>
> root@castor:~/btrfs-progs# ./btrfs restore -t 4844272943104 -D  /dev/sda 
> /dev/null
> parent transid verify failed on 4844272943104 wanted 161562 found 161476
> parent transid verify failed on 4844272943104 wanted 161562 found 161476
> checksum verify failed on 4844272943104 found E808AB28 wanted 0CEB169E
> checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640
> checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640
> bytenr mismatch, want=4844272943104, have=66211125067776
> Couldn't read tree root
> Could not open root, trying backup super
> warning, device 6 is missing
> warning, device 5 is missing
> warning, device 4 is missing
> warning, device 3 is missing
> warning, device 2 is missing
> checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB
> checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB
> bytenr mismatch, want=20971520, have=267714560
> ERROR: cannot read chunk root
> Could not open root, trying backup super
> warning, device 6 is missing
> warning, device 5 is missing
> warning, device 4 is missing
> warning, device 3 is missing
> warning, device 2 is missing
> checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB
> checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB
> bytenr mismatch, want=20971520, have=267714560
> ERROR: cannot read chunk root
> Could not open root, trying backup super


Don't all of these device missing messages seem bogus? I don't know
how to find out what's going on here. If it were me, I'd try to
reproduce this with a couple of distros's live images (Fedora Rawhide
and openSUSE Tumbleweed), and if they're both reproducing this
"missing" output, I'd file a bugzilla.kernel.org bug with a strace. I
mean, this stuff is hard enough as it is without bugs like this
getting in the way.

Fedora 25 nightly:
https://kojipkgs.fedoraproject.org/compose/branched/Fedora-25-20161008.n.0/compose/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-25-20161008.n.0.iso

That'll have some version of kernel 4.8, not sure which one. And it
will have btrfs-progs 4.6.1 which is safe but showing its age in Btrfs
years.

You can do this from inside the live environment:

sudo dnf update
https://kojipkgs.fedoraproject.org//packages/btrfs-progs/4.7.3/1.fc26/x86_64/btrfs-progs-4.7.3-1.fc26.x86_64.rpm

or

sudo dnf update
https://kojipkgs.fedoraproject.org//packages/btrfs-progs/4.8.1/1.fc26/x86_64/btrfs-progs-4.8.1-1.fc26.x86_64.rpm

It's probably just as valid to do this with whatever you have now,
strace that and file a bug. But it doesn't really for sure isolate
whether it's a local problem or not.


>
> And what's interesting is that when I move the /dev/sdd (the current bad 
> disc) out of /dev, rescan, and run btrfs restore with the main root I get 
> similar output:
>
> root@castor:~/btrfs-progs# ./btrfs restore -D  /dev/sda /dev/null
> warning, device 2 is missing
> checksum verify failed on 21430272 found 71001E6E wanted 95E3A3D8
> checksum verify failed on 21430272 found 992E0C37 wanted 36992D8B
> checksum verify failed on 21430272 found 992E0C37 wanted 36992D8B
> bytenr mismatch, want=21430272, have=264830976
> Couldn't read chunk tree
> Could not open root, trying backup super
> warning, device 6 is missing
> warning, device 5 is missing
> warning, device 4 is missing
> warning, device 3 is missing
> warning, device 2 is missing
> checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB
> checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB
> bytenr mismatch, want=20971520, have=267714560
> ERROR: cannot read chunk root
> Could not open root, trying backup super
> warning, device 6 is missing
> warning, device 5 is missing
> warning, device 4 is missing
> warning, device 3 is missing
> warning, device 2 is missing
> checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB
> checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB
> bytenr mismatch, want=20971520, have=267714560
> ERROR: cannot read chunk root
> Could not open root, trying backup super
>
> So it doesn't seem to work, but the difference in output between the two, at 
> least to my untrained eyes, is intriguing, to say the least.

Yeah I'm not sure what to recommend now.



>
>>
>>
>> OK at this point I'm thinking that fixing the super blocks won't change
>> anything because it sounds like it's using the new ones anyway and
>> maybe the thing to try is going back to a tree root that isn't in any
>> of the new supers. That means losing anything that was being written
>> when the lost writes happened. However, for all we know some overwrites
>> happened so this won't work. And also it does nothing to deal with the
>> fragile state of having at least two flaky devices, and one of the
>> system chunks with no redundancy.
>>
>
> This is the one thing I'm not following you on. I know there's one device 
> 

RE: raid6 file system in a bad state

2016-10-12 Thread Jason D. Michaelson


> -Original Message-
> From: ch...@colorremedies.com [mailto:ch...@colorremedies.com] On
> Behalf Of Chris Murphy
> Sent: Tuesday, October 11, 2016 3:38 PM
> To: Jason D. Michaelson; Btrfs BTRFS
> Cc: Chris Murphy
> Subject: Re: raid6 file system in a bad state
> 
> readding btrfs
> 
> On Tue, Oct 11, 2016 at 1:00 PM, Jason D. Michaelson
>  wrote:
> >
> >
> >> -Original Message-
> >> From: ch...@colorremedies.com [mailto:ch...@colorremedies.com] On
> >> Behalf Of Chris Murphy
> >> Sent: Tuesday, October 11, 2016 12:41 PM
> >> To: Jason D. Michaelson
> >> Cc: Chris Murphy; Btrfs BTRFS
> >> Subject: Re: raid6 file system in a bad state
> >>
> >> On Tue, Oct 11, 2016 at 10:10 AM, Jason D. Michaelson
> >>  wrote:
> >> > superblock: bytenr=65536, device=/dev/sda
> >> > -
> >> > generation  161562
> >> > root5752616386560
> >>
> >>
> >>
> >> > superblock: bytenr=65536, device=/dev/sdh
> >> > -
> >> > generation  161474
> >> > root4844272943104
> >>
> >> OK so most obvious is that the bad super is many generations back
> >> than the good super. That's expected given all the write errors.
> >>
> >>
> >
> > Is there any chance/way of going back to use this generation/root as
> a source for btrfs restore?
> 
> Yes with -t option and that root bytenr for the generation you want to
> restore. Thing is, that's so far back the metadata may be gone
> (overwritten) already. But worth a shot. I've recovered recently
> deleted files this way.

With the bad disc in place:

root@castor:~/btrfs-progs# ./btrfs restore -t 4844272943104 -D  /dev/sda 
/dev/null
parent transid verify failed on 4844272943104 wanted 161562 found 161476
parent transid verify failed on 4844272943104 wanted 161562 found 161476
checksum verify failed on 4844272943104 found E808AB28 wanted 0CEB169E
checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640
checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640
bytenr mismatch, want=4844272943104, have=66211125067776
Couldn't read tree root
Could not open root, trying backup super
warning, device 6 is missing
warning, device 5 is missing
warning, device 4 is missing
warning, device 3 is missing
warning, device 2 is missing
checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB
checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB
bytenr mismatch, want=20971520, have=267714560
ERROR: cannot read chunk root
Could not open root, trying backup super
warning, device 6 is missing
warning, device 5 is missing
warning, device 4 is missing
warning, device 3 is missing
warning, device 2 is missing
checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB
checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB
bytenr mismatch, want=20971520, have=267714560
ERROR: cannot read chunk root
Could not open root, trying backup super

And what's interesting is that when I move the /dev/sdd (the current bad disc) 
out of /dev, rescan, and run btrfs restore with the main root I get similar 
output:

root@castor:~/btrfs-progs# ./btrfs restore -D  /dev/sda /dev/null
warning, device 2 is missing
checksum verify failed on 21430272 found 71001E6E wanted 95E3A3D8
checksum verify failed on 21430272 found 992E0C37 wanted 36992D8B
checksum verify failed on 21430272 found 992E0C37 wanted 36992D8B
bytenr mismatch, want=21430272, have=264830976
Couldn't read chunk tree
Could not open root, trying backup super
warning, device 6 is missing
warning, device 5 is missing
warning, device 4 is missing
warning, device 3 is missing
warning, device 2 is missing
checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB
checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB
bytenr mismatch, want=20971520, have=267714560
ERROR: cannot read chunk root
Could not open root, trying backup super
warning, device 6 is missing
warning, device 5 is missing
warning, device 4 is missing
warning, device 3 is missing
warning, device 2 is missing
checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB
checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB
bytenr mismatch, want=20971520, have=267714560
ERROR: cannot read chunk root
Could not open root, trying backup super

So it doesn't seem to work, but the difference in output between the two, at 
least to my untrained eyes, is intriguing, to say the least.

> 
> 
> OK at this point I'm thinking that fixing the super blocks won't c

Re: raid6 file system in a bad state

2016-10-11 Thread Chris Murphy
readding btrfs

On Tue, Oct 11, 2016 at 1:00 PM, Jason D. Michaelson
 wrote:
>
>
>> -Original Message-
>> From: ch...@colorremedies.com [mailto:ch...@colorremedies.com] On
>> Behalf Of Chris Murphy
>> Sent: Tuesday, October 11, 2016 12:41 PM
>> To: Jason D. Michaelson
>> Cc: Chris Murphy; Btrfs BTRFS
>> Subject: Re: raid6 file system in a bad state
>>
>> On Tue, Oct 11, 2016 at 10:10 AM, Jason D. Michaelson
>>  wrote:
>> > superblock: bytenr=65536, device=/dev/sda
>> > -
>> > generation  161562
>> > root5752616386560
>>
>>
>>
>> > superblock: bytenr=65536, device=/dev/sdh
>> > -
>> > generation  161474
>> > root4844272943104
>>
>> OK so most obvious is that the bad super is many generations back than
>> the good super. That's expected given all the write errors.
>>
>>
>
> Is there any chance/way of going back to use this generation/root as a source 
> for btrfs restore?

Yes with -t option and that root bytenr for the generation you want to
restore. Thing is, that's so far back the metadata may be gone
(overwritten) already. But worth a shot. I've recovered recently
deleted files this way.


OK at this point I'm thinking that fixing the super blocks won't
change anything because it sounds like it's using the new ones anyway
and maybe the thing to try is going back to a tree root that isn't in
any of the new supers. That means losing anything that was being
written when the lost writes happened. However, for all we know some
overwrites happened so this won't work. And also it does nothing to
deal with the fragile state of having at least two flaky devices, and
one of the system chunks with no redundancy.


Try 'btrfs check' without repair. And then also try it with -r flag
using the various tree roots we've seen so far. Try explicitly using
5752616386560, which is what it ought to use first anyway. And then
also 4844272943104.

That might go far enough back before the bad sectors were a factor.
Normally what you'd want is for it to use one of the backup roots, but
it's consistently running into a problem with all of them when using
recovery mount option.





-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6 file system in a bad state

2016-10-11 Thread Chris Murphy
On Tue, Oct 11, 2016 at 10:10 AM, Jason D. Michaelson
 wrote:
> superblock: bytenr=65536, device=/dev/sda
> -
> generation  161562
> root5752616386560



> superblock: bytenr=65536, device=/dev/sdh
> -
> generation  161474
> root4844272943104

OK so most obvious is that the bad super is many generations back than
the good super. That's expected given all the write errors.


>root@castor:~/logs# btrfs-find-root /dev/sda
>parent transid verify failed on 5752357961728 wanted 161562 found 159746
>parent transid verify failed on 5752357961728 wanted 161562 found 159746
>Couldn't setup extent tree
>Superblock thinks the generation is 161562
>Superblock thinks the level is 1


This squares with the good super. So btrfs-find-root is using a good
super. I don't know what 5752357961728 is for, maybe it's possible to
read that with btrfs-debug-tree -b 5752357961728  and see what
comes back. This is not the tree root according to the super though.
So what do you get for btrfs-debug-tree -b 5752616386560 

Going back to your logs


[   38.810575] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state
recovery directory
[   38.810595] NFSD: starting 90-second grace period (net b12e5b80)
[  241.292816] INFO: task bfad_worker:234 blocked for more than 120 seconds.
[  241.299135]   Not tainted 4.7.0-0.bpo.1-amd64 #1
[  241.305645] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

I don't know what this kernel is. I think you'd be better off with
stable 4.7.7 or 4.8.1 for this work, so you're not running into a
bunch of weird blocked task problems in addition to whatever is going
on with the fs.
[   38.810575] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state
recovery directory
[   38.810595] NFSD: starting 90-second grace period (net b12e5b80)
[  241.292816] INFO: task bfad_worker:234 blocked for more than 120 seconds.
[  241.299135]   Not tainted 4.7.0-0.bpo.1-amd64 #1
[  241.305645] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

I don't know what this kernel is. I think you'd be better off with
stable 4.7.7 or 4.8.1 for this work, so you're not running into a
bunch of weird blocked task problems in addition to whatever is going
on with the fs.


[   20.552205] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd
devid 3 transid 161562 /dev/sdd
[   20.552372] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd
devid 5 transid 161562 /dev/sdf
[   20.552524] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd
devid 6 transid 161562 /dev/sde
[   20.552689] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd
devid 4 transid 161562 /dev/sdg
[   20.552858] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd
devid 1 transid 161562 /dev/sda
[  669.843166] BTRFS warning (device sda): devid 2 uuid
dc8760f1-2c54-4134-a9a7-a0ac2b7a9f1c is missing
[232572.871243] sd 0:0:8:0: [sdh] tag#4 Sense Key : Medium Error [current]


Two items missing, in effect, for this failed read. One literally
missing, and the other one missing due to unrecoverable read error.
The fact it's not trying to fix anything suggests it hasn't really
finished mounting, there must be something wrong where it either just
gets confused and won't fix (because it might make things worse) or
there isn't reduncancy.


[52799.495999] mce: [Hardware Error]: Machine check events logged
[53249.491975] mce: [Hardware Error]: Machine check events logged
[231298.005594] mce: [Hardware Error]: Machine check events logged

Bunch of other hardware issues...

I *really* think you need to get the hardware issues sorted out before
working on this file system unless you just don't care that much about
it. There are already enough unknowns without contributing who knows
what effect the hardware issues are having while trying to repair
things. Or even understand what's going on.



> sys_chunk_array[2048]:
> item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 0)
> chunk length 4194304 owner 2 stripe_len 65536
> type SYSTEM num_stripes 1
> stripe 0 devid 1 offset 0
> dev uuid: 08c50aa9-c2dd-43b7-a631-6dfdc7d69ea4
> item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520)
> chunk length 11010048 owner 2 stripe_len 65536
> type SYSTEM|RAID6 num_stripes 6
> stripe 0 devid 6 offset 1048576
> dev uuid: 390a1fd8-cc6c-40e7-b0b5-88ca7dcbcc32
> stripe 1 devid 5 offset 1048576
> dev uuid: 2df974c5-9dde-4062-81e9-c613db62
> stripe 2 devid 4 offset 1048576
> dev uuid: dce3d159-721d-4859-9955-37a03769bb0d
> stripe 3 devid 3 offset 1048576
>  

RE: raid6 file system in a bad state

2016-10-11 Thread Jason D. Michaelson
> 
> 
> Bad superblocks can't be a good thing and would only cause confusion.
> I'd think that a known bad superblock would be ignored at mount time
> and even by btrfs-find-root, or maybe even replaced like any other kind
> of known bad metadata where good copies are available.
> 
> btrfs-show-super -f /dev/sda
> btrfs-show-super -f /dev/sdh
> 
> 
> Find out what the difference is between good and bad supers.
> 
root@castor:~# btrfs-show-super -f /dev/sda
superblock: bytenr=65536, device=/dev/sda
-
csum_type   0 (crc32c)
csum_size   4
csum0x45278835 [match]
bytenr  65536
flags   0x1
( WRITTEN )
magic   _BHRfS_M [match]
fsid73ed01df-fb2a-4b27-b6fc-12a57da934bd
label
generation  161562
root5752616386560
sys_array_size  354
chunk_root_generation   156893
root_level  1
chunk_root  20971520
chunk_root_level1
log_root0
log_root_transid0
log_root_level  0
total_bytes 18003557892096
bytes_used  7107627130880
sectorsize  4096
nodesize16384
leafsize16384
stripesize  4096
root_dir6
num_devices 6
compat_flags0x0
compat_ro_flags 0x0
incompat_flags  0xe1
( MIXED_BACKREF |
  BIG_METADATA |
  EXTENDED_IREF |
  RAID56 )
cache_generation161562
uuid_tree_generation161562
dev_item.uuid   08c50aa9-c2dd-43b7-a631-6dfdc7d69ea4
dev_item.fsid   73ed01df-fb2a-4b27-b6fc-12a57da934bd [match]
dev_item.type   0
dev_item.total_bytes3000592982016
dev_item.bytes_used 1800957198336
dev_item.io_align   4096
dev_item.io_width   4096
dev_item.sector_size4096
dev_item.devid  1
dev_item.dev_group  0
dev_item.seek_speed 0
dev_item.bandwidth  0
dev_item.generation 0
sys_chunk_array[2048]:
item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 0)
chunk length 4194304 owner 2 stripe_len 65536
type SYSTEM num_stripes 1
stripe 0 devid 1 offset 0
dev uuid: 08c50aa9-c2dd-43b7-a631-6dfdc7d69ea4
item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520)
chunk length 11010048 owner 2 stripe_len 65536
type SYSTEM|RAID6 num_stripes 6
stripe 0 devid 6 offset 1048576
dev uuid: 390a1fd8-cc6c-40e7-b0b5-88ca7dcbcc32
stripe 1 devid 5 offset 1048576
dev uuid: 2df974c5-9dde-4062-81e9-c613db62
stripe 2 devid 4 offset 1048576
dev uuid: dce3d159-721d-4859-9955-37a03769bb0d
stripe 3 devid 3 offset 1048576
dev uuid: 6f7142db-824c-4791-a5b2-d6ce11c81c8f
stripe 4 devid 2 offset 1048576
dev uuid: dc8760f1-2c54-4134-a9a7-a0ac2b7a9f1c
stripe 5 devid 1 offset 20971520
dev uuid: 08c50aa9-c2dd-43b7-a631-6dfdc7d69ea4
backup_roots[4]:
backup 0:
backup_tree_root:   5752437456896   gen: 161561 level: 1
backup_chunk_root:  20971520gen: 156893 level: 1
backup_extent_root: 5752385224704   gen: 161561 level: 2
backup_fs_root: 124387328   gen: 74008  level: 0
backup_dev_root:5752437587968   gen: 161561 level: 1
backup_csum_root:   5752389615616   gen: 161561 level: 3
backup_total_bytes: 18003557892096
backup_bytes_used:  7112579833856
backup_num_devices: 6

backup 1:
backup_tree_root:   5752616386560   gen: 161562 level: 1
backup_chunk_root:  20971520gen: 156893 level: 1
backup_extent_root: 5752649416704   gen: 161563 level: 2
backup_fs_root: 124387328   gen: 74008  level: 0
backup_dev_root:5752616501248   gen: 161562 level: 1
backup_csum_root:   5752650203136   gen: 161563 level: 3
backup_total_bytes: 18003557892096
backup_bytes_used:  7107602407424
backup_num_devices: 6

backup 2:
backup_tree_root:   5752112103424   gen: 161559 level: 1
backup_chunk_root:  20971520gen: 156893 level: 1
backup_extent_root: 5752207409152   gen: 161560 level: 2
b

Re: raid6 file system in a bad state

2016-10-11 Thread Chris Murphy
On Tue, Oct 11, 2016 at 9:52 AM, Jason D. Michaelson
 wrote:

>> btrfs rescue super-recover -v 
>
> root@castor:~/logs# btrfs rescue super-recover -v /dev/sda
> All Devices:
> Device: id = 2, name = /dev/sdh
> Device: id = 3, name = /dev/sdd
> Device: id = 5, name = /dev/sdf
> Device: id = 6, name = /dev/sde
> Device: id = 4, name = /dev/sdg
> Device: id = 1, name = /dev/sda
>
> Before Recovering:
> [All good supers]:
> device name = /dev/sdd
> superblock bytenr = 65536
>
> device name = /dev/sdd
> superblock bytenr = 67108864
>
> device name = /dev/sdd
> superblock bytenr = 274877906944
>
> device name = /dev/sdf
> superblock bytenr = 65536
>
> device name = /dev/sdf
> superblock bytenr = 67108864
>
> device name = /dev/sdf
> superblock bytenr = 274877906944
>
> device name = /dev/sde
> superblock bytenr = 65536
>
> device name = /dev/sde
> superblock bytenr = 67108864
>
> device name = /dev/sde
> superblock bytenr = 274877906944
>
> device name = /dev/sdg
> superblock bytenr = 65536
>
> device name = /dev/sdg
> superblock bytenr = 67108864
>
> device name = /dev/sdg
> superblock bytenr = 274877906944
>
> device name = /dev/sda
> superblock bytenr = 65536
>
> device name = /dev/sda
> superblock bytenr = 67108864
>
> device name = /dev/sda
> superblock bytenr = 274877906944
>
> [All bad supers]:
> device name = /dev/sdh
> superblock bytenr = 65536
>
> device name = /dev/sdh
> superblock bytenr = 67108864
>
> device name = /dev/sdh
> superblock bytenr = 274877906944
>
>
> Make sure this is a btrfs disk otherwise the tool will destroy other fs, Are 
> you sure? [y/N]: n
> Aborted to recover bad superblocks
>
> I aborted this waiting for instructions on whether to proceed from the list.


Bad superblocks can't be a good thing and would only cause confusion.
I'd think that a known bad superblock would be ignored at mount time
and even by btrfs-find-root, or maybe even replaced like any other
kind of known bad metadata where good copies are available.

btrfs-show-super -f /dev/sda
btrfs-show-super -f /dev/sdh


Find out what the difference is between good and bad supers.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: raid6 file system in a bad state

2016-10-11 Thread Jason D. Michaelson

> -Original Message-
> From: ch...@colorremedies.com [mailto:ch...@colorremedies.com] On
> Behalf Of Chris Murphy
> Sent: Monday, October 10, 2016 11:23 PM
> To: Jason D. Michaelson
> Cc: Chris Murphy; Btrfs BTRFS
> Subject: Re: raid6 file system in a bad state
> 
> What do you get for
> 
> btrfs-find-root 

root@castor:~/logs# btrfs-find-root /dev/sda
parent transid verify failed on 5752357961728 wanted 161562 found 159746
parent transid verify failed on 5752357961728 wanted 161562 found 159746
Couldn't setup extent tree
Superblock thinks the generation is 161562
Superblock thinks the level is 1

There's no further output, and btrfs-find-root is pegged at 100%.

At the moment, the perceived bad disc is connected. I received the same results 
without as well.

> btrfs rescue super-recover -v 

root@castor:~/logs# btrfs rescue super-recover -v /dev/sda
All Devices:
Device: id = 2, name = /dev/sdh
Device: id = 3, name = /dev/sdd
Device: id = 5, name = /dev/sdf
Device: id = 6, name = /dev/sde
Device: id = 4, name = /dev/sdg
Device: id = 1, name = /dev/sda

Before Recovering:
[All good supers]:
device name = /dev/sdd
superblock bytenr = 65536

device name = /dev/sdd
superblock bytenr = 67108864

device name = /dev/sdd
superblock bytenr = 274877906944

device name = /dev/sdf
superblock bytenr = 65536

device name = /dev/sdf
superblock bytenr = 67108864

device name = /dev/sdf
superblock bytenr = 274877906944

device name = /dev/sde
superblock bytenr = 65536

device name = /dev/sde
superblock bytenr = 67108864

device name = /dev/sde
superblock bytenr = 274877906944

device name = /dev/sdg
superblock bytenr = 65536

device name = /dev/sdg
superblock bytenr = 67108864

device name = /dev/sdg
superblock bytenr = 274877906944

device name = /dev/sda
superblock bytenr = 65536

device name = /dev/sda
superblock bytenr = 67108864

device name = /dev/sda
superblock bytenr = 274877906944

[All bad supers]:
device name = /dev/sdh
superblock bytenr = 65536

device name = /dev/sdh
superblock bytenr = 67108864

device name = /dev/sdh
superblock bytenr = 274877906944


Make sure this is a btrfs disk otherwise the tool will destroy other fs, Are 
you sure? [y/N]: n
Aborted to recover bad superblocks

I aborted this waiting for instructions on whether to proceed from the list.

>   
> 
> 
> It shouldn't matter which dev you pick, unless it face plants, then try
> another.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6 file system in a bad state

2016-10-10 Thread Chris Murphy
What do you get for

btrfs-find-root 
btrfs rescue super-recover -v 



It shouldn't matter which dev you pick, unless it face plants, then try another.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6 file system in a bad state

2016-10-10 Thread Chris Murphy
On Mon, Oct 10, 2016 at 10:04 AM, Jason D. Michaelson
 wrote:

> One of the disks had a write problem, unbeknownst to me, which caused the
> entire pool and its subvolumes to remount read only.

Can you be more specific about the write problem? What are the
messages from the logs about these write problems and is that problem
now fixed?




>
> When this problem occurred I was on debian jessie kernel 3.16.something.
> Following list advice I upgraded to the latest in jessie-backports, 4.7.5.
> My git clone of btrfs-progs is at commit
> 81f4d96f3d6368dc4e5edf7e3cb9d19bb4d00c4f
>
> Not knowing the cause of the problem, I unmounted and attempted to remount,
> which failed, with the following coming from dmesg:
>
> [308063.610960] BTRFS info (device sda): allowing degraded mounts
> [308063.610972] BTRFS info (device sda): disk space caching is enabled
> [308063.723461] BTRFS error (device sda): parent transid verify failed on
> 5752357961728 wanted 161562 found 159746
> [308063.815224] BTRFS info (device sda): bdev /dev/sdh errs: wr 261, rd 1,
> flush 87, corrupt 0, gen 0
> [308063.849613] BTRFS error (device sda): parent transid verify failed on
> 5752642420736 wanted 161562 found 159786
> [308063.881024] BTRFS error (device sda): parent transid verify failed on
> 5752472338432 wanted 161562 found 159751
> [308063.940225] BTRFS error (device sda): parent transid verify failed on
> 5752478842880 wanted 161562 found 159752
> [308063.979517] BTRFS error (device sda): parent transid verify failed on
> 5752543526912 wanted 161562 found 159764
> [308064.012479] BTRFS error (device sda): parent transid verify failed on
> 5752513036288 wanted 161562 found 159764
> [308064.049169] BTRFS error (device sda): parent transid verify failed on
> 5752642617344 wanted 161562 found 159786
> [308064.080507] BTRFS error (device sda): parent transid verify failed on
> 5752642650112 wanted 161562 found 159786
> [308064.138951] BTRFS error (device sda): parent transid verify failed on
> 5752610603008 wanted 161562 found 159783
> [308064.164326] BTRFS error (device sda): bad tree block start
> 5918360357649457268 5752610603008
> [308064.173752] BTRFS error (device sda): bad tree block start
> 5567295971165396096 5752610603008
> [308064.182026] BTRFS error (device sda): failed to read block groups: -5
> [308064.234174] BTRFS: open_ctree failed

Sometimes it will be more tolerant with mount -o degraded,ro.



> /dev/sdh is the disc that had the write error
>


> [232578.796809] mpt2sas_cm0: log_info(0x3108): originator(PL),
> code(0x08), sub_code(0x)
> [232578.796838] sd 0:0:8:0: [sdh] tag#4 CDB: Read(16) 88 00 00 00 00 00 34
> 55 61 f0 00 00 00 40 00 00
> [232578.796845] mpt2sas_cm0:sas_address(0x50030480002e5946), phy(6)
> [232578.796850] mpt2sas_cm0:
> enclosure_logical_id(0x50030442523a2033),slot(2)
> [232578.796856] mpt2sas_cm0:handle(0x0012), ioc_status(success)(0x),
> smid(36)
> [232578.796860] mpt2sas_cm0:request_len(32768), underflow(32768),
> resid(0)
> [232578.796864] mpt2sas_cm0:tag(0), transfer_count(32768),
> sc->result(0x)
> [232578.796869] mpt2sas_cm0:scsi_status(check condition)(0x02),
> scsi_state(autosense valid )(0x01)
> [232578.796874] mpt2sas_cm0:[sense_key,asc,ascq]: [0x03,0x11,0x00],
> count(18)
> [232578.797129] sd 0:0:8:0: [sdh] tag#4 FAILED Result: hostbyte=DID_OK
> driverbyte=DRIVER_SENSE
> [232578.797138] sd 0:0:8:0: [sdh] tag#4 Sense Key : Medium Error [current]
> [232578.797146] sd 0:0:8:0: [sdh] tag#4 Add. Sense: Unrecovered read error
> [232578.797154] sd 0:0:8:0: [sdh] tag#4 CDB: Read(16) 88 00 00 00 00 00 34
> 55 61 f0 00 00 00 40 00 00
> [232578.797160] blk_update_request: critical medium error, dev sdh, sector
> 878010888

Each one of these complains about a read error and a different LBA is reported.

These should get fixed automatically by Btrfs since kernel 3.19. The
problem is that you were using 3.16 so they were left to accumulate.
3.16 kernel needed a full balance to fix these.




> [232581.663794] mpt2sas_cm0: log_info(0x3108): originator(PL),
> code(0x08), sub_code(0x)
> [232581.663823] sd 0:0:8:0: [sdh] tag#1 CDB: Read(16) 88 00 00 00 00 00 34
> 55 62 30 00 00 00 80 00 00
> [232581.663830] mpt2sas_cm0:sas_address(0x50030480002e5946), phy(6)
> [232581.663835] mpt2sas_cm0:
> enclosure_logical_id(0x50030442523a2033),slot(2)
> [232581.663841] mpt2sas_cm0:handle(0x0012), ioc_status(success)(0x),
> smid(62)
> [232581.663845] mpt2sas_cm0:request_len(65536), underflow(65536),
> resid(65536)
> [232581.663849] mpt2sas_cm0:tag(0), transfer_count(0),
> sc->result(0x)
> [232581.663854] mpt2sas_cm0:scsi_status(check condition)(0x02),
> scsi_state(autosense valid )(0x01)
> [232581.663859] mpt2sas_cm0:[sense_key,asc,ascq]: [0x03,0x11,0x00],
> count(18)
> [232581.663918] sd 0:0:8:0: [sdh] tag#1 FAILED Result: hostbyte=DID_OK
> driverbyte=DRIVER_SENSE
> [232581.663937] sd 0:0:8:0: [sdh] tag#1 Sense Key : Medium Error [