Re: help needed with raid 6 filesystem with errors

2021-03-30 Thread Zygo Blaxell
On Tue, Mar 30, 2021 at 07:20:14PM +0200, Bas Hulsken wrote:
> On Tue, 2021-03-30 at 11:46 -0400, Zygo Blaxell wrote:
> > On Tue, Mar 30, 2021 at 03:01:57PM +0200, Bas Hulsken wrote:
> > > I followed your advice, Zygo and Chris, and did both:
> > > 1) smartctl -l scterc,70,70 /dev/sdX for all 4 drives in the array
> > > (the
> > > drives do support this)
> > > 2) echo 180 > /sys/block/sdX/device/timeout for all 4 drives
> > > 
> > > with that I attempted another scrub (on the single failing device,
> > > not
> > > on the filesystem), but with bad results again. The drive is
> > > basically
> > > still not responsive after the first error, this is the error
> > > according
> > > to smartctl:
> > > 
> > > Error 4 occurred at disk power-on lifetime: 7124 hours (296 days +
> > > 20
> > > hours)
> > >   When the command that caused the error occurred, the device was
> > > active or idle.
> > > 
> > >   After command completion occurred, registers were:
> > >   ER ST SC SN CL CH DH
> > >   -- -- -- -- -- -- --
> > >   40 41 98 68 24 00 40  Error: UNC at LBA = 0x2468 = 9320
> > > 
> > >   Commands leading to the command that caused the error were:
> > >   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
> > >   -- -- -- -- -- -- -- --    
> > >   60 80 00 80 25 00 40 00   2d+20:57:19.109  READ FPDMA QUEUED
> > >   60 80 f8 80 23 00 40 00   2d+20:57:16.423  READ FPDMA QUEUED
> > >   60 80 f0 80 21 00 40 00   2d+20:57:16.422  READ FPDMA QUEUED
> > >   60 80 e8 80 1f 00 40 00   2d+20:57:16.421  READ FPDMA QUEUED
> > >   60 80 e0 80 1d 00 40 00   2d+20:57:16.420  READ FPDMA QUEUED
> > > 
> > > The other errors are all the same (Error: UNC at LBA = 0x2468 =
> > > 9320), and at exactly the same LBA, once scrub gets to this LBA,
> > > the
> > > drives basically no longer responds, and querying it with smartctl
> > > will
> > > return garbage characters, or nothing at all. I've attached a dmesg
> > > with also the io errors this time.
> > > 
> > > So: I conclude scub is not going to fix this problem, and I should
> > > really replace the disk.
> > 
> > Agreed.  It is now properly configured, there are UNC sectors logged
> > in
> > SMART, and UNC recovery is still not working.  The drive is broken
> > and
> > will likely stay that way.
> > 
> > > @Zygo: following your advice, and using btrfs replace -r with the
> > > failing drive online, I take it it reads only sectors from the
> > > failing
> > > disk if at least 2 other disks are failing at that spot (given it's
> > > raid6), correct? 
> > 
> > That's the general idea.
> > 
> > > If so I would be comfortable giving that a shot. I do
> > > expect that while doing a replace and reading the same LBA from the
> > > disk, it will just crash again and ruin my replace.
> > 
> > There's still another redundant disk in the array, so there's no need
> > to
> > put too much effort into recovering one failing drive.  The disk
> > seems
> > really broken, so take it offline and do a replace in degraded mode.
> 
> Thanks for the clear help and explanations, I have 2 final questions
> (famous last words :-) )
> 1) In your earlier reply you mentioned known bugs, including the
> "Spurious read errors in btrfs raid5 degraded mode". Would replacing
> with "-r" while the faulty drive is still online not prevent this from
> happening? 

The bug only seems to affect kernel read code, which tries to avoid
unnecessary reads so it has a lot of special cases (not degraded,
degraded, P corrupted, Q corrupted...).  Scrub uses different code which
is much simpler, always reads the entire stripe at once, and doesn't
seem to be affected by the read bug.  Replace is implemented as a special
case of scrub internally, so it has the same read behavior as scrub.

In testing I've always hit the spurious read failures with reads and
never with scrub or replace.

> Assuming the replace speed is similar to the scrub speed,

Replace speed will be the same as the scrub speed _for scrub on one
drive_.

Running scrub on all disks at once will dramatically reduce performance
compared to running scrub on a single disk (or even each disk one at
a time).  If you have been running scrub with a mountpoint argument
instead of individual devices, then it has been running scrubs on all
disks in parallel (i.e. competing with each other), and taking far longer
than it could have.

> I'm looking at 4 days to replace the drive, would prefer if I could
> keep using the filesystem while that happens.. otherwise wiping it and
> restoring from backup might actually be the fastest option.

I would avoid using it as much as possible during the replace.  I have
tested running raid5 in degraded mode with a fully active read/write
workload, and there were a handful of lost data blocks (17, 84K) on a
20TB restore.  I don't know if the extra Q disk for raid6 helps, raid6
is not something I'm testing so far.

> 2) If I go the offline way, how would I actually do that? I do not s

Re: help needed with raid 6 filesystem with errors

2021-03-30 Thread Bas Hulsken
On Tue, 2021-03-30 at 11:46 -0400, Zygo Blaxell wrote:
> On Tue, Mar 30, 2021 at 03:01:57PM +0200, Bas Hulsken wrote:
> > I followed your advice, Zygo and Chris, and did both:
> > 1) smartctl -l scterc,70,70 /dev/sdX for all 4 drives in the array
> > (the
> > drives do support this)
> > 2) echo 180 > /sys/block/sdX/device/timeout for all 4 drives
> > 
> > with that I attempted another scrub (on the single failing device,
> > not
> > on the filesystem), but with bad results again. The drive is
> > basically
> > still not responsive after the first error, this is the error
> > according
> > to smartctl:
> > 
> > Error 4 occurred at disk power-on lifetime: 7124 hours (296 days +
> > 20
> > hours)
> >   When the command that caused the error occurred, the device was
> > active or idle.
> > 
> >   After command completion occurred, registers were:
> >   ER ST SC SN CL CH DH
> >   -- -- -- -- -- -- --
> >   40 41 98 68 24 00 40  Error: UNC at LBA = 0x2468 = 9320
> > 
> >   Commands leading to the command that caused the error were:
> >   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
> >   -- -- -- -- -- -- -- --    
> >   60 80 00 80 25 00 40 00   2d+20:57:19.109  READ FPDMA QUEUED
> >   60 80 f8 80 23 00 40 00   2d+20:57:16.423  READ FPDMA QUEUED
> >   60 80 f0 80 21 00 40 00   2d+20:57:16.422  READ FPDMA QUEUED
> >   60 80 e8 80 1f 00 40 00   2d+20:57:16.421  READ FPDMA QUEUED
> >   60 80 e0 80 1d 00 40 00   2d+20:57:16.420  READ FPDMA QUEUED
> > 
> > The other errors are all the same (Error: UNC at LBA = 0x2468 =
> > 9320), and at exactly the same LBA, once scrub gets to this LBA,
> > the
> > drives basically no longer responds, and querying it with smartctl
> > will
> > return garbage characters, or nothing at all. I've attached a dmesg
> > with also the io errors this time.
> > 
> > So: I conclude scub is not going to fix this problem, and I should
> > really replace the disk.
> 
> Agreed.  It is now properly configured, there are UNC sectors logged
> in
> SMART, and UNC recovery is still not working.  The drive is broken
> and
> will likely stay that way.
> 
> > @Zygo: following your advice, and using btrfs replace -r with the
> > failing drive online, I take it it reads only sectors from the
> > failing
> > disk if at least 2 other disks are failing at that spot (given it's
> > raid6), correct? 
> 
> That's the general idea.
> 
> > If so I would be comfortable giving that a shot. I do
> > expect that while doing a replace and reading the same LBA from the
> > disk, it will just crash again and ruin my replace.
> 
> There's still another redundant disk in the array, so there's no need
> to
> put too much effort into recovering one failing drive.  The disk
> seems
> really broken, so take it offline and do a replace in degraded mode.

Thanks for the clear help and explanations, I have 2 final questions
(famous last words :-) )
1) In your earlier reply you mentioned known bugs, including the
"Spurious read errors in btrfs raid5 degraded mode". Would replacing
with "-r" while the faulty drive is still online not prevent this from
happening? Assuming the replace speed is similar to the scrub speed,
I'm looking at 4 days to replace the drive, would prefer if I could
keep using the filesystem while that happens.. otherwise wiping it and
restoring from backup might actually be the fastest option.
2) If I go the offline way, how would I actually do that? I do not see
a command in the btrfs manual to flag a disk as faulty, or any other
command to move into "degraded" mode. I could unplug it while power off
ofcourse, is that the best / only way?


> 
> > thanks!
> > 
> > 
> > On Mon, 2021-03-29 at 17:05 -0400, Zygo Blaxell wrote:
> > > On Mon, Mar 29, 2021 at 02:03:06PM +0200, Bas Hulsken wrote:
> > > > Dear list,
> > > 
> > > > due to a disk intermittently failing in my 4 disk array, I'm
> > > > getting
> > > > "transid verify failed" errors on my btrfs filesystem (see
> > > > attached
> > > > dmesg | grep -i btrfs dump in btrfs_dmesg.txt). 
> > > 
> > > Scary!  But in this case, it looks like they were automatically
> > > recovered
> > > already.
> > > 
> > > > When I run a scrub,
> > > > the bad disk (/dev/sdd) becomes unresponsive, so I'm hesitant
> > > > to try
> > > > that again (happened 3 times now, and was the root cause of the
> > > > transid
> > > > verify failed errors possibly, at least they did not show up
> > > > earlier
> > > > than the failed scrub). 
> > > 
> > > That is quite common when disks fail.  The extra IO load results
> > > in a
> > > firmware crash, either due to failure of the electronics
> > > disrupting the
> > > embedded CPU so it can't run any program correctly, or an error
> > > condition
> > > in the rest of the disk that the firmware doesn't handle
> > > properly.
> > > Any unflushed writes in the write cache at this time are lost. 
> > > Lost
> > > metadata writes will result in parent transid verify failures
> > > lat

Re: help needed with raid 6 filesystem with errors

2021-03-30 Thread Zygo Blaxell
On Tue, Mar 30, 2021 at 03:01:57PM +0200, Bas Hulsken wrote:
> I followed your advice, Zygo and Chris, and did both:
> 1) smartctl -l scterc,70,70 /dev/sdX for all 4 drives in the array (the
> drives do support this)
> 2) echo 180 > /sys/block/sdX/device/timeout for all 4 drives
> 
> with that I attempted another scrub (on the single failing device, not
> on the filesystem), but with bad results again. The drive is basically
> still not responsive after the first error, this is the error according
> to smartctl:
> 
> Error 4 occurred at disk power-on lifetime: 7124 hours (296 days + 20
> hours)
>   When the command that caused the error occurred, the device was
> active or idle.
> 
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   40 41 98 68 24 00 40  Error: UNC at LBA = 0x2468 = 9320
> 
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --    
>   60 80 00 80 25 00 40 00   2d+20:57:19.109  READ FPDMA QUEUED
>   60 80 f8 80 23 00 40 00   2d+20:57:16.423  READ FPDMA QUEUED
>   60 80 f0 80 21 00 40 00   2d+20:57:16.422  READ FPDMA QUEUED
>   60 80 e8 80 1f 00 40 00   2d+20:57:16.421  READ FPDMA QUEUED
>   60 80 e0 80 1d 00 40 00   2d+20:57:16.420  READ FPDMA QUEUED
> 
> The other errors are all the same (Error: UNC at LBA = 0x2468 =
> 9320), and at exactly the same LBA, once scrub gets to this LBA, the
> drives basically no longer responds, and querying it with smartctl will
> return garbage characters, or nothing at all. I've attached a dmesg
> with also the io errors this time.
> 
> So: I conclude scub is not going to fix this problem, and I should
> really replace the disk.

Agreed.  It is now properly configured, there are UNC sectors logged in
SMART, and UNC recovery is still not working.  The drive is broken and
will likely stay that way.

> @Zygo: following your advice, and using btrfs replace -r with the
> failing drive online, I take it it reads only sectors from the failing
> disk if at least 2 other disks are failing at that spot (given it's
> raid6), correct? 

That's the general idea.

> If so I would be comfortable giving that a shot. I do
> expect that while doing a replace and reading the same LBA from the
> disk, it will just crash again and ruin my replace.

There's still another redundant disk in the array, so there's no need to
put too much effort into recovering one failing drive.  The disk seems
really broken, so take it offline and do a replace in degraded mode.

> thanks!
> 
> 
> On Mon, 2021-03-29 at 17:05 -0400, Zygo Blaxell wrote:
> > On Mon, Mar 29, 2021 at 02:03:06PM +0200, Bas Hulsken wrote:
> > > Dear list,
> > 
> > > due to a disk intermittently failing in my 4 disk array, I'm getting
> > > "transid verify failed" errors on my btrfs filesystem (see attached
> > > dmesg | grep -i btrfs dump in btrfs_dmesg.txt). 
> > 
> > Scary!  But in this case, it looks like they were automatically
> > recovered
> > already.
> > 
> > > When I run a scrub,
> > > the bad disk (/dev/sdd) becomes unresponsive, so I'm hesitant to try
> > > that again (happened 3 times now, and was the root cause of the
> > > transid
> > > verify failed errors possibly, at least they did not show up earlier
> > > than the failed scrub). 
> > 
> > That is quite common when disks fail.  The extra IO load results in a
> > firmware crash, either due to failure of the electronics disrupting the
> > embedded CPU so it can't run any program correctly, or an error
> > condition
> > in the rest of the disk that the firmware doesn't handle properly.
> > Any unflushed writes in the write cache at this time are lost.  Lost
> > metadata writes will result in parent transid verify failures later on.
> > 
> > Low end desktop drives have very large SCTERC timeouts but no SCTERC
> > controls, so they have very long IO error retry loops (2 minutes).
> > That can look like an intermittent failure in the logs, but in fact
> > it's
> > an ordinary remappable UNC sector.  The kernel has a default timeout
> > of 30 seconds, so the kernel forces a drive reset before the drive can
> > report the bad block.  The drive can often be used normally by setting
> > the kernel timeout with 'echo 180 > /sys/block/sd.../device/timeout'.
> > 
> > Whether you _want_ to use a disk with firmware that waits two full
> > minutes before reporting an IO error is a separate question, but this
> > is a feature of several popular cheap drive models, and you _can_ use
> > these disks if needed.
> > 
> > > A new disk is on it's way to use btrfs replace,
> > > but I'm not sure whehter that will be a wise choice for a filesystem
> > > with errors. There was never a crash/power failure, so the filesystem
> > > was unmounted at every reboot, but as said on 3 occasions (after a
> > > scrub), that unmount was with on of the four drives unresponsive.
> > 
> > Not

Re: help needed with raid 6 filesystem with errors

2021-03-29 Thread Zygo Blaxell
On Mon, Mar 29, 2021 at 02:03:06PM +0200, Bas Hulsken wrote:
> Dear list,

> due to a disk intermittently failing in my 4 disk array, I'm getting
> "transid verify failed" errors on my btrfs filesystem (see attached
> dmesg | grep -i btrfs dump in btrfs_dmesg.txt). 

Scary!  But in this case, it looks like they were automatically recovered
already.

> When I run a scrub,
> the bad disk (/dev/sdd) becomes unresponsive, so I'm hesitant to try
> that again (happened 3 times now, and was the root cause of the transid
> verify failed errors possibly, at least they did not show up earlier
> than the failed scrub). 

That is quite common when disks fail.  The extra IO load results in a
firmware crash, either due to failure of the electronics disrupting the
embedded CPU so it can't run any program correctly, or an error condition
in the rest of the disk that the firmware doesn't handle properly.
Any unflushed writes in the write cache at this time are lost.  Lost
metadata writes will result in parent transid verify failures later on.

Low end desktop drives have very large SCTERC timeouts but no SCTERC
controls, so they have very long IO error retry loops (2 minutes).
That can look like an intermittent failure in the logs, but in fact it's
an ordinary remappable UNC sector.  The kernel has a default timeout
of 30 seconds, so the kernel forces a drive reset before the drive can
report the bad block.  The drive can often be used normally by setting
the kernel timeout with 'echo 180 > /sys/block/sd.../device/timeout'.

Whether you _want_ to use a disk with firmware that waits two full
minutes before reporting an IO error is a separate question, but this
is a feature of several popular cheap drive models, and you _can_ use
these disks if needed.

> A new disk is on it's way to use btrfs replace,
> but I'm not sure whehter that will be a wise choice for a filesystem
> with errors. There was never a crash/power failure, so the filesystem
> was unmounted at every reboot, but as said on 3 occasions (after a
> scrub), that unmount was with on of the four drives unresponsive.

Note that 

1.  in the logs each distinct bytenr occurs exactly once (more
precisely, "not more than N - 1 times for a RAID profile with
N copies"), and

2.  it is immediately followed by 4x "read error corrected"

e.g.

> [38079.437411] BTRFS error (device sdg): parent transid verify failed on 
> 12884760723456 wanted 360620 found 359101 
>  
> [38079.457879] BTRFS info (device sdg): read error corrected: ino 0 off 
> 12884760723456 (dev /dev/sdd sector 12559526656)  
>  
> [38079.459418] BTRFS info (device sdg): read error corrected: ino 0 off 
> 12884760727552 (dev /dev/sdd sector 12559526664)  
>   
> [38079.460390] BTRFS info (device sdg): read error corrected: ino 0 off 
> 12884760731648 (dev /dev/sdd sector 12559526672)  
>  
> [38079.460585] BTRFS info (device sdg): read error corrected: ino 0 off 
> 12884760735744 (dev /dev/sdd sector 12559526680)  
>   

Metadata pages are 16K by default, and filesystem pages are 4K on
amd64/x86/arm/aarch64, so these 4 "read error corrected" lines are btrfs
replacing one broken metadata page on sdd using data from other mirrors.

If you are using raid1* metadata, this is part of normal recovery from
a disk failure.

The failing disk will not be keeping up with metadata updates (because
it's failing, you can't assume it will be doing anything correctly).
Writes will be lost on sdd that are not lost on the other mirror drives.
btrfs will continue without error as long as at least one mirror drive
is OK.  btrfs will notice during later reads that some metadata pages
are not up to date on the failing disk, and correct the failing disk
using redundant copies of the metadata from the other mirrors.

Similar correction is applied to data when the csums do not match.

nodatacow files (which do not have csums) will be corrupted.  That is
part of the cost of nodatacow--no recovery from data corruption errors.

> Funnily enough, after a reboot every time the filesystem gets mounted
> without issues (the unresponsive drive is back online), and btrfs
> check --readonly claims the filesystem has no errors (see attached
> btrfs_sdd_check.txt).

There's no error on the disk by the time you run btrfs check or reboot and
mount again.  "read error corrected" means correct data was written back
to the failing disk.  With UNC sector remapping in disk firmware, btrfs
could even repair the UNC sector so the disk is no longer failing.
The only hint would be "Reallocated sector count" in SMART stats--and
only if you are lucky enough to have that count reported accurately by
your drive firmware.

Possibly there are still error errors on disk, but btrfs check didn't
happen to read that particular block from that particular mirror.
btrfs check won't verify th

Re: help needed with raid 6 filesystem with errors

2021-03-29 Thread Chris Murphy
On Mon, Mar 29, 2021 at 4:22 AM Bas Hulsken  wrote:
>
> Dear list,
>
> due to a disk intermittently failing in my 4 disk array, I'm getting
> "transid verify failed" errors on my btrfs filesystem (see attached
> dmesg | grep -i btrfs dump in btrfs_dmesg.txt). When I run a scrub, the
> bad disk (/dev/sdd) becomes unresponsive, so I'm hesitant to try that
> again (happened 3 times now, and was the root cause of the transid
> verify failed errors possibly, at least they did not show up earlier
> than the failed scrub).

Is the dmesg filtered? An unfiltered dmesg might help understand what
might be going on with the drive being unresponsive, if it's spitting
out any kind of errors itself or if there are kernel link reset
messages.

Check if the drive supports SCT ERC.

smartctl -l scterc /dev/sdX

If it does but it isn't enabled, enable it. This is true for all the drives.

smartctl -l scterc,70,70

That will result in the drive giving up on errors much sooner rather
than doing the very slow "deep recovery" on reads. If this goes beyond
30 seconds, the kernel's command timer will think the device is
unresponsive and issue a link reset which is ... bad for this use
case. You really want the drive to error out quickly and allow Btrfs
to do the fixups.

If you can't configure the SCT ERC on the drives, you'll need to
increase the kernel command timeout which is a per device value in
/sys/block/sdX/device/timeout  - default is 30 and chances are 180 is
enough (which sounds terribly high and it is but reportedly some
consumer drives can have such high timeouts).

Basically you want the drive timeout to be shorter than the kernel's.

>A new disk is on it's way to use btrfs replace,
> but I'm not sure whehter that will be a wise choice for a filesystem
> with errors. There was never a crash/power failure, so the filesystem
> was unmounted at every reboot, but as said on 3 occasions (after a
> scrub), that unmount was with on of the four drives unresponsive.

The least amount of risk is to not change anything. When you do the
replace, make sure you use recent btrfs-progs and use 'btrfs replace'
instead of 'btrfs device add/remove'

https://lore.kernel.org/linux-btrfs/20200627032414.gx10...@hungrycats.org/

If metadata is raid5 too, or if it's not already using space_cache v2,
I'd probably leave it alone until after the flakey device is replaced.


> Funnily enough, after a reboot every time the filesystem gets mounted
> without issues (the unresponsive drive is back online), and btrfs check
> --readonly claims the filesystem has no errors (see attached
> btrfs_sdd_check.txt).

I'd take advantage of it's cooperative moment by making sure backups
are fresh in case things get worse.

> Not sure what to do next, so seeking your advice! The important data on
> the drive is backed up, and I'll be running a verify to see if there
> are any corruptions overnight. Would still like to try to save the
> filesystem if possible though.



-- 
Chris Murphy