Re: help needed with raid 6 filesystem with errors
On Tue, Mar 30, 2021 at 07:20:14PM +0200, Bas Hulsken wrote: > On Tue, 2021-03-30 at 11:46 -0400, Zygo Blaxell wrote: > > On Tue, Mar 30, 2021 at 03:01:57PM +0200, Bas Hulsken wrote: > > > I followed your advice, Zygo and Chris, and did both: > > > 1) smartctl -l scterc,70,70 /dev/sdX for all 4 drives in the array > > > (the > > > drives do support this) > > > 2) echo 180 > /sys/block/sdX/device/timeout for all 4 drives > > > > > > with that I attempted another scrub (on the single failing device, > > > not > > > on the filesystem), but with bad results again. The drive is > > > basically > > > still not responsive after the first error, this is the error > > > according > > > to smartctl: > > > > > > Error 4 occurred at disk power-on lifetime: 7124 hours (296 days + > > > 20 > > > hours) > > > When the command that caused the error occurred, the device was > > > active or idle. > > > > > > After command completion occurred, registers were: > > > ER ST SC SN CL CH DH > > > -- -- -- -- -- -- -- > > > 40 41 98 68 24 00 40 Error: UNC at LBA = 0x2468 = 9320 > > > > > > Commands leading to the command that caused the error were: > > > CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name > > > -- -- -- -- -- -- -- -- > > > 60 80 00 80 25 00 40 00 2d+20:57:19.109 READ FPDMA QUEUED > > > 60 80 f8 80 23 00 40 00 2d+20:57:16.423 READ FPDMA QUEUED > > > 60 80 f0 80 21 00 40 00 2d+20:57:16.422 READ FPDMA QUEUED > > > 60 80 e8 80 1f 00 40 00 2d+20:57:16.421 READ FPDMA QUEUED > > > 60 80 e0 80 1d 00 40 00 2d+20:57:16.420 READ FPDMA QUEUED > > > > > > The other errors are all the same (Error: UNC at LBA = 0x2468 = > > > 9320), and at exactly the same LBA, once scrub gets to this LBA, > > > the > > > drives basically no longer responds, and querying it with smartctl > > > will > > > return garbage characters, or nothing at all. I've attached a dmesg > > > with also the io errors this time. > > > > > > So: I conclude scub is not going to fix this problem, and I should > > > really replace the disk. > > > > Agreed. It is now properly configured, there are UNC sectors logged > > in > > SMART, and UNC recovery is still not working. The drive is broken > > and > > will likely stay that way. > > > > > @Zygo: following your advice, and using btrfs replace -r with the > > > failing drive online, I take it it reads only sectors from the > > > failing > > > disk if at least 2 other disks are failing at that spot (given it's > > > raid6), correct? > > > > That's the general idea. > > > > > If so I would be comfortable giving that a shot. I do > > > expect that while doing a replace and reading the same LBA from the > > > disk, it will just crash again and ruin my replace. > > > > There's still another redundant disk in the array, so there's no need > > to > > put too much effort into recovering one failing drive. The disk > > seems > > really broken, so take it offline and do a replace in degraded mode. > > Thanks for the clear help and explanations, I have 2 final questions > (famous last words :-) ) > 1) In your earlier reply you mentioned known bugs, including the > "Spurious read errors in btrfs raid5 degraded mode". Would replacing > with "-r" while the faulty drive is still online not prevent this from > happening? The bug only seems to affect kernel read code, which tries to avoid unnecessary reads so it has a lot of special cases (not degraded, degraded, P corrupted, Q corrupted...). Scrub uses different code which is much simpler, always reads the entire stripe at once, and doesn't seem to be affected by the read bug. Replace is implemented as a special case of scrub internally, so it has the same read behavior as scrub. In testing I've always hit the spurious read failures with reads and never with scrub or replace. > Assuming the replace speed is similar to the scrub speed, Replace speed will be the same as the scrub speed _for scrub on one drive_. Running scrub on all disks at once will dramatically reduce performance compared to running scrub on a single disk (or even each disk one at a time). If you have been running scrub with a mountpoint argument instead of individual devices, then it has been running scrubs on all disks in parallel (i.e. competing with each other), and taking far longer than it could have. > I'm looking at 4 days to replace the drive, would prefer if I could > keep using the filesystem while that happens.. otherwise wiping it and > restoring from backup might actually be the fastest option. I would avoid using it as much as possible during the replace. I have tested running raid5 in degraded mode with a fully active read/write workload, and there were a handful of lost data blocks (17, 84K) on a 20TB restore. I don't know if the extra Q disk for raid6 helps, raid6 is not something I'm testing so far. > 2) If I go the offline way, how would I actually do that? I do not s
Re: help needed with raid 6 filesystem with errors
On Tue, 2021-03-30 at 11:46 -0400, Zygo Blaxell wrote: > On Tue, Mar 30, 2021 at 03:01:57PM +0200, Bas Hulsken wrote: > > I followed your advice, Zygo and Chris, and did both: > > 1) smartctl -l scterc,70,70 /dev/sdX for all 4 drives in the array > > (the > > drives do support this) > > 2) echo 180 > /sys/block/sdX/device/timeout for all 4 drives > > > > with that I attempted another scrub (on the single failing device, > > not > > on the filesystem), but with bad results again. The drive is > > basically > > still not responsive after the first error, this is the error > > according > > to smartctl: > > > > Error 4 occurred at disk power-on lifetime: 7124 hours (296 days + > > 20 > > hours) > > When the command that caused the error occurred, the device was > > active or idle. > > > > After command completion occurred, registers were: > > ER ST SC SN CL CH DH > > -- -- -- -- -- -- -- > > 40 41 98 68 24 00 40 Error: UNC at LBA = 0x2468 = 9320 > > > > Commands leading to the command that caused the error were: > > CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name > > -- -- -- -- -- -- -- -- > > 60 80 00 80 25 00 40 00 2d+20:57:19.109 READ FPDMA QUEUED > > 60 80 f8 80 23 00 40 00 2d+20:57:16.423 READ FPDMA QUEUED > > 60 80 f0 80 21 00 40 00 2d+20:57:16.422 READ FPDMA QUEUED > > 60 80 e8 80 1f 00 40 00 2d+20:57:16.421 READ FPDMA QUEUED > > 60 80 e0 80 1d 00 40 00 2d+20:57:16.420 READ FPDMA QUEUED > > > > The other errors are all the same (Error: UNC at LBA = 0x2468 = > > 9320), and at exactly the same LBA, once scrub gets to this LBA, > > the > > drives basically no longer responds, and querying it with smartctl > > will > > return garbage characters, or nothing at all. I've attached a dmesg > > with also the io errors this time. > > > > So: I conclude scub is not going to fix this problem, and I should > > really replace the disk. > > Agreed. It is now properly configured, there are UNC sectors logged > in > SMART, and UNC recovery is still not working. The drive is broken > and > will likely stay that way. > > > @Zygo: following your advice, and using btrfs replace -r with the > > failing drive online, I take it it reads only sectors from the > > failing > > disk if at least 2 other disks are failing at that spot (given it's > > raid6), correct? > > That's the general idea. > > > If so I would be comfortable giving that a shot. I do > > expect that while doing a replace and reading the same LBA from the > > disk, it will just crash again and ruin my replace. > > There's still another redundant disk in the array, so there's no need > to > put too much effort into recovering one failing drive. The disk > seems > really broken, so take it offline and do a replace in degraded mode. Thanks for the clear help and explanations, I have 2 final questions (famous last words :-) ) 1) In your earlier reply you mentioned known bugs, including the "Spurious read errors in btrfs raid5 degraded mode". Would replacing with "-r" while the faulty drive is still online not prevent this from happening? Assuming the replace speed is similar to the scrub speed, I'm looking at 4 days to replace the drive, would prefer if I could keep using the filesystem while that happens.. otherwise wiping it and restoring from backup might actually be the fastest option. 2) If I go the offline way, how would I actually do that? I do not see a command in the btrfs manual to flag a disk as faulty, or any other command to move into "degraded" mode. I could unplug it while power off ofcourse, is that the best / only way? > > > thanks! > > > > > > On Mon, 2021-03-29 at 17:05 -0400, Zygo Blaxell wrote: > > > On Mon, Mar 29, 2021 at 02:03:06PM +0200, Bas Hulsken wrote: > > > > Dear list, > > > > > > > due to a disk intermittently failing in my 4 disk array, I'm > > > > getting > > > > "transid verify failed" errors on my btrfs filesystem (see > > > > attached > > > > dmesg | grep -i btrfs dump in btrfs_dmesg.txt). > > > > > > Scary! But in this case, it looks like they were automatically > > > recovered > > > already. > > > > > > > When I run a scrub, > > > > the bad disk (/dev/sdd) becomes unresponsive, so I'm hesitant > > > > to try > > > > that again (happened 3 times now, and was the root cause of the > > > > transid > > > > verify failed errors possibly, at least they did not show up > > > > earlier > > > > than the failed scrub). > > > > > > That is quite common when disks fail. The extra IO load results > > > in a > > > firmware crash, either due to failure of the electronics > > > disrupting the > > > embedded CPU so it can't run any program correctly, or an error > > > condition > > > in the rest of the disk that the firmware doesn't handle > > > properly. > > > Any unflushed writes in the write cache at this time are lost. > > > Lost > > > metadata writes will result in parent transid verify failures > > > lat
Re: help needed with raid 6 filesystem with errors
On Tue, Mar 30, 2021 at 03:01:57PM +0200, Bas Hulsken wrote: > I followed your advice, Zygo and Chris, and did both: > 1) smartctl -l scterc,70,70 /dev/sdX for all 4 drives in the array (the > drives do support this) > 2) echo 180 > /sys/block/sdX/device/timeout for all 4 drives > > with that I attempted another scrub (on the single failing device, not > on the filesystem), but with bad results again. The drive is basically > still not responsive after the first error, this is the error according > to smartctl: > > Error 4 occurred at disk power-on lifetime: 7124 hours (296 days + 20 > hours) > When the command that caused the error occurred, the device was > active or idle. > > After command completion occurred, registers were: > ER ST SC SN CL CH DH > -- -- -- -- -- -- -- > 40 41 98 68 24 00 40 Error: UNC at LBA = 0x2468 = 9320 > > Commands leading to the command that caused the error were: > CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name > -- -- -- -- -- -- -- -- > 60 80 00 80 25 00 40 00 2d+20:57:19.109 READ FPDMA QUEUED > 60 80 f8 80 23 00 40 00 2d+20:57:16.423 READ FPDMA QUEUED > 60 80 f0 80 21 00 40 00 2d+20:57:16.422 READ FPDMA QUEUED > 60 80 e8 80 1f 00 40 00 2d+20:57:16.421 READ FPDMA QUEUED > 60 80 e0 80 1d 00 40 00 2d+20:57:16.420 READ FPDMA QUEUED > > The other errors are all the same (Error: UNC at LBA = 0x2468 = > 9320), and at exactly the same LBA, once scrub gets to this LBA, the > drives basically no longer responds, and querying it with smartctl will > return garbage characters, or nothing at all. I've attached a dmesg > with also the io errors this time. > > So: I conclude scub is not going to fix this problem, and I should > really replace the disk. Agreed. It is now properly configured, there are UNC sectors logged in SMART, and UNC recovery is still not working. The drive is broken and will likely stay that way. > @Zygo: following your advice, and using btrfs replace -r with the > failing drive online, I take it it reads only sectors from the failing > disk if at least 2 other disks are failing at that spot (given it's > raid6), correct? That's the general idea. > If so I would be comfortable giving that a shot. I do > expect that while doing a replace and reading the same LBA from the > disk, it will just crash again and ruin my replace. There's still another redundant disk in the array, so there's no need to put too much effort into recovering one failing drive. The disk seems really broken, so take it offline and do a replace in degraded mode. > thanks! > > > On Mon, 2021-03-29 at 17:05 -0400, Zygo Blaxell wrote: > > On Mon, Mar 29, 2021 at 02:03:06PM +0200, Bas Hulsken wrote: > > > Dear list, > > > > > due to a disk intermittently failing in my 4 disk array, I'm getting > > > "transid verify failed" errors on my btrfs filesystem (see attached > > > dmesg | grep -i btrfs dump in btrfs_dmesg.txt). > > > > Scary! But in this case, it looks like they were automatically > > recovered > > already. > > > > > When I run a scrub, > > > the bad disk (/dev/sdd) becomes unresponsive, so I'm hesitant to try > > > that again (happened 3 times now, and was the root cause of the > > > transid > > > verify failed errors possibly, at least they did not show up earlier > > > than the failed scrub). > > > > That is quite common when disks fail. The extra IO load results in a > > firmware crash, either due to failure of the electronics disrupting the > > embedded CPU so it can't run any program correctly, or an error > > condition > > in the rest of the disk that the firmware doesn't handle properly. > > Any unflushed writes in the write cache at this time are lost. Lost > > metadata writes will result in parent transid verify failures later on. > > > > Low end desktop drives have very large SCTERC timeouts but no SCTERC > > controls, so they have very long IO error retry loops (2 minutes). > > That can look like an intermittent failure in the logs, but in fact > > it's > > an ordinary remappable UNC sector. The kernel has a default timeout > > of 30 seconds, so the kernel forces a drive reset before the drive can > > report the bad block. The drive can often be used normally by setting > > the kernel timeout with 'echo 180 > /sys/block/sd.../device/timeout'. > > > > Whether you _want_ to use a disk with firmware that waits two full > > minutes before reporting an IO error is a separate question, but this > > is a feature of several popular cheap drive models, and you _can_ use > > these disks if needed. > > > > > A new disk is on it's way to use btrfs replace, > > > but I'm not sure whehter that will be a wise choice for a filesystem > > > with errors. There was never a crash/power failure, so the filesystem > > > was unmounted at every reboot, but as said on 3 occasions (after a > > > scrub), that unmount was with on of the four drives unresponsive. > > > > Not
Re: help needed with raid 6 filesystem with errors
On Mon, Mar 29, 2021 at 02:03:06PM +0200, Bas Hulsken wrote: > Dear list, > due to a disk intermittently failing in my 4 disk array, I'm getting > "transid verify failed" errors on my btrfs filesystem (see attached > dmesg | grep -i btrfs dump in btrfs_dmesg.txt). Scary! But in this case, it looks like they were automatically recovered already. > When I run a scrub, > the bad disk (/dev/sdd) becomes unresponsive, so I'm hesitant to try > that again (happened 3 times now, and was the root cause of the transid > verify failed errors possibly, at least they did not show up earlier > than the failed scrub). That is quite common when disks fail. The extra IO load results in a firmware crash, either due to failure of the electronics disrupting the embedded CPU so it can't run any program correctly, or an error condition in the rest of the disk that the firmware doesn't handle properly. Any unflushed writes in the write cache at this time are lost. Lost metadata writes will result in parent transid verify failures later on. Low end desktop drives have very large SCTERC timeouts but no SCTERC controls, so they have very long IO error retry loops (2 minutes). That can look like an intermittent failure in the logs, but in fact it's an ordinary remappable UNC sector. The kernel has a default timeout of 30 seconds, so the kernel forces a drive reset before the drive can report the bad block. The drive can often be used normally by setting the kernel timeout with 'echo 180 > /sys/block/sd.../device/timeout'. Whether you _want_ to use a disk with firmware that waits two full minutes before reporting an IO error is a separate question, but this is a feature of several popular cheap drive models, and you _can_ use these disks if needed. > A new disk is on it's way to use btrfs replace, > but I'm not sure whehter that will be a wise choice for a filesystem > with errors. There was never a crash/power failure, so the filesystem > was unmounted at every reboot, but as said on 3 occasions (after a > scrub), that unmount was with on of the four drives unresponsive. Note that 1. in the logs each distinct bytenr occurs exactly once (more precisely, "not more than N - 1 times for a RAID profile with N copies"), and 2. it is immediately followed by 4x "read error corrected" e.g. > [38079.437411] BTRFS error (device sdg): parent transid verify failed on > 12884760723456 wanted 360620 found 359101 > > [38079.457879] BTRFS info (device sdg): read error corrected: ino 0 off > 12884760723456 (dev /dev/sdd sector 12559526656) > > [38079.459418] BTRFS info (device sdg): read error corrected: ino 0 off > 12884760727552 (dev /dev/sdd sector 12559526664) > > [38079.460390] BTRFS info (device sdg): read error corrected: ino 0 off > 12884760731648 (dev /dev/sdd sector 12559526672) > > [38079.460585] BTRFS info (device sdg): read error corrected: ino 0 off > 12884760735744 (dev /dev/sdd sector 12559526680) > Metadata pages are 16K by default, and filesystem pages are 4K on amd64/x86/arm/aarch64, so these 4 "read error corrected" lines are btrfs replacing one broken metadata page on sdd using data from other mirrors. If you are using raid1* metadata, this is part of normal recovery from a disk failure. The failing disk will not be keeping up with metadata updates (because it's failing, you can't assume it will be doing anything correctly). Writes will be lost on sdd that are not lost on the other mirror drives. btrfs will continue without error as long as at least one mirror drive is OK. btrfs will notice during later reads that some metadata pages are not up to date on the failing disk, and correct the failing disk using redundant copies of the metadata from the other mirrors. Similar correction is applied to data when the csums do not match. nodatacow files (which do not have csums) will be corrupted. That is part of the cost of nodatacow--no recovery from data corruption errors. > Funnily enough, after a reboot every time the filesystem gets mounted > without issues (the unresponsive drive is back online), and btrfs > check --readonly claims the filesystem has no errors (see attached > btrfs_sdd_check.txt). There's no error on the disk by the time you run btrfs check or reboot and mount again. "read error corrected" means correct data was written back to the failing disk. With UNC sector remapping in disk firmware, btrfs could even repair the UNC sector so the disk is no longer failing. The only hint would be "Reallocated sector count" in SMART stats--and only if you are lucky enough to have that count reported accurately by your drive firmware. Possibly there are still error errors on disk, but btrfs check didn't happen to read that particular block from that particular mirror. btrfs check won't verify th
Re: help needed with raid 6 filesystem with errors
On Mon, Mar 29, 2021 at 4:22 AM Bas Hulsken wrote: > > Dear list, > > due to a disk intermittently failing in my 4 disk array, I'm getting > "transid verify failed" errors on my btrfs filesystem (see attached > dmesg | grep -i btrfs dump in btrfs_dmesg.txt). When I run a scrub, the > bad disk (/dev/sdd) becomes unresponsive, so I'm hesitant to try that > again (happened 3 times now, and was the root cause of the transid > verify failed errors possibly, at least they did not show up earlier > than the failed scrub). Is the dmesg filtered? An unfiltered dmesg might help understand what might be going on with the drive being unresponsive, if it's spitting out any kind of errors itself or if there are kernel link reset messages. Check if the drive supports SCT ERC. smartctl -l scterc /dev/sdX If it does but it isn't enabled, enable it. This is true for all the drives. smartctl -l scterc,70,70 That will result in the drive giving up on errors much sooner rather than doing the very slow "deep recovery" on reads. If this goes beyond 30 seconds, the kernel's command timer will think the device is unresponsive and issue a link reset which is ... bad for this use case. You really want the drive to error out quickly and allow Btrfs to do the fixups. If you can't configure the SCT ERC on the drives, you'll need to increase the kernel command timeout which is a per device value in /sys/block/sdX/device/timeout - default is 30 and chances are 180 is enough (which sounds terribly high and it is but reportedly some consumer drives can have such high timeouts). Basically you want the drive timeout to be shorter than the kernel's. >A new disk is on it's way to use btrfs replace, > but I'm not sure whehter that will be a wise choice for a filesystem > with errors. There was never a crash/power failure, so the filesystem > was unmounted at every reboot, but as said on 3 occasions (after a > scrub), that unmount was with on of the four drives unresponsive. The least amount of risk is to not change anything. When you do the replace, make sure you use recent btrfs-progs and use 'btrfs replace' instead of 'btrfs device add/remove' https://lore.kernel.org/linux-btrfs/20200627032414.gx10...@hungrycats.org/ If metadata is raid5 too, or if it's not already using space_cache v2, I'd probably leave it alone until after the flakey device is replaced. > Funnily enough, after a reboot every time the filesystem gets mounted > without issues (the unresponsive drive is back online), and btrfs check > --readonly claims the filesystem has no errors (see attached > btrfs_sdd_check.txt). I'd take advantage of it's cooperative moment by making sure backups are fresh in case things get worse. > Not sure what to do next, so seeking your advice! The important data on > the drive is backed up, and I'll be running a verify to see if there > are any corruptions overnight. Would still like to try to save the > filesystem if possible though. -- Chris Murphy