Re: IO errors when building RAID1.... ?
On Mon, Sep 3, 2018 at 4:23 AM, Adam Borowski wrote: > On Sun, Sep 02, 2018 at 09:15:25PM -0600, Chris Murphy wrote: >> For > 10 years drive firmware handles bad sector remapping internally. >> It remaps the sector logical address to a reserve physical sector. >> >> NTFS and ext[234] have a means of accepting a list of bad sectors, and >> will avoid using them. Btrfs doesn't. But also ZFS, XFS, APFS, HFS+ >> and I think even FAT, lack this capability. > > > FAT entry FF7 (FAT12)/FFF7 (FAT16)/... > Oh yeah even Linux mkdosfs does have -c option to check for bad sectors and presumably will remove them from use. It doesn't accept a separate list though, like badblocks + mke2fs. -- Chris Murphy
Re: IO errors when building RAID1.... ?
On Sun, Sep 02, 2018 at 09:15:25PM -0600, Chris Murphy wrote: > For > 10 years drive firmware handles bad sector remapping internally. > It remaps the sector logical address to a reserve physical sector. > > NTFS and ext[234] have a means of accepting a list of bad sectors, and > will avoid using them. Btrfs doesn't. But also ZFS, XFS, APFS, HFS+ > and I think even FAT, lack this capability. FAT entry FF7 (FAT12)/FFF7 (FAT16)/... -- ⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition: ⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17] ⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37] ⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]
Re: IO errors when building RAID1.... ?
On 09/03/2018 05:15 AM, Chris Murphy wrote: On Sat, Sep 1, 2018 at 1:03 AM, Pierre Couderc wrote: On 08/31/2018 08:52 PM, Chris Murphy wrote: Bad sector which is failing write. This is fatal, there isn't anything the block layer or Btrfs (or ext4 or XFS) can do about it. Well, ext234 do have an option to scan for bad sectors and create a bad sector map which then can be used at mkfs time, and ext234 will avoid using those sectors. And also the md driver has a bad sector option for the same, and does remapping. But XFS and Btrfs don't do that. If the drive is under warranty, get it swapped out, this is definitely a warranty covered problem. Thank you very much. Once upon a time...(I am old), there were lists of bad sectors, and the software did avoid wrting in them. It seems to have disappeared. For which reason ? Maybe because these errors occur so rarely, that it is not worth the trouble ? For > 10 years drive firmware handles bad sector remapping internally. It remaps the sector logical address to a reserve physical sector. NTFS and ext[234] have a means of accepting a list of bad sectors, and will avoid using them. Btrfs doesn't. But also ZFS, XFS, APFS, HFS+ and I think even FAT, lack this capability. I'm not aware of any file system that once had bad sector tracking, that has since dropped the capability. Thank you, you are very clear.
Re: IO errors when building RAID1.... ?
On Sat, Sep 1, 2018 at 1:03 AM, Pierre Couderc wrote: > > > On 08/31/2018 08:52 PM, Chris Murphy wrote: >> >> >> Bad sector which is failing write. This is fatal, there isn't anything >> the block layer or Btrfs (or ext4 or XFS) can do about it. Well, >> ext234 do have an option to scan for bad sectors and create a bad >> sector map which then can be used at mkfs time, and ext234 will avoid >> using those sectors. And also the md driver has a bad sector option >> for the same, and does remapping. But XFS and Btrfs don't do that. >> >> If the drive is under warranty, get it swapped out, this is definitely >> a warranty covered problem. >> >> >> >> > Thank you very much. > > Once upon a time...(I am old), there were lists of bad sectors, and the > software did avoid wrting in them. It seems to have disappeared. For which > reason ? Maybe because these errors occur so rarely, that it is not worth > the trouble ? For > 10 years drive firmware handles bad sector remapping internally. It remaps the sector logical address to a reserve physical sector. NTFS and ext[234] have a means of accepting a list of bad sectors, and will avoid using them. Btrfs doesn't. But also ZFS, XFS, APFS, HFS+ and I think even FAT, lack this capability. I'm not aware of any file system that once had bad sector tracking, that has since dropped the capability. -- Chris Murphy
Re: IO errors when building RAID1.... ?
On 09/01/2018 03:35 AM, Duncan wrote: Chris Murphy posted on Fri, 31 Aug 2018 13:02:16 -0600 as excerpted: If you want you can post the output from 'sudo smartctl -x /dev/sda' which will contain more information... but this is in some sense superfluous. The problem is very clearly a bad drive, the drive explicitly report to libata a write error, and included the sector LBA affected, and only the drive firmware would know that. It's not likely a cable problem or something like. And that the write error is reported at all means it's persistent, not transient. Two points: 1) Does this happen to be an archive/SMR (shingled magnetic recording) device? If so that might be the problem as such devices really aren't suited to normal usage (they really are designed for archiving), and btrfs' COW patterns can exacerbate the issue. It's quite possible that the original install didn't load up the IO as heavily as the balance- convert does, so the problem appears with convert but not for install. 2) Assuming it's /not/ an SMR issue, and smartctl doesn't say it's dying, I'd suggest running badblocks -w (make sure the device doesn't have anything valuable on it!) on the device -- note that this will take awhile, probably a couple days perhaps longer, as it writes four different patterns to the entire device one at a time, reading everything back to verify the pattern was written correctly, so it's actually going over the entire device 8 times, alternating write and read, but it should settle the issue of the reliability of the device. Or if you'd rather spend the money than the time and it's not under warrantee still, just replace it, or at least buy a new one to use while you run the tests on that one. I fully understand that tying up the thing running tests on it for days straight may not be viable. Thank you sudo smartctl -x /dev/sda finds 4 errors... Thank you, I shall spend no more time o that and replace the HDD... Anyway, I should expect a less cryptic message from btrfs... PC
Re: IO errors when building RAID1.... ?
On 08/31/2018 08:52 PM, Chris Murphy wrote: Bad sector which is failing write. This is fatal, there isn't anything the block layer or Btrfs (or ext4 or XFS) can do about it. Well, ext234 do have an option to scan for bad sectors and create a bad sector map which then can be used at mkfs time, and ext234 will avoid using those sectors. And also the md driver has a bad sector option for the same, and does remapping. But XFS and Btrfs don't do that. If the drive is under warranty, get it swapped out, this is definitely a warranty covered problem. Thank you very much. Once upon a time...(I am old), there were lists of bad sectors, and the software did avoid wrting in them. It seems to have disappeared. For which reason ? Maybe because these errors occur so rarely, that it is not worth the trouble ?
Re: IO errors when building RAID1.... ?
Chris Murphy posted on Fri, 31 Aug 2018 13:02:16 -0600 as excerpted: > If you want you can post the output from 'sudo smartctl -x /dev/sda' > which will contain more information... but this is in some sense > superfluous. The problem is very clearly a bad drive, the drive > explicitly report to libata a write error, and included the sector LBA > affected, and only the drive firmware would know that. It's not likely a > cable problem or something like. And that the write error is reported at > all means it's persistent, not transient. Two points: 1) Does this happen to be an archive/SMR (shingled magnetic recording) device? If so that might be the problem as such devices really aren't suited to normal usage (they really are designed for archiving), and btrfs' COW patterns can exacerbate the issue. It's quite possible that the original install didn't load up the IO as heavily as the balance- convert does, so the problem appears with convert but not for install. 2) Assuming it's /not/ an SMR issue, and smartctl doesn't say it's dying, I'd suggest running badblocks -w (make sure the device doesn't have anything valuable on it!) on the device -- note that this will take awhile, probably a couple days perhaps longer, as it writes four different patterns to the entire device one at a time, reading everything back to verify the pattern was written correctly, so it's actually going over the entire device 8 times, alternating write and read, but it should settle the issue of the reliability of the device. Or if you'd rather spend the money than the time and it's not under warrantee still, just replace it, or at least buy a new one to use while you run the tests on that one. I fully understand that tying up the thing running tests on it for days straight may not be viable. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: IO errors when building RAID1.... ?
If you want you can post the output from 'sudo smartctl -x /dev/sda' which will contain more information... but this is in some sense superfluous. The problem is very clearly a bad drive, the drive explicitly report to libata a write error, and included the sector LBA affected, and only the drive firmware would know that. It's not likely a cable problem or something like. And that the write error is reported at all means it's persistent, not transient. Chris Murphy
Re: IO errors when building RAID1.... ?
On Fri, Aug 31, 2018 at 10:35 AM, Pierre Couderc wrote: > > Aug 31 17:34:55 server su[559]: Successful su for root by nous > Aug 31 17:34:55 server su[559]: + /dev/pts/1 nous:root > Aug 31 17:34:55 server su[559]: pam_unix(su:session): session opened for > user root by nous(uid=1000) > Aug 31 17:34:55 server su[559]: pam_systemd(su:session): Cannot create > session: Already running in a session > Aug 31 17:35:03 server kernel: BTRFS info (device sda1): disk added > /dev/sdb1 > Aug 31 17:35:40 server kernel: BTRFS info (device sda1): relocating block > group 1103101952 flags 1 > Aug 31 17:36:12 server sshd[572]: Accepted password for nous from > 2a01:e34:eeaf:c5f0:e54:15ff:feb1:b1c9 port 49308 ssh2 > Aug 31 17:36:12 server sshd[572]: pam_unix(sshd:session): session opened for > user nous by (uid=0) > Aug 31 17:36:12 server systemd-logind[415]: New session 4 of user nous. > Aug 31 17:36:12 server systemd[1]: Started Session 4 of user nous. > Aug 31 17:36:16 server kernel: ata1: lost interrupt (Status 0x50) > Aug 31 17:36:16 server kernel: ata1.00: exception Emask 0x50 SAct 0x0 SErr > 0x40d0802 action 0xe frozen > Aug 31 17:36:16 server kernel: ata1.00: SError: { RecovComm HostInt > PHYRdyChg CommWake 10B8B DevExch } > Aug 31 17:36:16 server kernel: ata1.00: failed command: READ DMA > Aug 31 17:36:16 server kernel: ata1.00: cmd > c8/00:60:00:cd:02/00:00:00:00:00/e0 tag 0 dma 49152 in > res > 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x54 (ATA bus error) > Aug 31 17:36:16 server kernel: ata1.00: status: { DRDY } > Aug 31 17:36:16 server kernel: ata1.00: hard resetting link > Aug 31 17:36:17 server kernel: ata1.01: hard resetting link > Aug 31 17:36:18 server kernel: ata1.01: failed to resume link (SControl 0) > Aug 31 17:36:18 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133 > SControl 300) > Aug 31 17:36:18 server kernel: ata1.01: SATA link down (SStatus 4 SControl > 0) > Aug 31 17:36:18 server kernel: ata1.00: NODEV after polling detection > Aug 31 17:36:18 server kernel: ata1.00: revalidation failed (errno=-2) > Aug 31 17:36:20 server su[590]: Successful su for root by nous > Aug 31 17:36:20 server su[590]: + /dev/pts/2 nous:root > Aug 31 17:36:20 server su[590]: pam_unix(su:session): session opened for > user root by nous(uid=1000) > Aug 31 17:36:20 server su[590]: pam_systemd(su:session): Cannot create > session: Already running in a session > Aug 31 17:36:23 server kernel: ata1.00: hard resetting link > Aug 31 17:36:23 server kernel: ata1.01: hard resetting link > Aug 31 17:36:24 server kernel: ata1.01: failed to resume link (SControl 0) > Aug 31 17:36:25 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133 > SControl 300) > Aug 31 17:36:25 server kernel: ata1.01: SATA link down (SStatus 4 SControl > 0) > Aug 31 17:36:25 server kernel: ata1.00: NODEV after polling detection > Aug 31 17:36:25 server kernel: ata1.00: revalidation failed (errno=-2) > Aug 31 17:36:30 server kernel: ata1.00: hard resetting link > Aug 31 17:36:30 server kernel: ata1.01: hard resetting link > Aug 31 17:36:31 server kernel: ata1.01: failed to resume link (SControl 0) > Aug 31 17:36:31 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133 > SControl 300) > Aug 31 17:36:31 server kernel: ata1.01: SATA link down (SStatus 4 SControl > 0) > Aug 31 17:36:31 server kernel: ata1.00: NODEV after polling detection > Aug 31 17:36:31 server kernel: ata1.00: revalidation failed (errno=-2) > Aug 31 17:36:31 server kernel: ata1.00: disabled > Aug 31 17:36:36 server kernel: ata1.00: hard resetting link > Aug 31 17:36:37 server kernel: ata1.01: hard resetting link > Aug 31 17:36:38 server kernel: ata1.01: failed to resume link (SControl 0) > Aug 31 17:36:38 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133 > SControl 300) > Aug 31 17:36:38 server kernel: ata1.01: SATA link down (SStatus 4 SControl > 0) > Aug 31 17:36:38 server kernel: ata1.00: NODEV after polling detection > Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] tag#0 FAILED Result: > hostbyte=DID_OK driverbyte=DRIVER_SENSE > Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] tag#0 Sense Key : Illegal > Request [current] > Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] tag#0 Add. Sense: Unaligned > write command > Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 > 00 02 cd 00 00 00 60 00 > Aug 31 17:36:38 server kernel: blk_update_request: I/O error, dev sda, > sector 183552 > Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1 > errs: wr 0, rd 1, flush 0, corrupt 0, gen 0 > Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1 > errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 > Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1 > errs: wr 0, rd 3, flush 0, corrupt 0, gen 0 > Aug 31 17:36:38 server kernel: sd 0:0:0:0: rejecting I/O to offline device > Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] killing request > Aug 31
IO errors when building RAID1.... ?
When trying to build a RAID1 on main fs. After normal debian stretch install : root@server:/home/nous# btrfs device add /dev/sdb1 / root@server:/home/nous# btrfs fi show Label: none uuid: ef0b9dad-c0eb-4a3b-9b41-e5e249363abc Total devices 2 FS bytes used 824.60MiB devid 1 size 1.82TiB used 3.02GiB path /dev/sda1 devid 2 size 1.82TiB used 0.00B path /dev/sdb1 root@server:/home/nous# btrfs balance start -v -mconvert=raid1 -dconvert=raid1 / Dumping filters: flags 0x7, state 0x0, force is off DATA (flags 0x100): converting, target=16, soft is off METADATA (flags 0x100): converting, target=16, soft is off SYSTEM (flags 0x100): converting, target=16, soft is off Killed root@server:/home/nous# btrfs fi show Label: none uuid: ef0b9dad-c0eb-4a3b-9b41-e5e249363abc Total devices 2 FS bytes used 1.29GiB devid 2 size 1.82TiB used 1.00GiB path /dev/sdb1 *** Some devices missing Some IO errors on /dev/sda are found in journalctl (see them below) I cannot believe that /dev/sda has no hard disk errors when installing without problems, but has many ones when I "btrfs device add /dev/sdb1 /". I can reproduce the problem : reinstall (3times...) and try "btrfs device add /dev/sdb1 /" with the same results... Aug 31 17:34:55 server su[559]: Successful su for root by nous Aug 31 17:34:55 server su[559]: + /dev/pts/1 nous:root Aug 31 17:34:55 server su[559]: pam_unix(su:session): session opened for user root by nous(uid=1000) Aug 31 17:34:55 server su[559]: pam_systemd(su:session): Cannot create session: Already running in a session Aug 31 17:35:03 server kernel: BTRFS info (device sda1): disk added /dev/sdb1 Aug 31 17:35:40 server kernel: BTRFS info (device sda1): relocating block group 1103101952 flags 1 Aug 31 17:36:12 server sshd[572]: Accepted password for nous from 2a01:e34:eeaf:c5f0:e54:15ff:feb1:b1c9 port 49308 ssh2 Aug 31 17:36:12 server sshd[572]: pam_unix(sshd:session): session opened for user nous by (uid=0) Aug 31 17:36:12 server systemd-logind[415]: New session 4 of user nous. Aug 31 17:36:12 server systemd[1]: Started Session 4 of user nous. Aug 31 17:36:16 server kernel: ata1: lost interrupt (Status 0x50) Aug 31 17:36:16 server kernel: ata1.00: exception Emask 0x50 SAct 0x0 SErr 0x40d0802 action 0xe frozen Aug 31 17:36:16 server kernel: ata1.00: SError: { RecovComm HostInt PHYRdyChg CommWake 10B8B DevExch } Aug 31 17:36:16 server kernel: ata1.00: failed command: READ DMA Aug 31 17:36:16 server kernel: ata1.00: cmd c8/00:60:00:cd:02/00:00:00:00:00/e0 tag 0 dma 49152 in res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x54 (ATA bus error) Aug 31 17:36:16 server kernel: ata1.00: status: { DRDY } Aug 31 17:36:16 server kernel: ata1.00: hard resetting link Aug 31 17:36:17 server kernel: ata1.01: hard resetting link Aug 31 17:36:18 server kernel: ata1.01: failed to resume link (SControl 0) Aug 31 17:36:18 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Aug 31 17:36:18 server kernel: ata1.01: SATA link down (SStatus 4 SControl 0) Aug 31 17:36:18 server kernel: ata1.00: NODEV after polling detection Aug 31 17:36:18 server kernel: ata1.00: revalidation failed (errno=-2) Aug 31 17:36:20 server su[590]: Successful su for root by nous Aug 31 17:36:20 server su[590]: + /dev/pts/2 nous:root Aug 31 17:36:20 server su[590]: pam_unix(su:session): session opened for user root by nous(uid=1000) Aug 31 17:36:20 server su[590]: pam_systemd(su:session): Cannot create session: Already running in a session Aug 31 17:36:23 server kernel: ata1.00: hard resetting link Aug 31 17:36:23 server kernel: ata1.01: hard resetting link Aug 31 17:36:24 server kernel: ata1.01: failed to resume link (SControl 0) Aug 31 17:36:25 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Aug 31 17:36:25 server kernel: ata1.01: SATA link down (SStatus 4 SControl 0) Aug 31 17:36:25 server kernel: ata1.00: NODEV after polling detection Aug 31 17:36:25 server kernel: ata1.00: revalidation failed (errno=-2) Aug 31 17:36:30 server kernel: ata1.00: hard resetting link Aug 31 17:36:30 server kernel: ata1.01: hard resetting link Aug 31 17:36:31 server kernel: ata1.01: failed to resume link (SControl 0) Aug 31 17:36:31 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Aug 31 17:36:31 server kernel: ata1.01: SATA link down (SStatus 4 SControl 0) Aug 31 17:36:31 server kernel: ata1.00: NODEV after polling detection Aug 31 17:36:31 server kernel: ata1.00: revalidation failed (errno=-2) Aug 31 17:36:31 server kernel: ata1.00: disabled Aug 31 17:36:36 server kernel: ata1.00: hard resetting link Aug 31 17:36:37 server kernel: ata1.01: hard resetting link Aug 31 17:36:38 server kernel: ata1.01: failed to resume link (SControl 0) Aug 31 17:36:38 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Aug 31 17:36:38 server kernel: ata1.01: SATA