Re: IO errors when building RAID1.... ?

2018-09-03 Thread Chris Murphy
On Mon, Sep 3, 2018 at 4:23 AM, Adam Borowski  wrote:
> On Sun, Sep 02, 2018 at 09:15:25PM -0600, Chris Murphy wrote:
>> For > 10 years drive firmware handles bad sector remapping internally.
>> It remaps the sector logical address to a reserve physical sector.
>>
>> NTFS and ext[234] have a means of accepting a list of bad sectors, and
>> will avoid using them. Btrfs doesn't. But also ZFS, XFS, APFS, HFS+
>> and I think even FAT, lack this capability.
>
> 
> FAT entry FF7 (FAT12)/FFF7 (FAT16)/...
> 

Oh yeah even Linux mkdosfs does have -c option to check for bad
sectors and presumably will remove them from use. It doesn't accept a
separate list though, like badblocks + mke2fs.

-- 
Chris Murphy


Re: IO errors when building RAID1.... ?

2018-09-03 Thread Adam Borowski
On Sun, Sep 02, 2018 at 09:15:25PM -0600, Chris Murphy wrote:
> For > 10 years drive firmware handles bad sector remapping internally.
> It remaps the sector logical address to a reserve physical sector.
> 
> NTFS and ext[234] have a means of accepting a list of bad sectors, and
> will avoid using them. Btrfs doesn't. But also ZFS, XFS, APFS, HFS+
> and I think even FAT, lack this capability.


FAT entry FF7 (FAT12)/FFF7 (FAT16)/...



-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17]
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]


Re: IO errors when building RAID1.... ?

2018-09-03 Thread Pierre Couderc




On 09/03/2018 05:15 AM, Chris Murphy wrote:

On Sat, Sep 1, 2018 at 1:03 AM, Pierre Couderc  wrote:


On 08/31/2018 08:52 PM, Chris Murphy wrote:


Bad sector which is failing write. This is fatal, there isn't anything
the block layer or Btrfs (or ext4 or XFS) can do about it. Well,
ext234 do have an option to scan for bad sectors and create a bad
sector map which then can be used at mkfs time, and ext234 will avoid
using those sectors. And also the md driver has a bad sector option
for the same, and does remapping. But XFS and Btrfs don't do that.

If the drive is under warranty, get it swapped out, this is definitely
a warranty covered problem.





Thank you very much.

Once upon a time...(I am old), there were lists of bad sectors, and the
software did avoid wrting in them. It seems to have disappeared. For which
reason ? Maybe because these errors occur so  rarely, that it is not worth
the trouble ?

For > 10 years drive firmware handles bad sector remapping internally.
It remaps the sector logical address to a reserve physical sector.

NTFS and ext[234] have a means of accepting a list of bad sectors, and
will avoid using them. Btrfs doesn't. But also ZFS, XFS, APFS, HFS+
and I think even FAT, lack this capability. I'm not aware of any file
system that once had bad sector tracking, that has since dropped the
capability.


Thank you, you are very clear.


Re: IO errors when building RAID1.... ?

2018-09-02 Thread Chris Murphy
On Sat, Sep 1, 2018 at 1:03 AM, Pierre Couderc  wrote:
>
>
> On 08/31/2018 08:52 PM, Chris Murphy wrote:
>>
>>
>> Bad sector which is failing write. This is fatal, there isn't anything
>> the block layer or Btrfs (or ext4 or XFS) can do about it. Well,
>> ext234 do have an option to scan for bad sectors and create a bad
>> sector map which then can be used at mkfs time, and ext234 will avoid
>> using those sectors. And also the md driver has a bad sector option
>> for the same, and does remapping. But XFS and Btrfs don't do that.
>>
>> If the drive is under warranty, get it swapped out, this is definitely
>> a warranty covered problem.
>>
>>
>>
>>
> Thank you very much.
>
> Once upon a time...(I am old), there were lists of bad sectors, and the
> software did avoid wrting in them. It seems to have disappeared. For which
> reason ? Maybe because these errors occur so  rarely, that it is not worth
> the trouble ?

For > 10 years drive firmware handles bad sector remapping internally.
It remaps the sector logical address to a reserve physical sector.

NTFS and ext[234] have a means of accepting a list of bad sectors, and
will avoid using them. Btrfs doesn't. But also ZFS, XFS, APFS, HFS+
and I think even FAT, lack this capability. I'm not aware of any file
system that once had bad sector tracking, that has since dropped the
capability.

-- 
Chris Murphy


Re: IO errors when building RAID1.... ?

2018-09-01 Thread Pierre Couderc




On 09/01/2018 03:35 AM, Duncan wrote:

Chris Murphy posted on Fri, 31 Aug 2018 13:02:16 -0600 as excerpted:


If you want you can post the output from 'sudo smartctl -x /dev/sda'
which will contain more information... but this is in some sense
superfluous. The problem is very clearly a bad drive, the drive
explicitly report to libata a write error, and included the sector LBA
affected, and only the drive firmware would know that. It's not likely a
cable problem or something like. And that the write error is reported at
all means it's persistent, not transient.

Two points:

1) Does this happen to be an archive/SMR (shingled magnetic recording)
device?  If so that might be the problem as such devices really aren't
suited to normal usage (they really are designed for archiving), and
btrfs' COW patterns can exacerbate the issue.  It's quite possible that
the original install didn't load up the IO as heavily as the balance-
convert does, so the problem appears with convert but not for install.

2) Assuming it's /not/ an SMR issue, and smartctl doesn't say it's dying,
I'd suggest running badblocks -w (make sure the device doesn't have
anything valuable on it!) on the device -- note that this will take
awhile, probably a couple days perhaps longer, as it writes four
different patterns to the entire device one at a time, reading everything
back to verify the pattern was written correctly, so it's actually going
over the entire device 8 times, alternating write and read, but it should
settle the issue of the reliability of the device.

Or if you'd rather spend the money than the time and it's not under
warrantee still, just replace it, or at least buy a new one to use while
you run the tests on that one.  I fully understand that tying up the
thing running tests on it for days straight may not be viable.


Thank you

sudo smartctl -x /dev/sda

finds 4 errors...

Thank you, I shall spend no more time o that and replace the HDD...

Anyway, I should expect a less cryptic message from btrfs...

PC



Re: IO errors when building RAID1.... ?

2018-09-01 Thread Pierre Couderc




On 08/31/2018 08:52 PM, Chris Murphy wrote:


Bad sector which is failing write. This is fatal, there isn't anything
the block layer or Btrfs (or ext4 or XFS) can do about it. Well,
ext234 do have an option to scan for bad sectors and create a bad
sector map which then can be used at mkfs time, and ext234 will avoid
using those sectors. And also the md driver has a bad sector option
for the same, and does remapping. But XFS and Btrfs don't do that.

If the drive is under warranty, get it swapped out, this is definitely
a warranty covered problem.





Thank you very much.

Once upon a time...(I am old), there were lists of bad sectors, and the 
software did avoid wrting in them. It seems to have disappeared. For 
which reason ? Maybe because these errors occur so  rarely, that it is 
not worth the trouble ?


Re: IO errors when building RAID1.... ?

2018-08-31 Thread Duncan
Chris Murphy posted on Fri, 31 Aug 2018 13:02:16 -0600 as excerpted:

> If you want you can post the output from 'sudo smartctl -x /dev/sda'
> which will contain more information... but this is in some sense
> superfluous. The problem is very clearly a bad drive, the drive
> explicitly report to libata a write error, and included the sector LBA
> affected, and only the drive firmware would know that. It's not likely a
> cable problem or something like. And that the write error is reported at
> all means it's persistent, not transient.

Two points:

1) Does this happen to be an archive/SMR (shingled magnetic recording) 
device?  If so that might be the problem as such devices really aren't 
suited to normal usage (they really are designed for archiving), and 
btrfs' COW patterns can exacerbate the issue.  It's quite possible that 
the original install didn't load up the IO as heavily as the balance-
convert does, so the problem appears with convert but not for install.

2) Assuming it's /not/ an SMR issue, and smartctl doesn't say it's dying, 
I'd suggest running badblocks -w (make sure the device doesn't have 
anything valuable on it!) on the device -- note that this will take 
awhile, probably a couple days perhaps longer, as it writes four 
different patterns to the entire device one at a time, reading everything 
back to verify the pattern was written correctly, so it's actually going 
over the entire device 8 times, alternating write and read, but it should 
settle the issue of the reliability of the device.

Or if you'd rather spend the money than the time and it's not under 
warrantee still, just replace it, or at least buy a new one to use while 
you run the tests on that one.  I fully understand that tying up the 
thing running tests on it for days straight may not be viable.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



Re: IO errors when building RAID1.... ?

2018-08-31 Thread Chris Murphy
If you want you can post the output from 'sudo smartctl -x /dev/sda'
which will contain more information... but this is in some sense
superfluous. The problem is very clearly a bad drive, the drive
explicitly report to libata a write error, and included the sector LBA
affected, and only the drive firmware would know that. It's not likely
a cable problem or something like. And that the write error is
reported at all means it's persistent, not transient.


Chris Murphy


Re: IO errors when building RAID1.... ?

2018-08-31 Thread Chris Murphy
On Fri, Aug 31, 2018 at 10:35 AM, Pierre Couderc  wrote:


>
> Aug 31 17:34:55 server su[559]: Successful su for root by nous
> Aug 31 17:34:55 server su[559]: + /dev/pts/1 nous:root
> Aug 31 17:34:55 server su[559]: pam_unix(su:session): session opened for
> user root by nous(uid=1000)
> Aug 31 17:34:55 server su[559]: pam_systemd(su:session): Cannot create
> session: Already running in a session
> Aug 31 17:35:03 server kernel: BTRFS info (device sda1): disk added
> /dev/sdb1
> Aug 31 17:35:40 server kernel: BTRFS info (device sda1): relocating block
> group 1103101952 flags 1
> Aug 31 17:36:12 server sshd[572]: Accepted password for nous from
> 2a01:e34:eeaf:c5f0:e54:15ff:feb1:b1c9 port 49308 ssh2
> Aug 31 17:36:12 server sshd[572]: pam_unix(sshd:session): session opened for
> user nous by (uid=0)
> Aug 31 17:36:12 server systemd-logind[415]: New session 4 of user nous.
> Aug 31 17:36:12 server systemd[1]: Started Session 4 of user nous.
> Aug 31 17:36:16 server kernel: ata1: lost interrupt (Status 0x50)
> Aug 31 17:36:16 server kernel: ata1.00: exception Emask 0x50 SAct 0x0 SErr
> 0x40d0802 action 0xe frozen
> Aug 31 17:36:16 server kernel: ata1.00: SError: { RecovComm HostInt
> PHYRdyChg CommWake 10B8B DevExch }
> Aug 31 17:36:16 server kernel: ata1.00: failed command: READ DMA
> Aug 31 17:36:16 server kernel: ata1.00: cmd
> c8/00:60:00:cd:02/00:00:00:00:00/e0 tag 0 dma 49152 in
> res
> 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x54 (ATA bus error)
> Aug 31 17:36:16 server kernel: ata1.00: status: { DRDY }
> Aug 31 17:36:16 server kernel: ata1.00: hard resetting link
> Aug 31 17:36:17 server kernel: ata1.01: hard resetting link
> Aug 31 17:36:18 server kernel: ata1.01: failed to resume link (SControl 0)
> Aug 31 17:36:18 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133
> SControl 300)
> Aug 31 17:36:18 server kernel: ata1.01: SATA link down (SStatus 4 SControl
> 0)
> Aug 31 17:36:18 server kernel: ata1.00: NODEV after polling detection
> Aug 31 17:36:18 server kernel: ata1.00: revalidation failed (errno=-2)
> Aug 31 17:36:20 server su[590]: Successful su for root by nous
> Aug 31 17:36:20 server su[590]: + /dev/pts/2 nous:root
> Aug 31 17:36:20 server su[590]: pam_unix(su:session): session opened for
> user root by nous(uid=1000)
> Aug 31 17:36:20 server su[590]: pam_systemd(su:session): Cannot create
> session: Already running in a session
> Aug 31 17:36:23 server kernel: ata1.00: hard resetting link
> Aug 31 17:36:23 server kernel: ata1.01: hard resetting link
> Aug 31 17:36:24 server kernel: ata1.01: failed to resume link (SControl 0)
> Aug 31 17:36:25 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133
> SControl 300)
> Aug 31 17:36:25 server kernel: ata1.01: SATA link down (SStatus 4 SControl
> 0)
> Aug 31 17:36:25 server kernel: ata1.00: NODEV after polling detection
> Aug 31 17:36:25 server kernel: ata1.00: revalidation failed (errno=-2)
> Aug 31 17:36:30 server kernel: ata1.00: hard resetting link
> Aug 31 17:36:30 server kernel: ata1.01: hard resetting link
> Aug 31 17:36:31 server kernel: ata1.01: failed to resume link (SControl 0)
> Aug 31 17:36:31 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133
> SControl 300)
> Aug 31 17:36:31 server kernel: ata1.01: SATA link down (SStatus 4 SControl
> 0)
> Aug 31 17:36:31 server kernel: ata1.00: NODEV after polling detection
> Aug 31 17:36:31 server kernel: ata1.00: revalidation failed (errno=-2)
> Aug 31 17:36:31 server kernel: ata1.00: disabled
> Aug 31 17:36:36 server kernel: ata1.00: hard resetting link
> Aug 31 17:36:37 server kernel: ata1.01: hard resetting link
> Aug 31 17:36:38 server kernel: ata1.01: failed to resume link (SControl 0)
> Aug 31 17:36:38 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133
> SControl 300)
> Aug 31 17:36:38 server kernel: ata1.01: SATA link down (SStatus 4 SControl
> 0)
> Aug 31 17:36:38 server kernel: ata1.00: NODEV after polling detection
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] tag#0 FAILED Result:
> hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] tag#0 Sense Key : Illegal
> Request [current]
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] tag#0 Add. Sense: Unaligned
> write command
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] tag#0 CDB: Read(10) 28 00
> 00 02 cd 00 00 00 60 00
> Aug 31 17:36:38 server kernel: blk_update_request: I/O error, dev sda,
> sector 183552
> Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1
> errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
> Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1
> errs: wr 0, rd 2, flush 0, corrupt 0, gen 0
> Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1
> errs: wr 0, rd 3, flush 0, corrupt 0, gen 0
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: rejecting I/O to offline device
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] killing request
> Aug 31 

IO errors when building RAID1.... ?

2018-08-31 Thread Pierre Couderc
When trying to build a RAID1 on main fs. After  normal debian stretch 
install :




root@server:/home/nous# btrfs device add /dev/sdb1 /
root@server:/home/nous# btrfs fi show
Label: none  uuid: ef0b9dad-c0eb-4a3b-9b41-e5e249363abc
    Total devices 2 FS bytes used 824.60MiB
    devid    1 size 1.82TiB used 3.02GiB path /dev/sda1
    devid    2 size 1.82TiB used 0.00B path /dev/sdb1

root@server:/home/nous# btrfs balance start -v -mconvert=raid1 
-dconvert=raid1 /

Dumping filters: flags 0x7, state 0x0, force is off
  DATA (flags 0x100): converting, target=16, soft is off
  METADATA (flags 0x100): converting, target=16, soft is off
  SYSTEM (flags 0x100): converting, target=16, soft is off
Killed
root@server:/home/nous# btrfs fi show
Label: none  uuid: ef0b9dad-c0eb-4a3b-9b41-e5e249363abc
    Total devices 2 FS bytes used 1.29GiB
    devid    2 size 1.82TiB used 1.00GiB path /dev/sdb1
    *** Some devices missing


Some IO errors on /dev/sda are found in journalctl (see them below)

I cannot believe that /dev/sda has no hard disk errors when installing 
without problems, but has many ones when I "btrfs device add /dev/sdb1 /".


I can reproduce the problem : reinstall (3times...) and try "btrfs 
device add /dev/sdb1 /" with the same results...





Aug 31 17:34:55 server su[559]: Successful su for root by nous
Aug 31 17:34:55 server su[559]: + /dev/pts/1 nous:root
Aug 31 17:34:55 server su[559]: pam_unix(su:session): session opened for 
user root by nous(uid=1000)
Aug 31 17:34:55 server su[559]: pam_systemd(su:session): Cannot create 
session: Already running in a session
Aug 31 17:35:03 server kernel: BTRFS info (device sda1): disk added 
/dev/sdb1
Aug 31 17:35:40 server kernel: BTRFS info (device sda1): relocating 
block group 1103101952 flags 1
Aug 31 17:36:12 server sshd[572]: Accepted password for nous from 
2a01:e34:eeaf:c5f0:e54:15ff:feb1:b1c9 port 49308 ssh2
Aug 31 17:36:12 server sshd[572]: pam_unix(sshd:session): session opened 
for user nous by (uid=0)

Aug 31 17:36:12 server systemd-logind[415]: New session 4 of user nous.
Aug 31 17:36:12 server systemd[1]: Started Session 4 of user nous.
Aug 31 17:36:16 server kernel: ata1: lost interrupt (Status 0x50)
Aug 31 17:36:16 server kernel: ata1.00: exception Emask 0x50 SAct 0x0 
SErr 0x40d0802 action 0xe frozen
Aug 31 17:36:16 server kernel: ata1.00: SError: { RecovComm HostInt 
PHYRdyChg CommWake 10B8B DevExch }

Aug 31 17:36:16 server kernel: ata1.00: failed command: READ DMA
Aug 31 17:36:16 server kernel: ata1.00: cmd 
c8/00:60:00:cd:02/00:00:00:00:00/e0 tag 0 dma 49152 in
    res 
40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x54 (ATA bus error)

Aug 31 17:36:16 server kernel: ata1.00: status: { DRDY }
Aug 31 17:36:16 server kernel: ata1.00: hard resetting link
Aug 31 17:36:17 server kernel: ata1.01: hard resetting link
Aug 31 17:36:18 server kernel: ata1.01: failed to resume link (SControl 0)
Aug 31 17:36:18 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 
133 SControl 300)
Aug 31 17:36:18 server kernel: ata1.01: SATA link down (SStatus 4 
SControl 0)

Aug 31 17:36:18 server kernel: ata1.00: NODEV after polling detection
Aug 31 17:36:18 server kernel: ata1.00: revalidation failed (errno=-2)
Aug 31 17:36:20 server su[590]: Successful su for root by nous
Aug 31 17:36:20 server su[590]: + /dev/pts/2 nous:root
Aug 31 17:36:20 server su[590]: pam_unix(su:session): session opened for 
user root by nous(uid=1000)
Aug 31 17:36:20 server su[590]: pam_systemd(su:session): Cannot create 
session: Already running in a session

Aug 31 17:36:23 server kernel: ata1.00: hard resetting link
Aug 31 17:36:23 server kernel: ata1.01: hard resetting link
Aug 31 17:36:24 server kernel: ata1.01: failed to resume link (SControl 0)
Aug 31 17:36:25 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 
133 SControl 300)
Aug 31 17:36:25 server kernel: ata1.01: SATA link down (SStatus 4 
SControl 0)

Aug 31 17:36:25 server kernel: ata1.00: NODEV after polling detection
Aug 31 17:36:25 server kernel: ata1.00: revalidation failed (errno=-2)
Aug 31 17:36:30 server kernel: ata1.00: hard resetting link
Aug 31 17:36:30 server kernel: ata1.01: hard resetting link
Aug 31 17:36:31 server kernel: ata1.01: failed to resume link (SControl 0)
Aug 31 17:36:31 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 
133 SControl 300)
Aug 31 17:36:31 server kernel: ata1.01: SATA link down (SStatus 4 
SControl 0)

Aug 31 17:36:31 server kernel: ata1.00: NODEV after polling detection
Aug 31 17:36:31 server kernel: ata1.00: revalidation failed (errno=-2)
Aug 31 17:36:31 server kernel: ata1.00: disabled
Aug 31 17:36:36 server kernel: ata1.00: hard resetting link
Aug 31 17:36:37 server kernel: ata1.01: hard resetting link
Aug 31 17:36:38 server kernel: ata1.01: failed to resume link (SControl 0)
Aug 31 17:36:38 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 
133 SControl 300)
Aug 31 17:36:38 server kernel: ata1.01: SATA