Re: having issue removing a drive with a bad block

2018-04-15 Thread Chris Murphy
On Sun, Apr 15, 2018 at 10:33 PM, Chris Murphy  wrote:
> On Sun, Apr 15, 2018 at 7:45 PM, Alexander Zapatka
>  wrote:
>> thanks, Chris.  i have given a timeout of 300 to all the drives.  they
>> are all USB, all connected to an apollo lake based htpc.  then i
>> started the command again... the dmesg output is here from a few
>> minutes after i started the btrfs device remove command.
>> https://paste.ee/p/H1R0i.  no hopes, high or low, but i'm still
>> getting the same errors.  i'll let it run though the night tho, as it
>> doesn't seem to hurt anything other then slowly lock the system up.
>>
>> on a side note, all the USB drives are either powered or are connected
>> to a powered hub..  thanks again!
>
> That's 5 minutes. I'd say something is wrong/badly designed if it's
> not giving a clear error message inside of 1 minute but the
> manufacturers have apparently decided upwards of 180. I haven't heard
> of it taking longer than 180, but good grief.

Bad proof reading: That's 180 seconds.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: having issue removing a drive with a bad block

2018-04-15 Thread Chris Murphy
On Sun, Apr 15, 2018 at 7:45 PM, Alexander Zapatka
 wrote:
> thanks, Chris.  i have given a timeout of 300 to all the drives.  they
> are all USB, all connected to an apollo lake based htpc.  then i
> started the command again... the dmesg output is here from a few
> minutes after i started the btrfs device remove command.
> https://paste.ee/p/H1R0i.  no hopes, high or low, but i'm still
> getting the same errors.  i'll let it run though the night tho, as it
> doesn't seem to hurt anything other then slowly lock the system up.
>
> on a side note, all the USB drives are either powered or are connected
> to a powered hub..  thanks again!

That's 5 minutes. I'd say something is wrong/badly designed if it's
not giving a clear error message inside of 1 minute but the
manufacturers have apparently decided upwards of 180. I haven't heard
of it taking longer than 180, but good grief.

Anyway, at this point it sounds like it continues indefinitely in this
state and there's no point in doing that. I would not persist in
trying to use device remove until this problem is fixed using scrub as
a confirmation rather than either balance or device removal. Scrub is
faster and it's safer.


>From smartctl, it's /dev/sdc that has the bad sector, and usb 2-3 is
the device being reset, which

[3.921241] usb 2-3: Product: Elements 107C
[3.921243] usb 2-3: Manufacturer: Western Digital
[3.921245] usb 2-3: SerialNumber: 57434334453443414636334E

[3.929353] usb-storage 2-3:1.0: USB Mass Storage device detected
[3.929651] scsi host3: usb-storage 2-3:1.0

[4.994087] sd 3:0:0:0: [sdc] 976746240 4096-byte logical blocks:
(4.00 TB/3.64 TiB)

So the device with bad sector is also the device being reset but even
a 300 second command timer isn't causing the drive to report a read
error, it just hangs instead.

That's not expected.

But also, this is the only device that has a 4096 byte logical sector
size, which probably isn't related unless there's a bug here.

The smart reported LBA 1372896792 for the first error should be a 4096
byte base LBA in that case so the proper command to just toss the data
in this sector and cause firmware remapping if necessary is:

dd if=/dev/zero of=/dev/sdX bs=4096 count=1 seek=1372896792 oflag=direct

Confirm the suspect drive is still in fact /dev/sdc, since that can
change between boots. And of course umount the file system first.
There's no reason to step on 16KiB.

You can try that and then restart the long test from just prior to
that LBA and see if it finishes or stops on another sector.

smartctl -t select,1372896792-max /dev/sdX

Then mount the volume and do a scrub and see if it completes with no errors.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: having issue removing a drive with a bad block

2018-04-15 Thread Alexander Zapatka
thanks, Chris.  i have given a timeout of 300 to all the drives.  they
are all USB, all connected to an apollo lake based htpc.  then i
started the command again... the dmesg output is here from a few
minutes after i started the btrfs device remove command.
https://paste.ee/p/H1R0i.  no hopes, high or low, but i'm still
getting the same errors.  i'll let it run though the night tho, as it
doesn't seem to hurt anything other then slowly lock the system up.

on a side note, all the USB drives are either powered or are connected
to a powered hub..  thanks again!

On Sun, Apr 15, 2018 at 8:52 PM, Chris Murphy  wrote:
> On Sun, Apr 15, 2018 at 6:30 PM, Chris Murphy  wrote:
>
>> # echo value > /sys/block/device-name/device/timeout
>>
>
> Also note that this is not a persistent setting. It needs to be done
> per boot. But before you change it, use cat to find out what the value
> is. Default is 30.
>
> I'm seeing this:
> https://github.com/neilbrown/mdadm/pull/32/commits/af1ddca7d5311dfc9ed60a5eb6497db1296f1bec
>
> Which could bmaybe be adapted from mdadm raid to look for Btrfs
> instead, or in addition to.
>
> --
> Chris Murphy



-- 
 -o)
  /\\Message void if penguin violated
_\_VDon't mess with the penguin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: having issue removing a drive with a bad block

2018-04-15 Thread Chris Murphy
On Sun, Apr 15, 2018 at 6:30 PM, Chris Murphy  wrote:

> # echo value > /sys/block/device-name/device/timeout
>

Also note that this is not a persistent setting. It needs to be done
per boot. But before you change it, use cat to find out what the value
is. Default is 30.

I'm seeing this:
https://github.com/neilbrown/mdadm/pull/32/commits/af1ddca7d5311dfc9ed60a5eb6497db1296f1bec

Which could bmaybe be adapted from mdadm raid to look for Btrfs
instead, or in addition to.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: having issue removing a drive with a bad block

2018-04-15 Thread Chris Murphy
Please keep the list in the cc:

On Sun, Apr 15, 2018 at 5:55 PM, Alexander Zapatka
 wrote:
> output:
>
> $  sudo smartctl -l scterc /dev/sdc
> smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.0-38-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
>
> SCT Error Recovery Control command not supported

OK you'll need to increase the scsi command timer to something like
120. Hopefully that works. This needs to be done for each device.


# echo value > /sys/block/device-name/device/timeout

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/scsi-command-timer-device-status


> after the last reboot i haven't done anything to restart the scrub or
> remove the device...   i have a syslog from the last crash, you can
> see it here https://paste.ee/p/R2Pt7.  if is not enough, i will
> certainly start a scrub and let it crash.

>Apr 13 23:53:41 kodbox kernel: [225349.101299] usb 1-1.4.2: reset high-speed 
>USB device number 7 using xhci_hcd

Hmmm, could be there's a power issue. Not sure if it's related to the
problem or not. I see this when I direct connect laptop drives in USB
powered enclosures (no external power) directly to a my Intel NUC, but
then the problem goes away when the drive is connected to a suitably
powered USB hub, which is then connected to the computer.

But it's worth a shot to change the scsi command timer as described
resolves the problem first.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: having issue removing a drive with a bad block

2018-04-15 Thread Chris Murphy
On Sun, Apr 15, 2018 at 6:14 AM, Alexander Zapatka
 wrote:
> i recently set up a drive pool in single mode on my little media
> server.  about a week later SMART started telling me that the drive
> was having issue and there is one bad sector.  since the array is far
> from full i decided to remove the drive from the pool.  but running
>
> btrfs device remove /dev/sdc /mnt/pool
>
> resulted in a deadlock.  everything crashed, and i had to pull the
> plug to reboot.  once up i did a btrfs check of the drive and it
> reported no issues with the file system...  but running the remove
> again results in a dead lock.  i have tried running a scrub and it
> eventually results in a dead lock also.

What do you get for:

$ sudo smartctl -l scterc

And can you post a complete dmesg somewhere? Chances are this deadlock
is not really a deadlock, the system is hanging because Btrfs keeps
trying to read a bad block, and it's taking the drive so long to
recover that the kernel does a SATA link reset, and then Btrfs tries
to read again and then you get another hang while the drive decides
what to do - etc and it just doesn't end. But we need the dmesg even
if it takes 30 minutes for the dmesg command to complete - it's
probably easiest to do this with ssh remotely so that the dmesg result
when it finally appears is already on another machine and you don't
have to additionally mess around with outputing it to a file and then
getting the file off the hanging machine.

And don't hard reset it. 'sudo reboot -f' should be sufficient and
safe, even if not immediate, it might take a couple minutes for it it
to actually reboot.

What I'm betting is that you've got a mismatch between the kernel's
scsi command timer (defaults to 30 seconds) and the SCT ERC setting
for the drives. If they're consumer drives they either don't support
SCT ERC or it's disabled by default, in either case the recovery can
be well in excess of 30 seconds. So what you have to do is flip that
around so the drive gives up before the kernel. So either the command
timer has to be increased, or the drive SCT ERC value must be
decreased. And hence we need more info as requested above.




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html