Re: having issue removing a drive with a bad block
On Sun, Apr 15, 2018 at 10:33 PM, Chris Murphy wrote: > On Sun, Apr 15, 2018 at 7:45 PM, Alexander Zapatka > wrote: >> thanks, Chris. i have given a timeout of 300 to all the drives. they >> are all USB, all connected to an apollo lake based htpc. then i >> started the command again... the dmesg output is here from a few >> minutes after i started the btrfs device remove command. >> https://paste.ee/p/H1R0i. no hopes, high or low, but i'm still >> getting the same errors. i'll let it run though the night tho, as it >> doesn't seem to hurt anything other then slowly lock the system up. >> >> on a side note, all the USB drives are either powered or are connected >> to a powered hub.. thanks again! > > That's 5 minutes. I'd say something is wrong/badly designed if it's > not giving a clear error message inside of 1 minute but the > manufacturers have apparently decided upwards of 180. I haven't heard > of it taking longer than 180, but good grief. Bad proof reading: That's 180 seconds. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: having issue removing a drive with a bad block
On Sun, Apr 15, 2018 at 7:45 PM, Alexander Zapatka wrote: > thanks, Chris. i have given a timeout of 300 to all the drives. they > are all USB, all connected to an apollo lake based htpc. then i > started the command again... the dmesg output is here from a few > minutes after i started the btrfs device remove command. > https://paste.ee/p/H1R0i. no hopes, high or low, but i'm still > getting the same errors. i'll let it run though the night tho, as it > doesn't seem to hurt anything other then slowly lock the system up. > > on a side note, all the USB drives are either powered or are connected > to a powered hub.. thanks again! That's 5 minutes. I'd say something is wrong/badly designed if it's not giving a clear error message inside of 1 minute but the manufacturers have apparently decided upwards of 180. I haven't heard of it taking longer than 180, but good grief. Anyway, at this point it sounds like it continues indefinitely in this state and there's no point in doing that. I would not persist in trying to use device remove until this problem is fixed using scrub as a confirmation rather than either balance or device removal. Scrub is faster and it's safer. >From smartctl, it's /dev/sdc that has the bad sector, and usb 2-3 is the device being reset, which [3.921241] usb 2-3: Product: Elements 107C [3.921243] usb 2-3: Manufacturer: Western Digital [3.921245] usb 2-3: SerialNumber: 57434334453443414636334E [3.929353] usb-storage 2-3:1.0: USB Mass Storage device detected [3.929651] scsi host3: usb-storage 2-3:1.0 [4.994087] sd 3:0:0:0: [sdc] 976746240 4096-byte logical blocks: (4.00 TB/3.64 TiB) So the device with bad sector is also the device being reset but even a 300 second command timer isn't causing the drive to report a read error, it just hangs instead. That's not expected. But also, this is the only device that has a 4096 byte logical sector size, which probably isn't related unless there's a bug here. The smart reported LBA 1372896792 for the first error should be a 4096 byte base LBA in that case so the proper command to just toss the data in this sector and cause firmware remapping if necessary is: dd if=/dev/zero of=/dev/sdX bs=4096 count=1 seek=1372896792 oflag=direct Confirm the suspect drive is still in fact /dev/sdc, since that can change between boots. And of course umount the file system first. There's no reason to step on 16KiB. You can try that and then restart the long test from just prior to that LBA and see if it finishes or stops on another sector. smartctl -t select,1372896792-max /dev/sdX Then mount the volume and do a scrub and see if it completes with no errors. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: having issue removing a drive with a bad block
thanks, Chris. i have given a timeout of 300 to all the drives. they are all USB, all connected to an apollo lake based htpc. then i started the command again... the dmesg output is here from a few minutes after i started the btrfs device remove command. https://paste.ee/p/H1R0i. no hopes, high or low, but i'm still getting the same errors. i'll let it run though the night tho, as it doesn't seem to hurt anything other then slowly lock the system up. on a side note, all the USB drives are either powered or are connected to a powered hub.. thanks again! On Sun, Apr 15, 2018 at 8:52 PM, Chris Murphy wrote: > On Sun, Apr 15, 2018 at 6:30 PM, Chris Murphy wrote: > >> # echo value > /sys/block/device-name/device/timeout >> > > Also note that this is not a persistent setting. It needs to be done > per boot. But before you change it, use cat to find out what the value > is. Default is 30. > > I'm seeing this: > https://github.com/neilbrown/mdadm/pull/32/commits/af1ddca7d5311dfc9ed60a5eb6497db1296f1bec > > Which could bmaybe be adapted from mdadm raid to look for Btrfs > instead, or in addition to. > > -- > Chris Murphy -- -o) /\\Message void if penguin violated _\_VDon't mess with the penguin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: having issue removing a drive with a bad block
On Sun, Apr 15, 2018 at 6:30 PM, Chris Murphy wrote: > # echo value > /sys/block/device-name/device/timeout > Also note that this is not a persistent setting. It needs to be done per boot. But before you change it, use cat to find out what the value is. Default is 30. I'm seeing this: https://github.com/neilbrown/mdadm/pull/32/commits/af1ddca7d5311dfc9ed60a5eb6497db1296f1bec Which could bmaybe be adapted from mdadm raid to look for Btrfs instead, or in addition to. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: having issue removing a drive with a bad block
Please keep the list in the cc: On Sun, Apr 15, 2018 at 5:55 PM, Alexander Zapatka wrote: > output: > > $ sudo smartctl -l scterc /dev/sdc > smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.0-38-generic] (local build) > Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org > > SCT Error Recovery Control command not supported OK you'll need to increase the scsi command timer to something like 120. Hopefully that works. This needs to be done for each device. # echo value > /sys/block/device-name/device/timeout https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/scsi-command-timer-device-status > after the last reboot i haven't done anything to restart the scrub or > remove the device... i have a syslog from the last crash, you can > see it here https://paste.ee/p/R2Pt7. if is not enough, i will > certainly start a scrub and let it crash. >Apr 13 23:53:41 kodbox kernel: [225349.101299] usb 1-1.4.2: reset high-speed >USB device number 7 using xhci_hcd Hmmm, could be there's a power issue. Not sure if it's related to the problem or not. I see this when I direct connect laptop drives in USB powered enclosures (no external power) directly to a my Intel NUC, but then the problem goes away when the drive is connected to a suitably powered USB hub, which is then connected to the computer. But it's worth a shot to change the scsi command timer as described resolves the problem first. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: having issue removing a drive with a bad block
On Sun, Apr 15, 2018 at 6:14 AM, Alexander Zapatka wrote: > i recently set up a drive pool in single mode on my little media > server. about a week later SMART started telling me that the drive > was having issue and there is one bad sector. since the array is far > from full i decided to remove the drive from the pool. but running > > btrfs device remove /dev/sdc /mnt/pool > > resulted in a deadlock. everything crashed, and i had to pull the > plug to reboot. once up i did a btrfs check of the drive and it > reported no issues with the file system... but running the remove > again results in a dead lock. i have tried running a scrub and it > eventually results in a dead lock also. What do you get for: $ sudo smartctl -l scterc And can you post a complete dmesg somewhere? Chances are this deadlock is not really a deadlock, the system is hanging because Btrfs keeps trying to read a bad block, and it's taking the drive so long to recover that the kernel does a SATA link reset, and then Btrfs tries to read again and then you get another hang while the drive decides what to do - etc and it just doesn't end. But we need the dmesg even if it takes 30 minutes for the dmesg command to complete - it's probably easiest to do this with ssh remotely so that the dmesg result when it finally appears is already on another machine and you don't have to additionally mess around with outputing it to a file and then getting the file off the hanging machine. And don't hard reset it. 'sudo reboot -f' should be sufficient and safe, even if not immediate, it might take a couple minutes for it it to actually reboot. What I'm betting is that you've got a mismatch between the kernel's scsi command timer (defaults to 30 seconds) and the SCT ERC setting for the drives. If they're consumer drives they either don't support SCT ERC or it's disabled by default, in either case the recovery can be well in excess of 30 seconds. So what you have to do is flip that around so the drive gives up before the kernel. So either the command timer has to be increased, or the drive SCT ERC value must be decreased. And hence we need more info as requested above. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html