Re: scrub implies failing drive - smartctl blissfully unaware
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/25/2014 6:13 PM, Chris Murphy wrote: The drive will only issue a read error when its ECC absolutely cannot recover the data, hard fail. A few years ago companies including Western Digital started shipping large cheap drives, think of the green drives. These had very high TLER (Time Limited Error Recovery) settings, a.k.a. SCT ERC. Later they completely took out the ability to configure this error recovery timing so you only get the upward of 2 minutes to actually get a read error reported by the drive. Presumably if the ECC determines it's a hard fail and no point in reading the same sector 14000 times, it would issue a read error much sooner. But again, the linux-raid list if full of cases where this doesn't happen, and merely by changing the linux SCSI command timer from 30 to 121 seconds, now the drive reports an explicit read error with LBA information included, and now md can correct the problem. I have one of those and took it out of service when it started reporting read errors ( not timeouts ). I tried several times to write over the bad sectors to force reallocation and it worked again for a while... then the bad sectors kept coming back. Oddly, the SMART values never indicated anything had been reallocated. That's my whole point. When the link is reset, no read error is submitted by the drive, the md driver has no idea what the drive's problem was, no idea that it's a read problem, no idea what LBA is affected, and thus no way of writing over the affected bad sector. If the SCSI command timer is raised well above 30 seconds, this problem is resolved. Also replacing the drive with one that definitively errors out (or can be configured with smartctl -l scterc) before 30 seconds is another option. It doesn't know why or exactly where, but it does know *something* went wrong. It doesn't really matter, clearly its time out for drive commands is much higher than the linux default of 30 seconds. Only if you are running linux and can see the timeouts. You can't assume that's what is going on under windows just because the desktop stutters. OK that doesn't actually happen and it would be completely f'n wrong behavior if it were happening. All the kernel knows is the command timer has expired, it doesn't know why the drive isn't responding. It doesn't know there are uncorrectable sector errors causing the problem. To just assume link resets are the same thing as bad sectors and to just wholesale start writing possibly a metric shit ton of data when you don't know what the problem is would be asinine. It might even be sabotage. Jesus... In normal single disk operation sure: the kernel resets the drive and retries the request. But like I said before, I could have sworn there was an early failure flag that md uses to tell the lower layers NOT to attempt that kind of normal recovery, and instead just to return the failure right away so md can just go grab the data from the drive that isn't wigging out. That prevents the system from stalling on paging IO while the drive plays around with its deep recovery, and copying back 512k to the drive with the one bad sector isn't really that big of a deal. Then there is one option which is to increase the value of the SCSI command timer. And that applies to all raid: md, lvm, btrfs, and hardware. And then you get stupid hanging when you could just get the data from the other drive immediately. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUfL04AAoJENRVrw2cjl5RFW0H/Rtz4Y8bynWAP2yjiqZMsic+ vXCxuJAFGpOKVyV1FboCuLStp8TQ5aIiJyHrprsCiy4UAY0bFQjzaHOo4jBlCdV/ YaD3HSWGKAFUbIiByCnMfIDMxWSPP8rOeFpotoywAkNe0vIsIKg955IX96+jNMy2 IAjKGQahzp2UW6ggnwwdA/JayUmb1jZ8LvmV58rDVdhTnGPgrrYZnIyf/OphrXqd R/WJtFDuUBUhtsmXYrY2wGUQNi+3zp+I9YburmeDtEcrbwDLDCiVdE6ChmoCrNBS nbcfqoWPEk1DsiI9GC/Yu/sXLq2iD0n53e/DHa36z4zc4uWtUjBwSYyCubJfkyI= =FrB9 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On 25 November 2014 at 22:34, Phillip Susi ps...@ubuntu.com wrote: On 11/19/2014 7:05 PM, Chris Murphy wrote: I'm not a hard drive engineer, so I can't argue either point. But consumer drives clearly do behave this way. On Linux, the kernel's default 30 second command timer eventually results in what look like link errors rather than drive read errors. And instead of the problems being fixed with the normal md and btrfs recovery mechanisms, the errors simply get worse and eventually there's data loss. Exhibits A, B, C, D - the linux-raid list is full to the brim of such reports and their solution. I have seen plenty of error logs of people with drives that do properly give up and return an error instead of timing out so I get the feeling that most drives are properly behaved. Is there a particular make/model of drive that is known to exhibit this silly behavior? I had a couple of Seagate Barracuda 7200.11 (codename Moose) drives with seriously retarded firmware. They never reported a read error AFAIK but began to time out instead. They wouldn't even respond after a link reset. I had to power cycle the disks. Funny days with ddrescue. Got almost everything off them. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On 25 November 2014 at 23:14, Phillip Susi ps...@ubuntu.com wrote: On 11/19/2014 6:59 PM, Duncan wrote: The paper specifically mentioned that it wasn't necessarily the more expensive devices that were the best, either, but the ones that faired best did tend to have longer device-ready times. The conclusion was that a lot of devices are cutting corners on device-ready, gambling that in normal use they'll work fine, leading to an acceptable return rate, and evidently, the gamble pays off most of the time. I believe I read the same study and don't recall any such conclusion. Instead the conclusion was that the badly behaving drives aren't ordering their internal writes correctly and flushing their metadata from ram to flash before completing the write request. The problem was on the power *loss* side, not the power application. I've found: http://www.usenix.org/conference/fast13/technical-sessions/presentation/zheng http://lkcl.net/reports/ssd_analysis.html Are there any more studies? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/19/2014 7:05 PM, Chris Murphy wrote: I'm not a hard drive engineer, so I can't argue either point. But consumer drives clearly do behave this way. On Linux, the kernel's default 30 second command timer eventually results in what look like link errors rather than drive read errors. And instead of the problems being fixed with the normal md and btrfs recovery mechanisms, the errors simply get worse and eventually there's data loss. Exhibits A, B, C, D - the linux-raid list is full to the brim of such reports and their solution. I have seen plenty of error logs of people with drives that do properly give up and return an error instead of timing out so I get the feeling that most drives are properly behaved. Is there a particular make/model of drive that is known to exhibit this silly behavior? IIRC, this is true when the drive returns failure as well. The whole bio is marked as failed, and the page cache layer then begins retrying with progressively smaller requests to see if it can get *some* data out. Well that's very course. It's not at a sector level, so as long as the drive continues to try to read from a particular LBA, but fails to either succeed reading or give up and report a read error, within 30 seconds, then you just get a bunch of wonky system behavior. I don't understand this response at all. The drive isn't going to keep trying to read the same bad lba; after the kernel times out, it resets the drive, and tries reading different smaller parts to see which it can read and which it can't. Conversely what I've observed on Windows in such a case, is it tolerates these deep recoveries on consumer drives. So they just get really slow but the drive does seem to eventually recover (until it doesn't). But yeah 2 minutes is a long time. So then the user gets annoyed and reinstalls their system. Since that means writing to the affected drive, the firmware logic causes bad sectors to be dereferenced when the write error is persistent. Problem solved, faster system. That seems like rather unsubstantiated guesswork. i.e. the 2 minute+ delays are likely not on an individual request, but from several requests that each go into deep recovery, possibly because windows is retrying the same sector or a few consecutive sectors are bad. Because now you have a member drive that's inconsistent. At least in the md raid case, a certain number of read failures causes the drive to be ejected from the array. Anytime there's a write failure, it's ejected from the array too. What you want is for the drive to give up sooner with an explicit read error, so md can help fix the problem by writing good data to the effected LBA. That doesn't happen when there are a bunch of link resets happening. What? It is no different than when it does return an error, with the exception that the error is incorrectly applied to the entire request instead of just the affected sector. Again, if your drive SCT ERC is configurable, and set to something sane like 70 deciseconds, that read failure happens at MOST 7 seconds after the read attempt. And md is notified of *exactly* what sectors are affected, it immediately goes to mirror data, or rebuilds it from parity, and then writes the correct data to the previously reported bad sectors. And that will fix the problem. Yes... I'm talking about when the drive doesn't support that. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUdPXRAAoJEI5FoCIzSKrw5aUIAJpmAczzc+0flGpDnenNIf9E HITY2a15lRhrnpfiEBmlTe0EUyc8O+Sv/kWJ61VRJ1KNCtF0Cs0jMEvOk2BGiM9T rR2KinIFlPZfuR7sUpgns+i5TK3eXpn+bbm5jIUFf8hOdkERFArwaQIqo3qqMybs 3rHdnBo7T+F9oCMwuFyvwHupDd2gCbnibB8mIUhijUcZQwoqU9c/ISGySpM7x04J VeDCI3hWv2V5hhm+Bfdq3fQpjeIo2AAvCPt+ODuFFHabQ5l78Qu7IlCEFGIYuQqi VJPxXNUi4n34O/jWEX5KBGgXp3H1RegnvcAt2NFLMVpFVDSB9I5eYLrj/d8KWoE= =r3AP -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/19/2014 6:59 PM, Duncan wrote: It's not physical spinup, but electronic device-ready. It happens on SSDs too and they don't have anything to spinup. If you have an SSD that isn't handling IO within 5 seconds or so of power on, it is badly broken. But, for instance on my old seagate 300-gigs that I used to have in 4-way mdraid, when I tried to resume from hibernate the drives would be spunup and talking to the kernel, but for some seconds to a couple minutes or so after spinup, they'd sometimes return something like (example) Seagrte3x0 instead of Seagate300. Of course that wasn't the exact string, I think it was the model number or perhaps the serial number or something, but looking at dmsg I could see the ATA layer up for each of the four devices, the connection establish and seem to be returning good data, then the mdraid layer would try to assemble and would kick out a drive or two due to the device string mismatch compared to what was there before the hibernate. With the string mismatch, from its perspective the device had disappeared and been replaced with something else. Again, these drives were badly broken then. Even if it needs extra time to come up for some reason, it shouldn't be reporting that it is ready and returning incorrect information. And now I seen similar behavior resuming from suspend (the old hardware wouldn't resume from suspend to ram, only hibernate, the new hardware resumes from suspend to ram just fine, but I had trouble getting it to resume from hibernate back when I first setup and tried it; I've not tried hibernate since and didn't even setup swap to hibernate to when I got the SSDs so I've not tried it for a couple years) on SSDs with btrfs raid. Btrfs isn't as informative as was mdraid on why it kicks a device, but dmesg says both devices are up, while btrfs is suddenly spitting errors on one device. A reboot later and both devices are back in the btrfs and I can do a scrub to resync, which generally finds and fixes errors on the btrfs that were writable (/home and /var/log), but of course not on the btrfs mounted as root, since it's read-only by default. Several months back I was working on some patches to avoid blocking a resume until after all disks had spun up ( someone else ended up getting a different version merged to the mainline kernel ). I looked quite hard at the timings of things during suspend and found that my ssd was ready and handling IO darn near instantly and the hd ( 5900 rpm wd green at the time ) took something like 7 seconds before it was completing IO. These days I'm running a raid10 on 3 7200 rpm blues and it comes right up from suspend with no problems, just as it should. The paper specifically mentioned that it wasn't necessarily the more expensive devices that were the best, either, but the ones that faired best did tend to have longer device-ready times. The conclusion was that a lot of devices are cutting corners on device-ready, gambling that in normal use they'll work fine, leading to an acceptable return rate, and evidently, the gamble pays off most of the time. I believe I read the same study and don't recall any such conclusion. Instead the conclusion was that the badly behaving drives aren't ordering their internal writes correctly and flushing their metadata from ram to flash before completing the write request. The problem was on the power *loss* side, not the power application. The spinning rust in that study faired far better, with I think none of the devices scrambling their own firmware, and while there was some damage to storage, it was generally far better confined. That is because they don't have a flash translation layer to get mucked up and prevent them from knowing where the blocks are on disk. The worst thing you get out of a hdd losing power during a write is the sector it was writing is corrupted and you have to re-write it. My experience says otherwise. Else explain why those problems occur in the first two minutes, but don't occur if I hold it at the grub prompt to stabilizefor two minutes, and never during normal post- stabilization operation. Of course perhaps there's another explanation for that, and I'm conflating the two things. But so far, experience matches the theory. I don't know what was broken about these drives, only that it wasn't capacitors since those charge in milliseconds, not seconds. Further, all systems using microprocessors ( like the one in the drive that controls it ) have reset circuitry that prevents them from running until after any caps have charged enough to get the power rail up to the required voltage. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUdP9jAAoJEI5FoCIzSKrw50IH/jkh48Z8Oh/AS/i68zT6Grtb C98aNNQwhC2sJSvaxRBqJ1qkXY4af5DZM/SOvFdNE4qdPLBDLfg70tnTXwU4PjzN 1mHR1PR6Vgft11t0+u8TPTos669Jm8KJ21NMgY072P18Kj/+UJqNRQ+UUNikAcaM
Re: scrub implies failing drive - smartctl blissfully unaware
On Tue, Nov 25, 2014 at 2:34 PM, Phillip Susi ps...@ubuntu.com wrote: I have seen plenty of error logs of people with drives that do properly give up and return an error instead of timing out so I get the feeling that most drives are properly behaved. Is there a particular make/model of drive that is known to exhibit this silly behavior? The drive will only issue a read error when its ECC absolutely cannot recover the data, hard fail. A few years ago companies including Western Digital started shipping large cheap drives, think of the green drives. These had very high TLER (Time Limited Error Recovery) settings, a.k.a. SCT ERC. Later they completely took out the ability to configure this error recovery timing so you only get the upward of 2 minutes to actually get a read error reported by the drive. Presumably if the ECC determines it's a hard fail and no point in reading the same sector 14000 times, it would issue a read error much sooner. But again, the linux-raid list if full of cases where this doesn't happen, and merely by changing the linux SCSI command timer from 30 to 121 seconds, now the drive reports an explicit read error with LBA information included, and now md can correct the problem. IIRC, this is true when the drive returns failure as well. The whole bio is marked as failed, and the page cache layer then begins retrying with progressively smaller requests to see if it can get *some* data out. Well that's very course. It's not at a sector level, so as long as the drive continues to try to read from a particular LBA, but fails to either succeed reading or give up and report a read error, within 30 seconds, then you just get a bunch of wonky system behavior. I don't understand this response at all. The drive isn't going to keep trying to read the same bad lba; after the kernel times out, it resets the drive, and tries reading different smaller parts to see which it can read and which it can't. That's my whole point. When the link is reset, no read error is submitted by the drive, the md driver has no idea what the drive's problem was, no idea that it's a read problem, no idea what LBA is affected, and thus no way of writing over the affected bad sector. If the SCSI command timer is raised well above 30 seconds, this problem is resolved. Also replacing the drive with one that definitively errors out (or can be configured with smartctl -l scterc) before 30 seconds is another option. Conversely what I've observed on Windows in such a case, is it tolerates these deep recoveries on consumer drives. So they just get really slow but the drive does seem to eventually recover (until it doesn't). But yeah 2 minutes is a long time. So then the user gets annoyed and reinstalls their system. Since that means writing to the affected drive, the firmware logic causes bad sectors to be dereferenced when the write error is persistent. Problem solved, faster system. That seems like rather unsubstantiated guesswork. i.e. the 2 minute+ delays are likely not on an individual request, but from several requests that each go into deep recovery, possibly because windows is retrying the same sector or a few consecutive sectors are bad. It doesn't really matter, clearly its time out for drive commands is much higher than the linux default of 30 seconds. Because now you have a member drive that's inconsistent. At least in the md raid case, a certain number of read failures causes the drive to be ejected from the array. Anytime there's a write failure, it's ejected from the array too. What you want is for the drive to give up sooner with an explicit read error, so md can help fix the problem by writing good data to the effected LBA. That doesn't happen when there are a bunch of link resets happening. What? It is no different than when it does return an error, with the exception that the error is incorrectly applied to the entire request instead of just the affected sector. OK that doesn't actually happen and it would be completely f'n wrong behavior if it were happening. All the kernel knows is the command timer has expired, it doesn't know why the drive isn't responding. It doesn't know there are uncorrectable sector errors causing the problem. To just assume link resets are the same thing as bad sectors and to just wholesale start writing possibly a metric shit ton of data when you don't know what the problem is would be asinine. It might even be sabotage. Jesus... Again, if your drive SCT ERC is configurable, and set to something sane like 70 deciseconds, that read failure happens at MOST 7 seconds after the read attempt. And md is notified of *exactly* what sectors are affected, it immediately goes to mirror data, or rebuilds it from parity, and then writes the correct data to the previously reported bad sectors. And that will fix the problem. Yes... I'm talking about when the drive doesn't support that. Then there is one option which is to
Re: scrub implies failing drive - smartctl blissfully unaware
On Tue, Nov 25, 2014 at 6:13 PM, Chris Murphy li...@colorremedies.com wrote: A few years ago companies including Western Digital started shipping large cheap drives, think of the green drives. These had very high TLER (Time Limited Error Recovery) settings, a.k.a. SCT ERC. Later they completely took out the ability to configure this error recovery timing so you only get the upward of 2 minutes to actually get a read error reported by the drive. Why sell an $80 hard drive when you can change a few bytes in the firmware and sell a crippled $80 drive and an otherwise-identical non-crippled $130 drive? -- Rich -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 On 11/21/2014 04:12 PM, Robert White wrote: Here's a bug from 2005 of someone having a problem with the ACPI IDE support... That is not ACPI emulation. ACPI is not used to access the disk, but rather it has hooks that give it a chance to diddle with the disk to do things like configure it to lie about its maximum size, or issue a security unlock during suspend/resume. People debating the merits of the ACPI IDE drivers in 2005. No... that's not a debate at all; it is one guy asking if he should use IDE or ACPI mode... someone who again meant AHCI and typed the wrong acronym. Even when you get me for referencing windows, you're still wrong... How many times will you try get out of being hideously horribly wrong about ACPI supporting disk/storage IO? It is neither recent nor rare. How much egg does your face really need before you just see that your fantasy that it's new and uncommon is a delusional mistake? Project much? It seems I've proven just about everything I originally said you got wrong now so hopefully we can be done. -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBCgAGBQJUcQj4AAoJENRVrw2cjl5RwmcH+gOW0LUQE4OXEToMY33brK8Z QMKw7T1y4dtXIeeWihugNs+vbwmoI2Wheeej4WPdiqvgqIfX4ov9+N9Nb39JiIsI 7frPJ638n98Et5sirCGKfaVvDTwlF85ApHHtXrVLg2dBY3A+oLM9jVU7jpRBvW1m IFjhJH/SMGDpMhix9SFg6w6cALRh1U5WYV4zMZ1f5/ri/05TYmNJ/M23cjtBicPZ LaIFxOMGef4lylysNaVh0W03424oIJit6d7DB1gxCyjnkUvVuJ43NjuS5ay+y2sP FFrepKrOfhK1oOib9e63zNfRHhWrX4KN0Dqcu/3+/+lhD3q5G1fd4YK2RV/oaso= =nm9l -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On Fri, 21 Nov 2014 09:05:32 +0200, Brendan Hide wrote: On 2014/11/21 06:58, Zygo Blaxell wrote: I also notice you are not running regular SMART self-tests (e.g. by smartctl -t long) and the last (and first, and only!) self-test the drive ran was ~12000 hours ago. That means most of your SMART data is about 18 months old. The drive won't know about sectors that went bad in the last year and a half unless the host happens to stumble across them during a read. The drive is over five years old in operating hours alone. It is probably so fragile now that it will break if you try to move it. All interesting points. Do you schedule SMART self-tests on your own systems? I have smartd running. In theory it tracks changes and sends alerts if it figures a drive is going to fail. But, based on what you've indicated, that isn't good enough. Simply monitoring the smart status without a self-test isn't really that great. I'm not sure on the default config, but smartd can be made to initiate a smart self-test at regular intervals. Depending on the test type (short, long, etc) it could include a full surface scan. This can reveal things like bad sectors before you ever hit them during normal system usage. WARNING: errors detected during scrubbing, corrected. [snip] scrub device /dev/sdb2 (id 2) done scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds total bytes scrubbed: 189.49GiB with 5420 errors error details: read=5 csum=5415 corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164 That seems a little off. If there were 5 read errors, I'd expect the drive to have errors in the SMART error log. Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem. There have been a number of fixes to csums in btrfs pulled into the kernel recently, and I've retired two five-year-old computers this summer due to RAM/CPU failures. The difference here is that the issue only affects the one drive. This leaves the probable cause at: - the drive itself - the cable/ports with a negligibly-possible cause at the motherboard chipset. This is the same problem that I'm currently trying to resolve. I have one drive in a raid1 setup which shows no issues in smart status but often has checksum errors. In my situation what I've found is that if I scrub let it fix the errors then a second pass immediately after will show no errors. If I then leave it a few days try again there will be errors, even in old files which have not been accessed for months. If I do a read-only scrub to get a list of errors, a second scrub immediately after will show exactly the same errors. Apart from the scrub errors the system logs shows no issues with that particular drive. My next step is to disable autodefrag see if the problem persists. (I'm not suggesting a problem with autodefrag, I just want to remove it from the equation ensure that outside of normal file access, data isn't being rewritten between scrubs) -- Ian -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/20/2014 5:45 PM, Robert White wrote: Nice attempt at saving face, but wrong as _always_. The CONFIG_PATA_ACPI option has been in the kernel since 2008 and lots of people have used it. If you search for ACPI ide you'll find people complaining in 2008-2010 about windows error messages indicating the device is present in their system but no OS driver is available. Nope... not finding it. The closest thing was one or two people who said ACPI when they meant AHCI ( and were quickly corrected ). This is probably what you were thinking of since windows xp did not ship with an ahci driver so it was quite common for winxp users to have this problem when in _AHCI_ mode. That you have yet to see a single system that implements it is about the worst piece of internet research I've ever seen. Do you not _get_ that your opinion about what exists and how it works is not authoritative? Show me one and I'll give you a cookie. I have disassembled a number of acpi tables and yet to see one that has it. What's more, motherboard vendors tend to implement only the absolute minimum they have to. Since nobody actually needs this feature, they aren't going to bother with it. Do you not get that your hand waving arguments of you can google for it are not authoritative? You can also find articles about both windows and linux systems actively using ACPI fan control going back to 2009 Maybe you should have actually read those articles. Linux supports acpi fan control, unfortunately, almost no motherboards actually implement it. Almost everyone who wants fan control working in linux has to install lm-sensors and load a driver that directly accesses one of the embedded controllers that motherboards tend to use and run the fancontrol script to manipulate the pwm channels on that controller. These days you also have to boot with a kernel argument to allow loading the driver since ACPI claims those IO ports for its own use which creates a conflict. Windows users that want to do this have to install a program... I believe a popular one is called q-fan, that likewise directly accesses the embedded controller registers to control the fan, since the acpi tables don't bother properly implementing the acpi fan spec. Then there are thinkpads, and one or two other laptops ( asus comes to mind ) that went and implemented their own proprietary acpi interfaces for fancontrol instead of following the spec, which required some reverse engineering and yet more drivers to handle these proprietary acpi interfaces. You can google for thinkfan if you want to see this. These are not hard searches to pull off. These are not obscure references. Go to the google box and start typing ACPI fan... and check the autocomplete. I'll skip ovea all the parts where you don't know how a chipset works and blah, blah, blah... You really should have just stopped at I don't know and I've never because you keep demonstrating that you _don't_ know, and that you really _should_ _never_. Tell us more about the lizard aliens controlling your computer, I find your versions of realty fascinating... By all means, keep embarrassing yourself with nonsense and trying to cover it up by being rude and insulting. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUb1YxAAoJEI5FoCIzSKrwi54H/Rkd7DloqC9x9QwN4QdmWcAZ /UQg3hcRbtB3wpmp34Mnb3SS0Ii2mCh/dtKmdRGBNE/x5nU1WiQEHHCicKX3Avvq 8OXLNQrsf+xZL9/HGtUJ3RefpEkmwIG5NgFfKJHtv6Iq204Umq32JUxRla+ZQE5s MrUparigpUlj26lrnShc6ByDUqYK3wOTsDxEMxrOyAgi/n/7ESHV/dZVaqsE6jGQ OvPynf1FqJoJSSYC7sNE0XLqfHMu2wnSxcoF6MpuHXlDiwtrSH07tuwgrhCNPagY j7gQyxucew8oim8lcfs+4rrQ60wwVzlsEJwjA9rAXQF7U2x/WoB+ArYhgmJUMgA= =cXJr -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/20/2014 6:08 PM, Robert White wrote: Well you should have _actually_ trimmed your response down to not pressing send. _Many_ motherboards have complete RAID support at levels 0, 1, 10, and five 5. A few have RAID6. Some of them even use the LSI chip-set. Yes, there are some expensive server class motherboards out there with integrated real raid chips. Your average consumer class motherboards are not those. They contain intel, nvidia, sil, promise, and via chipsets that are fake raid. Seriously... are you trolling this list with disinformation or just repeating tribal knowledge from fifteen year old copies of PC Magazine? Please drop the penis measuring. Yea, some of the IDE motherboards and that only had RAID1 and RAID0 (and indeed some of the add-on controllers) back in the IDE-only days were really lame just-forked-write devices with no integrity checks (hence fake raid) but that's from like the 1990s; it's paleolithic age wisdom at this point. Wrong again... fakeraid became popular with the advent of SATA since it was easy to add a knob to the bios to switch it between AHCI and RAID mode, and just change the pci device id. These chipsets are still quite common today and several of them do support raid5 and raid10 ( well, really it's raid 0 + raid1, but that's a whole nother can of worms ). Recent intel chips also now have a caching mode for having an SSD cache a larger HDD. Intel has also done a lot of work integrating support for their chipset into mdadm in the last year or three. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUb1ngAAoJEI5FoCIzSKrwqMQIAJ3MfA4n74aJ1KUdfHYOz96o vwPBNQJ953yozmCHfjERbTCQlKT5AzwQHWpHoFWsQ4gYoNGmeE1jy2rsqxMfujff eQekfISyX3POExnsr3LnfHWI2/Om39+EAxVPxbA5LN6SC1SCWRut7Q3bQqkuxj/S bYRU65XJ9BZ6eYznutMDFdEELyAr8b9/wnatI/ohzmebOBDgFzBrn8gwilCctz7X DI39HTkCvciWKVXNyVdUZKI5S+MRCEB2JZAkCy3x8LLsENmMnO0xN32o5Od0zlGn nFLcLQFrZfz5dY2ZusxP+z0z0x4RW3sikd4RZ99PEHBkFa5CgJIFrBxtQAsLi1c= =4Yg+ -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On Fri, Nov 21, 2014 at 5:55 AM, Ian Armstrong bt...@iarmst.co.uk wrote: In my situation what I've found is that if I scrub let it fix the errors then a second pass immediately after will show no errors. If I then leave it a few days try again there will be errors, even in old files which have not been accessed for months. What are the devices? And if they're SSDs are they powered off for these few days? I take it the scrub error type is corruption? You can use badblocks to write a known pattern to the drive. Then power off and leave it for a few days. Then read the drive, matching against the pattern, and see if there are any discrepancies. Doing this outside the code path of Btrfs would fairly conclusively indicate whether it's hardware or software induced. Assuming you have another copy of all of these files :-) you could just sha256sum the two copies to see if they have in fact changed. If they have, well then you've got some silent data corruption somewhere somehow. But if they always match, then that suggests a bug. I don't see how you can get bogus corruption messages, and for it to not be a bug. When you do these scrubs that come up clean, and then later come up with corruptions, have you done any software updates? My next step is to disable autodefrag see if the problem persists. (I'm not suggesting a problem with autodefrag, I just want to remove it from the equation ensure that outside of normal file access, data isn't being rewritten between scrubs) I wouldn't expect autodefrag to touch old files not accessed for months. Doesn't it only affect actively used files? -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On Fri, Nov 21, 2014 at 09:05:32AM +0200, Brendan Hide wrote: On 2014/11/21 06:58, Zygo Blaxell wrote: You have one reallocated sector, so the drive has lost some data at some time in the last 49000(!) hours. Normally reallocations happen during writes so the data that was lost was data you were in the process of overwriting anyway; however, the reallocated sector count could also be a sign of deteriorating drive integrity. In /var/lib/smartmontools there might be a csv file with logged error attribute data that you could use to figure out whether that reallocation was recent. I also notice you are not running regular SMART self-tests (e.g. by smartctl -t long) and the last (and first, and only!) self-test the drive ran was ~12000 hours ago. That means most of your SMART data is about 18 months old. The drive won't know about sectors that went bad in the last year and a half unless the host happens to stumble across them during a read. The drive is over five years old in operating hours alone. It is probably so fragile now that it will break if you try to move it. All interesting points. Do you schedule SMART self-tests on your own systems? I have smartd running. In theory it tracks changes and sends alerts if it figures a drive is going to fail. But, based on what you've indicated, that isn't good enough. I run 'smartctl -t long' from cron overnight (or whenever the drives are most idle). You can also set up smartd.conf to launch the self tests; however, the syntax for test scheduling is byzantine compared to cron (and that's saying something!). On multi-drive systems I schedule a different drive for each night. If you are also doing btrfs scrub, then stagger the scheduling so e.g. smart runs in even weeks and btrfs scrub runs in odd weeks. smartd is OK for monitoring test logs and email alerts. I've had no problems there. WARNING: errors detected during scrubbing, corrected. [snip] scrub device /dev/sdb2 (id 2) done scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds total bytes scrubbed: 189.49GiB with 5420 errors error details: read=5 csum=5415 corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164 That seems a little off. If there were 5 read errors, I'd expect the drive to have errors in the SMART error log. Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem. There have been a number of fixes to csums in btrfs pulled into the kernel recently, and I've retired two five-year-old computers this summer due to RAM/CPU failures. The difference here is that the issue only affects the one drive. This leaves the probable cause at: - the drive itself - the cable/ports with a negligibly-possible cause at the motherboard chipset. If it was cable, there should be UDMA CRC errors or similar in the SMART counters, but they are zero. You can also try swapping the cable and seeing whether the errors move. I've found many bad cables that way. The drive itself could be failing in some way that prevents recording SMART errors (e.g. because of host timeouts triggering a bus reset, which also prevents the SMART counter update for what was going wrong at the time). This is unfortunately quite common, especially with drives configured for non-RAID workloads. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: scrub implies failing drive - smartctl blissfully unaware
On Fri, Nov 21, 2014 at 10:42 AM, Zygo Blaxell zblax...@furryterror.org wrote: I run 'smartctl -t long' from cron overnight (or whenever the drives are most idle). You can also set up smartd.conf to launch the self tests; however, the syntax for test scheduling is byzantine compared to cron (and that's saying something!). On multi-drive systems I schedule a different drive for each night. If you are also doing btrfs scrub, then stagger the scheduling so e.g. smart runs in even weeks and btrfs scrub runs in odd weeks. smartd is OK for monitoring test logs and email alerts. I've had no problems there. Most attributes are always updated without issuing a smart test of any kind. A drive I have here only has four offline updateable attributes. When it comes to bad sectors, the drive won't use a sector that persistently fails writes. So you don't really have to worry about latent bad sectors that don't have data on them already. The sectors you care about are the ones with data. A scrub reads all of those sectors. First the drive could report a read error in which case Btrfs raid1/10, and any (md, lvm, hardware) raid can use mirrored data, or rebuild it from parity, and write to the affected sector; and also this same mechanism happens in normal reads so it's a kind of passive scrub. But it happens to miss checking inactively read data, which a scrub will check. Second, the drive could report no problem, and Btrfs raid1/10 could still fix the problem in case of a csum mismatch. And it looks like soonish we'll see this apply to raid5/6. So I think a nightly long smart test is a bit overkill. I think you could do nightly -t short tests which will report problems scrub won't notice, such as higher seek times or lower throughput performance. And then scrub once a week. The drive itself could be failing in some way that prevents recording SMART errors (e.g. because of host timeouts triggering a bus reset, which also prevents the SMART counter update for what was going wrong at the time). This is unfortunately quite common, especially with drives configured for non-RAID workloads. Libata resetting the link should be recorded in kernel messages. -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On 11/21/2014 07:11 AM, Phillip Susi wrote: On 11/20/2014 5:45 PM, Robert White wrote: If you search for ACPI ide you'll find people complaining in 2008-2010 about windows error messages indicating the device is present in their system but no OS driver is available. Nope... not finding it. The closest thing was one or two people who said ACPI when they meant AHCI ( and were quickly corrected ). This is probably what you were thinking of since windows xp did not ship with an ahci driver so it was quite common for winxp users to have this problem when in _AHCI_ mode. I have to give you that one... I should have never trusted any reference to windows. Most of those references to windows support were getting AHCI and ACPI mixed up. Lolz windows users... They didn't get into ACPI disk support till 2010. I should have known they were behind the times. I had to scroll down almost a whole page to find the linux support. So lets just look at the top of the ide/ide-acpi.c from linux 2.6 to consult about when ACPI got into the IDE business... linux/drivers/ide/ide-acpi.c /* * Provides ACPI support for IDE drives. * * Copyright (C) 2005 Intel Corp. * Copyright (C) 2005 Randy Dunlap * Copyright (C) 2006 SUSE Linux Products GmbH * Copyright (C) 2006 Hannes Reinecke */ Here's a bug from 2005 of someone having a problem with the ACPI IDE support... https://www.google.com/url?sa=trct=jq=esrc=ssource=webcd=6cad=rjauact=8ved=0CDkQFjAFurl=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D5604ei=g6VvVL73K-HLsASIrYKIDgusg=AFQjCNGTuuXPJk91svGJtRAf35DUqVqrLgsig2=eHxwbLYXn4ED5jG-guoZqg People debating the merits of the ACPI IDE drivers in 2005. https://www.google.com/url?sa=trct=jq=esrc=ssource=webcd=12cad=rjauact=8ved=0CGUQFjALurl=http%3A%2F%2Fwww.linuxquestions.org%2Fquestions%2Fslackware-14%2Fbare-ide-and-bare-acpi-kernels-297525%2Fei=g6VvVL73K-HLsASIrYKIDgusg=AFQjCNFoyKgH2sOteWwRN_Tdrfw9hOmVGQsig2=BmMVcZl24KRz4s4gEvLN_w So you got me... windows was behind the curve by five years instead of just three... my bad... But yea, nobody has ever used that ACPI disk drive support that's been in the kernel for nine years. Even when you get me for referencing windows, you're still wrong... How many times will you try get out of being hideously horribly wrong about ACPI supporting disk/storage IO? It is neither recent nor rare. How much egg does your face really need before you just see that your fantasy that it's new and uncommon is a delusional mistake? Methinks Misters Dunning and Kruger need a word with you... -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On 11/21/2014 01:12 PM, Robert White wrote: (wrong links included in post...) Dangit... those two links were bad... wrong clipboard... /sigh... I'll just stand on the pasted text from the driver. 8-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On Fri, Nov 21, 2014 at 11:06:19AM -0700, Chris Murphy wrote: On Fri, Nov 21, 2014 at 10:42 AM, Zygo Blaxell zblax...@furryterror.org wrote: I run 'smartctl -t long' from cron overnight (or whenever the drives are most idle). You can also set up smartd.conf to launch the self tests; however, the syntax for test scheduling is byzantine compared to cron (and that's saying something!). On multi-drive systems I schedule a different drive for each night. If you are also doing btrfs scrub, then stagger the scheduling so e.g. smart runs in even weeks and btrfs scrub runs in odd weeks. smartd is OK for monitoring test logs and email alerts. I've had no problems there. Most attributes are always updated without issuing a smart test of any kind. A drive I have here only has four offline updateable attributes. One of those four is Offline_Uncorrectable, which is a really important attribute to monitor! When it comes to bad sectors, the drive won't use a sector that persistently fails writes. So you don't really have to worry about latent bad sectors that don't have data on them already. The sectors you care about are the ones with data. A scrub reads all of those sectors. A scrub reads all the _allocated_ sectors. A long selftest reads _everything_, and also exercises the electronics and mechanics of the drive in ways that normal operation doesn't. I have several disks that are less than 25% occupied, which means scrubs will ignore 75% of the disk surface at any given time. A sharp increase in the number of bad sectors (no matter how they are detected) usually indicates a total drive failure is coming. Many drives have been nice enough to give me enough warning for their RMA replacements to be delivered just a few hours before the drive totally fails. First the drive could report a read error in which case Btrfs raid1/10, and any (md, lvm, hardware) raid can use mirrored data, or rebuild it from parity, and write to the affected sector; and also this same mechanism happens in normal reads so it's a kind of passive scrub. But it happens to miss checking inactively read data, which a scrub will check. Second, the drive could report no problem, and Btrfs raid1/10 could still fix the problem in case of a csum mismatch. And it looks like soonish we'll see this apply to raid5/6. So I think a nightly long smart test is a bit overkill. I think you could do nightly -t short tests which will report problems scrub won't notice, such as higher seek times or lower throughput performance. And then scrub once a week. Drives quite often drop a sector or two over the years, and it can be harmless. What you want to be watching out for is hundreds of bad sectors showing up over a period of few days--that means something is rattling around on the disk platters, damaging the hardware as it goes. To get that data, you have to test the disks every few days. The drive itself could be failing in some way that prevents recording SMART errors (e.g. because of host timeouts triggering a bus reset, which also prevents the SMART counter update for what was going wrong at the time). This is unfortunately quite common, especially with drives configured for non-RAID workloads. Libata resetting the link should be recorded in kernel messages. This is true, but the original question was about SMART data coverage. This is why it's important to monitor both. -- Chris Murphy signature.asc Description: Digital signature
Re: scrub implies failing drive - smartctl blissfully unaware
On Fri, 21 Nov 2014 10:45:21 -0700 Chris Murphy wrote: On Fri, Nov 21, 2014 at 5:55 AM, Ian Armstrong bt...@iarmst.co.uk wrote: In my situation what I've found is that if I scrub let it fix the errors then a second pass immediately after will show no errors. If I then leave it a few days try again there will be errors, even in old files which have not been accessed for months. What are the devices? And if they're SSDs are they powered off for these few days? I take it the scrub error type is corruption? It's spinning rust and the checksum error is always on the one drive (a SAMSUNG HD204UI). The firmware has been updated, since some were shipped with a bad version which could result in data corruption. You can use badblocks to write a known pattern to the drive. Then power off and leave it for a few days. Then read the drive, matching against the pattern, and see if there are any discrepancies. Doing this outside the code path of Btrfs would fairly conclusively indicate whether it's hardware or software induced. Unfortunately I'm reluctant to go the badblock route for the entire drive since it's the second drive in a 2 drive raid1 and I don't currently have a spare. There is a small 6G partition that I can use, but given that the drive is large and the errors are few, it could take a while for anything to show. I also have a second 2 drive btrfs raid1 in the same machine that doesn't have this problem. All the drives are running off the same controller. Assuming you have another copy of all of these files :-) you could just sha256sum the two copies to see if they have in fact changed. If they have, well then you've got some silent data corruption somewhere somehow. But if they always match, then that suggests a bug. Some of the files already have an md5 linked to them, while others have parity files to give some level of recovery from corruption or damage. Checking against these show no problems, so I assume that btrfs is doing its job only serving an intact file. I don't see how you can get bogus corruption messages, and for it to not be a bug. When you do these scrubs that come up clean, and then later come up with corruptions, have you done any software updates? No software updates between clean corrupt. I don't have to power down or reboot either for checksum errors to appear. I don't think the corruption messages are bogus, but are indicating a genuine problem. What I would like to be able to do is compare the corrupt block with the one used to repair it and see what the difference is. As I've already stated, the system logs are clean the smart logs aren't showing any issues. (Well, until today when a self-test failed with a read error, but it must be an unused sector since the scrub doesn't hit it there are no re-allocated sectors yet) My next step is to disable autodefrag see if the problem persists. (I'm not suggesting a problem with autodefrag, I just want to remove it from the equation ensure that outside of normal file access, data isn't being rewritten between scrubs) I wouldn't expect autodefrag to touch old files not accessed for months. Doesn't it only affect actively used files? The drive is mainly used to hold old archive files, though there are daily rotating files on it as well. The corruption affects both new and old files. -- Ian -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/19/2014 5:25 PM, Robert White wrote: The controller, the thing that sets the ready bit and sends the interrupt is distinct from the driver, the thing that polls the ready bit when the interrupt is sent. At the bus level there are fixed delays and retries. Try putting two drives on a pin-select IDE bus and strapping them both as _slave_ (or indeed master) sometime and watch the shower of fixed delay retries. No, it does not. In classical IDE, the controller is really just a bus bridge. When you read from the status register in the controller, the read bus cycle is propagated down the IDE ribbon, and into the drive, and you are in fact, reading the register directly from the drive. That is where the name Integrated Device Electronics came from: because the controller was really integrated into the drive. The only fixed delays at the bus level are the bus cycle speed. There are no retries. There are only 3 mentions of the word retry in the ATA8-APT and they all refer to the host driver. That's odd... my bios reads from storage to boot the device and it does so using the ACPI storage methods. No, it doesn't. It does so by accessing the IDE or ACHI registers just as pc bios always has. I suppose I also need to remind you that we are talking about the context of linux here, and linux does not make use of the bios for disk access. ACPI 4.0 Specification Section 9.8 even disagrees with you at some length. Let's just do the titles shall we: 9.8 ATA Controller Devices 9.8.1 Objects for both ATA and SATA Controllers. 9.8.2 IDE Controller Device 9.8.3 Serial ATA (SATA) controller Device Oh, and _lookie_ _here_ in Linux Kernel Menuconfig at Device Drivers - * Serial ATA and Parallel ATA drivers (libata) - * ACPI firmware driver for PATA CONFIG_PATA_ACPI: This option enables an ACPI method driver which drives motherboard PATA controller interfaces through the ACPI firmware in the BIOS. This driver can sometimes handle otherwise unsupported hardware. You are a storage _genius_ for knowing that all that stuff doesn't exist... the rest of us must simply muddle along in our delusion... Yes, ACPI 4.0 added this mess. I have yet to see a single system that actually implements it. I can't believe they even bothered adding this driver to the kernel. Is there anyone in the world who has ever used it? If no motherboard vendor has bothered implementing the ACPI FAN specs, I very much doubt anyone will ever bother with this. Do tell us more... I didn't say the driver would cause long delays, I said that the time it takes to error out other improperly supported drivers and fall back to this one could induce long delays and resets. There is no error out and fall back. If the device is in AHCI mode then it identifies itself as such and the ACHI driver is loaded. If it is in IDE mode, then it identifies itself as such, and the IDE driver is loaded. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUbk5qAAoJEI5FoCIzSKrw++IH/2DAayNzDqKlA7DBi79UVlpg jJHDOlmzPqJCLMkffZRX1TLM/OEzu3k/pYMlS0HCdNggbG7eTpHxsoCetiETPcnc LCcolWXa/eMfzkEphSq4GToeEj5FKrVNzymNvPVL6zdiSfySvSg4RZOs123ULYNM nPUaOYPSiDPzfC7ggUS3RSvWb8mNzfRVJtgGXlZd/jDh+NAjy3oTb4fYksZjq8qb n5emKU1jJafvSbBek41wo7Xji1vLThiDZ4kcf4c7oT3x4WuQUMUhzkficqEnwYsm HK12pv0ktDJr6hKMcHPT26YKsdUOPE6XC3GgNaxt8EZ3bioWYRb4RRAdAuAjI2s= =+M2o -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/19/2014 5:33 PM, Robert White wrote: That would be fake raid, not hardware raid. The LSI MegaRaid controller people would _love_ to hear more about your insight into how their battery-backed multi-drive RAID controller is fake. You should go work for them. Try the contact us link at the bottom of this page. I'm sure they are waiting for your insight with baited breath! Forgive me, I should have trimmed the quote a bit more. I was responding specifically to the many mother boards have hardware RAID support available through the bios part, not the lsi part. Odd, my MegaRaid controller takes about fifteen seconds by-the-clock to initialize and to the integrity check on my single initialized drive. It is almost certainly spending those 15 seconds on something else, like bootstrapping its firmware code from a slow serial eeprom or waiting for you to press the magic key to enter the bios utility. I would be very surprised to see that time double if you add a second disk. If it does, then they are doing something *very* wrong, and certainly quite different from any other real or fake raid controller I've ever used. It's amazing that with a fail and retry it would be _faster_... I have no idea what you are talking about here. I said that they aren't going to retry a read that *succeeded* but came back without their magic signature. It isn't like reading it again is going to magically give different results. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUblBmAAoJEI5FoCIzSKrwFKkIAKNGOGyLrMIcTeV4DQntdbaa NMkjXnWnk6lHeqTyE/pb+l4VgVH8nQwDp8hRCnKNnKHoZbT8LOGFULSmBes+DDmW dxPVDTytUu1AiqB7AyxNJU8213BQCaF0inL7ofZmX95N+0eajuVxOyHIMeokdwUU zLOnXQg0awLkQwk7U6YLAKA4A7HrOEXw4wHt9hPy/yUySMVqCeHYV3tpf7t96guU 0IRctvpwcNvvVtt65I8A4EklR+vCvqEDUZfKyG8WJAeyAdC4UoHT9vZcJAVkiFl+ Y+Mp5wsr1vuo3dYQ1bKO8RvPTB9D9npFyFIlyHEBMJlCHDU43YsNP8hGcu0mKco= =AJ6/ -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On 11/20/2014 12:26 PM, Phillip Susi wrote: Yes, ACPI 4.0 added this mess. I have yet to see a single system that actually implements it. I can't believe they even bothered adding this driver to the kernel. Is there anyone in the world who has ever used it? If no motherboard vendor has bothered implementing the ACPI FAN specs, I very much doubt anyone will ever bother with this. Nice attempt at saving face, but wrong as _always_. The CONFIG_PATA_ACPI option has been in the kernel since 2008 and lots of people have used it. If you search for ACPI ide you'll find people complaining in 2008-2010 about windows error messages indicating the device is present in their system but no OS driver is available. That you have yet to see a single system that implements it is about the worst piece of internet research I've ever seen. Do you not _get_ that your opinion about what exists and how it works is not authoritative? You can also find articles about both windows and linux systems actively using ACPI fan control going back to 2009 These are not hard searches to pull off. These are not obscure references. Go to the google box and start typing ACPI fan... and check the autocomplete. I'll skip ovea all the parts where you don't know how a chipset works and blah, blah, blah... You really should have just stopped at I don't know and I've never because you keep demonstrating that you _don't_ know, and that you really _should_ _never_. Tell us more about the lizard aliens controlling your computer, I find your versions of realty fascinating... -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On 11/20/2014 12:34 PM, Phillip Susi wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/19/2014 5:33 PM, Robert White wrote: That would be fake raid, not hardware raid. The LSI MegaRaid controller people would _love_ to hear more about your insight into how their battery-backed multi-drive RAID controller is fake. You should go work for them. Try the contact us link at the bottom of this page. I'm sure they are waiting for your insight with baited breath! Forgive me, I should have trimmed the quote a bit more. I was responding specifically to the many mother boards have hardware RAID support available through the bios part, not the lsi part. Well you should have _actually_ trimmed your response down to not pressing send. _Many_ motherboards have complete RAID support at levels 0, 1, 10, and five 5. A few have RAID6. Some of them even use the LSI chip-set. Seriously... are you trolling this list with disinformation or just repeating tribal knowledge from fifteen year old copies of PC Magazine? Yea, some of the IDE motherboards and that only had RAID1 and RAID0 (and indeed some of the add-on controllers) back in the IDE-only days were really lame just-forked-write devices with no integrity checks (hence fake raid) but that's from like the 1990s; it's paleolithic age wisdom at this point. Phillip say sky god angry, all go hide in cave! /D'oh... -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On Tue, Nov 18, 2014 at 09:29:54AM +0200, Brendan Hide wrote: Hey, guys See further below extracted output from a daily scrub showing csum errors on sdb, part of a raid1 btrfs. Looking back, it has been getting errors like this for a few days now. The disk is patently unreliable but smartctl's output implies there are no issues. Is this somehow standard faire for S.M.A.R.T. output? Here are (I think) the important bits of the smartctl output for $(smartctl -a /dev/sdb) (the full results are attached): ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 253 006Pre-fail Always - 0 5 Reallocated_Sector_Ct 0x0033 100 100 036Pre-fail Always - 1 7 Seek_Error_Rate 0x000f 086 060 030Pre-fail Always - 440801014 197 Current_Pending_Sector 0x0012 100 100 000Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000Old_age Offline - 0 199 UDMA_CRC_Error_Count0x003e 200 200 000Old_age Always - 0 200 Multi_Zone_Error_Rate 0x 100 253 000Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000Old_age Always - 0 You have one reallocated sector, so the drive has lost some data at some time in the last 49000(!) hours. Normally reallocations happen during writes so the data that was lost was data you were in the process of overwriting anyway; however, the reallocated sector count could also be a sign of deteriorating drive integrity. In /var/lib/smartmontools there might be a csv file with logged error attribute data that you could use to figure out whether that reallocation was recent. I also notice you are not running regular SMART self-tests (e.g. by smartctl -t long) and the last (and first, and only!) self-test the drive ran was ~12000 hours ago. That means most of your SMART data is about 18 months old. The drive won't know about sectors that went bad in the last year and a half unless the host happens to stumble across them during a read. The drive is over five years old in operating hours alone. It is probably so fragile now that it will break if you try to move it. Original Message Subject: Cron root@watricky /usr/local/sbin/btrfs-scrub-all Date: Tue, 18 Nov 2014 04:19:12 +0200 From: (Cron Daemon) root@watricky To: brendan@watricky WARNING: errors detected during scrubbing, corrected. [snip] scrub device /dev/sdb2 (id 2) done scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds total bytes scrubbed: 189.49GiB with 5420 errors error details: read=5 csum=5415 corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164 That seems a little off. If there were 5 read errors, I'd expect the drive to have errors in the SMART error log. Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem. There have been a number of fixes to csums in btrfs pulled into the kernel recently, and I've retired two five-year-old computers this summer due to RAM/CPU failures. [snip] smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.17.2-1-ARCH] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.10 Device Model: ST3250410AS Serial Number:6RYF5NP7 Firmware Version: 4.AAA User Capacity:250,059,350,016 bytes [250 GB] Sector Size: 512 bytes logical/physical Device is:In smartctl database [for details use: -P show] ATA Version is: ATA/ATAPI-7 (minor revision not indicated) Local Time is:Tue Nov 18 09:16:03 2014 SAST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command.
Re: scrub implies failing drive - smartctl blissfully unaware
On 2014/11/21 06:58, Zygo Blaxell wrote: You have one reallocated sector, so the drive has lost some data at some time in the last 49000(!) hours. Normally reallocations happen during writes so the data that was lost was data you were in the process of overwriting anyway; however, the reallocated sector count could also be a sign of deteriorating drive integrity. In /var/lib/smartmontools there might be a csv file with logged error attribute data that you could use to figure out whether that reallocation was recent. I also notice you are not running regular SMART self-tests (e.g. by smartctl -t long) and the last (and first, and only!) self-test the drive ran was ~12000 hours ago. That means most of your SMART data is about 18 months old. The drive won't know about sectors that went bad in the last year and a half unless the host happens to stumble across them during a read. The drive is over five years old in operating hours alone. It is probably so fragile now that it will break if you try to move it. All interesting points. Do you schedule SMART self-tests on your own systems? I have smartd running. In theory it tracks changes and sends alerts if it figures a drive is going to fail. But, based on what you've indicated, that isn't good enough. WARNING: errors detected during scrubbing, corrected. [snip] scrub device /dev/sdb2 (id 2) done scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds total bytes scrubbed: 189.49GiB with 5420 errors error details: read=5 csum=5415 corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164 That seems a little off. If there were 5 read errors, I'd expect the drive to have errors in the SMART error log. Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem. There have been a number of fixes to csums in btrfs pulled into the kernel recently, and I've retired two five-year-old computers this summer due to RAM/CPU failures. The difference here is that the issue only affects the one drive. This leaves the probable cause at: - the drive itself - the cable/ports with a negligibly-possible cause at the motherboard chipset. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/2014 9:40 PM, Chris Murphy wrote: It’s well known on linux-raid@ that consumer drives have well over 30 second deep recoveries when they lack SCT command support. The WDC and Seagate “green” drives are over 2 minutes apparently. This isn’t easy to test because it requires a sector with enough error that it requires the ECC to do something, and yet not so much error that it gives up in less than 30 seconds. So you have to track down a drive model spec document (one of those 100 pagers). This makes sense, sorta, because the manufacturer use case is typically single drive only, and most proscribe raid5/6 with such products. So it’s a “recover data at all costs” behavior because it’s assumed to be the only (immediately) available copy. It doesn't make sense to me. If it can't recover the data after one or two hundred retries in one or two seconds, it can keep trying until the cows come home and it just isn't ever going to work. I don’t see how that’s possible because anything other than the drive explicitly producing a read error (which includes the affected LBA’s), it’s ambiguous what the actual problem is as far as the kernel is concerned. It has no way of knowing which of possibly dozens of ata commands queued up in the drive have actually hung up the drive. It has no idea why the drive is hung up as well. IIRC, this is true when the drive returns failure as well. The whole bio is marked as failed, and the page cache layer then begins retrying with progressively smaller requests to see if it can get *some* data out. No I think 30 is pretty sane for servers using SATA drives because if the bus is reset all pending commands in the queue get obliterated which is worse than just waiting up to 30 seconds. With SAS drives maybe less time makes sense. But in either case you still need configurable SCT ERC, or it needs to be a sane fixed default like 70 deciseconds. Who cares if multiple commands in the queue are obliterated if they can all be retried on the other mirror? Better to fall back to the other mirror NOW instead of waiting 30 seconds ( or longer! ). Sure, you might end up recovering more than you really had to, but that won't hurt anything. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUbLMyAAoJEI5FoCIzSKrwSM8IAJO2cwhHyxK4LFjINEbNT+ij fT4EpyzOCs704zhOTgssgSQ8ym85PRQ8VyAIrz338m+lHqKbktZtRt7vWaealmOp 6eleIDJ/I7kggnlhkqg1V8Nctap8qBeRE34K/PaGtTrkRzBYnYxbGdDDz+rXaDi6 CSEMLJBo3I69Oj9qSOV4O18ntV/S3eln0sQ8+w2btbc3xGkG3X2FwVIJokb6IAmu ngHUeDGXUgkEOvzw3aGDheLueGDPe+V3YlsjSbw2rH75svzXqFCUO8Jcg4NfxT0q Nl03eoTEGlyf8x2geMWfhoKFatJ7sCMy48K0ZFAAX1k8j0ssjNaEC+q6pwrA/xU= =Gehg -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/2014 9:46 PM, Duncan wrote: I'm not sure about normal operation, but certainly, many drives take longer than 30 seconds to stabilize after power-on, and I routinely see resets during this time. As far as I have seen, typical drive spin up time is on the order of 3-7 seconds. Hell, I remember my pair of first generation seagate cheetah 15,000 rpm drives seemed to take *forever* to spin up and that still was maybe only 15 seconds. If a drive takes longer than 30 seconds, then there is something wrong with it. I figure there is a reason why spin up time is tracked by SMART so it seems like long spin up time is a sign of a sick drive. This doesn't happen on single-hardware-device block devices and filesystems because in that case it's either up or down, if the device doesn't come up in time the resume simply fails entirely, instead of coming up with one or more devices there, but others missing as they didn't stabilize in time, as is unfortunately all too common in the multi- device scenario. No, the resume doesn't fail entirely. The drive is reset, and the IO request is retried, and by then it should succeed. I've seen this with both spinning rust and with SSDs, with mdraid and btrfs, with multiple mobos and device controllers, and with resume both from suspend to ram (if the machine powers down the storage devices in that case, as most modern ones do) and hibernate to permanent storage device, over several years worth of kernel series, so it's a reasonably widespread phenomena, at least among consumer-level SATA devices. (My experience doesn't extend to enterprise-raid-level devices or proper SCSI, etc, so I simply don't know, there.) If you are restoring from hibernation, then the drives are already spun up before the kernel is loaded. While two minutes is getting a bit long, I think it's still within normal range, and some devices definitely take over a minute enough of the time to be both noticeable and irritating. It certainly is not normal for a drive to take that long to spin up. IIRC, the 30 second timeout comes from the ATA specs which state that it can take up to 30 seconds for a drive to spin up. That said, I SHOULD say I'd be far *MORE* irritated if the device simply pretended it was stable and started reading/writing data before it really had stabilized, particularly with SSDs where that sort of behavior has been observed and is known to put some devices at risk of complete scrambling of either media or firmware, beyond recovery at times. That of course is the risk of going the other direction, and I'd a WHOLE lot rather have devices play it safe for another 30 seconds or so after they / think/ they're stable and be SURE, than pretend to be just fine when voltages have NOT stabilized yet and thus end up scrambling things irrecoverably. I've never had that happen here tho I've never stress- tested for it, only done normal operation, but I've seen testing reports where the testers DID make it happen surprisingly easily, to a surprising number of their test devices. Power supply voltage is stable within milliseconds. What takes HDDs time to start up is mechanically bringing the spinning rust up to speed. On SSDs, I think you are confusing testing done on power *cycling* ( i.e. yanking the power cord in the middle of a write ) with startup. So, umm... I suspect the 2-minute default is 2 minutes due to power-up stabilizing issues, where two minutes is a reasonable compromise between failing the boot most of the time if the timeout is too low, and taking excessively long for very little further gain. The default is 30 seconds, not 2 minutes. sure whether it's even possible, without some specific hardware feature available to tell the kernel that it has in fact NOT been in power-saving mode for say 5-10 minutes, hopefully long enough that voltage readings really /are/ fully stabilized and a shorter timeout is possible. Again, there is no several minute period where voltage stabilizes and the drive takes longer to access. This is a complete red herring. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUbMBPAAoJEI5FoCIzSKrwcV0H/20pv7O5+CDf2cRg5G5vt7PR 4J1NuVIBsboKwjwCj8qdxHQJHihvLYkTQKANqaqHv0+wx0u2DaQdPU/LRnqN71xA jP7b9lx9X6rPnAnZUDBbxzAc8HLeutgQ8YD/WB0sE5IXlI1/XFGW4tXIZ4iYmtN9 GUdL+zcdtEiYE993xiGSMXF4UBrN8d/5buBRsUsPVivAZes6OHbf9bd72c1IXBuS ADZ7cH7XGmLL3OXA+hm7d99429HFZYAgI7DjrLWp6Tb9ja5Gvhy+AVvrbU5ZWMwu XUnNsLsBBhEGuZs5xpkotZgaQlmJpw4BFY4BKwC6PL+7ex7ud3hGCGeI6VDmI0U= =DLHU -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On 11/19/2014 08:07 AM, Phillip Susi wrote: On 11/18/2014 9:46 PM, Duncan wrote: I'm not sure about normal operation, but certainly, many drives take longer than 30 seconds to stabilize after power-on, and I routinely see resets during this time. As far as I have seen, typical drive spin up time is on the order of 3-7 seconds. Hell, I remember my pair of first generation seagate cheetah 15,000 rpm drives seemed to take *forever* to spin up and that still was maybe only 15 seconds. If a drive takes longer than 30 seconds, then there is something wrong with it. I figure there is a reason why spin up time is tracked by SMART so it seems like long spin up time is a sign of a sick drive. I was recently re-factoring Underdog (http://underdog.sourceforge.net) startup scripts to separate out the various startup domains (e.g. lvm, luks, mdadm) in the prtotype init. So I notice you (Duncan) use the word stabilize, as do a small number of drivers in the linux kernel. This word has very little to do with disks per se. Between SCSI probing LUNs (where the controller tries every theoretical address and gives a potential device ample time to reply), and usb-storage having a simple timer delay set for each volume it sees, there is a lot of waiting in the name of safety going on in the linux kernel at device initialization. When I added the messages scanning /dev/sd?? to the startup sequence as I iterate through the disks and partitions present I discovered that the first time I called blkid (e.g. right between /dev/sda and /dev/sda1) I'd get a huge hit of many human seconds (I didn't time it, but I'd say eight or so) just for having a 2Tb My Book WD 3.0 disk enclosure attached as /dev/sdc. This enclosure having spun up in the previous boot cycle and only bing a soft reboot was immaterial. In this case usb-store is going to take its time and do its deal regardless of the state of the physical drive itself. So there are _lots_ of places where you are going to get delays and very few of them involve the disk itself going from power-off to ready. You said it yourself with respect to SSDs. It's cheaper, and less error prone, and less likely to generate customer returns if the generic controller chips just send init, wait a fixed delay, then request a status compared to trying to are-you-there-yet poll each device like a nagging child. And you are going to see that at every level. And you are going to see it multiply with _sparsely_ provisioned buses where the cycle is going to be retried for absent LUNs (one disk on a Wide SCSI bus and a controller set to probe all LUNs is particularly egregious) One of the reasons that the whole industry has started favoring point-to-point (SATA, SAS) or physical intercessor chaining point-to-point (eSATA) buses is to remove a lot of those wait-and-see delays. That said, you should not see a drive (or target enclosure, or controller) reset during spin up. In a SCSI setting this is almost always a cabling, termination, or addressing issue. In IDE its jumper mismatch (master vs slave vs cable-select). Less often its a partitioning issue (trying to access sectors beyond the end of the drive). Another strong actor is selecting the wrong storage controller chipset driver. In that case you may be faling back from high-end device you think it is, through intermediate chip-set, and back to ACPI or BIOS emulation Another common cause is having a dedicated hardware RAID controller (dell likes to put LSI MegaRaid controllers in their boxes for example), many mother boards have hardware RAID support available through the bios, etc, leaving that feature active, then the adding a drive and _not_ initializing that drive with the RAID controller disk setup. In this case the controller is going to repeatedly probe the drive for its proprietary controller signature blocks (and reset the drive after each attempt) and then finally fall back to raw block pass-through. This can take a long time (thirty seconds to a minute). But seriously, if you are seeing reset anywhere in any storage chain during a normal power-on cycle then you've got a problem with geometry or configuration. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/19/2014 4:05 PM, Robert White wrote: It's cheaper, and less error prone, and less likely to generate customer returns if the generic controller chips just send init, wait a fixed delay, then request a status compared to trying to are-you-there-yet poll each device like a nagging child. And you are going to see that at every level. And you are going to see it multiply with _sparsely_ provisioned buses where the cycle is going to be retried for absent LUNs (one disk on a Wide SCSI bus and a controller set to probe all LUNs is particularly egregious) No, they do not wait a fixed time, then proceed. They do in fact issue the command, then poll or wait for an interrupt to know when it is done, then time out and give up if that doesn't happen within a reasonable amount of time. One of the reasons that the whole industry has started favoring point-to-point (SATA, SAS) or physical intercessor chaining point-to-point (eSATA) buses is to remove a lot of those wait-and-see delays. Nope... even with the ancient PIO mode PATA interface, you polled a ready bit in the status register to see if it was done yet. If you always waited 30 seconds for every command your system wouldn't boot up until next year. Another strong actor is selecting the wrong storage controller chipset driver. In that case you may be faling back from high-end device you think it is, through intermediate chip-set, and back to ACPI or BIOS emulation There is no such thing as ACPI or BIOS emulation. AHCI SATA controllers do usually have an old IDE emulation mode instead of AHCI mode, but this isn't going to cause ridiculously long delays. Another common cause is having a dedicated hardware RAID controller (dell likes to put LSI MegaRaid controllers in their boxes for example), many mother boards have hardware RAID support available through the bios, etc, leaving that feature active, then the adding a drive and That would be fake raid, not hardware raid. _not_ initializing that drive with the RAID controller disk setup. In this case the controller is going to repeatedly probe the drive for its proprietary controller signature blocks (and reset the drive after each attempt) and then finally fall back to raw block pass-through. This can take a long time (thirty seconds to a minute). No, no, and no. If it reads the drive and does not find its metadata, it falls back to pass through. The actual read takes only milliseconds, though it may have to wait a few seconds for the drive to spin up. There is no reason it would keep retrying after a successful read. The way you end up with 30-60 second startup time with a raid is if you have several drives and staggered spinup mode enabled, then each drive is started one at a time instead of all at once so their cumulative startup time can add up fairly high. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUbQ/qAAoJEI5FoCIzSKrwuhwH/R/+EVTpNlw36naJ8mxqMagt /xafq+1kGhwNjLTPV68CI4Wt24WSGOLqpq5FPWlTMxuN0VSnX/wqBeSbz4w2Vl3F VNic+4RqhmzS3EnLXNzkHyF2Z+hQEEldOlheAobkQb4hv/7jVxBri42nMdHQUq5w em181txT8zkltmV+dm8aYcro8Z4ewntQtyGaO6U/nCfxt9Odr2rfytyeuSyJi9uY +dKlGSb5klIFwCOOSoRqEz2+KOFHF7td9RrcfIRcPRgjKROH0YilQ8T53lTMoNL1 aUMsbyUy+edEBN1a4o/FqK3dEvBSu1nnRGRpSgm2fFGKhyi/z9gmJ1ZXTdYZRXE= =/O7+ -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
Shame you already know everything? On 11/19/2014 01:47 PM, Phillip Susi wrote: On 11/19/2014 4:05 PM, Robert White wrote: One of the reasons that the whole industry has started favoring point-to-point (SATA, SAS) or physical intercessor chaining point-to-point (eSATA) buses is to remove a lot of those wait-and-see delays. Nope... even with the ancient PIO mode PATA interface, you polled a ready bit in the status register to see if it was done yet. If you always waited 30 seconds for every command your system wouldn't boot up until next year. The controller, the thing that sets the ready bit and sends the interrupt is distinct from the driver, the thing that polls the ready bit when the interrupt is sent. At the bus level there are fixed delays and retries. Try putting two drives on a pin-select IDE bus and strapping them both as _slave_ (or indeed master) sometime and watch the shower of fixed delay retries. Another strong actor is selecting the wrong storage controller chipset driver. In that case you may be faling back from high-end device you think it is, through intermediate chip-set, and back to ACPI or BIOS emulation There is no such thing as ACPI or BIOS emulation. That's odd... my bios reads from storage to boot the device and it does so using the ACPI storage methods. ACPI 4.0 Specification Section 9.8 even disagrees with you at some length. Let's just do the titles shall we: 9.8 ATA Controller Devices 9.8.1 Objects for both ATA and SATA Controllers. 9.8.2 IDE Controller Device 9.8.3 Serial ATA (SATA) controller Device Oh, and _lookie_ _here_ in Linux Kernel Menuconfig at Device Drivers - * Serial ATA and Parallel ATA drivers (libata) - * ACPI firmware driver for PATA CONFIG_PATA_ACPI: This option enables an ACPI method driver which drives motherboard PATA controller interfaces through the ACPI firmware in the BIOS. This driver can sometimes handle otherwise unsupported hardware. You are a storage _genius_ for knowing that all that stuff doesn't exist... the rest of us must simply muddle along in our delusion... AHCI SATA controllers do usually have an old IDE emulation mode instead of AHCI mode, but this isn't going to cause ridiculously long delays. Do tell us more... I didn't say the driver would cause long delays, I said that the time it takes to error out other improperly supported drivers and fall back to this one could induce long delays and resets. I think I am done with your expertise in the question of all things storage related. Not to be rude... but I'm physically ill and maybe I shouldn't be posting right now... 8-) -- Rob. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
P.S. On 11/19/2014 01:47 PM, Phillip Susi wrote: Another common cause is having a dedicated hardware RAID controller (dell likes to put LSI MegaRaid controllers in their boxes for example), many mother boards have hardware RAID support available through the bios, etc, leaving that feature active, then the adding a drive and That would be fake raid, not hardware raid. The LSI MegaRaid controller people would _love_ to hear more about your insight into how their battery-backed multi-drive RAID controller is fake. You should go work for them. Try the contact us link at the bottom of this page. I'm sure they are waiting for your insight with baited breath! http://www.lsi.com/products/raid-controllers/pages/megaraid-sas-9260-8i.aspx _not_ initializing that drive with the RAID controller disk setup. In this case the controller is going to repeatedly probe the drive for its proprietary controller signature blocks (and reset the drive after each attempt) and then finally fall back to raw block pass-through. This can take a long time (thirty seconds to a minute). No, no, and no. If it reads the drive and does not find its metadata, it falls back to pass through. The actual read takes only milliseconds, though it may have to wait a few seconds for the drive to spin up. There is no reason it would keep retrying after a successful read. Odd, my MegaRaid controller takes about fifteen seconds by-the-clock to initialize and to the integrity check on my single initialized drive. It's amazing that with a fail and retry it would be _faster_... It's like you know _everything_... -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
Phillip Susi posted on Wed, 19 Nov 2014 11:07:43 -0500 as excerpted: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/2014 9:46 PM, Duncan wrote: I'm not sure about normal operation, but certainly, many drives take longer than 30 seconds to stabilize after power-on, and I routinely see resets during this time. As far as I have seen, typical drive spin up time is on the order of 3-7 seconds. Hell, I remember my pair of first generation seagate cheetah 15,000 rpm drives seemed to take *forever* to spin up and that still was maybe only 15 seconds. If a drive takes longer than 30 seconds, then there is something wrong with it. I figure there is a reason why spin up time is tracked by SMART so it seems like long spin up time is a sign of a sick drive. It's not physical spinup, but electronic device-ready. It happens on SSDs too and they don't have anything to spinup. But, for instance on my old seagate 300-gigs that I used to have in 4-way mdraid, when I tried to resume from hibernate the drives would be spunup and talking to the kernel, but for some seconds to a couple minutes or so after spinup, they'd sometimes return something like (example) Seagrte3x0 instead of Seagate300. Of course that wasn't the exact string, I think it was the model number or perhaps the serial number or something, but looking at dmsg I could see the ATA layer up for each of the four devices, the connection establish and seem to be returning good data, then the mdraid layer would try to assemble and would kick out a drive or two due to the device string mismatch compared to what was there before the hibernate. With the string mismatch, from its perspective the device had disappeared and been replaced with something else. But if I held it at the grub prompt for a couple minutes and /then/ let it go, or part of the time on its own, all four drives would match and it'd work fine. For just short hibernates (as when testing hibernate/ resume), it'd come back just fine; as it would nearly all the time out to two hours or so. Beyond that, out to 10 or 12 hours, the longer it sat the more likely it would be to fail, if it didn't hold it at the grub prompt for a few minutes to let it stabilize. And now I seen similar behavior resuming from suspend (the old hardware wouldn't resume from suspend to ram, only hibernate, the new hardware resumes from suspend to ram just fine, but I had trouble getting it to resume from hibernate back when I first setup and tried it; I've not tried hibernate since and didn't even setup swap to hibernate to when I got the SSDs so I've not tried it for a couple years) on SSDs with btrfs raid. Btrfs isn't as informative as was mdraid on why it kicks a device, but dmesg says both devices are up, while btrfs is suddenly spitting errors on one device. A reboot later and both devices are back in the btrfs and I can do a scrub to resync, which generally finds and fixes errors on the btrfs that were writable (/home and /var/log), but of course not on the btrfs mounted as root, since it's read-only by default. Same pattern. Immediate suspend and resume is fine. Out to about 6 hours it tends to be fine as well. But at 8-10 hours in suspend, btrfs starts spitting errors often enough that I generally quit trying to suspend at all, I simply shut down now. (With SSDs and systemd, shutdown and restart is fast enough, and the delay from having to refill cache low enough, that the time difference between suspend and full shutdown is hardly worth troubling with anyway, certainly not when there's a risk to data due to failure to properly resume.) But it worked fine when I had only a single device to bring back up. Nothing to be slower than another device to respond and thus to be kicked out as dead. I finally realized what was happening after I read a study paper mentioning capacitor charge time and solid-state stability time, and how a lot of cheap devices say they're ready before the electronics have actually properly stabilized. On SSDs, this is a MUCH worse issue than it is on spinning rust, because the logical layout isn't practically forced to serial like it is on spinning rust, and the firmware can get so jumbled it pretty much scrambles the device. And it's not just the normal storage either. In the study, many devices corrupted their own firmware as well! Now that was definitely a worst-case study in that they were deliberately yanking and/or fast-switching the power, not just doing time-on waits, but still, a surprisingly high proportion of SSDs not only scrambled the storage, but scrambled their firmware as well. (On those devices the firmware may well have been on the same media as the storage, with the firmware simply read in first in a hardware bootstrap mode, and the firmware programmed to avoid that area in normal operation thus making it as easily corrupted as the the normal storage.) The paper specifically
Re: scrub implies failing drive - smartctl blissfully unaware
On Wed, Nov 19, 2014 at 8:11 AM, Phillip Susi ps...@ubuntu.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/2014 9:40 PM, Chris Murphy wrote: It’s well known on linux-raid@ that consumer drives have well over 30 second deep recoveries when they lack SCT command support. The WDC and Seagate “green” drives are over 2 minutes apparently. This isn’t easy to test because it requires a sector with enough error that it requires the ECC to do something, and yet not so much error that it gives up in less than 30 seconds. So you have to track down a drive model spec document (one of those 100 pagers). This makes sense, sorta, because the manufacturer use case is typically single drive only, and most proscribe raid5/6 with such products. So it’s a “recover data at all costs” behavior because it’s assumed to be the only (immediately) available copy. It doesn't make sense to me. If it can't recover the data after one or two hundred retries in one or two seconds, it can keep trying until the cows come home and it just isn't ever going to work. I'm not a hard drive engineer, so I can't argue either point. But consumer drives clearly do behave this way. On Linux, the kernel's default 30 second command timer eventually results in what look like link errors rather than drive read errors. And instead of the problems being fixed with the normal md and btrfs recovery mechanisms, the errors simply get worse and eventually there's data loss. Exhibits A, B, C, D - the linux-raid list is full to the brim of such reports and their solution. I don’t see how that’s possible because anything other than the drive explicitly producing a read error (which includes the affected LBA’s), it’s ambiguous what the actual problem is as far as the kernel is concerned. It has no way of knowing which of possibly dozens of ata commands queued up in the drive have actually hung up the drive. It has no idea why the drive is hung up as well. IIRC, this is true when the drive returns failure as well. The whole bio is marked as failed, and the page cache layer then begins retrying with progressively smaller requests to see if it can get *some* data out. Well that's very course. It's not at a sector level, so as long as the drive continues to try to read from a particular LBA, but fails to either succeed reading or give up and report a read error, within 30 seconds, then you just get a bunch of wonky system behavior. Conversely what I've observed on Windows in such a case, is it tolerates these deep recoveries on consumer drives. So they just get really slow but the drive does seem to eventually recover (until it doesn't). But yeah 2 minutes is a long time. So then the user gets annoyed and reinstalls their system. Since that means writing to the affected drive, the firmware logic causes bad sectors to be dereferenced when the write error is persistent. Problem solved, faster system. No I think 30 is pretty sane for servers using SATA drives because if the bus is reset all pending commands in the queue get obliterated which is worse than just waiting up to 30 seconds. With SAS drives maybe less time makes sense. But in either case you still need configurable SCT ERC, or it needs to be a sane fixed default like 70 deciseconds. Who cares if multiple commands in the queue are obliterated if they can all be retried on the other mirror? Because now you have a member drive that's inconsistent. At least in the md raid case, a certain number of read failures causes the drive to be ejected from the array. Anytime there's a write failure, it's ejected from the array too. What you want is for the drive to give up sooner with an explicit read error, so md can help fix the problem by writing good data to the effected LBA. That doesn't happen when there are a bunch of link resets happening. Better to fall back to the other mirror NOW instead of waiting 30 seconds ( or longer! ). Sure, you might end up recovering more than you really had to, but that won't hurt anything. Again, if your drive SCT ERC is configurable, and set to something sane like 70 deciseconds, that read failure happens at MOST 7 seconds after the read attempt. And md is notified of *exactly* what sectors are affected, it immediately goes to mirror data, or rebuilds it from parity, and then writes the correct data to the previously reported bad sectors. And that will fix the problem. So really, if you're going to play the multiple device game, you need drive error timing to be shorter than the kernel's. Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
Robert White posted on Wed, 19 Nov 2014 13:05:13 -0800 as excerpted: One of the reasons that the whole industry has started favoring point-to-point (SATA, SAS) or physical intercessor chaining point-to-point (eSATA) buses is to remove a lot of those wait-and-see delays. That said, you should not see a drive (or target enclosure, or controller) reset during spin up. In a SCSI setting this is almost always a cabling, termination, or addressing issue. In IDE its jumper mismatch (master vs slave vs cable-select). Less often its a partitioning issue (trying to access sectors beyond the end of the drive). Another strong actor is selecting the wrong storage controller chipset driver. In that case you may be faling back from high-end device you think it is, through intermediate chip-set, and back to ACPI or BIOS emulation FWIW I run a custom-built monolithic kernel, with only the specific drivers (SATA/AHCI in this case) builtin. There's no drivers for anything else it could fallback to. Once in awhile I do see it try at say 6-gig speeds, then eventually fall back to 3 and ultimately 1.5, but that /is/ indicative of other issues when I see it. And like I said, there's no other drivers to fall back to, so obviously I never see it doing that. Another common cause is having a dedicated hardware RAID controller (dell likes to put LSI MegaRaid controllers in their boxes for example), many mother boards have hardware RAID support available through the bios, etc, leaving that feature active, then the adding a drive and _not_ initializing that drive with the RAID controller disk setup. In this case the controller is going to repeatedly probe the drive for its proprietary controller signature blocks (and reset the drive after each attempt) and then finally fall back to raw block pass-through. This can take a long time (thirty seconds to a minute). Everything's set JBOD here. I don't trust those proprietary firmware raid things. Besides, that kills portability. JBOD SATA and AHCI are sufficiently standardized that should the hardware die, I can switch out to something else and not have to worry about rebuilding the custom kernel with the new drivers. Some proprietary firmware raid, requiring dmraid at the software kernel level to support, when I can just as easily use full software mdraid on standardized JBOD, no thanks! And be sure, that's one of the first things I check when I setup a new box, any so-called hardware raid that's actually firmware/software raid, disabled, JBOD mode, enabled. But seriously, if you are seeing reset anywhere in any storage chain during a normal power-on cycle then you've got a problem with geometry or configuration. IIRC I don't get it routinely. But I've seen it a few times, attributing it as I said to the 30-second SATA level timeout not being long enough. Most often, however, it's at resume, not original startup, which is understandable as state at resume doesn't match state at suspend/ hibernate. The irritating thing, as previously discussed, is when one device takes long enough to come back that mdraid or btrfs drops it out, generally forcing the reboot I was trying to avoid with the suspend/ hibernate in the first place, along with a re-add and resync (for mdraid) or a scrub (for btrfs raid). -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On 11/19/2014 04:25 PM, Duncan wrote: Most often, however, it's at resume, not original startup, which is understandable as state at resume doesn't match state at suspend/ hibernate. The irritating thing, as previously discussed, is when one device takes long enough to come back that mdraid or btrfs drops it out, generally forcing the reboot I was trying to avoid with the suspend/ hibernate in the first place, along with a re-add and resync (for mdraid) or a scrub (for btrfs raid). If you want a practical solution you might want to look at http://underdog.soruceforge.net (my project, shameless plug). The actual user context return isn't in there but I use the project to build initramfs images into all my kernels. [DISCLAIMER: The cryptsetup and LUKS stuff is rock solid but the mdadm incremental build stuff is very rough and so lightly untested] You could easily add a drive preheat code block (spin up and status check all drives with pause and repeat function) as a preamble function that could/would safely take place before any glance is made towards the resume stage. extemporaneous example:: --- snip --- cat 'EOT' /opt/underdog/utility/preheat.mod #!/bin/bash # ROOT_COMMANDS+=( commands your preheat needs ) UNDERDOG+=( init.d/preheat ) EOT cat 'EOT' /opt/underdog/prototype/init.d/preheat #!/bin/bash function __preamble_preheat() { whatever logic you need return 0 } __preamble_funcs+=( [preheat]=__preamble_preheat ) EOT --- snip --- install underdog, paste the above into a shell once. edit /opt/underdog/prototype/init.d/preamble to put whatever logic in you need. Follow the instructions in /opt/underdog/README.txt for making the initramfs image or, as I do, build the initramfs into the kernel image. The preamble will be run in the resultant /init script before the swap partitions are submitted for attempted resume. (The system does support complexity like resuming from a swap partition inside an LVM/LV built over a LUKS encrypted media expanse, or just a plain laptop with one plain partitioned disk, with zero changes to the necessary default config.) -- Rob. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On 2014-11-18 02:29, Brendan Hide wrote: Hey, guys See further below extracted output from a daily scrub showing csum errors on sdb, part of a raid1 btrfs. Looking back, it has been getting errors like this for a few days now. The disk is patently unreliable but smartctl's output implies there are no issues. Is this somehow standard faire for S.M.A.R.T. output? Here are (I think) the important bits of the smartctl output for $(smartctl -a /dev/sdb) (the full results are attached): ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 253 006Pre-fail Always - 0 5 Reallocated_Sector_Ct 0x0033 100 100 036Pre-fail Always - 1 7 Seek_Error_Rate 0x000f 086 060 030Pre-fail Always - 440801014 197 Current_Pending_Sector 0x0012 100 100 000Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000Old_age Offline - 0 199 UDMA_CRC_Error_Count0x003e 200 200 000Old_age Always - 0 200 Multi_Zone_Error_Rate 0x 100 253 000Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000Old_age Always - 0 Original Message Subject: Cron root@watricky /usr/local/sbin/btrfs-scrub-all Date: Tue, 18 Nov 2014 04:19:12 +0200 From: (Cron Daemon) root@watricky To: brendan@watricky WARNING: errors detected during scrubbing, corrected. [snip] scrub device /dev/sdb2 (id 2) done scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds total bytes scrubbed: 189.49GiB with 5420 errors error details: read=5 csum=5415 corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164 [snip] In addition to the storage controller being a possibility as mentioned in another reply, there are some parts of the drive that aren't covered by SMART attributes on most disks, most notably the on-drive cache. There really isn't a way to disable the read cache on the drive, but you can disable write-caching, which may improve things (and if it's a cheap disk, may provide better reliability for BTRFS as well). The other thing I would suggest trying is a different data cable to the drive itself, I've had issues with some SATA cables (the cheap red ones you get in the retail packaging for some hard disks in particular) having either bad connectors, or bad strain-reliefs, and failing after only a few hundred hours of use. smime.p7s Description: S/MIME Cryptographic Signature
Re: scrub implies failing drive - smartctl blissfully unaware
On 2014/11/18 09:36, Roman Mamedov wrote: On Tue, 18 Nov 2014 09:29:54 +0200 Brendan Hide bren...@swiftspirit.co.za wrote: Hey, guys See further below extracted output from a daily scrub showing csum errors on sdb, part of a raid1 btrfs. Looking back, it has been getting errors like this for a few days now. The disk is patently unreliable but smartctl's output implies there are no issues. Is this somehow standard faire for S.M.A.R.T. output? Not necessarily the disk's fault, could be a SATA controller issue. How are your disks connected, which controller brand and chip? Add lspci output, at least if something other than the ordinary to the motherboard chipset's built-in ports. In this case, yup, its directly to the motherboard chipset's built-in ports. This is a very old desktop, and the other 3 disks don't have any issues. I'm checking out the alternative pointed out by Austin. SATA-relevant lspci output: 00:1f.2 SATA controller: Intel Corporation 82801JD/DO (ICH10 Family) SATA AHCI Controller (rev 02) -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On 2014/11/18 14:08, Austin S Hemmelgarn wrote: [snip] there are some parts of the drive that aren't covered by SMART attributes on most disks, most notably the on-drive cache. There really isn't a way to disable the read cache on the drive, but you can disable write-caching. Its an old and replaceable disk - but if the cable replacement doesn't work I'll try this for kicks. :) The other thing I would suggest trying is a different data cable to the drive itself, I've had issues with some SATA cables (the cheap red ones you get in the retail packaging for some hard disks in particular) having either bad connectors, or bad strain-reliefs, and failing after only a few hundred hours of use. Thanks. I'll try this first. :) -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
Brendan Hide posted on Tue, 18 Nov 2014 15:24:48 +0200 as excerpted: In this case, yup, its directly to the motherboard chipset's built-in ports. This is a very old desktop, and the other 3 disks don't have any issues. I'm checking out the alternative pointed out by Austin. SATA-relevant lspci output: 00:1f.2 SATA controller: Intel Corporation 82801JD/DO (ICH10 Family) SATA AHCI Controller (rev 02) I guess your definition of _very_ old desktop, and mine, are _very_ different. * A quick check of wikipedia says the ICH10 wasn't even /introduced/ until 2008 (the wiki link for the 82801jo/do points to an Intel page, which says it was launched Q3-2008), and it would have been some time after that, likely 2009, that you actually purchased the machine. 2009 is five years ago, middle-aged yes, arguably old, but _very_ old, not so much in this day and age of longer system replace cycles. * It has SATA, not IDE/PATA. * It was PCIE 1.1, not PCI-X or PCI and AGP, and DEFINITELY not ISA bus, with or without VLB! * It has USB 2.0 ports, not USB 1.1, and not only serial/parallel/ps2, and DEFINITELY not an AT keyboard. * It has Gigabit Ethernet, not simply Fast Ethernet or just Ethernet, and DEFINITELY Ethernet not token-ring. * It already has Intel Virtualization technology and HD audio instead of AC97 or earlier. Now I can certainly imagine and old desktop having most of these, but you said _very_ old, not simply old, and _very_ old to me would mean PATA/ USB-1/AGP/PCI/FastEthernet with AC97 audio or earlier and no virtualization. 64-bit would be questionable as well. FWIW, I've been playing minitube/youtube C64 music the last few days. Martin Galway, etc. Now C64 really _IS_ _very_ old! Also FWIW, only a couple years ago now (well, about three, time flies!), my old 2003 vintage original 3-digit Opteron based mobo died due to bulging/burst capacitors, after serving me 8 years. I was shooting for a full decade but didn't quite make it... So indeed, 2009 vintage system, five years, definitely not _very_ old, arguably not even old, more like middle-aged. =:^) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/2014 7:08 AM, Austin S Hemmelgarn wrote: In addition to the storage controller being a possibility as mentioned in another reply, there are some parts of the drive that aren't covered by SMART attributes on most disks, most notably the on-drive cache. There really isn't a way to disable the read cache on the drive, but you can disable write-caching, which may improve things (and if it's a cheap disk, may provide better reliability for BTRFS as well). The other thing I would suggest trying is a different data cable to the drive itself, I've had issues with some SATA cables (the cheap red ones you get in the retail packaging for some hard disks in particular) having either bad connectors, or bad strain-reliefs, and failing after only a few hundred hours of use. SATA does CRC the data going across it so if it is a bad cable, you get CRC, or often times 8b10b coding errors and the transfer is aborted rather than returning bad data. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUa22PAAoJEI5FoCIzSKrwqlAH/3p1iftYkX3DAMgMmWra9AZT 2OA4PIwzgKIhANpy+ZQo4c+W1ZUwo2V6sxLvG8/oM3HfITGyfwNA5HgTbQrlx/iU vdRHq+y60gCruIa0lRST5JCQMbez7eXvSNOWNAZYbtNH/BNyMxwFuav14zFZpNxO QovXxhk1D5vLf+ID2jwa5mF1Zj7b5GEhb4zzqK+xU1QNeWppLFhB3da+llae8qxf eFtNt8ebtknr7QMCFrbaYCq1z1I+Fy8EjskkdI4ZW6AgBRPQDDmB8gNCmAAbSaZC 2Ze/AB4Xr6uuGQ4iK7nprKXUtPJFLzGYx+JQ2EeBJtin9ivno1fEY45CMreuzv4= =6Oy/ -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/2014 10:35 AM, Marc MERLIN wrote: Try running hdrecover on your drive, it'll scan all your blocks and try to rewrite the ones that are failing, if any: http://hdrecover.sourceforge.net/ He doesn't have blocks that are failing; he has blocks that are being silently corrupted. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUa23wAAoJEI5FoCIzSKrwXTMH/3KhuXuNbPBY0jALRS6kVAew M3gfJ1kMeZgiBZzUlZb0GsB9J3i+Ei+nF7NQ7taMKey84sPxhQVjpYZV0LZxWNwe RSga4/Kfnk8TGphwBBeK5e3tOypmv+ECCB4p4uQHXqPAvoFiIALdHYzZGYb0kM8e ydTonqtUiR8WJ0uqy24/vl7uJyTkj0xz4Adk2ksrbVhW1Z8md2LesKOCtCLa3bVn Qu8Um/KIBPNBbB21FYN1KyBUMvkx2uGDcu7YRfxXpPnZLwZ9NdMjlOzY8P+EnhFt cW+tW3mYO9BMhONxi8m7hDI5wj+dsPFblqA5CRBwAOG5b4fsE2pwZwdqYoASmd4= =2Ho1 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On Tue, Nov 18, 2014 at 11:04:00AM -0500, Phillip Susi wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/2014 10:35 AM, Marc MERLIN wrote: Try running hdrecover on your drive, it'll scan all your blocks and try to rewrite the ones that are failing, if any: http://hdrecover.sourceforge.net/ He doesn't have blocks that are failing; he has blocks that are being silently corrupted. That seems to be the case, but hdrecover will rule that part out at least. Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/2014 11:11 AM, Marc MERLIN wrote: That seems to be the case, but hdrecover will rule that part out at least. It's already ruled out: if the read failed that is what the error message would have said rather than a bad checksum. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEbBAEBAgAGBQJUa3MpAAoJEI5FoCIzSKrwmu4H+IPNwUZMEES7vvA7WTPcrYgw mO2x9uR/fQJFH1u4Urf3anKXoifsHUgvgyPHotRrm1OoiB3bQgYVapVEqZ0PEkre la3zKydJ6ZuCa/TuEvATdOxBwvUhMKJCYcwYheja+1stqEBxD8mj6HY5+HqufoLo VaSeEeBDWvQZtGrOC8JNxfzaeFmf46W+8dQIn7qI72WYvWRfVMhCun+dR4amS8hN cXgxAe6ElnVV4TuGHLy0n4l2Hr6oWBYLWIJhDzM9IpkfjX9jsv78nLHcoWwtaw82 gv248OcCeLnZBwoN5Tepd5Av6uHh3x9MzlXDrqnWQBWulY3f0idrFGU1y1uZvw== =AtDf -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On Nov 18, 2014, at 8:35 AM, Marc MERLIN m...@merlins.org wrote: On Tue, Nov 18, 2014 at 09:29:54AM +0200, Brendan Hide wrote: Hey, guys See further below extracted output from a daily scrub showing csum errors on sdb, part of a raid1 btrfs. Looking back, it has been getting errors like this for a few days now. The disk is patently unreliable but smartctl's output implies there are no issues. Is this somehow standard faire for S.M.A.R.T. output? Try running hdrecover on your drive, it'll scan all your blocks and try to rewrite the ones that are failing, if any: http://hdrecover.sourceforge.net/ The only way it can know if there is a bad sector is if the drive returns a read error, which will include the LBA for the affected sector(s). This is the same thing that would be done with scrub, except any bad sectors that don’t contain data. A common problem getting a drive to issue the read error, however, is a mismatch between the scsi command timer setting (default 30 seconds) and the SCT error recover control setting for the drive. The drive SCT ERC value needs to be shorter than the scsi command timer value, otherwise some bad sector errors will cause the drive to go into a longer recovery attempt beyond the scsi command timer value. If that happens, the ata link is reset, and there’s no possibility of finding out what the affected sector is. So a.) use smartctl -l scterc to change the value below 30 seconds (300 deciseconds) with 70 deciseconds being reasonable. If the drive doesn’t support SCT commands, then b.) change the linux scsi command timer to be greater than 120 seconds. Strictly speaking the command timer would be set to a value that ensures there are no link reset messages in dmesg, that it’s long enough that the drive itself times out and actually reports a read error. This could be much shorter than 120 seconds. I don’t know if there are any consumer drives that try longer than 2 minutes to recover data from a marginally bad sector. Ideally though, don’t use drives that lack SCT support in multiple device volume configurations. An up to 2 minute hang of the storage stack isn’t production compatible for most workflows. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/2014 1:57 PM, Chris Murphy wrote: So a.) use smartctl -l scterc to change the value below 30 seconds (300 deciseconds) with 70 deciseconds being reasonable. If the drive doesn’t support SCT commands, then b.) change the linux scsi command timer to be greater than 120 seconds. Strictly speaking the command timer would be set to a value that ensures there are no link reset messages in dmesg, that it’s long enough that the drive itself times out and actually reports a read error. This could be much shorter than 120 seconds. I don’t know if there are any consumer drives that try longer than 2 minutes to recover data from a marginally bad sector. Are there really any that take longer than 30 seconds? That's enough time for thousands of retries. If it can't be read after a dozen tries, it ain't never gonna work. It seems absurd that a drive would keep trying for so long. Ideally though, don’t use drives that lack SCT support in multiple device volume configurations. An up to 2 minute hang of the storage stack isn’t production compatible for most workflows. Wasn't there an early failure flag that md ( and therefore, btrfs when doing raid ) sets so the scsi stack doesn't bother with recovery attempts and just fails the request? Thus if the drive takes longer than the scsi_timeout, the failure would be reported to btrfs, which then can recover using the other copy, write it back to the bad drive, and hopefully that fixes it? In that case, you probably want to lower the timeout so that the recover kicks in sooner instead of hanging your IO stack for 30 seconds. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUa7LqAAoJEI5FoCIzSKrw2Y0H/3Q03vCTxXeGkqvOYG/arZgk yHq/ruWIKMgfaESdu0Ujzoqbe7XopUueU8luKon52LtbgIFhOM5XnMu/o52KPXIS CVLnNtRWNbykHJMQu0Sk4lpPrUVI5QP9Ya9ZGVFM4x2ehvJGDAT+wcRWP5OH0waf mgK+oOnadsckqiSbcQhGrxecjTWZFu5WUCzWFPx+4sEV5ta/tmL0obhHcyho+SDN lCib2KI9YGzS2sm+V/Qe2i/3ZMp8QY8aAD2x/KlV0DBxkRLZQdOoD3ZkBiaApxZX VMfXNCKLMexwpe+rGGemH/fCvhRpM/z1aHu8D1u4QVnoWPzD51vX7ySLkwRHaGo= =XZkM -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On Nov 18, 2014, at 1:58 PM, Phillip Susi ps...@ubuntu.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/2014 1:57 PM, Chris Murphy wrote: So a.) use smartctl -l scterc to change the value below 30 seconds (300 deciseconds) with 70 deciseconds being reasonable. If the drive doesn’t support SCT commands, then b.) change the linux scsi command timer to be greater than 120 seconds. Strictly speaking the command timer would be set to a value that ensures there are no link reset messages in dmesg, that it’s long enough that the drive itself times out and actually reports a read error. This could be much shorter than 120 seconds. I don’t know if there are any consumer drives that try longer than 2 minutes to recover data from a marginally bad sector. Are there really any that take longer than 30 seconds? That's enough time for thousands of retries. If it can't be read after a dozen tries, it ain't never gonna work. It seems absurd that a drive would keep trying for so long. It’s well known on linux-raid@ that consumer drives have well over 30 second deep recoveries when they lack SCT command support. The WDC and Seagate “green” drives are over 2 minutes apparently. This isn’t easy to test because it requires a sector with enough error that it requires the ECC to do something, and yet not so much error that it gives up in less than 30 seconds. So you have to track down a drive model spec document (one of those 100 pagers). This makes sense, sorta, because the manufacturer use case is typically single drive only, and most proscribe raid5/6 with such products. So it’s a “recover data at all costs” behavior because it’s assumed to be the only (immediately) available copy. Ideally though, don’t use drives that lack SCT support in multiple device volume configurations. An up to 2 minute hang of the storage stack isn’t production compatible for most workflows. Wasn't there an early failure flag that md ( and therefore, btrfs when doing raid ) sets so the scsi stack doesn't bother with recovery attempts and just fails the request? Thus if the drive takes longer than the scsi_timeout, the failure would be reported to btrfs, which then can recover using the other copy, write it back to the bad drive, and hopefully that fixes it? I don’t see how that’s possible because anything other than the drive explicitly producing a read error (which includes the affected LBA’s), it’s ambiguous what the actual problem is as far as the kernel is concerned. It has no way of knowing which of possibly dozens of ata commands queued up in the drive have actually hung up the drive. It has no idea why the drive is hung up as well. The linux-raid@ list is chock full of users having these kinds of problems. It comes up pretty much every week. Someone has an e.g. raid5, and in dmesg all they get are a bunch of ata bus reset messages. So someone tells them to change the scsi command timer for all the block devices that are members of the array in question, and retry (reading file, or scrub or whatever) and low and behold no more ata bus reset messages. Instead they get explicit read errors with LBAs and now md can fix the problem. In that case, you probably want to lower the timeout so that the recover kicks in sooner instead of hanging your IO stack for 30 seconds. No I think 30 is pretty sane for servers using SATA drives because if the bus is reset all pending commands in the queue get obliterated which is worse than just waiting up to 30 seconds. With SAS drives maybe less time makes sense. But in either case you still need configurable SCT ERC, or it needs to be a sane fixed default like 70 deciseconds. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
Phillip Susi posted on Tue, 18 Nov 2014 15:58:18 -0500 as excerpted: Are there really any that take longer than 30 seconds? That's enough time for thousands of retries. If it can't be read after a dozen tries, it ain't never gonna work. It seems absurd that a drive would keep trying for so long. I'm not sure about normal operation, but certainly, many drives take longer than 30 seconds to stabilize after power-on, and I routinely see resets during this time. In fact, as I recently posted, power-up stabilization time can and often does kill reliable multi-drive device or filesystem (my experience is with mdraid and btrfs raid) resume from suspend to RAM or hibernate to disk, either one or both, because it's often enough the case that one device or another will take enough longer to stabilize than the other, that it'll be failed out of the raid. This doesn't happen on single-hardware-device block devices and filesystems because in that case it's either up or down, if the device doesn't come up in time the resume simply fails entirely, instead of coming up with one or more devices there, but others missing as they didn't stabilize in time, as is unfortunately all too common in the multi- device scenario. I've seen this with both spinning rust and with SSDs, with mdraid and btrfs, with multiple mobos and device controllers, and with resume both from suspend to ram (if the machine powers down the storage devices in that case, as most modern ones do) and hibernate to permanent storage device, over several years worth of kernel series, so it's a reasonably widespread phenomena, at least among consumer-level SATA devices. (My experience doesn't extend to enterprise-raid-level devices or proper SCSI, etc, so I simply don't know, there.) While two minutes is getting a bit long, I think it's still within normal range, and some devices definitely take over a minute enough of the time to be both noticeable and irritating. That said, I SHOULD say I'd be far *MORE* irritated if the device simply pretended it was stable and started reading/writing data before it really had stabilized, particularly with SSDs where that sort of behavior has been observed and is known to put some devices at risk of complete scrambling of either media or firmware, beyond recovery at times. That of course is the risk of going the other direction, and I'd a WHOLE lot rather have devices play it safe for another 30 seconds or so after they / think/ they're stable and be SURE, than pretend to be just fine when voltages have NOT stabilized yet and thus end up scrambling things irrecoverably. I've never had that happen here tho I've never stress- tested for it, only done normal operation, but I've seen testing reports where the testers DID make it happen surprisingly easily, to a surprising number of their test devices. So, umm... I suspect the 2-minute default is 2 minutes due to power-up stabilizing issues, where two minutes is a reasonable compromise between failing the boot most of the time if the timeout is too low, and taking excessively long for very little further gain. And in my experience, the only way around that, at the consumer level at least, would be to split the timeouts, perhaps setting something even higher, 2.5-3 minutes on power-on, while lowering the operational timeout to something more sane for operation, probably 30 seconds or so by default, but easily tunable down to 10-20 seconds (or even lower, 5 seconds, even for consumer level devices?) for those who had hardware that fit within that tolerance and wanted the performance. But at least to my knowledge, there's no such split in reset timeout values available (maybe for SCSI?), and due to auto-spindown and power-saving, I'm not sure whether it's even possible, without some specific hardware feature available to tell the kernel that it has in fact NOT been in power-saving mode for say 5-10 minutes, hopefully long enough that voltage readings really /are/ fully stabilized and a shorter timeout is possible. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On Tue, 18 Nov 2014 09:29:54 +0200 Brendan Hide bren...@swiftspirit.co.za wrote: Hey, guys See further below extracted output from a daily scrub showing csum errors on sdb, part of a raid1 btrfs. Looking back, it has been getting errors like this for a few days now. The disk is patently unreliable but smartctl's output implies there are no issues. Is this somehow standard faire for S.M.A.R.T. output? Not necessarily the disk's fault, could be a SATA controller issue. How are your disks connected, which controller brand and chip? Add lspci output, at least if something other than the ordinary to the motherboard chipset's built-in ports. -- With respect, Roman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html