Re: scrub implies failing drive - smartctl blissfully unaware

2014-12-01 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/25/2014 6:13 PM, Chris Murphy wrote:
 The drive will only issue a read error when its ECC absolutely
 cannot recover the data, hard fail.
 
 A few years ago companies including Western Digital started
 shipping large cheap drives, think of the green drives. These had
 very high TLER (Time Limited Error Recovery) settings, a.k.a. SCT
 ERC. Later they completely took out the ability to configure this
 error recovery timing so you only get the upward of 2 minutes to
 actually get a read error reported by the drive. Presumably if the
 ECC determines it's a hard fail and no point in reading the same
 sector 14000 times, it would issue a read error much sooner. But
 again, the linux-raid list if full of cases where this doesn't
 happen, and merely by changing the linux SCSI command timer from 30
 to 121 seconds, now the drive reports an explicit read error with
 LBA information included, and now md can correct the problem.

I have one of those and took it out of service when it started reporting
read errors ( not timeouts ).  I tried several times to write over the
bad sectors to force reallocation and it worked again for a while...
then the bad sectors kept coming back.  Oddly, the SMART values never
indicated anything had been reallocated.

 That's my whole point. When the link is reset, no read error is 
 submitted by the drive, the md driver has no idea what the drive's 
 problem was, no idea that it's a read problem, no idea what LBA is 
 affected, and thus no way of writing over the affected bad sector.
 If the SCSI command timer is raised well above 30 seconds, this
 problem is resolved. Also replacing the drive with one that
 definitively errors out (or can be configured with smartctl -l
 scterc) before 30 seconds is another option.

It doesn't know why or exactly where, but it does know *something* went
wrong.

 It doesn't really matter, clearly its time out for drive commands
 is much higher than the linux default of 30 seconds.

Only if you are running linux and can see the timeouts.  You can't
assume that's what is going on under windows just because the desktop
stutters.

 OK that doesn't actually happen and it would be completely f'n
 wrong behavior if it were happening. All the kernel knows is the
 command timer has expired, it doesn't know why the drive isn't
 responding. It doesn't know there are uncorrectable sector errors
 causing the problem. To just assume link resets are the same thing
 as bad sectors and to just wholesale start writing possibly a
 metric shit ton of data when you don't know what the problem is
 would be asinine. It might even be sabotage. Jesus...

In normal single disk operation sure: the kernel resets the drive and
retries the request.  But like I said before, I could have sworn there
was an early failure flag that md uses to tell the lower layers NOT to
attempt that kind of normal recovery, and instead just to return the
failure right away so md can just go grab the data from the drive that
isn't wigging out.  That prevents the system from stalling on paging IO
while the drive plays around with its deep recovery, and copying back
512k to the drive with the one bad sector isn't really that big of a
deal.

 Then there is one option which is to increase the value of the
 SCSI command timer. And that applies to all raid: md, lvm, btrfs,
 and hardware.

And then you get stupid hanging when you could just get the data from
the other drive immediately.
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUfL04AAoJENRVrw2cjl5RFW0H/Rtz4Y8bynWAP2yjiqZMsic+
vXCxuJAFGpOKVyV1FboCuLStp8TQ5aIiJyHrprsCiy4UAY0bFQjzaHOo4jBlCdV/
YaD3HSWGKAFUbIiByCnMfIDMxWSPP8rOeFpotoywAkNe0vIsIKg955IX96+jNMy2
IAjKGQahzp2UW6ggnwwdA/JayUmb1jZ8LvmV58rDVdhTnGPgrrYZnIyf/OphrXqd
R/WJtFDuUBUhtsmXYrY2wGUQNi+3zp+I9YburmeDtEcrbwDLDCiVdE6ChmoCrNBS
nbcfqoWPEk1DsiI9GC/Yu/sXLq2iD0n53e/DHa36z4zc4uWtUjBwSYyCubJfkyI=
=FrB9
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-28 Thread Patrik Lundquist
On 25 November 2014 at 22:34, Phillip Susi ps...@ubuntu.com wrote:
 On 11/19/2014 7:05 PM, Chris Murphy wrote:
  I'm not a hard drive engineer, so I can't argue either point. But
  consumer drives clearly do behave this way. On Linux, the kernel's
  default 30 second command timer eventually results in what look
  like link errors rather than drive read errors. And instead of the
  problems being fixed with the normal md and btrfs recovery
  mechanisms, the errors simply get worse and eventually there's data
  loss. Exhibits A, B, C, D - the linux-raid list is full to the brim
  of such reports and their solution.

 I have seen plenty of error logs of people with drives that do
 properly give up and return an error instead of timing out so I get
 the feeling that most drives are properly behaved.  Is there a
 particular make/model of drive that is known to exhibit this silly
 behavior?

I had a couple of Seagate Barracuda 7200.11 (codename Moose) drives
with seriously retarded firmware.

They never reported a read error AFAIK but began to time out instead.
They wouldn't even respond after a link reset. I had to power cycle
the disks.

Funny days with ddrescue. Got almost everything off them.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-28 Thread Patrik Lundquist
On 25 November 2014 at 23:14, Phillip Susi ps...@ubuntu.com wrote:
 On 11/19/2014 6:59 PM, Duncan wrote:

 The paper specifically mentioned that it wasn't necessarily the
 more expensive devices that were the best, either, but the ones
 that faired best did tend to have longer device-ready times.  The
 conclusion was that a lot of devices are cutting corners on
 device-ready, gambling that in normal use they'll work fine,
 leading to an acceptable return rate, and evidently, the gamble
 pays off most of the time.

 I believe I read the same study and don't recall any such conclusion.
  Instead the conclusion was that the badly behaving drives aren't
 ordering their internal writes correctly and flushing their metadata
 from ram to flash before completing the write request.  The problem
 was on the power *loss* side, not the power application.

I've found:

http://www.usenix.org/conference/fast13/technical-sessions/presentation/zheng
http://lkcl.net/reports/ssd_analysis.html

Are there any more studies?
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-25 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/19/2014 7:05 PM, Chris Murphy wrote:
 I'm not a hard drive engineer, so I can't argue either point. But 
 consumer drives clearly do behave this way. On Linux, the kernel's 
 default 30 second command timer eventually results in what look
 like link errors rather than drive read errors. And instead of the
 problems being fixed with the normal md and btrfs recovery
 mechanisms, the errors simply get worse and eventually there's data
 loss. Exhibits A, B, C, D - the linux-raid list is full to the brim
 of such reports and their solution.

I have seen plenty of error logs of people with drives that do
properly give up and return an error instead of timing out so I get
the feeling that most drives are properly behaved.  Is there a
particular make/model of drive that is known to exhibit this silly
behavior?

 IIRC, this is true when the drive returns failure as well.  The
 whole bio is marked as failed, and the page cache layer then
 begins retrying with progressively smaller requests to see if it
 can get *some* data out.
 
 Well that's very course. It's not at a sector level, so as long as
 the drive continues to try to read from a particular LBA, but fails
 to either succeed reading or give up and report a read error,
 within 30 seconds, then you just get a bunch of wonky system
 behavior.

I don't understand this response at all.  The drive isn't going to
keep trying to read the same bad lba; after the kernel times out, it
resets the drive, and tries reading different smaller parts to see
which it can read and which it can't.

 Conversely what I've observed on Windows in such a case, is it 
 tolerates these deep recoveries on consumer drives. So they just
 get really slow but the drive does seem to eventually recover
 (until it doesn't). But yeah 2 minutes is a long time. So then the
 user gets annoyed and reinstalls their system. Since that means
 writing to the affected drive, the firmware logic causes bad
 sectors to be dereferenced when the write error is persistent.
 Problem solved, faster system.

That seems like rather unsubstantiated guesswork.  i.e. the 2 minute+
delays are likely not on an individual request, but from several
requests that each go into deep recovery, possibly because windows is
retrying the same sector or a few consecutive sectors are bad.

 Because now you have a member drive that's inconsistent. At least
 in the md raid case, a certain number of read failures causes the
 drive to be ejected from the array. Anytime there's a write
 failure, it's ejected from the array too. What you want is for the
 drive to give up sooner with an explicit read error, so md can help
 fix the problem by writing good data to the effected LBA. That
 doesn't happen when there are a bunch of link resets happening.

What?  It is no different than when it does return an error, with the
exception that the error is incorrectly applied to the entire request
instead of just the affected sector.

 Again, if your drive SCT ERC is configurable, and set to something 
 sane like 70 deciseconds, that read failure happens at MOST 7
 seconds after the read attempt. And md is notified of *exactly*
 what sectors are affected, it immediately goes to mirror data, or
 rebuilds it from parity, and then writes the correct data to the
 previously reported bad sectors. And that will fix the problem.

Yes... I'm talking about when the drive doesn't support that.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUdPXRAAoJEI5FoCIzSKrw5aUIAJpmAczzc+0flGpDnenNIf9E
HITY2a15lRhrnpfiEBmlTe0EUyc8O+Sv/kWJ61VRJ1KNCtF0Cs0jMEvOk2BGiM9T
rR2KinIFlPZfuR7sUpgns+i5TK3eXpn+bbm5jIUFf8hOdkERFArwaQIqo3qqMybs
3rHdnBo7T+F9oCMwuFyvwHupDd2gCbnibB8mIUhijUcZQwoqU9c/ISGySpM7x04J
VeDCI3hWv2V5hhm+Bfdq3fQpjeIo2AAvCPt+ODuFFHabQ5l78Qu7IlCEFGIYuQqi
VJPxXNUi4n34O/jWEX5KBGgXp3H1RegnvcAt2NFLMVpFVDSB9I5eYLrj/d8KWoE=
=r3AP
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-25 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/19/2014 6:59 PM, Duncan wrote:
 It's not physical spinup, but electronic device-ready.  It happens
 on SSDs too and they don't have anything to spinup.

If you have an SSD that isn't handling IO within 5 seconds or so of
power on, it is badly broken.

 But, for instance on my old seagate 300-gigs that I used to have in
 4-way mdraid, when I tried to resume from hibernate the drives
 would be spunup and talking to the kernel, but for some seconds to
 a couple minutes or so after spinup, they'd sometimes return
 something like (example) Seagrte3x0 instead of Seagate300.  Of
 course that wasn't the exact string, I think it was the model
 number or perhaps the serial number or something, but looking at
 dmsg I could see the ATA layer up for each of the four devices, the
 connection establish and seem to be returning good data, then the
 mdraid layer would try to assemble and would kick out a drive or
 two due to the device string mismatch compared to what was there 
 before the hibernate.  With the string mismatch, from its
 perspective the device had disappeared and been replaced with
 something else.

Again, these drives were badly broken then.  Even if it needs extra
time to come up for some reason, it shouldn't be reporting that it is
ready and returning incorrect information.

 And now I seen similar behavior resuming from suspend (the old
 hardware wouldn't resume from suspend to ram, only hibernate, the
 new hardware resumes from suspend to ram just fine, but I had
 trouble getting it to resume from hibernate back when I first setup
 and tried it; I've not tried hibernate since and didn't even setup
 swap to hibernate to when I got the SSDs so I've not tried it for a
 couple years) on SSDs with btrfs raid.  Btrfs isn't as informative
 as was mdraid on why it kicks a device, but dmesg says both devices
 are up, while btrfs is suddenly spitting errors on one device.  A
 reboot later and both devices are back in the btrfs and I can do a
 scrub to resync, which generally finds and fixes errors on the
 btrfs that were writable (/home and /var/log), but of course not on
 the btrfs mounted as root, since it's read-only by default.

Several months back I was working on some patches to avoid blocking a
resume until after all disks had spun up ( someone else ended up
getting a different version merged to the mainline kernel ).  I looked
quite hard at the timings of things during suspend and found that my
ssd was ready and handling IO darn near instantly and the hd ( 5900
rpm wd green at the time ) took something like 7 seconds before it was
completing IO.  These days I'm running a raid10 on 3 7200 rpm blues
and it comes right up from suspend with no problems, just as it should.

 The paper specifically mentioned that it wasn't necessarily the
 more expensive devices that were the best, either, but the ones
 that faired best did tend to have longer device-ready times.  The
 conclusion was that a lot of devices are cutting corners on
 device-ready, gambling that in normal use they'll work fine,
 leading to an acceptable return rate, and evidently, the gamble
 pays off most of the time.

I believe I read the same study and don't recall any such conclusion.
 Instead the conclusion was that the badly behaving drives aren't
ordering their internal writes correctly and flushing their metadata
from ram to flash before completing the write request.  The problem
was on the power *loss* side, not the power application.

 The spinning rust in that study faired far better, with I think
 none of the devices scrambling their own firmware, and while there
 was some damage to storage, it was generally far better confined.

That is because they don't have a flash translation layer to get
mucked up and prevent them from knowing where the blocks are on disk.
 The worst thing you get out of a hdd losing power during a write is
the sector it was writing is corrupted and you have to re-write it.

 My experience says otherwise.  Else explain why those problems
 occur in the first two minutes, but don't occur if I hold it at the
 grub prompt to stabilizefor two minutes, and never during normal
 post- stabilization operation.  Of course perhaps there's another
 explanation for that, and I'm conflating the two things.  But so
 far, experience matches the theory.

I don't know what was broken about these drives, only that it wasn't
capacitors since those charge in milliseconds, not seconds.  Further,
all systems using microprocessors ( like the one in the drive that
controls it ) have reset circuitry that prevents them from running
until after any caps have charged enough to get the power rail up to
the required voltage.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUdP9jAAoJEI5FoCIzSKrw50IH/jkh48Z8Oh/AS/i68zT6Grtb
C98aNNQwhC2sJSvaxRBqJ1qkXY4af5DZM/SOvFdNE4qdPLBDLfg70tnTXwU4PjzN
1mHR1PR6Vgft11t0+u8TPTos669Jm8KJ21NMgY072P18Kj/+UJqNRQ+UUNikAcaM

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-25 Thread Chris Murphy
On Tue, Nov 25, 2014 at 2:34 PM, Phillip Susi ps...@ubuntu.com wrote:

 I have seen plenty of error logs of people with drives that do
 properly give up and return an error instead of timing out so I get
 the feeling that most drives are properly behaved.  Is there a
 particular make/model of drive that is known to exhibit this silly
 behavior?

The drive will only issue a read error when its ECC absolutely cannot
recover the data, hard fail.

A few years ago companies including Western Digital started shipping
large cheap drives, think of the green drives. These had very high
TLER (Time Limited Error Recovery) settings, a.k.a. SCT ERC. Later
they completely took out the ability to configure this error recovery
timing so you only get the upward of 2 minutes to actually get a read
error reported by the drive. Presumably if the ECC determines it's a
hard fail and no point in reading the same sector 14000 times, it
would issue a read error much sooner. But again, the linux-raid list
if full of cases where this doesn't happen, and merely by changing the
linux SCSI command timer from 30 to 121 seconds, now the drive reports
an explicit read error with LBA information included, and now md can
correct the problem.





 IIRC, this is true when the drive returns failure as well.  The
 whole bio is marked as failed, and the page cache layer then
 begins retrying with progressively smaller requests to see if it
 can get *some* data out.

 Well that's very course. It's not at a sector level, so as long as
 the drive continues to try to read from a particular LBA, but fails
 to either succeed reading or give up and report a read error,
 within 30 seconds, then you just get a bunch of wonky system
 behavior.

 I don't understand this response at all.  The drive isn't going to
 keep trying to read the same bad lba; after the kernel times out, it
 resets the drive, and tries reading different smaller parts to see
 which it can read and which it can't.

That's my whole point. When the link is reset, no read error is
submitted by the drive, the md driver has no idea what the drive's
problem was, no idea that it's a read problem, no idea what LBA is
affected, and thus no way of writing over the affected bad sector. If
the SCSI command timer is raised well above 30 seconds, this problem
is resolved. Also replacing the drive with one that definitively
errors out (or can be configured with smartctl -l scterc) before 30
seconds is another option.



 Conversely what I've observed on Windows in such a case, is it
 tolerates these deep recoveries on consumer drives. So they just
 get really slow but the drive does seem to eventually recover
 (until it doesn't). But yeah 2 minutes is a long time. So then the
 user gets annoyed and reinstalls their system. Since that means
 writing to the affected drive, the firmware logic causes bad
 sectors to be dereferenced when the write error is persistent.
 Problem solved, faster system.

 That seems like rather unsubstantiated guesswork.  i.e. the 2 minute+
 delays are likely not on an individual request, but from several
 requests that each go into deep recovery, possibly because windows is
 retrying the same sector or a few consecutive sectors are bad.

It doesn't really matter, clearly its time out for drive commands is
much higher than the linux default of 30 seconds.


 Because now you have a member drive that's inconsistent. At least
 in the md raid case, a certain number of read failures causes the
 drive to be ejected from the array. Anytime there's a write
 failure, it's ejected from the array too. What you want is for the
 drive to give up sooner with an explicit read error, so md can help
 fix the problem by writing good data to the effected LBA. That
 doesn't happen when there are a bunch of link resets happening.

 What?  It is no different than when it does return an error, with the
 exception that the error is incorrectly applied to the entire request
 instead of just the affected sector.

OK that doesn't actually happen and it would be completely f'n wrong
behavior if it were happening. All the kernel knows is the command
timer has expired, it doesn't know why the drive isn't responding. It
doesn't know there are uncorrectable sector errors causing the
problem. To just assume link resets are the same thing as bad sectors
and to just wholesale start writing possibly a metric shit ton of data
when you don't know what the problem is would be asinine. It might
even be sabotage. Jesus...





 Again, if your drive SCT ERC is configurable, and set to something
 sane like 70 deciseconds, that read failure happens at MOST 7
 seconds after the read attempt. And md is notified of *exactly*
 what sectors are affected, it immediately goes to mirror data, or
 rebuilds it from parity, and then writes the correct data to the
 previously reported bad sectors. And that will fix the problem.

 Yes... I'm talking about when the drive doesn't support that.

Then there is one option which is to 

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-25 Thread Rich Freeman
On Tue, Nov 25, 2014 at 6:13 PM, Chris Murphy li...@colorremedies.com wrote:
 A few years ago companies including Western Digital started shipping
 large cheap drives, think of the green drives. These had very high
 TLER (Time Limited Error Recovery) settings, a.k.a. SCT ERC. Later
 they completely took out the ability to configure this error recovery
 timing so you only get the upward of 2 minutes to actually get a read
 error reported by the drive.

Why sell an $80 hard drive when you can change a few bytes in the
firmware and sell a crippled $80 drive and an otherwise-identical
non-crippled $130 drive?

--
Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-22 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On 11/21/2014 04:12 PM, Robert White wrote:
 Here's a bug from 2005 of someone having a problem with the ACPI
 IDE support...

That is not ACPI emulation.  ACPI is not used to access the disk,
but rather it has hooks that give it a chance to diddle with the disk
to do things like configure it to lie about its maximum size, or issue
a security unlock during suspend/resume.

 People debating the merits of the ACPI IDE drivers in 2005.

No... that's not a debate at all; it is one guy asking if he should
use IDE or ACPI mode... someone who again meant AHCI and typed the
wrong acronym.

 Even when you get me for referencing windows, you're still 
 wrong...
 
 How many times will you try get out of being hideously horribly
 wrong about ACPI supporting disk/storage IO? It is neither recent
 nor rare.
 
 How much egg does your face really need before you just see that
 your fantasy that it's new and uncommon is a delusional mistake?

Project much?

It seems I've proven just about everything I originally said you got
wrong now so hopefully we can be done.

-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBCgAGBQJUcQj4AAoJENRVrw2cjl5RwmcH+gOW0LUQE4OXEToMY33brK8Z
QMKw7T1y4dtXIeeWihugNs+vbwmoI2Wheeej4WPdiqvgqIfX4ov9+N9Nb39JiIsI
7frPJ638n98Et5sirCGKfaVvDTwlF85ApHHtXrVLg2dBY3A+oLM9jVU7jpRBvW1m
IFjhJH/SMGDpMhix9SFg6w6cALRh1U5WYV4zMZ1f5/ri/05TYmNJ/M23cjtBicPZ
LaIFxOMGef4lylysNaVh0W03424oIJit6d7DB1gxCyjnkUvVuJ43NjuS5ay+y2sP
FFrepKrOfhK1oOib9e63zNfRHhWrX4KN0Dqcu/3+/+lhD3q5G1fd4YK2RV/oaso=
=nm9l
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Ian Armstrong
On Fri, 21 Nov 2014 09:05:32 +0200, Brendan Hide wrote:

 On 2014/11/21 06:58, Zygo Blaxell wrote:

  I also notice you are not running regular SMART self-tests (e.g.
  by smartctl -t long) and the last (and first, and only!) self-test
  the drive ran was ~12000 hours ago.  That means most of your SMART
  data is about 18 months old.  The drive won't know about sectors
  that went bad in the last year and a half unless the host happens
  to stumble across them during a read.
 
  The drive is over five years old in operating hours alone.  It is
  probably so fragile now that it will break if you try to move it.

 All interesting points. Do you schedule SMART self-tests on your own 
 systems? I have smartd running. In theory it tracks changes and sends 
 alerts if it figures a drive is going to fail. But, based on what
 you've indicated, that isn't good enough.

Simply monitoring the smart status without a self-test isn't really that
great. I'm not sure on the default config, but smartd can be made to
initiate a smart self-test at regular intervals. Depending on the test
type (short, long, etc) it could include a full surface scan. This can
reveal things like bad sectors before you ever hit them during normal
system usage.

 
  WARNING: errors detected during scrubbing, corrected.
  [snip]
  scrub device /dev/sdb2 (id 2) done
  scrub started at Tue Nov 18 03:22:58 2014 and finished
  after 2682 seconds total bytes scrubbed: 189.49GiB with 5420 errors
  error details: read=5 csum=5415
  corrected errors: 5420, uncorrectable errors: 0, unverified
  errors: 164 That seems a little off.  If there were 5 read errors,
  I'd expect the drive to have errors in the SMART error log.
 
  Checksum errors could just as easily be a btrfs bug or a RAM/CPU
  problem. There have been a number of fixes to csums in btrfs pulled
  into the kernel recently, and I've retired two five-year-old
  computers this summer due to RAM/CPU failures.

 The difference here is that the issue only affects the one drive.
 This leaves the probable cause at:
 - the drive itself
 - the cable/ports
 
 with a negligibly-possible cause at the motherboard chipset.

This is the same problem that I'm currently trying to resolve. I have
one drive in a raid1 setup which shows no issues in smart status but
often has checksum errors.

In my situation what I've found is that if I scrub  let it fix the
errors then a second pass immediately after will show no errors. If I
then leave it a few days  try again there will be errors, even in
old files which have not been accessed for months.

If I do a read-only scrub to get a list of errors, a second scrub
immediately after will show exactly the same errors.

Apart from the scrub errors the system logs shows no issues with that
particular drive.

My next step is to disable autodefrag  see if the problem persists.
(I'm not suggesting a problem with autodefrag, I just want to remove it
from the equation  ensure that outside of normal file access, data
isn't being rewritten between scrubs)

-- 
Ian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/20/2014 5:45 PM, Robert White wrote:
 Nice attempt at saving face, but wrong as _always_.
 
 The CONFIG_PATA_ACPI option has been in the kernel since 2008 and
 lots of people have used it.
 
 If you search for ACPI ide you'll find people complaining in
 2008-2010 about windows error messages indicating the device is
 present in their system but no OS driver is available.

Nope... not finding it.  The closest thing was one or two people who
said ACPI when they meant AHCI ( and were quickly corrected ).  This
is probably what you were thinking of since windows xp did not ship
with an ahci driver so it was quite common for winxp users to have
this problem when in _AHCI_ mode.

 That you have yet to see a single system that implements it is
 about the worst piece of internet research I've ever seen. Do you
 not _get_ that your opinion about what exists and how it works is
 not authoritative?

Show me one and I'll give you a cookie.  I have disassembled a number
of acpi tables and yet to see one that has it.  What's more,
motherboard vendors tend to implement only the absolute minimum they
have to.  Since nobody actually needs this feature, they aren't going
to bother with it.  Do you not get that your hand waving arguments of
you can google for it are not authoritative?

 You can also find articles about both windows and linux systems
 actively using ACPI fan control going back to 2009

Maybe you should have actually read those articles.  Linux supports
acpi fan control, unfortunately, almost no motherboards actually
implement it.  Almost everyone who wants fan control working in linux
has to install lm-sensors and load a driver that directly accesses one
of the embedded controllers that motherboards tend to use and run the
fancontrol script to manipulate the pwm channels on that controller.
These days you also have to boot with a kernel argument to allow
loading the driver since ACPI claims those IO ports for its own use
which creates a conflict.

Windows users that want to do this have to install a program... I
believe a popular one is called q-fan, that likewise directly accesses
the embedded controller registers to control the fan, since the acpi
tables don't bother properly implementing the acpi fan spec.

Then there are thinkpads, and one or two other laptops ( asus comes to
mind ) that went and implemented their own proprietary acpi interfaces
for fancontrol instead of following the spec, which required some
reverse engineering and yet more drivers to handle these proprietary
acpi interfaces.  You can google for thinkfan if you want to see this.

 These are not hard searches to pull off. These are not obscure 
 references. Go to the google box and start typing ACPI fan...
 and check the autocomplete.
 
 I'll skip ovea all the parts where you don't know how a chipset
 works and blah, blah, blah...
 
 You really should have just stopped at I don't know and I've
 never because you keep demonstrating that you _don't_ know, and
 that you really _should_ _never_.
 
 Tell us more about the lizard aliens controlling your computer, I
 find your versions of realty fascinating...

By all means, keep embarrassing yourself with nonsense and trying to
cover it up by being rude and insulting.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUb1YxAAoJEI5FoCIzSKrwi54H/Rkd7DloqC9x9QwN4QdmWcAZ
/UQg3hcRbtB3wpmp34Mnb3SS0Ii2mCh/dtKmdRGBNE/x5nU1WiQEHHCicKX3Avvq
8OXLNQrsf+xZL9/HGtUJ3RefpEkmwIG5NgFfKJHtv6Iq204Umq32JUxRla+ZQE5s
MrUparigpUlj26lrnShc6ByDUqYK3wOTsDxEMxrOyAgi/n/7ESHV/dZVaqsE6jGQ
OvPynf1FqJoJSSYC7sNE0XLqfHMu2wnSxcoF6MpuHXlDiwtrSH07tuwgrhCNPagY
j7gQyxucew8oim8lcfs+4rrQ60wwVzlsEJwjA9rAXQF7U2x/WoB+ArYhgmJUMgA=
=cXJr
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/20/2014 6:08 PM, Robert White wrote:
 Well you should have _actually_ trimmed your response down to not 
 pressing send.
 
 _Many_ motherboards have complete RAID support at levels 0, 1, 10,
 and five 5. A few have RAID6.
 
 Some of them even use the LSI chip-set.

Yes, there are some expensive server class motherboards out there with
integrated real raid chips.  Your average consumer class motherboards
are not those.  They contain intel, nvidia, sil, promise, and via
chipsets that are fake raid.

 Seriously... are you trolling this list with disinformation or
 just repeating tribal knowledge from fifteen year old copies of PC
 Magazine?

Please drop the penis measuring.

 Yea, some of the IDE motherboards and that only had RAID1 and RAID0
 (and indeed some of the add-on controllers) back in the IDE-only
 days were really lame just-forked-write devices with no integrity
 checks (hence fake raid) but that's from like the 1990s; it's
 paleolithic age wisdom at this point.

Wrong again... fakeraid became popular with the advent of SATA since
it was easy to add a knob to the bios to switch it between AHCI and
RAID mode, and just change the pci device id.  These chipsets are
still quite common today and several of them do support raid5 and
raid10 ( well, really it's raid 0 + raid1, but that's a whole nother
can of worms ).  Recent intel chips also now have a caching mode for
having an SSD cache a larger HDD.  Intel has also done a lot of work
integrating support for their chipset into mdadm in the last year or
three.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUb1ngAAoJEI5FoCIzSKrwqMQIAJ3MfA4n74aJ1KUdfHYOz96o
vwPBNQJ953yozmCHfjERbTCQlKT5AzwQHWpHoFWsQ4gYoNGmeE1jy2rsqxMfujff
eQekfISyX3POExnsr3LnfHWI2/Om39+EAxVPxbA5LN6SC1SCWRut7Q3bQqkuxj/S
bYRU65XJ9BZ6eYznutMDFdEELyAr8b9/wnatI/ohzmebOBDgFzBrn8gwilCctz7X
DI39HTkCvciWKVXNyVdUZKI5S+MRCEB2JZAkCy3x8LLsENmMnO0xN32o5Od0zlGn
nFLcLQFrZfz5dY2ZusxP+z0z0x4RW3sikd4RZ99PEHBkFa5CgJIFrBxtQAsLi1c=
=4Yg+
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Chris Murphy
On Fri, Nov 21, 2014 at 5:55 AM, Ian Armstrong bt...@iarmst.co.uk wrote:

 In my situation what I've found is that if I scrub  let it fix the
 errors then a second pass immediately after will show no errors. If I
 then leave it a few days  try again there will be errors, even in
 old files which have not been accessed for months.

What are the devices? And if they're SSDs are they powered off for
these few days? I take it the scrub error type is corruption?

You can use badblocks to write a known pattern to the drive. Then
power off and leave it for a few days. Then read the drive, matching
against the pattern, and see if there are any discrepancies. Doing
this outside the code path of Btrfs would fairly conclusively indicate
whether it's hardware or software induced.

Assuming you have another copy of all of these files :-) you could
just sha256sum the two copies to see if they have in fact changed. If
they have, well then you've got some silent data corruption somewhere
somehow. But if they always match, then that suggests a bug. I don't
see how you can get bogus corruption messages, and for it to not be a
bug. When you do these scrubs that come up clean, and then later come
up with corruptions, have you done any software updates?


 My next step is to disable autodefrag  see if the problem persists.
 (I'm not suggesting a problem with autodefrag, I just want to remove it
 from the equation  ensure that outside of normal file access, data
 isn't being rewritten between scrubs)

I wouldn't expect autodefrag to touch old files not accessed for
months. Doesn't it only affect actively used files?


-- 
Chris Murphy
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Zygo Blaxell
On Fri, Nov 21, 2014 at 09:05:32AM +0200, Brendan Hide wrote:
 On 2014/11/21 06:58, Zygo Blaxell wrote:
 You have one reallocated sector, so the drive has lost some data at some
 time in the last 49000(!) hours.  Normally reallocations happen during
 writes so the data that was lost was data you were in the process of
 overwriting anyway; however, the reallocated sector count could also be
 a sign of deteriorating drive integrity.
 
 In /var/lib/smartmontools there might be a csv file with logged error
 attribute data that you could use to figure out whether that reallocation
 was recent.
 
 I also notice you are not running regular SMART self-tests (e.g.
 by smartctl -t long) and the last (and first, and only!) self-test the
 drive ran was ~12000 hours ago.  That means most of your SMART data is
 about 18 months old.  The drive won't know about sectors that went bad
 in the last year and a half unless the host happens to stumble across
 them during a read.
 
 The drive is over five years old in operating hours alone.  It is probably
 so fragile now that it will break if you try to move it.
 All interesting points. Do you schedule SMART self-tests on your own
 systems? I have smartd running. In theory it tracks changes and
 sends alerts if it figures a drive is going to fail. But, based on
 what you've indicated, that isn't good enough.

I run 'smartctl -t long' from cron overnight (or whenever the drives
are most idle).  You can also set up smartd.conf to launch the self
tests; however, the syntax for test scheduling is byzantine compared to
cron (and that's saying something!).  On multi-drive systems I schedule
a different drive for each night.

If you are also doing btrfs scrub, then stagger the scheduling so
e.g. smart runs in even weeks and btrfs scrub runs in odd weeks.

smartd is OK for monitoring test logs and email alerts.  I've had no
problems there.

 WARNING: errors detected during scrubbing, corrected.
 [snip]
 scrub device /dev/sdb2 (id 2) done
  scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 
  seconds
  total bytes scrubbed: 189.49GiB with 5420 errors
  error details: read=5 csum=5415
  corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164
 That seems a little off.  If there were 5 read errors, I'd expect the drive 
 to
 have errors in the SMART error log.
 
 Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem.
 There have been a number of fixes to csums in btrfs pulled into the kernel
 recently, and I've retired two five-year-old computers this summer due
 to RAM/CPU failures.
 The difference here is that the issue only affects the one drive.
 This leaves the probable cause at:
 - the drive itself
 - the cable/ports
 
 with a negligibly-possible cause at the motherboard chipset.

If it was cable, there should be UDMA CRC errors or similar in the SMART
counters, but they are zero.  You can also try swapping the cable and
seeing whether the errors move.  I've found many bad cables that way.

The drive itself could be failing in some way that prevents recording
SMART errors (e.g. because of host timeouts triggering a bus reset,
which also prevents the SMART counter update for what was going wrong at
the time).  This is unfortunately quite common, especially with drives
configured for non-RAID workloads.

 
 -- 
 __
 Brendan Hide
 http://swiftspirit.co.za/
 http://www.webafrica.co.za/?AFF1E97
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Chris Murphy
On Fri, Nov 21, 2014 at 10:42 AM, Zygo Blaxell zblax...@furryterror.org wrote:

 I run 'smartctl -t long' from cron overnight (or whenever the drives
 are most idle).  You can also set up smartd.conf to launch the self
 tests; however, the syntax for test scheduling is byzantine compared to
 cron (and that's saying something!).  On multi-drive systems I schedule
 a different drive for each night.

 If you are also doing btrfs scrub, then stagger the scheduling so
 e.g. smart runs in even weeks and btrfs scrub runs in odd weeks.

 smartd is OK for monitoring test logs and email alerts.  I've had no
 problems there.

Most attributes are always updated without issuing a smart test of any
kind. A drive I have here only has four offline updateable attributes.

When it comes to bad sectors, the drive won't use a sector that
persistently fails writes. So you don't really have to worry about
latent bad sectors that don't have data on them already. The sectors
you care about are the ones with data. A scrub reads all of those
sectors.

First the drive could report a read error in which case Btrfs
raid1/10, and any (md, lvm, hardware) raid can use mirrored data, or
rebuild it from parity, and write to the affected sector; and also
this same mechanism happens in normal reads so it's a kind of passive
scrub. But it happens to miss checking inactively read data, which a
scrub will check.

Second, the drive could report no problem, and Btrfs raid1/10 could
still fix the problem in case of a csum mismatch. And it looks like
soonish we'll see this apply to raid5/6.

So I think a nightly long smart test is a bit overkill. I think you
could do nightly -t short tests which will report problems scrub won't
notice, such as higher seek times or lower throughput performance. And
then scrub once a week.


 The drive itself could be failing in some way that prevents recording
 SMART errors (e.g. because of host timeouts triggering a bus reset,
 which also prevents the SMART counter update for what was going wrong at
 the time).  This is unfortunately quite common, especially with drives
 configured for non-RAID workloads.

Libata resetting the link should be recorded in kernel messages.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Robert White

On 11/21/2014 07:11 AM, Phillip Susi wrote:

On 11/20/2014 5:45 PM, Robert White wrote:

If you search for ACPI ide you'll find people complaining in
2008-2010 about windows error messages indicating the device is
present in their system but no OS driver is available.


Nope... not finding it.  The closest thing was one or two people who
said ACPI when they meant AHCI ( and were quickly corrected ).  This
is probably what you were thinking of since windows xp did not ship
with an ahci driver so it was quite common for winxp users to have
this problem when in _AHCI_ mode.


I have to give you that one... I should have never trusted any reference 
to windows.


Most of those references to windows support were getting AHCI and ACPI 
mixed up. Lolz windows users... They didn't get into ACPI disk support 
till 2010. I should have known they were behind the times. I had to 
scroll down almost a whole page to find the linux support.


So lets just look at the top of the ide/ide-acpi.c from linux 2.6 to 
consult about when ACPI got into the IDE business...


linux/drivers/ide/ide-acpi.c
/*
 * Provides ACPI support for IDE drives.
 *
 * Copyright (C) 2005 Intel Corp.
 * Copyright (C) 2005 Randy Dunlap
 * Copyright (C) 2006 SUSE Linux Products GmbH
 * Copyright (C) 2006 Hannes Reinecke
 */

Here's a bug from 2005 of someone having a problem with the ACPI IDE 
support...


https://www.google.com/url?sa=trct=jq=esrc=ssource=webcd=6cad=rjauact=8ved=0CDkQFjAFurl=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D5604ei=g6VvVL73K-HLsASIrYKIDgusg=AFQjCNGTuuXPJk91svGJtRAf35DUqVqrLgsig2=eHxwbLYXn4ED5jG-guoZqg

People debating the merits of the ACPI IDE drivers in 2005.

https://www.google.com/url?sa=trct=jq=esrc=ssource=webcd=12cad=rjauact=8ved=0CGUQFjALurl=http%3A%2F%2Fwww.linuxquestions.org%2Fquestions%2Fslackware-14%2Fbare-ide-and-bare-acpi-kernels-297525%2Fei=g6VvVL73K-HLsASIrYKIDgusg=AFQjCNFoyKgH2sOteWwRN_Tdrfw9hOmVGQsig2=BmMVcZl24KRz4s4gEvLN_w

So you got me... windows was behind the curve by five years instead of 
just three... my bad...


But yea, nobody has ever used that ACPI disk drive support that's been 
in the kernel for nine years.


Even when you get me for referencing windows, you're still wrong...

How many times will you try get out of being hideously horribly wrong 
about ACPI supporting disk/storage IO? It is neither recent nor rare.


How much egg does your face really need before you just see that your 
fantasy that it's new and uncommon is a delusional mistake?



Methinks Misters Dunning and Kruger need a word with you...


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Robert White

On 11/21/2014 01:12 PM, Robert White wrote:
 (wrong links included in post...)
Dangit... those two links were bad... wrong clipboard... /sigh...

I'll just stand on the pasted text from the driver. 8-)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Zygo Blaxell
On Fri, Nov 21, 2014 at 11:06:19AM -0700, Chris Murphy wrote:
 On Fri, Nov 21, 2014 at 10:42 AM, Zygo Blaxell zblax...@furryterror.org 
 wrote:
 
  I run 'smartctl -t long' from cron overnight (or whenever the drives
  are most idle).  You can also set up smartd.conf to launch the self
  tests; however, the syntax for test scheduling is byzantine compared to
  cron (and that's saying something!).  On multi-drive systems I schedule
  a different drive for each night.
 
  If you are also doing btrfs scrub, then stagger the scheduling so
  e.g. smart runs in even weeks and btrfs scrub runs in odd weeks.
 
  smartd is OK for monitoring test logs and email alerts.  I've had no
  problems there.
 
 Most attributes are always updated without issuing a smart test of any
 kind. A drive I have here only has four offline updateable attributes.

One of those four is Offline_Uncorrectable, which is a really important
attribute to monitor!

 When it comes to bad sectors, the drive won't use a sector that
 persistently fails writes. So you don't really have to worry about
 latent bad sectors that don't have data on them already. The sectors
 you care about are the ones with data. A scrub reads all of those
 sectors.

A scrub reads all the _allocated_ sectors.  A long selftest reads
_everything_, and also exercises the electronics and mechanics of the
drive in ways that normal operation doesn't.  I have several disks that
are less than 25% occupied, which means scrubs will ignore 75% of the
disk surface at any given time.

A sharp increase in the number of bad sectors (no matter how they are
detected) usually indicates a total drive failure is coming.  Many drives
have been nice enough to give me enough warning for their RMA replacements
to be delivered just a few hours before the drive totally fails.

 First the drive could report a read error in which case Btrfs
 raid1/10, and any (md, lvm, hardware) raid can use mirrored data, or
 rebuild it from parity, and write to the affected sector; and also
 this same mechanism happens in normal reads so it's a kind of passive
 scrub. But it happens to miss checking inactively read data, which a
 scrub will check.
 
 Second, the drive could report no problem, and Btrfs raid1/10 could
 still fix the problem in case of a csum mismatch. And it looks like
 soonish we'll see this apply to raid5/6.
 
 So I think a nightly long smart test is a bit overkill. I think you
 could do nightly -t short tests which will report problems scrub won't
 notice, such as higher seek times or lower throughput performance. And
 then scrub once a week.

Drives quite often drop a sector or two over the years, and it can
be harmless.  What you want to be watching out for is hundreds of bad
sectors showing up over a period of few days--that means something is
rattling around on the disk platters, damaging the hardware as it goes.
To get that data, you have to test the disks every few days.

  The drive itself could be failing in some way that prevents recording
  SMART errors (e.g. because of host timeouts triggering a bus reset,
  which also prevents the SMART counter update for what was going wrong at
  the time).  This is unfortunately quite common, especially with drives
  configured for non-RAID workloads.
 
 Libata resetting the link should be recorded in kernel messages.

This is true, but the original question was about SMART data coverage.
This is why it's important to monitor both.

 -- 
 Chris Murphy


signature.asc
Description: Digital signature


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Ian Armstrong
On Fri, 21 Nov 2014 10:45:21 -0700 Chris Murphy wrote:

 On Fri, Nov 21, 2014 at 5:55 AM, Ian Armstrong bt...@iarmst.co.uk
 wrote:
 
  In my situation what I've found is that if I scrub  let it fix the
  errors then a second pass immediately after will show no errors. If
  I then leave it a few days  try again there will be errors, even in
  old files which have not been accessed for months.
 
 What are the devices? And if they're SSDs are they powered off for
 these few days? I take it the scrub error type is corruption?

It's spinning rust and the checksum error is always on the one drive
(a SAMSUNG HD204UI). The firmware has been updated, since some were
shipped with a bad version which could result in data corruption.

 You can use badblocks to write a known pattern to the drive. Then
 power off and leave it for a few days. Then read the drive, matching
 against the pattern, and see if there are any discrepancies. Doing
 this outside the code path of Btrfs would fairly conclusively indicate
 whether it's hardware or software induced.

Unfortunately I'm reluctant to go the badblock route for the entire
drive since it's the second drive in a 2 drive raid1 and I don't
currently have a spare. There is a small 6G partition that I can use,
but given that the drive is large and the errors are few, it could take
a while for anything to show.

I also have a second 2 drive btrfs raid1 in the same machine that
doesn't have this problem. All the drives are running off the same
controller.

 Assuming you have another copy of all of these files :-) you could
 just sha256sum the two copies to see if they have in fact changed. If
 they have, well then you've got some silent data corruption somewhere
 somehow. But if they always match, then that suggests a bug.

Some of the files already have an md5 linked to them, while others have
parity files to give some level of recovery from corruption or damage.
Checking against these show no problems, so I assume that btrfs is
doing its job  only serving an intact file.

 I don't
 see how you can get bogus corruption messages, and for it to not be a
 bug. When you do these scrubs that come up clean, and then later come
 up with corruptions, have you done any software updates?

No software updates between clean  corrupt. I don't have to power down
or reboot either for checksum errors to appear.

I don't think the corruption messages are bogus, but are indicating a
genuine problem. What I would like to be able to do is compare the
corrupt block with the one used to repair it and see what the difference
is. As I've already stated, the system logs are clean  the smart logs
aren't showing any issues. (Well, until today when a self-test failed
with a read error, but it must be an unused sector since the scrub
doesn't hit it  there are no re-allocated sectors yet)

  My next step is to disable autodefrag  see if the problem persists.
  (I'm not suggesting a problem with autodefrag, I just want to
  remove it from the equation  ensure that outside of normal file
  access, data isn't being rewritten between scrubs)
 
 I wouldn't expect autodefrag to touch old files not accessed for
 months. Doesn't it only affect actively used files?

The drive is mainly used to hold old archive files, though there are
daily rotating files on it as well. The corruption affects both new and
old files.

-- 
Ian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-20 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/19/2014 5:25 PM, Robert White wrote:
 The controller, the thing that sets the ready bit and sends the 
 interrupt is distinct from the driver, the thing that polls the
 ready bit when the interrupt is sent. At the bus level there are
 fixed delays and retries. Try putting two drives on a pin-select
 IDE bus and strapping them both as _slave_ (or indeed master)
 sometime and watch the shower of fixed delay retries.

No, it does not.  In classical IDE, the controller is really just a
bus bridge.  When you read from the status register in the controller,
the read bus cycle is propagated down the IDE ribbon, and into the
drive, and you are in fact, reading the register directly from the
drive.  That is where the name Integrated Device Electronics came
from: because the controller was really integrated into the drive.

The only fixed delays at the bus level are the bus cycle speed.  There
are no retries.  There are only 3 mentions of the word retry in the
ATA8-APT and they all refer to the host driver.

 That's odd... my bios reads from storage to boot the device and it
 does so using the ACPI storage methods.

No, it doesn't.  It does so by accessing the IDE or ACHI registers
just as pc bios always has.  I suppose I also need to remind you that
we are talking about the context of linux here, and linux does not
make use of the bios for disk access.

 ACPI 4.0 Specification Section 9.8 even disagrees with you at some
 length.
 
 Let's just do the titles shall we:
 
 9.8 ATA Controller Devices 9.8.1 Objects for both ATA and SATA
 Controllers. 9.8.2 IDE Controller Device 9.8.3 Serial ATA (SATA)
 controller Device
 
 Oh, and _lookie_ _here_ in Linux Kernel Menuconfig at Device
 Drivers - * Serial ATA and Parallel ATA drivers (libata) - *
 ACPI firmware driver for PATA
 
 CONFIG_PATA_ACPI:
 
 This option enables an ACPI method driver which drives motherboard
 PATA controller interfaces through the ACPI firmware in the BIOS.
 This driver can sometimes handle otherwise unsupported hardware.
 
 You are a storage _genius_ for knowing that all that stuff doesn't 
 exist... the rest of us must simply muddle along in our
 delusion...

Yes, ACPI 4.0 added this mess.  I have yet to see a single system that
actually implements it.  I can't believe they even bothered adding
this driver to the kernel.  Is there anyone in the world who has ever
used it?  If no motherboard vendor has bothered implementing the ACPI
FAN specs, I very much doubt anyone will ever bother with this.

 Do tell us more... I didn't say the driver would cause long delays,
 I said that the time it takes to error out other improperly
 supported drivers and fall back to this one could induce long
 delays and resets.

There is no error out and fall back.  If the device is in AHCI
mode then it identifies itself as such and the ACHI driver is loaded.
 If it is in IDE mode, then it identifies itself as such, and the IDE
driver is loaded.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUbk5qAAoJEI5FoCIzSKrw++IH/2DAayNzDqKlA7DBi79UVlpg
jJHDOlmzPqJCLMkffZRX1TLM/OEzu3k/pYMlS0HCdNggbG7eTpHxsoCetiETPcnc
LCcolWXa/eMfzkEphSq4GToeEj5FKrVNzymNvPVL6zdiSfySvSg4RZOs123ULYNM
nPUaOYPSiDPzfC7ggUS3RSvWb8mNzfRVJtgGXlZd/jDh+NAjy3oTb4fYksZjq8qb
n5emKU1jJafvSbBek41wo7Xji1vLThiDZ4kcf4c7oT3x4WuQUMUhzkficqEnwYsm
HK12pv0ktDJr6hKMcHPT26YKsdUOPE6XC3GgNaxt8EZ3bioWYRb4RRAdAuAjI2s=
=+M2o
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-20 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/19/2014 5:33 PM, Robert White wrote:
 That would be fake raid, not hardware raid.
 
 The LSI MegaRaid controller people would _love_ to hear more about
 your insight into how their battery-backed multi-drive RAID
 controller is fake. You should go work for them. Try the contact
 us link at the bottom of this page. I'm sure they are waiting for
 your insight with baited breath!

Forgive me, I should have trimmed the quote a bit more.  I was
responding specifically to the many mother boards have hardware RAID
support available through the bios part, not the lsi part.

 Odd, my MegaRaid controller takes about fifteen seconds
 by-the-clock to initialize and to the integrity check on my single
 initialized drive.

It is almost certainly spending those 15 seconds on something else,
like bootstrapping its firmware code from a slow serial eeprom or
waiting for you to press the magic key to enter the bios utility.  I
would be very surprised to see that time double if you add a second
disk.  If it does, then they are doing something *very* wrong, and
certainly quite different from any other real or fake raid controller
I've ever used.

 It's amazing that with a fail and retry it would be _faster_...

I have no idea what you are talking about here.  I said that they
aren't going to retry a read that *succeeded* but came back without
their magic signature.  It isn't like reading it again is going to
magically give different results.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUblBmAAoJEI5FoCIzSKrwFKkIAKNGOGyLrMIcTeV4DQntdbaa
NMkjXnWnk6lHeqTyE/pb+l4VgVH8nQwDp8hRCnKNnKHoZbT8LOGFULSmBes+DDmW
dxPVDTytUu1AiqB7AyxNJU8213BQCaF0inL7ofZmX95N+0eajuVxOyHIMeokdwUU
zLOnXQg0awLkQwk7U6YLAKA4A7HrOEXw4wHt9hPy/yUySMVqCeHYV3tpf7t96guU
0IRctvpwcNvvVtt65I8A4EklR+vCvqEDUZfKyG8WJAeyAdC4UoHT9vZcJAVkiFl+
Y+Mp5wsr1vuo3dYQ1bKO8RvPTB9D9npFyFIlyHEBMJlCHDU43YsNP8hGcu0mKco=
=AJ6/
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-20 Thread Robert White

On 11/20/2014 12:26 PM, Phillip Susi wrote:

Yes, ACPI 4.0 added this mess.  I have yet to see a single system that
actually implements it.  I can't believe they even bothered adding
this driver to the kernel.  Is there anyone in the world who has ever
used it?  If no motherboard vendor has bothered implementing the ACPI
FAN specs, I very much doubt anyone will ever bother with this.


Nice attempt at saving face, but wrong as _always_.

The CONFIG_PATA_ACPI option has been in the kernel since 2008 and lots 
of people have used it.


If you search for ACPI ide you'll find people complaining in 2008-2010 
about windows error messages indicating the device is present in their 
system but no OS driver is available.


That you have yet to see a single system that implements it is about 
the worst piece of internet research I've ever seen. Do you not _get_ 
that your opinion about what exists and how it works is not authoritative?


You can also find articles about both windows and linux systems actively 
using ACPI fan control going back to 2009


These are not hard searches to pull off. These are not obscure 
references. Go to the google box and start typing ACPI fan... and 
check the autocomplete.


I'll skip ovea all the parts where you don't know how a chipset works 
and blah, blah, blah...


You really should have just stopped at I don't know and I've never 
because you keep demonstrating that you _don't_ know, and that you 
really _should_ _never_.


Tell us more about the lizard aliens controlling your computer, I find 
your versions of realty fascinating...

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-20 Thread Robert White

On 11/20/2014 12:34 PM, Phillip Susi wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/19/2014 5:33 PM, Robert White wrote:

That would be fake raid, not hardware raid.


The LSI MegaRaid controller people would _love_ to hear more about
your insight into how their battery-backed multi-drive RAID
controller is fake. You should go work for them. Try the contact
us link at the bottom of this page. I'm sure they are waiting for
your insight with baited breath!


Forgive me, I should have trimmed the quote a bit more.  I was
responding specifically to the many mother boards have hardware RAID
support available through the bios part, not the lsi part.


Well you should have _actually_ trimmed your response down to not 
pressing send.


_Many_ motherboards have complete RAID support at levels 0, 1, 10, and 
five 5. A few have RAID6.


Some of them even use the LSI chip-set.

Seriously... are you trolling this list with disinformation or just 
repeating tribal knowledge from fifteen year old copies of PC Magazine?


Yea, some of the IDE motherboards and that only had RAID1 and RAID0 (and 
indeed some of the add-on controllers) back in the IDE-only days were 
really lame just-forked-write devices with no integrity checks (hence 
fake raid) but that's from like the 1990s; it's paleolithic age 
wisdom at this point.


Phillip say sky god angry, all go hide in cave! /D'oh...
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-20 Thread Zygo Blaxell
On Tue, Nov 18, 2014 at 09:29:54AM +0200, Brendan Hide wrote:
 Hey, guys
 
 See further below extracted output from a daily scrub showing csum
 errors on sdb, part of a raid1 btrfs. Looking back, it has been
 getting errors like this for a few days now.
 
 The disk is patently unreliable but smartctl's output implies there
 are no issues. Is this somehow standard faire for S.M.A.R.T. output?
 
 Here are (I think) the important bits of the smartctl output for
 $(smartctl -a /dev/sdb) (the full results are attached):
 ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE UPDATED
 WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate 0x000f   100   253   006Pre-fail
 Always   -   0
   5 Reallocated_Sector_Ct   0x0033   100   100   036Pre-fail
 Always   -   1
   7 Seek_Error_Rate 0x000f   086   060   030Pre-fail
 Always   -   440801014
 197 Current_Pending_Sector  0x0012   100   100   000Old_age
 Always   -   0
 198 Offline_Uncorrectable   0x0010   100   100   000Old_age
 Offline  -   0
 199 UDMA_CRC_Error_Count0x003e   200   200   000Old_age
 Always   -   0
 200 Multi_Zone_Error_Rate   0x   100   253   000Old_age
 Offline  -   0
 202 Data_Address_Mark_Errs  0x0032   100   253   000Old_age
 Always   -   0

You have one reallocated sector, so the drive has lost some data at some
time in the last 49000(!) hours.  Normally reallocations happen during
writes so the data that was lost was data you were in the process of
overwriting anyway; however, the reallocated sector count could also be
a sign of deteriorating drive integrity.

In /var/lib/smartmontools there might be a csv file with logged error
attribute data that you could use to figure out whether that reallocation
was recent.

I also notice you are not running regular SMART self-tests (e.g.
by smartctl -t long) and the last (and first, and only!) self-test the
drive ran was ~12000 hours ago.  That means most of your SMART data is
about 18 months old.  The drive won't know about sectors that went bad
in the last year and a half unless the host happens to stumble across
them during a read.

The drive is over five years old in operating hours alone.  It is probably
so fragile now that it will break if you try to move it.


 
 
  Original Message 
 Subject:  Cron root@watricky /usr/local/sbin/btrfs-scrub-all
 Date: Tue, 18 Nov 2014 04:19:12 +0200
 From: (Cron Daemon) root@watricky
 To:   brendan@watricky
 
 
 
 WARNING: errors detected during scrubbing, corrected.
 [snip]
 scrub device /dev/sdb2 (id 2) done
   scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 
 seconds
   total bytes scrubbed: 189.49GiB with 5420 errors
   error details: read=5 csum=5415
   corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164

That seems a little off.  If there were 5 read errors, I'd expect the drive to
have errors in the SMART error log.

Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem.
There have been a number of fixes to csums in btrfs pulled into the kernel
recently, and I've retired two five-year-old computers this summer due
to RAM/CPU failures.

 [snip]
 

 smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.17.2-1-ARCH] (local build)
 Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
 
 === START OF INFORMATION SECTION ===
 Model Family: Seagate Barracuda 7200.10
 Device Model: ST3250410AS
 Serial Number:6RYF5NP7
 Firmware Version: 4.AAA
 User Capacity:250,059,350,016 bytes [250 GB]
 Sector Size:  512 bytes logical/physical
 Device is:In smartctl database [for details use: -P show]
 ATA Version is:   ATA/ATAPI-7 (minor revision not indicated)
 Local Time is:Tue Nov 18 09:16:03 2014 SAST
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled
 
 === START OF READ SMART DATA SECTION ===
 SMART overall-health self-assessment test result: PASSED
 See vendor-specific Attribute list for marginal Attributes.
 
 General SMART Values:
 Offline data collection status:  (0x82)   Offline data collection activity
   was completed without error.
   Auto Offline Data Collection: Enabled.
 Self-test execution status:  (   0)   The previous self-test routine 
 completed
   without error or no self-test has ever 
   been run.
 Total time to complete Offline 
 data collection:  (  430) seconds.
 Offline data collection
 capabilities:  (0x5b) SMART execute Offline immediate.
   Auto Offline data collection on/off 
 support.
   Suspend Offline collection upon new
   command.
  

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-20 Thread Brendan Hide

On 2014/11/21 06:58, Zygo Blaxell wrote:

You have one reallocated sector, so the drive has lost some data at some
time in the last 49000(!) hours.  Normally reallocations happen during
writes so the data that was lost was data you were in the process of
overwriting anyway; however, the reallocated sector count could also be
a sign of deteriorating drive integrity.

In /var/lib/smartmontools there might be a csv file with logged error
attribute data that you could use to figure out whether that reallocation
was recent.

I also notice you are not running regular SMART self-tests (e.g.
by smartctl -t long) and the last (and first, and only!) self-test the
drive ran was ~12000 hours ago.  That means most of your SMART data is
about 18 months old.  The drive won't know about sectors that went bad
in the last year and a half unless the host happens to stumble across
them during a read.

The drive is over five years old in operating hours alone.  It is probably
so fragile now that it will break if you try to move it.
All interesting points. Do you schedule SMART self-tests on your own 
systems? I have smartd running. In theory it tracks changes and sends 
alerts if it figures a drive is going to fail. But, based on what you've 
indicated, that isn't good enough.



WARNING: errors detected during scrubbing, corrected.
[snip]
scrub device /dev/sdb2 (id 2) done
scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 
seconds
total bytes scrubbed: 189.49GiB with 5420 errors
error details: read=5 csum=5415
corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164
That seems a little off.  If there were 5 read errors, I'd expect the drive to
have errors in the SMART error log.

Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem.
There have been a number of fixes to csums in btrfs pulled into the kernel
recently, and I've retired two five-year-old computers this summer due
to RAM/CPU failures.
The difference here is that the issue only affects the one drive. This 
leaves the probable cause at:

- the drive itself
- the cable/ports

with a negligibly-possible cause at the motherboard chipset.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-19 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/18/2014 9:40 PM, Chris Murphy wrote:
 It’s well known on linux-raid@ that consumer drives have well over
 30 second deep recoveries when they lack SCT command support. The
 WDC and Seagate “green” drives are over 2 minutes apparently. This
 isn’t easy to test because it requires a sector with enough error
 that it requires the ECC to do something, and yet not so much error
 that it gives up in less than 30 seconds. So you have to track down
 a drive model spec document (one of those 100 pagers).
 
 This makes sense, sorta, because the manufacturer use case is 
 typically single drive only, and most proscribe raid5/6 with such 
 products. So it’s a “recover data at all costs” behavior because
 it’s assumed to be the only (immediately) available copy.

It doesn't make sense to me.  If it can't recover the data after one
or two hundred retries in one or two seconds, it can keep trying until
the cows come home and it just isn't ever going to work.

 I don’t see how that’s possible because anything other than the
 drive explicitly producing  a read error (which includes the
 affected LBA’s), it’s ambiguous what the actual problem is as far
 as the kernel is concerned. It has no way of knowing which of
 possibly dozens of ata commands queued up in the drive have
 actually hung up the drive. It has no idea why the drive is hung up
 as well.

IIRC, this is true when the drive returns failure as well.  The whole
bio is marked as failed, and the page cache layer then begins retrying
with progressively smaller requests to see if it can get *some* data out.

 No I think 30 is pretty sane for servers using SATA drives because
 if the bus is reset all pending commands in the queue get
 obliterated which is worse than just waiting up to 30 seconds. With
 SAS drives maybe less time makes sense. But in either case you
 still need configurable SCT ERC, or it needs to be a sane fixed
 default like 70 deciseconds.

Who cares if multiple commands in the queue are obliterated if they
can all be retried on the other mirror?  Better to fall back to the
other mirror NOW instead of waiting 30 seconds ( or longer! ).  Sure,
you might end up recovering more than you really had to, but that
won't hurt anything.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUbLMyAAoJEI5FoCIzSKrwSM8IAJO2cwhHyxK4LFjINEbNT+ij
fT4EpyzOCs704zhOTgssgSQ8ym85PRQ8VyAIrz338m+lHqKbktZtRt7vWaealmOp
6eleIDJ/I7kggnlhkqg1V8Nctap8qBeRE34K/PaGtTrkRzBYnYxbGdDDz+rXaDi6
CSEMLJBo3I69Oj9qSOV4O18ntV/S3eln0sQ8+w2btbc3xGkG3X2FwVIJokb6IAmu
ngHUeDGXUgkEOvzw3aGDheLueGDPe+V3YlsjSbw2rH75svzXqFCUO8Jcg4NfxT0q
Nl03eoTEGlyf8x2geMWfhoKFatJ7sCMy48K0ZFAAX1k8j0ssjNaEC+q6pwrA/xU=
=Gehg
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-19 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/18/2014 9:46 PM, Duncan wrote:
 I'm not sure about normal operation, but certainly, many drives
 take longer than 30 seconds to stabilize after power-on, and I
 routinely see resets during this time.

As far as I have seen, typical drive spin up time is on the order of
3-7 seconds.  Hell, I remember my pair of first generation seagate
cheetah 15,000 rpm drives seemed to take *forever* to spin up and that
still was maybe only 15 seconds.  If a drive takes longer than 30
seconds, then there is something wrong with it.  I figure there is a
reason why spin up time is tracked by SMART so it seems like long spin
up time is a sign of a sick drive.

 This doesn't happen on single-hardware-device block devices and 
 filesystems because in that case it's either up or down, if the
 device doesn't come up in time the resume simply fails entirely,
 instead of coming up with one or more devices there, but others
 missing as they didn't stabilize in time, as is unfortunately all
 too common in the multi- device scenario.

No, the resume doesn't fail entirely.  The drive is reset, and the
IO request is retried, and by then it should succeed.

 I've seen this with both spinning rust and with SSDs, with mdraid
 and btrfs, with multiple mobos and device controllers, and with
 resume both from suspend to ram (if the machine powers down the
 storage devices in that case, as most modern ones do) and hibernate
 to permanent storage device, over several years worth of kernel
 series, so it's a reasonably widespread phenomena, at least among
 consumer-level SATA devices.  (My experience doesn't extend to
 enterprise-raid-level devices or proper SCSI, etc, so I simply
 don't know, there.)

If you are restoring from hibernation, then the drives are already
spun up before the kernel is loaded.

 While two minutes is getting a bit long, I think it's still within
 normal range, and some devices definitely take over a minute enough
 of the time to be both noticeable and irritating.

It certainly is not normal for a drive to take that long to spin up.
IIRC, the 30 second timeout comes from the ATA specs which state that
it can take up to 30 seconds for a drive to spin up.

 That said, I SHOULD say I'd be far *MORE* irritated if the device
 simply pretended it was stable and started reading/writing data
 before it really had stabilized, particularly with SSDs where that
 sort of behavior has been observed and is known to put some devices
 at risk of complete scrambling of either media or firmware, beyond
 recovery at times.  That of course is the risk of going the other
 direction, and I'd a WHOLE lot rather have devices play it safe for
 another 30 seconds or so after they / think/ they're stable and be
 SURE, than pretend to be just fine when voltages have NOT
 stabilized yet and thus end up scrambling things irrecoverably.
 I've never had that happen here tho I've never stress- tested for
 it, only done normal operation, but I've seen testing reports where
 the testers DID make it happen surprisingly easily, to a surprising
  number of their test devices.

Power supply voltage is stable within milliseconds.  What takes HDDs
time to start up is mechanically bringing the spinning rust up to
speed.  On SSDs, I think you are confusing testing done on power
*cycling* ( i.e. yanking the power cord in the middle of a write )
with startup.

 So, umm... I suspect the 2-minute default is 2 minutes due to
 power-up stabilizing issues, where two minutes is a reasonable
 compromise between failing the boot most of the time if the timeout
 is too low, and taking excessively long for very little further
 gain.

The default is 30 seconds, not 2 minutes.

 sure whether it's even possible, without some specific hardware
 feature available to tell the kernel that it has in fact NOT been
 in power-saving mode for say 5-10 minutes, hopefully long enough
 that voltage readings really /are/ fully stabilized and a shorter
 timeout is possible.

Again, there is no several minute period where voltage stabilizes and
the drive takes longer to access.  This is a complete red herring.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUbMBPAAoJEI5FoCIzSKrwcV0H/20pv7O5+CDf2cRg5G5vt7PR
4J1NuVIBsboKwjwCj8qdxHQJHihvLYkTQKANqaqHv0+wx0u2DaQdPU/LRnqN71xA
jP7b9lx9X6rPnAnZUDBbxzAc8HLeutgQ8YD/WB0sE5IXlI1/XFGW4tXIZ4iYmtN9
GUdL+zcdtEiYE993xiGSMXF4UBrN8d/5buBRsUsPVivAZes6OHbf9bd72c1IXBuS
ADZ7cH7XGmLL3OXA+hm7d99429HFZYAgI7DjrLWp6Tb9ja5Gvhy+AVvrbU5ZWMwu
XUnNsLsBBhEGuZs5xpkotZgaQlmJpw4BFY4BKwC6PL+7ex7ud3hGCGeI6VDmI0U=
=DLHU
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-19 Thread Robert White

On 11/19/2014 08:07 AM, Phillip Susi wrote:

On 11/18/2014 9:46 PM, Duncan wrote:

I'm not sure about normal operation, but certainly, many drives
take longer than 30 seconds to stabilize after power-on, and I
routinely see resets during this time.


As far as I have seen, typical drive spin up time is on the order of
3-7 seconds.  Hell, I remember my pair of first generation seagate
cheetah 15,000 rpm drives seemed to take *forever* to spin up and that
still was maybe only 15 seconds.  If a drive takes longer than 30
seconds, then there is something wrong with it.  I figure there is a
reason why spin up time is tracked by SMART so it seems like long spin
up time is a sign of a sick drive.


I was recently re-factoring Underdog (http://underdog.sourceforge.net) 
startup scripts to separate out the various startup domains (e.g. lvm, 
luks, mdadm) in the prtotype init.


So I notice you (Duncan) use the word stabilize, as do a small number 
of drivers in the linux kernel. This word has very little to do with 
disks per se.


Between SCSI probing LUNs (where the controller tries every theoretical 
address and gives a potential device ample time to reply), and 
usb-storage having a simple timer delay set for each volume it sees, 
there is a lot of waiting in the name of safety going on in the linux 
kernel at device initialization.


When I added the messages scanning /dev/sd?? to the startup sequence 
as I iterate through the disks and partitions present I discovered that 
the first time I called blkid (e.g. right between /dev/sda and 
/dev/sda1) I'd get a huge hit of many human seconds (I didn't time it, 
but I'd say eight or so) just for having a 2Tb My Book WD 3.0 disk 
enclosure attached as /dev/sdc. This enclosure having spun up in the 
previous boot cycle and only bing a soft reboot was immaterial. In this 
case usb-store is going to take its time and do its deal regardless of 
the state of the physical drive itself.


So there are _lots_ of places where you are going to get delays and very 
few of them involve the disk itself going from power-off to ready.


You said it yourself with respect to SSDs.

It's cheaper, and less error prone, and less likely to generate customer 
returns if the generic controller chips just send init, wait a fixed 
delay, then request a status compared to trying to are-you-there-yet 
poll each device like a nagging child. And you are going to see that at 
every level. And you are going to see it multiply with _sparsely_ 
provisioned buses where the cycle is going to be retried for absent LUNs 
(one disk on a Wide SCSI bus and a controller set to probe all LUNs is 
particularly egregious)


One of the reasons that the whole industry has started favoring 
point-to-point (SATA, SAS) or physical intercessor chaining 
point-to-point (eSATA) buses is to remove a lot of those wait-and-see 
delays.


That said, you should not see a drive (or target enclosure, or 
controller) reset during spin up. In a SCSI setting this is almost 
always a cabling, termination, or addressing issue. In IDE its jumper 
mismatch (master vs slave vs cable-select). Less often its a 
partitioning issue (trying to access sectors beyond the end of the drive).


Another strong actor is selecting the wrong storage controller chipset 
driver. In that case you may be faling back from high-end device you 
think it is, through intermediate chip-set, and back to ACPI or BIOS 
emulation


Another common cause is having a dedicated hardware RAID controller 
(dell likes to put LSI MegaRaid controllers in their boxes for example), 
many mother boards have hardware RAID support available through the 
bios, etc, leaving that feature active, then the adding a drive and 
_not_ initializing that drive with the RAID controller disk setup. In 
this case the controller is going to repeatedly probe the drive for its 
proprietary controller signature blocks (and reset the drive after each 
attempt) and then finally fall back to raw block pass-through. This can 
take a long time (thirty seconds to a minute).


But seriously, if you are seeing reset anywhere in any storage chain 
during a normal power-on cycle then you've got a problem  with geometry 
or configuration.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-19 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/19/2014 4:05 PM, Robert White wrote:
 It's cheaper, and less error prone, and less likely to generate
 customer returns if the generic controller chips just send init,
 wait a fixed delay, then request a status compared to trying to
 are-you-there-yet poll each device like a nagging child. And you
 are going to see that at every level. And you are going to see it
 multiply with _sparsely_ provisioned buses where the cycle is going
 to be retried for absent LUNs (one disk on a Wide SCSI bus and a
 controller set to probe all LUNs is particularly egregious)

No, they do not wait a fixed time, then proceed.  They do in fact
issue the command, then poll or wait for an interrupt to know when it
is done, then time out and give up if that doesn't happen within a
reasonable amount of time.

 One of the reasons that the whole industry has started favoring 
 point-to-point (SATA, SAS) or physical intercessor chaining 
 point-to-point (eSATA) buses is to remove a lot of those
 wait-and-see delays.

Nope... even with the ancient PIO mode PATA interface, you polled a
ready bit in the status register to see if it was done yet.  If you
always waited 30 seconds for every command your system wouldn't boot
up until next year.

 Another strong actor is selecting the wrong storage controller
 chipset driver. In that case you may be faling back from high-end
 device you think it is, through intermediate chip-set, and back to
 ACPI or BIOS emulation

There is no such thing as ACPI or BIOS emulation.  AHCI SATA
controllers do usually have an old IDE emulation mode instead of AHCI
mode, but this isn't going to cause ridiculously long delays.

 Another common cause is having a dedicated hardware RAID
 controller (dell likes to put LSI MegaRaid controllers in their
 boxes for example), many mother boards have hardware RAID support
 available through the bios, etc, leaving that feature active, then
 the adding a drive and

That would be fake raid, not hardware raid.

 _not_ initializing that drive with the RAID controller disk setup.
 In this case the controller is going to repeatedly probe the drive
 for its proprietary controller signature blocks (and reset the
 drive after each attempt) and then finally fall back to raw block
 pass-through. This can take a long time (thirty seconds to a
 minute).

No, no, and no.  If it reads the drive and does not find its metadata,
it falls back to pass through.  The actual read takes only
milliseconds, though it may have to wait a few seconds for the drive
to spin up.  There is no reason it would keep retrying after a
successful read.

The way you end up with 30-60 second startup time with a raid is if
you have several drives and staggered spinup mode enabled, then each
drive is started one at a time instead of all at once so their
cumulative startup time can add up fairly high.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUbQ/qAAoJEI5FoCIzSKrwuhwH/R/+EVTpNlw36naJ8mxqMagt
/xafq+1kGhwNjLTPV68CI4Wt24WSGOLqpq5FPWlTMxuN0VSnX/wqBeSbz4w2Vl3F
VNic+4RqhmzS3EnLXNzkHyF2Z+hQEEldOlheAobkQb4hv/7jVxBri42nMdHQUq5w
em181txT8zkltmV+dm8aYcro8Z4ewntQtyGaO6U/nCfxt9Odr2rfytyeuSyJi9uY
+dKlGSb5klIFwCOOSoRqEz2+KOFHF7td9RrcfIRcPRgjKROH0YilQ8T53lTMoNL1
aUMsbyUy+edEBN1a4o/FqK3dEvBSu1nnRGRpSgm2fFGKhyi/z9gmJ1ZXTdYZRXE=
=/O7+
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-19 Thread Robert White

Shame you already know everything?

On 11/19/2014 01:47 PM, Phillip Susi wrote:

On 11/19/2014 4:05 PM, Robert White wrote:






One of the reasons that the whole industry has started favoring
point-to-point (SATA, SAS) or physical intercessor chaining
point-to-point (eSATA) buses is to remove a lot of those
wait-and-see delays.


Nope... even with the ancient PIO mode PATA interface, you polled a
ready bit in the status register to see if it was done yet.  If you
always waited 30 seconds for every command your system wouldn't boot
up until next year.


The controller, the thing that sets the ready bit and sends the 
interrupt is distinct from the driver, the thing that polls the ready 
bit when the interrupt is sent. At the bus level there are fixed delays 
and retries. Try putting two drives on a pin-select IDE bus and 
strapping them both as _slave_ (or indeed master) sometime and watch the 
shower of fixed delay retries.



Another strong actor is selecting the wrong storage controller
chipset driver. In that case you may be faling back from high-end
device you think it is, through intermediate chip-set, and back to
ACPI or BIOS emulation


There is no such thing as ACPI or BIOS emulation.


That's odd... my bios reads from storage to boot the device and it does 
so using the ACPI storage methods.


ACPI 4.0 Specification Section 9.8 even disagrees with you at some length.

Let's just do the titles shall we:

9.8 ATA Controller Devices
9.8.1 Objects for both ATA and SATA Controllers.
9.8.2 IDE Controller Device
9.8.3 Serial ATA (SATA) controller Device

Oh, and _lookie_ _here_ in Linux Kernel Menuconfig at
Device Drivers -
 * Serial ATA and Parallel ATA drivers (libata) -
  * ACPI firmware driver for PATA

CONFIG_PATA_ACPI:

This option enables an ACPI method driver which drives motherboard PATA 
controller interfaces through the ACPI firmware in the BIOS. This driver 
can sometimes handle otherwise unsupported hardware.


You are a storage _genius_ for knowing that all that stuff doesn't 
exist... the rest of us must simply muddle along in our delusion...


 AHCI SATA
 controllers do usually have an old IDE emulation mode instead of AHCI
 mode, but this isn't going to cause ridiculously long delays.

Do tell us more... I didn't say the driver would cause long delays, I 
said that the time it takes to error out other improperly supported 
drivers and fall back to this one could induce long delays and resets.


I think I am done with your expertise in the question of all things 
storage related.


Not to be rude... but I'm physically ill and maybe I shouldn't be 
posting right now... 8-)


-- Rob.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-19 Thread Robert White

P.S.

On 11/19/2014 01:47 PM, Phillip Susi wrote:

Another common cause is having a dedicated hardware RAID
controller (dell likes to put LSI MegaRaid controllers in their
boxes for example), many mother boards have hardware RAID support
available through the bios, etc, leaving that feature active, then
the adding a drive and


That would be fake raid, not hardware raid.


The LSI MegaRaid controller people would _love_ to hear more about your 
insight into how their battery-backed multi-drive RAID controller is 
fake. You should go work for them. Try the contact us link at the 
bottom of this page. I'm sure they are waiting for your insight with 
baited breath!


http://www.lsi.com/products/raid-controllers/pages/megaraid-sas-9260-8i.aspx


_not_ initializing that drive with the RAID controller disk setup.
In this case the controller is going to repeatedly probe the drive
for its proprietary controller signature blocks (and reset the
drive after each attempt) and then finally fall back to raw block
pass-through. This can take a long time (thirty seconds to a
minute).


No, no, and no.  If it reads the drive and does not find its metadata,
it falls back to pass through.  The actual read takes only
milliseconds, though it may have to wait a few seconds for the drive
to spin up.  There is no reason it would keep retrying after a
successful read.


Odd, my MegaRaid controller takes about fifteen seconds by-the-clock to 
initialize and to the integrity check on my single initialized drive. 
It's amazing that with a fail and retry it would be _faster_...


It's like you know _everything_...


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-19 Thread Duncan
Phillip Susi posted on Wed, 19 Nov 2014 11:07:43 -0500 as excerpted:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 11/18/2014 9:46 PM, Duncan wrote:
 I'm not sure about normal operation, but certainly, many drives take
 longer than 30 seconds to stabilize after power-on, and I routinely see
 resets during this time.
 
 As far as I have seen, typical drive spin up time is on the order of 3-7
 seconds.  Hell, I remember my pair of first generation seagate cheetah
 15,000 rpm drives seemed to take *forever* to spin up and that still was
 maybe only 15 seconds.  If a drive takes longer than 30 seconds, then
 there is something wrong with it.  I figure there is a reason why spin
 up time is tracked by SMART so it seems like long spin up time is a sign
 of a sick drive.

It's not physical spinup, but electronic device-ready.  It happens on 
SSDs too and they don't have anything to spinup.

But, for instance on my old seagate 300-gigs that I used to have in 4-way 
mdraid, when I tried to resume from hibernate the drives would be spunup 
and talking to the kernel, but for some seconds to a couple minutes or so 
after spinup, they'd sometimes return something like (example) 
Seagrte3x0 instead of Seagate300.  Of course that wasn't the exact 
string, I think it was the model number or perhaps the serial number or 
something, but looking at dmsg I could see the ATA layer up for each of 
the four devices, the connection establish and seem to be returning good 
data, then the mdraid layer would try to assemble and would kick out a 
drive or two due to the device string mismatch compared to what was there 
before the hibernate.  With the string mismatch, from its perspective the 
device had disappeared and been replaced with something else.

But if I held it at the grub prompt for a couple minutes and /then/ let 
it go, or part of the time on its own, all four drives would match and 
it'd work fine.  For just short hibernates (as when testing hibernate/
resume), it'd come back just fine; as it would nearly all the time out to 
two hours or so.  Beyond that, out to 10 or 12 hours, the longer it sat 
the more likely it would be to fail, if it didn't hold it at the grub 
prompt for a few minutes to let it stabilize.

And now I seen similar behavior resuming from suspend (the old hardware 
wouldn't resume from suspend to ram, only hibernate, the new hardware 
resumes from suspend to ram just fine, but I had trouble getting it to 
resume from hibernate back when I first setup and tried it; I've not 
tried hibernate since and didn't even setup swap to hibernate to when I 
got the SSDs so I've not tried it for a couple years) on SSDs with btrfs 
raid.  Btrfs isn't as informative as was mdraid on why it kicks a device, 
but dmesg says both devices are up, while btrfs is suddenly spitting 
errors on one device.  A reboot later and both devices are back in the 
btrfs and I can do a scrub to resync, which generally finds and fixes 
errors on the btrfs that were writable (/home and /var/log), but of 
course not on the btrfs mounted as root, since it's read-only by default.

Same pattern.  Immediate suspend and resume is fine.  Out to about 6 
hours it tends to be fine as well.  But at 8-10 hours in suspend, btrfs 
starts spitting errors often enough that I generally quit trying to 
suspend at all, I simply shut down now.  (With SSDs and systemd, shutdown 
and restart is fast enough, and the delay from having to refill cache low 
enough, that the time difference between suspend and full shutdown is 
hardly worth troubling with anyway, certainly not when there's a risk to 
data due to failure to properly resume.)

But it worked fine when I had only a single device to bring back up.  
Nothing to be slower than another device to respond and thus to be kicked 
out as dead.


I finally realized what was happening after I read a study paper 
mentioning capacitor charge time and solid-state stability time, and how 
a lot of cheap devices say they're ready before the electronics have 
actually properly stabilized.  On SSDs, this is a MUCH worse issue than 
it is on spinning rust, because the logical layout isn't practically 
forced to serial like it is on spinning rust, and the firmware can get so 
jumbled it pretty much scrambles the device.  And it's not just the 
normal storage either.  In the study, many devices corrupted their own 
firmware as well!

Now that was definitely a worst-case study in that they were deliberately 
yanking and/or fast-switching the power, not just doing time-on waits, 
but still, a surprisingly high proportion of SSDs not only scrambled the 
storage, but scrambled their firmware as well.  (On those devices the 
firmware may well have been on the same media as the storage, with the 
firmware simply read in first in a hardware bootstrap mode, and the 
firmware programmed to avoid that area in normal operation thus making it 
as easily corrupted as the the normal storage.)

The paper specifically 

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-19 Thread Chris Murphy
On Wed, Nov 19, 2014 at 8:11 AM, Phillip Susi ps...@ubuntu.com wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 11/18/2014 9:40 PM, Chris Murphy wrote:
 It’s well known on linux-raid@ that consumer drives have well over
 30 second deep recoveries when they lack SCT command support. The
 WDC and Seagate “green” drives are over 2 minutes apparently. This
 isn’t easy to test because it requires a sector with enough error
 that it requires the ECC to do something, and yet not so much error
 that it gives up in less than 30 seconds. So you have to track down
 a drive model spec document (one of those 100 pagers).

 This makes sense, sorta, because the manufacturer use case is
 typically single drive only, and most proscribe raid5/6 with such
 products. So it’s a “recover data at all costs” behavior because
 it’s assumed to be the only (immediately) available copy.

 It doesn't make sense to me.  If it can't recover the data after one
 or two hundred retries in one or two seconds, it can keep trying until
 the cows come home and it just isn't ever going to work.

I'm not a hard drive engineer, so I can't argue either point. But
consumer drives clearly do behave this way. On Linux, the kernel's
default 30 second command timer eventually results in what look like
link errors rather than drive read errors. And instead of the problems
being fixed with the normal md and btrfs recovery mechanisms, the
errors simply get worse and eventually there's data loss. Exhibits A,
B, C, D - the linux-raid list is full to the brim of such reports and
their solution.


 I don’t see how that’s possible because anything other than the
 drive explicitly producing  a read error (which includes the
 affected LBA’s), it’s ambiguous what the actual problem is as far
 as the kernel is concerned. It has no way of knowing which of
 possibly dozens of ata commands queued up in the drive have
 actually hung up the drive. It has no idea why the drive is hung up
 as well.

 IIRC, this is true when the drive returns failure as well.  The whole
 bio is marked as failed, and the page cache layer then begins retrying
 with progressively smaller requests to see if it can get *some* data out.

Well that's very course. It's not at a sector level, so as long as the
drive continues to try to read from a particular LBA, but fails to
either succeed reading or give up and report a read error, within 30
seconds, then you just get a bunch of wonky system behavior.

Conversely what I've observed on Windows in such a case, is it
tolerates these deep recoveries on consumer drives. So they just get
really slow but the drive does seem to eventually recover (until it
doesn't). But yeah 2 minutes is a long time. So then the user gets
annoyed and reinstalls their system. Since that means writing to the
affected drive, the firmware logic causes bad sectors to be
dereferenced when the write error is persistent. Problem solved,
faster system.




 No I think 30 is pretty sane for servers using SATA drives because
 if the bus is reset all pending commands in the queue get
 obliterated which is worse than just waiting up to 30 seconds. With
 SAS drives maybe less time makes sense. But in either case you
 still need configurable SCT ERC, or it needs to be a sane fixed
 default like 70 deciseconds.

 Who cares if multiple commands in the queue are obliterated if they
 can all be retried on the other mirror?

Because now you have a member drive that's inconsistent. At least in
the md raid case, a certain number of read failures causes the drive
to be ejected from the array. Anytime there's a write failure, it's
ejected from the array too. What you want is for the drive to give up
sooner with an explicit read error, so md can help fix the problem by
writing good data to the effected LBA. That doesn't happen when there
are a bunch of link resets happening.


 Better to fall back to the
 other mirror NOW instead of waiting 30 seconds ( or longer! ).  Sure,
 you might end up recovering more than you really had to, but that
 won't hurt anything.

Again, if your drive SCT ERC is configurable, and set to something
sane like 70 deciseconds, that read failure happens at MOST 7 seconds
after the read attempt. And md is notified of *exactly* what sectors
are affected, it immediately goes to mirror data, or rebuilds it from
parity, and then writes the correct data to the previously reported
bad sectors. And that will fix the problem.

So really, if you're going to play the multiple device game, you need
drive error timing to be shorter than the kernel's.



Chris Murphy
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-19 Thread Duncan
Robert White posted on Wed, 19 Nov 2014 13:05:13 -0800 as excerpted:

 One of the reasons that the whole industry has started favoring
 point-to-point (SATA, SAS) or physical intercessor chaining
 point-to-point (eSATA) buses is to remove a lot of those wait-and-see
 delays.
 
 That said, you should not see a drive (or target enclosure, or
 controller) reset during spin up. In a SCSI setting this is almost
 always a cabling, termination, or addressing issue. In IDE its jumper
 mismatch (master vs slave vs cable-select). Less often its a
 partitioning issue (trying to access sectors beyond the end of the
 drive).
 
 Another strong actor is selecting the wrong storage controller chipset
 driver. In that case you may be faling back from high-end device you
 think it is, through intermediate chip-set, and back to ACPI or BIOS
 emulation

FWIW I run a custom-built monolithic kernel, with only the specific 
drivers (SATA/AHCI in this case) builtin.  There's no drivers for 
anything else it could fallback to.

Once in awhile I do see it try at say 6-gig speeds, then eventually fall 
back to 3 and ultimately 1.5, but that /is/ indicative of other issues 
when I see it.  And like I said, there's no other drivers to fall back 
to, so obviously I never see it doing that.

 Another common cause is having a dedicated hardware RAID controller
 (dell likes to put LSI MegaRaid controllers in their boxes for example),
 many mother boards have hardware RAID support available through the
 bios, etc, leaving that feature active, then the adding a drive and
 _not_ initializing that drive with the RAID controller disk setup. In
 this case the controller is going to repeatedly probe the drive for its
 proprietary controller signature blocks (and reset the drive after each
 attempt) and then finally fall back to raw block pass-through. This can
 take a long time (thirty seconds to a minute).

Everything's set JBOD here.  I don't trust those proprietary firmware 
raid things.  Besides, that kills portability.  JBOD SATA and AHCI are 
sufficiently standardized that should the hardware die, I can switch out 
to something else and not have to worry about rebuilding the custom 
kernel with the new drivers.  Some proprietary firmware raid, requiring 
dmraid at the software kernel level to support, when I can just as easily 
use full software mdraid on standardized JBOD, no thanks!

And be sure, that's one of the first things I check when I setup a new 
box, any so-called hardware raid that's actually firmware/software raid, 
disabled, JBOD mode, enabled.

 But seriously, if you are seeing reset anywhere in any storage chain
 during a normal power-on cycle then you've got a problem  with geometry
 or configuration.

IIRC I don't get it routinely.  But I've seen it a few times, attributing 
it as I said to the 30-second SATA level timeout not being long enough.

Most often, however, it's at resume, not original startup, which is 
understandable as state at resume doesn't match state at suspend/
hibernate.  The irritating thing, as previously discussed, is when one 
device takes long enough to come back that mdraid or btrfs drops it out, 
generally forcing the reboot I was trying to avoid with the suspend/
hibernate in the first place, along with a re-add and resync (for mdraid) 
or a scrub (for btrfs raid).

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-19 Thread Robert White

On 11/19/2014 04:25 PM, Duncan wrote:

Most often, however, it's at resume, not original startup, which is
understandable as state at resume doesn't match state at suspend/
hibernate.  The irritating thing, as previously discussed, is when one
device takes long enough to come back that mdraid or btrfs drops it out,
generally forcing the reboot I was trying to avoid with the suspend/
hibernate in the first place, along with a re-add and resync (for mdraid)
or a scrub (for btrfs raid).


If you want a practical solution you might want to look at 
http://underdog.soruceforge.net (my project, shameless plug). The actual 
user context return isn't in there but I use the project to build 
initramfs images into all my kernels.


[DISCLAIMER: The cryptsetup and LUKS stuff is rock solid but the mdadm 
incremental build stuff is very rough and so lightly untested]


You could easily add a drive preheat code block (spin up and status 
check all drives with pause and repeat function) as a preamble function 
that could/would safely take place before any glance is made towards the 
resume stage.


extemporaneous example::

--- snip ---
cat 'EOT' /opt/underdog/utility/preheat.mod
#!/bin/bash
# ROOT_COMMANDS+=( commands your preheat needs )
UNDERDOG+=( init.d/preheat )
EOT

cat 'EOT' /opt/underdog/prototype/init.d/preheat
#!/bin/bash
function __preamble_preheat() {
whatever logic you need
return 0
}
__preamble_funcs+=( [preheat]=__preamble_preheat )
EOT
--- snip ---

install underdog, paste the above into a shell once. edit 
/opt/underdog/prototype/init.d/preamble to put whatever logic in you need.


Follow the instructions in /opt/underdog/README.txt for making the 
initramfs image or, as I do, build the initramfs into the kernel image.


The preamble will be run in the resultant /init script before the swap 
partitions are submitted for attempted resume.


(The system does support complexity like resuming from a swap partition 
inside an LVM/LV built over a LUKS encrypted media expanse, or just a 
plain laptop with one plain partitioned disk, with zero changes to the 
necessary default config.)


-- Rob.



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Austin S Hemmelgarn

On 2014-11-18 02:29, Brendan Hide wrote:

Hey, guys

See further below extracted output from a daily scrub showing csum
errors on sdb, part of a raid1 btrfs. Looking back, it has been getting
errors like this for a few days now.

The disk is patently unreliable but smartctl's output implies there are
no issues. Is this somehow standard faire for S.M.A.R.T. output?

Here are (I think) the important bits of the smartctl output for
$(smartctl -a /dev/sdb) (the full results are attached):
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate 0x000f   100   253   006Pre-fail
Always   -   0
   5 Reallocated_Sector_Ct   0x0033   100   100   036Pre-fail
Always   -   1
   7 Seek_Error_Rate 0x000f   086   060   030Pre-fail
Always   -   440801014
197 Current_Pending_Sector  0x0012   100   100   000Old_age
Always   -   0
198 Offline_Uncorrectable   0x0010   100   100   000Old_age
Offline  -   0
199 UDMA_CRC_Error_Count0x003e   200   200   000Old_age
Always   -   0
200 Multi_Zone_Error_Rate   0x   100   253   000Old_age
Offline  -   0
202 Data_Address_Mark_Errs  0x0032   100   253   000Old_age
Always   -   0



 Original Message 
Subject: Cron root@watricky /usr/local/sbin/btrfs-scrub-all
Date: Tue, 18 Nov 2014 04:19:12 +0200
From: (Cron Daemon) root@watricky
To: brendan@watricky



WARNING: errors detected during scrubbing, corrected.
[snip]
scrub device /dev/sdb2 (id 2) done
 scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682
seconds
 total bytes scrubbed: 189.49GiB with 5420 errors
 error details: read=5 csum=5415
 corrected errors: 5420, uncorrectable errors: 0, unverified errors:
164
[snip]

In addition to the storage controller being a possibility as mentioned 
in another reply, there are some parts of the drive that aren't covered 
by SMART attributes on most disks, most notably the on-drive cache. 
There really isn't a way to disable the read cache on the drive, but you 
can disable write-caching, which may improve things (and if it's a cheap 
disk, may provide better reliability for BTRFS as well).  The other 
thing I would suggest trying is a different data cable to the drive 
itself, I've had issues with some SATA cables (the cheap red ones you 
get in the retail packaging for some hard disks in particular) having 
either bad connectors, or bad strain-reliefs, and failing after only a 
few hundred hours of use.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Brendan Hide

On 2014/11/18 09:36, Roman Mamedov wrote:

On Tue, 18 Nov 2014 09:29:54 +0200
Brendan Hide bren...@swiftspirit.co.za wrote:


Hey, guys

See further below extracted output from a daily scrub showing csum
errors on sdb, part of a raid1 btrfs. Looking back, it has been getting
errors like this for a few days now.

The disk is patently unreliable but smartctl's output implies there are
no issues. Is this somehow standard faire for S.M.A.R.T. output?

Not necessarily the disk's fault, could be a SATA controller issue. How are
your disks connected, which controller brand and chip? Add lspci output, at
least if something other than the ordinary to the motherboard chipset's
built-in ports.


In this case, yup, its directly to the motherboard chipset's built-in ports. 
This is a very old desktop, and the other 3 disks don't have any issues. I'm 
checking out the alternative pointed out by Austin.

SATA-relevant lspci output:
00:1f.2 SATA controller: Intel Corporation 82801JD/DO (ICH10 Family) SATA AHCI 
Controller (rev 02)


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Brendan Hide

On 2014/11/18 14:08, Austin S Hemmelgarn wrote:

[snip] there are some parts of the drive that aren't covered by SMART 
attributes on most disks, most notably the on-drive cache. There really isn't a 
way to disable the read cache on the drive, but you can disable write-caching.
Its an old and replaceable disk - but if the cable replacement doesn't 
work I'll try this for kicks. :)

The other thing I would suggest trying is a different data cable to the drive 
itself, I've had issues with some SATA cables (the cheap red ones you get in 
the retail packaging for some hard disks in particular) having either bad 
connectors, or bad strain-reliefs, and failing after only a few hundred hours 
of use.

Thanks. I'll try this first. :)

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Duncan
Brendan Hide posted on Tue, 18 Nov 2014 15:24:48 +0200 as excerpted:

 In this case, yup, its directly to the motherboard chipset's built-in
 ports. This is a very old desktop, and the other 3 disks don't have any
 issues. I'm checking out the alternative pointed out by Austin.
 
 SATA-relevant lspci output:
 00:1f.2 SATA controller: Intel Corporation 82801JD/DO (ICH10 Family)
 SATA AHCI Controller (rev 02)

I guess your definition of _very_ old desktop, and mine, are _very_ 
different.

* A quick check of wikipedia says the ICH10 wasn't even 
/introduced/ until 2008 (the wiki link for the 82801jo/do points to an 
Intel page, which says it was launched Q3-2008), and it would have been 
some time after that, likely 2009, that you actually purchased the 
machine.

2009 is five years ago, middle-aged yes, arguably old, but _very_ old, 
not so much in this day and age of longer system replace cycles.

* It has SATA, not IDE/PATA.

* It was PCIE 1.1, not PCI-X or PCI and AGP, and DEFINITELY not ISA bus, 
with or without VLB!

* It has USB 2.0 ports, not USB 1.1, and not only serial/parallel/ps2, 
and DEFINITELY not an AT keyboard.

* It has Gigabit Ethernet, not simply Fast Ethernet or just Ethernet, and 
DEFINITELY Ethernet not token-ring.

* It already has Intel Virtualization technology and HD audio instead of 
AC97 or earlier.

Now I can certainly imagine and old desktop having most of these, but 
you said _very_ old, not simply old, and _very_ old to me would mean PATA/
USB-1/AGP/PCI/FastEthernet with AC97 audio or earlier and no 
virtualization.  64-bit would be questionable as well.


FWIW, I've been playing minitube/youtube C64 music the last few days.  
Martin Galway, etc.  Now C64 really _IS_ _very_ old!

Also FWIW, only a couple years ago now (well, about three,
time flies!), my old 2003 vintage original 3-digit Opteron based mobo 
died due to bulging/burst capacitors, after serving me 8 years.  I was 
shooting for a full decade but didn't quite make it...

So indeed, 2009 vintage system, five years, definitely not _very_ old, 
arguably not even old, more like middle-aged. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/18/2014 7:08 AM, Austin S Hemmelgarn wrote:
 In addition to the storage controller being a possibility as
 mentioned in another reply, there are some parts of the drive that
 aren't covered by SMART attributes on most disks, most notably the
 on-drive cache. There really isn't a way to disable the read cache
 on the drive, but you can disable write-caching, which may improve
 things (and if it's a cheap disk, may provide better reliability
 for BTRFS as well).  The other thing I would suggest trying is a
 different data cable to the drive itself, I've had issues with some
 SATA cables (the cheap red ones you get in the retail packaging for
 some hard disks in particular) having either bad connectors, or bad
 strain-reliefs, and failing after only a few hundred hours of use.

SATA does CRC the data going across it so if it is a bad cable, you
get CRC, or often times 8b10b coding errors and the transfer is
aborted rather than returning bad data.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUa22PAAoJEI5FoCIzSKrwqlAH/3p1iftYkX3DAMgMmWra9AZT
2OA4PIwzgKIhANpy+ZQo4c+W1ZUwo2V6sxLvG8/oM3HfITGyfwNA5HgTbQrlx/iU
vdRHq+y60gCruIa0lRST5JCQMbez7eXvSNOWNAZYbtNH/BNyMxwFuav14zFZpNxO
QovXxhk1D5vLf+ID2jwa5mF1Zj7b5GEhb4zzqK+xU1QNeWppLFhB3da+llae8qxf
eFtNt8ebtknr7QMCFrbaYCq1z1I+Fy8EjskkdI4ZW6AgBRPQDDmB8gNCmAAbSaZC
2Ze/AB4Xr6uuGQ4iK7nprKXUtPJFLzGYx+JQ2EeBJtin9ivno1fEY45CMreuzv4=
=6Oy/
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/18/2014 10:35 AM, Marc MERLIN wrote:
 Try running hdrecover on your drive, it'll scan all your blocks and
 try to rewrite the ones that are failing, if any: 
 http://hdrecover.sourceforge.net/

He doesn't have blocks that are failing; he has blocks that are being
silently corrupted.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUa23wAAoJEI5FoCIzSKrwXTMH/3KhuXuNbPBY0jALRS6kVAew
M3gfJ1kMeZgiBZzUlZb0GsB9J3i+Ei+nF7NQ7taMKey84sPxhQVjpYZV0LZxWNwe
RSga4/Kfnk8TGphwBBeK5e3tOypmv+ECCB4p4uQHXqPAvoFiIALdHYzZGYb0kM8e
ydTonqtUiR8WJ0uqy24/vl7uJyTkj0xz4Adk2ksrbVhW1Z8md2LesKOCtCLa3bVn
Qu8Um/KIBPNBbB21FYN1KyBUMvkx2uGDcu7YRfxXpPnZLwZ9NdMjlOzY8P+EnhFt
cW+tW3mYO9BMhONxi8m7hDI5wj+dsPFblqA5CRBwAOG5b4fsE2pwZwdqYoASmd4=
=2Ho1
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Marc MERLIN
On Tue, Nov 18, 2014 at 11:04:00AM -0500, Phillip Susi wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 11/18/2014 10:35 AM, Marc MERLIN wrote:
  Try running hdrecover on your drive, it'll scan all your blocks and
  try to rewrite the ones that are failing, if any: 
  http://hdrecover.sourceforge.net/
 
 He doesn't have blocks that are failing; he has blocks that are being
 silently corrupted.

That seems to be the case, but hdrecover will rule that part out at least.

Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/18/2014 11:11 AM, Marc MERLIN wrote:
 That seems to be the case, but hdrecover will rule that part out at
 least.

It's already ruled out: if the read failed that is what the error
message would have said rather than a bad checksum.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEbBAEBAgAGBQJUa3MpAAoJEI5FoCIzSKrwmu4H+IPNwUZMEES7vvA7WTPcrYgw
mO2x9uR/fQJFH1u4Urf3anKXoifsHUgvgyPHotRrm1OoiB3bQgYVapVEqZ0PEkre
la3zKydJ6ZuCa/TuEvATdOxBwvUhMKJCYcwYheja+1stqEBxD8mj6HY5+HqufoLo
VaSeEeBDWvQZtGrOC8JNxfzaeFmf46W+8dQIn7qI72WYvWRfVMhCun+dR4amS8hN
cXgxAe6ElnVV4TuGHLy0n4l2Hr6oWBYLWIJhDzM9IpkfjX9jsv78nLHcoWwtaw82
gv248OcCeLnZBwoN5Tepd5Av6uHh3x9MzlXDrqnWQBWulY3f0idrFGU1y1uZvw==
=AtDf
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Chris Murphy

On Nov 18, 2014, at 8:35 AM, Marc MERLIN m...@merlins.org wrote:

 On Tue, Nov 18, 2014 at 09:29:54AM +0200, Brendan Hide wrote:
 Hey, guys
 
 See further below extracted output from a daily scrub showing csum 
 errors on sdb, part of a raid1 btrfs. Looking back, it has been getting 
 errors like this for a few days now.
 
 The disk is patently unreliable but smartctl's output implies there are 
 no issues. Is this somehow standard faire for S.M.A.R.T. output?
 
 Try running hdrecover on your drive, it'll scan all your blocks and try to
 rewrite the ones that are failing, if any:
 http://hdrecover.sourceforge.net/

The only way it can know if there is a bad sector is if the drive returns a 
read error, which will include the LBA for the affected sector(s). This is the 
same thing that would be done with scrub, except any bad sectors that don’t 
contain data. A common problem getting a drive to issue the read error, 
however, is a mismatch between the scsi command timer setting (default 30 
seconds) and the SCT error recover control setting for the drive. The drive SCT 
ERC value needs to be shorter than the scsi command timer value, otherwise some 
bad sector errors will cause the drive to go into a longer recovery attempt 
beyond the scsi command timer value. If that happens, the ata link is reset, 
and there’s no possibility of finding out what the affected sector is.

So a.) use smartctl -l scterc to change the value below 30 seconds (300 
deciseconds) with 70 deciseconds being reasonable. If the drive doesn’t support 
SCT commands, then b.) change the linux scsi command timer to be greater than 
120 seconds.

Strictly speaking the command timer would be set to a value that ensures there 
are no link reset messages in dmesg, that it’s long enough that the drive 
itself times out and actually reports a read error. This could be much shorter 
than 120 seconds. I don’t know if there are any consumer drives that try longer 
than 2 minutes to recover data from a marginally bad sector.

Ideally though, don’t use drives that lack SCT support in multiple device 
volume configurations. An up to 2 minute hang of the storage stack isn’t 
production compatible for most workflows.


Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/18/2014 1:57 PM, Chris Murphy wrote:
 So a.) use smartctl -l scterc to change the value below 30 seconds 
 (300 deciseconds) with 70 deciseconds being reasonable. If the
 drive doesn’t support SCT commands, then b.) change the linux scsi
 command timer to be greater than 120 seconds.
 
 Strictly speaking the command timer would be set to a value that 
 ensures there are no link reset messages in dmesg, that it’s long 
 enough that the drive itself times out and actually reports a read 
 error. This could be much shorter than 120 seconds. I don’t know
 if there are any consumer drives that try longer than 2 minutes to 
 recover data from a marginally bad sector.

Are there really any that take longer than 30 seconds?  That's enough
time for thousands of retries.  If it can't be read after a dozen
tries, it ain't never gonna work.  It seems absurd that a drive would
keep trying for so long.

 Ideally though, don’t use drives that lack SCT support in multiple 
 device volume configurations. An up to 2 minute hang of the
 storage stack isn’t production compatible for most workflows.

Wasn't there an early failure flag that md ( and therefore, btrfs when
doing raid ) sets so the scsi stack doesn't bother with recovery
attempts and just fails the request?  Thus if the drive takes longer
than the scsi_timeout, the failure would be reported to btrfs, which
then can recover using the other copy, write it back to the bad drive,
and hopefully that fixes it?

In that case, you probably want to lower the timeout so that the
recover kicks in sooner instead of hanging your IO stack for 30 seconds.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUa7LqAAoJEI5FoCIzSKrw2Y0H/3Q03vCTxXeGkqvOYG/arZgk
yHq/ruWIKMgfaESdu0Ujzoqbe7XopUueU8luKon52LtbgIFhOM5XnMu/o52KPXIS
CVLnNtRWNbykHJMQu0Sk4lpPrUVI5QP9Ya9ZGVFM4x2ehvJGDAT+wcRWP5OH0waf
mgK+oOnadsckqiSbcQhGrxecjTWZFu5WUCzWFPx+4sEV5ta/tmL0obhHcyho+SDN
lCib2KI9YGzS2sm+V/Qe2i/3ZMp8QY8aAD2x/KlV0DBxkRLZQdOoD3ZkBiaApxZX
VMfXNCKLMexwpe+rGGemH/fCvhRpM/z1aHu8D1u4QVnoWPzD51vX7ySLkwRHaGo=
=XZkM
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Chris Murphy

On Nov 18, 2014, at 1:58 PM, Phillip Susi ps...@ubuntu.com wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 11/18/2014 1:57 PM, Chris Murphy wrote:
 So a.) use smartctl -l scterc to change the value below 30 seconds 
 (300 deciseconds) with 70 deciseconds being reasonable. If the
 drive doesn’t support SCT commands, then b.) change the linux scsi
 command timer to be greater than 120 seconds.
 
 Strictly speaking the command timer would be set to a value that 
 ensures there are no link reset messages in dmesg, that it’s long 
 enough that the drive itself times out and actually reports a read 
 error. This could be much shorter than 120 seconds. I don’t know
 if there are any consumer drives that try longer than 2 minutes to 
 recover data from a marginally bad sector.
 
 Are there really any that take longer than 30 seconds?  That's enough
 time for thousands of retries.  If it can't be read after a dozen
 tries, it ain't never gonna work.  It seems absurd that a drive would
 keep trying for so long.

It’s well known on linux-raid@ that consumer drives have well over 30 second 
deep recoveries when they lack SCT command support. The WDC and Seagate 
“green” drives are over 2 minutes apparently. This isn’t easy to test because 
it requires a sector with enough error that it requires the ECC to do 
something, and yet not so much error that it gives up in less than 30 seconds. 
So you have to track down a drive model spec document (one of those 100 pagers).

This makes sense, sorta, because the manufacturer use case is typically single 
drive only, and most proscribe raid5/6 with such products. So it’s a “recover 
data at all costs” behavior because it’s assumed to be the only (immediately) 
available copy.


 
 Ideally though, don’t use drives that lack SCT support in multiple 
 device volume configurations. An up to 2 minute hang of the
 storage stack isn’t production compatible for most workflows.
 
 Wasn't there an early failure flag that md ( and therefore, btrfs when
 doing raid ) sets so the scsi stack doesn't bother with recovery
 attempts and just fails the request?  Thus if the drive takes longer
 than the scsi_timeout, the failure would be reported to btrfs, which
 then can recover using the other copy, write it back to the bad drive,
 and hopefully that fixes it?

I don’t see how that’s possible because anything other than the drive 
explicitly producing  a read error (which includes the affected LBA’s), it’s 
ambiguous what the actual problem is as far as the kernel is concerned. It has 
no way of knowing which of possibly dozens of ata commands queued up in the 
drive have actually hung up the drive. It has no idea why the drive is hung up 
as well.

The linux-raid@ list is chock full of users having these kinds of problems. It 
comes up pretty much every week. Someone has an e.g. raid5, and in dmesg all 
they get are a bunch of ata bus reset messages. So someone tells them to change 
the scsi command timer for all the block devices that are members of the array 
in question, and retry (reading file, or scrub or whatever) and low and behold 
no more ata bus reset messages. Instead they get explicit read errors with LBAs 
and now md can fix the problem.


 
 In that case, you probably want to lower the timeout so that the
 recover kicks in sooner instead of hanging your IO stack for 30 seconds.

No I think 30 is pretty sane for servers using SATA drives because if the bus 
is reset all pending commands in the queue get obliterated which is worse than 
just waiting up to 30 seconds. With SAS drives maybe less time makes sense. But 
in either case you still need configurable SCT ERC, or it needs to be a sane 
fixed default like 70 deciseconds.


Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Duncan
Phillip Susi posted on Tue, 18 Nov 2014 15:58:18 -0500 as excerpted:

 Are there really any that take longer than 30 seconds?  That's enough
 time for thousands of retries.  If it can't be read after a dozen tries,
 it ain't never gonna work.  It seems absurd that a drive would keep
 trying for so long.

I'm not sure about normal operation, but certainly, many drives take 
longer than 30 seconds to stabilize after power-on, and I routinely see 
resets during this time.

In fact, as I recently posted, power-up stabilization time can and often 
does kill reliable multi-drive device or filesystem (my experience is 
with mdraid and btrfs raid) resume from suspend to RAM or hibernate to 
disk, either one or both, because it's often enough the case that one 
device or another will take enough longer to stabilize than the other, 
that it'll be failed out of the raid.

This doesn't happen on single-hardware-device block devices and 
filesystems because in that case it's either up or down, if the device 
doesn't come up in time the resume simply fails entirely, instead of 
coming up with one or more devices there, but others missing as they 
didn't stabilize in time, as is unfortunately all too common in the multi-
device scenario.

I've seen this with both spinning rust and with SSDs, with mdraid and 
btrfs, with multiple mobos and device controllers, and with resume both 
from suspend to ram (if the machine powers down the storage devices in 
that case, as most modern ones do) and hibernate to permanent storage 
device, over several years worth of kernel series, so it's a reasonably 
widespread phenomena, at least among consumer-level SATA devices.  (My 
experience doesn't extend to enterprise-raid-level devices or proper 
SCSI, etc, so I simply don't know, there.)

While two minutes is getting a bit long, I think it's still within normal 
range, and some devices definitely take over a minute enough of the time 
to be both noticeable and irritating.

That said, I SHOULD say I'd be far *MORE* irritated if the device simply 
pretended it was stable and started reading/writing data before it really 
had stabilized, particularly with SSDs where that sort of behavior has 
been observed and is known to put some devices at risk of complete 
scrambling of either media or firmware, beyond recovery at times.  That 
of course is the risk of going the other direction, and I'd a WHOLE lot 
rather have devices play it safe for another 30 seconds or so after they /
think/ they're stable and be SURE, than pretend to be just fine when 
voltages have NOT stabilized yet and thus end up scrambling things 
irrecoverably.  I've never had that happen here tho I've never stress-
tested for it, only done normal operation, but I've seen testing reports 
where the testers DID make it happen surprisingly easily, to a surprising 
number of their test devices.

So, umm... I suspect the 2-minute default is 2 minutes due to power-up 
stabilizing issues, where two minutes is a reasonable compromise between 
failing the boot most of the time if the timeout is too low, and taking 
excessively long for very little further gain.

And in my experience, the only way around that, at the consumer level at 
least, would be to split the timeouts, perhaps setting something even 
higher, 2.5-3 minutes on power-on, while lowering the operational timeout 
to something more sane for operation, probably 30 seconds or so by 
default, but easily tunable down to 10-20 seconds (or even lower, 5 
seconds, even for consumer level devices?) for those who had hardware 
that fit within that tolerance and wanted the performance.  But at least 
to my knowledge, there's no such split in reset timeout values available 
(maybe for SCSI?), and due to auto-spindown and power-saving, I'm not 
sure whether it's even possible, without some specific hardware feature 
available to tell the kernel that it has in fact NOT been in power-saving 
mode for say 5-10 minutes, hopefully long enough that voltage readings 
really /are/ fully stabilized and a shorter timeout is possible.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-17 Thread Roman Mamedov
On Tue, 18 Nov 2014 09:29:54 +0200
Brendan Hide bren...@swiftspirit.co.za wrote:

 Hey, guys
 
 See further below extracted output from a daily scrub showing csum 
 errors on sdb, part of a raid1 btrfs. Looking back, it has been getting 
 errors like this for a few days now.
 
 The disk is patently unreliable but smartctl's output implies there are 
 no issues. Is this somehow standard faire for S.M.A.R.T. output?

Not necessarily the disk's fault, could be a SATA controller issue. How are
your disks connected, which controller brand and chip? Add lspci output, at
least if something other than the ordinary to the motherboard chipset's
built-in ports.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html