Re: [zfs-discuss] BAD WD drives - defective by design?

2011-09-08 Thread Ian Collins

 On 09/ 9/11 06:40 AM, Richard Elling wrote:

On Sep 7, 2011, at 2:05 AM, Roy Sigurd Karlsbakk wrote:

A drive retrying a single sector for two whole minutes is nonsense, even on a 
desktop
or laptop, at least when it does so without logging the error to SMART or 
summing up
the issues so to flag the disk unusable. And, beleive it or not, a drive 
spending 2 minutes
trying to fetch 512 bytes from a dead sector is quite unusable when the the 
number of
bad sectors start climbing.

Yes, but that is the current state of the market, and this change has become 
more pronounced
in the past few generations. Experienced systems architects know this and 
design accordingly.


I think I've fallen victim to this one!  I had a pair of older WD black 
drives in a test system in a hostile environment (my garage!) for 18 
months without any issues so I recently replaced the remaining 8 drives 
in the pool with the current model.  I now see these warnings in my logs:


scsi: [ID 243001 kern.warning] WARNING: 
/pci@0,0/pci8086,3410@9/pci15d9,400@0 (mpt_sas0):

   mptsas_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120303

Now whether that's due to a change of controller, or one of the new 
drives wandering off into la-la land I'm not sure.



The disk vendors provide product roadmaps so you can plan for the future 
(Seagate's is quite
good reading :-)


Indeed.  I'll have to update my reading habits!

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] BAD WD drives - defective by design?

2011-09-08 Thread Richard Elling
On Sep 7, 2011, at 2:05 AM, Roy Sigurd Karlsbakk wrote:

>> The common use for desktop drives is having a single disk without
>> redundancy.. If a sector is feeling bad, it's better if it tries a bit
>> harder to recover it than just say "blah, there was a bit of dirt in
>> the corner.. I don't feel like looking at it, so I'll just say your data
>> is screwed instead".. In a raid setup, that data is sitting safe(?) on
>> some other disk as well, so it might as well give up early.
> 
> Still, there's a wee difference between shaving and cutting your head off.

Today, it is in the best interest of the suppliers to do this. They can show 
concrete
product differentiation to support increased margins. Business 101.

> A drive retrying a single sector for two whole minutes is nonsense, even on a 
> desktop
> or laptop, at least when it does so without logging the error to SMART or 
> summing up
> the issues so to flag the disk unusable. And, beleive it or not, a drive 
> spending 2 minutes
> trying to fetch 512 bytes from a dead sector is quite unusable when the the 
> number of
> bad sectors start climbing.

Yes, but that is the current state of the market, and this change has become 
more pronounced
in the past few generations. Experienced systems architects know this and 
design accordingly.
The disk vendors provide product roadmaps so you can plan for the future 
(Seagate's is quite
good reading :-)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] BAD WD drives - defective by design?

2011-09-07 Thread Roy Sigurd Karlsbakk
> The common use for desktop drives is having a single disk without
> redundancy.. If a sector is feeling bad, it's better if it tries a bit
> harder to recover it than just say "blah, there was a bit of dirt in
> the corner.. I don't feel like looking at it, so I'll just say your data
> is screwed instead".. In a raid setup, that data is sitting safe(?) on
> some other disk as well, so it might as well give up early.

Still, there's a wee difference between shaving and cutting your head off. A 
drive retrying a single sector for two whole minutes is nonsense, even on a 
desktop or laptop, at least when it does so without logging the error to SMART 
or summing up the issues so to flag the disk unusable. And, beleive it or not, 
a drive spending 2 minutes trying to fetch 512 bytes from a dead sector is 
quite unusable when the the number of bad sectors start climbing.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] BAD WD drives - defective by design?

2011-09-07 Thread Tomas Forsman
On 07 September, 2011 - Roy Sigurd Karlsbakk sent me these 2,0K bytes:

> > http://wdc.custhelp.com/app/answers/detail/a_id/1397/~/difference-between-desktop-edition-and-raid-%28enterprise%29-edition-drives
> 
> "When an error is found on a desktop edition hard drive, the drive will enter 
> into a deep recovery cycle to attempt to repair the error, recover the data 
> from the problematic area, and then reallocate a dedicated area to replace 
> the problematic area. This process can take up to 2 minutes depending on the 
> severity of the issue"
> 
> Or in other words: "When an error occurs on a desktop drive, the drive will 
> refuse to realize the sector is bad, and retry forever "

The common use for desktop drives is having a single disk without
redundancy.. If a sector is feeling bad, it's better if it tries a bit
harder to recover it than just say "blah, there was a bit of dirt in the
corner.. I don't feel like looking at it, so I'll just say your data is
screwed instead".. In a raid setup, that data is sitting safe(?) on some
other disk as well, so it might as well give up early.

So don't use desktop drives in raid and don't use raid disks in a
desktop setup. Ofcourse, this is just a config setting - but it's still
reality.

/Tomas
-- 
Tomas Forsman, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] BAD WD drives - defective by design?

2011-09-07 Thread Roy Sigurd Karlsbakk
> http://wdc.custhelp.com/app/answers/detail/a_id/1397/~/difference-between-desktop-edition-and-raid-%28enterprise%29-edition-drives

"When an error is found on a desktop edition hard drive, the drive will enter 
into a deep recovery cycle to attempt to repair the error, recover the data 
from the problematic area, and then reallocate a dedicated area to replace the 
problematic area. This process can take up to 2 minutes depending on the 
severity of the issue"

Or in other words: "When an error occurs on a desktop drive, the drive will 
refuse to realize the sector is bad, and retry forever and ever without even 
increasing SMART counters, so that even if Western Digital Data LifeGuard 
Diagnostics need to spend 36 hours to test a drive (as opposed to the normal 
5-ish hours for a 2TB 7k2 drive), WD will refuse return of the drive because IT 
WORKS".

Or in yet other words: "Desktop drives aren't meant to be used for anything 
productive or important."

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] BAD WD drives - defective by design?

2011-09-06 Thread John Martin

http://wdc.custhelp.com/app/answers/detail/a_id/1397/~/difference-between-desktop-edition-and-raid-%28enterprise%29-edition-drives
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] BAD WD drives - defective by design?

2011-09-06 Thread Richard Elling
On Aug 29, 2011, at 2:07 PM, Roy Sigurd Karlsbakk wrote:

> Hi all
> 
> It seems recent WD drives that aren't "Raid edition" can cause rather a lot 
> of problems on RAID systems. We have a few machines with LSI controllers 
> (6801/6081/9201) and we're seeing massive errors occuring. The usual pattern 
> is a drive failing or even a resilver/scrub starting and then, suddenly, most 
> drives on the whole backplane report errors. These are usually No Device (as 
> reported by iostat -En), but the result is that we may see data corruption at 
> the end. We also have a system setup with Hitachi Deskstars, which has been 
> running for almost a year without issues. One system with a mixture of WD 
> Blacks and greens showed the same errors as described, but has been working 
> well after the WD drives were replaced by deskstars.

Sounds familiar.

> 
> Now, it seems WD has changed their firmware to inhibit people from using them 
> for other things than toys (read: PCs etc). Since we've seen this issue on 
> different controllers and different drives, and can't reproduce it with 
> Hitachi Deskstars, I would guess the firmware "upgrade" from WD is the issue.

Likely.

> 
> Would it be possible to fix this in ZFS somehow?

No, the error is 1-2 layers below ZFS.

> The drives seem to work well except for those "No Device" errors….

'nuff said.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] BAD WD drives - defective by design?

2011-08-29 Thread Roy Sigurd Karlsbakk
Hi all

It seems recent WD drives that aren't "Raid edition" can cause rather a lot of 
problems on RAID systems. We have a few machines with LSI controllers 
(6801/6081/9201) and we're seeing massive errors occuring. The usual pattern is 
a drive failing or even a resilver/scrub starting and then, suddenly, most 
drives on the whole backplane report errors. These are usually No Device (as 
reported by iostat -En), but the result is that we may see data corruption at 
the end. We also have a system setup with Hitachi Deskstars, which has been 
running for almost a year without issues. One system with a mixture of WD 
Blacks and greens showed the same errors as described, but has been working 
well after the WD drives were replaced by deskstars.

Now, it seems WD has changed their firmware to inhibit people from using them 
for other things than toys (read: PCs etc). Since we've seen this issue on 
different controllers and different drives, and can't reproduce it with Hitachi 
Deskstars, I would guess the firmware "upgrade" from WD is the issue.

Would it be possible to fix this in ZFS somehow? The drives seem to work well 
except for those "No Device" errors

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss