Re: [zfs-discuss] 1tb SATA drives

2010-07-24 Thread Haudy Kazemi



But if it were just the difference between 5min freeze when a drive
fails, and 1min freeze when a drive fails, I don't see that anyone
would care---both are bad enough to invoke upper-layer application
timeouts of iSCSI connections and load balancers, but not disastrous.

but it's not.  ZFS doesn't immediately offline the drive after 1 read
error.  Some people find it doesn't offline the drive at all, until
they notice which drive is taking multiple seconds to complete
commands and offline it manually.  so you have 1 - 5 minute freezes
several times a day, every time the slowly-failing drive hits a latent
sector error.

I'm saying the works:notworks comparison is not between TLER-broken
and non-TLER-broken.  I think the TLER fans are taking advantage of
people's binary debating bias to imply that TLER is the ``works OK''
case and non-TLER is ``broken: dont u see it's 5x slower.''  There are
three cases to compare for any given failure mode: TLER-failed,
non-TLER-failed, and working.  The proper comparison is therefore
between a successful read (7ms) and an unsuccessful read (7000ms * n
cargo-cult retries put into various parts of the stack to work around
some scar someone has on their knee from some weird thing an FC switch
once did in 1999).
  
If you give a drive enough retries on a sector giving a read error, 
sometimes it can get the data back.  I once had project with an 80gb 
Maxtor IDE drive that I needed to get all the files off of.  One file (a 
ZIP archive) was sitting over a sector with a read error.  I found that 
I could get what appeared to be partial data from the sector using 
Ontrack EasyRecovery, but the data read back from the 512 byte sector 
was slightly different each time.  I manually repeated this a few times 
and got it down to about a few bytes out of the 512 that were different 
on each re-read attempt.  Looking at those further I figured it was 
actually only a few bits of each of those bytes that were different each 
time, and I could narrow that down as well by looking at the frequency 
of the results of each read.  I knew the ZIP file had a CRC32 code that 
would match the correct byte sequence, and figured I could write up a 
brute force recovery for the remaining bytes.


I didn't end up writing the code to do that because I found something 
else: GNU ddrescue.  That can image a drive including as many automatic 
retries as you like, including infinite.  I didn't need the drive right 
away, so I started up ddrescue and let it go after the drive over a 
whole weekend.  There was only one sector on the whole drive that 
ddrescue was working to recover...the one with the file on it.  About 
two days later it finished reading, and when I mounted the drive image, 
I was then able to open up the ZIP file.  The CRC passed and I had 
confirmation that the drive had finally after days of reread attempts 
gotten that last sector.


It was really slow, but I had nothing to lose, and just wanted to see 
what would happen.  I've tried it since on other bad sectors with 
varying results.  Sometimes a couple hundred or thousand retries will 
get a lucky break and recover the sector.  Sometimes not.




The unsuccessful read is thousands of times slower than normal
performance.  It doesn't make your array seem 5x slower during the
fail like the false TLER vs non-TLER comparison makes it seem.  It
makes your array seem entirely frozen.  The actual speed doesn't
matter: it's FROZEN.  Having TLER does not make FROZEN any faster than
FROZEN.
  

I agree.


The story here sounds great, so I can see why it spreads so well:
``during drive failures, the array drags performance a little, maybe
5x, until you locate teh drive and replace it.  However, if you have
used +1 MAGICAL DRIVES OF RECKONING, the dragging is much reduced!
Unfortunately +1 magical drives are only appropriate for ENTERPRISE
use while at home we use non-magic drives, but you get what you pay
for.''  That all sounds fair, reasonable, and like good fun gameplay.
Unfortunately ZFS isn't a video game: it just fucking freezes.

bh The difference is that a fast fail with ZFS relies on ZFS to
bh fix the problem rather than degrading the array.

ok but the decision of ``degrading the array'' means ``not sending
commands to the slowly-failing drive any more''.

which is actually the correct decision, the wrong course being to
continue sending commands there and ``patiently waiting'' for them to
fail instead of re-issuing them to redundant drives, even when waiting
thousands of standard deviations outside the mean request time.  TLER
or not, a failing drive will poison the array by making reads
thousands of times slower.
  
I agree.  This is the behavior all RAID type devices should have whether 
hardware or Linux RAID or ZFS.  If a drive is slow to respond, stop 
sending it read commands if there is enough redundancy remaining to 
compute the data.  ZFS should have no problem with this even though I 
understand that it needs 

Re: [zfs-discuss] 1tb SATA drives

2010-07-22 Thread Brandon High
On Fri, Jul 16, 2010 at 11:32 AM, Jordan McQuown
j...@larsondesigngroup.com wrote:
 I’m curious to know what other people are running for HD’s in white box
 systems? I’m currently looking at Seagate Barracuda’s and Hitachi Deskstars.
 I’m looking at the 1tb models. These will be attached to an LSI expander in
 a sc847e2 chassis driven by an LSI 9211-8i HBA. This system will be used as
 a large storage array for backups and archiving.

Some of the Deskstars are qualified to run in a raid configuration,
but not all. The E7K1000 is, but the 7K1000, 7K1000.B, 7K1000.C and
7K2000 are not. Curiously, many of the drives are recommended for
video editing arrays, and the 7K2000 and A7K2000 share the same
specifications, including vibration tolerance. The only difference
appears to be the warranty and error rate.

I would not suggest using consumer drives from WD or Seagate for a
large array. Recent versions no longer support enabling TLER or ERC.
To the best of my knowledge, Samsung and Hitachi drives all support
CCTL, which is yet another name for the same thing.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 1tb SATA drives

2010-07-22 Thread Miles Nordin
 bh == Brandon High bh...@freaks.com writes:

bh Recent versions no longer support enabling TLER or ERC.  To
bh the best of my knowledge, Samsung and Hitachi drives all
bh support CCTL, which is yet another name for the same thing.

once again, I have to ask, has anyone actually found these features to
make a verified positive difference with ZFS?

Some of those things you cannot even set on Solaris because the
channel to the drive with a LSI controller isn't sufficiently
transparent to support smartctl, and the settings don't survive
reboots.  Brandon have you actually set it yourself, or are you just
aggregating forum discussion?

The experience so far that I've read here has been:

 * if a drive goes bad completely

   + zfs will mark the drive unavailable after a delay that depends on
 the controller you're using, but with lengths like 60 seconds,
 180 seconds, 2 hours, or forever.  The delay is not sane or
 reasonable with all controllers, and even if redundancy is
 available ZFS will patiently wait for the controller.  The delay
 depends on the controller driver.  It's part of the Solaris code.
 best case zpool will freeze until the delay is up, but there are
 application timeouts and iSCSI initiator-target timeouts,
 too---getting the equivalent of an NFS hard mount is hard these
 days (even with NFS, in some people's experiences).

   + the delay is different if the system's running when the drive
 fails, or if it's trying to boot up.  For example iSCSI will
 ``patiently wait'' forever for a drive to appear while booting
 up, but will notice after 180 seconds while running.

   + because the disk is compeltely bad, TLER, ERC, CCTL, whatever you
 call it, doesn't apply.  The drive might not answer commands
 ever, at all.  The timer is not in the drive: the drive is bad
 starting now, continuing forever.

 * if a drive goes partially bad (large and increasing numbers of
   latent sector errors, which for me happens more often than
   bad-completely):

   + the zpool becomes unusably slow

   + it stays unusably slow until you use 'iostat' or 'fmdump' to find
 the marginal drive and offline it

   + TLER, ERC, CCTL makes the slowness factor 7ms : 7000ms vs 
 7ms : 3ms.  In other words, it's unusably slow with or
 without the feature.

AFAICT the feature is useful as a workaround for buggy RAID card
firmware and nothing else.  It's a cost differentiator, and you're
swallowing it hook, line and sinker.

If you know otherwise please reinform me, but the discussion here so
far doesn't match what I've learned about ZFS and Solaris exception
handling.

That said, to reword Don Marti, ``uninformed Western Digital bashing
is better than no Western Digital bashing at all.''


pgpFMSCuYt2qE.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 1tb SATA drives

2010-07-22 Thread Brandon High
On Thu, Jul 22, 2010 at 11:14 AM, Miles Nordin car...@ivy.net wrote:
 reboots.  Brandon have you actually set it yourself, or are you just
 aggregating forum discussion?

I'm using an older revision of WD10EADS drives that allow TLER to be
enabled via WDTLER.EXE. I have not had a drive fail in this
environment so I can't speak from personal experience.

I'm basing my statement on what I've read in the product specs from
the manufacturer and what I've heard about newer revisions of the
drives.

 AFAICT the feature is useful as a workaround for buggy RAID card
 firmware and nothing else.  It's a cost differentiator, and you're
 swallowing it hook, line and sinker.

ERC is part of the ATA-8 spec. WD and Seagate fail to recognize the
command on their desktop drives. Hitachi and Samsung implement it.

 If you know otherwise please reinform me, but the discussion here so
 far doesn't match what I've learned about ZFS and Solaris exception
 handling.

The idea of ERC is to return an error prior to the timeout. With 60
second timeouts and 5 retries, it could conceivably take 5 minutes for
a bad read to fail past the scsi driver. For those 5 minutes, you'll
see horrible performance. If the drive returns an error within 7-10
seconds, it would only take 35-50 seconds to fail.

ERC allows you to fast-fail with the assumption that you'll correct
the error at a higher level. This is true of HW raid cards that
offline a disk that is slow to respond as well as ZFS and other
software raid mechanisms. The difference is that a fast fail with ZFS
relies on ZFS to fix the problem rather than degrading the array.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 1tb SATA drives

2010-07-19 Thread Eric D. Mudama

On Fri, Jul 16 at 18:32, Jordan McQuown wrote:

  I'm curious to know what other people are running for HD's in white box
  systems? I'm currently looking at Seagate Barracuda's and Hitachi
  Deskstars. I'm looking at the 1tb models. These will be attached to an LSI
  expander in a sc847e2 chassis driven by an LSI 9211-8i HBA. This system
  will be used as a large storage array for backups and archiving.


Dell shipped us WD RE3 drives in the server we bought from them,
they've been working great and come in a 1TB size.  Not sure about the
expander, but they talk just fine to the 9211 HBAs.


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] 1tb SATA drives

2010-07-16 Thread Jordan McQuown
I'm curious to know what other people are running for HD's in white box 
systems? I'm currently looking at Seagate Barracuda's and Hitachi Deskstars. 
I'm looking at the 1tb models. These will be attached to an LSI expander in a 
sc847e2 chassis driven by an LSI 9211-8i HBA. This system will be used as a 
large storage array for backups and archiving.

Thanks,
Jordan

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 1tb SATA drives

2010-07-16 Thread Arne Jansen

Jordan McQuown wrote:
I’m curious to know what other people are running for HD’s in white box 
systems? I’m currently looking at Seagate Barracuda’s and Hitachi 
Deskstars. I’m looking at the 1tb models. These will be attached to an 
LSI expander in a sc847e2 chassis driven by an LSI 9211-8i HBA. This 
system will be used as a large storage array for backups and archiving.


I wouldn't recommend using desktop drives in a server RAID. They can't
handle the vibrations well that are present in a server. I'd recommend
at least the Seagate Constellation or the Hitachi Ultrastar, though I
haven't tested the Deskstar myself.

--Arne

 
Thanks,

Jordan
 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss