Re: [zfs-discuss] hot spares - in standby?

2007-02-05 Thread Richard Elling

Torrey McMahon wrote:

Richard Elling wrote:

Good question. If you consider that mechanical wear out is what ultimately
causes many failure modes, then the argument can be made that a spun down
disk should last longer. The problem is that there are failure modes which
are triggered by a spin up.  I've never seen field data showing the difference
between the two.


Often, the spare is up and running but for whatever reason you'll have a 
bad block on it and you'll die during the reconstruct. Periodically 
checking the spare means reading and writing from over time in order to 
make sure it's still ok. (You take the spare out of the trunk, you look 
at it, you check the tire pressure, etc.) The issue I see coming down 
the road is that we'll start getting into a Golden Gate paint job 
where it takes so long to check the spare that we'll just keep the 
process going constantly. Not as much wear and tear as real i/o but it 
will still be up and running the entire time and you won't be able to 
spin the spare down.


In my experience, checking the spare tire leads to getting a flat and needing
the spare about a week later :-)  It has happened to me twice in the past
few years... I suspect a conspiracy... :-)

Back to the topic, I'd believe that some combination of hot, warm, and
cold spares would be optimal.

Anton B. Rang wrote:
 Shouldn't SCSI/ATA block sparing handle this?  Reconstruction should be
 purely a matter of writing, so bit rot shouldn't be an issue; or are
 there cases I'm not thinking of? (Yes, I know there are a limited number of
 spare blocks, but I wouldn't expect a spare which is turned off to develop
 severe media problems...am I wrong?)

In the disk, at the disk block level, there is fairly substantial ECC.
Yet, we still see data loss.  There are many mechanisms at work here.  One
that we have studied to some detail is superparamagnetic decay -- the medium
wishes to decay to a lower-enegy state, losing information in the process.
One way to prevent this is to rewrite the data -- basically resetting the
decay clock.  The study we did on this says that rewriting your data once
per year is reasonable.  Note that ZFS is COW, and scrubbing is currently a
read operation which will only write when data needs to be reconstructed.
I look at this as: rewrite-style scrubbing is preventative, read and verify
style scrubbing is prescriptive.  Either is better than neither.

In short, use spares and scrub.
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hot spares - in standby?

2007-02-02 Thread Torrey McMahon

Richard Elling wrote:


Good question. If you consider that mechanical wear out is what 
ultimately

causes many failure modes, then the argument can be made that a spun down
disk should last longer. The problem is that there are failure modes 
which
are triggered by a spin up.  I've never seen field data showing the 
difference

between the two.


Often, the spare is up and running but for whatever reason you'll have a 
bad block on it and you'll die during the reconstruct. Periodically 
checking the spare means reading and writing from over time in order to 
make sure it's still ok. (You take the spare out of the trunk, you look 
at it, you check the tire pressure, etc.) The issue I see coming down 
the road is that we'll start getting into a Golden Gate paint job 
where it takes so long to check the spare that we'll just keep the 
process going constantly. Not as much wear and tear as real i/o but it 
will still be up and running the entire time and you won't be able to 
spin the spare down.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hot spares - in standby?

2007-02-01 Thread Al Hopper
On Wed, 31 Jan 2007 [EMAIL PROTECTED] wrote:


 I understand all the math involved with RAID 5/6 and failure rates,
 but its wise to remember that even if the probabilities are small
 they aren't zero. :)

Agreed.  Another thing I've seen, is that if you have an A/C (Air
Conditioning) event in the data center or lab, you will usually see a
cluster of failures over the next 2 to 3 weeks.  Effectively, all your
disk drives have been thermally stressed and are likely to exhibit a spike
in the failure rates in the near term.

Often, in a larger environment, the facilities personnel don't understand
the co-relation between an A/C event and disk drive failure rates.  And
major A/C upgrade work is often scheduled over a (long) weekend when most
of the technical talent won't be present.  After the work is completed
everyone is told that it went very well because the organization does
not do bad news and then you loose two drives in a RAID5 array 

 And after 3-5 years of continuous operation, you better decommission the
 whole thing or you will have many disk failures.

Agreed.  We took an 11 disk FC hardware RAID box offline recently because
all the drives were 5 years old.  It's tough to hit those power off
switches and scrap working disk drives, but much better than the business
disruption and professional embarassment caused by data loss.  And much
better to be in control of, and experience, *scheduled* downtime.  BTW:
don't forget that if you plan to continue to use the disk enclosure
hardware you need to replace _all_ the fans first.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
   Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
 OpenSolaris Governing Board (OGB) Member - Feb 2006
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hot spares - in standby?

2007-01-31 Thread David Magda

On Jan 30, 2007, at 09:52, Luke Scharf wrote:

Hey, I can take a double-drive failure now!  And I don't even need  
to rebuild!  Just like having a hot spare with raid5, but without  
the rebuild time!


Theoretically you want to rebuild as soon as possible, because  
running in degraded mode (even with dual-parity) increases your  
chances of data loss (even though the probabilities involved may seem  
remote).


Case in point, recently at work we had a drive fail in a server with 5 
+1 RAID5 configuration.  We replaced it, and about 2-3 weeks later a  
separate drive failed. Even with dual-parity, if we hadn't replaced /  
rebuilt things we would now be cutting it close.


I understand all the math involved with RAID 5/6 and failure rates,  
but its wise to remember that even if the probabilities are small  
they aren't zero. :)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hot spares - in standby?

2007-01-31 Thread Casper . Dik

I understand all the math involved with RAID 5/6 and failure rates,  
but its wise to remember that even if the probabilities are small  
they aren't zero. :)

And after 3-5 years of continuous operation, you better decommission the
whole thing or you will have many disk failures.

Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hot spares - in standby?

2007-01-31 Thread Luke Scharf

David Magda wrote:

On Jan 30, 2007, at 09:52, Luke Scharf wrote:

Hey, I can take a double-drive failure now!  And I don't even need 
to rebuild!  Just like having a hot spare with raid5, but without the 
rebuild time!


Theoretically you want to rebuild as soon as possible, because running 
in degraded mode (even with dual-parity) increases your chances of 
data loss (even though the probabilities involved may seem remote).
Case in point, recently at work we had a drive fail in a server with 
5+1 RAID5 configuration.  We replaced it, and about 2-3 weeks later a 
separate drive failed. Even with dual-parity, if we hadn't replaced / 
rebuilt things we would now be cutting it close. 


I did misspeak -- with raidz2, I still do have to replace the failed 
drive ASAP!


However, with raidz2, you don't have to wait hours for the rebuild to 
occur before the second drive can fail; with a hot-spare, the first and 
second failures (provided that the failures occur on the array-drives, 
rather than on the spare) must happen several hours apart.  With raidz2 
on the same hardware, the two failures can happen at the same time -- 
and the array can still be rebuilt.


But, I guess the utility of the hot-spare depends a lot on the number of 
drives available, and on the layout.  In my case, most of the hardware 
that I have is Apple XRaid units and, when using the hardware RAID 
inside the unit, the hot-spare  must be in the same half of the box as 
failed drive -- in these small, constrained RAIDs, raidz2 would be much 
better than raidz and a spare because of the rebuild-time.  With 
Thumper+ZFS or something like that, though, the spare could be anywhere, 
and I think I'd like having a few hot/warm spares on the machine that 
could be zinged into service if an array member fails.


-Luke



smime.p7s
Description: S/MIME Cryptographic Signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hot spares - in standby?

2007-01-30 Thread Luke Scharf

David Magda wrote:

What about a rotating spare?

When setting up a pool a lot of people would (say) balance things 
around buses and controllers to minimize single  points of failure, 
and a rotating spare could disrupt this organization, but would it be 
useful at all?


Functionally, that sounds a lot like raidz2!

Hey, I can take a double-drive failure now!  And I don't even need to 
rebuild!  Just like having a hot spare with raid5, but without the 
rebuild time!


Though I can see a raidz sub N being useful -- just tell ZFS how many 
parity drives you want, and we'll take care of the rest.


-Luke




smime.p7s
Description: S/MIME Cryptographic Signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hot spares - in standby?

2007-01-30 Thread Albert Chin
On Mon, Jan 29, 2007 at 09:37:57PM -0500, David Magda wrote:
 On Jan 29, 2007, at 20:27, Toby Thain wrote:
 
 On 29-Jan-07, at 11:02 PM, Jason J. W. Williams wrote:
 
 I seem to remember the Massive Array of Independent Disk guys ran  
 into
 a problem I think they called static friction, where idle drives  
 would
 fail on spin up after being idle for a long time:
 
 You'd think that probably wouldn't happen to a spare drive that was  
 spun up from time to time. In fact this problem would be (mitigated  
 and/or) caught by the periodic health check I suggested.
 
 What about a rotating spare?
 
 When setting up a pool a lot of people would (say) balance things  
 around buses and controllers to minimize single  points of failure,  
 and a rotating spare could disrupt this organization, but would it be  
 useful at all?

Agami Systems has the concept of Enterprise Sparing, where the hot
spare is distributed amongst data drives in the array. When a failure
occurs, the rebuild occurs in parallel across _all_ drives in the
array:
  http://www.issidata.com/specs/agami/enterprise-classreliability.pdf

-- 
albert chin ([EMAIL PROTECTED])
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] hot spares - in standby?

2007-01-29 Thread Toby Thain

Hi,

This is not exactly ZFS specific, but this still seems like a  
fruitful place to ask.


It occurred to me today that hot spares could sit in standby (spun  
down) until needed (I know ATA can do this, I'm supposing SCSI does  
too, but I haven't looked at a spec recently). Does anybody do this?  
Or does everybody do this already?


Does the tub curve (chance of early life failure) imply that hot  
spares should be burned in, instead of sitting there doing nothing  
from new? Just like a data disk, seems to me you'd want to know if a  
hot spare fails while waiting to be swapped in. Do they get tested  
periodically?


--Toby
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hot spares - in standby?

2007-01-29 Thread Bill Moore
You could easily do this in Solaris today by just using power.conf(4).
Just have it spin down any drives that have been idle for a day or more.

The periodic testing part would be an interesting project to kick off.


--Bill


On Mon, Jan 29, 2007 at 08:21:16PM -0200, Toby Thain wrote:
 Hi,
 
 This is not exactly ZFS specific, but this still seems like a  
 fruitful place to ask.
 
 It occurred to me today that hot spares could sit in standby (spun  
 down) until needed (I know ATA can do this, I'm supposing SCSI does  
 too, but I haven't looked at a spec recently). Does anybody do this?  
 Or does everybody do this already?
 
 Does the tub curve (chance of early life failure) imply that hot  
 spares should be burned in, instead of sitting there doing nothing  
 from new? Just like a data disk, seems to me you'd want to know if a  
 hot spare fails while waiting to be swapped in. Do they get tested  
 periodically?
 
 --Toby
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hot spares - in standby?

2007-01-29 Thread Al Hopper
On Mon, 29 Jan 2007, Toby Thain wrote:

 Hi,

 This is not exactly ZFS specific, but this still seems like a
 fruitful place to ask.

 It occurred to me today that hot spares could sit in standby (spun
 down) until needed (I know ATA can do this, I'm supposing SCSI does
 too, but I haven't looked at a spec recently). Does anybody do this?
 Or does everybody do this already?

I don't work with enough disk storage systems to know what is the industry
norm.  But there are 3 broad categories of disk drive spares:

a) Cold Spare.  A spare where the power is not connected until it is
required.  [1]

b) Warm Spare.  A spare that is active but placed into a low power mode.
Or into a low mechanical ware  tare mode.  In the case of a disk drive,
the controller board is active but the HDA (Head Disk Assembly) is
inactive (platters are stationary, heads unloaded [if the heads are
physically unloaded]); it has power applied and can be made hot by a
command over its data/command (bus) connection.  The supervisorary
hardware/software/firmware knows how long it *should* take the drive to
go from warm to hot.

c) Hot Spare.  A spare that is spun up and ready to accept
read/write/position (etc) requests.

 Does the tub curve (chance of early life failure) imply that hot
 spares should be burned in, instead of sitting there doing nothing
 from new? Just like a data disk, seems to me you'd want to know if a
 hot spare fails while waiting to be swapped in. Do they get tested
 periodically?

The ideal scenario, as you already allude to, would be for the disk
subsystem to initially configure the drive as a hot spare and send it
periodic test events for, say, the first 48 hours.  This would get it
past the first segment of the bathtub reliability curve - often referred
to as the infant mortality phase.  After that, (ideally) it would be
placed into warm standby mode and it would be periodically tested (once
a month??).

If saving power was the highest priority, then the ideal situation would
be where the disk subsystem could apply/remove power to the spare and move
it from warm to cold upon command.

One trick with disk subsystems, like ZFS that have yet to have the FMA
type functionality added and which (today) provide for hot spares only, is
to initially configure a pool with one (hot) spare, and then add a 2nd hot
spare, based on installing a brand new device, say, 12 months later.  And
another spare 12 months later.  What you are trying to achieve, with this
strategy, is to avoid the scenario whereby mechanical systems, like disk
drives, tend to wear out within the same general, relatively short,
timeframe.

One (obvious) issue with this strategy, is that it may be impossible to
purchase the same disk drive 12 and 24 months later.  However, it's always
possible to purchase a larger disk drive and simply commit to the fact
that the extra space provided by the newer drive will be wasted.

[1] The most common example is a disk drive mounted on a carrier but not
seated within the disk drive enclosure.  Simple push in when required.

Off Topic: To go off on a tangent - the same strategy applies to a UPS
(Uninterruptable Power Supply).  As per the following time line:

year 0: purchase the UPS and one battery cabinet
year 1: purchase and attach an additional battery cabinet
year 2: purchase and attach an additional battery cabinet
year 3: purchase and attach an additional battery cabinet
year 4: purchase and attach an additional battery cabinet and remove the
oldest battery cabinet
year 5 ... N: repeat year 4s scenario until its time to replace the UPS.

The advantage of this scenario is that you can budget a *fixed* cost for
the UPS and your management understands that there is a recurring cost so
that, when the power fails, your UPS will have working batteries!!

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
   Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
 OpenSolaris Governing Board (OGB) Member - Feb 2006
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hot spares - in standby?

2007-01-29 Thread Toby Thain


On 29-Jan-07, at 9:04 PM, Al Hopper wrote:


On Mon, 29 Jan 2007, Toby Thain wrote:


Hi,

This is not exactly ZFS specific, but this still seems like a
fruitful place to ask.

It occurred to me today that hot spares could sit in standby (spun
down) until needed (I know ATA can do this, I'm supposing SCSI does
too, but I haven't looked at a spec recently). Does anybody do this?
Or does everybody do this already?


I don't work with enough disk storage systems to know what is the  
industry

norm.  But there are 3 broad categories of disk drive spares:

a) Cold Spare.  A spare where the power is not connected until it is
required.  [1]

b) Warm Spare.  A spare that is active but placed into a low power  
mode. ...


c) Hot Spare.  A spare that is spun up and ready to accept
read/write/position (etc) requests.


Hi Al,

Thanks for reminding me of the distinction. It seems very few  
installations would actually require (c)?





Does the tub curve (chance of early life failure) imply that hot
spares should be burned in, instead of sitting there doing nothing
from new? Just like a data disk, seems to me you'd want to know if a
hot spare fails while waiting to be swapped in. Do they get tested
periodically?


The ideal scenario, as you already allude to, would be for the disk
subsystem to initially configure the drive as a hot spare and send it
periodic test events for, say, the first 48 hours.


For some reason that's a little shorter than I had in mind, but I  
take your word that that's enough burn-in for semiconductors, motors,  
servos, etc.



This would get it
past the first segment of the bathtub reliability curve ...

If saving power was the highest priority, then the ideal situation  
would
be where the disk subsystem could apply/remove power to the spare  
and move

it from warm to cold upon command.


I am surmising that it would also considerably increase the spare's  
useful lifespan versus hot and spinning.




One trick with disk subsystems, like ZFS that have yet to have  
the FMA
type functionality added and which (today) provide for hot spares  
only, is
to initially configure a pool with one (hot) spare, and then add a  
2nd hot
spare, based on installing a brand new device, say, 12 months  
later.  And
another spare 12 months later.  What you are trying to achieve,  
with this
strategy, is to avoid the scenario whereby mechanical systems, like  
disk

drives, tend to wear out within the same general, relatively short,
timeframe.

One (obvious) issue with this strategy, is that it may be  
impossible to
purchase the same disk drive 12 and 24 months later.  However, it's  
always

possible to purchase a larger disk drive


...which is not guaranteed to be compatible with your storage  
subsystem...!


--Toby


and simply commit to the fact
that the extra space provided by the newer drive will be wasted.

[1] The most common example is a disk drive mounted on a carrier  
but not
seated within the disk drive enclosure.  Simple push in when  
required.

...
Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
   Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
 OpenSolaris Governing Board (OGB) Member - Feb 2006


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hot spares - in standby?

2007-01-29 Thread Jason J. W. Williams

Hi Guys,

I seem to remember the Massive Array of Independent Disk guys ran into
a problem I think they called static friction, where idle drives would
fail on spin up after being idle for a long time:
http://www.eweek.com/article2/0,1895,1941205,00.asp

Would that apply here?

Best Regards,
Jason

On 1/29/07, Toby Thain [EMAIL PROTECTED] wrote:


On 29-Jan-07, at 9:04 PM, Al Hopper wrote:

 On Mon, 29 Jan 2007, Toby Thain wrote:

 Hi,

 This is not exactly ZFS specific, but this still seems like a
 fruitful place to ask.

 It occurred to me today that hot spares could sit in standby (spun
 down) until needed (I know ATA can do this, I'm supposing SCSI does
 too, but I haven't looked at a spec recently). Does anybody do this?
 Or does everybody do this already?

 I don't work with enough disk storage systems to know what is the
 industry
 norm.  But there are 3 broad categories of disk drive spares:

 a) Cold Spare.  A spare where the power is not connected until it is
 required.  [1]

 b) Warm Spare.  A spare that is active but placed into a low power
 mode. ...

 c) Hot Spare.  A spare that is spun up and ready to accept
 read/write/position (etc) requests.

Hi Al,

Thanks for reminding me of the distinction. It seems very few
installations would actually require (c)?


 Does the tub curve (chance of early life failure) imply that hot
 spares should be burned in, instead of sitting there doing nothing
 from new? Just like a data disk, seems to me you'd want to know if a
 hot spare fails while waiting to be swapped in. Do they get tested
 periodically?

 The ideal scenario, as you already allude to, would be for the disk
 subsystem to initially configure the drive as a hot spare and send it
 periodic test events for, say, the first 48 hours.

For some reason that's a little shorter than I had in mind, but I
take your word that that's enough burn-in for semiconductors, motors,
servos, etc.

 This would get it
 past the first segment of the bathtub reliability curve ...

 If saving power was the highest priority, then the ideal situation
 would
 be where the disk subsystem could apply/remove power to the spare
 and move
 it from warm to cold upon command.

I am surmising that it would also considerably increase the spare's
useful lifespan versus hot and spinning.


 One trick with disk subsystems, like ZFS that have yet to have
 the FMA
 type functionality added and which (today) provide for hot spares
 only, is
 to initially configure a pool with one (hot) spare, and then add a
 2nd hot
 spare, based on installing a brand new device, say, 12 months
 later.  And
 another spare 12 months later.  What you are trying to achieve,
 with this
 strategy, is to avoid the scenario whereby mechanical systems, like
 disk
 drives, tend to wear out within the same general, relatively short,
 timeframe.

 One (obvious) issue with this strategy, is that it may be
 impossible to
 purchase the same disk drive 12 and 24 months later.  However, it's
 always
 possible to purchase a larger disk drive

...which is not guaranteed to be compatible with your storage
subsystem...!

--Toby

 and simply commit to the fact
 that the extra space provided by the newer drive will be wasted.

 [1] The most common example is a disk drive mounted on a carrier
 but not
 seated within the disk drive enclosure.  Simple push in when
 required.
 ...
 Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
 OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
  OpenSolaris Governing Board (OGB) Member - Feb 2006

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hot spares - in standby?

2007-01-29 Thread David Magda

On Jan 29, 2007, at 20:27, Toby Thain wrote:


On 29-Jan-07, at 11:02 PM, Jason J. W. Williams wrote:

I seem to remember the Massive Array of Independent Disk guys ran  
into
a problem I think they called static friction, where idle drives  
would

fail on spin up after being idle for a long time:


You'd think that probably wouldn't happen to a spare drive that was  
spun up from time to time. In fact this problem would be (mitigated  
and/or) caught by the periodic health check I suggested.


What about a rotating spare?

When setting up a pool a lot of people would (say) balance things  
around buses and controllers to minimize single  points of failure,  
and a rotating spare could disrupt this organization, but would it be  
useful at all?


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hot spares - in standby?

2007-01-29 Thread Wee Yeh Tan

On 1/30/07, David Magda [EMAIL PROTECTED] wrote:

What about a rotating spare?

When setting up a pool a lot of people would (say) balance things
around buses and controllers to minimize single  points of failure,
and a rotating spare could disrupt this organization, but would it be
useful at all?


The costs involved in rotating spares in terms of IOPS reduction may
not be worth it.


--
Just me,
Wire ...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hot spares - in standby?

2007-01-29 Thread Nathan Kroenert

Random thoughts:

If we were to use some intelligence in the design, we could perhaps have 
a monitor that profiles the workload on the system (a pool, for example) 
over a [week|month|whatever] and selects a point in time, based on 
history, that it would expect the disks to be quite, and can 'pre-build' 
the spare with the contents of the disk it's about to swap out. At the 
point of switch-over, it could be pretty much instantaneous... It could 
also bail if it happened that the system actually started to get 
genuinely busy...


That might actually be quite cool, though, if all disks are rotated, we 
end up with a whole bunch of disks that are evenly worn out again, which 
is just what we are really trying to avoid! ;)


Nathan.

Wee Yeh Tan wrote:

On 1/30/07, David Magda [EMAIL PROTECTED] wrote:

What about a rotating spare?

When setting up a pool a lot of people would (say) balance things
around buses and controllers to minimize single  points of failure,
and a rotating spare could disrupt this organization, but would it be
useful at all?


The costs involved in rotating spares in terms of IOPS reduction may
not be worth it.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hot spares - in standby?

2007-01-29 Thread Jason J. W. Williams

Hi Toby,

You're right. The healthcheck would definitely find any issues. I
misinterpreted your comment to that effect as a question and didn't
quite latch on. A zpool MAID-mode with that healthcheck might also be
interesting on something like a Thumper for pure-archival, D2D backup
work. Would dramatically cut down on the power. What do y'all think?

Best Regards,
Jason

On 1/29/07, Toby Thain [EMAIL PROTECTED] wrote:


On 29-Jan-07, at 11:02 PM, Jason J. W. Williams wrote:

 Hi Guys,

 I seem to remember the Massive Array of Independent Disk guys ran into
 a problem I think they called static friction, where idle drives would
 fail on spin up after being idle for a long time:

You'd think that probably wouldn't happen to a spare drive that was
spun up from time to time. In fact this problem would be (mitigated
and/or) caught by the periodic health check I suggested.

--T

 http://www.eweek.com/article2/0,1895,1941205,00.asp

 Would that apply here?

 Best Regards,
 Jason

 On 1/29/07, Toby Thain [EMAIL PROTECTED] wrote:

 On 29-Jan-07, at 9:04 PM, Al Hopper wrote:

  On Mon, 29 Jan 2007, Toby Thain wrote:
 
  Hi,
 
  This is not exactly ZFS specific, but this still seems like a
  fruitful place to ask.
 
  It occurred to me today that hot spares could sit in standby (spun
  down) until needed (I know ATA can do this, I'm supposing SCSI
 does
  too, but I haven't looked at a spec recently). Does anybody do
 this?
  Or does everybody do this already?
 
  I don't work with enough disk storage systems to know what is the
  industry
  norm.  But there are 3 broad categories of disk drive spares:
 
  a) Cold Spare.  A spare where the power is not connected until
 it is
  required.  [1]
 
  b) Warm Spare.  A spare that is active but placed into a low power
  mode. ...
 
  c) Hot Spare.  A spare that is spun up and ready to accept
  read/write/position (etc) requests.

 Hi Al,

 Thanks for reminding me of the distinction. It seems very few
 installations would actually require (c)?

 
  Does the tub curve (chance of early life failure) imply that hot
  spares should be burned in, instead of sitting there doing nothing
  from new? Just like a data disk, seems to me you'd want to know
 if a
  hot spare fails while waiting to be swapped in. Do they get tested
  periodically?
 
  The ideal scenario, as you already allude to, would be for the disk
  subsystem to initially configure the drive as a hot spare and
 send it
  periodic test events for, say, the first 48 hours.

 For some reason that's a little shorter than I had in mind, but I
 take your word that that's enough burn-in for semiconductors, motors,
 servos, etc.

  This would get it
  past the first segment of the bathtub reliability curve ...
 
  If saving power was the highest priority, then the ideal situation
  would
  be where the disk subsystem could apply/remove power to the spare
  and move
  it from warm to cold upon command.

 I am surmising that it would also considerably increase the spare's
 useful lifespan versus hot and spinning.

 
  One trick with disk subsystems, like ZFS that have yet to have
  the FMA
  type functionality added and which (today) provide for hot spares
  only, is
  to initially configure a pool with one (hot) spare, and then add a
  2nd hot
  spare, based on installing a brand new device, say, 12 months
  later.  And
  another spare 12 months later.  What you are trying to achieve,
  with this
  strategy, is to avoid the scenario whereby mechanical systems, like
  disk
  drives, tend to wear out within the same general, relatively
 short,
  timeframe.
 
  One (obvious) issue with this strategy, is that it may be
  impossible to
  purchase the same disk drive 12 and 24 months later.  However, it's
  always
  possible to purchase a larger disk drive

 ...which is not guaranteed to be compatible with your storage
 subsystem...!

 --Toby

  and simply commit to the fact
  that the extra space provided by the newer drive will be wasted.
 
  [1] The most common example is a disk drive mounted on a carrier
  but not
  seated within the disk drive enclosure.  Simple push in when
  required.
  ...
  Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
 approach.com
 Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
  OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
   OpenSolaris Governing Board (OGB) Member - Feb 2006

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss