Re: [zfs-discuss] # devices in raidz.

2007-04-11 Thread Cindy . Swearingen

Mike,

This RFE is still being worked and I have no ETA on completion...

cs

Mike Seda wrote:
I noticed that there is still an open bug regarding removing devices 
from a zpool:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783
Does anyone know if or when this feature will be implemented?


Cindy Swearingen wrote:


Hi Mike,

Yes, outside of the hot-spares feature, you can detach, offline, and 
replace existing devices in a pool, but you can't remove devices, yet.


This feature work is being tracked under this RFE:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783

Cindy

Mike Seda wrote:


Hi All,
 From reading the docs, it seems that you can add devices 
(non-spares) to a zpool, but you cannot take them away, right?

Best,
Mike


Victor Latushkin wrote:


Maybe something like the slow parameter of VxVM?

   slow[=iodelay]
Reduces toe system performance impact of copy
operations.  Such operations are usually per-
formed on small regions of the  volume  (nor-
mally  from  16  kilobytes to 128 kilobytes).
This  option  inserts  a  delay  between  the
recovery  of  each  such  region . A specific
delay can be  specified  with  iodelay  as  a
number  of milliseconds; otherwise, a default
is chosen (normally 250 milliseconds).




For modern machines, which *should* be the design point, the channel
bandwidth is underutilized, so why not use it?

NB. At 4 128kByte iops per second, it would take 11 days and 8 hours
to resilver a single 500 GByte drive -- feeling lucky?  In the bad old
days when disks were small, and the systems were slow, this made some
sense.  The better approach is for the file system to do what it needs
to do as efficiently as possible, which is the current state of ZFS.




Well, we are trying to balance impact of resilvering on running 
applications with a speed of resilvering.


I think that having an option to tell filesystem to postpone 
full-throttle resilvering till some quieter period of time may help. 
This may be combined with some throttling mechanism so during 
quieter period resilvering is done with full speed, and during busy 
period it may continue with reduced speed. Such arrangement may be 
useful for customers with e.g. well-defined SLAs.


Wbr,
Victor
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2007-04-10 Thread Mike Seda
I noticed that there is still an open bug regarding removing devices 
from a zpool:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783
Does anyone know if or when this feature will be implemented?


Cindy Swearingen wrote:

Hi Mike,

Yes, outside of the hot-spares feature, you can detach, offline, and 
replace existing devices in a pool, but you can't remove devices, yet.


This feature work is being tracked under this RFE:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783

Cindy

Mike Seda wrote:

Hi All,
 From reading the docs, it seems that you can add devices 
(non-spares) to a zpool, but you cannot take them away, right?

Best,
Mike


Victor Latushkin wrote:


Maybe something like the slow parameter of VxVM?

   slow[=iodelay]
Reduces toe system performance impact of copy
operations.  Such operations are usually per-
formed on small regions of the  volume  (nor-
mally  from  16  kilobytes to 128 kilobytes).
This  option  inserts  a  delay  between  the
recovery  of  each  such  region . A specific
delay can be  specified  with  iodelay  as  a
number  of milliseconds; otherwise, a default
is chosen (normally 250 milliseconds).



For modern machines, which *should* be the design point, the channel
bandwidth is underutilized, so why not use it?

NB. At 4 128kByte iops per second, it would take 11 days and 8 hours
to resilver a single 500 GByte drive -- feeling lucky?  In the bad old
days when disks were small, and the systems were slow, this made some
sense.  The better approach is for the file system to do what it needs
to do as efficiently as possible, which is the current state of ZFS.



Well, we are trying to balance impact of resilvering on running 
applications with a speed of resilvering.


I think that having an option to tell filesystem to postpone 
full-throttle resilvering till some quieter period of time may help. 
This may be combined with some throttling mechanism so during 
quieter period resilvering is done with full speed, and during busy 
period it may continue with reduced speed. Such arrangement may be 
useful for customers with e.g. well-defined SLAs.


Wbr,
Victor
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-15 Thread Richard Elling - PAE

Torrey McMahon wrote:

Richard Elling - PAE wrote:

Torrey McMahon wrote:

Robert Milkowski wrote:

Hello Torrey,

Friday, November 10, 2006, 11:31:31 PM, you wrote:

[SNIP]

Tunable in a form of pool property, with default 100%.

On the other hand maybe simple algorithm Veritas has used is good
enough - simple delay between scrubing/resilvering some data.


I think a not-to-convoluted algorithm as people have suggested would 
be ideal and then let people override it as necessary. I would think 
a 100% default might be a call generator but I'm up for debate. 
(Hey my array just went crazy. All the lights are blinking but my 
application isn't doing any I/O. What gives?)


I'll argue that *any* random % is bogus.  What you really want to
do is prioritize activity where resources are constrained.  From a RAS
perspective, idle systems are the devil's playground :-).  ZFS already
does prioritize I/O that it knows about.  Prioritizing on CPU might have
some merit, but to integrate into Solaris' resource management system
might bring some added system admin complexity which is unwanted.



I agree but the problem as I see it as that nothing has a overview of 
the entire environment. ZFS knows what I/O is coming in and what its 
sending out but that's it. Even if we had an easy to use resource 
management framework across all the Sun applications and devices we'd 
still run into non-Sun bits that place demands on shared components 
like networking, san, arrays, etc. Anything that can be auto-tuned is 
great but I'm afraid we're still going to need manual tuning in some 
cases.

I think this is reason #7429 why I hate SANs: no meaningful QoS
related to reason #85823 why I hate SANs: sdd_max_throttle is a 
butt-ugly hack

:-)
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-14 Thread Torrey McMahon

Richard Elling - PAE wrote:

Torrey McMahon wrote:

Robert Milkowski wrote:

Hello Torrey,

Friday, November 10, 2006, 11:31:31 PM, you wrote:

[SNIP]

Tunable in a form of pool property, with default 100%.

On the other hand maybe simple algorithm Veritas has used is good
enough - simple delay between scrubing/resilvering some data.


I think a not-to-convoluted algorithm as people have suggested would 
be ideal and then let people override it as necessary. I would think 
a 100% default might be a call generator but I'm up for debate. (Hey 
my array just went crazy. All the lights are blinking but my 
application isn't doing any I/O. What gives?)


I'll argue that *any* random % is bogus.  What you really want to
do is prioritize activity where resources are constrained.  From a RAS
perspective, idle systems are the devil's playground :-).  ZFS already
does prioritize I/O that it knows about.  Prioritizing on CPU might have
some merit, but to integrate into Solaris' resource management system
might bring some added system admin complexity which is unwanted.



I agree but the problem as I see it as that nothing has a overview of 
the entire environment. ZFS knows what I/O is coming in and what its 
sending out but that's it. Even if we had an easy to use resource 
management framework across all the Sun applications and devices we'd 
still run into non-Sun bits that place demands on shared components like 
networking, san, arrays, etc. Anything that can be auto-tuned is great 
but I'm afraid we're still going to need manual tuning in some cases.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-13 Thread Torrey McMahon

Howdy Robert.

Robert Milkowski wrote:


You've got the same behavior with any LVM when you replace a disk.
So it's not something unexpected for admins. Also most of the time
they expect LVM to resilver ASAP. With default setting not being 100%
you'll definitely see people complaining ZFS is slooow, etc.
  



It's quite possible that I've only seen the other side of the coin but 
in my past I've had support calls where
customers complained that they {replaced a drive, resilvered a mirror, 
... } and it knocked the performance of other things. My fave was a set 
of A5200s on a hub and after they cranked the i/o rate up on the mirror 
it caused some other app - Me thinks it was Oracle - get too slow, think 
there was a disk problem, crash(!), and then initiate a cluster 
failover. Given the disk group was not in perfect healthoh the fun 
we had.


In any case the key is documenting the behavior well enough so people 
can see what is going on, how to tune it slower or faster on the fly, 
etc. I'm more concerned with that then the actual algorithm or method used.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-13 Thread Victor Latushkin

Maybe something like the slow parameter of VxVM?

   slow[=iodelay]
Reduces toe system performance impact of copy
operations.  Such operations are usually per-
formed on small regions of the  volume  (nor-
mally  from  16  kilobytes to 128 kilobytes).
This  option  inserts  a  delay  between  the
recovery  of  each  such  region . A specific
delay can be  specified  with  iodelay  as  a
number  of milliseconds; otherwise, a default
is chosen (normally 250 milliseconds).


For modern machines, which *should* be the design point, the channel
bandwidth is underutilized, so why not use it?

NB. At 4 128kByte iops per second, it would take 11 days and 8 hours
to resilver a single 500 GByte drive -- feeling lucky?  In the bad old
days when disks were small, and the systems were slow, this made some
sense.  The better approach is for the file system to do what it needs
to do as efficiently as possible, which is the current state of ZFS.


Well, we are trying to balance impact of resilvering on running 
applications with a speed of resilvering.


I think that having an option to tell filesystem to postpone 
full-throttle resilvering till some quieter period of time may help. 
This may be combined with some throttling mechanism so during quieter 
period resilvering is done with full speed, and during busy period it 
may continue with reduced speed. Such arrangement may be useful for 
customers with e.g. well-defined SLAs.


Wbr,
Victor
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-13 Thread Cindy Swearingen

Hi Mike,

Yes, outside of the hot-spares feature, you can detach, offline, and 
replace existing devices in a pool, but you can't remove devices, yet.


This feature work is being tracked under this RFE:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4852783

Cindy

Mike Seda wrote:

Hi All,
 From reading the docs, it seems that you can add devices (non-spares) 
to a zpool, but you cannot take them away, right?

Best,
Mike


Victor Latushkin wrote:


Maybe something like the slow parameter of VxVM?

   slow[=iodelay]
Reduces toe system performance impact of copy
operations.  Such operations are usually per-
formed on small regions of the  volume  (nor-
mally  from  16  kilobytes to 128 kilobytes).
This  option  inserts  a  delay  between  the
recovery  of  each  such  region . A specific
delay can be  specified  with  iodelay  as  a
number  of milliseconds; otherwise, a default
is chosen (normally 250 milliseconds).



For modern machines, which *should* be the design point, the channel
bandwidth is underutilized, so why not use it?

NB. At 4 128kByte iops per second, it would take 11 days and 8 hours
to resilver a single 500 GByte drive -- feeling lucky?  In the bad old
days when disks were small, and the systems were slow, this made some
sense.  The better approach is for the file system to do what it needs
to do as efficiently as possible, which is the current state of ZFS.



Well, we are trying to balance impact of resilvering on running 
applications with a speed of resilvering.


I think that having an option to tell filesystem to postpone 
full-throttle resilvering till some quieter period of time may help. 
This may be combined with some throttling mechanism so during quieter 
period resilvering is done with full speed, and during busy period it 
may continue with reduced speed. Such arrangement may be useful for 
customers with e.g. well-defined SLAs.


Wbr,
Victor
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-13 Thread Richard Elling - PAE

Torrey McMahon wrote:

Robert Milkowski wrote:

Hello Torrey,

Friday, November 10, 2006, 11:31:31 PM, you wrote:

TM Robert Milkowski wrote:
 
Also scrub can consume all CPU power on smaller and older 
machines and

that's not always what I would like.


REP The big question, though, is 10% of what?  User CPU?  iops?


AH Probably N% of I/O Ops/Second would work well.

Or if 100% means full speed, then 10% means that expected time should
be approximately 10x more (instead 1h make it 10h).

It would be more intuitive than specifying some numbers like IOPS,
etc.
  


TM In any case you're still going to have to provide a tunable for 
this TM even if the resulting algorithm works well on the host side. 
Keep in TM mind that a scrub can also impact the array(s) you're 
filesystem lives
TM on. If all my ZFS systems started scrubbing at full speed - 
Because they
TM thought they weren't busy - at the same time it might cause 
issues with

TM other I/O on the array itself.

Tunable in a form of pool property, with default 100%.

On the other hand maybe simple algorithm Veritas has used is good
enough - simple delay between scrubing/resilvering some data.


I think a not-to-convoluted algorithm as people have suggested would 
be ideal and then let people override it as necessary. I would think a 
100% default might be a call generator but I'm up for debate. (Hey my 
array just went crazy. All the lights are blinking but my application 
isn't doing any I/O. What gives?)


I'll argue that *any* random % is bogus.  What you really want to
do is prioritize activity where resources are constrained.  From a RAS
perspective, idle systems are the devil's playground :-).  ZFS already
does prioritize I/O that it knows about.  Prioritizing on CPU might have
some merit, but to integrate into Solaris' resource management system
might bring some added system admin complexity which is unwanted.
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-12 Thread Torrey McMahon

Robert Milkowski wrote:

Hello Torrey,

Friday, November 10, 2006, 11:31:31 PM, you wrote:

TM Robert Milkowski wrote:
  

Also scrub can consume all CPU power on smaller and older machines and
that's not always what I would like.


REP The big question, though, is 10% of what?  User CPU?  iops?
  
  

AH Probably N% of I/O Ops/Second would work well.

Or if 100% means full speed, then 10% means that expected time should
be approximately 10x more (instead 1h make it 10h).

It would be more intuitive than specifying some numbers like IOPS,
etc.
  


TM In any case you're still going to have to provide a tunable for this 
TM even if the resulting algorithm works well on the host side. Keep in 
TM mind that a scrub can also impact the array(s) you're filesystem lives

TM on. If all my ZFS systems started scrubbing at full speed - Because they
TM thought they weren't busy - at the same time it might cause issues with
TM other I/O on the array itself.

Tunable in a form of pool property, with default 100%.

On the other hand maybe simple algorithm Veritas has used is good
enough - simple delay between scrubing/resilvering some data.


I think a not-to-convoluted algorithm as people have suggested would be 
ideal and then let people override it as necessary. I would think a 100% 
default might be a call generator but I'm up for debate. (Hey my array 
just went crazy. All the lights are blinking but my application isn't 
doing any I/O. What gives?)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-10 Thread Torrey McMahon

Robert Milkowski wrote:

Also scrub can consume all CPU power on smaller and older machines and
that's not always what I would like.


REP The big question, though, is 10% of what?  User CPU?  iops?
  


AH Probably N% of I/O Ops/Second would work well.

Or if 100% means full speed, then 10% means that expected time should
be approximately 10x more (instead 1h make it 10h).

It would be more intuitive than specifying some numbers like IOPS,
etc.


In any case you're still going to have to provide a tunable for this 
even if the resulting algorithm works well on the host side. Keep in 
mind that a scrub can also impact the array(s) you're filesystem lives 
on. If all my ZFS systems started scrubbing at full speed - Because they 
thought they weren't busy - at the same time it might cause issues with 
other I/O on the array itself.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-07 Thread Richard Elling - PAE

Robert Milkowski wrote:

Saturday, November 4, 2006, 12:46:05 AM, you wrote:
REP Incidentally, since ZFS schedules the resync iops itself, then it can
REP really move along on a mostly idle system.  You should be able to resync
REP at near the media speed for an idle system.  By contrast, a hardware
REP RAID array has no knowledge of the context of the data or the I/O 
scheduling,
REP so they will perform resyncs using a throttle.  Not only do they end up
REP resyncing unused space, but they also take a long time (4-18 GBytes/hr for
REP some arrays) and thus expose you to a higher probability of second disk
REP failure.

However some mechanism to slow or freeze scrub/resilvering would be
useful. Especially in cases where server does many other things and
not only file serving - and scrub/resilver can take much CPU power on
slower servers.

Something like 'zpool scrub -r 10 pool' - which would mean 10% of
speed.


I think this has some merit for scrubs, but I wouldn't suggest it for resilver.
If your data is at risk, there is nothing more important than protecting it.
While that sounds harsh, in reality there is a practical limit determined by
the ability of a single LUN to absorb a (large, sequential?) write workload.
For JBODs, that would be approximately the media speed.

The big question, though, is 10% of what?  User CPU?  iops?
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-07 Thread Daniel Rock

Richard Elling - PAE schrieb:

The big question, though, is 10% of what?  User CPU?  iops?


Maybe something like the slow parameter of VxVM?

   slow[=iodelay]
Reduces toe system performance impact of copy
operations.  Such operations are usually per-
formed on small regions of the  volume  (nor-
mally  from  16  kilobytes to 128 kilobytes).
This  option  inserts  a  delay  between  the
recovery  of  each  such  region . A specific
delay can be  specified  with  iodelay  as  a
number  of milliseconds; otherwise, a default
is chosen (normally 250 milliseconds).



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-07 Thread Richard Elling - PAE

Daniel Rock wrote:

Richard Elling - PAE schrieb:

The big question, though, is 10% of what?  User CPU?  iops?


Maybe something like the slow parameter of VxVM?

   slow[=iodelay]
Reduces toe system performance impact of copy
operations.  Such operations are usually per-
formed on small regions of the  volume  (nor-
mally  from  16  kilobytes to 128 kilobytes).
This  option  inserts  a  delay  between  the
recovery  of  each  such  region . A specific
delay can be  specified  with  iodelay  as  a
number  of milliseconds; otherwise, a default
is chosen (normally 250 milliseconds).


For modern machines, which *should* be the design point, the channel
bandwidth is underutilized, so why not use it?

NB. At 4 128kByte iops per second, it would take 11 days and 8 hours
to resilver a single 500 GByte drive -- feeling lucky?  In the bad old
days when disks were small, and the systems were slow, this made some
sense.  The better approach is for the file system to do what it needs
to do as efficiently as possible, which is the current state of ZFS.
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-07 Thread Torrey McMahon

Richard Elling - PAE wrote:

The better approach is for the file system to do what it needs
to do as efficiently as possible, which is the current state of ZFS. 


This implies that the filesystem has exclusive use of the channel - SAN 
or otherwise - as well as the storage array front end controllers, 
cache, and the raid groups that may be behind it. What we really need in 
this case, and a few others, is the filesystem and backend storage 
working together...but I'll save that rant for an other day. ;)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-07 Thread Daniel Rock

Richard Elling - PAE schrieb:

For modern machines, which *should* be the design point, the channel
bandwidth is underutilized, so why not use it?


And what about encrypted disks? Simply create a zpool with checksum=sha256, 
fill it up, then scrub. I'd be happy if I could use my machine during 
scrubbing. A throttling of scrubbing would help. Maybe also running the 
scrubbing with a high nice level in kernel.





NB. At 4 128kByte iops per second, it would take 11 days and 8 hours
to resilver a single 500 GByte drive -- feeling lucky?


250ms is the Veritas default. It doesn't have to be the ZFS default also.


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-06 Thread Torrey McMahon

Richard Elling - PAE wrote:



Incidentally, since ZFS schedules the resync iops itself, then it can
really move along on a mostly idle system.  You should be able to resync
at near the media speed for an idle system.  By contrast, a hardware
RAID array has no knowledge of the context of the data or the I/O 
scheduling,

so they will perform resyncs using a throttle.  Not only do they end up
resyncing unused space, but they also take a long time (4-18 GBytes/hr 
for

some arrays) and thus expose you to a higher probability of second disk
failure. 


Just as an other data point: It is true that the array doesn't know the 
context of the data or the i/o scheduling but some arrays do watch the 
incoming data rate and throttle accordingly. (T3 used to for example.)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-03 Thread Robert Milkowski
Hello ozan,

Friday, November 3, 2006, 3:57:00 PM, you wrote:

osy for s10u2, documentation recommends 3 to 9 devices in raidz. what is the
osy basis for this recommendation? i assume it is performance and not failure
osy resilience, but i am just guessing... [i know, recommendation was intended
osy for people who know their raid cold, so it needed no further explanation]

Performance reason for random reads.

ps. however the bigger raid-z group the more risky it could be - but
this is obvious.


-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-03 Thread Richard Elling - PAE

ozan s. yigit wrote:

for s10u2, documentation recommends 3 to 9 devices in raidz. what is the
basis for this recommendation? i assume it is performance and not failure
resilience, but i am just guessing... [i know, recommendation was intended
for people who know their raid cold, so it needed no further explanation]


Both actually.
The small, random read performance will approximate that of a single disk.
The probability of data loss increases as you add disks to a RAID-5/6/Z/Z2
volumes.

For example, suppose you have 12 disks and insist on RAID-Z.
Given
1. small, random read iops for a single disk is 141 (eg. 2.5 SAS
   10k rpm drive)
2. MTBF = 1.4M hours (0.63% AFR) (so says the disk vendor)
3. no spares
4. service time = 24 hours, resync rate 100 GBytes/hr, 50% space
   utilization
5. infinite service life

Scenario 1: 12-way RAID-Z
performance = 141 iops
MTTDL[1] = 68,530 years
space = 11 * disk size

Scenario 2: 2x 6-way RAID-Z+0
performance = 282 iops
MTTDL[1] = 150,767 years
space = 10 * disk size

[1] Using MTTDL = MTBF^2 / (N * (N-1) * MTTR)

 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-03 Thread Al Hopper
On Fri, 3 Nov 2006, Richard Elling - PAE wrote:

 ozan s. yigit wrote:
  for s10u2, documentation recommends 3 to 9 devices in raidz. what is the
  basis for this recommendation? i assume it is performance and not failure
  resilience, but i am just guessing... [i know, recommendation was intended
  for people who know their raid cold, so it needed no further explanation]

 Both actually.
 The small, random read performance will approximate that of a single disk.
 The probability of data loss increases as you add disks to a RAID-5/6/Z/Z2
 volumes.

 For example, suppose you have 12 disks and insist on RAID-Z.
 Given
   1. small, random read iops for a single disk is 141 (eg. 2.5 SAS
  10k rpm drive)
   2. MTBF = 1.4M hours (0.63% AFR) (so says the disk vendor)
   3. no spares
   4. service time = 24 hours, resync rate 100 GBytes/hr, 50% space
  utilization
   5. infinite service life

 Scenario 1: 12-way RAID-Z
   performance = 141 iops
   MTTDL[1] = 68,530 years
   space = 11 * disk size

 Scenario 2: 2x 6-way RAID-Z+0
   performance = 282 iops
   MTTDL[1] = 150,767 years
   space = 10 * disk size

 [1] Using MTTDL = MTBF^2 / (N * (N-1) * MTTR)

But ... I'm not sure I buy into your numbers given the probability that
more than one disk will fail inside the service window - given that the
disks are identical?  Or ... a disk failure occurs at 5:01 PM (quitting
time) on a Friday and won't be replaced until 8:00AM on Monday morning.
Does the failure data you have access to support my hypothesis that
failures of identical mechanical systems tend to occur in small clusters
within a relatively small window of time?

Call me paranoid, but I'd prefer to see a product like thumper configured
with 50% of the disks manufactured by vendor A and the other 50%
manufactured by someone else.

This paranoia is based on a personal experience, many years ago (before we
had smart fans etc), where we had a rack full of expensive custom
equipment cooled by (what we thought was) a highly redundant group of 5
fans.  One fan suffered infant mortality and its failure went unnoticed,
leaving 4 fans running.  Two of the fans died on the same extended weekend
(public holiday).  It was an expensive and embarassing disaster.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
   Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
 OpenSolaris Governing Board (OGB) Member - Feb 2006
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # devices in raidz.

2006-11-03 Thread Richard Elling - PAE

Al Hopper wrote:

[1] Using MTTDL = MTBF^2 / (N * (N-1) * MTTR)


But ... I'm not sure I buy into your numbers given the probability that
more than one disk will fail inside the service window - given that the
disks are identical?  Or ... a disk failure occurs at 5:01 PM (quitting
time) on a Friday and won't be replaced until 8:00AM on Monday morning.
Does the failure data you have access to support my hypothesis that
failures of identical mechanical systems tend to occur in small clusters
within a relatively small window of time?


Separating the right hand side:
MTTDL = MTBF/N * MTBF/(N-1)*MTTR

the right-most product is the probability that one of the N-1 disks fail
during the recovery window for the first disk's failure.  As the MTTR
increases, the probability of the 2nd disk failure also increases.
RAIDoptimizer calculates the MTTR as:
MTTR = service response time + resync time
where
resync time = size * space used (%) / resync rate

Incidentally, since ZFS schedules the resync iops itself, then it can
really move along on a mostly idle system.  You should be able to resync
at near the media speed for an idle system.  By contrast, a hardware
RAID array has no knowledge of the context of the data or the I/O scheduling,
so they will perform resyncs using a throttle.  Not only do they end up
resyncing unused space, but they also take a long time (4-18 GBytes/hr for
some arrays) and thus expose you to a higher probability of second disk
failure.


Call me paranoid, but I'd prefer to see a product like thumper configured
with 50% of the disks manufactured by vendor A and the other 50%
manufactured by someone else.


Diversity is usually a good thing.  Unfortunately, this is often impractical
for a manufacturer.


This paranoia is based on a personal experience, many years ago (before we
had smart fans etc), where we had a rack full of expensive custom
equipment cooled by (what we thought was) a highly redundant group of 5
fans.  One fan suffered infant mortality and its failure went unnoticed,
leaving 4 fans running.  Two of the fans died on the same extended weekend
(public holiday).  It was an expensive and embarassing disaster.


Modelling such as this assumes independence of failures.  Common cause or
bad lots are not that hard to model, but you may never find any failure rate
data for them.  You can look at the MTBF sensitivities, though that is an
opening to another set of results.  I prefer to ignore the absolute values
and judge competing designs by their relative results.  To wit, I fully
expect to be beyond dust in 150,767 years, and the expected lifetime of
most disks is 5 years.  But given two competing designs using the same
model, a design predicting and MTTDL 150,767 years will very likely demonstrate
better MTTDL than a design predicting 68,530 years.
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss