Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-20 Thread Erik Trimble
Since my name was mention, a couple of things:

(a) I'm not infallible. :-)

(b) In my posts, I swapped "slab" for "record". I really should have
said "record".  It's more correct as to what's going on.

(c) It is possible for constituent drives in a RaidZ to be issued
concurrent requests for portions of a record, which *may* increase
efficiency. So, the "assembly" of a complete record isn't completely a
serial operation (that is, ZFS doesn't wait for all the parts of a
record to be assembled before issuing further requests for the next
record) So, drives may have requests for multiple portions of records
sitting in their "todo" queues. Thus, all "good" (i.e. being rebuilt
*from*) drives should be constantly busy, and not waiting around for
others to finish reading data.  That all said, I don't see (in the code)
where the place is that indicates how many records can be done in
parallel. 2? 4? 20?  It matters quite a bit.

(d) writing completed record parts (i.e. the segment that needs to be
resilvered) is also queued up, so, for the most part, the replaced drive
is doing relatively sequential IO.  That is, *usually* the head doesn't
have to seek and *may* not even have to wait much for rotational delay -
it just stays where it left off and writes the next reconstructed data.
Now, for drives which are not replaced, but rather just "stale", this
isn't often true, and those drives may be stuck seeking quite a bit.
But, since they're usually only slightly stale, it isn't noticed that
much.


(e) Given C above, the average performance of a drive being read does
tend to be "average" for random IO - that is, half the max seek time,
plus half the average rotational latency. NCQ/etc will help this by
clustering reads, so actual performance should be better than a pure
average, but I'd not bet on a significant improvement.  And, for a
typical pools, I'm going to make a bald-faced statement that the HD read
cache is going to be much less helpful than usual (as for a typical
filesystem with lots of small files, most will fit in a single record,
and the next location on the HD is likely NOT to be something you want)
- that is, HD read-ahead cache misses are going to be frequent.  All
this assumes you are reconstructing a drive which has not been
sequentially written to - those types of zpools will resilver much
faster than zpools exposed to "typical" read/write patterns.

(f)  IOPS is going to be the limiting factor, particularly for the
resilvering drive, as there is less opportunity to group writes than
there is to group reads (even allowing for D above).  My reading of the
code says that ZFS issues writes to the resilver drive as the
opportunity comes - that is, ZFS itself doesn't try to batch up multiple
records into a single write request.I'd like verification of this,
though.



-Erik


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-20 Thread Tuomas Leikola
On Wed, Oct 20, 2010 at 4:05 PM, Edward Ned Harvey  wrote:
>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
>>
>> 4. Guess what happens if you have 2 or 3 failed disks in your raidz3,
>> and
>> they're trying to resilver at the same time.  Does the system ignore
>> subsequently failed disks and concentrate on restoring a single disk
>> quickly?  Or does the system try to resilver them all simultaneously
>> and
>> therefore double or triple the time before any one disk is fully
>> resilvered?
>
> This is a legitimate question.  If anyone knows, I'd like to know...
>

My recent experience with os_111b, os_134 and oi_147 was that
subsequent failure and disk replacement causes resilver to restart
from beginning, including the new disks on the later pass. If disk is
not replaced, the resilver would run to completion (and then a replace
could be performed with a new resilver).

This however is an issue that is being developed further so changes
may be coming.

-- 
- Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-20 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> 4. Guess what happens if you have 2 or 3 failed disks in your raidz3,
> and
> they're trying to resilver at the same time.  Does the system ignore
> subsequently failed disks and concentrate on restoring a single disk
> quickly?  Or does the system try to resilver them all simultaneously
> and
> therefore double or triple the time before any one disk is fully
> resilvered?

This is a legitimate question.  If anyone knows, I'd like to know...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-18 Thread Richard Elling
On Oct 18, 2010, at 6:52 AM, Edward Ned Harvey wrote:

>> From: Richard Elling [mailto:richard.ell...@gmail.com]
>> 
>>> http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg41998.html
>> 
>> Slabs don't matter. So the rest of this argument is moot.
> 
> Tell it to Erik.  He might want to know.  Or maybe he knows better than you.

You were the one who posted this.  If you intend to follow citations, then
there are quite a number of useful discussions on resilvering in the 2007-2008
archives.

>> 2. Each slab is spread across many disks, so the average seek time to
>> fetch
>> the slab approaches the maximum seek time of a single disk.  That means
>> an
>> average 2x longer than average seek time.
>> 
>> nope.
> 
> Anything intelligent to add?  Or just "nope"

The assertion that an average 2x longer than average seek time is wrong.
This is all done in parallel, not serially, so there is no 2x penalty.

>> Seeks are usually quite small compared to the rotational delay, due to
>> the way data is written.
> 
> I'm using the term "seek time" to reference from time the drive receives an
> instruction, to the time it actually is able to read/write the requested
> data.  In drive spec sheets, this is often referred to as "seek time" so I
> don't think I'm misusing the term, and it includes the rotational delay.

It is important because you have concentrated your concern based on 
seek time.  Even if the seek time were zero, you can't get past the rot delay
on HDDs.  For reads, which we are concerned about here, the likelihood
of data existing in the track cache is high and so the penalty of a blown
rev is low.

>> 4. Guess what happens if you have 2 or 3 failed disks in your raidz3,
>> and
>> they're trying to resilver at the same time.  Does the system ignore
>> subsequently failed disks and concentrate on restoring a single disk
>> quickly?
>> 
>> No, of course.
>> 
>> 
>> Or does the system try to resilver them all simultaneously and
>> therefore double or triple the time before any one disk is fully
>> resilvered?
>> 
>> Yes, of course.
> 
> Are those supposed to be real answers?  Or are you mocking me?  It sounds
> like mocking.
> 
> If you don't mind, please try to stick with productive conversation.  I'm
> just skipping the rest of your reply from here down, because I'm considering
> it hostile and unnecessary to read or reply further.

If you want to recommend configurations and compare or contrast their merits,
then you should be able to defend your decisions.  In engineering, this would be
known as a critical design review, where the operational definition of 
"critical" is
expressing of involving an analysis of the merits and faults of a work product 
incorporating a detailed and scholarly analysis and commentary. While people
who are not experienced with critical design reviews may view them as hostile,
the desire to achieve a better product or result is the ultimate goal.  Check 
your
ego at the door.
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-18 Thread Edward Ned Harvey
> From: Richard Elling [mailto:richard.ell...@gmail.com]
>
> > http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg41998.html
> 
> Slabs don't matter. So the rest of this argument is moot.

Tell it to Erik.  He might want to know.  Or maybe he knows better than you.


> 2. Each slab is spread across many disks, so the average seek time to
> fetch
> the slab approaches the maximum seek time of a single disk.  That means
> an
> average 2x longer than average seek time.
> 
> nope.

Anything intelligent to add?  Or just "nope"


> Seeks are usually quite small compared to the rotational delay, due to
> the way data is written.

I'm using the term "seek time" to reference from time the drive receives an
instruction, to the time it actually is able to read/write the requested
data.  In drive spec sheets, this is often referred to as "seek time" so I
don't think I'm misusing the term, and it includes the rotational delay.


> 4. Guess what happens if you have 2 or 3 failed disks in your raidz3,
> and
> they're trying to resilver at the same time.  Does the system ignore
> subsequently failed disks and concentrate on restoring a single disk
> quickly?
> 
> No, of course.
> 
> 
> Or does the system try to resilver them all simultaneously and
> therefore double or triple the time before any one disk is fully
> resilvered?
> 
> Yes, of course.

Are those supposed to be real answers?  Or are you mocking me?  It sounds
like mocking.

If you don't mind, please try to stick with productive conversation.  I'm
just skipping the rest of your reply from here down, because I'm considering
it hostile and unnecessary to read or reply further.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-17 Thread Richard Elling
On Oct 16, 2010, at 4:57 AM, Edward Ned Harvey wrote:

>> From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us]
>> 
>>> raidzN takes a really long time to resilver (code written
>> inefficiently,
>>> it's a known problem.)  If you had a huge raidz3, it would literally
>> never
>>> finish, because it couldn't resilver as fast as new data appears.  A
>> week
>> 
>> In what way is the code written inefficiently?
> 
> Here is a link to one message in the middle of a really long thread, which
> touched on a lot of things, so it's difficult to read the thread now and get
> what it all boils down to and which parts are relevant to the present
> discussion.  Relevant comments below...
> http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg41998.html
> 
> In conclusion of the referenced thread:
> 
> The raidzN resilver code is inefficient, especially when there are a lot of
> disks in the vdev, because...
> 
> 1. It processes one slab at a time.  That's very important.  Each disk
> spends a lot of idle time waiting for the next disk to fetch something, so
> there is an opportunity to start prefetching data on the idle disks, and
> that is not happening.

Slabs don't matter. So the rest of this argument is moot.

> 2. Each slab is spread across many disks, so the average seek time to fetch
> the slab approaches the maximum seek time of a single disk.  That means an
> average 2x longer than average seek time.

nope.

> 2a. The more disks in the vdev, the smaller the piece of data that gets
> written to each individual disk.  So you are waiting for the maximum seek
> time, in order to fetch a slab fragment which is tiny ...

This is an oversimplification.  In all of the resilvering tests I've done, the
resilver time is entirely based on the random write performance of the
resilvering disk. 

> 3. The order of slab fetching is determined by creation time, not by disk
> layout.  This is a huge setback.  It means each seek is essentially random,
> which yields maximum seek time, instead of being sequential which approaches
> zero seek time.  If you could cut the seek time down to zero, you would have
> infinitely faster IOPS.  Something divided by zero is infinity.  Suddenly
> you wouldn't care about seek time and you'd start paying attention to some
> other limiting factor.
> http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg42017.html

Seeks are usually quite small compared to the rotational delay, due to
the way data is written.

> 4. Guess what happens if you have 2 or 3 failed disks in your raidz3, and
> they're trying to resilver at the same time.  Does the system ignore
> subsequently failed disks and concentrate on restoring a single disk
> quickly?  

No, of course.

> Or does the system try to resilver them all simultaneously and
> therefore double or triple the time before any one disk is fully resilvered?

Yes, of course.

> 5. If all your files reside in one big raidz3, that means a little piece of
> *every* slab in the pool must be on each disk.  We've concluded above that
> you are approaching maximum seek time,

No, you are jumping to the conclusion that data is allocated at the beginning
and the end of the device, which is not the case.

> and now we're also concluding you
> must do the maximum number of possible seeks.  If instead, you break your
> big raidz3 vdev into 3 raidz1 vdev's, that means each raidz1 vdev will have
> approx 33% as many slab pieces on it.  

Again, misuse of the term "slab."  A record will exist in only one set.  So it 
is
simply a matter of finding the records that need to be resilvered.

> If you need to resilver a disk, even
> though you're resilvering approximately the same number of bytes per disk as
> you would have in raidz3, in the raidz1 you've cut the number of seeks down
> to 33%, and you've reduced the time necessary for each of those seeks.

No, not really. The metadata contains the information you need to locate
the records to be resilvered. By design, the metadata is redundant and spread
across top-level vdevs or, in the case of a single top-level vdev, made 
redundant and diverse. So there are two activities in play:
1. metadata is read in time order and prefetched
2. records are reconstructed from the surviving vdevs

> Still better ... Compare a 23-disk raidz3 (capacity of 20 disks) against 20
> mirrors.  Resilver one disk.  You only require 5% as many seeks, and each
> seek will go twice as fast.  

Again, this is an oversimplification that assumes seeks are not done in
parallel. In reality, the I/Os are scheduled to each device in the set 
concurrently,
so the total number of seeks per set is moot.

> So the mirror will resilver 40x faster.  

I've never seen data to support this.  And yes, I've done many experiments
and observed real-life reconstruction.

> Also,
> if anybody is actually using the pool during that time, only 5% of the user
> operations will result in a seek on the resilvering mirror disk, while 100%
> of t

Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-17 Thread Orvar Korvar
I would definitely consider raidz2 or raidz3 in several vdevs. Maximum 8-9 
drives in each vdev. Not a huge 20 disc vdev.

One vdev gives you the IOPS as in one single drive. If you have three vdevs, 
you get  IOPS worth of three drives. That is better than one single vdev of 20 
discs.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-16 Thread Edward Ned Harvey
> From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us]
> 
> > raidzN takes a really long time to resilver (code written
> inefficiently,
> > it's a known problem.)  If you had a huge raidz3, it would literally
> never
> > finish, because it couldn't resilver as fast as new data appears.  A
> week
> 
> In what way is the code written inefficiently?

Here is a link to one message in the middle of a really long thread, which
touched on a lot of things, so it's difficult to read the thread now and get
what it all boils down to and which parts are relevant to the present
discussion.  Relevant comments below...
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg41998.html

In conclusion of the referenced thread:

The raidzN resilver code is inefficient, especially when there are a lot of
disks in the vdev, because...

1. It processes one slab at a time.  That's very important.  Each disk
spends a lot of idle time waiting for the next disk to fetch something, so
there is an opportunity to start prefetching data on the idle disks, and
that is not happening.

2. Each slab is spread across many disks, so the average seek time to fetch
the slab approaches the maximum seek time of a single disk.  That means an
average 2x longer than average seek time.

2a. The more disks in the vdev, the smaller the piece of data that gets
written to each individual disk.  So you are waiting for the maximum seek
time, in order to fetch a slab fragment which is tiny ...

3. The order of slab fetching is determined by creation time, not by disk
layout.  This is a huge setback.  It means each seek is essentially random,
which yields maximum seek time, instead of being sequential which approaches
zero seek time.  If you could cut the seek time down to zero, you would have
infinitely faster IOPS.  Something divided by zero is infinity.  Suddenly
you wouldn't care about seek time and you'd start paying attention to some
other limiting factor.
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg42017.html

4. Guess what happens if you have 2 or 3 failed disks in your raidz3, and
they're trying to resilver at the same time.  Does the system ignore
subsequently failed disks and concentrate on restoring a single disk
quickly?  Or does the system try to resilver them all simultaneously and
therefore double or triple the time before any one disk is fully resilvered?

5. If all your files reside in one big raidz3, that means a little piece of
*every* slab in the pool must be on each disk.  We've concluded above that
you are approaching maximum seek time, and now we're also concluding you
must do the maximum number of possible seeks.  If instead, you break your
big raidz3 vdev into 3 raidz1 vdev's, that means each raidz1 vdev will have
approx 33% as many slab pieces on it.  If you need to resilver a disk, even
though you're resilvering approximately the same number of bytes per disk as
you would have in raidz3, in the raidz1 you've cut the number of seeks down
to 33%, and you've reduced the time necessary for each of those seeks.
Still better ... Compare a 23-disk raidz3 (capacity of 20 disks) against 20
mirrors.  Resilver one disk.  You only require 5% as many seeks, and each
seek will go twice as fast.  So the mirror will resilver 40x faster.  Also,
if anybody is actually using the pool during that time, only 5% of the user
operations will result in a seek on the resilvering mirror disk, while 100%
of the user operations will hurt the raidz3 resilver.

6. Please see the following calculation of probability of failure of 20
mirrors vs 23 disk raidz3.  According to my calculations, the probability of
4 disk failure in raidz3 is approx 4.4E-4 and the probability of 2 disks in
the same mirror failing is approx 5E-5.  So the chances of either pool to
fail is very small, but the raidz3 is approx 10x more likely to suffer pool
failure than the mirror setup.  Granted there is some linear estimation
which is not entirely accurate, but I think the calculation comes within an
order of magnitude of being correct.  The mirror setup is 65% more hardware,
10x more reliable, and much faster than the raidz3 setup, same usable
capacity.
http://dl.dropbox.com/u/543241/raidz3%20vs%20mirrors.pdf 

...

Compare the 21disk raidz3 versus 3 vdev's of 7-disk raidz1.  You get more
than 3x faster resilver time with the smaller vdev's, and you only get 3x
the redundancy in the raidz3.  That means the probability of 4
simultaneously failed disks in the raidz3 is higher than the probability of
2 failed disks in a single raidz1 vdev.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-15 Thread Ian Collins

On 10/16/10 12:29 PM, Marty Scholes wrote:

On Fri, Oct 15, 2010 at 3:16 PM, Marty Scholes
  wrote:
 

My home server's main storage is a 22 (19 + 3) disk
   

RAIDZ3 pool backed up hourly to a 14 (11+3) RAIDZ3
backup pool.

How long does it take to resilver a disk in that
pool?  And how long
does it take to run a scrub?

When I initially setup a 24-disk raidz2 vdev, it died
trying to
resilver a single 500 GB SATA disk.  I/O under 1
MBps, all 24 drives
thrashing like crazy, could barely even login to the
system and type
onscreen.  It was a nightmare.

That, and normal (no scrub, no resilver) disk I/O was
abysmal.

Since then, I've avoided any vdev with more than 8
drives in it.
 

MY situation is kind of unique.  I picked up 120 15K 73GB FC disks early this 
year for $2 per.  As such, spindle count is a non-issue.  As a home server, it 
has very little need for write iops and I have 8 disks for L2ARC on the main 
pool.

   

I'd hate to be paying your power bill!


Main pool is at 40% capacity and backup pool is at 65% capacity.  Both take 
about 70 minutes to scrub.  The last time I tested a resilver it took about 3 
hours.

   
So a tiny fast drive takes three hours, consider how long a 30x bigger, 
much slower drive will take.


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-15 Thread Marty Scholes
> On Fri, Oct 15, 2010 at 3:16 PM, Marty Scholes
>  wrote:
> > My home server's main storage is a 22 (19 + 3) disk
> RAIDZ3 pool backed up hourly to a 14 (11+3) RAIDZ3
> backup pool.
> 
> How long does it take to resilver a disk in that
> pool?  And how long
> does it take to run a scrub?
> 
> When I initially setup a 24-disk raidz2 vdev, it died
> trying to
> resilver a single 500 GB SATA disk.  I/O under 1
> MBps, all 24 drives
> thrashing like crazy, could barely even login to the
> system and type
> onscreen.  It was a nightmare.
> 
> That, and normal (no scrub, no resilver) disk I/O was
> abysmal.
> 
> Since then, I've avoided any vdev with more than 8
> drives in it.

MY situation is kind of unique.  I picked up 120 15K 73GB FC disks early this 
year for $2 per.  As such, spindle count is a non-issue.  As a home server, it 
has very little need for write iops and I have 8 disks for L2ARC on the main 
pool.

Main pool is at 40% capacity and backup pool is at 65% capacity.  Both take 
about 70 minutes to scrub.  The last time I tested a resilver it took about 3 
hours.

The difference is that these are low capacity 15K FC spindles and the pool has 
very little sustained I/O; it only bursts now and again.  Resilvers would go 
mostly uncontested, and with RAIDZ3 + autoreplace=off, I can actually schedule 
a resilver.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-15 Thread Freddie Cash
On Fri, Oct 15, 2010 at 3:16 PM, Marty Scholes  wrote:
> My home server's main storage is a 22 (19 + 3) disk RAIDZ3 pool backed up 
> hourly to a 14 (11+3) RAIDZ3 backup pool.

How long does it take to resilver a disk in that pool?  And how long
does it take to run a scrub?

When I initially setup a 24-disk raidz2 vdev, it died trying to
resilver a single 500 GB SATA disk.  I/O under 1 MBps, all 24 drives
thrashing like crazy, could barely even login to the system and type
onscreen.  It was a nightmare.

That, and normal (no scrub, no resilver) disk I/O was abysmal.

Since then, I've avoided any vdev with more than 8 drives in it.

-- 
Freddie Cash
fjwc...@gmail.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-15 Thread Marty Scholes
Sorry, I can't not respond...

Edward Ned Harvey wrote:
> whatever you do, *don't* configure one huge raidz3.

Peter, whatever you do, *don't* make a decision based on blanket 
generalizations.

> If you can afford mirrors, your risk is much lower.
>  Because although it's
> hysically possible for 2 disks to fail simultaneously
> and ruin the pool,
> the probability of that happening is smaller than the
> probability of 3
> simultaneous disk failures on the raidz3.

Edward, I normally agree with most of what you have to say, but this has gone 
off the deep end.  I can think of counter-use-cases far faster than I can type.

>  Due to
> smaller resilver window.

Coupled with a smaller MTTDL, smaller cabinet space yield, smaller $/GB ratio, 
etc.

> I highly endorse mirrors for nearly all purposes.

Clearly.

Peter, go straight to the source.

http://blogs.sun.com/roch/entry/when_to_and_not_to

In short:
1. vdev_count = spindle_count / (stripe_width + parity_count)
2. IO/s is proprotional to vdev_count
3. Usable capacity is proportional to stripe_width * vdev_count
4. A mirror can be approximated by a stripe of width one
5. Mean time to data loss increases exponentially with parity_count
6. Resilver time increases (super)linearly with stripe width

Balance capacity available, storage needed, performance needed and your own 
level of paranoia regarding data loss.

My home server's main storage is a 22 (19 + 3) disk RAIDZ3 pool backed up 
hourly to a 14 (11+3) RAIDZ3 backup pool.

Clearly this is not a production Oracle server.  Equally clear is that my 
paranoia index is rather high.

ZFS will let you choose the combination of stripe width and parity count which 
works for you.

There is no "one size fits all."
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-15 Thread Bob Friesenhahn

On Wed, 13 Oct 2010, Edward Ned Harvey wrote:


raidzN takes a really long time to resilver (code written inefficiently,
it's a known problem.)  If you had a huge raidz3, it would literally never
finish, because it couldn't resilver as fast as new data appears.  A week


In what way is the code written inefficiently?

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-14 Thread Edward Ned Harvey
> From: David Magda [mailto:dma...@ee.ryerson.ca]
> 
> On Wed, October 13, 2010 21:26, Edward Ned Harvey wrote:
> 
> > I highly endorse mirrors for nearly all purposes.
> 
> Are you a member of BAARF?
> 
> http://www.miracleas.com/BAARF/BAARF2.html

Never heard of it.  I don't quite get it ... They want people to stop
talking about pros/cons of various types of raid?  That's definitely not me.


I think there are lots of pros/cons, and many of them have nuances, and vary
by implementation...  I think it's important to keep talking about it, and
all us "experts" in the field can keep current on all this ...

Take, for example, the number of people discussing things in this mailing
list, who say they still use hardware raid.  That alone demonstrates
misinformation (in most cases) and warrants more discussion.  ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-14 Thread David Magda
On Wed, October 13, 2010 21:26, Edward Ned Harvey wrote:

> I highly endorse mirrors for nearly all purposes.

Are you a member of BAARF?

http://www.miracleas.com/BAARF/BAARF2.html

 :)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-13 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Peter Taps
> 
> If I have 20 disks to build a raidz3 pool, do I create one big raidz
> vdev or do I create multiple raidz3 vdevs? Is there any advantage of
> having multiple raidz3 vdevs in a single pool?

whatever you do, *don't* configure one huge raidz3.

Consider either:  3 vdev's of each 7-disk raidz1, or 3 vdev's of 7-disk
raidz2, or something along these lines.  Perhaps 3 vdev's of each 6-disk
raidz1, and two hotspares.

raidzN takes a really long time to resilver (code written inefficiently,
it's a known problem.)  If you had a huge raidz3, it would literally never
finish, because it couldn't resilver as fast as new data appears.  A week
later you'd destroy & rebuild your whole pool.

If you can afford mirrors, your risk is much lower.  Because although it's
physically possible for 2 disks to fail simultaneously and ruin the pool,
the probability of that happening is smaller than the probability of 3
simultaneous disk failures on the raidz3.  Due to smaller resilver window.

I highly endorse mirrors for nearly all purposes.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-13 Thread Scott Meilicke
Hello Peter, 

Read the ZFS Best Practices Guide to start. If you still have questions, post 
back to the list.

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Storage_Pool_Performance_Considerations

-Scott

On Oct 13, 2010, at 3:21 PM, Peter Taps wrote:

> Folks,
> 
> If I have 20 disks to build a raidz3 pool, do I create one big raidz vdev or 
> do I create multiple raidz3 vdevs? Is there any advantage of having multiple 
> raidz3 vdevs in a single pool?
> 
> Thank you in advance for your help.
> 
> Regards,
> Peter
> -- 
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Scott Meilicke



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Optimal raidz3 configuration

2010-10-13 Thread Peter Taps
Folks,

If I have 20 disks to build a raidz3 pool, do I create one big raidz vdev or do 
I create multiple raidz3 vdevs? Is there any advantage of having multiple 
raidz3 vdevs in a single pool?

Thank you in advance for your help.

Regards,
Peter
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss