[zfs-discuss] All (pure) SSD pool rehash

2011-09-27 Thread Matt Banks
I know there was a thread about this a few months ago.

However, with the costs of SSD's falling like they have, the idea of an Oracle 
X4270 M2/Cisco C210 M2/IBM x3650 M3 class of machine with a 13 drive RAIDZ2 
zpool (1 hot spare) is really starting to sound alluring to me/us. Especially 
with something like the OCZ Deneva 2 drives (Sandforce 2281 with a supercap), 
the SanDisk (Pliant) Lightning series, or perhaps the Hitachi SSD400M's coming 
in at prices that aren't a whole lot more than 600GB 15k drives. (From an 
enterprise perspective anyway.)

Systems with a similar load (OLTP) are frequently I/O bound - eg a server with 
a Sun 2540 FC array w/ 11x300GB 15k SAS drives and 2x X25-e's for ZIL/L2ARC, so 
the extra bandwidth would be welcome.

Am I crazy for putting something like this into production using Solaris 10/11? 
On paper, it really seems ideal for our needs.

Also, maybe I read it wrong, but why is it that (in the previous thread about 
hw raid and zpools) zpools with large numbers of physical drives (eg 20+) were 
frowned upon? I know that ZFS!=WAFL but it's so common in the NetApp world that 
I was surprised to read that. A 20 drive RAID-Z2 pool really wouldn't/couldn't 
recover (resilver) from a drive failure? That seems to fly in the face of the 
x4500 boxes from a few years ago.

matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Mirror Gone

2011-09-27 Thread Tony MacDoodle
Hello,

Looks like the mirror was removed or deleted... Can I get it back to it's
original???

Original:
mirror-0  ONLINE   0 0 0
c1t2d0  ONLINE   0 0 0
c1t3d0  ONLINE   0 0 0
  mirror-1  ONLINE   0 0 0
c1t4d0  ONLINE   0 0 0
c1t5d0  ONLINE   0 0 0

Now:
mirror-0  ONLINE   0 0 0
c1t2d0  ONLINE   0 0 0
c1t3d0  ONLINE   0 0 0
  c1t4d0ONLINE   0 0 0
  c1t5d0ONLINE   0 0 0


Thanks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] All (pure) SSD pool rehash

2011-09-27 Thread Bob Friesenhahn

On Tue, 27 Sep 2011, Matt Banks wrote:


Am I crazy for putting something like this into production using Solaris 10/11? 
On paper, it really seems ideal for our needs.


As long as the drive firmware operates correctly, I don't see a 
problem.


Also, maybe I read it wrong, but why is it that (in the previous 
thread about hw raid and zpools) zpools with large numbers of 
physical drives (eg 20+) were frowned upon? I know that ZFS!=WAFL


There is no concern with a large number of physical drives in a pool. 
The primary concern is with the number of drives per vdev.  Any 
variation in the latency of the drives hinders performance and each 
I/O to a vdev consumes 1 IOP across all of the drives in the vdev 
(or strip) when raidzN is used.  Having more vdevs is better for 
consistent performance and more available IOPS.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] All (pure) SSD pool rehash

2011-09-27 Thread Paul Kraus
On Tue, Sep 27, 2011 at 1:21 PM, Matt Banks mattba...@gmail.com wrote:

 Also, maybe I read it wrong, but why is it that (in the previous thread about
 hw raid and zpools) zpools with large numbers of physical drives (eg 20+)
 were frowned upon? I know that ZFS!=WAFL but it's so common in the
 NetApp world that I was surprised to read that. A 20 drive RAID-Z2 pool
 really wouldn't/couldn't recover (resilver) from a drive failure? That seems
 to fly in the face of the x4500 boxes from a few years ago.

There is a world of difference between a zpool with 20+ drives and
a single vdev with 20+ drives. What has been frowned upon is a single
vdev with more than about 8 drives. I have a zpool with 120 drives, 22
vdevs each with 5 drives in a raidz2 and 10 hot spares. The only
failures I had to resilver were before it went production (and I  had
little data in it at the time), but I expect resilver times to be
reasonable based on experience with other configurations I have had.

Keep in mind that random read I/O is proportional to the number of
vdevs, NOT the number of drives. See
https://docs.google.com/spreadsheet/pub?hl=en_UShl=en_USkey=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsLXcoutput=html
for the results of some of my testing.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
- Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
- Sound Designer: Frankenstein, A New Musical
(http://www.facebook.com/event.php?eid=123170297765140)
- Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
- Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] All (pure) SSD pool rehash

2011-09-27 Thread Erik Trimble

On 9/27/2011 10:39 AM, Bob Friesenhahn wrote:

On Tue, 27 Sep 2011, Matt Banks wrote:

Also, maybe I read it wrong, but why is it that (in the previous 
thread about hw raid and zpools) zpools with large numbers of 
physical drives (eg 20+) were frowned upon? I know that ZFS!=WAFL


There is no concern with a large number of physical drives in a pool. 
The primary concern is with the number of drives per vdev.  Any 
variation in the latency of the drives hinders performance and each 
I/O to a vdev consumes 1 IOP across all of the drives in the vdev 
(or strip) when raidzN is used.  Having more vdevs is better for 
consistent performance and more available IOPS.


Bob


To expound just a bit on Bob's reply:   the reason that large numbers of 
disks in a RAIDZ* vdev are frowned upon has to do with the fact that 
IOPS for a RAIDZ vdev are pretty much O(C), regardless of how many disks 
are in the actual vdev. So, the IOPS throughput of a 20-disk vdev is the 
same as a 5-disk vdev.  Streaming throughput is significantly higher 
(i.e. it scales as O(N)), but you're unlikely to get that for the vast 
majority of workloads.


Given that resilvering a RAIDZ* is IOPS-bound, you quickly run into the 
situation where the time to resilver X amount of data on a 5-drive RAIDZ 
is the same as a 30-drive RAIDZ.  Given that you're highly likely to 
store much more data on a larger vdev,  your resilver time to replace a 
drive goes up linearly with the number of drives in a RAIDZ vdev.


This leads to this situation:  if I have 20 x 1TB drives, here's several 
possible configurations, and the relative resilver times (relative, 
because without knowing the exact configuration of the data itself, I 
can't estimate wall-clock-time resilver times):


(a)5 x 4-disk RAIDZ:  15TB usable, takes N amount of time to replace 
a failed disk

(b)4 x 5-disk RAIDZ:  16TB usable, takes 1.25N time to replace a disk
(c)2 x 10-disk RAIDZ:  18TB Usable, takes 2.5N time to replace a disk
(d)1 x 20-disk RAIDZ:19TB usable, takes 5N time to replace a disk

Notice that by doubling the number of drives in a RAIDZ, you double the 
resilver time for the same amount of data in the ZPOOL.


The above also applies to RAIDZ[23], as the additional parity disk 
doesn't materially impact resilver times in either direction (and, yes, 
it's not really a parity disk, I'm just being sloppy).


Also, the other main reason is that larger numbers of drives in a single 
vdev mean there is a higher probability that multiple disk failures will 
result in loss of data. Richard Elling had some data on the exact 
calculations, but it boils down to the fact that your chance of total 
data loss from multiple drive failures goes up MORE THAN LINEARLY by 
adding drives into a vdev.  Thus, a 1x10-disk RAIDZ has well over 2x the 
chance of failure that  2 x 5-disk RAIDZ zpool has.


-Erik
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mirror Gone

2011-09-27 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Tony MacDoodle
 
 Original:
 mirror-0  ONLINE   0 0 0
     c1t2d0  ONLINE   0 0 0
     c1t3d0  ONLINE   0 0 0
   mirror-1  ONLINE   0 0 0
     c1t4d0  ONLINE   0 0 0
     c1t5d0  ONLINE   0 0 0
 
 Now:
 mirror-0  ONLINE   0 0 0
     c1t2d0  ONLINE   0 0 0
     c1t3d0  ONLINE   0 0 0
   c1t4d0    ONLINE   0 0 0
   c1t5d0    ONLINE   0 0 0

There is only one way for this to make sense:  You did not have mirror-1 in
the first place.  You accidentally added 4  5 without mirroring.  The only
way to fix it is to (a) add redundancy to both 4  5, or (b) destroy and
recreate the pool, and this time be very careful that you mirror 4  5.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mirror Gone

2011-09-27 Thread Mark Musante

On 27 Sep 2011, at 18:29, Edward Ned Harvey wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Tony MacDoodle
 
 
 Now:
 mirror-0  ONLINE   0 0 0
 c1t2d0  ONLINE   0 0 0
 c1t3d0  ONLINE   0 0 0
   c1t4d0ONLINE   0 0 0
   c1t5d0ONLINE   0 0 0
 
 There is only one way for this to make sense:  You did not have mirror-1 in
 the first place.  

An easy way to tell is taking a look at the zpool history command for this pool.
What does that show?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] All (pure) SSD pool rehash

2011-09-27 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Matt Banks
 
 Am I crazy for putting something like this into production using Solaris
10/11?
 On paper, it really seems ideal for our needs.

Do you have an objection to solaris 10/11 for some reason?
No, it's not crazy (and I wonder why you would ask).


 Also, maybe I read it wrong, but why is it that (in the previous thread
about
 hw raid and zpools) zpools with large numbers of physical drives (eg 20+)

Clarification that I know others have already added, but I reiterate:  It's
not the number of devices in a zpool that matters.  It's the amount of data
in the resilvering vdev, and the number of devices inside the vdev, and your
usage patterns (where the typical use pattern is the worst case usage
pattern, especially for a database server).  Together these of course have a
relation to the number of devices in the pool, but that's not what matters.

The problem basically applies to HDD's.  By creating your pool of SSD's,
this problem should be eliminated.

Here is the problem:

Assuming the data in the pool is evenly distributed amongst the vdev's, then
the more vdev's you have, the less data is in each one.  If you make your
pool of a small number of large raidzN vdev's, then you're going to have
relatively a lot of data in each vdev, and therefore a lot of data in the
resilvering vdev.

When a vdev resilvers, it will read each slab of data, in essentially time
order, which is approximately random disk order, in order to reconstruct the
data that must be written on the resilvering device.  This creates two
problems, (a) Since each disk must fetch a piece of each slab, the random
access time of the vdev as a whole is approximately the random access time
of the slowest individual device.  So the more devices in the vdev, the
worse the IOPS for the vdev...  And (b) the more data slabs in the vdev, the
more iterations of random IO operations must be completed.  

In other words, during resilvers, you're IOPS limited.  If your pool is made
of all SSD's, then problem (a) is basically nonexistent, since the random
access time of all the devices are equal and essentially zero.  Problem (b)
isn't necessarily a problem...  It's like, if somebody is giving you $1,000
for free every month and then they suddenly drop down to only $500, you
complain about what you've lost.   ;-)  (See below.)

In a hardware raid system, resilvering will be done sequentially on all
disks in the array.  Depending on your specs, a typical time might be 2hrs.
All blocks will be resilvered regardless of whether or not they're used.
But in ZFS, only used blocks will be resilvered.  That means, if your vdev
is empty, your resilver is completed instantly.  Also, if your vdev is made
of SSD's, then the random access times will be just like the sequential
access times, and your worst case is still equal to hardware raid resilver.

The only time there's a problem is when you have a vdev made of HDD's, and
there's a bunch of data in it, and it's scattered randomly (which typically
happens due to the nature of COW and snapshot deletion/creation over time).
So the HDD's thrash around spending all their time doing random access, with
very little payload for each random op.  In these cases, even HDD mirrors
end up having resilver times that are several times longer than sequentially
resilvering the whole disk including unused blocks.  In this case, mirrors
are the best case scenario, because they're both (a) minimal data in each
vdev, and (b) minimal number of devices in the resilvering vdev.  Even so,
the mirror resilver time might be like 12 hours, in my experience, instead
of the 2hrs that hardware would have needed to resilver the whole disk.  But
if you were using a big vdev (raidzN) of a bunch of HDD's (let's say, 21
disks in a raidz3), you might get resilver times that are a couple orders of
magnitude too long...  Like 20 days instead of 10 hours.  At this level, you
should assume your resilver will never complete.

So again:  Not a problem if you're making your pool out of SSD's.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] All (pure) SSD pool rehash

2011-09-27 Thread Fajar A. Nugraha
On Wed, Sep 28, 2011 at 8:21 AM, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 When a vdev resilvers, it will read each slab of data, in essentially time
 order, which is approximately random disk order, in order to reconstruct the
 data that must be written on the resilvering device.  This creates two
 problems, (a) Since each disk must fetch a piece of each slab, the random
 access time of the vdev as a whole is approximately the random access time
 of the slowest individual device.  So the more devices in the vdev, the
 worse the IOPS for the vdev...  And (b) the more data slabs in the vdev, the
 more iterations of random IO operations must be completed.

 In other words, during resilvers, you're IOPS limited.  If your pool is made
 of all SSD's, then problem (a) is basically nonexistent, since the random
 access time of all the devices are equal and essentially zero.  Problem (b)
 isn't necessarily a problem...  It's like, if somebody is giving you $1,000
 for free every month and then they suddenly drop down to only $500, you
 complain about what you've lost.   ;-)  (See below.)

If you regularly spend all of the given $1,000, then you're going to
complain hard when it suddenly drops to $500.

 So again:  Not a problem if you're making your pool out of SSD's.

Big problem if your system is already using most of the available IOPS
during normal operation.

-- 
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] All (pure) SSD pool rehash

2011-09-27 Thread Bob Friesenhahn

On Tue, 27 Sep 2011, Edward Ned Harvey wrote:


The problem basically applies to HDD's.  By creating your pool of SSD's,
this problem should be eliminated.


This is not completely true.  SSDs will help significantly but they 
will still suffer from the synchronized commit of a transaction group. 
SSDs don't suffer from seek time, but they still suffer from 
erase/write time and many SSDs are capable of only a few thousand 
flushed writes per second.  It is just a matter of degree.


SSDs which do garbage collection during the write cycle could cause 
the whole vdev to temporarily hang until the last SSD has committed 
its write.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss