> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Matt Banks
> Am I crazy for putting something like this into production using Solaris
> On paper, it really seems ideal for our needs.
Do you have an objection to solaris 10/11 for some reason?
No, it's not crazy (and I wonder why you would ask).
> Also, maybe I read it wrong, but why is it that (in the previous thread
> hw raid and zpools) zpools with large numbers of physical drives (eg 20+)
Clarification that I know others have already added, but I reiterate: It's
not the number of devices in a zpool that matters. It's the amount of data
in the resilvering vdev, and the number of devices inside the vdev, and your
usage patterns (where the typical use pattern is the worst case usage
pattern, especially for a database server). Together these of course have a
relation to the number of devices in the pool, but that's not what matters.
The problem basically applies to HDD's. By creating your pool of SSD's,
this problem should be eliminated.
Here is the problem:
Assuming the data in the pool is evenly distributed amongst the vdev's, then
the more vdev's you have, the less data is in each one. If you make your
pool of a small number of large raidzN vdev's, then you're going to have
relatively a lot of data in each vdev, and therefore a lot of data in the
When a vdev resilvers, it will read each slab of data, in essentially time
order, which is approximately random disk order, in order to reconstruct the
data that must be written on the resilvering device. This creates two
problems, (a) Since each disk must fetch a piece of each slab, the random
access time of the vdev as a whole is approximately the random access time
of the slowest individual device. So the more devices in the vdev, the
worse the IOPS for the vdev... And (b) the more data slabs in the vdev, the
more iterations of random IO operations must be completed.
In other words, during resilvers, you're IOPS limited. If your pool is made
of all SSD's, then problem (a) is basically nonexistent, since the random
access time of all the devices are equal and essentially zero. Problem (b)
isn't necessarily a problem... It's like, if somebody is giving you $1,000
for free every month and then they suddenly drop down to only $500, you
complain about what you've lost. ;-) (See below.)
In a hardware raid system, resilvering will be done sequentially on all
disks in the array. Depending on your specs, a typical time might be 2hrs.
All blocks will be resilvered regardless of whether or not they're used.
But in ZFS, only used blocks will be resilvered. That means, if your vdev
is empty, your resilver is completed instantly. Also, if your vdev is made
of SSD's, then the random access times will be just like the sequential
access times, and your worst case is still equal to hardware raid resilver.
The only time there's a problem is when you have a vdev made of HDD's, and
there's a bunch of data in it, and it's scattered randomly (which typically
happens due to the nature of COW and snapshot deletion/creation over time).
So the HDD's thrash around spending all their time doing random access, with
very little payload for each random op. In these cases, even HDD mirrors
end up having resilver times that are several times longer than sequentially
resilvering the whole disk including unused blocks. In this case, mirrors
are the best case scenario, because they're both (a) minimal data in each
vdev, and (b) minimal number of devices in the resilvering vdev. Even so,
the mirror resilver time might be like 12 hours, in my experience, instead
of the 2hrs that hardware would have needed to resilver the whole disk. But
if you were using a big vdev (raidzN) of a bunch of HDD's (let's say, 21
disks in a raidz3), you might get resilver times that are a couple orders of
magnitude too long... Like 20 days instead of 10 hours. At this level, you
should assume your resilver will never complete.
So again: Not a problem if you're making your pool out of SSD's.
zfs-discuss mailing list