Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

Jim Klimov Wed, 21 Mar 2012 05:00:49 -0700

2012-03-21 7:16, MLR wrote:

I read the "ZFS_Best_Practices_Guide" and "ZFS_Evil_Tuning_Guide", and have some
questions:


  1. Cache device for L2ARC
      Say we get a decent ssd, ~500MB/s read/write. If we have a 20 HDD zpool
setup shouldn't we be reading at least at the 500MB/s read/write range? Why
would we want a ~500MB/s cache?


Basically, SSDs shine best in random IOs. For example, my
(consumer-grade) 2Tb disks in a home NAS yield up to 160MB/s
in linear reads, but drop to about 3Mb/s in random performance,
occasionally bursting 10-20Mb/s for a short time.

ZFS COW-based data structure is quite fragmented, so there
are many random seeks. Raw low-level performance gets hurt
as a tradeoff for reliability, and SSDs along with large
RAM buffers are ways to recover and boost the performance.

There is especially lot of work with metadata when/if you
use deduplication - tens of gigabytes of RAM are recommended
for a decent-sized pool of a few TB.

  2. ZFS dynamically strips along the top-most vdev's and that "performance for 
1
vdev is equivalent to performance of one drive in that group". Am I correct in
thinking this means, for example, I have a single 14 disk raidz2 vdev zpool, the
disks will go ~100MB/s each , this zpool would theoretically read/write at
~100MB/s max (how about real world average?)? If this was RAID6 I think this
would go theoretically ~1.4GB/s, but in real life I am thinking ~1GB/s (aka 10x-
14x faster than zfs, and both provide the same amount of redundancy)? Is my
thinking off in the RAID6 or RAIDZ2 numbers?


I think your numbers are not right. They would make sense
for RAID0 of 14 drives though.

All correctly implemented synchronously-redundant schemes
must wait for all storage devices to complete writing, so
they are "not faster" than single devices during writes,
and due to bus contention, etc. are often a bit slower
overall.

Reads on the other hand can be parallelised on RAIDzN as
well as on RAID5/6 and can boost read performance like
striping more or less.

As for "same level of redundancy", many people would stick
your finger at the statement that usual RAIDs don't have a
method to know which part of the array is faulty (i.e. when
one sector in a RAID stripe becomes corrupted, there is no
way to certainly reconstruct correct data, and often no quick
way to detect the corruption either). Many arrays depend on
timestamps of the component disks so as to detect stale data,
and can only recover well from full-disk failures.

> Why doesn't ZFS try to dynamically

strip inside vdevs (and if it is, is there an easy to understand explanation why
a vdev doesn't read from multiple drives at once when requesting data, or why a
zpool wouldn't make N number of requests to a vdev with N being the number of
disks in that vdev)?


That it does, somewhat. In RAID terms you can think of a
ZFS pool with several top-level devices each made up from
several leaf devices, as implementing RAID50 or RAID60,
to contain lots of "blocks".

There are "banks" (TLVDevs) of disks in redundant arrays,
and these have block data (and redundancy blocks) striped
across sectors of different disks. A pool stripes (RAID0)
userdata across several TLVDEVs by storing different blocks
in different "banks". Loss of a whole TLVDEV is fatal, like
in RAID50.

ZFS has a variable step though, so depending on block size,
the block-stripe size within a TLVDEV can vary. For minimal
sized blocks on a raidz or raidz2 TLVDEV you'd have one or
two redundancy sectors and a data sector using two or three
disks only. Other "same-numbered" sectors of other disk in
the TLVDEV can be used by another such stripe.

There are nice illustrations in the docs and blogs regarding
the layout.

Note that redundancy disks are not used during normal reads
of uncorrupted data. However, I believe that there may be a
slight benefit from ZFS for smaller blocks which are not
using the whole raidzN array stripe, since parallel disks
can be used to read parts of different blocks. But the random
seeks involved in mechanical disks would probably make it
unnoticeable, and there's probably lot of randomness in
storage of small blocks.


Since "performance for 1 vdev is equivalent to performance of one drive in that
group" it seems like the higher raidzN are not very useful. If your using raidzN
your probably looking for a lower than mirroring parity (aka 10%-33%), but if
you try to use raidz3 with 15% parity your putting 20 HDDs in 1 vdev which is
terrible (almost unimaginable) if your running at 1/20 the "ideal" performance.


There are several tradeoffs, and other people on the list can
explain them better (and did in the past - search the archives).
Mostly this regards resilver times (how many disks are used to
rebuild another disk) and striping performance. There were also
some calculations regarding i.e. 10-disk sets: you can make two
raidz1 arrays or one raidz2 array. They give you same userspace
sizes (8 data disks), but the latter is deemed a lot more reliable.

Basically, with mirroring you pay the most (2x-3x redundancy
for each disk) and get the best performance as well as best
redundancy. With raidzN you get more useable space on the
same disks at a greater hit to performance, but cheaper.

For many home users that does not matter. Say, your camera's
CF card can stream its photos at 10MB/s to save it into your home
storage box, so sustained 10 or 50Mb/s of writes suffice for you.

One thing to note though is that with larger drives you get longer
times to just read in the whole drive trying to detect errors when
scrubbing - and this is something your system should proactively do.
This opens windows to multiple-drive errors, which can happen to
become unrecoverable (i.e. several hits to same block exceeding
its redundancy level). With multi-TB disks it is recommended to
have at least 3-disk redundancy via 3-4-way mirrors or raidz3 or
in traditional systems "RAID7" or "RAID6.3" as some call it.

Apparently, having 3 parity disks in a raidz3 array places some
requirement on the minimal size of the array so it becomes just
reasonable (perhaps 8-10 disks overall).



Main Question:
  3. I am updating my old RAID5 and want to reuse my old drives. I have 8 1.5TB
drives and buying new 3TB drives to fill up the rest of a 20 disk enclosure
(Norco RPC-4220); there is also 1 spare, plus the bootdrive so 22 total. I want
around 20%-25% parity. My system is like so:

Main Application: Home NAS
* Like to optimize max space with 20%(ideal) or 25% parity - would like 'decent'
reading performance
   - 'decent' being max of 10GigE Ethernet, right now it is only 1 gigabit 
Ethernet but hope to leave room to update in future if 10GigE becomes cheaper.
My RAID5 runs at ~500MB/s so was hoping to get at least above that with the 20
disk raid.


10GigE is a theoretical 1250MB/s. That might be achievable
for writes with mirrored disks and/or good fast caching (in
bursts or if your working set fits in the cache), but seems
unlikely with raidz sets.

For reads caching would likewise help; disk speeds would be
good if you have written lots of data contiguaously (so that
the disks won't have to seek too much and yield linear reads).

I am not ready to conjure up some numbers out of thin air now,
and hopefully someone else would reply to your main question
in detail.

I assume your other hardware won't be a bottleneck?
(PCI buses, disk controllers, RAM access, etc.)

* 16GB RAM


Not so much for ZFS advanced features - don't try dedup ;)
Also, remember that L2ARC indexing still needs some RAM to
reference the cached blocks. Reference size is constant
(about 200 bytes per block), but due to varying block size
the ratio (GB of RAM => GB of L2ARC) can be different and
depends on your usage. In particular, for DEDUP the ratio
is very bad, about 2x (a dedup-table entry is about twice
as large as the reference to it from RAM ARC to L2ARC).

* Open to using ZIL/L2ARC, but, left out for now: writing doesn't occur much
(~7GB a week, maybe a big burst every couple months), and don't really read same
data multiple times.


Dedicated fast and reliable (i.e. mirrored SSD or RAMDrive)
ZIL would help if you have synchronous writes. For example -
compilation of large projects creating many files, especially
over NFS.

ZIL is a rather specific investment, so it might not help you
at all, and ideally it is a write-only device (read in only
after crashes). So for SSDs you should expect a lot of wear,
and orient for a mirror of SLC devices. Or RAM disks. Or maybe
small dedicated HDDs to offload the write-seeks from main pool
(that last idea is often argued for/against)...


What would be the best setup? I'm thinking one of the following:
     a. 1vdev of 8 1.5TB disks (raidz2). 1vdev of 12 3TB disks (raidz3)?
(~200MB/s reading, best reliability)
     b. 1vdev of 8 1.5TB disks (raidz2). 3vdev of 4 3TB disks (raidz)? (~400MB/s
reading, evens out size across vdevs)
     c. 2vdev of 4 1.5TB disks (raidz). 3vdev of 4 3TB disks (raidz)? (~500MB/s
reading, maximize vdevs for performance)

I am leaning towards "a." since I am thinking "raidz3"+"raidz2" should provide a
little more reliability than 5 "raidz1"'s, but, worried that the real world
read/write performance will be low (theoridical is ~200MB/s, and, since the 2nd
vdev is 3x the size as the 1st, I am probably looking at more like 133MB/s?).
The 12 disk array is also above the "9 disk group max" recommendation in the
Best Practices guide, so not sure if this affects read performance (if it is
just resilver time I am not as worried about it as long it isn't like 3x
longer)?



One thing to note is that many people would not recommend using
a "disbalanced" ZFS array - one expanded by adding a TLVDEV after
many writes, or one consisting of differently-sized TLVDEVs.

ZFS does a rather good job of trying to use available storage
most efficiently, but it was often reported that it hits some
algorithmic bottleneck when one of the TLVDEVs is about 80-90%
full (even if others are new and empty). Blocks are balanced
across TLVDEVs on write, so your old data is not magically
redistributed until you explicitly rewrite it (i.e. zfs send
or rsync into another dataset on this pool).

So I'd suggest that you keep your disks separate, with two
pools made from 1.5Tb disks and from 3Tb disks, and use these
pools for different tasks (i.e. a working set with relatively
high turnaround and fragmentation, and WORM static data with
little fragmentation and high read performance).
Also this would allow you to more easily upgrade/replace the
whole set of 1.5Tb disks when the time comes.

Note that the two disk types can also have other different
characteristics, most notably the native sector size (4kb vs.
512b). You might expose your pool to a hit in reliability and
performance if you used the 4kb-sectored disks with emulated
512b sectors as a 512b-sectored disk, however you'd gain some
more useable space in exchange. You don't have these negative
hits when you use a native 512b disk as a 512b disk.
It is likely that when you decide to replace the 1.5Tb disks,
all those available on the market would be 4kb-sectored, so
in-place replacement of disks (replacing pool disks one by one
and resilvering) would be a bad option, IF your 1.5Tb disks
have native 512b sectors and you use them as such in the pool.
If interested, read up more on "ashift=9 vs. ashift=12" issues
in ZFS.


I guess I'm hoping "a." really isn't ~200MB/s hehe, if it is I'm leaning towards
"b.", but, if so, all three are downgrades from my initial setup read
performance wise -_-.

Is someone able to correct my understanding if some of my numbers are off, or
would someone have a better raidzN configuration I should consider? Thanks for
any help.


Again, I hope someone else would correctly suggest the setup
for your numbers. I'm somewhat more successful with theory now ;(

HTH,
//Jim Klimov
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

Reply via email to