Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-25 Thread Jeff Bacon
 In general, mixing SATA and SAS directly behind expanders (eg without
 SAS/SATA intereposers) seems to be bad juju that an OS can't fix.

In general I'd agree. Just mixing them on the same box can be problematic,
I've noticed - though I think as much as anything that the firmware
on the 3G/s expanders just isn't as well-tuned as the firmware
on the 6G/s expanders, and for all I know there's a firmware update
that will make things better. 

SSDs seem to be an exception, however. Several boxes have a mix of 
Crucial C300, OCZ Vertex Pro, and OCZ Vertex-3 SSDs for the usual
purposes on the expander with the constellations, or in one case, 
Cheetah 15ks. One box has SSDs and Cheetah 15ks/constellations on the same
expander under massive loads - the aforementioned box suffering from
80k ZIO queues - with nary a blip. (The SSDs are swap drives, and
we were force-swapping processes out to disk as part of task management.
Meanwhile, the Java processes are doing batch import processing using
the Cheetahs as staging area, so those two expanders are under constant
heavy load. Yes that is as ugly as it sounds, don't ask, and don't do
this yourself. This is what happens when you develop a database without
clear specs and have to just throw hardware underneath it guessing
all the way. But to give you an idea of the load they were/are under.) 

The SSDs were chosen with an eye towards expander-friendliness, and tested
relatively extensively before use. YMMV of course and this is nowhere
to skimp on a-data or Kingston; buy what Anand says to buy and you
seem to do very well. 

I would say, never do it on LSI 3G/s expanders. Be careful with using 
SATA spindles. Test the hell out of any SSD you use first. But you seem
to be able to get away with the better consumer-class SATA SSDs. 

(I realize that many here would say that if you are going to use
SSD in an enterprise config, you shouldn't be messing with anything
short of Deneva or the SAS-based SSDs. I'd say there are simply
a bunch of caveats with the consumer MLC SSDs in such situations
to consider and if you are very clear about them up front, then 
they can be just fine. 

I suspect the real difficulty in these situations is in having
a management chain that is capable of both grokking the caveats up
front and remembering that they agreed to them when something
does go wrong. :)   As in this case I am the management chain,
it's not an issue. This is of course not the usual case.) 

-bacon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-25 Thread Richard Elling
On Mar 25, 2012, at 6:26 AM, Jeff Bacon wrote:

 In general, mixing SATA and SAS directly behind expanders (eg without
 SAS/SATA intereposers) seems to be bad juju that an OS can't fix.
 
 In general I'd agree. Just mixing them on the same box can be problematic,
 I've noticed - though I think as much as anything that the firmware
 on the 3G/s expanders just isn't as well-tuned as the firmware
 on the 6G/s expanders, and for all I know there's a firmware update
 that will make things better. 

I haven't noticed a big difference in the expanders, does anyone else see
issues with 6G expanders?

 SSDs seem to be an exception, however. Several boxes have a mix of 
 Crucial C300, OCZ Vertex Pro, and OCZ Vertex-3 SSDs for the usual
 purposes on the expander with the constellations, or in one case, 
 Cheetah 15ks. One box has SSDs and Cheetah 15ks/constellations on the same
 expander under massive loads - the aforementioned box suffering from
 80k ZIO queues - with nary a blip. (The SSDs are swap drives, and
 we were force-swapping processes out to disk as part of task management.
 Meanwhile, the Java processes are doing batch import processing using
 the Cheetahs as staging area, so those two expanders are under constant
 heavy load. Yes that is as ugly as it sounds, don't ask, and don't do
 this yourself. This is what happens when you develop a database without
 clear specs and have to just throw hardware underneath it guessing
 all the way. But to give you an idea of the load they were/are under.) 

Sometime over beers we can trade war stories... many beers... :-)

 
 The SSDs were chosen with an eye towards expander-friendliness, and tested
 relatively extensively before use. YMMV of course and this is nowhere
 to skimp on a-data or Kingston; buy what Anand says to buy and you
 seem to do very well. 

Yes. Be aware that companies like Kingston rebadge drives from other,
reputable suppliers. And some reputable suppliers have less-than-perfect
models.

 
 I would say, never do it on LSI 3G/s expanders. Be careful with using 
 SATA spindles. Test the hell out of any SSD you use first. But you seem
 to be able to get away with the better consumer-class SATA SSDs. 
 
 (I realize that many here would say that if you are going to use
 SSD in an enterprise config, you shouldn't be messing with anything
 short of Deneva or the SAS-based SSDs. I'd say there are simply
 a bunch of caveats with the consumer MLC SSDs in such situations
 to consider and if you are very clear about them up front, then 
 they can be just fine. 
 
 I suspect the real difficulty in these situations is in having
 a management chain that is capable of both grokking the caveats up
 front and remembering that they agreed to them when something
 does go wrong. :)   As in this case I am the management chain,
 it's not an issue. This is of course not the usual case.) 

We'd like to think that given the correct information, reasonable people will
make the best choice.  And then there are PHBs.
 -- richard

--
DTrace Conference, April 3, 2012, 
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-24 Thread Jeff Bacon
 2012-03-21 16:41, Paul Kraus wrote:
   I have been running ZFS in a mission critical application since
  zpool version 10 and have not seen any issues with some of the vdevs
  in a zpool full while others are virtually empty. We have been running
  commercial Solaris 10 releases. The configuration was that each
 
 Thanks for sharing some real-life data from larger deployments,
 as you often did. That's something I don't often have access
 to nowadays, with a liberty to tell :)

Here's another datapoint, then: 

I'm using sol10u9 and u10 on a number of supermicro boxes,
mostly X8DTH boards with LSI 9211/9208 controllers and E5600 CPUs.
Application is NFS file service to a bunch of clients, and 
we also have an in-house database application written in Java
which implements a column-oriented db in files. Just about all
of it is raidz2, much of it running gzip-compressed.

Since I can't find anything saying not to other than some common
wisdom about not putting your eggs all in one basket that I'm
choosing to reject in some cases, I just keep adding vdevs to
the pool. started with 2TB barracudas for dev/test/archive
usage and constellations for prod, now 3TB drives, have just
added some of the new Pipeline drives with nothing particularly
of interest to note therefrom. 

You can create a startlingly large pool this way:

ny-fs7(68)% zpool list
NAME   SIZE  ALLOC   FREECAP  HEALTH  ALTROOT
srv177T   114T  63.3T64%  ONLINE  -

most pools are smaller. this is an archive box that's also
the guinea pig, 12 vdevs of 7 drives raidz2. the largest prod
one is 130TB in 11 vdevs of 8 drives raidz2. I won't guess
at the mix of 2TB and 3TB. these are both sol10u9. 

Another box has 150TB in 6 pools, raidz2/gzip using 2TB
constellations, dual X5690s with 144GB RAM running 20-30
Java db workers. We do manage to break this box on the
odd occasion - there's a race condition in the ZIO code 
where a buffer can be freed while the block buffer is in
the process of being loaned out to the compression code.
However, it takes 600 zpool threads plus another 600-900
java threads running at the same time with a backlog of 
8 ZIOs in queue, so it's not the sort of thing that
anyone's likely to run across much. :) It's fixed
in sol11, I understand; however, our intended fix is
to split the whole thing so that the workload (which
for various reasons needs to be on one box) is moved
to a 4-socket Westmere, and all of the data pools
are served via NFS from other boxes. 

I did lose some data once, long ago, using LSI 1068-based 
controllers on older kit, but pretty much I can attribute
that to something between me-being-stupid and the 1068s
really not being especially friendly towards the LSI
expander chips in the older 3Gb/s SMC backplanes when used
for SATA-over-SAS tunneling. The current arrangements 
are pretty solid otherwise. 

The SATA-based boxes can be a little cranky when a drive
toasts, of course - they sit and hang for a while until they
finally decide to offline the drive. We take that as par
for the course; for the application in question (basically,
storing huge amounts of data on the odd occasion that someone
has a need for it), it's not exactly a showstopper.


I am curious as to whether there is any practical upper-limit
on the number of vdevs, or how far one might push this kind of
configuration in terms of pool size - assuming a sufficient
quantity of RAM, of course I'm sure I will need to 
split this up someday but for the application there's just
something hideously convenient about leaving it all in one
filesystem in one pool. 


-bacon

 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-24 Thread Richard Elling
Thanks for sharing, Jeff!
Comments below...

On Mar 24, 2012, at 4:33 PM, Jeff Bacon wrote:

 2012-03-21 16:41, Paul Kraus wrote:
 I have been running ZFS in a mission critical application since
 zpool version 10 and have not seen any issues with some of the vdevs
 in a zpool full while others are virtually empty. We have been running
 commercial Solaris 10 releases. The configuration was that each
 
 Thanks for sharing some real-life data from larger deployments,
 as you often did. That's something I don't often have access
 to nowadays, with a liberty to tell :)
 
 Here's another datapoint, then: 
 
 I'm using sol10u9 and u10 on a number of supermicro boxes,
 mostly X8DTH boards with LSI 9211/9208 controllers and E5600 CPUs.
 Application is NFS file service to a bunch of clients, and 
 we also have an in-house database application written in Java
 which implements a column-oriented db in files. Just about all
 of it is raidz2, much of it running gzip-compressed.
 
 Since I can't find anything saying not to other than some common
 wisdom about not putting your eggs all in one basket that I'm
 choosing to reject in some cases, I just keep adding vdevs to
 the pool. started with 2TB barracudas for dev/test/archive
 usage and constellations for prod, now 3TB drives, have just
 added some of the new Pipeline drives with nothing particularly
 of interest to note therefrom. 
 
 You can create a startlingly large pool this way:
 
 ny-fs7(68)% zpool list
 NAME   SIZE  ALLOC   FREECAP  HEALTH  ALTROOT
 srv177T   114T  63.3T64%  ONLINE  -
 
 most pools are smaller. this is an archive box that's also
 the guinea pig, 12 vdevs of 7 drives raidz2. the largest prod
 one is 130TB in 11 vdevs of 8 drives raidz2. I won't guess
 at the mix of 2TB and 3TB. these are both sol10u9. 
 
 Another box has 150TB in 6 pools, raidz2/gzip using 2TB
 constellations, dual X5690s with 144GB RAM running 20-30
 Java db workers. We do manage to break this box on the
 odd occasion - there's a race condition in the ZIO code 
 where a buffer can be freed while the block buffer is in
 the process of being loaned out to the compression code.
 However, it takes 600 zpool threads plus another 600-900
 java threads running at the same time with a backlog of 
 8 ZIOs in queue, so it's not the sort of thing that
 anyone's likely to run across much. :) It's fixed
 in sol11, I understand; however, our intended fix is
 to split the whole thing so that the workload (which
 for various reasons needs to be on one box) is moved
 to a 4-socket Westmere, and all of the data pools
 are served via NFS from other boxes. 
 
 I did lose some data once, long ago, using LSI 1068-based 
 controllers on older kit, but pretty much I can attribute
 that to something between me-being-stupid and the 1068s
 really not being especially friendly towards the LSI
 expander chips in the older 3Gb/s SMC backplanes when used
 for SATA-over-SAS tunneling. The current arrangements 
 are pretty solid otherwise. 

In general, mixing SATA and SAS directly behind expanders (eg without
SAS/SATA intereposers) seems to be bad juju that an OS can't fix.

 
 The SATA-based boxes can be a little cranky when a drive
 toasts, of course - they sit and hang for a while until they
 finally decide to offline the drive. We take that as par
 for the course; for the application in question (basically,
 storing huge amounts of data on the odd occasion that someone
 has a need for it), it's not exactly a showstopper.
 
 
 I am curious as to whether there is any practical upper-limit
 on the number of vdevs, or how far one might push this kind of
 configuration in terms of pool size - assuming a sufficient
 quantity of RAM, of course I'm sure I will need to 
 split this up someday but for the application there's just
 something hideously convenient about leaving it all in one
 filesystem in one pool. 

I've run pools with  100 top-level vdevs. It is not uncommon to see
40+ top-level vdevs.
 -- richard

--
DTrace Conference, April 3, 2012, 
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-22 Thread Jim Klimov

2012-03-21 22:53, Richard Elling wrote:
...

This is why a single
vdev's random-read performance is equivalent to the random-read
performance of
a single drive.


It is not as bad as that. The actual worst case number for a HDD with
zfs_vdev_max_pending
of one is:
average IOPS * ((D+P) / D)
where,
D = number of data vdevs
P = numebr of parity vdevs (1 for raidz, 2 for raidz2, 3 for raidz3)
total disks per set = D + P


I wrote in this thread that AFAIK for small blocks (i.e. 1-sector
size worth of data) there would be P+1 sectors used to store the
block, which is an even worse case at least capacity-wise, as well
as impacting fragmentation = seeks, but might occasionally allow
parallel reads of different objects (tasks running on disks not
involved in storage of the one data sector and maybe its parities
when required).

Is there any truth to this picture?

Were there any research or tests regarding storage of many small
files (1-sector sized or close to that) on different vdev layouts?
I believe that such files would use a single-sector-sized set of
indirect blocks (dittoed at least twice), so one single-sector
sized file would use at least 9 sectors in raidz2.

Thanks :)



We did many studies that verified this. More recent studies show
zfs_vdev_max_pending
has a huge impact on average latency of HDDs, which I also described in
my talk at
OpenStorage Summit last fall.


What about drives without (a good implementation of) NCQ/TCQ/whatever?
Does ZFS in-kernel caching, queuing and sorting of pending requests
provide a similar service? Is it controllable with the same switch?

Or, alternatively, is it a kernel-only feature which does not depend
on hardware *CQ? Are there any benefits to disks with *CQ then? :)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-22 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Paul Kraus
 
  Three are two different cases here... resilver to reconstruct
 data from a failed drive and a scrub to pro-actively find bad sectors.
 
  The best case situation for the first case (bad drive
 replacement) is a mirrored drive in my experience. In that case only
 the data involved in the failure needs to be read and written. I am

During resilver, all the data in the vdev must be read  reconstructed to
write the new disk.  Notice I said vdev.  If you have a pool made of a
single vdev, then it means all the data in your pool.  However if you have a
pool made of a million vdev's, then ~ one millionth of the pool must be
read.  If you configured your pool using mirrors instead of raidz, then you
have minimized the size of your vdev's, and maximized the IOPS you're able
to perform *per* vdev.  So mirrors resilver many times faster than raidz,
but still, mirrors in my experience resilver ~ 10x slower than blindly
reading  writing the entire disk serially.  In my experience, hardware raid
resilver takes a couple or a few hours (divide total disk size by total
sustainable throughput), while zfs mirror resilver takes a dozen hours, or a
day or two (lots of random IO).  While raidz takes several days, if not
multiple weeks to resilver.  Of course all this is variable and dependent on
both your data and usage patterns. 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-22 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of maillist reader
 
 I read though that ZFS does not have a defragmentation tool, is this
still the
 case? 

True.


 It would seem with such a performance difference between
 sequential reads and random reads for raidzN's, a defragmentation tool
 would be very high on ZFS's TODO list ;).

It is high on the todo list, and in fact a lot of other useful stuff is
dependent on the same code, so when/if implemented, it will enable a lot of
new features, where defrag is just one such new feature.

However, there's a very difficult decision regarding *what* do you count as
defragmentation?  (Not to mention, a lot of work to be done.)  The goal of
defrag is to align data on disks serially so as to maximize the useful speed
of the disks.  Unfortunately, there are some really big competing demands -
where data is read in different orders.

For example, the traditional perception of defrag would align disk blocks of
individual files.  Thus, when you later return to read those files
sequentially, you would have maximum performance.  But that's not the same
order of data read as compared to scrub/resilver/zfs send.
Scrub/resilver/zfs send operate in (at least approximate) temporal order.  

So if you defrag at a file level, you hurt the performance of
scrub/resilver/send.  If you defrag at the temporal pool level (which is the
default position, current behavior) you hurt performance of file operations.
Pick your poison.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-22 Thread Richard Elling
On Mar 22, 2012, at 3:03 AM, Jim Klimov wrote:

 2012-03-21 22:53, Richard Elling wrote:
 ...
 This is why a single
 vdev's random-read performance is equivalent to the random-read
 performance of
 a single drive.
 
 It is not as bad as that. The actual worst case number for a HDD with
 zfs_vdev_max_pending
 of one is:
 average IOPS * ((D+P) / D)
 where,
 D = number of data vdevs
 P = numebr of parity vdevs (1 for raidz, 2 for raidz2, 3 for raidz3)
 total disks per set = D + P
 
 I wrote in this thread that AFAIK for small blocks (i.e. 1-sector
 size worth of data) there would be P+1 sectors used to store the
 block, which is an even worse case at least capacity-wise, as well
 as impacting fragmentation = seeks, but might occasionally allow
 parallel reads of different objects (tasks running on disks not
 involved in storage of the one data sector and maybe its parities
 when required).
 
 Is there any truth to this picture?

Yes, but it is a rare case for 512b sectors. It could be more common for 4KB
sector disks when ashift=12. However, in that case the performance increases
to the equivalent of mirroring, so there are some benefits.

FWIW, some people call this RAID-1E

 
 Were there any research or tests regarding storage of many small
 files (1-sector sized or close to that) on different vdev layouts?

It is not a common case, so why bother?

 I believe that such files would use a single-sector-sized set of
 indirect blocks (dittoed at least twice), so one single-sector
 sized file would use at least 9 sectors in raidz2.

No. You can't account for the metadata that way. Metadata space is not 1:1 with
data space. Metadata tends to get written in 16KB chunks, compressed.

 
 Thanks :)
 
 
 We did many studies that verified this. More recent studies show
 zfs_vdev_max_pending
 has a huge impact on average latency of HDDs, which I also described in
 my talk at
 OpenStorage Summit last fall.
 
 What about drives without (a good implementation of) NCQ/TCQ/whatever?

All HDDs I've tested suck. The form of the suckage is that the number of IOPS
stays relatively constant, but the average latency increases dramatically.  This
makes sense, due to the way elevator algorithms work.

 Does ZFS in-kernel caching, queuing and sorting of pending requests
 provide a similar service? Is it controllable with the same switch?

There are many caches at play here, with many tunables. The analysis doesn't
fit in an email.

 
 Or, alternatively, is it a kernel-only feature which does not depend
 on hardware *CQ? Are there any benefits to disks with *CQ then? :)

Yes, SSDs with NCQ work very well.
 -- richard

--
DTrace Conference, April 3, 2012, 
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-22 Thread Jim Klimov

2012-03-22 20:52, Richard Elling wrote:

Yes, but it is a rare case for 512b sectors.

 It could be more common for 4KB sector disks when ashift=12.
...

Were there any research or tests regarding storage of many small
files (1-sector sized or close to that) on different vdev layouts?


It is not a common case, so why bother?


I think that a certain Bob F. would disagree, especially
when larger native sectors and ashist=12 come into play.
Namely, one scenario where this is important is automated
storage of thumbnails for websites, or some similar small
objects in vast amounts.

I agree that hordes of 512b files would be rare; 4kb-sized
files (or a bit larger - 2-3 userdata sectors) are a lot
more probable ;)




I believe that such files would use a single-sector-sized set of
indirect blocks (dittoed at least twice), so one single-sector
sized file would use at least 9 sectors in raidz2.


No. You can't account for the metadata that way. Metadata space is not
1:1 with
data space. Metadata tends to get written in 16KB chunks, compressed.


I purportedly made an example of single-sector-sized files.
The way I get it (maybe wrong though), the tree of indirect
blocks (dnode?) for a file is stored separately from other
similar objects. While different L0 blkptr_t objects (BPs)
parented by the same L1 object are stored as a single
block on disk (128 BPs sized 128 bytes each = 16kb), further
redundanced and ditto-copied, I believe that L0 BPs from
different files are stored in separate blocks - as well
as L0 BPs parented by different L1 BPs from different
byterange stretches of the same file. Likewise for other
layers of L(N+1) pointers if the file is sufficiently
large (in amount of blocks used to write it).

The BP tree for a file is itself an object for a ZFS dataset,
individually referenced (as inode number) and there's a
pointer to its root from the DMU dnode of the dataset.

If the above rant is true, then the single-block file should
have a single L0 blkptr playing as its whole indirect tree
of block pointers, and that L0 would be stored in a dedicated
block (not shared with other files' BPs), inflated by ditto
copies=2 and raidz/mirror redundancy.

Right/wrong?

Thanks,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-22 Thread Bob Friesenhahn

On Thu, 22 Mar 2012, Jim Klimov wrote:


I think that a certain Bob F. would disagree, especially
when larger native sectors and ashist=12 come into play.
Namely, one scenario where this is important is automated
storage of thumbnails for websites, or some similar small
objects in vast amounts.


I don't know about that Bob F. but this Bob F. just took a look and 
noticed that thumbnail files for full-color images are typically 4KB 
or a bit larger.  Low-color thumbnails can be much smaller.


For a very large photo site, it would make sense to replicate just the 
thumbnails across a number of front-end servers and put the larger 
files on fewer storage servers because they are requested much less 
often and stream out better.  This would mean that those front-end 
thumbnail servers would primarily contain small files.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-21 Thread Jim Klimov

2012-03-21 7:16, MLR wrote:

I read the ZFS_Best_Practices_Guide and ZFS_Evil_Tuning_Guide, and have some
questions:

  1. Cache device for L2ARC
  Say we get a decent ssd, ~500MB/s read/write. If we have a 20 HDD zpool
setup shouldn't we be reading at least at the 500MB/s read/write range? Why
would we want a ~500MB/s cache?


Basically, SSDs shine best in random IOs. For example, my
(consumer-grade) 2Tb disks in a home NAS yield up to 160MB/s
in linear reads, but drop to about 3Mb/s in random performance,
occasionally bursting 10-20Mb/s for a short time.

ZFS COW-based data structure is quite fragmented, so there
are many random seeks. Raw low-level performance gets hurt
as a tradeoff for reliability, and SSDs along with large
RAM buffers are ways to recover and boost the performance.

There is especially lot of work with metadata when/if you
use deduplication - tens of gigabytes of RAM are recommended
for a decent-sized pool of a few TB.


  2. ZFS dynamically strips along the top-most vdev's and that performance for 
1
vdev is equivalent to performance of one drive in that group. Am I correct in
thinking this means, for example, I have a single 14 disk raidz2 vdev zpool, the
disks will go ~100MB/s each , this zpool would theoretically read/write at
~100MB/s max (how about real world average?)? If this was RAID6 I think this
would go theoretically ~1.4GB/s, but in real life I am thinking ~1GB/s (aka 10x-
14x faster than zfs, and both provide the same amount of redundancy)? Is my
thinking off in the RAID6 or RAIDZ2 numbers?


I think your numbers are not right. They would make sense
for RAID0 of 14 drives though.

All correctly implemented synchronously-redundant schemes
must wait for all storage devices to complete writing, so
they are not faster than single devices during writes,
and due to bus contention, etc. are often a bit slower
overall.

Reads on the other hand can be parallelised on RAIDzN as
well as on RAID5/6 and can boost read performance like
striping more or less.

As for same level of redundancy, many people would stick
your finger at the statement that usual RAIDs don't have a
method to know which part of the array is faulty (i.e. when
one sector in a RAID stripe becomes corrupted, there is no
way to certainly reconstruct correct data, and often no quick
way to detect the corruption either). Many arrays depend on
timestamps of the component disks so as to detect stale data,
and can only recover well from full-disk failures.

 Why doesn't ZFS try to dynamically

strip inside vdevs (and if it is, is there an easy to understand explanation why
a vdev doesn't read from multiple drives at once when requesting data, or why a
zpool wouldn't make N number of requests to a vdev with N being the number of
disks in that vdev)?


That it does, somewhat. In RAID terms you can think of a
ZFS pool with several top-level devices each made up from
several leaf devices, as implementing RAID50 or RAID60,
to contain lots of blocks.

There are banks (TLVDevs) of disks in redundant arrays,
and these have block data (and redundancy blocks) striped
across sectors of different disks. A pool stripes (RAID0)
userdata across several TLVDEVs by storing different blocks
in different banks. Loss of a whole TLVDEV is fatal, like
in RAID50.

ZFS has a variable step though, so depending on block size,
the block-stripe size within a TLVDEV can vary. For minimal
sized blocks on a raidz or raidz2 TLVDEV you'd have one or
two redundancy sectors and a data sector using two or three
disks only. Other same-numbered sectors of other disk in
the TLVDEV can be used by another such stripe.

There are nice illustrations in the docs and blogs regarding
the layout.

Note that redundancy disks are not used during normal reads
of uncorrupted data. However, I believe that there may be a
slight benefit from ZFS for smaller blocks which are not
using the whole raidzN array stripe, since parallel disks
can be used to read parts of different blocks. But the random
seeks involved in mechanical disks would probably make it
unnoticeable, and there's probably lot of randomness in
storage of small blocks.



Since performance for 1 vdev is equivalent to performance of one drive in that
group it seems like the higher raidzN are not very useful. If your using raidzN
your probably looking for a lower than mirroring parity (aka 10%-33%), but if
you try to use raidz3 with 15% parity your putting 20 HDDs in 1 vdev which is
terrible (almost unimaginable) if your running at 1/20 the ideal performance.


There are several tradeoffs, and other people on the list can
explain them better (and did in the past - search the archives).
Mostly this regards resilver times (how many disks are used to
rebuild another disk) and striping performance. There were also
some calculations regarding i.e. 10-disk sets: you can make two
raidz1 arrays or one raidz2 array. They give you same userspace
sizes (8 data disks), but the latter is deemed a lot more reliable.


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-21 Thread Paul Kraus
On Wed, Mar 21, 2012 at 7:56 AM, Jim Klimov jimkli...@cos.ru wrote:
 2012-03-21 7:16, MLR wrote:

 One thing to note is that many people would not recommend using
 a disbalanced ZFS array - one expanded by adding a TLVDEV after
 many writes, or one consisting of differently-sized TLVDEVs.

 ZFS does a rather good job of trying to use available storage
 most efficiently, but it was often reported that it hits some
 algorithmic bottleneck when one of the TLVDEVs is about 80-90%
 full (even if others are new and empty). Blocks are balanced
 across TLVDEVs on write, so your old data is not magically
 redistributed until you explicitly rewrite it (i.e. zfs send
 or rsync into another dataset on this pool).

I have been running ZFS in a mission critical application since
zpool version 10 and have not seen any issues with some of the vdevs
in a zpool full while others are virtually empty. We have been running
commercial Solaris 10 releases. The configuration was that each
business unit had a separate zpool consisting of mirrored pairs of 500
GB LUNs from SAN based storage. Each zpool started with enough storage
for that business unit. As each business unit filled their space, we
added additional mirrored pairs of LUNs. So the smallest unit had one
mirror vdev and the largest had 13 vdevs. In the case of the two
largest (13 and 11 vdevs) most of the vdevs were well above 90%
utilized and there were 2 or 3 almost empty vdevs. We never saw any
reliability issues with this condition. In terms of performance, the
storage was NOT our performance bottleneck, so I do not know if there
were any performance issue with this situation.

 So I'd suggest that you keep your disks separate, with two
 pools made from 1.5Tb disks and from 3Tb disks, and use these
 pools for different tasks (i.e. a working set with relatively
 high turnaround and fragmentation, and WORM static data with
 little fragmentation and high read performance).
 Also this would allow you to more easily upgrade/replace the
 whole set of 1.5Tb disks when the time comes.

I have never tried mixing drives of different size or performance
characteristic in the same zpool or vdev, except as a temporary
migration strategy. You already know that growing a RAIDz vdev is
currently impossible, so with a RAIDz strategy your only option for
growth is to add complete RAIDz vdevs, and you _want_ those to match
in terms of performance or you will have unpredictable performance.
For situations where you _might_ want to grow the data capacity in the
future I recommend mirrors, but ... and Richard Elling posted hard
data on this to the list a while back, to get the reliability of
RAIDz2 you need more than a 2-way mirror. In my mind, the larger the
amount of data (and size of drives) the _more_ reliability you need.

We are no longer using the configuration described above. The
current configuration is five JBOD chassis of 24 drives each. We have
22 vdevs, each a RAIDz2 consisting of one drive from each chassis and
10 hot spares. Our priority was reliability followed by capacity and
performance. If we could have, we would have just used 3 or 4 way
mirrors, but we needed more capacity than that provided. I note that
in pre-production testing we did have two of the five JBOD chassis go
offline at once and did not lose _any_ data. The total pool size is
about 40 TB.

We also have a redundant copy of the data on a remote system. That
system only has two JBOD chassis and capacity  is the priority. The
zpool consists of two vdevs each a RAIDz2 of 23 drives and two hot
spares. The performance is dreadful, but we _have_ the data in case of
a real disaster.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
- Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
- Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
- Technical Advisor, Troy Civic Theatre Company
- Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-21 Thread Jim Klimov

2012-03-21 16:41, Paul Kraus wrote:

 I have been running ZFS in a mission critical application since
zpool version 10 and have not seen any issues with some of the vdevs
in a zpool full while others are virtually empty. We have been running
commercial Solaris 10 releases. The configuration was that each


Thanks for sharing some real-life data from larger deployments,
as you often did. That's something I don't often have access
to nowadays, with a liberty to tell :)

Nice to hear about lack of degradations in this scenario you
have, and it was one proposed a few years back on Sun Forums
I believe. Perhaps the problems come if you similarly expand
raidz-based arrays by adding TLVDEVs, or with OpenSolaris's
experimental features?.. I don't know, really :)

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-21 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of MLR
 
  Say we get a decent ssd, ~500MB/s read/write. If we have a 20 HDD
zpool
 setup shouldn't we be reading at least at the 500MB/s read/write range?
 Why
 would we want a ~500MB/s cache?

You don't add l2arc because you care about MB/sec.  You add it because you
care about IOPS (read).

Similarly, you don't add dedicated log device for MB/sec.  You add it for
IOPS (sync write).

Any pool - raidz, raidz2, mirror - will give you optimum *sequential*
throughput.  All the performance enhancements are for random IO.  Mirrors
outperform raidzN, but in either case, you get improvements by adding log 
cache.


 Am I correct in
 thinking this means, for example, I have a single 14 disk raidz2 vdev
zpool,

It's not advisable to put more than ~8 disks in a single vdev, because it
really hurts during resilver time.  Maybe a week or two to resilver like
that.


 the
 disks will go ~100MB/s each , this zpool would theoretically read/write at

No matter which configuration you choose, you can expect optimum throughput
from all drives in sequential operations.  Random IO is a different story.


 What would be the best setup? I'm thinking one of the following:
 a. 1vdev of 8 1.5TB disks (raidz2). 1vdev of 12 3TB disks (raidz3)?
 (~200MB/s reading, best reliability)

No.  12 in a single vdev is too much.


 b. 1vdev of 8 1.5TB disks (raidz2). 3vdev of 4 3TB disks (raidz)?
(~400MB/s
 reading, evens out size across vdevs)

Not bad, but different size vdev's will perform differently (8 disks vs 4)
so...  See below.


 c. 2vdev of 4 1.5TB disks (raidz). 3vdev of 4 3TB disks (raidz)?
(~500MB/s
 reading, maximize vdevs for performance)

This would be your optimal configuration.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-21 Thread Paul Kraus
On Tue, Mar 20, 2012 at 11:16 PM, MLR maillistread...@gmail.com wrote:

  1. Cache device for L2ARC
     Say we get a decent ssd, ~500MB/s read/write. If we have a 20 HDD zpool
 setup shouldn't we be reading at least at the 500MB/s read/write range? Why
 would we want a ~500MB/s cache?

Without knowing the I/O pattern, saying 500 MB/sec. is
meaningless. Achieving 500MB/sec. with 8KB files and lots of random
accesses is really hard, even with 20 HDDs. Achieving 500MB/sec. of
sequential streaming of 100MB+ files is much easier. An SSD will be as
fast on random I/O as on sequential (compared to an HDD). An SSD will
be as fast on small I/O as large (once again, compared to an HDD). Due
to it's COW design, once a file is _changed_, ZFS no longer accesses
it strictly sequentially. If the files are written once and never
changed, then they _may_ be sequential on disk.

An important point to remember about the ARC  / L2ARC is that it
(they ?) are ADAPTIVE. The amount of space used by the ARC will grow
as ZFS reads data and shrinks as other processes need memory. I also
suspect that data eventually ages out of the ARC. The L2ARC is
(mostly) just an extension of the ARC, except that it does not have to
give up capacity as other processes need more memory.

  2. ZFS dynamically strips along the top-most vdev's and that performance 
 for 1
 vdev is equivalent to performance of one drive in that group. Am I correct in
 thinking this means, for example, I have a single 14 disk raidz2 vdev zpool, 
 the
 disks will go ~100MB/s each ,

   Assuming the disks will do 100MB/sec. for your data :-)

 this zpool would theoretically read/write at
 ~100MB/s max (how about real world average?)?

Yes. In a RAIDzn when a write is dispatched to the vdev _all_
the drives must complete the write before the write is complete. All
the drives in the vdev are written to in parallel. This is (or should
be) the case for _any_ RAID scheme, including RAID1 (mirroring). If a
zpool has more than one vdev, then writes are distributed among the
vdevs based on a number of factors (which others are _much_ more
qualified to discuss).

For ZFS, performance is proportional to the number of vdevs NOT
the number of drives or the number of drives per vdev. See
https://docs.google.com/spreadsheet/ccc?key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsLXc
for some testing I did a while back. I did not test sequential read as
that is not part of our workload.

 If this was RAID6 I think this
 would go theoretically ~1.4GB/s, but in real life I am thinking ~1GB/s (aka 
 10x-
 14x faster than zfs, and both provide the same amount of redundancy)? Is my
 thinking off in the RAID6 or RAIDZ2 numbers? Why doesn't ZFS try to 
 dynamically
 strip inside vdevs (and if it is, is there an easy to understand explanation 
 why
 a vdev doesn't read from multiple drives at once when requesting data, or why 
 a
 zpool wouldn't make N number of requests to a vdev with N being the number of
 disks in that vdev)?

I understand why the read performance scales with the number of
vdevs, but I have never really understood _why_ it does not also scale
with the number of drives in each vdev. When I did my testing with 40
dribves, I expected similar READ performance regardless of the layout,
but that was NOT the case.

 Since performance for 1 vdev is equivalent to performance of one drive in 
 that
 group it seems like the higher raidzN are not very useful. If your using 
 raidzN
 your probably looking for a lower than mirroring parity (aka 10%-33%), but if
 you try to use raidz3 with 15% parity your putting 20 HDDs in 1 vdev which is
 terrible (almost unimaginable) if your running at 1/20 the ideal 
 performance.

The recommendation is to not go over 8 or so drives per vdev, but
that is a performance issue NOT a reliability one. I have also not
been able to duplicate others observations that 2^N drives per vdev is
a magic number (4, 8, 16, etc). As you can see from the above, even a
40 drive vdev works and is reliable, just (relatively) slow :-)

 Main Question:
  3. I am updating my old RAID5 and want to reuse my old drives. I have 8 1.5TB
 drives and buying new 3TB drives to fill up the rest of a 20 disk enclosure
 (Norco RPC-4220); there is also 1 spare, plus the bootdrive so 22 total. I 
 want
 around 20%-25% parity. My system is like so:

Is the enclosure just a JBOD? If it is not, can it present drives
directly? If you cannot get at the drives individually, then the rest
of the discussion is largely moot.

You are buying 3TB drives, by definition you are NOT looking for
performance or reliability but capacity. What is the uncorrectable
error rate on these 3TB drives? What is the real random I/Ops
capability of these 3TB drives? I am not trying to be mean here, but I
would hate to see you put a ton of effort into this and then be
disappointed with the result due to a poor choice of hardware.

 Main Application: Home NAS
 * Like to optimize max space 

Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-21 Thread Jim Klimov

2012-03-21 17:28, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of MLR

...

Am I correct in thinking this means, for example, I have a single

 14 disk raidz2 vdev zpool,


It's not advisable to put more than ~8 disks in a single vdev, because it
really hurts during resilver time.  Maybe a week or two to resilver like
that.


Yes, that's important to note also. While ZFS marketing initially
stressed that unlike traditional RAID systems, a rebuild of ZFS
onto a spare/replacement disk only needs to copy referenced data
and not the whole disk, it somehow fell off the picture that such
rebuild is a lot of random IO - because the data block tree must
be read in as a tree walk, often with emphasis on block age (its
birth TXG number). If your pool is reasonably full (and who runs
it empty?) then this is indeed lots of random IO, and a blind
full-disk copy would have gone orders of magnitude faster.
The less disk participate in this thrashing - the faster it will
go (less data needed overall to reconstruct a disk's worth of
sectors from redundancy data).

That's the way I understand the problem, anyway...

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-21 Thread Paul Kraus
On Wed, Mar 21, 2012 at 9:51 AM, Jim Klimov jimkli...@cos.ru wrote:
 2012-03-21 17:28, Edward Ned Harvey wrote:

 It's not advisable to put more than ~8 disks in a single vdev, because it
 really hurts during resilver time.  Maybe a week or two to resilver like
 that.

 Yes, that's important to note also. While ZFS marketing initially
 stressed that unlike traditional RAID systems, a rebuild of ZFS
 onto a spare/replacement disk only needs to copy referenced data
 and not the whole disk, it somehow fell off the picture that such
 rebuild is a lot of random IO - because the data block tree must
 be read in as a tree walk, often with emphasis on block age (its
 birth TXG number). If your pool is reasonably full (and who runs
 it empty?) then this is indeed lots of random IO, and a blind
 full-disk copy would have gone orders of magnitude faster.
 The less disk participate in this thrashing - the faster it will
 go (less data needed overall to reconstruct a disk's worth of
 sectors from redundancy data).

 Three are two different cases here... resilver to reconstruct
data from a failed drive and a scrub to pro-actively find bad sectors.

 The best case situation for the first case (bad drive
replacement) is a mirrored drive in my experience. In that case only
the data involved in the failure needs to be read and written. I am
unclear how much of the data is read in the case of a failure of a
drive in a RAIDzn vdev _from_other_vdevs_. I have seen disk activity
on non-failure related vdevs during a drive replacement, which is why
I am unsure in this case.

In the case of a scrub, _all_ of the data in the zpool is read
and the checksums checked. My 22 vdev zpool takes about 300 hours for
this while the 2 vdev zpool takes over 600 hours. Both have comparable
amounts of data and snapshots. The 22 vdev zpool is on a production
server with normal I/O activity, the 2 vdev case is only receiving zfs
snapshots and doing no other I/O.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
- Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
- Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
- Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-21 Thread Marion Hakanson
p...@kraus-haus.org said:
 Without knowing the I/O pattern, saying 500 MB/sec. is meaningless.
 Achieving 500MB/sec. with 8KB files and lots of random accesses is really
 hard, even with 20 HDDs. Achieving 500MB/sec. of sequential streaming of
 100MB+ files is much easier.
 . . .
 For ZFS, performance is proportional to the number of vdevs NOT the
 number of drives or the number of drives per vdev. See https://
 docs.google.com/spreadsheet/ccc?key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsL
 Xc for some testing I did a while back. I did not test sequential read as
 that is not part of our workload. 
 . . .
 I understand why the read performance scales with the number of vdevs,
 but I have never really understood _why_ it does not also scale with the
 number of drives in each vdev. When I did my testing with 40 dribves, I
 expected similar READ performance regardless of the layout, but that was NOT
 the case. 

In your first paragraph you make the important point that performance
is too ambiguous in this discussion.  Yet in the 2nd  3rd paragraphs above,
you go back to using performance in its ambiguous form.  I assume that
by performance you are mostly focussing on random-read performance

My experience is that sequential read performance _does_ scale with the number
of drives in each vdev.  Both sequential and random write performance also
scales in this manner (note that ZFS tends to save up small, random writes
and flush them out in a sequential batch).

Small, random read performance does not scale with the number of drives in each
raidz[123] vdev because of the dynamic striping.  In order to read a single
logical block, ZFS has to read all the segments of that logical block, which
have been spread out across multiple drives, in order to validate the checksum
before returning that logical block to the application.  This is why a single
vdev's random-read performance is equivalent to the random-read performance of
a single drive.


p...@kraus-haus.org said:
 The recommendation is to not go over 8 or so drives per vdev, but that is
 a performance issue NOT a reliability one. I have also not been able to
 duplicate others observations that 2^N drives per vdev is a magic number (4,
 8, 16, etc). As you can see from the above, even a 40 drive vdev works and is
 reliable, just (relatively) slow :-) 

Again, the performance issue you describe above is for the random-read
case, not sequential.  If you rarely experience small-random-read workloads,
then raidz* will perform just fine.  We often see 2000 MBytes/sec sequential
read (and write) performance on a raidz3 pool consisting of 3, 12-disk vdev's
(using 2TB drives).

However, when a disk fails and must be resilvered, that's when you will
run into the slow performance of the small, random read workload.  This
is why I use raidz2 or raidz3 on vdevs consisting of more than 6-7 drives,
especially of the 1TB+ size.  That way if it takes 200 hours to resilver,
you've still got a lot of redundancy in place.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-21 Thread Jim Klimov

2012-03-21 21:40, Marion Hakanson цкщеу:

Small, random read performance does not scale with the number of drives in each
raidz[123] vdev because of the dynamic striping.  In order to read a single
logical block, ZFS has to read all the segments of that logical block, which
have been spread out across multiple drives, in order to validate the checksum
before returning that logical block to the application.  This is why a single
vdev's random-read performance is equivalent to the random-read performance of
a single drive.


True, but if the stars align so nicely that all the sectors
related to the block are read simultaneously in parallel
from several drives of the top-level vdev, so there is no
(substantial) *latency* incurred by waiting between the first
and last drives to complete the read request, then the
*aggregate bandwidth* of the array is (should be) similar
to performance (bandwidth) of a stripe.

This gain would probably be hidden by caches and averages,
unless the stars align so nicely for many blocks in a row,
such as a sequential uninterrupted read of a file written
out sequentially - so that component drives would stream
it off the platter track by track in a row... Ah, what a
wonderful world that would be! ;)

Also, after the sector is read by the disk and passed to
the OS, it is supposedly cached until all sectors of the
block arrive into the cache and the checksum matches.
During this time the HDD is available to do other queued
mechanical tasks. I am not sure which cache that might be:
too early for ARC - no block yet, and the vdev-caches now
drop non-metadata sectors. Perhaps it is just a variable
buffer space in the instance of the reading routine which
tries to gather all pieces of the block together and pass
it to the reader (and into ARC)...

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-21 Thread Richard Elling
comments below...

On Mar 21, 2012, at 10:40 AM, Marion Hakanson wrote:

 p...@kraus-haus.org said:
Without knowing the I/O pattern, saying 500 MB/sec. is meaningless.
 Achieving 500MB/sec. with 8KB files and lots of random accesses is really
 hard, even with 20 HDDs. Achieving 500MB/sec. of sequential streaming of
 100MB+ files is much easier.
 . . .
For ZFS, performance is proportional to the number of vdevs NOT the
 number of drives or the number of drives per vdev. See https://
 docs.google.com/spreadsheet/ccc?key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsL
 Xc for some testing I did a while back. I did not test sequential read as
 that is not part of our workload. 

Actually, few people have sequential workloads. Many think they do, but I say
prove it with iopattern.

 . . .
I understand why the read performance scales with the number of vdevs,
 but I have never really understood _why_ it does not also scale with the
 number of drives in each vdev. When I did my testing with 40 dribves, I
 expected similar READ performance regardless of the layout, but that was NOT
 the case. 
 
 In your first paragraph you make the important point that performance
 is too ambiguous in this discussion.  Yet in the 2nd  3rd paragraphs above,
 you go back to using performance in its ambiguous form.  I assume that
 by performance you are mostly focussing on random-read performance
 
 My experience is that sequential read performance _does_ scale with the number
 of drives in each vdev.  Both sequential and random write performance also
 scales in this manner (note that ZFS tends to save up small, random writes
 and flush them out in a sequential batch).

Yes.

I wrote a small, random read performance model that considers the various 
caches.
It is described here:
http://info.nexenta.com/rs/nexenta/images/tech_brief_nexenta_performance.pdf

The spreadsheet shown in figure 3 is available for the asking (and it works on 
your
iphone or ipad :-)

 Small, random read performance does not scale with the number of drives in 
 each
 raidz[123] vdev because of the dynamic striping.  In order to read a single
 logical block, ZFS has to read all the segments of that logical block, which
 have been spread out across multiple drives, in order to validate the checksum
 before returning that logical block to the application.  This is why a single
 vdev's random-read performance is equivalent to the random-read performance of
 a single drive.

It is not as bad as that. The actual worst case number for a HDD with 
zfs_vdev_max_pending
of one is:
average IOPS * ((D+P) / D)
where,
D = number of data vdevs
P = numebr of parity vdevs (1 for raidz, 2 for raidz2, 3 for raidz3)
total disks per set = D + P

We did many studies that verified this. More recent studies show 
zfs_vdev_max_pending
has a huge impact on average latency of HDDs, which I also described in my talk 
at 
OpenStorage Summit last fall.

 p...@kraus-haus.org said:
The recommendation is to not go over 8 or so drives per vdev, but that is
 a performance issue NOT a reliability one. I have also not been able to
 duplicate others observations that 2^N drives per vdev is a magic number (4,
 8, 16, etc). As you can see from the above, even a 40 drive vdev works and is
 reliable, just (relatively) slow :-) 

Paul, I have a considerable amount of data that refutes your findings. Can we 
agree
that YMMV and varies dramatically, depending on your workload?

 
 Again, the performance issue you describe above is for the random-read
 case, not sequential.  If you rarely experience small-random-read workloads,
 then raidz* will perform just fine.  We often see 2000 MBytes/sec sequential
 read (and write) performance on a raidz3 pool consisting of 3, 12-disk vdev's
 (using 2TB drives).

Yes, this is relatively easy to see. I've seen 6GByes/sec for large configs, but
that begins to push the system boundaries in many ways.

 
 However, when a disk fails and must be resilvered, that's when you will
 run into the slow performance of the small, random read workload.  This
 is why I use raidz2 or raidz3 on vdevs consisting of more than 6-7 drives,
 especially of the 1TB+ size.  That way if it takes 200 hours to resilver,
 you've still got a lot of redundancy in place.
 
 Regards,
 
 Marion
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
DTrace Conference, April 3, 2012, 
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation

2012-03-21 Thread maillist reader
Thank you all for the information, I believe it is much clearer to me.
Sequential Reads should scale with the number of disks in the entire
zpool (regardless of amount of vdevs), and Random Reads will scale with
just the number of vdevs (aka idea I had before only applies to Random
Reads), which I am much more happy with. Everything on my system should be
mostly sequential as editing should not occur much (aka no virtual machine
type things), when things get changed it usually means deleteing the old
file and adding the updated file.

I read though that ZFS does not have a defragmentation tool, is this
still the case? It would seem with such a performance difference between
sequential reads and random reads for raidzN's, a defragmentation tool
would be very high on ZFS's TODO list ;).

Is the enclosure just a JBOD? If it is not, can it present drives
 
I assume your other hardware won't be a bottleneck?
(PCI buses, disk controllers, RAM access, etc.)

A little more extra information about my system, it is a JBODs. The disks
go to a SAS-2 Expander (RES2SV240), and that has a single connection to a
tyan motherboard which has a LSI SAS 2008 chip controller built-in. The CPU
is a i3 with a DMI of 5 GT/s (DMI is new to me vs FSB). RAM is server grade
unbuffered ECC DDR3 1333 8GB sticks. It is a dedicated machine which will
do nothing but serve files over the network.

To my understanding the Network or disks themselves should be my bottlenet;
SAS-2 connection between SAS Expander and mobo should be 24Gbit/s or
3GB/s(1.5GB/s if SAS-1), and 5 GT/s should provide ~20 GB/s max bandwidth
for the 64bit machine from what I read online. I don't think this affects
me, but, I was also curious, does anyone know if the mobo  sas expander
will still establish a SAS-2 connection (if they both support SAS-2) if the
backpanes only support SAS-1 / SATA 3Gb/s? I never looked up the backpane
partnumbers in my Norco, but, think they support SATA 6Gb/s, so assume they
support SAS-2. But in essance SAS Expander  HDD's wont be over 3Gb/s per
port, so as long as SAS Expander  mobo establish's it's own SAS-2
connection regardless of what SAS Expander  HDD's do, then I don't even
have to think about it. 1.5GB/s (SAS-1) is still above my optimal max
anyway though.

In essance, if the drives can provide it (and network interface ignored) I
think the theoredical limitation is 3GB/s. I mentioned 1.25GB/s for the
10GigE is max I am looking at, but, I'd be happy with anywhere between
500MB/s-1GB/s for sequential reads of large files (don't really care about
any type of writes, and hopefully random reads do not happen so often *will
test with iopattern*).

 What is the uncorrectable
 error rate on these 3TB drives? What is the real random I/Ops
 capability of these 3TB drives?

I'm unsure of these myself, all the other parts have arrived, or are on
route, but I have not actually bought the HDD's yet so can still choose
almost anything. It will probably be cheapest consumer drives I can get
though (probably Seagate Barracude Green ST3000DM001's or Hitachi 5K3000
3TB's). The 1.5TB's I have in my old system are pretty much the same thing.

How much space do you _need_, including reasonable growth?

My old system is 9.55TB and almost full, and I have about 3TB spread out
elsewear. This was setup about 5years ago. With the 20 disk enclosure I'm
thinking about 30TB usuable space (but maybe only usign 15 disks at first),
and hoping it'll last for another 5 years.

How did you measure this?

ATTO Benchmark is what I used on the local machine for the 500MB/s number.
For small files (1kB16kB) it is small (50MB150MB), for the larger 256kB+
it reads ~550MB/s). This is hardware RAID5 though. Over the 1Gbit network
Windows7 always gets up to ~100MB/s when writing/reading from the RAID5
share.

What OS? I have a 16 CPU Solaris 10 SPARC server with 16 GB of RAM

The new ZFS system OS will probably be OpenIndiana with the v28 zpool. I
have been looking at FreeNAS (FreeBSP) and a little up in the air on which
to choose.


Thank you all for the information. I will very likley create two zpools
(one for 1.5TB drives, and one for 3TB drives), initially I thought down
the road if the pool ever fills up (probably like 5+ years) I would start
swaping the 1.5TB drives with 3TB drives to let the small vdev expand
after all were replaced, but, I didn't realize there could potentially be
performance problems via block size differences of 1.5TB (~5 year old
drives) drives and 3TB+ drives (~5 years in the future).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss