Re: [zfs-discuss] One LUN per RAID group

2011-02-15 Thread Torrey McMahon


On 2/14/2011 10:37 PM, Erik Trimble wrote:
That said, given that SAN NVRAM caches are true write caches (and not 
a ZIL-like thing), it should be relatively simple to swamp one with 
write requests (most SANs have little more than 1GB of cache), at 
which point, the SAN will be blocking on flushing its cache to disk. 


Actually, most array controllers now have 10s if not 100s of GB of 
cache. The 6780 has 32GB, DMX-4 has - if I remember correctly - 256. The 
latest HDS box is probably close if not more.


Of course you still have to flush to disk and the cache flush algorithms 
of the boxes themselves come into play but 1GB was a long time ago.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] One LUN per RAID group

2011-02-15 Thread Erik Trimble

On 2/15/2011 1:37 PM, Torrey McMahon wrote:


On 2/14/2011 10:37 PM, Erik Trimble wrote:
That said, given that SAN NVRAM caches are true write caches (and not 
a ZIL-like thing), it should be relatively simple to swamp one with 
write requests (most SANs have little more than 1GB of cache), at 
which point, the SAN will be blocking on flushing its cache to disk. 


Actually, most array controllers now have 10s if not 100s of GB of 
cache. The 6780 has 32GB, DMX-4 has - if I remember correctly - 256. 
The latest HDS box is probably close if not more.


Of course you still have to flush to disk and the cache flush 
algorithms of the boxes themselves come into play but 1GB was a long 
time ago.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



STK2540 and the STK6140 have at most 1GB.
STK6180 has 4GB.


The move to large GB caches is only recent - only large (i.e big array 
setups with a dedicated SAN head) have had multi-GB NVRAM cache for any 
length of time.


In particular, pretty much all base arrays still have 4GB or less on the 
enclosure controller - only in the SAN heads do you find big multi-GB 
caches. And, lots (I'm going to be brave and say the vast majority) of 
ZFS deployments use direct-attach arrays or internal storage, rather 
than large SAN configs. Lots of places with older SAN heads are also 
going to have much smaller caches. Given the price tag of most large 
SANs, I'm thinking that there are still huge numbers of 5+ year-old SANs 
out there, and practically all of them have only a dozen or less GB of 
cache.


So, yes, big SAN modern configurations have lots of cache. But they're 
also the ones most likely to be hammered with huge amounts of I/O from 
multiple machines. All of which makes it relatively easy to blow through 
the cache capacity and slow I/O back down to the disk speed.


Once you get back down to raw disk speed, having multiple LUNS per raid 
array is almost certainly going to perform worse than a single LUN, due 
to thrashing.  That is, it would certainly be better (i.e. faster) for 
an array to have to commit 1 128k slab than 4 32k slabs.



So, the original recommendation is interesting, but needs to have the 
caveat that you'd really only use it if you can either limit the amount 
of sustained I/O you have, or are using very-large-cache disk setups.


I would think it idea might also apply (i.e. be useful) for something 
like the F5100 or similar RAM/Flash arrays.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] One LUN per RAID group

2011-02-14 Thread Paul Kraus
On Mon, Feb 14, 2011 at 2:38 PM, Gary Mills mi...@cc.umanitoba.ca wrote:

 I realize that it is possible to configure more than one LUN per RAID
 group on the storage device, but doesn't ZFS assume that each LUN
 represents an independant disk, and schedule I/O accordingly?  In that
 case, wouldn't ZFS I/O scheduling interfere with I/O scheduling
 already done by the storage device?

 Is there any reason not to use one LUN per RAID group?

My empirical testing confirms both the claims made that ZFS random
read I/O (at the very least) scales linearly with the NUMBER of vdev's
and NOT the number of spindles as well as the recommendation (I
believe from an Oracle White Paper on using ZFS for Oracle DBs) that
if you are using a hardware RAID device (with NVRAM write cache),
you should configure one LUN per spindle in the backend raid set.

In other words, if you build a zpool with one vdev of 10GB and
another with two vdev's each of 5GB (both coming from the same array
and raid set) you get almost exactly twice the random read performance
from the 2x5 zpool vs. the 1x10 zpool.

Also, using a 2540 disk array setup as a 10 disk RAID6 (with 2 hot
spares), you get substantially better random read performance using 10
LUNs vs. 1 LUN. While inconvenient, this just reflects the scaling of
ZFS aith number of vdevs and not spindles.

I suggest performing your own testing to insure you have the
performance to handle your specific application load.

Now, as to reliability, the hardware RAID array cannot detect
silent corruption of data the way the end to end ZFS checksum can.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
- Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
- Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
- Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] One LUN per RAID group

2011-02-14 Thread Gary Mills
On Mon, Feb 14, 2011 at 03:04:18PM -0500, Paul Kraus wrote:
 On Mon, Feb 14, 2011 at 2:38 PM, Gary Mills mi...@cc.umanitoba.ca wrote:
 
  Is there any reason not to use one LUN per RAID group?
[...]
 In other words, if you build a zpool with one vdev of 10GB and
 another with two vdev's each of 5GB (both coming from the same array
 and raid set) you get almost exactly twice the random read performance
 from the 2x5 zpool vs. the 1x10 zpool.

This finding is surprising to me.  How do you explain it?  Is it
simply that you get twice as many outstanding I/O requests with two
LUNs?  Is it limited by the default I/O queue depth in ZFS?  After
all, all of the I/O requests must be handled by the same RAID group
once they reach the storage device.

 Also, using a 2540 disk array setup as a 10 disk RAID6 (with 2 hot
 spares), you get substantially better random read performance using 10
 LUNs vs. 1 LUN. While inconvenient, this just reflects the scaling of
 ZFS aith number of vdevs and not spindles.

-- 
-Gary Mills--Unix Group--Computer and Network Services-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] One LUN per RAID group

2011-02-14 Thread Erik Trimble

On 2/14/2011 3:52 PM, Gary Mills wrote:

On Mon, Feb 14, 2011 at 03:04:18PM -0500, Paul Kraus wrote:

On Mon, Feb 14, 2011 at 2:38 PM, Gary Millsmi...@cc.umanitoba.ca  wrote:

Is there any reason not to use one LUN per RAID group?

[...]

 In other words, if you build a zpool with one vdev of 10GB and
another with two vdev's each of 5GB (both coming from the same array
and raid set) you get almost exactly twice the random read performance
from the 2x5 zpool vs. the 1x10 zpool.

This finding is surprising to me.  How do you explain it?  Is it
simply that you get twice as many outstanding I/O requests with two
LUNs?  Is it limited by the default I/O queue depth in ZFS?  After
all, all of the I/O requests must be handled by the same RAID group
once they reach the storage device.


 Also, using a 2540 disk array setup as a 10 disk RAID6 (with 2 hot
spares), you get substantially better random read performance using 10
LUNs vs. 1 LUN. While inconvenient, this just reflects the scaling of
ZFS aith number of vdevs and not spindles.


I'm going to go out on a limb here and say that you get the extra 
performance under one condition:  you don't overwhelm the NVRAM write 
cache on the SAN device head.


So long as the SAN's NVRAM cache can acknowledge the write immediately 
(i.e. it isn't full with pending commits to backing store), then, yes, 
having multiple write commits coming from different ZFS vdevs will 
obviously give more performance than a single ZFS vdev.


That said, given that SAN NVRAM caches are true write caches (and not a 
ZIL-like thing), it should be relatively simple to swamp one with write 
requests (most SANs have little more than 1GB of cache), at which point, 
the SAN will be blocking on flushing its cache to disk.


So, if you can arrange your workload to avoid more than the maximum 
write load of the SAN's raid array over a defined period, then, yes, go 
with the multiple LUN/array setup.  In particular, I would think this 
would be excellent for small-write/latency-sensitive applications, where 
the total amount of data written (over several seconds) isn't large, but 
where latency is critical.  For larger I/O requests (or, for consistent, 
sustained I/O of more than small amounts), all bets are off as far as 
possibly advantage of multiple LUNS/array.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss