Re: [zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

2011-02-14 Thread Paul Kraus
On Mon, Feb 7, 2011 at 7:53 PM, Richard Elling richard.ell...@gmail.com wrote:
 On Feb 7, 2011, at 1:07 PM, Peter Jeremy wrote:

 On 2011-Feb-07 14:22:51 +0800, Matthew Angelo bang...@gmail.com wrote:
 I'm actually more leaning towards running a simple 7+1 RAIDZ1.
 Running this with 1TB is not a problem but I just wanted to
 investigate at what TB size the scales would tip.

 It's not that simple.  Whilst resilver time is proportional to device
 size, it's far more impacted by the degree of fragmentation of the
 pool.  And there's no 'tipping point' - it's a gradual slope so it's
 really up to you to decide where you want to sit on the probability
 curve.

 The tipping point won't occur for similar configurations. The tip
 occurs for different configurations. In particular, if the size of the
 N+M parity scheme is very large and the resilver times become
 very, very large (weeks) then a (M-1)-way mirror scheme can provide
 better performance and dependability. But I consider these to be
 extreme cases.

Empirically it seems that resilver time is related to number of
objects as much (if not more than) amount of data. zpools (mirrors)
with similar amounts of data but radically different numbers of
objects take very different amounts of time to resilver. I have NOT
(yet) started actually measuring and tracking this, but the above is
based on casual observation.

P.S. I am measuring number of objects via `zdb -d` as that is faster
than trying to count files and directories and I expect is a much
better measure of what the underlying zfs code is dealing with (a
particular dataset may have lots of snapshot data that does not
(easily) show up).

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
- Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
- Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
- Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

2011-02-14 Thread Nico Williams
On Feb 14, 2011 6:56 AM, Paul Kraus p...@kraus-haus.org wrote:
 P.S. I am measuring number of objects via `zdb -d` as that is faster
 than trying to count files and directories and I expect is a much
 better measure of what the underlying zfs code is dealing with (a
 particular dataset may have lots of snapshot data that does not
 (easily) show up).

It's faster because; a) no atime updates, b) no ZPL overhead.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] existing performance data for on-disk dedup?

2011-02-14 Thread Janice Chang


Hello.  I am looking to see if performance data exists for on-disk 
dedup.  I am currently in the process of setting up some tests based on 
input from Roch, but before I get started, thought I'd ask here.


Thanks for the help,
Janice
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very bad ZFS write performance. Ok Read.

2011-02-14 Thread ian W
Thanks for the responses.. I found the issue. It was due to power management, 
and a probably bug with event driven power management states, 

changing 

cpupm enable

to 

cpupm enable poll-mode

in /etc/power.conf fixed the issue for me. back up to 110MB/sec+ now..
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] how to destroy a pool by id?

2011-02-14 Thread chris
I have old pool skeletons with vdevs that no longer exist. Can't import them, 
can't destroy them, can't even rename them to something obvious like junk1. 
What do I do to clean up?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very bad ZFS write performance. Ok Read.

2011-02-14 Thread Krunal Desai
On Sat, Feb 12, 2011 at 3:14 AM, ian W dropbears...@yahoo.com.au wrote:
 Thanks for the responses.. I found the issue. It was due to power management, 
 and a probably bug with event driven power management states,

 changing

 cpupm enable

 to

 cpupm enable poll-mode

 in /etc/power.conf fixed the issue for me. back up to 110MB/sec+ now..

Interesting - I have a E6600 also, and I will give this a try. I left
'cpupm enable' in /etc/power.conf because powertop/prtdiag properly
reported all the available P/C-states of my CPU, so I assumed that
power management was good to go. What do you have cpu-threshold set
too?

(This may be a moot point for me, because my CPU is littering fault
management with strings of L2 cache errors, so might be upgrading to
Nehalem soon).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] existing performance data for on-disk dedup?

2011-02-14 Thread Jim Dunham
Hi Janice,

 Hello.  I am looking to see if performance data exists for on-disk dedup.  I 
 am currently in the process of setting up some tests based on input from 
 Roch, but before I get started, thought I'd ask here.

I find it somewhat interesting that you are asking this question on behalf of 
work you are doing for Roch, wherein Roch posted the following blog, with 
references. 

http://blogs.sun.com/roch/entry/dedup_performance_considerations1

That said, there is the ZFS page:

http://hub.opensolaris.org/bin/view/Community+Group+zfs/dedup

As far as synthetic testing of dedup, I found that the latest version of 
VDBench supports dedup, and is helpful on narrowing in on specific issues 
related to the size of the DDT, the ARC and L2ARC. 

http://blogs.sun.com/henk/entry/first_beta_version_of_vdbench

Jim


 
 Thanks for the help,
 Janice
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] how to destroy a pool by id?

2011-02-14 Thread Cindy Swearingen

Hi Chris,

Yes, this is a known problem and a CR is filed.

I haven't tried these in a while, but consider one of the following
workarounds below.

#1 is most drastic and make sure you've got the right device name. No
sanity checking is done by the dd command.

Other experts can comment on a better dd command.

Thanks,

Cindy

1. Wipe the disk label with a dd command. For example:

dd if=/dev/zero of=/dev/dsk/c1t0d0s0 count=100 bs=512k

(The following two probably won't work if you're getting device
in use messages.)

2. Force the creation of a new pool on the disk, like this:

# zpool create -f pool ct1d0

Then, remove the new pool:

# zpool destroy pool

3. Put the opposite label on the disk. If the disk has an SMI
label, use format -e to force an EFI label. Or, vice versa.


On 02/12/11 03:33, chris wrote:

I have old pool skeletons with vdevs that no longer exist. Can't import them, 
can't destroy them, can't even rename them to something obvious like junk1. 
What do I do to clean up?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] One LUN per RAID group

2011-02-14 Thread Gary Mills
With ZFS on a Solaris server using storage on a SAN device, is it
reasonable to configure the storage device to present one LUN for each
RAID group?  I'm assuming that the SAN and storage device are
sufficiently reliable that no additional redundancy is necessary on
the Solaris ZFS server.  I'm also assuming that all disk management is
done on the storage device.

I realize that it is possible to configure more than one LUN per RAID
group on the storage device, but doesn't ZFS assume that each LUN
represents an independant disk, and schedule I/O accordingly?  In that
case, wouldn't ZFS I/O scheduling interfere with I/O scheduling
already done by the storage device?

Is there any reason not to use one LUN per RAID group?

-- 
-Gary Mills--Unix Group--Computer and Network Services-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS read/write fairness algorithm for single pool

2011-02-14 Thread Richard Elling
Hi Nathan,
comments below...

On Feb 13, 2011, at 8:28 PM, Nathan Kroenert wrote:

 On 14/02/2011 4:31 AM, Richard Elling wrote:
 On Feb 13, 2011, at 12:56 AM, Nathan Kroenertnat...@tuneunix.com  wrote:
 
 Hi all,
 
 Exec summary: I have a situation where I'm seeing lots of large reads 
 starving writes from being able to get through to disk.
 
 snip
 What is the average service time of each disk? Multiply that by the average
 active queue depth. If that number is greater than, say, 100ms, then the ZFS
 I/O scheduler is not able to be very effective because the disks are too 
 slow.
 Reducing the active queue depth can help, see zfs_vdev_max_pending in the
 ZFS Evil Tuning Guide. Faster disks helps, too.
 
 NexentaStor fans, note that you can do this easily, on the fly, via the 
 Settings -
 Preferences -  System web GUI.
   -- richard
 
 
 Hi Richard,
 
 Long time no speak! Anyhoo - See below.
 
 I'm unconvinced that faster disks would help. I think faster disks, at least 
 in what I'm observing, would make it suck just as bad, just reading faster... 
 ;) Maybe I'm missing something.

Faster disks always help :-)

 
 Queue depth is around 10 (default and unchanged since install), and average 
 service time is about 25ms... Below are 1 second samples with iostat - while 
 I have included only about 10 seconds, it's representative of what I'm seeing 
 all the time.
 extended device statistics
 devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
 sd6 360.9   13.0 46190.5  351.4  0.0 10.0   26.7   1 100
 sd7 342.9   12.0 43887.3  329.9  0.0 10.0   28.1   1 100

ok, we'll take sd6 as an example (the math is easy :-) ...
actv = 10
svc_t = 26.7

actv * svc_t = 267 milliseconds

This is the queue at the disk. ZFS manages its own queue for the disk,
but once it leaves ZFS, there is no way for ZFS to manage it. In the 
case of the active queue, the I/Os have left the OS, so even the OS
is unable to change what is in the queue or directly influence when
the I/Os will be finished.

In ZFS, the queue has a priority scheduler and does place a higher
priority on async writes than async reads (since b130 or so). But what
you can see is that the intermittent nature of the async writes get 
stuck behind the 267 milliseconds as the queue drains the reads.
[no, I'm not sure if that makes sense, try again...]
If it sends reads continuously and writes occasionally, it will appear
that reads have much more domination. In older releases, when the
reads and writes had the same priority, this looks even worse.

 
 extended device statistics
 devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
 
 sd6 422.10.0 54025.00.0  0.0 10.0   23.6   1 100
 sd7 422.10.0 54025.00.0  0.0 10.0   23.6   1 100
 
 extended device statistics
 devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
 sd6 370.0   11.0 47360.4  342.0  0.0 10.0   26.2   1 100
 sd7 327.0   16.0 41856.4  632.0  0.0  9.6   28.0   1 100
 
 extended device statistics
 devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
 sd6 388.07.0 49406.4  290.0  0.0  9.8   24.8   1 100
 sd7 409.01.0 52350.32.0  0.0  9.5   23.2   1  99
 
 extended device statistics
 devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
 sd6 423.00.0 54148.60.0  0.0 10.0   23.6   1 100
 sd7 413.00.0 52868.50.0  0.0 10.0   24.2   1 100
 
 extended device statistics
 devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
 sd6 400.02.0 51081.22.0  0.0 10.0   24.8   1 100
 sd7 384.04.0 49153.24.0  0.0 10.0   25.7   1 100
 
 extended device statistics
 devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
 sd6 401.91.0 51448.98.0  0.0 10.0   24.8   1 100
 sd7 424.90.0 54392.40.0  0.0 10.0   23.5   1 100
 
 extended device statistics
 devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
 sd6 215.1  208.1 26751.9 25433.5  0.0  9.3   22.1   1 100
 sd7 189.1  216.1 24199.1 26833.9  0.0  8.9   22.1   1  91
 
 extended device statistics
 devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
 sd6 295.0  162.0 37756.8 20610.2  0.0 10.0   21.8   1 100
 sd7 307.0  150.0 39292.6 19198.4  0.0 10.0   21.8   1 100
 
 extended device statistics
 devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
 sd6 405.02.0 51843.86.0  0.0 10.0   24.5   1 100
 sd7 408.03.0 52227.8   10.0  0.0 10.0   24.3   1 100
 
 Bottom line is that ZFS does not seem to be caring about getting my writes to 
 disk when there is a heavy read workload.
 
 I have also confirmed that it's not the RAID controller either - behaviour is 
 identical with direct attach SATA.
 
 But - to your excellent theory: Setting zfs_vdev_max_pending to 1 causes 
 

Re: [zfs-discuss] One LUN per RAID group

2011-02-14 Thread Paul Kraus
On Mon, Feb 14, 2011 at 2:38 PM, Gary Mills mi...@cc.umanitoba.ca wrote:

 I realize that it is possible to configure more than one LUN per RAID
 group on the storage device, but doesn't ZFS assume that each LUN
 represents an independant disk, and schedule I/O accordingly?  In that
 case, wouldn't ZFS I/O scheduling interfere with I/O scheduling
 already done by the storage device?

 Is there any reason not to use one LUN per RAID group?

My empirical testing confirms both the claims made that ZFS random
read I/O (at the very least) scales linearly with the NUMBER of vdev's
and NOT the number of spindles as well as the recommendation (I
believe from an Oracle White Paper on using ZFS for Oracle DBs) that
if you are using a hardware RAID device (with NVRAM write cache),
you should configure one LUN per spindle in the backend raid set.

In other words, if you build a zpool with one vdev of 10GB and
another with two vdev's each of 5GB (both coming from the same array
and raid set) you get almost exactly twice the random read performance
from the 2x5 zpool vs. the 1x10 zpool.

Also, using a 2540 disk array setup as a 10 disk RAID6 (with 2 hot
spares), you get substantially better random read performance using 10
LUNs vs. 1 LUN. While inconvenient, this just reflects the scaling of
ZFS aith number of vdevs and not spindles.

I suggest performing your own testing to insure you have the
performance to handle your specific application load.

Now, as to reliability, the hardware RAID array cannot detect
silent corruption of data the way the end to end ZFS checksum can.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
- Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
- Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
- Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ACL for .zfs directory

2011-02-14 Thread Cindy Swearingen

Hi Ian,

You are correct.

Previous Solaris releases displayed older POSIX ACL info on this
directory. It was changed to the new ACL style from the integration of
this CR:

6792884 Vista clients cannot access .zfs

Thanks,

Cindy

On 02/13/11 19:30, Ian Collins wrote:
 While scanning filesystems looking fro who has read access to files, I 
see the ACL type of the .zfs/snapshot directory varies between releases 
(non-ZFS in Solaris 10, ZFS in Solaris 11 Express).


Is this documented anywhere?


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ACL for .zfs directory

2011-02-14 Thread Ian Collins

 On 02/15/11 10:14 AM, Cindy Swearingen wrote:

Hi Ian,

You are correct.

Previous Solaris releases displayed older POSIX ACL info on this
directory. It was changed to the new ACL style from the integration of
this CR:

6792884 Vista clients cannot access .zfs

Thanks Cindy.  Unfortunately bugs.opensolaris.org appears to be FUBAR, 
so I couldn't look it up!


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS and Virtual Disks

2011-02-14 Thread Mark Creamer
Hi I wanted to get some expert advice on this. I have an ordinary hardware
SAN from Promise Tech that presents the LUNs via iSCSI. I would like to use
that if possible with my VMware environment where I run several Solaris /
OpenSolaris virtual machines. My question is regarding the virtual disks.

1. Should I create individual iSCSI LUNs and present those to the VMware
ESXi host as iSCSI storage, and then create virtual disks from there on each
Solaris VM?

 - or -

2. Should I (assuming this is possible), let the Solaris VM mount the iSCSI
LUNs directly (that is, NOT show them as VMware storage but let the VM
connect to the iSCSI across the network.) ?

Part of the issue is I have no idea if having a hardware RAID 5 or 6 disk
set will create a problem if I then create a bunch of virtual disks and then
use ZFS to create RAIDZ for the VM to use. Seems like that might be asking
for trouble.

This environment is completely available to mess with (no data at risk), so
I'm willing to try any option you guys would recommend.

Thanks!

-- 
Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Virtual Disks

2011-02-14 Thread Fajar A. Nugraha
On Tue, Feb 15, 2011 at 5:47 AM, Mark Creamer white...@gmail.com wrote:
 Hi I wanted to get some expert advice on this. I have an ordinary hardware
 SAN from Promise Tech that presents the LUNs via iSCSI. I would like to use
 that if possible with my VMware environment where I run several Solaris /
 OpenSolaris virtual machines. My question is regarding the virtual disks.

 1. Should I create individual iSCSI LUNs and present those to the VMware
 ESXi host as iSCSI storage, and then create virtual disks from there on each
 Solaris VM?

  - or -

 2. Should I (assuming this is possible), let the Solaris VM mount the iSCSI
 LUNs directly (that is, NOT show them as VMware storage but let the VM
 connect to the iSCSI across the network.) ?

 Part of the issue is I have no idea if having a hardware RAID 5 or 6 disk
 set will create a problem if I then create a bunch of virtual disks and then
 use ZFS to create RAIDZ for the VM to use. Seems like that might be asking
 for trouble.

The ideal solution would be to present all disks directly as JBOD to
solaris without any raid/virtualization (either from the storage of
vmware).

If you use (1), you'd pretty much given up data integrity check to the
lower layer (SAN + ESXi). In this case you'd probably better off
simply using stripe on zfs side (there's not much advantage of using
raidz if the block device would reside on the same physical disk in
the SAN anyway).

If you use (2), you should have the option of exporting each raw disk
on the SAN as a LUN to solaris, and you can create mirror/raidz from
it. However this setup is more complicated (e.g. need to setup the SAN
in a specific way, which it may or may not be capable of), plus
there's a performance overhead from vmware virtual network.

Personally I'd chose (1), and use zfs simply for it's
snapshot/clone/compression capability, not for its data integrity
check.

-- 
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Virtual Disks

2011-02-14 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Mark Creamer
 
 1. Should I create individual iSCSI LUNs and present those to the VMware
 ESXi host as iSCSI storage, and then create virtual disks from there on
each
 Solaris VM?
 
  - or -
 
 2. Should I (assuming this is possible), let the Solaris VM mount the
iSCSI
 LUNs directly (that is, NOT show them as VMware storage but let the VM
 connect to the iSCSI across the network.) ?

If you do #1 you'll have a layer of vmware in between your guest machine and
the storage.  This will add a little overhead and possibly reduce
performance slightly.

If you do #2 you won't have access to snapshot features in vmware.
Personally I would recommend using #2 and rely on ZFS snapshots instead of
vmware snapshots.  But maybe you have a good reason for using vmware
snapshots... I don't want to make assumptions.


 Part of the issue is I have no idea if having a hardware RAID 5 or 6 disk
set will
 create a problem if I then create a bunch of virtual disks and then use
ZFS to
 create RAIDZ for the VM to use. Seems like that might be asking for
trouble.

Where is there any hardware raid5 or raid6 in this system?  Whenever
possible, you want to allow ZFS to manage the raid...  configure the
hardware to just pass-thru single disk jbod to the guest...  Because when
ZFS detects disk errors, if ZFS has the redundancy, it can correct them.
But if there are disk problems on the hardware raid, the hardware raid will
never know about it and it will never be correctable except by luck.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] One LUN per RAID group

2011-02-14 Thread Gary Mills
On Mon, Feb 14, 2011 at 03:04:18PM -0500, Paul Kraus wrote:
 On Mon, Feb 14, 2011 at 2:38 PM, Gary Mills mi...@cc.umanitoba.ca wrote:
 
  Is there any reason not to use one LUN per RAID group?
[...]
 In other words, if you build a zpool with one vdev of 10GB and
 another with two vdev's each of 5GB (both coming from the same array
 and raid set) you get almost exactly twice the random read performance
 from the 2x5 zpool vs. the 1x10 zpool.

This finding is surprising to me.  How do you explain it?  Is it
simply that you get twice as many outstanding I/O requests with two
LUNs?  Is it limited by the default I/O queue depth in ZFS?  After
all, all of the I/O requests must be handled by the same RAID group
once they reach the storage device.

 Also, using a 2540 disk array setup as a 10 disk RAID6 (with 2 hot
 spares), you get substantially better random read performance using 10
 LUNs vs. 1 LUN. While inconvenient, this just reflects the scaling of
 ZFS aith number of vdevs and not spindles.

-- 
-Gary Mills--Unix Group--Computer and Network Services-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS read/write fairness algorithm for single pool

2011-02-14 Thread Nathan Kroenert

Thanks for all the thoughts, Richard.

One thing that still sticks in my craw is that I'm not wanting to write 
intermittently. I'm wanting to write flat out, and those writes are 
being held up... Seems to me that zfs should know and do something about 
that without me needing to tune zfs_vdev_max_pending...


Nonetheless, I'm now at a far more balanced point than when I started, 
so that's a good thing. :)


Cheers,

Nathan.

On 15/02/2011 6:44 AM, Richard Elling wrote:

Hi Nathan,
comments below...

On Feb 13, 2011, at 8:28 PM, Nathan Kroenert wrote:


On 14/02/2011 4:31 AM, Richard Elling wrote:

On Feb 13, 2011, at 12:56 AM, Nathan Kroenertnat...@tuneunix.com   wrote:


Hi all,

Exec summary: I have a situation where I'm seeing lots of large reads starving 
writes from being able to get through to disk.

snip

What is the average service time of each disk? Multiply that by the average
active queue depth. If that number is greater than, say, 100ms, then the ZFS
I/O scheduler is not able to be very effective because the disks are too slow.
Reducing the active queue depth can help, see zfs_vdev_max_pending in the
ZFS Evil Tuning Guide. Faster disks helps, too.

NexentaStor fans, note that you can do this easily, on the fly, via the Settings 
-
Preferences -   System web GUI.
   -- richard


Hi Richard,

Long time no speak! Anyhoo - See below.

I'm unconvinced that faster disks would help. I think faster disks, at least in 
what I'm observing, would make it suck just as bad, just reading faster... ;) 
Maybe I'm missing something.

Faster disks always help :-)


Queue depth is around 10 (default and unchanged since install), and average 
service time is about 25ms... Below are 1 second samples with iostat - while I 
have included only about 10 seconds, it's representative of what I'm seeing all 
the time.
 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 360.9   13.0 46190.5  351.4  0.0 10.0   26.7   1 100
sd7 342.9   12.0 43887.3  329.9  0.0 10.0   28.1   1 100

ok, we'll take sd6 as an example (the math is easy :-) ...
actv = 10
svc_t = 26.7

actv * svc_t = 267 milliseconds

This is the queue at the disk. ZFS manages its own queue for the disk,
but once it leaves ZFS, there is no way for ZFS to manage it. In the
case of the active queue, the I/Os have left the OS, so even the OS
is unable to change what is in the queue or directly influence when
the I/Os will be finished.

In ZFS, the queue has a priority scheduler and does place a higher
priority on async writes than async reads (since b130 or so). But what
you can see is that the intermittent nature of the async writes get
stuck behind the 267 milliseconds as the queue drains the reads.
[no, I'm not sure if that makes sense, try again...]
If it sends reads continuously and writes occasionally, it will appear
that reads have much more domination. In older releases, when the
reads and writes had the same priority, this looks even worse.


 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b

sd6 422.10.0 54025.00.0  0.0 10.0   23.6   1 100
sd7 422.10.0 54025.00.0  0.0 10.0   23.6   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 370.0   11.0 47360.4  342.0  0.0 10.0   26.2   1 100
sd7 327.0   16.0 41856.4  632.0  0.0  9.6   28.0   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 388.07.0 49406.4  290.0  0.0  9.8   24.8   1 100
sd7 409.01.0 52350.32.0  0.0  9.5   23.2   1  99

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 423.00.0 54148.60.0  0.0 10.0   23.6   1 100
sd7 413.00.0 52868.50.0  0.0 10.0   24.2   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 400.02.0 51081.22.0  0.0 10.0   24.8   1 100
sd7 384.04.0 49153.24.0  0.0 10.0   25.7   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 401.91.0 51448.98.0  0.0 10.0   24.8   1 100
sd7 424.90.0 54392.40.0  0.0 10.0   23.5   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 215.1  208.1 26751.9 25433.5  0.0  9.3   22.1   1 100
sd7 189.1  216.1 24199.1 26833.9  0.0  8.9   22.1   1  91

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 295.0  162.0 37756.8 20610.2  0.0 10.0   21.8   1 100
sd7 307.0  150.0 39292.6 19198.4  0.0 10.0   21.8   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 405.02.0 51843.86.0  

Re: [zfs-discuss] Very bad ZFS write performance. Ok Read.

2011-02-14 Thread ian W
Hello

my power.conf is as follows; any recommendations for improvement?


device-dependency-property removable-media /dev/fb
autopm enable
autoS3 enable
cpu-threshold 1s
# Auto-Shutdown Idle(min)   Start/Finish(hh:mm) Behavior
autoshutdown 30 0:00 0:00 noshutdown
S3-support enable
cpu_deep_idle enable
cpupm enable poll-mode
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] One LUN per RAID group

2011-02-14 Thread Erik Trimble

On 2/14/2011 3:52 PM, Gary Mills wrote:

On Mon, Feb 14, 2011 at 03:04:18PM -0500, Paul Kraus wrote:

On Mon, Feb 14, 2011 at 2:38 PM, Gary Millsmi...@cc.umanitoba.ca  wrote:

Is there any reason not to use one LUN per RAID group?

[...]

 In other words, if you build a zpool with one vdev of 10GB and
another with two vdev's each of 5GB (both coming from the same array
and raid set) you get almost exactly twice the random read performance
from the 2x5 zpool vs. the 1x10 zpool.

This finding is surprising to me.  How do you explain it?  Is it
simply that you get twice as many outstanding I/O requests with two
LUNs?  Is it limited by the default I/O queue depth in ZFS?  After
all, all of the I/O requests must be handled by the same RAID group
once they reach the storage device.


 Also, using a 2540 disk array setup as a 10 disk RAID6 (with 2 hot
spares), you get substantially better random read performance using 10
LUNs vs. 1 LUN. While inconvenient, this just reflects the scaling of
ZFS aith number of vdevs and not spindles.


I'm going to go out on a limb here and say that you get the extra 
performance under one condition:  you don't overwhelm the NVRAM write 
cache on the SAN device head.


So long as the SAN's NVRAM cache can acknowledge the write immediately 
(i.e. it isn't full with pending commits to backing store), then, yes, 
having multiple write commits coming from different ZFS vdevs will 
obviously give more performance than a single ZFS vdev.


That said, given that SAN NVRAM caches are true write caches (and not a 
ZIL-like thing), it should be relatively simple to swamp one with write 
requests (most SANs have little more than 1GB of cache), at which point, 
the SAN will be blocking on flushing its cache to disk.


So, if you can arrange your workload to avoid more than the maximum 
write load of the SAN's raid array over a defined period, then, yes, go 
with the multiple LUN/array setup.  In particular, I would think this 
would be excellent for small-write/latency-sensitive applications, where 
the total amount of data written (over several seconds) isn't large, but 
where latency is critical.  For larger I/O requests (or, for consistent, 
sustained I/O of more than small amounts), all bets are off as far as 
possibly advantage of multiple LUNS/array.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very bad ZFS write performance. Ok Read.

2011-02-14 Thread Richard Elling
On Feb 14, 2011, at 4:49 PM, ian W wrote:

 Hello
 
 my power.conf is as follows; any recommendations for improvement?

For best performance, disable power management. For certain processors
and BIOSes, some combinations of power management (below the OS) are
also known to be toxic. At Nexenta, current best practice is to disable 
C-states 
for Nehalems.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss