Re: [zfs-discuss] Performance problems with Thumper and 7TB ZFS pool using RAIDZ2

2009-10-26 Thread Marion Hakanson
opensolaris-zfs-disc...@mlists.thewrittenword.com said:
 Is it really pointless? Maybe they want the insurance RAIDZ2 provides. Given
 the choice between insurance and performance, I'll take insurance, though it
 depends on your use case. We're using 5-disk RAIDZ2 vdevs. 
 . . .
 Would love to hear other opinions on this. 

Hi again Albert,

On our Thumper, we use 7x 6-disk raidz2's (750GB drives).  It seems a good
compromise between capacity, IOPS, and data protection.  Like you, we are
afraid of the possibility of a 2nd disk failure during resilvering of these
large drives.  Our usage is a mix of disk-to-disk-to-tape backups, archival,
and multi-user (tens of users) NFS/SFTP service, in roughly that order
of load.  We have had no performance problems with this layout.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance problems with Thumper and 7TB ZFS pool using RAIDZ2

2009-10-25 Thread Jeff Savit

On 10/24/09 12:31 PM, Jim Mauro wrote:

Posting to zfs-discuss. There's no reason this needs to be
kept confidential.


okay.


5-disk RAIDZ2 - doesn't that equate to only 3 data disks?
Seems pointless - they'd be much better off using mirrors,
which is a better choice for random IO...


Hmm, they're giving up so much % capacity as is, they could just as well 
give up some more and get better performance. Great idea!


--
Jeff Savit
Principal Field Technologist
Sun Microsystems, Inc.Phone: 732-537-3451 (x63451)
2398 E Camelback Rd   Email: jeff.sa...@sun.com
Phoenix, AZ  85016http://blogs.sun.com/jsavit/ 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance problems with Thumper and 7TB ZFS pool using RAIDZ2

2009-10-24 Thread Jim Mauro

Posting to zfs-discuss. There's no reason this needs to be
kept confidential.

5-disk RAIDZ2 - doesn't that equate to only 3 data disks?
Seems pointless - they'd be much better off using mirrors,
which is a better choice for random IO...

Looking at this now...

/jim


Jeff Savit wrote:

Hi all,

I'm looking for suggestions for the following situation: I'm helping 
another SE with a customer using Thumper with a large ZFS pool mostly 
used as an NFS server, and disappointments in performance. The storage 
is an intermediate holding place for data to be fed into a relational 
database, and the statement is that the NFS side can't keep up with 
data feeds written to it as flat files.


The ZFS pool has 8 5-volume RAIDZ2 groups, for 7.3TB of storage, with 
1.74TB available.  Plenty of idle CPU as shown by vmstat and mpstat.  
iostat shows queued I/O and I'm not happy about the total latencies - 
wsvc_t in excess of 75ms at times.  Average of ~60KB per read and only 
~2.5KB per write. Evil Tuning guide tells me that RAIDZ2 is happiest 
for long reads and writes, and this is not the use case here.


I was surprised to see commands like tar, rm, and chown running 
locally on the NFS server, so it looks like they're locally doing file 
maintenance and pruning at the same time it's being accessed remotely. 
That makes sense to me for the short write lengths and for the high 
ZFS ACL activity shown by DTrace. I wonder if there is a lot of sync 
I/O that would benefit from separately defined ZILs (whether SSD or 
not), so I've asked them to look for fsync activity.


Data collected thus far is listed below. I've asked for verification 
of the Solaris 10 level (I believe it's S10u6) and ZFS recordsize.  
Any suggestions will be appreciated.


regards, Jeff

 stuff starts here 


zpool iostat -v gives figures like:

bash-3.00# zpool iostat -v
  capacity operations  bandwidth
pool   used avail read write  read write
-- - - -- - -
mdpool 7.32T 1.74T 290  455 1.57M 3.21M
raidz2  937G  223G  36   56   201K 411K
c0t0d0 -  - 18   40  1.13M 141K
c1t0d0 -  - 18   40  1.12M 141K
c4t0d0 -  - 18   40  1.13M 141K
c6t0d0 -  - 18   40  1.13M 141K
c7t0d0 -  - 18   40  1.13M 141K

---the other 7 raidz2 groups have almost identical numbers on their 
devices---


iostat -iDnxz looks like:

extended device statistics 
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

0.00.00.00.0  0.0  0.00.00.1   0   0 c5t0d0
   15.8   95.9  996.9  233.1  4.3  1.3   38.2   12.0  20  37 c6t0d0
   16.1   95.6 1018.5  232.4  2.5  2.6   22.2   23.2  16  36 c7t0d0
   16.1   96.0 1012.5  232.8  2.8  2.9   24.5   26.1  19  38 c4t0d0
   16.0   93.1 1012.9  242.2  3.6  1.5   33.2   14.2  18  36 c5t1d0
   15.9   82.2 1000.5  235.0  1.9  1.6   19.2   16.0  12  31 c5t2d0
   16.6   95.6 1046.7  232.7  2.5  2.7   22.2   23.7  18  37 c0t0d0
   16.6   96.1 1042.4  232.8  4.7  0.6   42.05.2  19  38 c1t0d0
...snip...
   16.5   95.4 1027.2  263.0  5.9  0.4   53.03.6  26  40 c0t4d0
   16.6   95.4 1041.1  263.6  3.9  1.0   34.59.3  18  36 c1t4d0
   16.8   99.1 1060.6  248.6  7.2  0.7   62.06.0  32  45 c0t5d0
   16.5   99.6 1034.7  248.9  8.2  1.1   70.59.1  38  48 c1t5d0
   17.0   82.5 1072.9  219.8  4.8  0.5   48.44.7  21  38 c0t6d0


prstat  looks like:

bash-3.00# prstat
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
815 daemon 3192K 2560K sleep 60 -20 83:10:07 0.6% nfsd/24
27918 root 1092K 920K cpu2 37 4 0:01:37 0.2% rm/1
19142 root 248M 247M sleep 60 0 1:24:24 0.1% chown/1
28794 root 2552K 1304K sleep 59 0 0:00:00 0.1% tar/1
29957 root 1192K 908K sleep 59 0 0:57:30 0.1% find/1
14737 root 7620K 1964K sleep 59 0 0:03:56 0.0% sshd/1
...


prstat -Lm looks like:

bash-3.00# prstat -Lm
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
27918 root 0.0 0.9 0.0 0.0 0.0 0.0 99 0.0 194 7 2K 0 rm/1
28794 root 0.1 0.6 0.0 0.0 0.0 0.0 99 0.0 209 10 909 0 tar/1
19142 root 0.0 0.6 0.0 0.0 0.0 0.0 99 0.0 224 3 1K 0 chown/1
29957 root 0.0 0.4 0.0 0.0 0.0 0.0 100 0.0 213 6 420 0 find/1
815 daemon 0.0 0.3 0.0 0.0 0.0 0.0 100 0.0 197 0 0 0 nfsd/28230
815 daemon 0.0 0.3 0.0 0.0 0.0 0.0 100 0.0 191 0 0 0 nfsd/28222
815 daemon 0.0 0.3 0.0 0.0 0.0 0.0 100 0.0 185 0 0 0 nfsd/28211
---many more nfsd lines of similar appearance---


A small DTrace script for ZFS gives me:

# dtrace -n 'fbt::zfs*:ent...@[pid,execname,probefunc] = count()} END 
{trunc(@,20); printa(@)}'

^C
...some lines trimmed...
28835 tar zfs_dirlook 67761
28835 tar zfs_lookup 67761
28835 tar zfs_zaccess 69166
28835 tar zfs_dirent_lock 71083
28835 tar zfs_dirent_unlock 71084
28835 tar zfs_zaccess_common28835 tar zfs_acl_node_read 77251

28835 tar zfs_acl_node_read_internal 77251
28835 tar zfs_acl_alloc 78656
28835 tar zfs_acl_free 78656
27918 rm zfs_acl_alloc 85888

Re: [zfs-discuss] Performance problems with Thumper and 7TB ZFS pool using RAIDZ2

2009-10-24 Thread Albert Chin
On Sat, Oct 24, 2009 at 03:31:25PM -0400, Jim Mauro wrote:
 Posting to zfs-discuss. There's no reason this needs to be
 kept confidential.

 5-disk RAIDZ2 - doesn't that equate to only 3 data disks?
 Seems pointless - they'd be much better off using mirrors,
 which is a better choice for random IO...

Is it really pointless? Maybe they want the insurance RAIDZ2 provides.
Given the choice between insurance and performance, I'll take insurance,
though it depends on your use case. We're using 5-disk RAIDZ2 vdevs.
While I want the performance a mirrored vdev would give, it scares me
that you're just one drive away from a failed pool. Of course, you could
have two mirrors in each vdev but I don't want to sacrifice that much
space. However, over the last two years, we haven't had any
demonstratable failures that would give us cause for concern. But, it's
still unsettling.

Would love to hear other opinions on this.

 Looking at this now...

 /jim


 Jeff Savit wrote:
 Hi all,

 I'm looking for suggestions for the following situation: I'm helping  
 another SE with a customer using Thumper with a large ZFS pool mostly  
 used as an NFS server, and disappointments in performance. The storage  
 is an intermediate holding place for data to be fed into a relational  
 database, and the statement is that the NFS side can't keep up with  
 data feeds written to it as flat files.

 The ZFS pool has 8 5-volume RAIDZ2 groups, for 7.3TB of storage, with  
 1.74TB available.  Plenty of idle CPU as shown by vmstat and mpstat.   
 iostat shows queued I/O and I'm not happy about the total latencies -  
 wsvc_t in excess of 75ms at times.  Average of ~60KB per read and only  
 ~2.5KB per write. Evil Tuning guide tells me that RAIDZ2 is happiest  
 for long reads and writes, and this is not the use case here.

 I was surprised to see commands like tar, rm, and chown running  
 locally on the NFS server, so it looks like they're locally doing file  
 maintenance and pruning at the same time it's being accessed remotely.  
 That makes sense to me for the short write lengths and for the high  
 ZFS ACL activity shown by DTrace. I wonder if there is a lot of sync  
 I/O that would benefit from separately defined ZILs (whether SSD or  
 not), so I've asked them to look for fsync activity.

 Data collected thus far is listed below. I've asked for verification  
 of the Solaris 10 level (I believe it's S10u6) and ZFS recordsize.   
 Any suggestions will be appreciated.

 regards, Jeff

-- 
albert chin (ch...@thewrittenword.com)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance problems with Thumper and 7TB ZFS pool using RAIDZ2

2009-10-24 Thread Bob Friesenhahn

On Sat, 24 Oct 2009, Albert Chin wrote:


5-disk RAIDZ2 - doesn't that equate to only 3 data disks?
Seems pointless - they'd be much better off using mirrors,
which is a better choice for random IO...


Is it really pointless? Maybe they want the insurance RAIDZ2 
provides. Given the choice between insurance and performance, I'll 
take insurance, though it depends on your use case. We're using 
5-disk RAIDZ2 vdevs. While I want the performance a mirrored vdev 
would give, it scares me that you're just one drive away from a 
failed pool. Of course, you could have two mirrors in each vdev but 
I don't want to sacrifice that much space. However, over the last 
two years, we haven't had any demonstratable failures that would 
give us cause for concern. But, it's still unsettling.


I am using duplex mirrors here even though if a drive fails, the pool 
is just one drive away from failure.  I do feel that it is safer than 
raidz1 because resilvering is much less complex so there is less to go 
wrong and the resilver time should be the best possible.


For heavy multi-user use (like this Sun customer has) it is impossible 
to beat the mirrored configuration for performance.  If the I/O load 
is heavy and the storage is an intermediate holding place for data 
then it makes sense to use mirrors.  If it was for long term archival 
storage, then raidz2 would make more sense.



iostat shows queued I/O and I'm not happy about the total latencies -
wsvc_t in excess of 75ms at times.  Average of ~60KB per read and only
~2.5KB per write. Evil Tuning guide tells me that RAIDZ2 is happiest
for long reads and writes, and this is not the use case here.


~2.5KB per write is definitely problematic.  NFS writes are usually 
synchronous so this is using up the available IOPS, and consuming them 
at a 5X elevated rate with a 5 disk raidz2.  It seems that a SSD for 
the intent log would help quite a lot for this situation so that zfs 
can aggregate the writes.  If the typical writes are small, it would 
also help to reduce the filesystem blocksize to 8K.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss