Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-28 Thread Aubrey Li
It looks like I was posting on the wrong mailing list. I thought
this mailing list includes developers. The experiment I did is
not for commercial purpose. The purpose of comparison is to
find the optimization opportunity of the entire software stack
on both linux and solaris.

As for this zfs root lock, currently a special re-entrant rwlock
is implemented here for ZFS only. The interesting is, all http
and https request of my benchmark are performing the read
ops on apache server, while we still saw lots of mutex spin
events of this rwlock in the lockstat report . I'll continue to
investigate if this is a conflict of design philosophy. At least,
this lock does not well behave for this kind of more read, less
write case.

If anyone has interested in this topic, I can send the update
offline.

Thanks,
-Aubrey



On Tue, Mar 27, 2012 at 9:42 PM, Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D.
laot...@gmail.com wrote:
 hi
 you did not answer the question, what is the RAM of the server? how many
 socket and core  etc
 what is the block size of zfs?
 what is the cache  ram of your  san array?
 what is the block size/strip size  of your raid in san array? raid 5 or
 what?
 what is your test program and how (from what kind  client)
 regards




 On 3/26/2012 11:13 PM, Aubrey Li wrote:

 On Tue, Mar 27, 2012 at 1:15 AM, Jim Klimovj...@cos.ru  wrote:

 Well, as a further attempt down this road, is it possible for you to rule
 out
 ZFS from swapping - i.e. if RAM amounts permit, disable the swap at all
 (swap -d /dev/zvol/dsk/rpool/swap) or relocate it to dedicated slices of
 same or better yet separate disks?

 Thanks Jim for your suggestion!


 If you do have lots of swapping activity (that can be seen in vmstat 1
 as
 si/so columns) going on in a zvol, you're likely to get much
 fragmentation
 in the pool, and searching for contiguous stretches of space can become
 tricky (and time-consuming), or larger writes can get broken down into
 many smaller random writes and/or gang blocks, which is also slower.
 At least such waiting on disks can explain the overall large kernel
 times.

 I took swapping activity into account, even when the CPU% is 100%, si
 (swap-ins) and so (swap-outs) are always ZEROs.

 You can also see the disk wait times ratio in iostat -xzn 1 column %w
 and disk busy times ratio in %b (second and third from the right).
 I dont't remember you posting that.

 If these are accounting in tens, or even close or equal to 100%, then
 your disks are the actual bottleneck. Speeding up that subsystem,
 including addition of cache (ARC RAM, L2ARC SSD, maybe ZIL
 SSD/DDRDrive) and combatting fragmentation by moving swap and
 other scratch spaces to dedicated pools or raw slices might help.

 My storage system is not quite busy, and there are only read operations.
 =
 # iostat -xnz 3
                     extended device statistics
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   112.4    0.0 1691.5    0.0  0.0  0.5    0.0    4.8   0  41 c11t0d0
                     extended device statistics
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   118.7    0.0 1867.0    0.0  0.0  0.5    0.0    4.5   0  42 c11t0d0
                     extended device statistics
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   127.7    0.0 2121.6    0.0  0.0  0.6    0.0    4.7   0  44 c11t0d0
                     extended device statistics
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   141.3    0.0 2158.5    0.0  0.0  0.7    0.0    4.6   0  48 c11t0d0
 ==

 Thanks,
 -Aubrey
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


 --
 Hung-Sheng Tsao Ph D.
 Founder  Principal
 HopBit GridComputing LLC
 cell: 9734950840

 http://laotsao.blogspot.com/
 http://laotsao.wordpress.com/
 http://blogs.oracle.com/hstsao/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-27 Thread Phil Harman
One of the glories of Solaris is that it is so very observable. Then 
there are the many excellent blog posts, wiki entries, and books - some 
or which are authored by contributors to this very thread - explaining 
how Solaris works. But these virtues are also a snare to some, and it is 
not uncommon for even experienced practitioners to become fixated by 
this or that.


Aubrey, a 70:30 user to system ratio is pretty respectable. Running at 
100% is not so pretty (e.g. I would be surprised if you DIDN'T see many 
involuntary context switches (icsw) in such a scenario). Esteemed 
experts have assured you that ZFS lock contention is not you main 
concern. I would run with that.


You said it was a stress test. That raises many questions for me. I am 
much more interested in how systems perform in the real world. In my 
experience, many of the issues we find in production are not the ones we 
found in the lab. Indeed, I would argue that too many systems are tuned 
to run simplistic benchmarks instead of real workloads.


However, I don't think it's helpful to continue this discussion here. 
There are some esteemed experienced practitioners out there who would 
be only too happy to provide holistic systems performance tuning and 
health check services to you on a commercial basis (I hear that some may 
even accept PayPal).


Phil
(p...@harmanholistix.com)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-27 Thread Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D.

hi
you did not answer the question, what is the RAM of the server? how many 
socket and core  etc

what is the block size of zfs?
what is the cache  ram of your  san array?
what is the block size/strip size  of your raid in san array? raid 5 or 
what?

what is your test program and how (from what kind  client)
regards



On 3/26/2012 11:13 PM, Aubrey Li wrote:

On Tue, Mar 27, 2012 at 1:15 AM, Jim Klimovj...@cos.ru  wrote:

Well, as a further attempt down this road, is it possible for you to rule
out
ZFS from swapping - i.e. if RAM amounts permit, disable the swap at all
(swap -d /dev/zvol/dsk/rpool/swap) or relocate it to dedicated slices of
same or better yet separate disks?


Thanks Jim for your suggestion!



If you do have lots of swapping activity (that can be seen in vmstat 1 as
si/so columns) going on in a zvol, you're likely to get much fragmentation
in the pool, and searching for contiguous stretches of space can become
tricky (and time-consuming), or larger writes can get broken down into
many smaller random writes and/or gang blocks, which is also slower.
At least such waiting on disks can explain the overall large kernel times.

I took swapping activity into account, even when the CPU% is 100%, si
(swap-ins) and so (swap-outs) are always ZEROs.


You can also see the disk wait times ratio in iostat -xzn 1 column %w
and disk busy times ratio in %b (second and third from the right).
I dont't remember you posting that.

If these are accounting in tens, or even close or equal to 100%, then
your disks are the actual bottleneck. Speeding up that subsystem,
including addition of cache (ARC RAM, L2ARC SSD, maybe ZIL
SSD/DDRDrive) and combatting fragmentation by moving swap and
other scratch spaces to dedicated pools or raw slices might help.

My storage system is not quite busy, and there are only read operations.
=
# iostat -xnz 3
 extended device statistics
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   112.40.0 1691.50.0  0.0  0.50.04.8   0  41 c11t0d0
 extended device statistics
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   118.70.0 1867.00.0  0.0  0.50.04.5   0  42 c11t0d0
 extended device statistics
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   127.70.0 2121.60.0  0.0  0.60.04.7   0  44 c11t0d0
 extended device statistics
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   141.30.0 2158.50.0  0.0  0.70.04.6   0  48 c11t0d0
==

Thanks,
-Aubrey
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Hung-Sheng Tsao Ph D.
Founder  Principal
HopBit GridComputing LLC
cell: 9734950840

http://laotsao.blogspot.com/
http://laotsao.wordpress.com/
http://blogs.oracle.com/hstsao/

attachment: laotsao.vcf___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-26 Thread Aubrey Li
On Mon, Mar 26, 2012 at 4:33 PM,  casper@oracle.com wrote:

I'm migrating a webserver(apache+php) from RHEL to solaris. During the
stress testing comparison, I found under the same session number of client
request, CPU% is ~70% on RHEL while CPU% is full on solaris.

 Which version of Solaris is this?

This is Solaris 11.


The apache root documentation is apparently in zfs dataset. It looks like each
file system operation will run into zfs root mutex contention.

Is this an expected behavior? If so, is there any zfs tunable to
address this issue?

 The zfs_root() function is the function which is used when a
 mountpoint is transversed.  Where is the apache document root
 directory and under which mountpoints?

I have an external storage server and I created a zfs pool on it.

# zpool list
NAME SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
export1  800G   277G   523G  34%  1.00x  ONLINE  -
rpool117G   109G  8.44G  92%  1.00x  ONLINE  -

this pool is mounted under /export, all of the apache documents.
are in this pool at /export/webserver/apache2/htdocs.

The php temporary folder is set to /tmp, which is tmpfs.

Thanks,
-Aubrey


 Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-26 Thread Jim Klimov

2012-03-26 14:27, Aubrey Li wrote:

The php temporary folder is set to /tmp, which is tmpfs.



By the way, how much RAM does the box have available?
tmpfs in Solaris is backed by virtual memory.
It is like a RAM disk, although maybe slower than ramdisk
FS as seen in livecd, as long as there is enough free
RAM, but then tmpfs can get swapped out to disk.

If your installation followed the default wizard, your
swap (or part of it) could be in rpool/swap and backed
by ZFS - leading to both many tmpfs accesses and many
ZFS accesses. (No idea about the mutex spinning part
though - if it is normal or not).

Also I'm not sure if tmpfs gets the benefits of caching,
and ZFS ARC cache can consume lots of RAM and thus push
tmpfs out to swap.

As a random guess, try pointing PHP tmp directory to
/var/tmp (backed by zfs) and see if any behaviors change?

Good luck,
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-26 Thread Jim Mauro

You care about #2 and #3 because you are fixated on a ZFS root
lock contention problem, and not open to a broader discussion
about what your real problem actually is. I am not saying there is
not lock contention, and I am not saying there is - I'll look at the
data later carefully later when I have more time.

Your problem statement, which took 20 emails to glean, is the 
Solaris system consumes more CPU than Linux on the same
hardware, doing roughly the same amount of work, and delivering
roughly the same level of performance - is that correct?

Please consider that, in Linux, you have no observability into
kernel lock statistics (at least, known that I know of) - Linux uses kernel
locks also, and for this workload, it seems likely to me that could
you observe those statistics, you would see numbers that would 
lead you to conclude you have lock contention in Linux.

Let's talk about THE PROBLEM - Linux is 15% sys, 55% usr,
Solaris is 30% sys, 70% usr, running the same workload,
doing the same amount of work. delivering the same level
of performance. Please validate that problem statement.


On Mar 25, 2012, at 9:51 PM, Aubrey Li wrote:

 On Mon, Mar 26, 2012 at 4:18 AM, Jim Mauro james.ma...@oracle.com wrote:
 If you're chasing CPU utilization, specifically %sys (time in the kernel),
 I would start with a time-based kernel profile.
 
 #dtrace -n 'profile-997hz /arg0/ { @[stack()] = count(); } tick-60sec { 
 trunc(@, 20); printa(@0; }'
 
 I would be curious to see where the CPU cycles are being consumed first,
 before going down the lock path…
 
 This assume that most or all of CPU utilization is %sys. If it's %usr, we 
 take
 a different approach.
 
 
 Here is the output, I changed to tick-5sec and trunc(@, 5).
 No.2 and No.3 is what I care about.
 
 Thanks,
 -Aubrey
 
 21  80536   :tick-5sec
 == 1 =
  genunix`avl_walk+0x6a
  genunix`as_gap_aligned+0x2b7
  unix`map_addr_proc+0x179
  unix`map_addr+0x8e
  genunix`choose_addr+0x9e
  zfs`zfs_map+0x161
  genunix`fop_map+0xc5
  genunix`smmap_common+0x268
  genunix`smmaplf32+0xa2
  genunix`syscall_ap+0x92
  unix`_sys_sysenter_post_swapgs+0x149
 1427
 
 = 2 =
  unix`mutex_delay_default+0x7
  unix`mutex_vector_enter+0x2ae
  zfs`zfs_zget+0x46
  zfs`zfs_root+0x55
  genunix`fsop_root+0x2d
  genunix`traverse+0x65
  genunix`lookuppnvp+0x446
  genunix`lookuppnatcred+0x119
  genunix`lookupnameatcred+0x97
  genunix`lookupnameat+0x6b
  genunix`vn_openat+0x147
  genunix`copen+0x493
  genunix`openat64+0x2d
  unix`_sys_sysenter_post_swapgs+0x149
 2645
 
 == 3 =
  unix`mutex_delay_default+0x7
  unix`mutex_vector_enter+0x2ae
  zfs`zfs_zget+0x46
  zfs`zfs_root+0x55
  genunix`fsop_root+0x2d
  genunix`traverse+0x65
  genunix`lookuppnvp+0x446
  genunix`lookuppnatcred+0x119
  genunix`lookupnameatcred+0x97
  genunix`lookupnameat+0x6b
  genunix`cstatat_getvp+0x11e
  genunix`cstatat64_32+0x5d
  genunix`fstatat64_32+0x4c
  unix`_sys_sysenter_post_swapgs+0x149
 3201
 
  4 ===
  unix`i86_mwait+0xd
  unix`cpu_idle_mwait+0x154
  unix`idle+0x116
  unix`thread_start+0x8
 3559
 
 = 5 ==
  tmpfs`tmp_readdir+0x138
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
 4589
 
 =  6 
  unix`strlen+0x3
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
 5005
 
 === 7 =
  tmpfs`tmp_readdir+0xc7
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
 9548
 
 
 = 8 ===
  unix`strlen+0x8
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
11166
 
 
 = 9 ===
  unix`strlen+0xe
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
14491
 
 =  10 ==
  tmpfs`tmp_readdir+0xbe
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  

Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-26 Thread Aubrey Li
On Mon, Mar 26, 2012 at 7:28 PM, Jim Klimov jimkli...@cos.ru wrote:
 2012-03-26 14:27, Aubrey Li wrote:

 The php temporary folder is set to /tmp, which is tmpfs.


 By the way, how much RAM does the box have available?
 tmpfs in Solaris is backed by virtual memory.
 It is like a RAM disk, although maybe slower than ramdisk
 FS as seen in livecd, as long as there is enough free
 RAM, but then tmpfs can get swapped out to disk.

 If your installation followed the default wizard, your
 swap (or part of it) could be in rpool/swap and backed
 by ZFS - leading to both many tmpfs accesses and many
 ZFS accesses. (No idea about the mutex spinning part
 though - if it is normal or not).

 Also I'm not sure if tmpfs gets the benefits of caching,
 and ZFS ARC cache can consume lots of RAM and thus push
 tmpfs out to swap.

 As a random guess, try pointing PHP tmp directory to
 /var/tmp (backed by zfs) and see if any behaviors change?

 Good luck,
 //Jim


Thanks for your suggestions. Actually the default PHP tmp directory
was /var/tmp, and I changed /var/tmp to /tmp. This reduced zfs
root lock contention significantly. However, I still see a bunch of lock
contention. So I'm here ask for help.

Thanks,
-Aubrey
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-26 Thread Aubrey Li
On Mon, Mar 26, 2012 at 8:24 PM, Jim Mauro james.ma...@oracle.com wrote:

 You care about #2 and #3 because you are fixated on a ZFS root
 lock contention problem, and not open to a broader discussion
 about what your real problem actually is. I am not saying there is
 not lock contention, and I am not saying there is - I'll look at the
 data later carefully later when I have more time.

 Your problem statement, which took 20 emails to glean, is the
 Solaris system consumes more CPU than Linux on the same
 hardware, doing roughly the same amount of work, and delivering
 roughly the same level of performance - is that correct?

 Please consider that, in Linux, you have no observability into
 kernel lock statistics (at least, known that I know of) - Linux uses kernel
 locks also, and for this workload, it seems likely to me that could
 you observe those statistics, you would see numbers that would
 lead you to conclude you have lock contention in Linux.

 Let's talk about THE PROBLEM - Linux is 15% sys, 55% usr,
 Solaris is 30% sys, 70% usr, running the same workload,
 doing the same amount of work. delivering the same level
 of performance. Please validate that problem statement.


You're definitely right.

I'm running the same workload, doing the same amount of work,
delivering the same level of performance on the same hardware
platform, even the root disk type are exactly the same.

So the userland software stack is exactly the same.
The defference is:
linuxis 15% sys, 55% usr.
solaris is 30% sys, 70% usr.

Basically I agree with Fajar. This is probably not a fair comparison.
A robost system not only deliver highest performance, we should
consider reliability, availability and serviceability, and energy
efficiency and other server related features. No doubt ZFS is the
most excellent file system in the planet.

As Richard pointed out, if we look at mpstat output
==
SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl sze
 0 35140   0 2380 59742 19476 93056 30906 32919 256336 1104 967806 65
35   0   0  32

Including
smtx 256336: spins on mutex

I need to look at

icsw 30906:  involuntary context switches
migr 32919:  thread migration
syscl 967806: system calls

And given that Solaris consumes 70% CPU on the userland,
I probably need to break down how long it is in apache, libphp, libc,
etc. (what's your approach for usr% you mentioned above? BTW)

I am not open to a broader discussion because this is zfs mailing list.
zfs root lock contention is what I observed so far I can post on this forum.
I can take care of other aspects and may ask for help somewhere else.

I admit I didn't dig into linux, I agree there could be lock contention as well.
But since solaris consumes more CPU% and hence more system power
than linux, I think I have to look at solaris first, to see if there are any
tuning works need to be done.

Do you agree this is the right way to go ahead?

Thanks,
-Aubrey
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-26 Thread Richard Elling
I see nothing unusual in the lockstat data. I think you're barking up
the wrong tree.
 -- richard

On Mar 25, 2012, at 10:51 PM, Aubrey Li wrote:

 On Mon, Mar 26, 2012 at 1:19 PM, Richard Elling
 richard.ell...@richardelling.com wrote:
 Apologies to the ZFSers, this thread really belongs elsewhere.
 
 Let me explain below:
 
 Root documentation path of apache is in zfs, you see
 it at No.3 at the above dtrace report.
 
 
 The sort is in reverse order. The large number you see below the
 stack trace is the number of times that stack was seen. By far the
 most frequently seen is tmpfs`tmp_readdir
 
 It's true, but I didn't see any potential issues there.
 
 
 
 tmpfs(/tmp) is the place where PHP place the temporary
 folders and files.
 
 
 bingo
 
 
 I have to paste the lock investigation here again,
 
 Firstly, you can see which lock is spinning:
 ===
 # lockstat -D 10 sleep 2
 
 Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec)
 
 Count indv cuml rcnt nsec Lock   Caller
 ---
 829064  45%  45% 0.0033280 0xff117624a5d0 rrw_enter_read+0x1b
 705001  38%  82% 0.0030983 0xff117624a5d0 rrw_exit+0x1d
 140678   8%  90% 0.0010546 0xff117624a6e0 zfs_zget+0x46
 37208   2%  92% 0.00 5403 0xff114b136840 vn_rele+0x1e
 33926   2%  94% 0.00 5417 0xff114b136840 lookuppnatcred+0xc5
 27188   1%  95% 0.00 1155 vn_vfslocks_buckets+0xd980
 vn_vfslocks_getlock+0x3b
 11073   1%  96% 0.00 1639 vn_vfslocks_buckets+0x4600
 vn_vfslocks_getlock+0x3b
 9321   1%  96% 0.00 1961 0xff114b82a680 dnlc_lookup+0x83
 6929   0%  97% 0.00 1590 0xff11573b8f28
 zfs_fastaccesschk_execute+0x6a
 5746   0%  97% 0.00 5935 0xff114b136840 lookuppnvp+0x566
 ---
 
 Then if you look at the caller of lock(0xff117624a5d0), you'll see
 it's ZFS, not tmpfs.
 
 Count indv cuml rcnt nsec Lock   Caller
 48494   6%  17% 0.00   145263 0xff117624a5d0 rrw_enter_read+0x1b
 
  nsec -- Time Distribution -- count Stack
   256 |   17rrw_enter+0x2c
   512 |   1120  zfs_root+0x3b
  1024 |@  1718  fsop_root+0x2d
  2048 |@@ 4834  traverse+0x65
  4096 |@@@18569 lookuppnvp+0x446
  8192 |   6620  lookuppnatcred+0x119
 16384 |@  2929  lookupnameatcred+0x97
 32768 |@  1635  lookupnameat+0x6b
 65536 |   894   cstatat_getvp+0x11e
131072 |   1249  cstatat64_32+0x5d
262144 |@  1620  fstatat64_32+0x4c
524288 |@  2474
 _sys_sysenter_post_swapgs+0x149
 
 
 That's why I post this subject here. Hope it's clear this time.
 
 Thanks,
 -Aubrey

--
DTrace Conference, April 3, 2012, 
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-26 Thread Jim Klimov
  As a random guess, try pointing PHP tmp directory to
  /var/tmp (backed by zfs) and see if any behaviors change?
 
  Good luck,
  //Jim
 
 
 Thanks for your suggestions. Actually the default PHP tmp directory
 was /var/tmp, and I changed /var/tmp to /tmp. This reduced zfs
 root lock contention significantly. However, I still see a bunch 
 of lock
 contention. So I'm here ask for help.

Well, as a further attempt down this road, is it possible for you to rule out
ZFS from swapping - i.e. if RAM amounts permit, disable the swap at all
(swap -d /dev/zvol/dsk/rpool/swap) or relocate it to dedicated slices of
same or better yet separate disks?
 
If you do have lots of swapping activity (that can be seen in vmstat 1 as 
si/so columns) going on in a zvol, you're likely to get much fragmentation
in the pool, and searching for contiguous stretches of space can become
tricky (and time-consuming), or larger writes can get broken down into
many smaller random writes and/or gang blocks, which is also slower.
At least such waiting on disks can explain the overall large kernel times.
 
You can also see the disk wait times ratio in iostat -xzn 1 column %w
and disk busy times ratio in %b (second and third from the right).
I dont't remember you posting that.
 
If these are accounting in tens, or even close or equal to 100%, then
your disks are the actual bottleneck. Speeding up that subsystem, 
including addition of cache (ARC RAM, L2ARC SSD, maybe ZIL 
SSD/DDRDrive) and combatting fragmentation by moving swap and 
other scratch spaces to dedicated pools or raw slices might help.

HTH,
//Jim
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-26 Thread Aubrey Li
On Tue, Mar 27, 2012 at 1:15 AM, Jim Klimov j...@cos.ru wrote:
 Well, as a further attempt down this road, is it possible for you to rule
 out
 ZFS from swapping - i.e. if RAM amounts permit, disable the swap at all
 (swap -d /dev/zvol/dsk/rpool/swap) or relocate it to dedicated slices of
 same or better yet separate disks?


Thanks Jim for your suggestion!


 If you do have lots of swapping activity (that can be seen in vmstat 1 as
 si/so columns) going on in a zvol, you're likely to get much fragmentation
 in the pool, and searching for contiguous stretches of space can become
 tricky (and time-consuming), or larger writes can get broken down into
 many smaller random writes and/or gang blocks, which is also slower.
 At least such waiting on disks can explain the overall large kernel times.

I took swapping activity into account, even when the CPU% is 100%, si
(swap-ins) and so (swap-outs) are always ZEROs.


 You can also see the disk wait times ratio in iostat -xzn 1 column %w
 and disk busy times ratio in %b (second and third from the right).
 I dont't remember you posting that.

 If these are accounting in tens, or even close or equal to 100%, then
 your disks are the actual bottleneck. Speeding up that subsystem,
 including addition of cache (ARC RAM, L2ARC SSD, maybe ZIL
 SSD/DDRDrive) and combatting fragmentation by moving swap and
 other scratch spaces to dedicated pools or raw slices might help.

My storage system is not quite busy, and there are only read operations.
=
# iostat -xnz 3
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  112.40.0 1691.50.0  0.0  0.50.04.8   0  41 c11t0d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  118.70.0 1867.00.0  0.0  0.50.04.5   0  42 c11t0d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  127.70.0 2121.60.0  0.0  0.60.04.7   0  44 c11t0d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  141.30.0 2158.50.0  0.0  0.70.04.6   0  48 c11t0d0
==

Thanks,
-Aubrey
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-25 Thread Richard Elling
On Mar 24, 2012, at 10:29 PM, Aubrey Li wrote:

 Hi,
 
 I'm migrating a webserver(apache+php) from RHEL to solaris. During the
 stress testing comparison, I found under the same session number of client
 request, CPU% is ~70% on RHEL while CPU% is full on solaris.
 
 After some investigation, zfs root lock emerges as a major doubtful point.

This data does not seem major, less than 1% of total walltime. 
What is the ratio of user/sys?
 -- richard

 
 Firstly, there are a bunch of mutex events.
 
 # lockstat -D 10 sleep 2
 
 Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec)
 
 Count indv cuml rcnt nsec Lock   Caller
 ---
 829064  45%  45% 0.0033280 0xff117624a5d0 rrw_enter_read+0x1b
 705001  38%  82% 0.0030983 0xff117624a5d0 rrw_exit+0x1d
 140678   8%  90% 0.0010546 0xff117624a6e0 zfs_zget+0x46
 37208   2%  92% 0.00 5403 0xff114b136840 vn_rele+0x1e
 33926   2%  94% 0.00 5417 0xff114b136840 lookuppnatcred+0xc5
 27188   1%  95% 0.00 1155 vn_vfslocks_buckets+0xd980
 vn_vfslocks_getlock+0x3b
 11073   1%  96% 0.00 1639 vn_vfslocks_buckets+0x4600
 vn_vfslocks_getlock+0x3b
 9321   1%  96% 0.00 1961 0xff114b82a680 dnlc_lookup+0x83
 6929   0%  97% 0.00 1590 0xff11573b8f28
 zfs_fastaccesschk_execute+0x6a
 5746   0%  97% 0.00 5935 0xff114b136840 lookuppnvp+0x566
 ---
 
 If we look at the lock caller's stack, the following two are the hotspots.
 ==
 # lockstat -D 10 -l 0xff117624a5d0 -s 15 sleep 2
 
 Adaptive mutex spin: 865609 events in 3.237 seconds (267444 events/sec)
 ---
 Count indv cuml rcnt nsec Lock   Caller
 51215   6%   6% 0.00   100023 0xff117624a5d0 rrw_exit+0x1d
 
 nsec -- Time Distribution -- count Stack
  256 |   61zfs_root+0x78
  512 |@  1746  fsop_root+0x2d
 1024 |@@@5210  traverse+0x65
 2048 |@@@13464 +0x446
 4096 |@@@13553 lookuppnatcred+0x119
 8192 |@@@5355  lookupnameatcred+0x97
 16384 |@  2403  lookupnameat+0x6b
 32768 |   1331  cstatat_getvp+0x11e
 65536 |   673   cstatat64_32+0x5d
 131072 |   961   fstatat64_32+0x4c
 262144 |   1259  
 _sys_sysenter_post_swapgs+0x149
 ---
 Count indv cuml rcnt nsec Lock   Caller
 48494   6%  17% 0.00   145263 0xff117624a5d0 rrw_enter_read+0x1b
 
 nsec -- Time Distribution -- count Stack
   256 |   17rrw_enter+0x2c
   512 |   1120  zfs_root+0x3b
  1024 |@  1718  fsop_root+0x2d
  2048 |@@ 4834  traverse+0x65
  4096 |@@@18569 lookuppnvp+0x446
  8192 |   6620  lookuppnatcred+0x119
 16384 |@  2929  lookupnameatcred+0x97
 32768 |@  1635  lookupnameat+0x6b
 65536 |   894   cstatat_getvp+0x11e
 131072 |   1249  cstatat64_32+0x5d
 262144 |@  1620  fstatat64_32+0x4c
 524288 |@  2474  
 _sys_sysenter_post_swapgs+0x149
 
 
 The apache root documentation is apparently in zfs dataset. It looks like each
 file system operation will run into zfs root mutex contention.
 
 Is this an expected behavior? If so, is there any zfs tunable to
 address this issue?
 
 Thanks,
 -Aubrey
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
DTrace Conference, April 3, 2012, 
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-25 Thread Aubrey Li
On Sun, Mar 25, 2012 at 3:55 PM, Richard Elling
richard.ell...@richardelling.com wrote:
 On Mar 24, 2012, at 10:29 PM, Aubrey Li wrote:

 Hi,

 I'm migrating a webserver(apache+php) from RHEL to solaris. During the
 stress testing comparison, I found under the same session number of client
 request, CPU% is ~70% on RHEL while CPU% is full on solaris.

 After some investigation, zfs root lock emerges as a major doubtful point.


 This data does not seem major, less than 1% of total walltime.
 What is the ratio of user/sys?
  -- richard

Here is the vmstat output.
===
 kthr  memorypagedisk  faults  cpu
 r b w   swap  free  re  mf pi po fr de sr s1 s1 s2 s2   in   sy   cs us sy id
 37 0 0 25257648 34744756 26671 40446 0 0 0 0 0 0 0 258 0 54473 833569
100171 67 31 2
 16 0 0 25270872 34745544 26416 40427 0 0 0 0 0 0 0 161 0 53804 850816
100016 68 29 3
 80 0 0 25266716 34731880 27082 41500 0 0 0 0 0 0 0 145 0 56908 901578
98194 70 28 1
 81 0 0 25318348 34763484 26436 40384 0 0 0 0 0 0 0 173 0 59977
1014699 99273 66 33 0
 104 0 0 25359888 34787868 25539 39726 0 0 0 0 0 0 0 122 0 58175
989927 93106 68 32 0
 113 0 0 25359208 34779872 25746 39992 0 0 0 0 0 0 0 161 0 57212
916427 94401 67 33 0
 123 0 0 25409656 34809704 26228 41330 0 0 0 0 0 0 0 176 0 59332
890511 97127 69 31 0
 97 0 0 25419628 34807796 26508 42013 0 0 0 0 0 0 0 148 0 59736 864022
98918 71 29 0
 111 0 0 25468728 34834968 26477 40232 0 0 0 0 0 0 0 159 0 59270
867006 98308 67 33 0
 105 0 0 25501972 34860952 26776 42204 0 0 0 0 0 0 0 163 0 70868
867899 97981 71 29 0
 78 0 0 25526492 34865416 26489 43504 0 0 0 0 0 0 0 157 0 61110 952207
99806 67 33 0
==

and mpstat -a matches the above measure. ~200,000 spins on mutex.
==
SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl sze
  0 35140   0 2380 59742 19476 93056 30906 32919 256336 1104 967806
65  35   0   0  32
SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl sze
  0 36044   0 3268 61910 19482 96287 32121 32844 245293 1122 954150
64  36   0   0  32
SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl sze
  0 36516   0 3900 61459 19478 95502 31256 35345 221473 1106 932210
66  34   0   0  32
SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl sze
  0 35791   0 3544 60751 19371 97982 31158 33920 209179 3115 859644
64  36   0   0  32
SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl sze
  0 38317   0 4892 63046 19566 97316 31688 38100 220262 2058 91
65  35   0   0  32
SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl sze
  0 36408   0 3154 60766 19393 94343 31244 35582 248314  811 980967
65  35   0   0  32
===

Thanks,
-Aubrey


 Firstly, there are a bunch of mutex events.
 
 # lockstat -D 10 sleep 2

 Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec)

 Count indv cuml rcnt nsec Lock   Caller
 ---
 829064  45%  45% 0.00    33280 0xff117624a5d0 rrw_enter_read+0x1b
 705001  38%  82% 0.00    30983 0xff117624a5d0 rrw_exit+0x1d
 140678   8%  90% 0.00    10546 0xff117624a6e0 zfs_zget+0x46
 37208   2%  92% 0.00 5403 0xff114b136840 vn_rele+0x1e
 33926   2%  94% 0.00 5417 0xff114b136840 lookuppnatcred+0xc5
 27188   1%  95% 0.00 1155 vn_vfslocks_buckets+0xd980
 vn_vfslocks_getlock+0x3b
 11073   1%  96% 0.00 1639 vn_vfslocks_buckets+0x4600
 vn_vfslocks_getlock+0x3b
 9321   1%  96% 0.00 1961 0xff114b82a680 dnlc_lookup+0x83
 6929   0%  97% 0.00 1590 0xff11573b8f28
 zfs_fastaccesschk_execute+0x6a
 5746   0%  97% 0.00 5935 0xff114b136840 lookuppnvp+0x566
 ---

 If we look at the lock caller's stack, the following two are the hotspots.
 ==
 # lockstat -D 10 -l 0xff117624a5d0 -s 15 sleep 2

 Adaptive mutex spin: 865609 events in 3.237 seconds (267444 events/sec)
 ---
 Count indv cuml rcnt nsec Lock   Caller
 51215   6%   6% 0.00   100023 0xff117624a5d0 rrw_exit+0x1d

 nsec -- Time Distribution -- count Stack
  256 |   61    zfs_root+0x78
  512 |@  1746  fsop_root+0x2d
 1024 |@@@    5210  traverse+0x65
 2048 |@@@    13464 +0x446
 4096 |@@@    

Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-25 Thread Richard Elling
This is the wrong forum for general purpose performance tuning. So I won't 
continue this much farther.  Notice the huge number of icsw, that is a bigger
symptom than locks.
 -- richard

On Mar 25, 2012, at 6:24 AM, Aubrey Li wrote:

 SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
 sze
  0 35140   0 2380 59742 19476 93056 30906 32919 256336 1104 967806
 65  35   0   0  32
 SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
 sze
  0 36044   0 3268 61910 19482 96287 32121 32844 245293 1122 954150
 64  36   0   0  32
 SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
 sze
  0 36516   0 3900 61459 19478 95502 31256 35345 221473 1106 932210
 66  34   0   0  32
 SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
 sze
  0 35791   0 3544 60751 19371 97982 31158 33920 209179 3115 859644
 64  36   0   0  32
 SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
 sze
  0 38317   0 4892 63046 19566 97316 31688 38100 220262 2058 91
 65  35   0   0  32
 SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
 sze
  0 36408   0 3154 60766 19393 94343 31244 35582 248314  811 980967
 65  35   0   0  32

--
DTrace Conference, April 3, 2012, 
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-25 Thread Aubrey Li
On Mon, Mar 26, 2012 at 12:48 AM, Richard Elling
richard.ell...@richardelling.com wrote:
 This is the wrong forum for general purpose performance tuning. So I won't
 continue this much farther.  Notice the huge number of icsw, that is a
 bigger
 symptom than locks.
  -- richard

thanks anyway, lock must be a problem. the scenario here is, apache
causes a bunch of stat()
syscall, which leads to a bunch of zfs vnode access, byond the normal
read/write operation.

The problem is, every zfs vnode access need the **same zfs root**
lock. When the number of
httpd processes and the corresponding kernel threads becomes large,
this root lock contention
becomes horrible. This situation does not occurs on linux.

Let me see if any others have a clue for me. Any suggestions will be
highly appreciated!

Regards,
-Aubrey


 On Mar 25, 2012, at 6:24 AM, Aubrey Li wrote:

 SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 sze
  0 35140   0 2380 59742 19476 93056 30906 32919 256336 1104 967806
 65  35   0   0  32
 SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 sze
  0 36044   0 3268 61910 19482 96287 32121 32844 245293 1122 954150
 64  36   0   0  32
 SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 sze
  0 36516   0 3900 61459 19478 95502 31256 35345 221473 1106 932210
 66  34   0   0  32
 SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 sze
  0 35791   0 3544 60751 19371 97982 31158 33920 209179 3115 859644
 64  36   0   0  32
 SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 sze
  0 38317   0 4892 63046 19566 97316 31688 38100 220262 2058 91
 65  35   0   0  32
 SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 sze
  0 36408   0 3154 60766 19393 94343 31244 35582 248314  811 980967
 65  35   0   0  32


 --
 DTrace Conference, April 3,
 2012, http://wiki.smartos.org/display/DOC/dtrace.conf
 ZFS Performance and Training
 richard.ell...@richardelling.com
 +1-760-896-4422



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-25 Thread zfs user

On 3/25/12 10:25 AM, Aubrey Li wrote:

On Mon, Mar 26, 2012 at 12:48 AM, Richard Elling
richard.ell...@richardelling.com  wrote:

This is the wrong forum for general purpose performance tuning. So I won't
continue this much farther.  Notice the huge number of icsw, that is a
bigger
symptom than locks.
  -- richard


thanks anyway, lock must be a problem. the scenario here is, apache
causes a bunch of stat()
syscall, which leads to a bunch of zfs vnode access, byond the normal
read/write operation.

The problem is, every zfs vnode access need the **same zfs root**
lock. When the number of
httpd processes and the corresponding kernel threads becomes large,
this root lock contention
becomes horrible. This situation does not occurs on linux.

Let me see if any others have a clue for me. Any suggestions will be
highly appreciated!

Regards,
-Aubrey


Is noatime turned on for all your zfs filesystems?
Have you tried the latest version of httpd?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-25 Thread Richard Elling
On Mar 25, 2012, at 10:25 AM, Aubrey Li wrote:

 On Mon, Mar 26, 2012 at 12:48 AM, Richard Elling
 richard.ell...@richardelling.com wrote:
 This is the wrong forum for general purpose performance tuning. So I won't
 continue this much farther.  Notice the huge number of icsw, that is a
 bigger
 symptom than locks.
  -- richard
 
 thanks anyway, lock must be a problem. the scenario here is, apache
 causes a bunch of stat()
 syscall, which leads to a bunch of zfs vnode access, byond the normal
 read/write operation.

That does not explain the high icsw. In a properly sized system icsw will be
on the order of 0 to 100, closer to 0.

If there are millions of files, then you should check the DLNC hit rate.

 
 The problem is, every zfs vnode access need the **same zfs root**
 lock. When the number of
 httpd processes and the corresponding kernel threads becomes large,
 this root lock contention
 becomes horrible. This situation does not occurs on linux.

I disagree with your conclusion and I've seen ZFS systems do millions of stats()
per second without issue. What does prstat -Lm show?
 -- richard

 
 Let me see if any others have a clue for me. Any suggestions will be
 highly appreciated!
 
 Regards,
 -Aubrey
 
 
 On Mar 25, 2012, at 6:24 AM, Aubrey Li wrote:
 
 SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 sze
  0 35140   0 2380 59742 19476 93056 30906 32919 256336 1104 967806
 65  35   0   0  32
 SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 sze
  0 36044   0 3268 61910 19482 96287 32121 32844 245293 1122 954150
 64  36   0   0  32
 SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 sze
  0 36516   0 3900 61459 19478 95502 31256 35345 221473 1106 932210
 66  34   0   0  32
 SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 sze
  0 35791   0 3544 60751 19371 97982 31158 33920 209179 3115 859644
 64  36   0   0  32
 SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 sze
  0 38317   0 4892 63046 19566 97316 31688 38100 220262 2058 91
 65  35   0   0  32
 SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 sze
  0 36408   0 3154 60766 19393 94343 31244 35582 248314  811 980967
 65  35   0   0  32
 
 
 --
 DTrace Conference, April 3,
 2012, http://wiki.smartos.org/display/DOC/dtrace.conf
 ZFS Performance and Training
 richard.ell...@richardelling.com
 +1-760-896-4422
 
 
 

--
DTrace Conference, April 3, 2012, 
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-25 Thread Aubrey Li
On Mon, Mar 26, 2012 at 2:10 AM, zfs user zf...@itsbeen.sent.com wrote:
 On 3/25/12 10:25 AM, Aubrey Li wrote:

 On Mon, Mar 26, 2012 at 12:48 AM, Richard Elling
 richard.ell...@richardelling.com  wrote:

 This is the wrong forum for general purpose performance tuning. So I
 won't
 continue this much farther.  Notice the huge number of icsw, that is a
 bigger
 symptom than locks.
  -- richard


 thanks anyway, lock must be a problem. the scenario here is, apache
 causes a bunch of stat()
 syscall, which leads to a bunch of zfs vnode access, byond the normal
 read/write operation.

 The problem is, every zfs vnode access need the **same zfs root**
 lock. When the number of
 httpd processes and the corresponding kernel threads becomes large,
 this root lock contention
 becomes horrible. This situation does not occurs on linux.

 Let me see if any others have a clue for me. Any suggestions will be
 highly appreciated!

 Regards,
 -Aubrey


 Is noatime turned on for all your zfs filesystems?
 Have you tried the latest version of httpd?


yeah, atime is turned off. It's not helpful to reduce the zfs vnode access.
I didn't try the latest version of httpd. I want to get the same performance
by the same user land software stack as it on Linux.

Thanks,
-Aubrey
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-25 Thread Aubrey Li
On Mon, Mar 26, 2012 at 2:58 AM, Richard Elling
richard.ell...@richardelling.com wrote:
 On Mar 25, 2012, at 10:25 AM, Aubrey Li wrote:

 On Mon, Mar 26, 2012 at 12:48 AM, Richard Elling
 richard.ell...@richardelling.com wrote:

 This is the wrong forum for general purpose performance tuning. So I won't

 continue this much farther.  Notice the huge number of icsw, that is a

 bigger

 symptom than locks.

  -- richard


 thanks anyway, lock must be a problem. the scenario here is, apache
 causes a bunch of stat()
 syscall, which leads to a bunch of zfs vnode access, byond the normal
 read/write operation.


 That does not explain the high icsw. In a properly sized system icsw will be
 on the order of 0 to 100, closer to 0.

 If there are millions of files, then you should check the DLNC hit rate.

Involuntary context switches might be another story I need to take care of.
But I'm not aware it's controllable, ?



 The problem is, every zfs vnode access need the **same zfs root**
 lock. When the number of
 httpd processes and the corresponding kernel threads becomes large,
 this root lock contention
 becomes horrible. This situation does not occurs on linux.


 I disagree with your conclusion and I've seen ZFS systems do millions of
 stats()
 per second without issue. What does prstat -Lm show?
  -- richard


I have ever not seen any issues until I did a comparison with Linux.

Thanks,
-Aubrey
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-25 Thread Fajar A. Nugraha
On Mon, Mar 26, 2012 at 2:13 AM, Aubrey Li aubrey...@gmail.com wrote:
 The problem is, every zfs vnode access need the **same zfs root**
 lock. When the number of
 httpd processes and the corresponding kernel threads becomes large,
 this root lock contention
 becomes horrible. This situation does not occurs on linux.


 I disagree with your conclusion and I've seen ZFS systems do millions of
 stats()
 per second without issue. What does prstat -Lm show?
  -- richard


 I have ever not seen any issues until I did a comparison with Linux.

So basically you're comparing linux + ext3/4 performance with solaris
+ zfs, on the same hardware? That's not really fair, is it?
If your load is I/O-intensive enough, ext3/4 will easily win since it
doesn't have to do things like calculating checksum.

Now if you compare it to .. say ... btrfs, or zfsonlinux, it'd be more
in the same ballpark.

-- 
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-25 Thread Jim Mauro
If you're chasing CPU utilization, specifically %sys (time in the kernel),
I would start with a time-based kernel profile.

#dtrace -n 'profile-997hz /arg0/ { @[stack()] = count(); } tick-60sec { 
trunc(@, 20); printa(@0; }'

I would be curious to see where the CPU cycles are being consumed first,
before going down the lock path…

This assume that most or all of CPU utilization is %sys. If it's %usr, we take
a different approach.

Thanks
/jim

On Mar 25, 2012, at 1:29 AM, Aubrey Li wrote:

 Hi,
 
 I'm migrating a webserver(apache+php) from RHEL to solaris. During the
 stress testing comparison, I found under the same session number of client
 request, CPU% is ~70% on RHEL while CPU% is full on solaris.
 
 After some investigation, zfs root lock emerges as a major doubtful point.
 
 Firstly, there are a bunch of mutex events.
 
 # lockstat -D 10 sleep 2
 
 Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec)
 
 Count indv cuml rcnt nsec Lock   Caller
 ---
 829064  45%  45% 0.0033280 0xff117624a5d0 rrw_enter_read+0x1b
 705001  38%  82% 0.0030983 0xff117624a5d0 rrw_exit+0x1d
 140678   8%  90% 0.0010546 0xff117624a6e0 zfs_zget+0x46
 37208   2%  92% 0.00 5403 0xff114b136840 vn_rele+0x1e
 33926   2%  94% 0.00 5417 0xff114b136840 lookuppnatcred+0xc5
 27188   1%  95% 0.00 1155 vn_vfslocks_buckets+0xd980
 vn_vfslocks_getlock+0x3b
 11073   1%  96% 0.00 1639 vn_vfslocks_buckets+0x4600
 vn_vfslocks_getlock+0x3b
 9321   1%  96% 0.00 1961 0xff114b82a680 dnlc_lookup+0x83
 6929   0%  97% 0.00 1590 0xff11573b8f28
 zfs_fastaccesschk_execute+0x6a
 5746   0%  97% 0.00 5935 0xff114b136840 lookuppnvp+0x566
 ---
 
 If we look at the lock caller's stack, the following two are the hotspots.
 ==
 # lockstat -D 10 -l 0xff117624a5d0 -s 15 sleep 2
 
 Adaptive mutex spin: 865609 events in 3.237 seconds (267444 events/sec)
 ---
 Count indv cuml rcnt nsec Lock   Caller
 51215   6%   6% 0.00   100023 0xff117624a5d0 rrw_exit+0x1d
 
 nsec -- Time Distribution -- count Stack
  256 |   61zfs_root+0x78
  512 |@  1746  fsop_root+0x2d
 1024 |@@@5210  traverse+0x65
 2048 |@@@13464 +0x446
 4096 |@@@13553 lookuppnatcred+0x119
 8192 |@@@5355  lookupnameatcred+0x97
 16384 |@  2403  lookupnameat+0x6b
 32768 |   1331  cstatat_getvp+0x11e
 65536 |   673   cstatat64_32+0x5d
 131072 |   961   fstatat64_32+0x4c
 262144 |   1259  
 _sys_sysenter_post_swapgs+0x149
 ---
 Count indv cuml rcnt nsec Lock   Caller
 48494   6%  17% 0.00   145263 0xff117624a5d0 rrw_enter_read+0x1b
 
 nsec -- Time Distribution -- count Stack
   256 |   17rrw_enter+0x2c
   512 |   1120  zfs_root+0x3b
  1024 |@  1718  fsop_root+0x2d
  2048 |@@ 4834  traverse+0x65
  4096 |@@@18569 lookuppnvp+0x446
  8192 |   6620  lookuppnatcred+0x119
 16384 |@  2929  lookupnameatcred+0x97
 32768 |@  1635  lookupnameat+0x6b
 65536 |   894   cstatat_getvp+0x11e
 131072 |   1249  cstatat64_32+0x5d
 262144 |@  1620  fstatat64_32+0x4c
 524288 |@  2474  
 _sys_sysenter_post_swapgs+0x149
 
 
 The apache root documentation is apparently in zfs dataset. It looks like each
 file system operation will run into zfs root mutex contention.
 
 Is this an expected behavior? If so, is there any zfs tunable to
 address this issue?
 
 Thanks,
 -Aubrey
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-25 Thread Aubrey Li
On Mon, Mar 26, 2012 at 3:22 AM, Fajar A. Nugraha w...@fajar.net wrote:

 I have ever not seen any issues until I did a comparison with Linux.

 So basically you're comparing linux + ext3/4 performance with solaris
 + zfs, on the same hardware? That's not really fair, is it?
 If your load is I/O-intensive enough, ext3/4 will easily win since it
 doesn't have to do things like calculating checksum.

 Now if you compare it to .. say ... btrfs, or zfsonlinux, it'd be more
 in the same ballpark.


I'll be more comfortable if I didn't see the lock contention, are you indicating
this(every file vnode access will acquire the same root mutex) is normal?
I think it will be good if calculating checksum consumes CPU.

Thanks,
-Aubrey
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-25 Thread Aubrey Li
On Mon, Mar 26, 2012 at 4:18 AM, Jim Mauro james.ma...@oracle.com wrote:
 If you're chasing CPU utilization, specifically %sys (time in the kernel),
 I would start with a time-based kernel profile.

 #dtrace -n 'profile-997hz /arg0/ { @[stack()] = count(); } tick-60sec { 
 trunc(@, 20); printa(@0; }'

 I would be curious to see where the CPU cycles are being consumed first,
 before going down the lock path…

 This assume that most or all of CPU utilization is %sys. If it's %usr, we take
 a different approach.


Here is the output, I changed to tick-5sec and trunc(@, 5).
No.2 and No.3 is what I care about.

Thanks,
-Aubrey

21  80536   :tick-5sec
== 1 =
  genunix`avl_walk+0x6a
  genunix`as_gap_aligned+0x2b7
  unix`map_addr_proc+0x179
  unix`map_addr+0x8e
  genunix`choose_addr+0x9e
  zfs`zfs_map+0x161
  genunix`fop_map+0xc5
  genunix`smmap_common+0x268
  genunix`smmaplf32+0xa2
  genunix`syscall_ap+0x92
  unix`_sys_sysenter_post_swapgs+0x149
 1427

= 2 =
  unix`mutex_delay_default+0x7
  unix`mutex_vector_enter+0x2ae
  zfs`zfs_zget+0x46
  zfs`zfs_root+0x55
  genunix`fsop_root+0x2d
  genunix`traverse+0x65
  genunix`lookuppnvp+0x446
  genunix`lookuppnatcred+0x119
  genunix`lookupnameatcred+0x97
  genunix`lookupnameat+0x6b
  genunix`vn_openat+0x147
  genunix`copen+0x493
  genunix`openat64+0x2d
  unix`_sys_sysenter_post_swapgs+0x149
 2645

== 3 =
  unix`mutex_delay_default+0x7
  unix`mutex_vector_enter+0x2ae
  zfs`zfs_zget+0x46
  zfs`zfs_root+0x55
  genunix`fsop_root+0x2d
  genunix`traverse+0x65
  genunix`lookuppnvp+0x446
  genunix`lookuppnatcred+0x119
  genunix`lookupnameatcred+0x97
  genunix`lookupnameat+0x6b
  genunix`cstatat_getvp+0x11e
  genunix`cstatat64_32+0x5d
  genunix`fstatat64_32+0x4c
  unix`_sys_sysenter_post_swapgs+0x149
 3201

 4 ===
  unix`i86_mwait+0xd
  unix`cpu_idle_mwait+0x154
  unix`idle+0x116
  unix`thread_start+0x8
 3559

= 5 ==
  tmpfs`tmp_readdir+0x138
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
 4589

=  6 
  unix`strlen+0x3
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
 5005

=== 7 =
  tmpfs`tmp_readdir+0xc7
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
 9548


= 8 ===
  unix`strlen+0x8
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
11166


= 9 ===
  unix`strlen+0xe
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
14491

=  10 ==
  tmpfs`tmp_readdir+0xbe
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
39540
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-25 Thread Richard Elling
On Mar 25, 2012, at 6:51 PM, Aubrey Li wrote:
 On Mon, Mar 26, 2012 at 4:18 AM, Jim Mauro james.ma...@oracle.com wrote:
 If you're chasing CPU utilization, specifically %sys (time in the kernel),
 I would start with a time-based kernel profile.
 
 #dtrace -n 'profile-997hz /arg0/ { @[stack()] = count(); } tick-60sec { 
 trunc(@, 20); printa(@0; }'
 
 I would be curious to see where the CPU cycles are being consumed first,
 before going down the lock path…
 
 This assume that most or all of CPU utilization is %sys. If it's %usr, we 
 take
 a different approach.
 
 
 Here is the output, I changed to tick-5sec and trunc(@, 5).
 No.2 and No.3 is what I care about.
 
 Thanks,
 -Aubrey
 
 21  80536   :tick-5sec
 == 1 =
  genunix`avl_walk+0x6a
  genunix`as_gap_aligned+0x2b7
  unix`map_addr_proc+0x179
  unix`map_addr+0x8e
  genunix`choose_addr+0x9e
  zfs`zfs_map+0x161
  genunix`fop_map+0xc5
  genunix`smmap_common+0x268
  genunix`smmaplf32+0xa2
  genunix`syscall_ap+0x92
  unix`_sys_sysenter_post_swapgs+0x149
 1427
 
 = 2 =
  unix`mutex_delay_default+0x7
  unix`mutex_vector_enter+0x2ae
  zfs`zfs_zget+0x46
  zfs`zfs_root+0x55
  genunix`fsop_root+0x2d
  genunix`traverse+0x65
  genunix`lookuppnvp+0x446
  genunix`lookuppnatcred+0x119
  genunix`lookupnameatcred+0x97
  genunix`lookupnameat+0x6b
  genunix`vn_openat+0x147
  genunix`copen+0x493
  genunix`openat64+0x2d
  unix`_sys_sysenter_post_swapgs+0x149
 2645
 
 == 3 =
  unix`mutex_delay_default+0x7
  unix`mutex_vector_enter+0x2ae
  zfs`zfs_zget+0x46
  zfs`zfs_root+0x55
  genunix`fsop_root+0x2d
  genunix`traverse+0x65
  genunix`lookuppnvp+0x446
  genunix`lookuppnatcred+0x119
  genunix`lookupnameatcred+0x97
  genunix`lookupnameat+0x6b
  genunix`cstatat_getvp+0x11e
  genunix`cstatat64_32+0x5d
  genunix`fstatat64_32+0x4c
  unix`_sys_sysenter_post_swapgs+0x149
 3201
 
  4 ===
  unix`i86_mwait+0xd
  unix`cpu_idle_mwait+0x154
  unix`idle+0x116
  unix`thread_start+0x8
 3559
 
 = 5 ==
  tmpfs`tmp_readdir+0x138
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
 4589
 
 =  6 
  unix`strlen+0x3
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
 5005
 
 === 7 =
  tmpfs`tmp_readdir+0xc7
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
 9548
 
 
 = 8 ===
  unix`strlen+0x8
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
11166
 
 
 = 9 ===
  unix`strlen+0xe
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
14491
 
 =  10 ==
  tmpfs`tmp_readdir+0xbe
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
39540


Well, there you go. tmpfs is not zfs.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
SCALE 10x, Los Angeles, Jan 20-22, 2012















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-25 Thread Aubrey Li
On Mon, Mar 26, 2012 at 11:34 AM, Richard Elling
richard.ell...@gmail.com wrote:
 On Mar 25, 2012, at 6:51 PM, Aubrey Li wrote:
 On Mon, Mar 26, 2012 at 4:18 AM, Jim Mauro james.ma...@oracle.com wrote:
 If you're chasing CPU utilization, specifically %sys (time in the kernel),
 I would start with a time-based kernel profile.

 #dtrace -n 'profile-997hz /arg0/ { @[stack()] = count(); } tick-60sec { 
 trunc(@, 20); printa(@0; }'

 I would be curious to see where the CPU cycles are being consumed first,
 before going down the lock path…

 This assume that most or all of CPU utilization is %sys. If it's %usr, we 
 take
 a different approach.


 Here is the output, I changed to tick-5sec and trunc(@, 5).
 No.2 and No.3 is what I care about.

 Thanks,
 -Aubrey

 21  80536                       :tick-5sec
 == 1 =
              genunix`avl_walk+0x6a
              genunix`as_gap_aligned+0x2b7
              unix`map_addr_proc+0x179
              unix`map_addr+0x8e
              genunix`choose_addr+0x9e
              zfs`zfs_map+0x161
              genunix`fop_map+0xc5
              genunix`smmap_common+0x268
              genunix`smmaplf32+0xa2
              genunix`syscall_ap+0x92
              unix`_sys_sysenter_post_swapgs+0x149
             1427

 = 2 =
              unix`mutex_delay_default+0x7
              unix`mutex_vector_enter+0x2ae
              zfs`zfs_zget+0x46
              zfs`zfs_root+0x55
              genunix`fsop_root+0x2d
              genunix`traverse+0x65
              genunix`lookuppnvp+0x446
              genunix`lookuppnatcred+0x119
              genunix`lookupnameatcred+0x97
              genunix`lookupnameat+0x6b
              genunix`vn_openat+0x147
              genunix`copen+0x493
              genunix`openat64+0x2d
              unix`_sys_sysenter_post_swapgs+0x149
             2645

 == 3 =
              unix`mutex_delay_default+0x7
              unix`mutex_vector_enter+0x2ae
              zfs`zfs_zget+0x46
              zfs`zfs_root+0x55
              genunix`fsop_root+0x2d
              genunix`traverse+0x65
              genunix`lookuppnvp+0x446
              genunix`lookuppnatcred+0x119
              genunix`lookupnameatcred+0x97
              genunix`lookupnameat+0x6b
              genunix`cstatat_getvp+0x11e
              genunix`cstatat64_32+0x5d
              genunix`fstatat64_32+0x4c
              unix`_sys_sysenter_post_swapgs+0x149
             3201

  4 ===
              unix`i86_mwait+0xd
              unix`cpu_idle_mwait+0x154
              unix`idle+0x116
              unix`thread_start+0x8
             3559

 = 5 ==
              tmpfs`tmp_readdir+0x138
              genunix`fop_readdir+0xe8
              genunix`getdents64+0xd5
              unix`_sys_sysenter_post_swapgs+0x149
             4589

 =  6 
              unix`strlen+0x3
              genunix`fop_readdir+0xe8
              genunix`getdents64+0xd5
              unix`_sys_sysenter_post_swapgs+0x149
             5005

 === 7 =
              tmpfs`tmp_readdir+0xc7
              genunix`fop_readdir+0xe8
              genunix`getdents64+0xd5
              unix`_sys_sysenter_post_swapgs+0x149
             9548


 = 8 ===
              unix`strlen+0x8
              genunix`fop_readdir+0xe8
              genunix`getdents64+0xd5
              unix`_sys_sysenter_post_swapgs+0x149
            11166


 = 9 ===
              unix`strlen+0xe
              genunix`fop_readdir+0xe8
              genunix`getdents64+0xd5
              unix`_sys_sysenter_post_swapgs+0x149
            14491

 =  10 ==
              tmpfs`tmp_readdir+0xbe
              genunix`fop_readdir+0xe8
              genunix`getdents64+0xd5
              unix`_sys_sysenter_post_swapgs+0x149
            39540


 Well, there you go. tmpfs is not zfs.
  -- richard

Let me explain below:

Root documentation path of apache is in zfs, you see
it at No.3 at the above dtrace report.

tmpfs(/tmp) is the place where PHP place the temporary
folders and files.

If you go back to see my first thread, you'll see the zfs
root lock contetion as well.

Hope we are in the same page now.

Thanks,
-Aubrey
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-25 Thread Richard Elling
Apologies to the ZFSers, this thread really belongs elsewhere.

On Mar 25, 2012, at 10:11 PM, Aubrey Li wrote:

 On Mon, Mar 26, 2012 at 11:34 AM, Richard Elling
 richard.ell...@gmail.com wrote:
 On Mar 25, 2012, at 6:51 PM, Aubrey Li wrote:
 On Mon, Mar 26, 2012 at 4:18 AM, Jim Mauro james.ma...@oracle.com wrote:
 If you're chasing CPU utilization, specifically %sys (time in the kernel),
 I would start with a time-based kernel profile.
 
 #dtrace -n 'profile-997hz /arg0/ { @[stack()] = count(); } tick-60sec { 
 trunc(@, 20); printa(@0; }'
 
 I would be curious to see where the CPU cycles are being consumed first,
 before going down the lock path…
 
 This assume that most or all of CPU utilization is %sys. If it's %usr, we 
 take
 a different approach.
 
 
 Here is the output, I changed to tick-5sec and trunc(@, 5).
 No.2 and No.3 is what I care about.
 
 Thanks,
 -Aubrey
 
 21  80536   :tick-5sec
 == 1 =
  genunix`avl_walk+0x6a
  genunix`as_gap_aligned+0x2b7
  unix`map_addr_proc+0x179
  unix`map_addr+0x8e
  genunix`choose_addr+0x9e
  zfs`zfs_map+0x161
  genunix`fop_map+0xc5
  genunix`smmap_common+0x268
  genunix`smmaplf32+0xa2
  genunix`syscall_ap+0x92
  unix`_sys_sysenter_post_swapgs+0x149
 1427
 
 = 2 =
  unix`mutex_delay_default+0x7
  unix`mutex_vector_enter+0x2ae
  zfs`zfs_zget+0x46
  zfs`zfs_root+0x55
  genunix`fsop_root+0x2d
  genunix`traverse+0x65
  genunix`lookuppnvp+0x446
  genunix`lookuppnatcred+0x119
  genunix`lookupnameatcred+0x97
  genunix`lookupnameat+0x6b
  genunix`vn_openat+0x147
  genunix`copen+0x493
  genunix`openat64+0x2d
  unix`_sys_sysenter_post_swapgs+0x149
 2645
 
 == 3 =
  unix`mutex_delay_default+0x7
  unix`mutex_vector_enter+0x2ae
  zfs`zfs_zget+0x46
  zfs`zfs_root+0x55
  genunix`fsop_root+0x2d
  genunix`traverse+0x65
  genunix`lookuppnvp+0x446
  genunix`lookuppnatcred+0x119
  genunix`lookupnameatcred+0x97
  genunix`lookupnameat+0x6b
  genunix`cstatat_getvp+0x11e
  genunix`cstatat64_32+0x5d
  genunix`fstatat64_32+0x4c
  unix`_sys_sysenter_post_swapgs+0x149
 3201
 
  4 ===
  unix`i86_mwait+0xd
  unix`cpu_idle_mwait+0x154
  unix`idle+0x116
  unix`thread_start+0x8
 3559
 
 = 5 ==
  tmpfs`tmp_readdir+0x138
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
 4589
 
 =  6 
  unix`strlen+0x3
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
 5005
 
 === 7 =
  tmpfs`tmp_readdir+0xc7
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
 9548
 
 
 = 8 ===
  unix`strlen+0x8
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
11166
 
 
 = 9 ===
  unix`strlen+0xe
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
14491
 
 =  10 ==
  tmpfs`tmp_readdir+0xbe
  genunix`fop_readdir+0xe8
  genunix`getdents64+0xd5
  unix`_sys_sysenter_post_swapgs+0x149
39540
 
 
 Well, there you go. tmpfs is not zfs.
  -- richard
 
 Let me explain below:
 
 Root documentation path of apache is in zfs, you see
 it at No.3 at the above dtrace report.

The sort is in reverse order. The large number you see below the 
stack trace is the number of times that stack was seen. By far the 
most frequently seen is tmpfs`tmp_readdir

 
 tmpfs(/tmp) is the place where PHP place the temporary
 folders and files.

bingo

 
 If you go back to see my first thread, you'll see the zfs
 root lock contetion as well.
 
 Hope we are in the same page now.
 
 Thanks,
 -Aubrey

 -- richard

--
DTrace Conference, April 3, 2012, 
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org

Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-25 Thread Fajar A. Nugraha
On Mon, Mar 26, 2012 at 12:19 PM, Richard Elling
richard.ell...@richardelling.com wrote:
 Apologies to the ZFSers, this thread really belongs elsewhere.

Some of the info in it is informative for other zfs users as well though :)

 Here is the output, I changed to tick-5sec and trunc(@, 5).

 No.2 and No.3 is what I care about.

 The sort is in reverse order. The large number you see below the
 stack trace is the number of times that stack was seen. By far the
 most frequently seen is tmpfs`tmp_readdir

So to summarize, even though there are locks in zfs, they are normal,
and insignificant performance-wise (at least in this case). And if
performance is a problem, the cause is elsewhere.

Is that correct?

-- 
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-25 Thread Aubrey Li
On Mon, Mar 26, 2012 at 1:19 PM, Richard Elling
richard.ell...@richardelling.com wrote:
 Apologies to the ZFSers, this thread really belongs elsewhere.

 Let me explain below:

 Root documentation path of apache is in zfs, you see
 it at No.3 at the above dtrace report.


 The sort is in reverse order. The large number you see below the
 stack trace is the number of times that stack was seen. By far the
 most frequently seen is tmpfs`tmp_readdir

It's true, but I didn't see any potential issues there.



 tmpfs(/tmp) is the place where PHP place the temporary
 folders and files.


 bingo


I have to paste the lock investigation here again,

Firstly, you can see which lock is spinning:
===
# lockstat -D 10 sleep 2

Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec)

Count indv cuml rcnt nsec Lock   Caller
---
829064  45%  45% 0.0033280 0xff117624a5d0 rrw_enter_read+0x1b
705001  38%  82% 0.0030983 0xff117624a5d0 rrw_exit+0x1d
140678   8%  90% 0.0010546 0xff117624a6e0 zfs_zget+0x46
37208   2%  92% 0.00 5403 0xff114b136840 vn_rele+0x1e
33926   2%  94% 0.00 5417 0xff114b136840 lookuppnatcred+0xc5
27188   1%  95% 0.00 1155 vn_vfslocks_buckets+0xd980
vn_vfslocks_getlock+0x3b
11073   1%  96% 0.00 1639 vn_vfslocks_buckets+0x4600
vn_vfslocks_getlock+0x3b
 9321   1%  96% 0.00 1961 0xff114b82a680 dnlc_lookup+0x83
 6929   0%  97% 0.00 1590 0xff11573b8f28
zfs_fastaccesschk_execute+0x6a
 5746   0%  97% 0.00 5935 0xff114b136840 lookuppnvp+0x566
---

Then if you look at the caller of lock(0xff117624a5d0), you'll see
it's ZFS, not tmpfs.

Count indv cuml rcnt nsec Lock   Caller
48494   6%  17% 0.00   145263 0xff117624a5d0 rrw_enter_read+0x1b

  nsec -- Time Distribution -- count Stack
   256 |   17rrw_enter+0x2c
   512 |   1120  zfs_root+0x3b
  1024 |@  1718  fsop_root+0x2d
  2048 |@@ 4834  traverse+0x65
  4096 |@@@18569 lookuppnvp+0x446
  8192 |   6620  lookuppnatcred+0x119
 16384 |@  2929  lookupnameatcred+0x97
 32768 |@  1635  lookupnameat+0x6b
 65536 |   894   cstatat_getvp+0x11e
131072 |   1249  cstatat64_32+0x5d
262144 |@  1620  fstatat64_32+0x4c
524288 |@  2474
_sys_sysenter_post_swapgs+0x149


That's why I post this subject here. Hope it's clear this time.

Thanks,
-Aubrey
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-24 Thread Aubrey Li
Hi,

I'm migrating a webserver(apache+php) from RHEL to solaris. During the
stress testing comparison, I found under the same session number of client
request, CPU% is ~70% on RHEL while CPU% is full on solaris.

After some investigation, zfs root lock emerges as a major doubtful point.

Firstly, there are a bunch of mutex events.

# lockstat -D 10 sleep 2

Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec)

Count indv cuml rcnt nsec Lock   Caller
---
829064  45%  45% 0.0033280 0xff117624a5d0 rrw_enter_read+0x1b
705001  38%  82% 0.0030983 0xff117624a5d0 rrw_exit+0x1d
140678   8%  90% 0.0010546 0xff117624a6e0 zfs_zget+0x46
37208   2%  92% 0.00 5403 0xff114b136840 vn_rele+0x1e
33926   2%  94% 0.00 5417 0xff114b136840 lookuppnatcred+0xc5
27188   1%  95% 0.00 1155 vn_vfslocks_buckets+0xd980
vn_vfslocks_getlock+0x3b
11073   1%  96% 0.00 1639 vn_vfslocks_buckets+0x4600
vn_vfslocks_getlock+0x3b
 9321   1%  96% 0.00 1961 0xff114b82a680 dnlc_lookup+0x83
 6929   0%  97% 0.00 1590 0xff11573b8f28
zfs_fastaccesschk_execute+0x6a
 5746   0%  97% 0.00 5935 0xff114b136840 lookuppnvp+0x566
---

If we look at the lock caller's stack, the following two are the hotspots.
==
# lockstat -D 10 -l 0xff117624a5d0 -s 15 sleep 2

Adaptive mutex spin: 865609 events in 3.237 seconds (267444 events/sec)
---
Count indv cuml rcnt nsec Lock   Caller
51215   6%   6% 0.00   100023 0xff117624a5d0 rrw_exit+0x1d

nsec -- Time Distribution -- count Stack
  256 |   61zfs_root+0x78
  512 |@  1746  fsop_root+0x2d
 1024 |@@@5210  traverse+0x65
 2048 |@@@13464 +0x446
 4096 |@@@13553 lookuppnatcred+0x119
 8192 |@@@5355  lookupnameatcred+0x97
 16384 |@  2403  lookupnameat+0x6b
 32768 |   1331  cstatat_getvp+0x11e
 65536 |   673   cstatat64_32+0x5d
131072 |   961   fstatat64_32+0x4c
262144 |   1259  _sys_sysenter_post_swapgs+0x149
---
Count indv cuml rcnt nsec Lock   Caller
48494   6%  17% 0.00   145263 0xff117624a5d0 rrw_enter_read+0x1b

 nsec -- Time Distribution -- count Stack
   256 |   17rrw_enter+0x2c
   512 |   1120  zfs_root+0x3b
  1024 |@  1718  fsop_root+0x2d
  2048 |@@ 4834  traverse+0x65
  4096 |@@@18569 lookuppnvp+0x446
  8192 |   6620  lookuppnatcred+0x119
 16384 |@  2929  lookupnameatcred+0x97
 32768 |@  1635  lookupnameat+0x6b
 65536 |   894   cstatat_getvp+0x11e
131072 |   1249  cstatat64_32+0x5d
262144 |@  1620  fstatat64_32+0x4c
524288 |@  2474  _sys_sysenter_post_swapgs+0x149


The apache root documentation is apparently in zfs dataset. It looks like each
file system operation will run into zfs root mutex contention.

Is this an expected behavior? If so, is there any zfs tunable to
address this issue?

Thanks,
-Aubrey
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss