Re: [dtrace-discuss] Web server(s) fails
... error, correction... I believe I was using the PID provider not the syscall. -- This message posted from opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Web server(s) fails
The syscall:::entry probe shows, is it stuck doing something?? How can I dig deeper? All I get is this when doing "GET /index.htm HTTP/1.1" through telnet port 80: ... 3 83394 ___errno:entry 3 80993 apr_atomic_dec32:entry 3 83062 atomic_dec_32_nv:entry 3 83394 ___errno:entry 3 83394 ___errno:entry 3 80853 apr_pollset_poll:entry 3 83124__div64:entry 3 83121 UDiv:entry 3 83125__rem64:entry 3 83120UDivRem:entry 3 80992 apr_atomic_inc32:entry 3 83055 atomic_inc_32_nv:entry 3 83401 port_getn:entry 3 85371_portfs:entry 3 83394 ___errno:entry 3 80993 apr_atomic_dec32:entry 3 83062 atomic_dec_32_nv:entry 3 83394 ___errno:entry 3 83394 ___errno:entry 3 80853 apr_pollset_poll:entry 3 83124__div64:entry 3 83121 UDiv:entry 3 83125__rem64:entry 3 83120UDivRem:entry 3 80992 apr_atomic_inc32:entry 3 83055 atomic_inc_32_nv:entry 3 83401 port_getn:entry 3 85371_portfs:entry 1 83394 ___errno:entry 1 80993 apr_atomic_dec32:entry 1 83062 atomic_dec_32_nv:entry 1 83394 ___errno:entry 1 83394 ___errno:entry 1 80853 apr_pollset_poll:entry 1 83124__div64:entry 1 83121 UDiv:entry 1 83125__rem64:entry 1 83120UDivRem:entry 1 80992 apr_atomic_inc32:entry 1 83055 atomic_inc_32_nv:entry 1 83401 port_getn:entry 1 85371_portfs:entry 1 83394 ___errno:entry 1 80993 apr_atomic_dec32:entry 1 83062 atomic_dec_32_nv:entry 1 83394 ___errno:entry 1 83394 ___errno:entry 1 80853 apr_pollset_poll:entry 1 83124__div64:entry 1 83121 UDiv:entry 1 83125__rem64:entry 1 83120UDivRem:entry 1 80992 apr_atomic_inc32:entry 1 83055 atomic_inc_32_nv:entry 1 83401 port_getn:entry 1 85371_portfs:entry 0 83394 ___errno:entry 0 80993 apr_atomic_dec32:entry 0 83062 atomic_dec_32_nv:entry 0 83394 ___errno:entry 0 83394 ___errno:entry 0 80853 apr_pollset_poll:entry 0 83124__div64:entry 0 83121 UDiv:entry 0 83125__rem64:entry 0 83120UDivRem:entry 0 80992 apr_atomic_inc32:entry 0 83055 atomic_inc_32_nv:entry 0 83401 port_getn:entry 0 85371_portfs:entry 0 83394 ___errno:entry 0 80993 apr_atomic_dec32:entry 0 83062 atomic_dec_32_nv:entry 0 83394 ___errno:entry 0 83394 ___errno:entry 0 80853 apr_pollset_poll:entry 0 83124__div64:entry 0 83121 UDiv:entry 0 83125__rem64:entry 0 83120UDivRem:entry 0 80992 apr_atomic_inc32:entry 0 83055 atomic_inc_32_nv:entry 0 83401 port_getn:entry 0 85371_portfs:entry 2 83394 ___errno:entry 2 80993 apr_atomic_dec32:entry 2 83062 atomic_dec_32_nv:entry 2 83394 ___errno:entry 2 83394 ___errno:entry 2 80853 apr_pollset_poll:entry 2 83124__div64:entry 2 83121 UDiv:entry 2 83125__rem64:entry 2 83120UDivRem:entry 2 80992 apr_atomic_inc32:entry 2 83055 atomic_inc_32_nv:entry 2 83401 port_getn:entry 2 85371_portfs:entry -- This message posted from opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
[dtrace-discuss] Web server(s) fails
I have a couple of zones. On one zone, I have lighttpd. On another, I got a customer using Apache. This is on a 2009.06 release. Every once in a while, maybe every other day, the server becomes unstable. HTTP requests no longer get a response. Sometimes, it works after a while w/o a problem. But sometimes, a restart of the server is necessary. I am using crossbow for these zones, with virtual links. I was doing a dtrace for all open* calls and I saw something like this: Could it be that there is something wrong with the /dev/urandom device in OpenSolaris? It is very peculiar and I can't figure it out. I find it strange that this same type of behavior happens in two different zones running two different applications. Do you guys have any pointers? Next time this happens, what can I do to capture more data? The customer wants to move away from OpenSolaris and go back to Solaris because of this. Thanks! 2 62318 open:entry postgres pg_stat_tmp/pgstat.tmp 2 62318 open:entry postgres pg_stat_tmp/pgstat.tmp 2 62720 open64:entry httpd /dev/urandom 2 62318 open:entry httpd /etc/crypto/pkcs11.conf dtrace: error on enabled probe ID 2 (ID 62318: syscall::open:entry): invalid address (0xfd4abc63) in action #2 at DIF offset 28 2 62318 open:entry httpd /usr/lib/security/pkcs11_softtoken.so 2 62318 open:entry httpd /usr/lib/libsoftcrypto.so.1 2 62318 open:entry httpd /usr/lib/libsoftcrypto/libsoftcrypto_hwcap1.so.1 -- This message posted from opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)
It would also be interesting to see some snapshots of the ZFS arc kstats kstat -n arcstats Thanks Jim Leonard wrote: Thanks for the awesome two-liner, I'd been struggling with 1-second intervals without a full-blown script. I modified it to output walltime so that I could zoom in on the problem, and here it is: unix`xc_do_call+0x8f unix`xc_wait_sync+0x36 unix`x86pte_invalidate_pfn+0x135 unix`hat_pte_unmap+0xa9 unix`hat_unload_callback+0x109 unix`hat_unload+0x2a unix`segkmem_free_vn+0x82 unix`segkmem_zio_free+0x10 genunix`vmem_xfree+0xee genunix`vmem_free+0x28 genunix`kmem_slab_destroy+0x80 genunix`kmem_slab_free+0x1be genunix`kmem_magazine_destroy+0x54 genunix`kmem_depot_ws_reap+0x4d genunix`taskq_thread+0xbc unix`thread_start+0x8 418560 Any ideas? ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)
As Dan said, it looks like ZFS is busy. How much RAM is on this system? What release of Solaris? Do you have any ZFS tweaks in /etc/system? (like clamping the ARC size...) Is the system memory constrained? The xcalls are due to the page unmaps out of what I'm assuming is the ZFS ARC (although I'm not 100% sure about that - I need to check the source for segkmem_zio_free()) Can you gather some ZFS IO statistics, like "fsstat zfs 1" for a minute or so. Thanks, /jim Jim Leonard wrote: Thanks for the awesome two-liner, I'd been struggling with 1-second intervals without a full-blown script. I modified it to output walltime so that I could zoom in on the problem, and here it is: unix`xc_do_call+0x8f unix`xc_wait_sync+0x36 unix`x86pte_invalidate_pfn+0x135 unix`hat_pte_unmap+0xa9 unix`hat_unload_callback+0x109 unix`hat_unload+0x2a unix`segkmem_free_vn+0x82 unix`segkmem_zio_free+0x10 genunix`vmem_xfree+0xee genunix`vmem_free+0x28 genunix`kmem_slab_destroy+0x80 genunix`kmem_slab_free+0x1be genunix`kmem_magazine_destroy+0x54 genunix`kmem_depot_ws_reap+0x4d genunix`taskq_thread+0xbc unix`thread_start+0x8 418560 Any ideas? ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)
zfs is busy? Jim Leonard wrote: Thanks for the awesome two-liner, I'd been struggling with 1-second intervals without a full-blown script. I modified it to output walltime so that I could zoom in on the problem, and here it is: unix`xc_do_call+0x8f unix`xc_wait_sync+0x36 unix`x86pte_invalidate_pfn+0x135 unix`hat_pte_unmap+0xa9 unix`hat_unload_callback+0x109 unix`hat_unload+0x2a unix`segkmem_free_vn+0x82 unix`segkmem_zio_free+0x10 genunix`vmem_xfree+0xee genunix`vmem_free+0x28 genunix`kmem_slab_destroy+0x80 genunix`kmem_slab_free+0x1be genunix`kmem_magazine_destroy+0x54 genunix`kmem_depot_ws_reap+0x4d genunix`taskq_thread+0xbc unix`thread_start+0x8 418560 Any ideas? ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)
Thanks for the awesome two-liner, I'd been struggling with 1-second intervals without a full-blown script. I modified it to output walltime so that I could zoom in on the problem, and here it is: unix`xc_do_call+0x8f unix`xc_wait_sync+0x36 unix`x86pte_invalidate_pfn+0x135 unix`hat_pte_unmap+0xa9 unix`hat_unload_callback+0x109 unix`hat_unload+0x2a unix`segkmem_free_vn+0x82 unix`segkmem_zio_free+0x10 genunix`vmem_xfree+0xee genunix`vmem_free+0x28 genunix`kmem_slab_destroy+0x80 genunix`kmem_slab_free+0x1be genunix`kmem_magazine_destroy+0x54 genunix`kmem_depot_ws_reap+0x4d genunix`taskq_thread+0xbc unix`thread_start+0x8 418560 Any ideas? -- This message posted from opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)
Jim Mauro has provided an excellent starting point. Keep in mind that kernel threads will show up as pid 0 so you may be seeing a kernel thread Causing the activity. Jim L --Original Message-- From: Jim Leonard Sent: Tue, September 22, 2009 11:31 AM To: dtrace-discuss@opensolaris.org Subject: [dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided) We have a 16-core x86 system that, at seemingly random intervals, will completely stop responding for several seconds. Running an mpstat 1 showed something horrifiying: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 1004691 397 170 0 0 0 5 0 0 0 100 0 0 (rest of CPUs omitted) That's over a million cross-calls a second. Seeing them on CPU0 made me nervous that they were kernel-related, so I wrote a dtrace to print out xcalls per second aggregated by PID to see if a specific process was the culprit. Here's the output during another random system outage: 2009 Sep 22 12:51:49, load average: 5.90, 5.35, 5.39 xcalls: 637511 PIDXCALLCOUNT 616415 616515 28339 26 0 637455 PID 0 is "sched" (aka the kernel). At this point I'm completely stumped as to what could be causing this. Any hints or ideas? -- This message posted from opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)
dtrace -n ':::xcalls { @s[stack()] = count() } tick-1sec { trunc(@s,10); printa(@s); clear(@s); }' That will tell us where the xcalls are coming from in the kernel, and we can go from there. Thanks, /jim Jim Leonard wrote: We have a 16-core x86 system that, at seemingly random intervals, will completely stop responding for several seconds. Running an mpstat 1 showed something horrifiying: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 1004691 397 170 0 0 0 5 0 0 0 100 0 0 (rest of CPUs omitted) That's over a million cross-calls a second. Seeing them on CPU0 made me nervous that they were kernel-related, so I wrote a dtrace to print out xcalls per second aggregated by PID to see if a specific process was the culprit. Here's the output during another random system outage: 2009 Sep 22 12:51:49, load average: 5.90, 5.35, 5.39 xcalls: 637511 PIDXCALLCOUNT 616415 616515 28339 26 0 637455 PID 0 is "sched" (aka the kernel). At this point I'm completely stumped as to what could be causing this. Any hints or ideas? ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
[dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)
We have a 16-core x86 system that, at seemingly random intervals, will completely stop responding for several seconds. Running an mpstat 1 showed something horrifiying: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 1004691 397 170 0 0 0 5 0 0 0 100 0 0 (rest of CPUs omitted) That's over a million cross-calls a second. Seeing them on CPU0 made me nervous that they were kernel-related, so I wrote a dtrace to print out xcalls per second aggregated by PID to see if a specific process was the culprit. Here's the output during another random system outage: 2009 Sep 22 12:51:49, load average: 5.90, 5.35, 5.39 xcalls: 637511 PIDXCALLCOUNT 616415 616515 28339 26 0 637455 PID 0 is "sched" (aka the kernel). At this point I'm completely stumped as to what could be causing this. Any hints or ideas? -- This message posted from opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org