Re: [dtrace-discuss] Web server(s) fails

2009-09-22 Thread Anil
... error, correction... I believe I was using the PID provider not the syscall.
-- 
This message posted from opensolaris.org
___
dtrace-discuss mailing list
dtrace-discuss@opensolaris.org


Re: [dtrace-discuss] Web server(s) fails

2009-09-22 Thread Anil
The syscall:::entry probe shows, is it stuck doing something?? How can I dig 
deeper? All I get is this when doing "GET /index.htm HTTP/1.1" through telnet 
port 80:

...
  3  83394   ___errno:entry 
  3  80993   apr_atomic_dec32:entry 
  3  83062   atomic_dec_32_nv:entry 
  3  83394   ___errno:entry 
  3  83394   ___errno:entry 
  3  80853   apr_pollset_poll:entry 
  3  83124__div64:entry 
  3  83121   UDiv:entry 
  3  83125__rem64:entry 
  3  83120UDivRem:entry 
  3  80992   apr_atomic_inc32:entry 
  3  83055   atomic_inc_32_nv:entry 
  3  83401  port_getn:entry 
  3  85371_portfs:entry 
  3  83394   ___errno:entry 
  3  80993   apr_atomic_dec32:entry 
  3  83062   atomic_dec_32_nv:entry 
  3  83394   ___errno:entry 
  3  83394   ___errno:entry 
  3  80853   apr_pollset_poll:entry 
  3  83124__div64:entry 
  3  83121   UDiv:entry 
  3  83125__rem64:entry 
  3  83120UDivRem:entry 
  3  80992   apr_atomic_inc32:entry 
  3  83055   atomic_inc_32_nv:entry 
  3  83401  port_getn:entry 
  3  85371_portfs:entry 
  1  83394   ___errno:entry 
  1  80993   apr_atomic_dec32:entry 
  1  83062   atomic_dec_32_nv:entry 
  1  83394   ___errno:entry 
  1  83394   ___errno:entry 
  1  80853   apr_pollset_poll:entry 
  1  83124__div64:entry 
  1  83121   UDiv:entry 
  1  83125__rem64:entry 
  1  83120UDivRem:entry 
  1  80992   apr_atomic_inc32:entry 
  1  83055   atomic_inc_32_nv:entry 
  1  83401  port_getn:entry 
  1  85371_portfs:entry 
  1  83394   ___errno:entry 
  1  80993   apr_atomic_dec32:entry 
  1  83062   atomic_dec_32_nv:entry 
  1  83394   ___errno:entry 
  1  83394   ___errno:entry 
  1  80853   apr_pollset_poll:entry 
  1  83124__div64:entry 
  1  83121   UDiv:entry 
  1  83125__rem64:entry 
  1  83120UDivRem:entry 
  1  80992   apr_atomic_inc32:entry 
  1  83055   atomic_inc_32_nv:entry 
  1  83401  port_getn:entry 
  1  85371_portfs:entry 
  0  83394   ___errno:entry 
  0  80993   apr_atomic_dec32:entry 
  0  83062   atomic_dec_32_nv:entry 
  0  83394   ___errno:entry 
  0  83394   ___errno:entry 
  0  80853   apr_pollset_poll:entry 
  0  83124__div64:entry 
  0  83121   UDiv:entry 
  0  83125__rem64:entry 
  0  83120UDivRem:entry 
  0  80992   apr_atomic_inc32:entry 
  0  83055   atomic_inc_32_nv:entry 
  0  83401  port_getn:entry 
  0  85371_portfs:entry 
  0  83394   ___errno:entry 
  0  80993   apr_atomic_dec32:entry 
  0  83062   atomic_dec_32_nv:entry 
  0  83394   ___errno:entry 
  0  83394   ___errno:entry 
  0  80853   apr_pollset_poll:entry 
  0  83124__div64:entry 
  0  83121   UDiv:entry 
  0  83125__rem64:entry 
  0  83120UDivRem:entry 
  0  80992   apr_atomic_inc32:entry 
  0  83055   atomic_inc_32_nv:entry 
  0  83401  port_getn:entry 
  0  85371_portfs:entry 
  2  83394   ___errno:entry 
  2  80993   apr_atomic_dec32:entry 
  2  83062   atomic_dec_32_nv:entry 
  2  83394   ___errno:entry 
  2  83394   ___errno:entry 
  2  80853   apr_pollset_poll:entry 
  2  83124__div64:entry 
  2  83121   UDiv:entry 
  2  83125__rem64:entry 
  2  83120UDivRem:entry 
  2  80992   apr_atomic_inc32:entry 
  2  83055   atomic_inc_32_nv:entry 
  2  83401  port_getn:entry 
  2  85371_portfs:entry
-- 
This message posted from opensolaris.org
___
dtrace-discuss mailing list
dtrace-discuss@opensolaris.org


[dtrace-discuss] Web server(s) fails

2009-09-22 Thread Anil
I have a couple of zones. On one zone, I have lighttpd. On another, I got a 
customer using Apache. This is on a 2009.06 release.

Every once in a while, maybe every other day, the server becomes unstable. HTTP 
requests no longer get a response. Sometimes, it works after a while w/o a 
problem. But sometimes, a restart of the server is necessary.

I am using crossbow for these zones, with virtual links. I was doing a dtrace 
for all open* calls and I saw something like this:

Could it be that there is something wrong with the /dev/urandom device in 
OpenSolaris? It is very peculiar and I can't figure it out. I find it strange 
that this same type of behavior happens in two different zones running two 
different applications.

Do you guys have any pointers? Next time this happens, what can I do to capture 
more data? The customer wants to move away from OpenSolaris and go back to 
Solaris because of this.

Thanks!

  2  62318   open:entry postgres pg_stat_tmp/pgstat.tmp
  2  62318   open:entry postgres pg_stat_tmp/pgstat.tmp
  2  62720 open64:entry httpd /dev/urandom
  2  62318   open:entry httpd /etc/crypto/pkcs11.conf
dtrace: error on enabled probe ID 2 (ID 62318: syscall::open:entry): invalid 
address (0xfd4abc63) in action #2 at DIF offset 28
  2  62318   open:entry httpd 
/usr/lib/security/pkcs11_softtoken.so
  2  62318   open:entry httpd /usr/lib/libsoftcrypto.so.1
  2  62318   open:entry httpd 
/usr/lib/libsoftcrypto/libsoftcrypto_hwcap1.so.1
-- 
This message posted from opensolaris.org
___
dtrace-discuss mailing list
dtrace-discuss@opensolaris.org


Re: [dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)

2009-09-22 Thread Jim Mauro

It would also be interesting to see some snapshots
of the ZFS arc kstats

kstat -n arcstats

Thanks

Jim Leonard wrote:

Thanks for the awesome two-liner, I'd been struggling with 1-second intervals 
without a full-blown script.

I modified it to output walltime so that I could zoom in on the problem, and 
here it is:

  unix`xc_do_call+0x8f
  unix`xc_wait_sync+0x36
  unix`x86pte_invalidate_pfn+0x135
  unix`hat_pte_unmap+0xa9
  unix`hat_unload_callback+0x109
  unix`hat_unload+0x2a
  unix`segkmem_free_vn+0x82
  unix`segkmem_zio_free+0x10
  genunix`vmem_xfree+0xee
  genunix`vmem_free+0x28
  genunix`kmem_slab_destroy+0x80
  genunix`kmem_slab_free+0x1be
  genunix`kmem_magazine_destroy+0x54
  genunix`kmem_depot_ws_reap+0x4d
  genunix`taskq_thread+0xbc
  unix`thread_start+0x8
   418560

Any ideas?
  

___
dtrace-discuss mailing list
dtrace-discuss@opensolaris.org


Re: [dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)

2009-09-22 Thread Jim Mauro

As Dan said, it looks like ZFS is busy.

How much RAM is on this system?
What release of Solaris?
Do you have any ZFS tweaks in /etc/system?
(like clamping the ARC size...)
Is the system memory constrained?

The xcalls are due to the page unmaps out of
what I'm assuming is the ZFS ARC (although
I'm not 100% sure about that - I need to check
the source for segkmem_zio_free())

Can you gather some ZFS IO statistics, like
"fsstat zfs 1" for a minute or so.

Thanks,
/jim


Jim Leonard wrote:

Thanks for the awesome two-liner, I'd been struggling with 1-second intervals 
without a full-blown script.

I modified it to output walltime so that I could zoom in on the problem, and 
here it is:

  unix`xc_do_call+0x8f
  unix`xc_wait_sync+0x36
  unix`x86pte_invalidate_pfn+0x135
  unix`hat_pte_unmap+0xa9
  unix`hat_unload_callback+0x109
  unix`hat_unload+0x2a
  unix`segkmem_free_vn+0x82
  unix`segkmem_zio_free+0x10
  genunix`vmem_xfree+0xee
  genunix`vmem_free+0x28
  genunix`kmem_slab_destroy+0x80
  genunix`kmem_slab_free+0x1be
  genunix`kmem_magazine_destroy+0x54
  genunix`kmem_depot_ws_reap+0x4d
  genunix`taskq_thread+0xbc
  unix`thread_start+0x8
   418560

Any ideas?
  

___
dtrace-discuss mailing list
dtrace-discuss@opensolaris.org


Re: [dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)

2009-09-22 Thread Dan Mick

zfs is busy?

Jim Leonard wrote:

Thanks for the awesome two-liner, I'd been struggling with 1-second intervals 
without a full-blown script.

I modified it to output walltime so that I could zoom in on the problem, and 
here it is:

  unix`xc_do_call+0x8f
  unix`xc_wait_sync+0x36
  unix`x86pte_invalidate_pfn+0x135
  unix`hat_pte_unmap+0xa9
  unix`hat_unload_callback+0x109
  unix`hat_unload+0x2a
  unix`segkmem_free_vn+0x82
  unix`segkmem_zio_free+0x10
  genunix`vmem_xfree+0xee
  genunix`vmem_free+0x28
  genunix`kmem_slab_destroy+0x80
  genunix`kmem_slab_free+0x1be
  genunix`kmem_magazine_destroy+0x54
  genunix`kmem_depot_ws_reap+0x4d
  genunix`taskq_thread+0xbc
  unix`thread_start+0x8
   418560

Any ideas?


___
dtrace-discuss mailing list
dtrace-discuss@opensolaris.org


Re: [dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)

2009-09-22 Thread Jim Leonard
Thanks for the awesome two-liner, I'd been struggling with 1-second intervals 
without a full-blown script.

I modified it to output walltime so that I could zoom in on the problem, and 
here it is:

  unix`xc_do_call+0x8f
  unix`xc_wait_sync+0x36
  unix`x86pte_invalidate_pfn+0x135
  unix`hat_pte_unmap+0xa9
  unix`hat_unload_callback+0x109
  unix`hat_unload+0x2a
  unix`segkmem_free_vn+0x82
  unix`segkmem_zio_free+0x10
  genunix`vmem_xfree+0xee
  genunix`vmem_free+0x28
  genunix`kmem_slab_destroy+0x80
  genunix`kmem_slab_free+0x1be
  genunix`kmem_magazine_destroy+0x54
  genunix`kmem_depot_ws_reap+0x4d
  genunix`taskq_thread+0xbc
  unix`thread_start+0x8
   418560

Any ideas?
-- 
This message posted from opensolaris.org
___
dtrace-discuss mailing list
dtrace-discuss@opensolaris.org


Re: [dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)

2009-09-22 Thread James Litchfield
Jim Mauro has provided an excellent starting point. Keep in mind that kernel 
threads will show up as pid 0 so you may be seeing a kernel thread
Causing the activity.

Jim L

--Original Message--

From: Jim Leonard 
Sent: Tue, September 22, 2009 11:31 AM
To: dtrace-discuss@opensolaris.org
Subject: [dtrace-discuss] How to drill down cause of cross-calls in the kernel? 
(output provided)


We have a 16-core x86 system that, at seemingly random intervals, will 
completely stop responding for several seconds.  Running an mpstat 1 showed 
something horrifiying:

CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 1004691 397 170 0 0 0 5 0 0 0 100 0 0
(rest of CPUs omitted)

That's over a million cross-calls a second.  Seeing them on CPU0 made me 
nervous that they were kernel-related, so I wrote a dtrace to print out xcalls 
per second aggregated by PID to see if a specific process was the culprit.  
Here's the output during another random system outage:

2009 Sep 22 12:51:49, load average: 5.90, 5.35, 5.39   xcalls: 637511

   PIDXCALLCOUNT
   616415
   616515
   28339   26
   0   637455

PID 0 is "sched" (aka the kernel).

At this point I'm completely stumped as to what could be causing this.  Any 
hints or ideas?
-- 
This message posted from opensolaris.org
___
dtrace-discuss mailing list
dtrace-discuss@opensolaris.org
___
dtrace-discuss mailing list
dtrace-discuss@opensolaris.org


Re: [dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)

2009-09-22 Thread Jim Mauro


dtrace -n ':::xcalls { @s[stack()] = count() } tick-1sec { trunc(@s,10); 
printa(@s); clear(@s); }'



That will tell us where the xcalls are coming from in the kernel,
and we can go from there.

Thanks,
/jim


Jim Leonard wrote:

We have a 16-core x86 system that, at seemingly random intervals, will 
completely stop responding for several seconds.  Running an mpstat 1 showed 
something horrifiying:

CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 1004691 397 170 0 0 0 5 0 0 0 100 0 0
(rest of CPUs omitted)

That's over a million cross-calls a second.  Seeing them on CPU0 made me 
nervous that they were kernel-related, so I wrote a dtrace to print out xcalls 
per second aggregated by PID to see if a specific process was the culprit.  
Here's the output during another random system outage:

2009 Sep 22 12:51:49, load average: 5.90, 5.35, 5.39   xcalls: 637511

   PIDXCALLCOUNT
   616415
   616515
   28339   26
   0   637455

PID 0 is "sched" (aka the kernel).

At this point I'm completely stumped as to what could be causing this.  Any 
hints or ideas?
  

___
dtrace-discuss mailing list
dtrace-discuss@opensolaris.org


[dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)

2009-09-22 Thread Jim Leonard
We have a 16-core x86 system that, at seemingly random intervals, will 
completely stop responding for several seconds.  Running an mpstat 1 showed 
something horrifiying:

CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 1004691 397 170 0 0 0 5 0 0 0 100 0 0
(rest of CPUs omitted)

That's over a million cross-calls a second.  Seeing them on CPU0 made me 
nervous that they were kernel-related, so I wrote a dtrace to print out xcalls 
per second aggregated by PID to see if a specific process was the culprit.  
Here's the output during another random system outage:

2009 Sep 22 12:51:49, load average: 5.90, 5.35, 5.39   xcalls: 637511

   PIDXCALLCOUNT
   616415
   616515
   28339   26
   0   637455

PID 0 is "sched" (aka the kernel).

At this point I'm completely stumped as to what could be causing this.  Any 
hints or ideas?
-- 
This message posted from opensolaris.org
___
dtrace-discuss mailing list
dtrace-discuss@opensolaris.org