On 3/15/06, Caldarale, Charles R <[EMAIL PROTECTED]> wrote:
> > From: Leon Rosenberg [mailto:[EMAIL PROTECTED]
> > Subject: Deadlock -> Out of Threads -> Strange Exception ->
> > OutOfMemory -> Server Death.
> >
> > I can't find a locked <0x61655498> entry nowhere.
> > But I should be able to see it to recognize the thread causing the
> > deadlock, shouldn't I?
>
> I would think so.  More questions:
>
> 1) Did the server exhibit any other symptoms before the apparent
> deadlock, such as slow response times?

Unfortunatelly our monitoring systems do not detect everything. But
what I can reconstruct:
The log entry All Threads busy came at 20:53:03

at 21:17:37,965 The server (or a monitoring thread in the server)
detected that it hasn't received any events (backend sends events with
user logins, approx. 10 per second at the evening, if the events
aren't coming anymore, that means that the server was slow or had
other problems with the tcp-ip connection from backend and is removed
from the recipient list) for 2 minutes and a mail alert was generated.

two minutes later a nagious alert came in:
Service: HTTP_RESPONSE
Host: xxx
Address: xxx
State: CRITICAL

Date/Time: Tue Mar 14 21:19:36 CET 2006

Additional Info:

CRITICAL - Socket timeout after 10 seconds

... somehow, don't know why yet, the server recovered and I received
next email at 21:24
Notification Type: RECOVERY

Service: HTTP_RESPONSE
Host: xxx
Address: xxx
State: OK

Date/Time: Tue Mar 14 21:24:26 CET 2006

Additional Info:

HTTP OK HTTP/1.1 200 OK - 47065 bytes in 0.042 seconds

----------------------------------
21:36 next problem mail came:
Notification Type: PROBLEM

Service: HTTP_RESPONSE
Host: xxx
Address: xxx
State: CRITICAL

Date/Time: Tue Mar 14 21:36:36 CET 2006

Additional Info:

CRITICAL - Socket timeout after 10 seconds

approx. 21:45 I checked the mails (I'm not the support guy, but just
feeling responsible, and therefore the one who watches for the system
in the evenings)

created a thread dump and restarted the server at 21:50 (according to
the change.log entry i made).

I don't see anything in SAR, but sar only shows mid-times, so if the
cpuload was high for a short moment (or idle time was zero) sar could
have missed it.

>
> 2) What was the time interval between the all threads busy log entry and
> the OOME reports?

I don't have a timestamp for the first OOME, but we have marks in the
catalina.out written all 5 minutes. The nearest one (before the oome)
is
561146456 Tue Mar 14 21:13:18 CET 2006 -MARK-

21:13-21:15 Seems to be the time in question for me.

I don't have it digital, but our sysadmin showed me a printed excel
sheet today, according to which we lost about 500MB ram in 20 Minutes
(somewhere between 20:55 and 21:15). I will try to get the sheet
tomorrow and send it to you.


>
> 3) Do you have -verbose:gc or -XX:+PrintGCDetails on, or perhaps any
> other monitoring in place that might show the state of the heap during
> normal running and then leading up to the hang?

No verbose:gc or other options, it would just spam the logs completely
(and the logs are large already), but we have monitoring for the
System freeMemory, totalMemory and availableMemory values aswell, as
for /proc/meminfo.

>
> 4) What are your -Xmx and PermGen size values?

The server has a total amount of 8 GB Ram, mostly at least 4 of them
are free (since our standart servers are 2-4GB 32 bit machines and we
can't use more).
The JVM parameters are:

export JAVA_OPTS="-server -mx1200M -ms1200M -Djacorb.config.dir=conf
-Djacorb.home=$JACORB_HOME -Dorg.omg.CORBA.ORBClass=org.jacorb.orb.ORB
-Dorg.omg.CORBA.ORBSingletonClass=org.jacorb.orb.ORBSingleton
-Djavax.net.ssl.trustStore=$HOME/.keystore
-Djavax.net.ssl.trustStorePassword=xxxxxx"

Until two weeks ago we ran with jdk1.4. With jdk1.4 normal memory
utilization was about 350 MB, we always had about 850MB free. With JDK
1.5 the amount of free memory is mostly over 1GB. Under high load its
about 850MB free memory.
Btw, I noticed long ago, that when something happens to tomcat or the
JVM (infinite loops, very high load, and so on) the garbage collector
seems to loose some memory between the collections. Probably (just
guessing, no real knowledge behind, but it looks this way) it tries
not to hold all other threads during the garbage collection and gives
required memory from free space (or allocate more if total<available)
but afterwards forgets this memory. I know this sounds idiotically,
but it's the impression one gets looking at the memory usage under
load.

>
> I'm speculating (aka grasping at straws) that a heap full situation
> existed for a bit before the hang, but that some code somewhere ate the
> OOMEs and tried to continue.

Well it seems according to above log entries that we in fact needed 20
minutes to consume  all of the memory, so I'd say some code just ate
memory and did nothing, until the available memory ended.
Almost forgot to mention, apparently the load balancer noticed the
server having problems before nagious did, at the time of the OOME it
only had 20 users left (have to check the exact time tomorow), and it
takes some time for an inactive user to be removed from the server.

>


>  - Chuck

Thanx again for your help

Leon

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to