On 3/15/06, Caldarale, Charles R <[EMAIL PROTECTED]> wrote: > > From: Leon Rosenberg [mailto:[EMAIL PROTECTED] > > Subject: Deadlock -> Out of Threads -> Strange Exception -> > > OutOfMemory -> Server Death. > > > > I can't find a locked <0x61655498> entry nowhere. > > But I should be able to see it to recognize the thread causing the > > deadlock, shouldn't I? > > I would think so. More questions: > > 1) Did the server exhibit any other symptoms before the apparent > deadlock, such as slow response times?
Unfortunatelly our monitoring systems do not detect everything. But what I can reconstruct: The log entry All Threads busy came at 20:53:03 at 21:17:37,965 The server (or a monitoring thread in the server) detected that it hasn't received any events (backend sends events with user logins, approx. 10 per second at the evening, if the events aren't coming anymore, that means that the server was slow or had other problems with the tcp-ip connection from backend and is removed from the recipient list) for 2 minutes and a mail alert was generated. two minutes later a nagious alert came in: Service: HTTP_RESPONSE Host: xxx Address: xxx State: CRITICAL Date/Time: Tue Mar 14 21:19:36 CET 2006 Additional Info: CRITICAL - Socket timeout after 10 seconds ... somehow, don't know why yet, the server recovered and I received next email at 21:24 Notification Type: RECOVERY Service: HTTP_RESPONSE Host: xxx Address: xxx State: OK Date/Time: Tue Mar 14 21:24:26 CET 2006 Additional Info: HTTP OK HTTP/1.1 200 OK - 47065 bytes in 0.042 seconds ---------------------------------- 21:36 next problem mail came: Notification Type: PROBLEM Service: HTTP_RESPONSE Host: xxx Address: xxx State: CRITICAL Date/Time: Tue Mar 14 21:36:36 CET 2006 Additional Info: CRITICAL - Socket timeout after 10 seconds approx. 21:45 I checked the mails (I'm not the support guy, but just feeling responsible, and therefore the one who watches for the system in the evenings) created a thread dump and restarted the server at 21:50 (according to the change.log entry i made). I don't see anything in SAR, but sar only shows mid-times, so if the cpuload was high for a short moment (or idle time was zero) sar could have missed it. > > 2) What was the time interval between the all threads busy log entry and > the OOME reports? I don't have a timestamp for the first OOME, but we have marks in the catalina.out written all 5 minutes. The nearest one (before the oome) is 561146456 Tue Mar 14 21:13:18 CET 2006 -MARK- 21:13-21:15 Seems to be the time in question for me. I don't have it digital, but our sysadmin showed me a printed excel sheet today, according to which we lost about 500MB ram in 20 Minutes (somewhere between 20:55 and 21:15). I will try to get the sheet tomorrow and send it to you. > > 3) Do you have -verbose:gc or -XX:+PrintGCDetails on, or perhaps any > other monitoring in place that might show the state of the heap during > normal running and then leading up to the hang? No verbose:gc or other options, it would just spam the logs completely (and the logs are large already), but we have monitoring for the System freeMemory, totalMemory and availableMemory values aswell, as for /proc/meminfo. > > 4) What are your -Xmx and PermGen size values? The server has a total amount of 8 GB Ram, mostly at least 4 of them are free (since our standart servers are 2-4GB 32 bit machines and we can't use more). The JVM parameters are: export JAVA_OPTS="-server -mx1200M -ms1200M -Djacorb.config.dir=conf -Djacorb.home=$JACORB_HOME -Dorg.omg.CORBA.ORBClass=org.jacorb.orb.ORB -Dorg.omg.CORBA.ORBSingletonClass=org.jacorb.orb.ORBSingleton -Djavax.net.ssl.trustStore=$HOME/.keystore -Djavax.net.ssl.trustStorePassword=xxxxxx" Until two weeks ago we ran with jdk1.4. With jdk1.4 normal memory utilization was about 350 MB, we always had about 850MB free. With JDK 1.5 the amount of free memory is mostly over 1GB. Under high load its about 850MB free memory. Btw, I noticed long ago, that when something happens to tomcat or the JVM (infinite loops, very high load, and so on) the garbage collector seems to loose some memory between the collections. Probably (just guessing, no real knowledge behind, but it looks this way) it tries not to hold all other threads during the garbage collection and gives required memory from free space (or allocate more if total<available) but afterwards forgets this memory. I know this sounds idiotically, but it's the impression one gets looking at the memory usage under load. > > I'm speculating (aka grasping at straws) that a heap full situation > existed for a bit before the hang, but that some code somewhere ate the > OOMEs and tried to continue. Well it seems according to above log entries that we in fact needed 20 minutes to consume all of the memory, so I'd say some code just ate memory and did nothing, until the available memory ended. Almost forgot to mention, apparently the load balancer noticed the server having problems before nagious did, at the time of the OOME it only had 20 users left (have to check the exact time tomorow), and it takes some time for an inactive user to be removed from the server. > > - Chuck Thanx again for your help Leon --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]