This problem continues to plague me.

A quick recap so you don't have to search your memory or archives.

The 10,000 foot view: new Dell T105 and T110, Slackware 13.0 (64 bit), latest Java (64 bit) and latest Tomcat. Machines only run Tomcat and a small, special purpose Java server (which I have also moved to another machine to make certain it wasn't causing any problems.) Periodically, Tomcat just dies leaving no tracks in any log that I have been able to find. The application has run on a Slackware 12.1 (32 bit) for several years without problems (except for application bugs.) I have run memTest86 for 30 hours on the T110 with no problems reported.

More details: the Dell 105 has an AMD processor and (currently) 8 GB memory. The T110 has a Xeon 3440 processor and 4 GB memory. The current Java version is 1.6.0_18-b07. The current Tomcat version is 6.0.24.

The servers are lightly loaded with less than 100 sessions active at any one time.

All of the following trials have produced the same results:

1.  Tried openSuse 64 bit.

2.  Tried 32 bit Slackware 13.

3.  Increased the memory in the T105 from 4GB to 6 GB and finally to 8 GB.

4. Have fiddled with the JAVA_OPTS settings in catalina.sh. The current settings are:

JAVA_OPTS="-Xms512m -Xmx512m -XX:PermSize=384m -XX:MaxPermSize=384m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/tomcat/logs"

I can see the incremental GC effects in both catalina.out and VisualJVM. Note the fairly small (512MB) heap but watching VisualJVM indicates this is sufficient (when a failure occurs, VisualJVM will report the last amount of memory used and this is always well under the max in both heap and permGen.)

More information about the failures:

1. They are clean kills as I can restart Tomcat immediately after failure and there is no port conflict. As I understand it, this implies the linux process was killed (I have manually killed the java process with kill -9 and had the same result that I have observed when the system fails) or Tomcat was shut down normally, e.g., using shutdown.sh (this always leaves tracks in catalina.out and I am not seeing any so I do not believe this is the case.)

2. They appear to be load related. On heavy processing days, the system might fail every 15 minutes but it could also run for up to 10 days without failure but with lighter processing. I have found a way to force a more frequent failure. We have four war's deployed (I will call them A, B, C and D.) They are all the same application but we use this process to enable access to different databases. A user accesses the correct application by https://xx.com/A or B, etc. A is used for production while the others have specific purposes. Thus, A is always used while the others are used periodically. If users start coming in on B, C and/or D, within hours the failure occurs (Tomcat shuts down bringing all of the users down, of course.) Note that the failure still does not happen immediately.

3. They do not appear to be caused by memory restrictions as 1) the old server had only 2 GB of memory and ran well, 2) I have tried adding memory to the new servers with no change in behavior and 3) the indications from top and the Slackware system monitor are that the system is not starved for memory. In fact, yesterday, running on the T105 with 8 GB of memory, top never reported over 6 GB being used (0 swap being used) yet it failed at about 4:00PM.

4. Most of the failures will occur after some amount of processing. We update the war's and restart the Tomcats each morning at 1:00AM. Most of the failures will occur toward the end of the day although heavy processing (or using multiple 'applications') may force it to happen earlier (the earliest failure has been around 1:00PM... it was the heaviest processing day ever.) It is almost as if there is a bucket somewhere that gets filled up and, when filled, causes the failure. (So there is no misunderstanding, there has never been an OOM condition reported anywhere that I can find.)

Observations (or random musings):

The fact that the failures occur after some amount of processing implies that the issue is related to memory usage, and, potentially, caused by a memory leak in the application. However, 1) I have never seen (from VisualJVM) any issue with either heap or permGen and the incremental GC's reported in catalina.out look pretty normal and 2) top, vmstat, system monitor, etc. are not showing any issues with memory.

The failures look a lot like the linux OOM killer (which Mark or Chris said way back at the beginning which is now 2-3 months ago.) Does anyone have an idea where I could get information on tracking the linux signals that could cause this?

Thanks,

Carl




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to