Re: Tomcat dies suddenly

Mark Eggers Wed, 03 Feb 2010 20:47:04 -0800

Carl,

A couple of random thoughts . . .

I'm not familiar with the Slackware monitoring tools, but I am with the various 
tools that come with Fedora / Redhat. One of the things that I've noticed with 
those GUI tools is that they add cache and buffers to the free memory total.

Tools like top and vmstat should give a more complete picture of your memory. 
With vmstat you can watch free, cache, buffers, and swap conveniently. With 
top, you can actually do a command line monitor and watch a particular PID.

From the taroon-list: If you're running a 32 bit Linux and run out of low 
memory, it doesn't matter how much high memory you have, the OOM killer will 
start killing processes off. Since you're running a 64 bit Linux, this should 
not be the problem.

A discussion on stackoverflow.com may be more relevant to your situation. It 
turns out (according to the discussion) that calling 
Runtime.getRuntime().exec() on a busy system can lead to transient memory 
shortages which trigger the OOM killer.

If Runtime.getRuntime().exec() or similar calls do not exist in your 
application, then please skip the following speculation. I've made some 
comments concerning host resolution at the end of this message which might be 
helpful.

If Runtime.getRuntime().exec() is used, the scenario goes like this:

1. call Runtime.getRuntime().exec()
2. fork() gets called and makes a copy of the parent process
3. System runs a different process
   At this point you have two processes with largish memory requirements
   At this point the OOM killer may get triggered
4. exec() gets called on the child process and memory requirements go back down.

At least that's how I read the this reference:

http://stackoverflow.com/questions/209875/from-what-linux-kernel-libc-version-is-java-runtime-exec-safe-with-regards-to-m

Since processes that fork a lot of child processes are high on OOM killer's 
kill list, Tomcat gets killed.

See for example: 
http://prefetch.net/blog/index.php/2009/09/30/how-the-linux-oom-killer-works/

As to why it would happen on the newer production systems and not the older 
system, my only idea concerns the version of the kernel you're using. Memory 
management has been significantly reworked between the 2.4 and 2.6 kernels. If 
you use a 2.4 kernel on your older system, this could explain some of the 
differences with memory allocation.

So, if Runtime.getRuntime().exec() is used, what are some possible solutions?

1. Reducing Xms, Xmx while adding physical memory

If you do this, then the fork() call without the exec() being called directly 
afterwards won't be as expensive. Your application will be able to serve more 
clients without potentially triggering the OOM killer.

Garbage collection may be an issue if this is done, so tuning with JMeter is 
probably a good idea.

2. Create a lightweight process that forks what Runtime.getRuntime().exec() 
calls and communicate with the process over sockets.

This is pretty unpleasant, but you might be able to treat this as a remote 
process server. You could then end up using a custom object, JNDI lookups, and 
pooling, much like database pooling.

As I've said, this is all based on an assumption that the application is 
requesting a transiently large amount of memory caused by 
Runtime.getRuntime().exec() or other similar action. If this is not the case, 
then the above arguments are null and void.

DNS Thoughts

As for the ideas concerning DNS - I've never seen DNS issues actually take down 
an environment. However, I've seen orders of magnitude performance issues 
caused by poorly configured DNS resolution and missing DNS entries.

One way to test DNS performance issues is to set up a client with a static IP 
address, but don't put it in your local DNS. Then run JMeter on this client and 
stress your server. Finally, add the client into DNS and stress your server 
with JMeter. If you notice a difference, then there are some issues with how 
your server uses host resolution.

Make sure that nonexistent address resolution services (nisplus, nis, hesiod) 
are not listed as sources on the host line in /etc/nsswitch.conf (or wherever 
Slackware puts it). At least put a [NOTFOUND=return] entry after dns but before 
all the other services listed on the hosts: line of the nsswitch.conf file.

So, here's a summary to all of this rambling:

1. Monitor memory with vmstat and top to get a better picture of the 
   system memory
2. If Runtime.getRuntime().exec() is used, then transient memory 
   allocations could trigger the OOM killer on a busy system
3. Make sure host resolution works properly, and turn it off in server.xml

OK, enough rambling - hope this is useful.

/mde/

--- On Wed, 2/3/10, Carl <c...@etrak-plus.com> wrote:

> From: Carl <c...@etrak-plus.com>
> Subject: Re: Tomcat dies suddenly
> To: "Tomcat Users List" <users@tomcat.apache.org>
> Date: Wednesday, February 3, 2010, 5:07 PM
> Chris,
> 
> Interesting idea.  I tried over the weekend to force
> that situation with JMeter hitting a simple jsp that did
> some data stuff and created a small display.  I pushed
> it to the point that there were entries in the log stating
> it was out of memory (when attempting to GC, I think) but it
> just slowed way down and never crashed.  I could see
> from VisualJVM that it had used the entire heap but, again,
> I could never get it to crash.
> 
> Strange because it doesn't have the classic signs (slowing
> down or throwing out of memory exceptions or freezing), it
> just disappears without any tracks.  I am certain there
> is a reason somewhere, I just haven't found it yet.
> 
> Thanks for your suggestions,
> 
> Carl

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Tomcat dies suddenly

Reply via email to