Re: [Resin-interest] Best Way to Track Down Random Resin Restarts

Shane Cruz Wed, 10 Aug 2011 09:11:47 -0700

If anyone else runs into this issue,  we resolved it by moving SSL termination 
to our load balancer.  It seems that something in the native SSL libraries was 
causing the JVM to crash under high load.  We moved the SSL termination to our 
BigIP and have not had an unexplained crash ever since.  I hope this eventually 
helps someone else with the same problem.


Thanks,
Shane

From: resin-interest-boun...@caucho.com 
[mailto:resin-interest-boun...@caucho.com] On Behalf Of Shane Cruz
Sent: Thursday, February 17, 2011 6:18 PM
To: General Discussion for the Resin application server
Subject: Re: [Resin-interest] Best Way to Track Down Random Resin Restarts

I increased the file descriptors to be safe, but it doesn't appear to be the 
issue.  I didn't see the user anywhere near the file descriptor limit on my 
checks and I doubt it would be possible for their to be a sudden file 
descriptor spike that would open another 5000 files.

One thing that is interesting is that I have been connecting jstat to the 
process to see what the heaps look like right before the crash.  The heap data 
looks fine, but it appears that the timestamp doesn't increment for about 10-12 
consecutive checks.  Jstat gets disconnected when the process dies, but this 
data almost makes it seem like the process is running but unresponsive for 
about 10 seconds and then it gets killed.  Would the wrapper process kill the 
JVM if it found it to be running but unresponsive?  Is there anything else in 
Resin that would kill the Java process if it determined there was a deadlock or 
something (we are not using the <ping> check in resin.conf)?

Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC    FGCT 
    GCT
        40044.1   0.00   0.00  23.05  83.21  82.56    711   70.353   118  
104.280  174.632
        40045.1   0.00   0.00  26.98  83.21  82.56    711   70.353   118  
104.280  174.632
        40046.1   0.00   0.00  28.36  83.21  82.56    711   70.353   118  
104.280  174.632
        40047.1   0.00   0.00  29.88  83.21  82.56    711   70.353   118  
104.280  174.632
        40048.1   0.00   0.00  31.28  83.21  82.56    711   70.353   118  
104.280  174.632
        40049.1   0.00   0.00  32.86  83.21  82.56    711   70.353   118  
104.280  174.632
        40050.1   0.00   0.00  35.72  83.21  82.56    711   70.353   118  
104.280  174.632
        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118  
104.280  174.632
        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118  
104.280  174.632
        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118  
104.280  174.632
        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118  
104.280  174.632
        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118  
104.280  174.632
        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118  
104.280  174.632
        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118  
104.280  174.632
        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118  
104.280  174.632
        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118  
104.280  174.632
        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118  
104.280  174.632
        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118  
104.280  174.632

On Thu, Feb 17, 2011 at 2:06 PM, Scott Ferguson 
<f...@caucho.com<mailto:f...@caucho.com>> wrote:
Shane Cruz wrote:
> So, with full debug logging turned on, I did see this exception in the
> logs right before the restart:
>
> [13:55:37.603] com.caucho.log.EnvironmentLogger.log
> com.caucho.config.ConfigException: OpenSSL can't open
> certificate-chain-file '/nfs/certs/mysite.crt'
> [13:55:37.603]  at com.caucho.vfs.OpenSSLFactory.open(Native Method)
> [13:55:37.603]  at
> com.caucho.vfs.OpenSSLFactory.accept(OpenSSLFactory.java:419)
> [13:55:37.603]  at com.caucho.server.port.Port.accept(Port.java:813)
> [13:55:37.603]  at
> com.caucho.server.port.TcpConnection.run(TcpConnection.java:495)
> [13:55:37.603]  at
> com.caucho.util.ThreadPool.runTasks(ThreadPool.java:520)
> [13:55:37.603]  at com.caucho.util.ThreadPool.run(ThreadPool.java:442)
> [13:55:37.603]  at java.lang.Thread.run(Thread.java:619)
> [13:55:37.603]
> [13:55:49.109] com.caucho.log.EnvironmentLogger.log Server[myserver1]
> starting
>
> That certificate is getting loaded over NFS. Is there a chance that a
> certificate loading failure due to an NFS issue could cause the JVM to
> exit?  I thought the certificate would just be loaded one time at
> startup, but it looks like maybe it accesses it during runtime as well.
Possibly an issue running out of file descriptors?

That exception shouldn't cause a restart directly. It would cause that
thread to exit, but would also start up a new thread to listen to that
port (because it's assuming the current thread is broken for some reason.)

But you could get a "can't open" if you run out of file descriptors, and
running out of  file descriptors can force a restart.

-- Scott
>
> Unfortunately, on a different JVM, there was a crash that doesn't seem
> to have the same exception:
>
> [13:36:03.102] com.caucho.log.EnvironmentLogger.log allocate
> PoolItem[jdbc/db1,3340053,com.caucho.sql.ManagedConnectionImpl@744ab820]
> [13:36:03.102] com.caucho.log.EnvironmentLogger.log allocate
> PoolItem[jdbc/db2,1020267,com.caucho.sql.ManagedConnectionImpl@2a121a07]
> [13:36:16.815] com.caucho.log.EnvironmentLogger.log Server[myserver2]
> starting
>
> Scott, what are your thoughts on the certificate issue?  To be safe,
> we should probably start by not loading the certificate over an NFS share.
>
> Thanks,
> Shane
>
> On Fri, Feb 11, 2011 at 1:40 PM, Scott Ferguson 
> <f...@caucho.com<mailto:f...@caucho.com>
> <mailto:f...@caucho.com<mailto:f...@caucho.com>>> wrote:
>
>     Shane Cruz wrote:
>     > We are running Resin Pro 3.0.25 on RHEL 5.5 and using 64-bit Sun JDK
>     > 1.6.0_05.  Recently, we have started seeing several incidents where
>     > the Resin JVM seems to just randomly get restarted.  There is
>     nothing
>     > in the logs to indicate that the JVM was shutdown cleanly or a
>     restart
>     > was attempted, the log files just go from displaying regular log
>     lines
>     > to displaying the following:
>     The logging for 4.0 is much more informative. With 3.0 it's a bit
>     trickier.
>     >
>     > [11:24:18.095] com.caucho.log.EnvironmentLogger.log Server[myserver]
>     > starting
>     >
>     > Things that have already been checked:
>     >
>     > 1. There doesn't appear to be a JVM crash as no HotSpot Error log
>     > files are created as they usually would be.
>     >
>     > 2. There are no signs in the sudo logs that anyone is manually
>     > restarting the JVM.
>     >
>     > 3. There are no signs in the logs that Resin is restarting
>     itself even
>     > though we have a "min-free-memory" setting of 1M.  With higher
>     values
>     > of that setting we have seen the JVM get restarted due to low
>     memory,
>     > but I am pretty sure logging always indicated that the JVM was
>     > restarting when this happened before.
>     >
>     > 4. We are not using the resin "ping" check that might restart
>     the JVM
>     > if it is unresponsive.
>     >
>     > 5.     Kernel logging is enabled and it doesn't look like the kernel
>     > is killing it for any reason
>     >
>     > It almost seems as if the JVM is just getting a kill -9 and then the
>     > wrapper script is starting it back up.  What is the best way to
>     track
>     > down what might be killing the JVM?  We are in the process of
>     testing
>     > an upgrade to a newer version of the JDK, but I am not very
>     confident
>     > that will fix the problem.  I am going to try to turn on full Resin
>     > debug logging, but I thought I would reach out in case anyone
>     else had
>     > an idea of how to track this down.  Is there a way to wrap the Linux
>     > kill command to find out if that is being run?  Any other
>     suggestions
>     > on where to look?
>     Since a phantom kill is pretty unlikely, I wouldn't spend too much
>     time
>     on that theory.
>
>     Since you're not getting a hs_* error, the most likely would be either
>     something calling System.exit or System.halt, possibly Resin
>     itself for
>     something like running out of threads or memory (although, as you
>     pointed out, that should be logged.)
>
>     Other than that, the restart should only happen if the config files
>     change (theoretically something like NFS or 'touch' could trigger
>     that,
>     but I assume that's not happening.)
>
>     -- Scott
>     >
>     ------------------------------------------------------------------------
>     >
>     > _______________________________________________
>     > resin-interest mailing list
>     > resin-interest@caucho.com<mailto:resin-interest@caucho.com> 
> <mailto:resin-interest@caucho.com<mailto:resin-interest@caucho.com>>
>     > http://maillist.caucho.com/mailman/listinfo/resin-interest
>     >
>
>
>
>     _______________________________________________
>     resin-interest mailing list
>     resin-interest@caucho.com<mailto:resin-interest@caucho.com> 
> <mailto:resin-interest@caucho.com<mailto:resin-interest@caucho.com>>
>     http://maillist.caucho.com/mailman/listinfo/resin-interest
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> resin-interest mailing list
> resin-interest@caucho.com<mailto:resin-interest@caucho.com>
> http://maillist.caucho.com/mailman/listinfo/resin-interest
>



_______________________________________________
resin-interest mailing list
resin-interest@caucho.com<mailto:resin-interest@caucho.com>
http://maillist.caucho.com/mailman/listinfo/resin-interest


Privileged/Confidential Information may be contained in this message. If you 
are not the addressee indicated in this message (or responsible for delivery of 
the message to such person), you may not copy or deliver this message to 
anyone. In such case, you should destroy this message and kindly notify the 
sender by reply email. Please advise immediately if you or your employer does 
not consent to Internet email for messages of this kind. Opinions, conclusions 
and other information in this message that do not relate to the official 
business of my firm shall be understood as neither given nor endorsed by it.

_______________________________________________
resin-interest mailing list
resin-interest@caucho.com
http://maillist.caucho.com/mailman/listinfo/resin-interest

Re: [Resin-interest] Best Way to Track Down Random Resin Restarts

Reply via email to