If anyone else runs into this issue, we resolved it by moving SSL termination to our load balancer. It seems that something in the native SSL libraries was causing the JVM to crash under high load. We moved the SSL termination to our BigIP and have not had an unexplained crash ever since. I hope this eventually helps someone else with the same problem.
Thanks, Shane From: resin-interest-boun...@caucho.com [mailto:resin-interest-boun...@caucho.com] On Behalf Of Shane Cruz Sent: Thursday, February 17, 2011 6:18 PM To: General Discussion for the Resin application server Subject: Re: [Resin-interest] Best Way to Track Down Random Resin Restarts I increased the file descriptors to be safe, but it doesn't appear to be the issue. I didn't see the user anywhere near the file descriptor limit on my checks and I doubt it would be possible for their to be a sudden file descriptor spike that would open another 5000 files. One thing that is interesting is that I have been connecting jstat to the process to see what the heaps look like right before the crash. The heap data looks fine, but it appears that the timestamp doesn't increment for about 10-12 consecutive checks. Jstat gets disconnected when the process dies, but this data almost makes it seem like the process is running but unresponsive for about 10 seconds and then it gets killed. Would the wrapper process kill the JVM if it found it to be running but unresponsive? Is there anything else in Resin that would kill the Java process if it determined there was a deadlock or something (we are not using the <ping> check in resin.conf)? Timestamp S0 S1 E O P YGC YGCT FGC FGCT GCT 40044.1 0.00 0.00 23.05 83.21 82.56 711 70.353 118 104.280 174.632 40045.1 0.00 0.00 26.98 83.21 82.56 711 70.353 118 104.280 174.632 40046.1 0.00 0.00 28.36 83.21 82.56 711 70.353 118 104.280 174.632 40047.1 0.00 0.00 29.88 83.21 82.56 711 70.353 118 104.280 174.632 40048.1 0.00 0.00 31.28 83.21 82.56 711 70.353 118 104.280 174.632 40049.1 0.00 0.00 32.86 83.21 82.56 711 70.353 118 104.280 174.632 40050.1 0.00 0.00 35.72 83.21 82.56 711 70.353 118 104.280 174.632 40050.7 0.00 0.00 41.97 83.21 82.56 711 70.353 118 104.280 174.632 40050.7 0.00 0.00 41.97 83.21 82.56 711 70.353 118 104.280 174.632 40050.7 0.00 0.00 41.97 83.21 82.56 711 70.353 118 104.280 174.632 40050.7 0.00 0.00 41.97 83.21 82.56 711 70.353 118 104.280 174.632 40050.7 0.00 0.00 41.97 83.21 82.56 711 70.353 118 104.280 174.632 40050.7 0.00 0.00 41.97 83.21 82.56 711 70.353 118 104.280 174.632 40050.7 0.00 0.00 41.97 83.21 82.56 711 70.353 118 104.280 174.632 40050.7 0.00 0.00 41.97 83.21 82.56 711 70.353 118 104.280 174.632 40050.7 0.00 0.00 41.97 83.21 82.56 711 70.353 118 104.280 174.632 40050.7 0.00 0.00 41.97 83.21 82.56 711 70.353 118 104.280 174.632 40050.7 0.00 0.00 41.97 83.21 82.56 711 70.353 118 104.280 174.632 On Thu, Feb 17, 2011 at 2:06 PM, Scott Ferguson <f...@caucho.com<mailto:f...@caucho.com>> wrote: Shane Cruz wrote: > So, with full debug logging turned on, I did see this exception in the > logs right before the restart: > > [13:55:37.603] com.caucho.log.EnvironmentLogger.log > com.caucho.config.ConfigException: OpenSSL can't open > certificate-chain-file '/nfs/certs/mysite.crt' > [13:55:37.603] at com.caucho.vfs.OpenSSLFactory.open(Native Method) > [13:55:37.603] at > com.caucho.vfs.OpenSSLFactory.accept(OpenSSLFactory.java:419) > [13:55:37.603] at com.caucho.server.port.Port.accept(Port.java:813) > [13:55:37.603] at > com.caucho.server.port.TcpConnection.run(TcpConnection.java:495) > [13:55:37.603] at > com.caucho.util.ThreadPool.runTasks(ThreadPool.java:520) > [13:55:37.603] at com.caucho.util.ThreadPool.run(ThreadPool.java:442) > [13:55:37.603] at java.lang.Thread.run(Thread.java:619) > [13:55:37.603] > [13:55:49.109] com.caucho.log.EnvironmentLogger.log Server[myserver1] > starting > > That certificate is getting loaded over NFS. Is there a chance that a > certificate loading failure due to an NFS issue could cause the JVM to > exit? I thought the certificate would just be loaded one time at > startup, but it looks like maybe it accesses it during runtime as well. Possibly an issue running out of file descriptors? That exception shouldn't cause a restart directly. It would cause that thread to exit, but would also start up a new thread to listen to that port (because it's assuming the current thread is broken for some reason.) But you could get a "can't open" if you run out of file descriptors, and running out of file descriptors can force a restart. -- Scott > > Unfortunately, on a different JVM, there was a crash that doesn't seem > to have the same exception: > > [13:36:03.102] com.caucho.log.EnvironmentLogger.log allocate > PoolItem[jdbc/db1,3340053,com.caucho.sql.ManagedConnectionImpl@744ab820] > [13:36:03.102] com.caucho.log.EnvironmentLogger.log allocate > PoolItem[jdbc/db2,1020267,com.caucho.sql.ManagedConnectionImpl@2a121a07] > [13:36:16.815] com.caucho.log.EnvironmentLogger.log Server[myserver2] > starting > > Scott, what are your thoughts on the certificate issue? To be safe, > we should probably start by not loading the certificate over an NFS share. > > Thanks, > Shane > > On Fri, Feb 11, 2011 at 1:40 PM, Scott Ferguson > <f...@caucho.com<mailto:f...@caucho.com> > <mailto:f...@caucho.com<mailto:f...@caucho.com>>> wrote: > > Shane Cruz wrote: > > We are running Resin Pro 3.0.25 on RHEL 5.5 and using 64-bit Sun JDK > > 1.6.0_05. Recently, we have started seeing several incidents where > > the Resin JVM seems to just randomly get restarted. There is > nothing > > in the logs to indicate that the JVM was shutdown cleanly or a > restart > > was attempted, the log files just go from displaying regular log > lines > > to displaying the following: > The logging for 4.0 is much more informative. With 3.0 it's a bit > trickier. > > > > [11:24:18.095] com.caucho.log.EnvironmentLogger.log Server[myserver] > > starting > > > > Things that have already been checked: > > > > 1. There doesn't appear to be a JVM crash as no HotSpot Error log > > files are created as they usually would be. > > > > 2. There are no signs in the sudo logs that anyone is manually > > restarting the JVM. > > > > 3. There are no signs in the logs that Resin is restarting > itself even > > though we have a "min-free-memory" setting of 1M. With higher > values > > of that setting we have seen the JVM get restarted due to low > memory, > > but I am pretty sure logging always indicated that the JVM was > > restarting when this happened before. > > > > 4. We are not using the resin "ping" check that might restart > the JVM > > if it is unresponsive. > > > > 5. Kernel logging is enabled and it doesn't look like the kernel > > is killing it for any reason > > > > It almost seems as if the JVM is just getting a kill -9 and then the > > wrapper script is starting it back up. What is the best way to > track > > down what might be killing the JVM? We are in the process of > testing > > an upgrade to a newer version of the JDK, but I am not very > confident > > that will fix the problem. I am going to try to turn on full Resin > > debug logging, but I thought I would reach out in case anyone > else had > > an idea of how to track this down. Is there a way to wrap the Linux > > kill command to find out if that is being run? Any other > suggestions > > on where to look? > Since a phantom kill is pretty unlikely, I wouldn't spend too much > time > on that theory. > > Since you're not getting a hs_* error, the most likely would be either > something calling System.exit or System.halt, possibly Resin > itself for > something like running out of threads or memory (although, as you > pointed out, that should be logged.) > > Other than that, the restart should only happen if the config files > change (theoretically something like NFS or 'touch' could trigger > that, > but I assume that's not happening.) > > -- Scott > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > resin-interest mailing list > > resin-interest@caucho.com<mailto:resin-interest@caucho.com> > <mailto:resin-interest@caucho.com<mailto:resin-interest@caucho.com>> > > http://maillist.caucho.com/mailman/listinfo/resin-interest > > > > > > _______________________________________________ > resin-interest mailing list > resin-interest@caucho.com<mailto:resin-interest@caucho.com> > <mailto:resin-interest@caucho.com<mailto:resin-interest@caucho.com>> > http://maillist.caucho.com/mailman/listinfo/resin-interest > > > ------------------------------------------------------------------------ > > _______________________________________________ > resin-interest mailing list > resin-interest@caucho.com<mailto:resin-interest@caucho.com> > http://maillist.caucho.com/mailman/listinfo/resin-interest > _______________________________________________ resin-interest mailing list resin-interest@caucho.com<mailto:resin-interest@caucho.com> http://maillist.caucho.com/mailman/listinfo/resin-interest Privileged/Confidential Information may be contained in this message. If you are not the addressee indicated in this message (or responsible for delivery of the message to such person), you may not copy or deliver this message to anyone. In such case, you should destroy this message and kindly notify the sender by reply email. Please advise immediately if you or your employer does not consent to Internet email for messages of this kind. Opinions, conclusions and other information in this message that do not relate to the official business of my firm shall be understood as neither given nor endorsed by it.
_______________________________________________ resin-interest mailing list resin-interest@caucho.com http://maillist.caucho.com/mailman/listinfo/resin-interest