[Resin-interest] Resin cluster failure with a single node running out of heap space

2008-03-10 Thread Mike Wynholds
Scott and others-

 

My client has a five-node Resin Pro cluster, each running version 3.1.2.

 

Today one of the nodes experienced an OutOfMemoryException which did not
bring Resin down but seemed to have put it in a completely unresponsive
state.

 

With 10 minutes or so of that happening, the other four servers stop
responding as well.  Looking at their logs shows that they are
continuously getting socket timeouts while trying to communicate with
the first server for session clustering.  (Stack trace below).

 

To be fair, this is not the only exception being thrown.  We also see
our distributed EhCache system unsuccessfully trying to replicate
itself.  And we *also* see the occasional Hessian exception happening
(also below).  Ultimately the server just gets so bogged down, it seems,
that it needs to be restarted.

 

So my question is this:

Assuming a Resin node runs out of memory, is there a way for other Resin
nodes to detect that and take the same action as if the node was
actually down?  I'm not sure this is really a bug, but it is probably a
good super-edge-case scenario worth thinking about.

 

We are currently looking at our watchdog process config to see why it
did not auto-restart Resin.  I think we didn't give enough memory buffer
for the watchdog to detect a needed restart, and our app lost
responsiveness before the watchdog could restart it.  But that's just a
theory.

 

I am interested in feedback from Scott and other Caucho developers about
this issue, as well as other Resin users who may have experienced issues
like this before and have any thoughts or suggestions on the matter.

 

Thanks.

..mike..

 

--- Socket Timeout stack trace (partial) ---

[14:47:10.389] java.net.SocketTimeoutException: Read timed out

[14:47:10.389]  at java.net.SocketInputStream.socketRead0(Native Method)

[14:47:10.389]  at
java.net.SocketInputStream.read(SocketInputStream.java:129)

[14:47:10.389]  at com.caucho.vfs.TcpStream.read(TcpStream.java:163)

[14:47:10.389]  at
com.caucho.vfs.ReadStream.readBuffer(ReadStream.java:1001)

[14:47:10.389]  at com.caucho.vfs.ReadStream.read(ReadStream.java:306)

[14:47:10.389]  at
com.caucho.server.cluster.ClusterStore.updateAccess(ClusterStore.java:85
6)

[14:47:10.389]  at
com.caucho.server.cluster.ClusterStore.accessServer(ClusterStore.java:82
3)

[14:47:10.389]  at
com.caucho.server.cluster.ClusterStore.accessImpl(ClusterStore.java:804)

[14:47:10.389]  at
com.caucho.server.cluster.ClusterObject.access(ClusterObject.java:337)

[14:47:10.389]  at
com.caucho.server.session.SessionImpl.setAccess(SessionImpl.java:839)

[14:47:10.389]  at
com.caucho.server.session.SessionManager.load(SessionManager.java:1477)

[14:47:10.389]  at
com.caucho.server.session.SessionManager.getSession(SessionManager.java:
1335)

[14:47:10.389]  at
com.caucho.server.connection.AbstractHttpRequest.createSession(AbstractH
ttpRequest.java:1455)

[14:47:10.389]  at
com.caucho.server.connection.AbstractHttpRequest.getSession(AbstractHttp
Request.java:1270)

[14:47:10.389]  at
net.sf.acegisecurity.context.HttpSessionContextIntegrationFilter.doFilte
r(HttpSessionContextIntegrationFilter.java:172)

[14:47:10.389]  at
net.sf.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(F
ilterChainProxy.java:303)

[14:47:10.389]  at
net.sf.acegisecurity.util.FilterChainProxy.doFilter(FilterChainProxy.jav
a:173)

[14:47:10.389]  at
net.sf.acegisecurity.util.FilterToBeanProxy.doFilter(FilterToBeanProxy.j
ava:125)

[14:47:10.389]  at
com.caucho.server.dispatch.FilterFilterChain.doFilter(FilterFilterChain.
java:73)

 

--- Hessian failure stack trace ---

[14:15:00.065] Caused by:
org.springframework.web.util.NestedServletException: Hessian skeleton
invocation failed; nested exception is java.io.IOException: expected 'c'
in hessian input at -1

[14:15:00.065]  at
org.springframework.remoting.caucho.HessianServiceExporter.handleRequest
(HessianServiceExporter.java:150)

[14:15:00.065]  at
org.springframework.web.servlet.mvc.HttpRequestHandlerAdapter.handle(Htt
pRequestHandlerAdapter.java:49)

[14:15:00.065]  at
org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherS
ervlet.java:857)

[14:15:00.065]  at
org.springframework.web.servlet.DispatcherServlet.doService(DispatcherSe
rvlet.java:792)

[14:15:00.065]  at
org.springframework.web.servlet.FrameworkServlet.processRequest(Framewor
kServlet.java:475)

[14:15:00.065]  at
org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet
.java:440)

[14:15:00.065]  at
javax.servlet.http.HttpServlet.service(HttpServlet.java:153)

[14:15:00.065]  at
javax.servlet.http.HttpServlet.service(HttpServlet.java:91)

[14:15:00.065]  at
com.caucho.server.dispatch.ServletFilterChain.doFilter(ServletFilterChai
n.java:103)

[14:15:00.065]  at
net.sf.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(F
ilterChainProxy.java:292)

[14:15:00.065]  at
taylor.tops.security.UserTrackerFilter.doFilter(UserTrackerFilter.java:2
7)


Re: [Resin-interest] Resin cluster failure with a single node running out of heap space

2008-03-10 Thread Scott Ferguson


On Mar 10, 2008, at 3:05 PM, Mike Wynholds wrote:


Scott and others-

My client has a five-node Resin Pro cluster, each running version  
3.1.2.


Today one of the nodes experienced an OutOfMemoryException which did  
not bring Resin down but seemed to have put it in a completely  
unresponsive state.


Do you have the memory-free-min set?  Resin should restart itself  
before it gets to OOM.


The problem with OOM is that errors and behavior start becoming  
undefined.  Basically, it's not possible to really handle OOM other  
than restarting the system.  The memory-free-min makes sure Resin  
restarts before that situation occurs.


-- Scott



With 10 minutes or so of that happening, the other four servers stop  
responding as well.  Looking at their logs shows that they are  
continuously getting socket timeouts while trying to communicate  
with the first server for session clustering.  (Stack trace below).


To be fair, this is not the only exception being thrown.  We also  
see our distributed EhCache system unsuccessfully trying to  
replicate itself.  And we *also* see the occasional Hessian  
exception happening (also below).  Ultimately the server just gets  
so bogged down, it seems, that it needs to be restarted.


So my question is this:
Assuming a Resin node runs out of memory, is there a way for other  
Resin nodes to detect that and take the same action as if the node  
was actually down?  I’m not sure this is really a bug, but it is  
probably a good super-edge-case scenario worth thinking about.


We are currently looking at our watchdog process config to see why  
it did not auto-restart Resin.  I think we didn’t give enough memory  
buffer for the watchdog to detect a needed restart, and our app lost  
responsiveness before the watchdog could restart it.  But that’s  
just a theory.


I am interested in feedback from Scott and other Caucho developers  
about this issue, as well as other Resin users who may have  
experienced issues like this before and have any thoughts or  
suggestions on the matter.


Thanks.
..mike..

--- Socket Timeout stack trace (partial) ---
[14:47:10.389] java.net.SocketTimeoutException: Read timed out
[14:47:10.389]  at java.net.SocketInputStream.socketRead0(Native  
Method)
[14:47:10.389]  at  
java.net.SocketInputStream.read(SocketInputStream.java:129)

[14:47:10.389]  at com.caucho.vfs.TcpStream.read(TcpStream.java:163)
[14:47:10.389]  at  
com.caucho.vfs.ReadStream.readBuffer(ReadStream.java:1001)

[14:47:10.389]  at com.caucho.vfs.ReadStream.read(ReadStream.java:306)
[14:47:10.389]  at  
com 
.caucho.server.cluster.ClusterStore.updateAccess(ClusterStore.java: 
856)
[14:47:10.389]  at  
com 
.caucho.server.cluster.ClusterStore.accessServer(ClusterStore.java: 
823)
[14:47:10.389]  at  
com.caucho.server.cluster.ClusterStore.accessImpl(ClusterStore.java: 
804)
[14:47:10.389]  at  
com.caucho.server.cluster.ClusterObject.access(ClusterObject.java:337)
[14:47:10.389]  at  
com.caucho.server.session.SessionImpl.setAccess(SessionImpl.java:839)
[14:47:10.389]  at  
com.caucho.server.session.SessionManager.load(SessionManager.java: 
1477)
[14:47:10.389]  at  
com 
.caucho.server.session.SessionManager.getSession(SessionManager.java: 
1335)
[14:47:10.389]  at  
com 
.caucho 
.server 
.connection 
.AbstractHttpRequest.createSession(AbstractHttpRequest.java:1455)
[14:47:10.389]  at  
com 
.caucho 
.server 
.connection.AbstractHttpRequest.getSession(AbstractHttpRequest.java: 
1270)
[14:47:10.389]  at  
net 
.sf 
.acegisecurity 
.context 
.HttpSessionContextIntegrationFilter 
.doFilter(HttpSessionContextIntegrationFilter.java:172)
[14:47:10.389]  at net.sf.acegisecurity.util.FilterChainProxy 
$VirtualFilterChain.doFilter(FilterChainProxy.java:303)
[14:47:10.389]  at  
net 
.sf 
.acegisecurity.util.FilterChainProxy.doFilter(FilterChainProxy.java: 
173)
[14:47:10.389]  at  
net 
.sf 
.acegisecurity 
.util.FilterToBeanProxy.doFilter(FilterToBeanProxy.java:125)
[14:47:10.389]  at  
com 
.caucho 
.server.dispatch.FilterFilterChain.doFilter(FilterFilterChain.java:73)


--- Hessian failure stack trace ---
[14:15:00.065] Caused by: org.springframework.web.util.NestedServletException 
: Hessian skeleton invocation failed; nested exception is  
java.io.IOException: expected 'c' in hessian input at -1
[14:15:00.065]  at  
org 
.springframework 
.remoting 
.caucho 
.HessianServiceExporter.handleRequest(HessianServiceExporter.java:150)
[14:15:00.065]  at org.springframework.web.servlet.mvc.HttpRequestHandlerAdapter.handle 
(HttpRequestHandlerAdapter.java:49)
[14:15:00.065]  at org.springframework.web.servlet.DispatcherServlet.doDispatch 
(DispatcherServlet.java:857)
[14:15:00.065]  at org.springframework.web.servlet.DispatcherServlet.doService 
(DispatcherServlet.java:792)
[14:15:00.065]  at org.springframework.web.servlet.FrameworkServlet.processRequest 
(FrameworkServlet.java:475)
[14:15:00.065]  at org.springframework.web.servlet.FrameworkServlet.doPost 

Re: [Resin-interest] Resin cluster failure with a single node running out of heap space

2008-03-10 Thread Sam
 We are currently looking at our watchdog process config to see why it
 did not auto-restart Resin.  I think we didn't give enough memory buffer
 for the watchdog to detect a needed restart, and our app lost
 responsiveness before the watchdog could restart it.  But that's just a
 theory.

The memory low detection happens within the server itself.  If the
server itself detects that the memory is about to be exhausted, it
exits.  The watchdog then notices that the server did not exit cleanly,
and starts a new server to replace it.

-- Sam



___
resin-interest mailing list
resin-interest@caucho.com
http://maillist.caucho.com/mailman/listinfo/resin-interest