Hello all I´m Mikel, and me and my workmates have already been a while testing our environment in order to establish a in memory session replication cluster for our servers. The thing is that our servers are often loaded with up to a thousand (1000) and more sessions (and now, we have three tomcat nodes in different machines working in a load balancer!!!!). In hot moments, I have counted up to 4500 sessions distributed between the three servers.
So, I'll tell you the main configuration our production environment. Web servers: Two windows server 2003 with IIS and isapi_redirect.dll conector. App servers node 1 & 2: IBM Xseries_3550 Intel Xeon CPU 5150 @2,66GHz, 2,00 GB RAM, Windows 2003 Server R2 node3: IBM XSeries_366 Intel Xeon CPU 3,20Ghz, 3,00 GB RAM, Windows Server 2003 R2 In our development environment, where we have been making our tests, we have 2 tomcats in two different machines node 1: Intel Xeon CPU E5440 @ 2,83GHz, 1,00 GB RAM, Windows Server 2003 R2 node 2: IBM XSeries_3550 Intel Xeon CPU E5440 @ 2,83 GHz, 2GB RAM, Windows Server 2003 R2 We have been testing the Tomcat 5.5.9 that we are using in production, finding some trouble, even after aplying the clustering fix pack from https://issues.apache.org/bugzilla/show_bug.cgi?id=34389 . Finally, we took the decission to upgrade to last Tomcat 6 available, version 6.0.18, to see if the announced refactoring of cluster subsystem could solve our trouble. After all this prety long intro, I'll tell you the reason of my request. When we test the cluster (in development environment) with both nodes running, we create up to a thousand sessions, keeping alive and modificating about 500. So, we can see that all of them get replicated to the other node quickly . The trouble comes when, after shutting down one of the instances, we start it again (while the half of the alive sessions are still beeing modified by the JMeter test). In our tests, the starting tomcat instance finally gets hunged when receiving sessions from the alive node. Theese are the traces seen in the catalina.log of node 1 when starting it: Jan 26, 2009 6:51:56 PM org.apache.catalina.core.AprLifecycleListener init INFO: The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: C:\tomcat-6.0.18\bin;.;C:\WINDOWS\system32;C:\WINDOWS;C:\Program Files\Serena\Dimensions 10.1\CM\prog;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\system32\WBEM;C:\Program Files\IBM\Director\bin;C:\Program Files\Common Files\IBM\ICC\cimom\bin;C:\Program Files\System Center Operations Manager 2007\ Jan 26, 2009 6:51:56 PM org.apache.coyote.http11.Http11Protocol init INFO: Initializing Coyote HTTP/1.1 on http-9080 Jan 26, 2009 6:51:56 PM org.apache.coyote.http11.Http11Protocol init INFO: Initializing Coyote HTTP/1.1 on http-9081 Jan 26, 2009 6:51:56 PM org.apache.catalina.startup.Catalina load INFO: Initialization processed in 1928 ms Jan 26, 2009 6:51:56 PM org.apache.catalina.core.StandardService start INFO: Starting service Catalina Jan 26, 2009 6:51:56 PM org.apache.catalina.core.StandardEngine start INFO: Starting Servlet Engine: Apache Tomcat/6.0.18 Jan 26, 2009 6:51:56 PM org.apache.catalina.ha.tcp.SimpleTcpCluster start INFO: Cluster is about to start Jan 26, 2009 6:51:56 PM org.apache.catalina.tribes.transport.ReceiverBase bind INFO: Receiver Server Socket bound to:/172.26.102.233:4009 Jan 26, 2009 6:51:56 PM org.apache.catalina.tribes.membership.McastServiceImpl setupSocket INFO: Attempting to bind the multicast socket to /228.0.0.9:45569 Jan 26, 2009 6:51:56 PM org.apache.catalina.tribes.membership.McastServiceImpl setupSocket INFO: Binding to multicast address, failed. Binding to port only. Jan 26, 2009 6:51:56 PM org.apache.catalina.tribes.membership.McastServiceImpl setupSocket INFO: Setting multihome multicast interface to:/172.26.102.233 Jan 26, 2009 6:51:56 PM org.apache.catalina.tribes.membership.McastServiceImpl setupSocket INFO: Setting cluster mcast soTimeout to 1000 Jan 26, 2009 6:51:56 PM org.apache.catalina.tribes.membership.McastServiceImpl waitForMembers INFO: Sleeping for 2000 milliseconds to establish cluster membership, start level:4 Jan 26, 2009 6:51:57 PM org.apache.catalina.ha.tcp.SimpleTcpCluster memberAdded INFO: Replication member added:org.apache.catalina.tribes.membership.MemberImpl[tcp://{-84, 26, 102, -60}:4009,{-84, 26, 102, -60},4009, alive=1938953,id={-57 67 34 -23 -38 83 74 68 -67 -87 -112 -94 13 102 -78 -20 }, payload={}, command={}, domain={}, ] Jan 26, 2009 6:51:58 PM org.apache.catalina.tribes.membership.McastServiceImpl waitForMembers INFO: Done sleeping, membership established, start level:4 Jan 26, 2009 6:51:58 PM org.apache.catalina.tribes.membership.McastServiceImpl waitForMembers INFO: Sleeping for 2000 milliseconds to establish cluster membership, start level:8 Jan 26, 2009 6:51:58 PM org.apache.catalina.tribes.io.BufferPool getBufferPool INFO: Created a buffer pool with max size:104857600 bytes of type:org.apache.catalina.tribes.io.BufferPool15Impl Jan 26, 2009 6:52:00 PM org.apache.catalina.tribes.membership.McastServiceImpl waitForMembers INFO: Done sleeping, membership established, start level:8 Jan 26, 2009 6:52:14 PM org.apache.catalina.ha.session.DeltaManager start INFO: Register manager to cluster element Engine with name Catalina Jan 26, 2009 6:52:14 PM org.apache.catalina.ha.session.DeltaManager start INFO: Starting clustering manager at Jan 26, 2009 6:52:14 PM org.apache.catalina.ha.session.DeltaManager getAllClusterSessions WARNING: Manager [localhost#], requesting session state from org.apache.catalina.tribes.membership.MemberImpl[tcp://{-84, 26, 102, -60}:4009,{-84, 26, 102, -60},4009, alive=1955953,id={-57 67 34 -23 -38 83 74 68 -67 -87 -112 -94 13 102 -78 -20 }, payload={}, command={}, domain={}, ]. This operation will timeout if no session state has been received within -1 seconds. Jan 26, 2009 7:00:50 PM org.apache.catalina.startup.Catalina stopServer SEVERE: Catalina.stop: java.net.ConnectException: Connection refused: connect at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:364) at java.net.Socket.connect(Socket.java:507) at java.net.Socket.connect(Socket.java:457) at java.net.Socket.<init>(Socket.java:365) at java.net.Socket.<init>(Socket.java:178) at org.apache.catalina.startup.Catalina.stopServer(Catalina.java:421) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.catalina.startup.Bootstrap.stopServer(Bootstrap.java:337) at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:415) The last trace was because after nearly ten minutes waiting, we resolved to shut the tomcat instance down. In the running Tomcat (node 2): 26-ene-2009 18:51:58 org.apache.catalina.ha.tcp.SimpleTcpCluster memberAdded INFO: Replication member added:org.apache.catalina.tribes.membership.MemberImpl[tcp://{-84, 26, 102, -23}:4009,{-84, 26, 102, -23},4009, alive=2047,id={-20 60 112 -113 35 -41 71 -5 -124 47 93 -37 117 -9 -9 29 }, payload={}, command={}, domain={}, ] 26-ene-2009 19:00:51 org.apache.catalina.tribes.transport.nio.NioReplicationTask run ADVERTENCIA: IOException in replication worker, unable to drain channel. Probable cause: Keep alive socket closed[An existing connection was forcibly closed by the remote host]. 26-ene-2009 19:00:53 org.apache.catalina.tribes.group.interceptors.TcpFailureDetector memberDisappeared INFO: Received memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-84, 26, 102, -23}:4009,{-84, 26, 102, -23},4009, alive=533282,id={-20 60 112 -113 35 -41 71 -5 -124 47 93 -37 117 -9 -9 29 }, payload={}, command={}, domain={}, ]] message. Will verify. 26-ene-2009 19:00:54 org.apache.catalina.tribes.group.interceptors.TcpFailureDetector memberDisappeared INFO: Verification complete. Member disappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-84, 26, 102, -23}:4009,{-84, 26, 102, -23},4009, alive=533282,id={-20 60 112 -113 35 -41 71 -5 -124 47 93 -37 117 -9 -9 29 }, payload={}, command={}, domain={}, ]] 26-ene-2009 19:00:54 org.apache.catalina.ha.tcp.SimpleTcpCluster memberDisappeared INFO: Received member disappeared:org.apache.catalina.tribes.membership.MemberImpl[tcp://{-84, 26, 102, -23}:4009,{-84, 26, 102, -23},4009, alive=533282,id={-20 60 112 -113 35 -41 71 -5 -124 47 93 -37 117 -9 -9 29 }, payload={}, command={}, domain={}, ] Our Cluster config in the nodes (as you can see, we configured the cluster at engine level) : Node 1: <Engine name="Catalina" defaultHost="localhost" jvmRoute="worker62"> <Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster"> <Manager className="org.apache.catalina.ha.session.DeltaManager" name="clusterPruebas6" stateTransferTimeout="-1" expireSessionsOnShutdown="false" notifyListenersOnReplication="true"/> <Channel className="org.apache.catalina.tribes.group.GroupChannel"> <Membership className="org.apache.catalina.tribes.membership.McastService" address="228.0.0.9" bind="172.26.102.233" port="45569" frequency="1000" dropTime="3000"/> <Receiver className="org.apache.catalina.tribes.transport.nio.NioReceiver" address="172.26.102.233" port="4009" autoBind="100" selectorTimeout="5000" maxThreads="12"/> <Sender className="org.apache.catalina.tribes.transport.ReplicationTransmitter"> <Transport className="org.apache.catalina.tribes.transport.nio.PooledParallelSender" /> </Sender> <Interceptor className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetector"/> <Interceptor className="org.apache.catalina.tribes.group.interceptors.MessageDispatch15Interceptor"/> </Channel> </Cluster> ... Node 2: <Engine name="Catalina" defaultHost="localhost" jvmRoute="worker66"> <Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster"> <Manager className="org.apache.catalina.ha.session.DeltaManager" name="clusterPruebas6" stateTransferTimeout="-1" expireSessionsOnShutdown="false" notifyListenersOnReplication="true"/> <Channel className="org.apache.catalina.tribes.group.GroupChannel"> <Membership className="org.apache.catalina.tribes.membership.McastService" address="228.0.0.9" bind="172.26.102.196" port="45569" frequency="1000" dropTime="5000"/> <Receiver className="org.apache.catalina.tribes.transport.nio.NioReceiver" address="172.26.102.196" port="4009" autoBind="100" selectorTimeout="100" maxThreads="12"/> <Sender className="org.apache.catalina.tribes.transport.ReplicationTransmitter"> <Transport className="org.apache.catalina.tribes.transport.nio.PooledParallelSender" maxRetryAttempts="0" /> </Sender> <Interceptor className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetector"/> <Interceptor className="org.apache.catalina.tribes.group.interceptors.MessageDispatch15Interceptor"/> </channel> </Cluster> So, after watching all this info, here go my questions. 1) Considering the high load of our servers, and even when we think that the in memory replication matches our expectatives more than database or in file persistence, is it possible to join a in memory replication or is it discouraged? 2) In our cluster configuration we have been testing with the stateTransferTimeout set to -1 to avoid quitting the DeltaManager getAllClusterSessions, because it is very important for us to replicate all of the sessions to the starting node. Anyway, should we set any other value here? 3) Any other sugestion to our configuration? Thank you very much. Mikel Ibiricu