Tomcat Clustering trouble when starting up under high load

Mikel Ibiricu Thu, 29 Jan 2009 23:32:49 -0800

Hello all

I´m Mikel, and me and my workmates have already been a while testing our
environment in order to  establish a in memory session replication cluster
for our servers. The thing is that our servers are often loaded with up to a
thousand (1000) and more sessions (and now, we have three tomcat nodes in
different machines working in a load balancer!!!!). In hot moments, I have
counted up to 4500 sessions distributed between the three servers.


So, I'll tell you the main configuration our production environment. Web
servers:

Two windows server 2003 with IIS and isapi_redirect.dll conector.

App servers

node 1 & 2: IBM Xseries_3550 Intel Xeon CPU 5150 @2,66GHz, 2,00 GB RAM,
Windows 2003 Server R2
node3: IBM XSeries_366 Intel Xeon CPU 3,20Ghz, 3,00 GB RAM, Windows Server
2003 R2

In our development environment, where we have been making our tests, we have
2 tomcats in two different machines

node 1: Intel Xeon CPU E5440 @ 2,83GHz, 1,00 GB RAM, Windows Server 2003 R2
node 2: IBM XSeries_3550 Intel Xeon CPU E5440 @ 2,83 GHz, 2GB RAM, Windows
Server 2003 R2

We have been testing the Tomcat 5.5.9 that we are using in production,
finding some trouble, even after aplying the clustering fix pack from
https://issues.apache.org/bugzilla/show_bug.cgi?id=34389 . Finally, we took
the decission to upgrade to last Tomcat 6 available, version 6.0.18, to see
if the announced refactoring of cluster subsystem could solve our trouble.

After all this prety long intro, I'll tell you the reason of my request.
When we test the cluster (in development environment) with both nodes
running, we create up to a thousand sessions, keeping alive and modificating
about 500. So, we can see that all of them get replicated to the other node
 quickly . The trouble comes when, after shutting down one of the instances,
we start it again (while the half of the alive sessions are still beeing
 modified by the JMeter test). In our tests, the starting tomcat instance
finally gets hunged when receiving sessions from the alive node.

Theese are the traces seen in the catalina.log of node 1 when starting it:

Jan 26, 2009 6:51:56 PM org.apache.catalina.core.AprLifecycleListener init
INFO: The APR based Apache Tomcat Native library which allows optimal
performance in production environments was not found on the
java.library.path:
C:\tomcat-6.0.18\bin;.;C:\WINDOWS\system32;C:\WINDOWS;C:\Program
Files\Serena\Dimensions
10.1\CM\prog;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\system32\WBEM;C:\Program
Files\IBM\Director\bin;C:\Program Files\Common
Files\IBM\ICC\cimom\bin;C:\Program Files\System Center Operations Manager
2007\
Jan 26, 2009 6:51:56 PM org.apache.coyote.http11.Http11Protocol init
INFO: Initializing Coyote HTTP/1.1 on http-9080
Jan 26, 2009 6:51:56 PM org.apache.coyote.http11.Http11Protocol init
INFO: Initializing Coyote HTTP/1.1 on http-9081
Jan 26, 2009 6:51:56 PM org.apache.catalina.startup.Catalina load
INFO: Initialization processed in 1928 ms
Jan 26, 2009 6:51:56 PM org.apache.catalina.core.StandardService start
INFO: Starting service Catalina
Jan 26, 2009 6:51:56 PM org.apache.catalina.core.StandardEngine start
INFO: Starting Servlet Engine: Apache Tomcat/6.0.18
Jan 26, 2009 6:51:56 PM org.apache.catalina.ha.tcp.SimpleTcpCluster start
INFO: Cluster is about to start
Jan 26, 2009 6:51:56 PM org.apache.catalina.tribes.transport.ReceiverBase
bind
INFO: Receiver Server Socket bound to:/172.26.102.233:4009
Jan 26, 2009 6:51:56 PM
org.apache.catalina.tribes.membership.McastServiceImpl setupSocket
INFO: Attempting to bind the multicast socket to /228.0.0.9:45569
Jan 26, 2009 6:51:56 PM
org.apache.catalina.tribes.membership.McastServiceImpl setupSocket
INFO: Binding to multicast address, failed. Binding to port only.
Jan 26, 2009 6:51:56 PM
org.apache.catalina.tribes.membership.McastServiceImpl setupSocket
INFO: Setting multihome multicast interface to:/172.26.102.233
Jan 26, 2009 6:51:56 PM
org.apache.catalina.tribes.membership.McastServiceImpl setupSocket
INFO: Setting cluster mcast soTimeout to 1000
Jan 26, 2009 6:51:56 PM
org.apache.catalina.tribes.membership.McastServiceImpl waitForMembers
INFO: Sleeping for 2000 milliseconds to establish cluster membership, start
level:4
Jan 26, 2009 6:51:57 PM org.apache.catalina.ha.tcp.SimpleTcpCluster
memberAdded
INFO: Replication member
added:org.apache.catalina.tribes.membership.MemberImpl[tcp://{-84, 26, 102,
-60}:4009,{-84, 26, 102, -60},4009, alive=1938953,id={-57 67 34 -23 -38 83
74 68 -67 -87 -112 -94 13 102 -78 -20 }, payload={}, command={}, domain={},
]
Jan 26, 2009 6:51:58 PM
org.apache.catalina.tribes.membership.McastServiceImpl waitForMembers
INFO: Done sleeping, membership established, start level:4
Jan 26, 2009 6:51:58 PM
org.apache.catalina.tribes.membership.McastServiceImpl waitForMembers
INFO: Sleeping for 2000 milliseconds to establish cluster membership, start
level:8
Jan 26, 2009 6:51:58 PM org.apache.catalina.tribes.io.BufferPool
getBufferPool
INFO: Created a buffer pool with max size:104857600 bytes of
type:org.apache.catalina.tribes.io.BufferPool15Impl
Jan 26, 2009 6:52:00 PM
org.apache.catalina.tribes.membership.McastServiceImpl waitForMembers
INFO: Done sleeping, membership established, start level:8
Jan 26, 2009 6:52:14 PM org.apache.catalina.ha.session.DeltaManager start
INFO: Register manager  to cluster element Engine with name Catalina
Jan 26, 2009 6:52:14 PM org.apache.catalina.ha.session.DeltaManager start
INFO: Starting clustering manager at
Jan 26, 2009 6:52:14 PM org.apache.catalina.ha.session.DeltaManager
getAllClusterSessions
WARNING: Manager [localhost#], requesting session state from
org.apache.catalina.tribes.membership.MemberImpl[tcp://{-84, 26, 102,
-60}:4009,{-84, 26, 102, -60},4009, alive=1955953,id={-57 67 34 -23 -38 83
74 68 -67 -87 -112 -94 13 102 -78 -20 }, payload={}, command={}, domain={},
]. This operation will timeout if no session state has been received within
-1 seconds.
Jan 26, 2009 7:00:50 PM org.apache.catalina.startup.Catalina stopServer
SEVERE: Catalina.stop:
java.net.ConnectException: Connection refused: connect
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
    at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:364)
    at java.net.Socket.connect(Socket.java:507)
    at java.net.Socket.connect(Socket.java:457)
    at java.net.Socket.<init>(Socket.java:365)
    at java.net.Socket.<init>(Socket.java:178)
    at org.apache.catalina.startup.Catalina.stopServer(Catalina.java:421)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:585)
    at org.apache.catalina.startup.Bootstrap.stopServer(Bootstrap.java:337)
    at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:415)

The last trace was because after nearly ten minutes waiting, we resolved to
shut the tomcat instance down. In the running Tomcat (node 2):

26-ene-2009 18:51:58 org.apache.catalina.ha.tcp.SimpleTcpCluster memberAdded
INFO: Replication member
added:org.apache.catalina.tribes.membership.MemberImpl[tcp://{-84, 26, 102,
-23}:4009,{-84, 26, 102, -23},4009, alive=2047,id={-20 60 112 -113 35 -41 71
-5 -124 47 93 -37 117 -9 -9 29 }, payload={}, command={}, domain={}, ]
26-ene-2009 19:00:51
org.apache.catalina.tribes.transport.nio.NioReplicationTask run
ADVERTENCIA: IOException in replication worker, unable to drain channel.
Probable cause: Keep alive socket closed[An existing connection was forcibly
closed by the remote host].
26-ene-2009 19:00:53
org.apache.catalina.tribes.group.interceptors.TcpFailureDetector
memberDisappeared
INFO: Received
memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-84,
26, 102, -23}:4009,{-84, 26, 102, -23},4009, alive=533282,id={-20 60 112
-113 35 -41 71 -5 -124 47 93 -37 117 -9 -9 29 }, payload={}, command={},
domain={}, ]] message. Will verify.
26-ene-2009 19:00:54
org.apache.catalina.tribes.group.interceptors.TcpFailureDetector
memberDisappeared
INFO: Verification complete. Member
disappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-84, 26,
102, -23}:4009,{-84, 26, 102, -23},4009, alive=533282,id={-20 60 112 -113 35
-41 71 -5 -124 47 93 -37 117 -9 -9 29 }, payload={}, command={}, domain={},
]]
26-ene-2009 19:00:54 org.apache.catalina.ha.tcp.SimpleTcpCluster
memberDisappeared
INFO: Received member
disappeared:org.apache.catalina.tribes.membership.MemberImpl[tcp://{-84, 26,
102, -23}:4009,{-84, 26, 102, -23},4009, alive=533282,id={-20 60 112 -113 35
-41 71 -5 -124 47 93 -37 117 -9 -9 29 }, payload={}, command={}, domain={},
]

Our Cluster config in the nodes  (as you can see, we configured the cluster
at engine level) :

Node 1:

<Engine name="Catalina" defaultHost="localhost" jvmRoute="worker62">

      <Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster">


        <Manager className="org.apache.catalina.ha.session.DeltaManager"
             name="clusterPruebas6"
                   stateTransferTimeout="-1"
                   expireSessionsOnShutdown="false"
                   notifyListenersOnReplication="true"/>

        <Channel className="org.apache.catalina.tribes.group.GroupChannel">
                <Membership
className="org.apache.catalina.tribes.membership.McastService"
                        address="228.0.0.9"
            bind="172.26.102.233"
                        port="45569"
                        frequency="1000"
                        dropTime="3000"/>
            <Receiver
className="org.apache.catalina.tribes.transport.nio.NioReceiver"
                      address="172.26.102.233"
                      port="4009"
                      autoBind="100"
                      selectorTimeout="5000"
                      maxThreads="12"/>

                <Sender
className="org.apache.catalina.tribes.transport.ReplicationTransmitter">
                  <Transport
className="org.apache.catalina.tribes.transport.nio.PooledParallelSender" />
                </Sender>
                <Interceptor
className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetector"/>
                <Interceptor
className="org.apache.catalina.tribes.group.interceptors.MessageDispatch15Interceptor"/>
          </Channel>



    </Cluster>
...

Node 2:

<Engine name="Catalina" defaultHost="localhost" jvmRoute="worker66">

      <Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster">

                <Manager
className="org.apache.catalina.ha.session.DeltaManager"
                                      name="clusterPruebas6"
                   stateTransferTimeout="-1"
                   expireSessionsOnShutdown="false"
                   notifyListenersOnReplication="true"/>

                <Channel
className="org.apache.catalina.tribes.group.GroupChannel">
            <Membership
className="org.apache.catalina.tribes.membership.McastService"
                        address="228.0.0.9"
                        bind="172.26.102.196"
                        port="45569"
                        frequency="1000"
                        dropTime="5000"/>
            <Receiver
className="org.apache.catalina.tribes.transport.nio.NioReceiver"
                      address="172.26.102.196"
                      port="4009"
                      autoBind="100"
                      selectorTimeout="100"
                      maxThreads="12"/>

            <Sender
className="org.apache.catalina.tribes.transport.ReplicationTransmitter">
              <Transport
className="org.apache.catalina.tribes.transport.nio.PooledParallelSender"
                                  maxRetryAttempts="0" />
            </Sender>
            <Interceptor
className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetector"/>
          <Interceptor
className="org.apache.catalina.tribes.group.interceptors.MessageDispatch15Interceptor"/>

        </channel>

          </Cluster>

So, after watching all this info, here go my questions.

1) Considering the high load of our servers, and even when we think that the
in memory replication matches our expectatives more than database or in file
persistence, is it possible to join a in memory replication or is it
discouraged?

2) In our cluster configuration we have been testing with the
stateTransferTimeout set to -1 to avoid quitting the DeltaManager
getAllClusterSessions, because it is very important for us to replicate all
of the sessions to the starting node. Anyway, should we set any other value
here?

3) Any other sugestion to our configuration?

Thank you very much.
Mikel Ibiricu

Tomcat Clustering trouble when starting up under high load

Reply via email to