Re: Cluster setup stopped working after 3 months in production

Igor Cicimov Tue, 12 Aug 2014 17:28:26 -0700

On 12/08/2014 7:47 PM, "Krishna Saranathan" <krishna.saran...@gmail.com>
wrote:
>
> Its linux distro.
> Linux version 2.6.32-358.14.1.el6.x86_64 (
> mockbu...@x86-022.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313
> (Red Hat 4.4.7-3) (GCC) ) #1 SMP Mon Jun 17 15:54:20 EDT 2013
>
> Java version - 1.6 update 45.
>
> I doubt change in security group suddenly applied for the port. Am able to
> telnet from server which is shutdown to the currently running server to
>  port  4444 . Yes. OS restart was done for a hardware upgrade for RAM and
> disk volume.
>


Well your logs clearly show the member cant establish connection to
10.160.40.12:4444
Did you try the telnet to that exact ip and port or you used something else
like internal dns name? Note that some instances on AWS change some
parameters upon restart so check in your console to confirm they have the
ip's you expect them to have.

>
> On Tue, Aug 12, 2014 at 6:58 AM, Igor Cicimov <icici...@gmail.com> wrote:
>
> > On 12/08/2014 4:24 PM, "Krishna Saranathan" <krishna.saran...@gmail.com>
> > wrote:
> > >
> > > We have J2EE war application deployed in a cluster setup having two
> > > nodes. Tomcat 6.0.39 is installed in the both nodes having identical
> > > war deployed in both. Its deployed in Amazon AWS environment, and the
> >
> > What distro? Win or linux? And if linux which one?
> >
> > > two ec2-nodes are beneath an ELB , with session stickiness enabled for
> > > JSESSIONID. Also the two tomcat nodes are session replication enabled
> > > too.
> > >
> > > Following is Cluster config updated server.xml file:
> > >
> >
> >
=============================================================================
> > >  <Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster"
> > > channelSendOptions="6" channelStartOptions="3">
> > >
> > > <Manager className="org.apache.catalina.ha.session.DeltaManager"
> > > expireSessionsOnShutdown="false" notifyListenersOnReplication="true"
> > > />
> > >
> > > <Channel className="org.apache.catalina.tribes.group.GroupChannel">
> > >
> > > <Receiver
> > className="org.apache.catalina.tribes.transport.nio.NioReceiver"
> > >                                 autoBind="0" selectorTimeout="5000"
> > > maxThreads="6"
> > >                                 address="x.x.x.x" port="4444" />
> > > <Sender
> > className="org.apache.catalina.tribes.transport.ReplicationTransmitter">
> > > <Transport
> >
className="org.apache.catalina.tribes.transport.nio.PooledParallelSender"
> > >                                         timeout="60000"
> > >                                         keepAliveTime="10"
> > >                                         keepAliveCount="0"
> > > />
> > > </Sender>
> > > <Interceptor
> >
> >
className="org.apache.catalina.tribes.group.interceptors.TcpPingInterceptor"
> > > staticOnly="true"/>
> > > <Interceptor
> >
> >
className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetector"/>
> > > <Interceptor
> >
> >
className="org.apache.catalina.tribes.group.interceptors.StaticMembershipInterceptor">
> > > <Member className="org.apache.catalina.tribes.membership.StaticMember"
> > >                                         host="x.x.x.x"
> > >                                         port="4444"
> > >
> > > uniqueId="{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4}"/>
> > > </Interceptor>
> > > </Channel>
> > > <Valve className="org.apache.catalina.ha.tcp.ReplicationValve"
filter=""
> > />
> > > <Valve className="org.apache.catalina.ha.session.JvmRouteBinderValve"
/>
> > > <ClusterListener
> > >
> >
> >
className="org.apache.catalina.ha.session.JvmRouteSessionIDBinderListener"/>
> > > <ClusterListener
> > > className="org.apache.catalina.ha.session.ClusterSessionListener"/>
> > > </Cluster>
> > >
> > >
> >
==========================================================================
> > >
> > > Receiver ip, static member ip and unique id is different in the
> > > server.xml of the other node in the cluster.
> > >
> > > this was running fine in production environment for 3 months. Suddenly
> > there was
> > > an exception logged like this :, and started coming up infinitely.
> > >
> > >
> > > ==================================================
> > > Aug 6, 2014 12:00:39 AM
> > > org.apache.catalina.tribes.group.interceptors.TcpFailureDetector
> > > memberDisappeared
> > > INFO: Received
> >
memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://
> > 10.160.40.12:4444,10.160.40.12,4444,
> > > alive=0,id={0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 }, payload={}, command={},
> > > domain={}, ]] message. Will verify.
> > > Aug 6, 2014 12:00:39 AM
> > > org.apache.catalina.tribes.group.interceptors.TcpFailureDetector
> > > memberDisappeared
> > > INFO: Verification complete. Member still
> > > alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://
> > 10.160.40.12:4444,10.160.40.12,4444,
> > > alive=0,id={0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 }, payload={}, command={},
> > > domain={}, ]]
> > > Aug 6, 2014 12:00:39 AM org.apache.catalina.ha.tcp.SimpleTcpCluster
send
> > > SEVERE: Unable to send message through cluster sender.
> > > org.apache.catalina.tribes.ChannelException: Operation has timed
> > > out(60000 ms.).; Faulty members:tcp://10.160.40.12:4444;
> > >         at
> >
> >
org.apache.catalina.tribes.transport.nio.ParallelNioSender.sendMessage(ParallelNioSender.java:97)
> > >         at
> >
> >
org.apache.catalina.tribes.transport.nio.PooledParallelSender.sendMessage(PooledParallelSender.java:53)
> > >         at
> >
> >
org.apache.catalina.tribes.transport.ReplicationTransmitter.sendMessage(ReplicationTransmitter.java:80)
> > >         at
> >
> >
org.apache.catalina.tribes.group.ChannelCoordinator.sendMessage(ChannelCoordinator.java:76)
> > >         at
> >
> >
org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
> > >         at
> >
> >
org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
> > >         at
> >
> >
org.apache.catalina.tribes.group.interceptors.TcpFailureDetector.sendMessage(TcpFailureDetector.java:88)
> > >         at
> >
> >
org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
> > >         at
> >
> >
org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelInterceptorBase.java:75)
> > >         at
> >
org.apache.catalina.tribes.group.GroupChannel.send(GroupChannel.java:216)
> > >         at
> >
org.apache.catalina.tribes.group.GroupChannel.send(GroupChannel.java:175)
> > >         at
> >
org.apache.catalina.ha.tcp.SimpleTcpCluster.send(SimpleTcpCluster.java:817)
> > >         at
> >
> >
org.apache.catalina.ha.tcp.SimpleTcpCluster.sendClusterDomain(SimpleTcpCluster.java:791)
> > >         at
> >
org.apache.catalina.ha.tcp.ReplicationValve.send(ReplicationValve.java:553)
> > >         at
> >
> >
org.apache.catalina.ha.tcp.ReplicationValve.sendMessage(ReplicationValve.java:537)
> > >         at
> >
> >
org.apache.catalina.ha.tcp.ReplicationValve.sendSessionReplicationMessage(ReplicationValve.java:519)
> > >         at
> >
> >
org.apache.catalina.ha.tcp.ReplicationValve.sendReplicationMessage(ReplicationValve.java:430)
> > >         at
> >
> >
org.apache.catalina.ha.tcp.ReplicationValve.invoke(ReplicationValve.java:363)
> > >         at
> >
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
> > >         at
> >
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:861)
> > >         at
> >
> >
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:606)
> > >         at
> > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
> > >         at java.lang.Thread.run(Thread.java:662)
> > >
> >
> >
============================================================================
> > >
> > >
> > > After this, the web application is not accessible, and we have to
> > > manually kill the tomcat process in one node, thereby disabling the
> > > cluster.
> > >
> > >
> > > We are unsure, how all of a sudden this is coming, and disabling
> > > application access altogether. If there are any suggestion on remedy,
> > > pls provide the same.
> >
> > Firewall???
> > Did you change something in the SecurityGroup the instances belong  to
that
> > might have affected the port 4444? Can you telnet from the server you
shut
> > down tomcat to port 4444 on the server tomcat is running on? Did you do
a
> > restart or OS update/upgrade that might have pulled some firewall
package
> > and activated it afterwards?
> >

Re: Cluster setup stopped working after 3 months in production

Reply via email to