Hi Trygve,
Thanks for reporting these problems with comprehensive log messages;
they were very helpful to diagnose.
The bug causing the inability to reliably restart without downtime
has been identified and fixed. I just deployed new WADI 2.2-SNAPSHOT
artifacts, which can be found there:
http://snapshots.repository.codehaus.org/org/codehaus/wadi/
Can you please confirm that this is now working as expected within
your environment?
Regarding the warning message
16:22:43,524 WARN [UpdateReplicationCommand] Update has not been
properly cascaded due to a communication failure. If a targeted
node has been lost, state will be re-balanced automatically.
the timeout is not exposed via API. Even if this warning message is
displayed, an update message has actually been queued for delivery to
the relevant back-up nodes. However, there is no guarantee that in
case of failure, the cluster will be able to restore the latest
session state as the latest update message may not have actually been
delivered to back-up nodes.
I conducted load-testing on a single computer. As the network traffic
going through the loopback interface does not actually touch the
network card, this is a problem I was never able to observe.
Do you have an idea of the traffic volume when this warning message
starts to appear? Also, do you have an idea of the session size?
Thanks,
Gianny
On 31/10/2009, at 3:39 AM, Trygve Hardersen wrote:
Hello
We have been using Geronimo 2.2-SNAPSHOT in production for a good
month now, and I thought I'd share some experiences with the
community, and maybe get some help. We are an online backup
service, check out jottabackup.com if you're interested.
Generally our experience has been very positive. We're using the
GBean framework for custom server components, the DB connection
pools against MySQL databases, stateless service EJBs and various
MDB, and of course the web tier (Jetty). Everything is running
smoothly and we're very much looking forward to 2.2 being released
so we can "release" our own software.
The issues we're having are related to WADI clustering with Jetty.
First we can't use Jetty7 because of GERONIMO-4846, so we're using
Jetty6 which works fine. The more serious issue is that we often
can not update our servers without downtime. This is what happens:
We have 2 application servers (AS-000 and AS-001) running dynamic
WADI HTTP session replication between them. When updating we first
stop one, AS-000 in this case. That works fine and the active
sessions are migrated to AS-001:
23:43:18,160 INFO [SimpleStateManager]
=============================
New Partition Balancing
Partition Balancing
Size [24]
Partition[0] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[1] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[2] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[3] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[4] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[5] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[6] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[7] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[8] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[9] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[10] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[11] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[12] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[13] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[14] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[15] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[16] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[17] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[18] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[19] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[20] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[21] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[22] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
Partition[23] owned by [TribesPeer [AS-001; tcp://
10.0.10.101:4000]]; version [3]; mergeVersion [0]
=============================
23:43:28,539 INFO [TcpFailureDetector] Verification complete.
Member disappeared[org.apache.catalina.tribes.membership.MemberImpl
[tcp://{10, 0, 10, 100}:4000,{10, 0, 10, 100},4000,
alive=41104531,id={-4 -32 54 90 -109 -17 65 64 -117 40 -110 -14 36
93 -12 -118 }, payload={-84 -19 0 5 115 114 0 50 111 ...(423)},
command={66 65 66 89 45 65 76 69 88 ...(9)}, domain={74 79 84 84 65
95 87 65 68 ...(10)}, ]]
23:43:28,540 INFO [ChannelInterceptorBase] memberDisappeared:tcp://
{10, 0, 10, 100}:4000
We then update AS-000 and try to start it, but it fails to rejoin
the WADI cluster:
23:46:30,784 INFO [ReceiverBase] Receiver Server Socket bound to:/
10.0.10.100:4000
23:46:30,864 INFO [ChannelInterceptorBase] memberStart
local:org.apache.catalina.tribes.membership.MemberImpl[tcp://
10.0.10.100:4000,10.0.10.100,4000, alive=0,id={-103 34 80 -19 68
-51 70 -91 -108 39 -84 65 50 50 103 -107 }, payload={-84 -19 0 5
115 114 0 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95
87 65 68 ...(10)}, ] notify:false peer:AS-000
23:46:30,868 INFO [McastService] Setting cluster mcast soTimeout
to 500
23:46:30,908 INFO [McastService] Sleeping for 1000 milliseconds to
establish cluster membership, start level:4
23:46:31,139 INFO [ChannelInterceptorBase] memberAdded:tcp://{10,
0, 10, 101}:4000
23:46:31,908 INFO [McastService] Done sleeping, membership
established, start level:4
23:46:31,912 INFO [McastService] Sleeping for 1000 milliseconds to
establish cluster membership, start level:8
23:46:31,927 INFO [BufferPool] Created a buffer pool with max size:
104857600 bytes of type:org.apache.catalina.tribes.io.BufferPool15Impl
23:46:32,912 INFO [McastService] Done sleeping, membership
established, start level:8
23:46:32,912 INFO [ChannelInterceptorBase] memberStart
local:org.apache.catalina.tribes.membership.MemberImpl[tcp://
10.0.10.100:4000,10.0.10.100,4000, alive=272,id={-103 34 80 -19 68
-51 70 -91 -108 39 -84 65 50 50 103 -107 }, payload={-84 -19 0 5
115 114 0 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95
87 65 68 ...(10)}, ] notify:false peer:AS-000
23:46:37,848 INFO [DiscStore] Creating directory: /usr/lib/jotta/
jotta-as-prod-1.0-SNAPSHOT/var/temp/SessionStore
23:46:37,930 INFO [BasicSingletonServiceHolder] [TribesPeer
[AS-000; tcp://10.0.10.100:4000]] owns singleton service
[PartitionManager for ServiceSpace [/]]
23:46:37,964 INFO [BasicSingletonServiceHolder] [TribesPeer
[AS-000; tcp://10.0.10.100:4000]] resigns ownership of singleton
service [PartitionManager for ServiceSpace [/]]
23:47:40,065 ERROR [BasicServiceRegistry] Error while starting
[Holder for service
[org.codehaus.wadi.location.partitionmanager.simplepartitionmana...@7d
c2445f] named [PartitionManager] in space [ServiceSpace [/]]]
org.codehaus.wadi.location.partitionmanager.PartitionManagerException:
Partition [0] is unknown.
at
org.codehaus.wadi.location.partitionmanager.SimplePartitionManager.wai
tForBoot(SimplePartitionManager.java:248)
at
org.codehaus.wadi.location.partitionmanager.SimplePartitionManager.sta
rt(SimplePartitionManager.java:119)
at org.codehaus.wadi.servicespace.basic.BasicServiceHolder.start
(BasicServiceHolder.java:60)
at org.codehaus.wadi.servicespace.basic.BasicServiceRegistry.start
(BasicServiceRegistry.java:152)
at org.codehaus.wadi.servicespace.basic.BasicServiceSpace.start
(BasicServiceSpace.java:169)
at
org.apache.geronimo.clustering.wadi.BasicWADISessionManager.doStart
(BasicWADISessionManager.java:125)
at org.apache.geronimo.gbean.runtime.GBeanInstance.createInstance
(GBeanInstance.java:948)
at
org.apache.geronimo.gbean.runtime.GBeanInstanceState.attemptFullStart(
GBeanInstanceState.java:269)
at org.apache.geronimo.gbean.runtime.GBeanInstanceState.start
(GBeanInstanceState.java:103)
at
org.apache.geronimo.gbean.runtime.GBeanInstanceState.startRecursive
(GBeanInstanceState.java:125)
at org.apache.geronimo.gbean.runtime.GBeanInstance.startRecursive
(GBeanInstance.java:538)
at org.apache.geronimo.kernel.basic.BasicKernel.startRecursiveGBean
(BasicKernel.java:377)
at
org.apache.geronimo.kernel.config.ConfigurationUtil.startConfiguration
GBeans(ConfigurationUtil.java:456)
at
org.apache.geronimo.kernel.config.ConfigurationUtil.startConfiguration
GBeans(ConfigurationUtil.java:493)
at
org.apache.geronimo.kernel.config.KernelConfigurationManager.start
(KernelConfigurationManager.java:190)
at
org.apache.geronimo.kernel.config.SimpleConfigurationManager.startConf
iguration(SimpleConfigurationManager.java:546)
at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.geronimo.gbean.runtime.ReflectionMethodInvoker.invoke
(ReflectionMethodInvoker.java:34)
at org.apache.geronimo.gbean.runtime.GBeanOperation.invoke
(GBeanOperation.java:130)
at org.apache.geronimo.gbean.runtime.GBeanInstance.invoke
(GBeanInstance.java:815)
at org.apache.geronimo.gbean.runtime.RawInvoker.invoke
(RawInvoker.java:57)
at org.apache.geronimo.kernel.basic.RawOperationInvoker.invoke
(RawOperationInvoker.java:35)
at
org.apache.geronimo.kernel.basic.ProxyMethodInterceptor.intercept
(ProxyMethodInterceptor.java:96)
at org.apache.geronimo.gbean.GBeanLifecycle$$EnhancerByCGLIB$
$628b9237.startConfiguration(<generated>)
at org.apache.geronimo.system.main.EmbeddedDaemon.doStartup
(EmbeddedDaemon.java:161)
at org.apache.geronimo.system.main.EmbeddedDaemon.execute
(EmbeddedDaemon.java:78)
at
org.apache.geronimo.kernel.util.MainConfigurationBootstrapper.main
(MainConfigurationBootstrapper.java:45)
at org.apache.geronimo.cli.AbstractCLI.executeMain
(AbstractCLI.java:65)
at org.apache.geronimo.cli.daemon.DaemonCLI.main(DaemonCLI.java:30)
23:47:40,078 ERROR [BasicWADISessionManager] Failed to stop
org.codehaus.wadi.servicespace.ServiceSpaceNotFoundException:
ServiceSpaceName not found
at
org.codehaus.wadi.servicespace.basic.BasicServiceSpaceRegistry.unregis
ter(BasicServiceSpaceRegistry.java:55)
at
org.codehaus.wadi.servicespace.basic.BasicServiceSpace.unregisterServi
ceSpace(BasicServiceSpace.java:228)
at org.codehaus.wadi.servicespace.basic.BasicServiceSpace.stop
(BasicServiceSpace.java:175)
at
org.apache.geronimo.clustering.wadi.BasicWADISessionManager.doFail
(BasicWADISessionManager.java:134)
at org.apache.geronimo.gbean.runtime.GBeanInstance.createInstance
(GBeanInstance.java:978)
at
org.apache.geronimo.gbean.runtime.GBeanInstanceState.attemptFullStart(
GBeanInstanceState.java:269)
at org.apache.geronimo.gbean.runtime.GBeanInstanceState.start
(GBeanInstanceState.java:103)
at
org.apache.geronimo.gbean.runtime.GBeanInstanceState.startRecursive
(GBeanInstanceState.java:125)
at org.apache.geronimo.gbean.runtime.GBeanInstance.startRecursive
(GBeanInstance.java:538)
at org.apache.geronimo.kernel.basic.BasicKernel.startRecursiveGBean
(BasicKernel.java:377)
at
org.apache.geronimo.kernel.config.ConfigurationUtil.startConfiguration
GBeans(ConfigurationUtil.java:456)
at
org.apache.geronimo.kernel.config.ConfigurationUtil.startConfiguration
GBeans(ConfigurationUtil.java:493)
at
org.apache.geronimo.kernel.config.KernelConfigurationManager.start
(KernelConfigurationManager.java:190)
at
org.apache.geronimo.kernel.config.SimpleConfigurationManager.startConf
iguration(SimpleConfigurationManager.java:546)
at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.geronimo.gbean.runtime.ReflectionMethodInvoker.invoke
(ReflectionMethodInvoker.java:34)
at org.apache.geronimo.gbean.runtime.GBeanOperation.invoke
(GBeanOperation.java:130)
at org.apache.geronimo.gbean.runtime.GBeanInstance.invoke
(GBeanInstance.java:815)
at org.apache.geronimo.gbean.runtime.RawInvoker.invoke
(RawInvoker.java:57)
at org.apache.geronimo.kernel.basic.RawOperationInvoker.invoke
(RawOperationInvoker.java:35)
at
org.apache.geronimo.kernel.basic.ProxyMethodInterceptor.intercept
(ProxyMethodInterceptor.java:96)
at org.apache.geronimo.gbean.GBeanLifecycle$$EnhancerByCGLIB$
$628b9237.startConfiguration(<generated>)
at org.apache.geronimo.system.main.EmbeddedDaemon.doStartup
(EmbeddedDaemon.java:161)
at org.apache.geronimo.system.main.EmbeddedDaemon.execute
(EmbeddedDaemon.java:78)
at
org.apache.geronimo.kernel.util.MainConfigurationBootstrapper.main
(MainConfigurationBootstrapper.java:45)
at org.apache.geronimo.cli.AbstractCLI.executeMain
(AbstractCLI.java:65)
at org.apache.geronimo.cli.daemon.DaemonCLI.main(DaemonCLI.java:30)
After this failure the server stops. Over at the running instance
AS-001 this is logged:
23:46:31,909 INFO [ChannelInterceptorBase] memberAdded:tcp://{10,
0, 10, 100}:4000
23:46:37,929 INFO [BasicSingletonServiceHolder] [TribesPeer
[AS-001; tcp://10.0.10.101:4000]] owns singleton service
[PartitionManager for ServiceSpace [/]]
23:46:37,929 INFO [BasicPartitionBalancerSingletonService]
Queueing partition rebalancing
23:46:38,438 ERROR [BasicEnvelopeDispatcherManager] problem
dispatching message
java.lang.IllegalArgumentException:
org.codehaus.wadi.core.store.BasicStoreMotable is not a Session
at
org.codehaus.wadi.replication.manager.basic.SessionStateHandler.newExt
ractFullStateExternalizable(SessionStateHandler.java:105)
at
org.codehaus.wadi.replication.manager.basic.SessionStateHandler.extrac
tFullState(SessionStateHandler.java:53)
at
org.codehaus.wadi.replication.manager.basic.CreateStorageCommand.execu
te(CreateStorageCommand.java:45)
at
org.codehaus.wadi.replication.manager.basic.SyncSecondaryManager.updat
eSecondaries(SyncSecondaryManager.java:169)
at
org.codehaus.wadi.replication.manager.basic.SyncSecondaryManager.updat
eSecondaries(SyncSecondaryManager.java:114)
at
org.codehaus.wadi.replication.manager.basic.SyncSecondaryManager.updat
eSecondaries(SyncSecondaryManager.java:103)
at
org.codehaus.wadi.replication.manager.basic.SyncSecondaryManager.updat
eSecondariesFollowingJoiningPeer(SyncSecondaryManager.java:75)
at
org.codehaus.wadi.replication.manager.basic.ReOrganizeSecondariesListe
ner.receive(ReOrganizeSecondariesListener.java:53)
at
org.codehaus.wadi.servicespace.basic.BasicServiceMonitor.notifyListene
rs(BasicServiceMonitor.java:124)
at
org.codehaus.wadi.servicespace.basic.BasicServiceMonitor.processLifecy
cleEvent(BasicServiceMonitor.java:141)
at org.codehaus.wadi.servicespace.basic.BasicServiceMonitor
$ServiceLifecycleEndpoint.dispatch(BasicServiceMonitor.java:148)
at org.codehaus.wadi.group.impl.ServiceEndpointWrapper.dispatch
(ServiceEndpointWrapper.java:50)
at org.codehaus.wadi.group.impl.BasicEnvelopeDispatcherManager
$DispatchRunner.run(BasicEnvelopeDispatcherManager.java:121)
at org.codehaus.wadi.servicespace.basic.BasicServiceSpaceDispatcher
$ExecuteInThread.execute(BasicServiceSpaceDispatcher.java:102)
at
org.codehaus.wadi.group.impl.BasicEnvelopeDispatcherManager.onEnvelope
(BasicEnvelopeDispatcherManager.java:100)
at org.codehaus.wadi.group.impl.AbstractDispatcher.doOnEnvelope
(AbstractDispatcher.java:104)
at org.codehaus.wadi.group.impl.AbstractDispatcher.onEnvelope
(AbstractDispatcher.java:100)
at
org.codehaus.wadi.servicespace.basic.ServiceSpaceEndpoint.dispatch
(ServiceSpaceEndpoint.java:49)
at org.codehaus.wadi.group.impl.ServiceEndpointWrapper.dispatch
(ServiceEndpointWrapper.java:50)
at org.codehaus.wadi.group.impl.BasicEnvelopeDispatcherManager
$DispatchRunner.run(BasicEnvelopeDispatcherManager.java:121)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask
(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run
(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
23:46:40,063 INFO [BasicPartitionBalancerSingletonService]
Queueing partition rebalancing
23:46:42,938 WARN [BasicPartitionBalancerSingletonService]
Rebalancing has failed
org.codehaus.wadi.group.MessageExchangeException: No correlated
messages received within [5000]ms
at
org.codehaus.wadi.group.impl.AbstractDispatcher.attemptMultiRendezVous
(AbstractDispatcher.java:174)
at
org.codehaus.wadi.location.balancing.BasicPartitionBalancer.fetchBalan
cingInfoState(BasicPartitionBalancer.java:85)
at
org.codehaus.wadi.location.balancing.BasicPartitionBalancer.balancePar
titions(BasicPartitionBalancer.java:69)
at
org.codehaus.wadi.location.balancing.BasicPartitionBalancerSingletonSe
rvice.run(BasicPartitionBalancerSingletonService.java:85)
at java.lang.Thread.run(Thread.java:619)
23:46:42,939 WARN [BasicPartitionBalancerSingletonService] Will
retry rebalancing in [500] ms
23:46:43,439 INFO [BasicPartitionBalancerSingletonService]
Queueing partition rebalancing
23:47:40,269 INFO [TcpFailureDetector] Verification complete.
Member disappeared[org.apache.catalina.tribes.membership.MemberImpl
[tcp://{10, 0, 10, 100}:4000,{10, 0, 10, 100},4000, alive=69401,id=
{-103 34 80 -19 68 -51 70 -91 -108 39 -84 65 50 50 103 -107 },
payload={-84 -19 0 5 115 114 0 50 111 ...(423)}, command={66 65 66
89 45 65 76 69 88 ...(9)}, domain={74 79 84 84 65 95 87 65 68 ...
(10)}, ]]
23:47:40,271 INFO [ChannelInterceptorBase] memberDisappeared:tcp://
{10, 0, 10, 100}:4000
23:47:40,271 INFO [BasicPartitionBalancerSingletonService]
Queueing partition rebalancing
If I try to start AS-000 again the same thing happens. If we stop
AS-001 the following is logged;
23:49:18,695 INFO [SimpleStateManager] Evacuating partitions
23:49:18,699 INFO [BasicPartitionBalancerSingletonService]
Queueing partition rebalancing
23:49:23,698 WARN [SimpleStateManager] Partition balancer has
disappeared - backing off for [1000]ms
23:49:24,699 INFO [BasicPartitionBalancerSingletonService]
Queueing partition rebalancing
23:49:29,698 WARN [SimpleStateManager] Partition balancer has
disappeared - backing off for [1000]ms
23:49:30,699 INFO [BasicPartitionBalancerSingletonService]
Queueing partition rebalancing
23:49:35,699 WARN [SimpleStateManager] Partition balancer has
disappeared - backing off for [1000]ms
23:49:36,700 INFO [BasicPartitionBalancerSingletonService]
Queueing partition rebalancing
23:49:41,699 WARN [SimpleStateManager] Partition balancer has
disappeared - backing off for [1000]ms
23:49:42,701 INFO [BasicPartitionBalancerSingletonService]
Queueing partition rebalancing
23:49:47,700 WARN [SimpleStateManager] Partition balancer has
disappeared - backing off for [1000]ms
23:49:48,700 INFO [SimpleStateManager] Evacuated
23:49:48,808 INFO [AbstractExclusiveContextualiser] Unloaded
sessions=[36]
23:49:48,843 INFO [AbstractExclusiveContextualiser] Unloaded
sessions=[13]
23:49:58,852 INFO [BasicSingletonServiceHolder] [TribesPeer
[AS-001; tcp://10.0.10.101:4000]] resigns ownership of singleton
service [PartitionManager for ServiceSpace [/]]
, however AS-001 then just hangs, and we have to kill the process
to get it stopped. After this we can start AS-000, update AS-001
and it always seems to have no problem joining the cluster
thereafter. The strange thing is that this problem does not always
occur, sometimes everything goes fine. I can't find a consistent
pattern but I've tried stopping AS-001 before AS-000, and I'm sure
no serializable object in the session has changed between the
updated and running instance.
My gut feeling is that this is either a concurrency-related bug in
WADI or a network-timeout related problem. During normal operation
I'm sometimes seeing messages like this in the log files:
17:14:08,869 INFO [TcpFailureDetector] Received memberDisappeared
[org.apache.catalina.tribes.membership.MemberImpl[tcp://{10, 0, 10,
101}:4000,{10, 0, 10, 101},4000, alive=95659954,id={-52 -76 98 22
10 71 76 -72 -122 -59 -21 -29 44 -86 38 114 }, payload={-84 -19 0 5
115 114 0 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95
87 65 68 ...(10)}, ]] message. Will verify.
17:14:08,870 INFO [TcpFailureDetector] Verification complete.
Member still alive[org.apache.catalina.tribes.membership.MemberImpl
[tcp://{10, 0, 10, 101}:4000,{10, 0, 10, 101},4000,
alive=95659954,id={-52 -76 98 22 10 71 76 -72 -122 -59 -21 -29 44
-86 38 114 }, payload={-84 -19 0 5 115 114 0 50 111 ...(423)},
command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]
And lately, as traffic has increased, errors like this:
16:22:43,524 WARN [UpdateReplicationCommand] Update has not been
properly cascaded due to a communication failure. If a targeted
node has been lost, state will be re-balanced automatically.
org.codehaus.wadi.servicespace.ServiceInvocationException:
org.codehaus.wadi.group.MessageExchangeException: No correlated
messages received within [2000]ms
at org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory
$ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:209)
at org.codehaus.wadi.replication.storage.ReplicaStorage$
$EnhancerByCGLIB$$a901e91b.mergeUpdate(<generated>)
at
org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.c
ascadeUpdate(UpdateReplicationCommand.java:93)
at
org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.r
un(UpdateReplicationCommand.java:86)
at
org.codehaus.wadi.replication.manager.basic.SyncReplicationManager.upd
ate(SyncReplicationManager.java:138)
at
org.codehaus.wadi.replication.manager.basic.LoggingReplicationManager.
update(LoggingReplicationManager.java:100)
at
org.codehaus.wadi.core.session.AbstractReplicableSession.onEndProcessi
ng(AbstractReplicableSession.java:49)
at
org.codehaus.wadi.core.session.AtomicallyReplicableSession.onEndProces
sing(AtomicallyReplicableSession.java:58)
at
org.apache.geronimo.clustering.wadi.WADISessionAdaptor.onEndAccess
(WADISessionAdaptor.java:77)
at
org.apache.geronimo.jetty6.cluster.ClusteredSessionManager.complete
(ClusteredSessionManager.java:60)
at org.mortbay.jetty.servlet.SessionHandler.handle
(SessionHandler.java:198)
at
org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.doHandle
(ClusteredSessionHandler.java:59)
at org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler
$ActualHandler.handle(ClusteredSessionHandler.java:66)
at org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler
$WebClusteredInvocation.invokeLocally
(AbstractClusteredPreHandler.java:71)
at org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler
$WADIWebClusteredInvocation.access$000(WADIClusteredPreHandler.java:
52)
at org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler
$WADIWebClusteredInvocation$1.doFilter(WADIClusteredPreHandler.java:
64)
at org.codehaus.wadi.web.impl.WebInvocation.invoke
(WebInvocation.java:116)
at
org.codehaus.wadi.core.contextualiser.MemoryContextualiser.handleLocal
ly(MemoryContextualiser.java:71)
at
org.codehaus.wadi.core.contextualiser.AbstractExclusiveContextualiser.
handle(AbstractExclusiveContextualiser.java:94)
at
org.codehaus.wadi.core.contextualiser.AbstractMotingContextualiser.con
textualise(AbstractMotingContextualiser.java:37)
at org.codehaus.wadi.core.manager.StandardManager.processStateful
(StandardManager.java:150)
at org.codehaus.wadi.core.manager.StandardManager.contextualise
(StandardManager.java:142)
at org.codehaus.wadi.core.manager.ClusteredManager.contextualise
(ClusteredManager.java:81)
at org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler
$WADIWebClusteredInvocation.invoke(WADIClusteredPreHandler.java:72)
at
org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler.handle(
AbstractClusteredPreHandler.java:39)
at
org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.handle
(ClusteredSessionHandler.java:51)
at org.mortbay.jetty.handler.ContextHandler.handle
(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle
(WebAppContext.java:417)
at org.apache.geronimo.jetty6.handler.TwistyWebAppContext.access
$101(TwistyWebAppContext.java:41)
at org.apache.geronimo.jetty6.handler.TwistyWebAppContext
$TwistyHandler.handle(TwistyWebAppContext.java:66)
at
org.apache.geronimo.jetty6.handler.ThreadClassloaderHandler.handle
(ThreadClassloaderHandler.java:46)
at org.apache.geronimo.jetty6.handler.InstanceContextHandler.handle
(InstanceContextHandler.java:58)
at org.apache.geronimo.jetty6.handler.UserTransactionHandler.handle
(UserTransactionHandler.java:48)
at
org.apache.geronimo.jetty6.handler.ComponentContextHandler.handle
(ComponentContextHandler.java:47)
at org.apache.geronimo.jetty6.handler.TwistyWebAppContext.handle
(TwistyWebAppContext.java:60)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle
(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerCollection.handle
(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle
(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest
(HttpConnection.java:534)
at org.mortbay.jetty.HttpConnection$RequestHandler.content
(HttpConnection.java:879)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at org.mortbay.io.nio.SelectChannelEndPoint.run
(SelectChannelEndPoint.java:409)
at org.apache.geronimo.pool.ThreadPool$1.run(ThreadPool.java:214)
at org.apache.geronimo.pool.ThreadPool
$ContextClassLoaderRunnable.run(ThreadPool.java:344)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask
(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run
(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.codehaus.wadi.group.MessageExchangeException: No
correlated messages received within [2000]ms
at
org.codehaus.wadi.group.impl.AbstractDispatcher.attemptMultiRendezVous
(AbstractDispatcher.java:174)
at
org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invokeOnPeers
(BasicServiceInvoker.java:90)
at org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invoke
(BasicServiceInvoker.java:69)
at org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory
$ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:193)
... 49 more
Does anyone have some insight into what might be causing this, or
where the timeouts can be increased if there are any?
I'm thinking that a static WADI configuration can be more stable
than the dynamic setup we have not which relies upon multicasting.
Does anyone have experience with similar setups?
Thanks!
Trygve Hardersen - Jotta