I have encountered an issue with ActiveMQ where the entire cluster will fail when the master Zookeeper node goes offline.
We have a 3-node ActiveMQ cluster setup in our development environment. Each node has ActiveMQ 5.12.0 and Zookeeper 3.4.6 (*note, we have done some testing with Zookeeper 3.4.7, but this has failed to resolve the issue. Time constraints have so far prevented us from testing ActiveMQ 5.13). What we have found is that when we stop the master ZooKeeper process (via the "end process tree" command in Task Manager), the remaining two ZooKeeper nodes continue to function as normal. Sometimes the ActiveMQ cluster is able to handle this, but sometimes it does not. When the cluster fails, we typically see this in the ActiveMQ log: 2015-12-18 09:08:45,157 | WARN | Too many cluster members are connected. Expected at most 3 members but there are 4 connected. | org.apache.activemq.leveldb.replicated.MasterElector | WrapperSimpleAppMain-EventThread ... ... 2015-12-18 09:27:09,722 | WARN | Session 0x351b43b4a560016 for server null, unexpected error, closing socket connection and attempting reconnect | org.apache.zookeeper.ClientCnxn | WrapperSimpleAppMain-SendThread(192.168.0.10:2181) java.net.ConnectException: Connection refused: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)[:1.7.0_79] at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)[:1.7.0_79] at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)[zookeeper-3.4.6.jar:3.4.6-1569965] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)[zookeeper-3.4.6.jar:3.4.6-1569965] We were immediately concerned by the fact that (A)ActiveMQ seems to think there are four members in the cluster when it is only configured with 3 and (B) when the exception is raised, the server appears to be null. We then increased ActiveMQ's logging level to DEBUG in order to display the list of members: 2015-12-18 09:33:04,236 | DEBUG | ZooKeeper group changed: Map(localhost -> ListBuffer((0000000156,{"id":"localhost","container":null,"address":null,"position":-1,"weight":5,"elected":null}), (0000000157,{"id":"localhost","container":null,"address":null,"position":-1,"weight":1,"elected":null}), (0000000158,{"id":"localhost","container":null,"address":"tcp://192.168.0.11:61619","position":-1,"weight":10,"elected":null}), (0000000159,{"id":"localhost","container":null,"address":null,"position":-1,"weight":10,"elected":null}))) | org.apache.activemq.leveldb.replicated.MasterElector | ActiveMQ BrokerService[localhost] Task-14 Can anyone suggest why this may be happening and/or suggest a way to resolve this? Our configurations are shown below: *ZooKeeper:* tickTime=2000 dataDir=C:\\zookeeper-3.4.7\\data clientPort=2181 initLimit=5 syncLimit=2 server.1=192.168.0.10:2888:3888 server.2=192.168.0.11:2888:3888 server.3=192.168.0.12:2888:3888 *ActiveMQ (server.1):* <persistenceAdapter> <replicatedLevelDB directory="activemq-data" replicas="3" bind="tcp://0.0.0.0:61619" zkAddress="192.168.0.11:2181,192.168.0.10:2181,192.168.0.12:2181" zkPath="/activemq/leveldb-stores" hostname="192.168.0.10" weight="5"/> //server.2 has a weight of 10, server.3 has a weight of 1 </persistenceAdapter> -- View this message in context: http://activemq.2283324.n4.nabble.com/ActiveMQ-cluster-fails-with-server-null-when-the-Zookeeper-master-node-goes-offline-tp4705165.html Sent from the ActiveMQ - User mailing list archive at Nabble.com.