Hi! We've spent the past couple weeks trying to setup an ActiveMQ cluster which would have both high availability of brokers as well as high availability of messages (e.g. to not have messages get stuck on crashed brokers).
It seems that this configuration is missing from the documentation or mailing list (or we were just unable to find it), so I wanted to document it here for you all in case it becomes helpful for others... Also, if you have any comments on what we could have done simpler, that would be helpful as well! (we're using ActiveMQ 5.5) We started with a pretty standard "Pure Master/Slave" (http://activemq.apache.org/pure-master-slave.html) configuration. This allows for a backup of messages to exist if the Master dies (in our configuration, only the Master takes part in the network of brokers or client communication). One change we had to make due to an open Active MQ bug (https://issues.apache.org/jira/browse/AMQ-3364) was to change the init script to kill -9 the java process on "stop" instead of trying to do a controlled shutdown. After this change, we no longer saw any missing messages on master shutdown. Additionally, I had written a restart script which would inspect the states of the master and slave nodes, shut everything down, copy the data directories over from the "more recently running" node, and restart the master and slave in the right order (after pausing for the master to start it's transportConnectors so that the slave wouldn't try to start up before the master completed). With the pure Master/Slave node setup out of the way, we went onto try and setup a network of (pure master/slave) brokers. This configuration was much more difficult to get right. In most of the configurations we'd try, after restarting random nodes in the cluster and testing various message delivery paths, messages would get stuck or lost, or networkConnections would reconnect but not pass messages. The configuration we did get working (note that we primarily will be using STOMP on 61613 as the client connection) was this: The loadbalancer VIP load balances over the Master01 and Master02 servers, Slaves do not accept client connections. Master01 and Master02 have transportConnections configured, but Slave01 and Slave02 do not have any transports in the configuration. master connections from Slave01 -> Master01, and Slave02 -> Master02 are configured on a separate port (61618) from the network of brokers connections (61616). Not sure if this ended up being a required part of configuration for this to work, but we had separated these to help identify traffic and failed reconnections during testing. Master01 <---> Master02 , duplex networkConnection between Master servers, configured using multicast discovery (explicit configuration of servers seemed to result in stuck networkConnections on server failures) Slave01 --> Master02 and Slave02 --> Master01 , non-duplex networkConnection between slave and master, configured using multicast discovery. This connection allows the slave to drain the messages queue from its master if the master server dies (clients will not connect to the slave only the master in our configuration) We use puppet to configure activemq.xml, and the ERB templates use a naming convention such that (name)-mq(number)-(letter) indicates a queue cluster, such as clustername-mq01-a. -a indicates master, and -b indicates slave, so that configuration is 100% automated. Here's the relevant portions of the activemq.xml.erb file: [...snip...] <!-- The <broker> element is used to configure the ActiveMQ broker. Note that for masters, we have "waitforslave" set to true, and we have a unique name set for each brokerName. --> <% hostname =~ /^([^.]*)-mq([0-9]+)-([ab])/ group_name = $~[1] pair_name = "#{$~[1]}-mq#{$~[2]}" letter = $~[3] -%> <broker xmlns="http://activemq.apache.org/schema/core" <% if letter == "a" %> waitForSlave="true" shutdownOnSlaveFailure="false" <% end -%> brokerName="<%= hostname %>" dataDirectory="${activemq.base}/data" networkConnectorStartAsync="true"> [...snip...] <!-- for network connections, master-master is duplex, slave-master is non-duplex. the username/password is given ACL for the all queues and topics (">") --> <networkConnectors> <networkConnector uri="multicast://224.1.2.3:6255?group=<%= group_name %>" name="<%= hostname -%>-<%= group_name %>" userName="<%= network_connection_username %>" password="<%= network_connection_password %>" networkTTL="3" duplex="<%= letter == "a" ? "true" : "false" %>"> </networkConnector> </networkConnectors> [...snip...] <services> <% if letter == "b" %> <!-- slaves initiate masterConnector to master for replication --> <masterConnector remoteURI="nio://<%= pair_name -%>-a.<%= domain %>:61618" userName="<%= network_connection_username %>" password="<%= network_connection_password %>"/> <% end -%> </services> <transportConnectors> <% if letter == "a" %> <!-- note that we only start up transports for masters, not for slaves --> <transportConnector name="openwire" uri="nio://0.0.0.0:61616" discoveryUri="multicast://224.1.2.3:6255?group=<%= group_name %>"/> <transportConnector name="replication" uri="nio://0.0.0.0:61618"/> <transportConnector name="stomp+nio" uri="stomp+nio://0.0.0.0:61613"/> <% end -%> </transportConnectors> Hope this helps someone!! Keith Minkler