Hi folks, this is along the lines of what Franz was asking in "Master/Slave Configuration With Non-Persistence - 2 Brokers Starting Problem", but I didn't want to hijack his thread in case it wasn't related.
what I'm looking for is: * Pure master/slave share nothing * transparent failover in the event of a broker failure * transparent fail back when the master is restored (I realize this isn't in AMQ yet) * clean startup when bringing up servers while clients are trying to send Client usage pattern is: * Some persistent messages to certain queues (maybe 10-50 a minute) * Lots of non-persistent messages with 10 sec TTL on other queues (200 / second) I realize I can do the pure master/slave bit by making the slave point to the master. Transparent failover works fine, even when there's temp queues in play (which is awesome given that there's only one other broker out there that can do this, afaik) Restoring the master node / startup is where I'm concerned. Considering that the 200/sec messages are 24 hours a day, there's no time where I can: * stop the slave * copy state to master * restart the master * restart the slave All without a client trying to send messages. If I read the docs correctly, there is no initial state sync at all between brokers, only a means of transferring state events as they occur. Given that, if I stop the slave, the clients will block waiting for a broker to come back up, then I start the master, the clients will connect there and immediately start sending messages. When the slave comes up, the master will start propagating changes, but it won't have any of the messages in it's store between when the master started accepting and when the slave comes up. Is this assumption correct? This isn't as critical for the non-persistent messages as we don't expect a master to come up and fail again in 10 seconds (TTL of messages), but for the persistent messages, they may live a very long time (14 hours in a queue until something reads them). And you could imagine that being an easy failure if you unknowingly have faulty hardware. I can imagine a few ways around this issue (if it is indeed an issue): 1) Start up brokers with acceptors 'disabled' then use jmx to enable acceptance on the master. (is this possible?) 2) Create a second set of brokers for persistent messages pointing to our RAC database (we don't want to run 200/sec through rac, but 50 a min is fine). This is kind of a pain in that it's a EJB3 MDB based app and using multiple brokers requires extra configuration. 3) we add (and contribute back) state sync between brokers. Idea being that when a slave connects, we pause all connectors to transfer state and then resume the connectors. Probably a lot harder done than said considering it's not already implemented that way. Any suggestions? How are other people using AMQ for a HA-loose nothing / share nothing solution? Thanks, -David