Re: a question about the FailoverManager

Gordon Sim Mon, 16 Mar 2009 12:32:17 -0700

Shan Wang wrote:

As I understand it, FailoverManager::execute() will try to reconnect if
no open connection is available, by using the last known brokers
list(because it uses connect() which has an empty brokers url list?),
and if that succeeded, a new session will be created and
Command::execute() will be called. And if the reconnect failed, it will

through an exception.

That is correct; when the current connection fails it will try toreconnect using the list of known urls retrieved when connected. It willtry each of these in turn then fail if it can't connect to any of them.

If the above is correct, then when broker1 in my cluster is dead, the
messages sent by sender should immediately failover to broker2, and the
client shouldn't notice anything. But in my test, what actually happened
is it took about 6-7 minutes for the clients to failover, during that

time both the sender and receiver clients were frozen.

That is much slower than I would expect from my testing. If I kill -9 abroker, failover takes no more than a second or so. Are you killing thebroker in your simulations? or turning off the machine? Or something else?

And if I kill both of the brokers in the cluster, it also takes about
6-7 minutes for the clients to notice that there are no connections
available. Even if I restart the brokers within that frozen time,
clients still won't be able to connect the newly started broker but will
fail with "Cannot establish a connection" exception at last.

My test clients are almost the same as replay_sender and resume_receiver
examples from M4, the only variation is in the beginning they use
connect( url ) to connect the cluster rather than a specific host:port
string. Also I put sleep between each send so I can manage the test
easier.


Do you see the same slowness in failover with those original examples?

So am I doing something wrong here? Can anyone suggest what do I need to
do in my application to have maximum resilience besides relying on

FailoverManager?

I would certainly expect to see _much_ faster failover than that. I'mnot sure what is causing the delay - you could try turning on heartbeatsin case it is somehow slow to detect the closed socket.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]

Re: a question about the FailoverManager

Reply via email to