Hi Gordon, sorry I didn't get back to you yesterday, but I needed to get an opportunity to start a separate broker that I could stop on one of our systems.
We had seen these different results with our deployed system, and I want to make sure that this test client acted the same way. (on our deployed sytem, most of failover testing had been done with a kill on the broker and we would see the client detect the lost connection. But later, when a host died, we found this other condition, where the client did not detect it) So, I ran the test: * started a separate broker (see below) * started the same test client, as before, with the larger msg size * did a kill on the broker: kill -STOP <pid> * saw the same results as you did, the client detected the loss connection in about 2x heartbeat rate Then, to verify my earlier results, I ran the same exact test, except this time pulled the network cable: * started a separate broker (same as previous run above) * started the same test client, as before, with the larger msg size * pulled network cable * saw same results as my previous tests: client continued to "send" well past the heartbeat timeout should have been (seeing same trace messages), until about 80 seconds later, locked up. Note, I've noted this result (client sending the larger msg missing the detection of the lost connection) also happens if the broker host abruptly dies (which is how we first detected the problem). Note: I used the following to start broker: /usr/sbin/qpidd -p 18102 --log-to-syslog no --log-to-file /export/hps/dda/qpidd_x/log/qpidd_x.log --worker-threads 3 --data-dir /export/hps/dda/qpidd_x/data --pid-dir /export/hps/dda/qpidd_x/pid-dir --auth no --config /dev/null So, please let me know if you can run the test again, but pulling the network cable (I'm pulling net between broker and switch, but, I'm pretty sure I've seen the same when pulling net between switch and client). thanks, Tom On Thu, Jan 12, 2012 at 10:05 AM, Gordon Sim <[email protected]> wrote: > On 01/07/2012 01:26 AM, Tom M wrote: > >> I’ve created a simpler test client (based on our deployed application) >> to test this problem. >> > > I ran your test client against the same package versions you listed for > qpidd and the client lib. I didn't pull a cable (as I was testing on remote > boxes) but instead issues a kill -STOP against the broker which should be > similar from the perspective of the client (i.e. it will miss two > heartbeats and abort the connection). > > However in all my attempts it did correctly detect the closed connection > and issue an exception. The output was of the following form: > > 01_12 15:51:34 TstConn: sending msg: 45 >> 01_12 15:51:34 TstConn: msg sent >> 01_12 15:51:35 TstConn: sending msg: 46 >> 01_12 15:51:35 TstConn: msg sent >> 2012-01-12 10:51:35 warning Connection [42787 mrg11:5672] closed >> >> 01_12 15:51:36 TstConn: connection_.isOpen() detected lost connection >> note: detected with isOpen() call, not an exception.... >> >> 01_12 15:51:36 TstConn: sending msg: 47 >> 01_12 15:51:36TstConn: qpid::Exception: Connection [42787 mrg11:5672] >> closed >> >> >> ...waiting on user (allow user to reconnect cable, so can attempt to >> close connection on broker) >> >> To continue shutdown, enter: 1 >> > > This is the same whichever size I choose. Would you mind verifying if you > see your problem with a kill -STOP in place of the network cable removal? > If not, I'll try and get a setup to test on where i can pull a physical > cable; if you can then it confirms there is something else different in our > test setup. > > > > ------------------------------**------------------------------**--------- > Apache Qpid - AMQP Messaging Implementation > Project: http://qpid.apache.org > Use/Interact: > mailto:users-subscribe@qpid.**apache.org<[email protected]> > >
