Hi Gordon (and team), Have you had a chance to look at this more? (in particular, running the test while pulling the network cable (or abruptly shutting down the broker host)). thanks, Tom
On Sat, Jan 14, 2012 at 6:07 PM, Tom M <[email protected]> wrote: > I don't have root privages on this system to run the script from Alan. > But, I did again verify that I get the same results just disconnecting the > network cable for the client host (while running my test client). > (Incidentlly, I don' t remember if I mentioned, we first detected this > problem when our broker host crash (so the OS, NIC, nor switch link had > opportunity to close gracefully), which, I assume, is a case that is > intended to be covered, but just pulling the network cable appears to give > the same result (yet, obviously, a less radical test to work with)). > > Also, a few more notes from the orig trace logs (which you may have > already noticed). > * As seen in the failed case 01_03d (which actually had trace logging (I > miss labeled in my email description)), > with the larger sent message size, the underlining qpid client had stopped > sending msg's before the heartbeat timeout would have occurred. > The heartbeat rate was set to 8. > The cable was pulled at about "msg: 30", and the last "trace SENT" was for > msg:41 (one msg per sec). > Then, 4 more messages were "sent" by the application (via > MessageReplayTracker) but there are no traces from the qpid client code. > Then, almost 16 seconds after the net disconnect, there is an indication > of "Traffic timeout", but there is no action by the client code after > this. > > In the good cases (ie. case 01_03b with the smaller msg, again net cable > pulled at about msg:30), we continue to see qpid client performing "SENT"s > for all messages up to the "Traffic timeout". Then the timeout occurs, > which is followed by the "Exception constructed" for the close (which does > not happen with the failed case). > > I'm wondering if an outgoing send buffering filling up is somehow blocking > the logic to act on the timout. > > I don't know if this is somehow affected by the actual link level getting > disconnected (which, if somehow related, might help to explain the > differing results when doing a STOP on the broker as opposed to the broker > host going down or net pulled. > Also, I wondering if just dropping the network packets will have the same > or different affect (particularly if the condition requires the send path > backing up). > > thanks, > Tom > > > On Fri, Jan 13, 2012 at 8:34 AM, Alan Conway <[email protected]> wrote: > >> On 01/13/2012 09:07 AM, Tom M wrote: >> >>> Hi Gordon, >>> sorry I didn't get back to you yesterday, but I needed to get an >>> opportunity to start a separate broker that I could stop on one of our >>> systems. >>> >>> We had seen these different results with our deployed system, and I want >>> to >>> make sure that this test client acted the same way. >>> (on our deployed sytem, most of failover testing had been done with a >>> kill >>> on the broker and we would see the client detect the lost connection. >>> But >>> later, when a host died, we found this other condition, where the client >>> did not detect it) >>> >>> So, I ran the test: >>> * started a separate broker (see below) >>> * started the same test client, as before, with the larger msg size >>> * did a kill on the broker: kill -STOP<pid> >>> * saw the same results as you did, the client detected the loss >>> connection >>> in about 2x heartbeat rate >>> >>> Then, to verify my earlier results, I ran the same exact test, except >>> this >>> time pulled the network cable: >>> * started a separate broker (same as previous run above) >>> * started the same test client, as before, with the larger msg size >>> * pulled network cable >>> * saw same results as my previous tests: client continued to "send" well >>> past the heartbeat timeout should have been (seeing same trace messages), >>> until about 80 seconds later, locked up. >>> >>> Note, I've noted this result (client sending the larger msg missing the >>> detection of the lost connection) also happens if the broker host >>> abruptly >>> dies (which is how we first detected the problem). >>> >>> >>> Note: I used the following to start broker: >>> /usr/sbin/qpidd -p 18102 --log-to-syslog no --log-to-file >>> /export/hps/dda/qpidd_x/log/**qpidd_x.log --worker-threads 3 --data-dir >>> /export/hps/dda/qpidd_x/data --pid-dir /export/hps/dda/qpidd_x/pid-**dir >>> --auth no --config /dev/null >>> >>> So, please let me know if you can run the test again, but pulling the >>> network cable (I'm pulling net between broker and switch, but, I'm pretty >>> sure I've seen the same when pulling net between switch and client). >>> thanks, >>> Tom >>> >> >> You can simulate a network cable pull by telling iptables to drop >> packets. Attached an old script, no warranty. >> WARNING: if you've got remote access only to the machine in question be >> careful you don't shut yourself out! The attached script only drops >> corosync/openais packets, so you can still ssh etc. >> > >
