On Tue, Dec 28, 2010 at 4:00 PM, Mark Moseley <[email protected]> wrote: > Sorry in advance that this is long. I've tried to explain it as > succinctly but thoroughly as possible. > > I've got a 2-node qpid test cluster at each of 2 datacenters, which > are federated together with a single durable static route between > each. Qpid is version 0.8. Corosync and openais are stock Squeeze > (1.2.1-3 and 1.1.2-2, respectively). OS is Squeeze, 32-bit, on Dell > Poweredge 1950s, kernel 2.6.36. The static route is durable and is set > up over SSL. > > This is quite possibly just a conceptual problem with how I'm setting > this up, so if anyone has a 'right way' to do it, I'm all ears :) > > Just a prelim: Call them cluster A with nodes A1 and A2, and cluster B > with nodes B1 and B2. The static route is defined as A1->B1 for an > exchange on cluster B (call it exchangeB), and the other route is > B1->A1 for an exchange on cluster A (call it exchangeA). After setting > this up, things seem to work pretty well. I can send from any node in > cluster A to exchangeB and it's received by any receiving node in > cluster B. Running "qpid-config ... exchanges --bindings" on cluster A > nodes show the route to cluster B for exchangeB and vice versa. That > seems to be good. > > The trouble I'm having regards failover. I'm finding that if I fail > the cluster in the order where the node with the route on it lives: > > * Kill A1, kill A2, start A2, start A1 -> The bindings on cluster B > for exchangeA get set back up automatically > > Also, after I kill A1, the route seems to fail over correctly to A2, > i.e. with A1 dead and A2 still alive, looking at qpid-route on B1 or > B2 says: > Exchange 'exchangeA' (direct) > bind [mytopic] => bridge_queue_1_f6d80145-67d2-4659-b26e-80c4da3ae85b > > If I stop the cluster in this order: > > * Kill A2, kill A1, start A1, start A2 -> The bindings on cluster B > for exchangeA don't get set up, i.e. on B1 or B2, qpid-route says: > Exchange 'exchangeA' (direct) > > Am I doing something wrong or is this a known limitation? I'd expect > that regardless of ordering, a durable route would come back up on its > own, on either node. I'd also think that if it was a limitation, it'd > happen in the other order, when A2 was the last node standing, > considering the route was created for A1. > > I had tried earlier to use source routes for my routing and they > seemed to do better at coming back after failover but on the source > clusters' side, the non-primary node (A2) would often blow up when > cluster B was down and a node in cluster B came back online, always > saying this in A2's qpid logs (10.1.58.3 is A1, 10.1.58.4 is A2): > > 2010-12-28 17:19:37 info ACL Allow id:walcl...@qpid action:create > ObjectType:link Name: > 2010-12-28 17:19:37 info Connection is a federation link > 2010-12-28 17:19:39 error Channel exception: not-attached: Channel 1 > is not attached (qpid/amqp_0_10/SessionHandler.cpp:39) > 2010-12-28 17:19:39 critical cluster(10.1.58.4:3128 READY/error) local > error 3054 did not occur on member 10.1.58.3:3369: not-attached: > Channel 1 is not) > 2010-12-28 17:19:39 critical Error delivering frames: local error did > not occur on all cluster members : not-attached: Channel 1 is not > attached (qpid/a) > 2010-12-28 17:19:39 notice cluster(10.1.58.4:3128 LEFT/error) leaving > cluster walclust > 2010-12-28 17:19:39 notice Shut down > > > I'm pushing my luck with an email this long, but I'll mention one > other weirdness. I was working on another test cluster where the IPs > were 10.1.1.246 and 10.1.1.247. In the qpid logs, they were fairly > consistently referred to in the logs as 10.1.1.118 and 10.1.1.119, > almost like the 8th bit was being cleared. Could be some localized > bizarreness (though dns and nsswitch both reported the IPs correctly) > but I thought I'd mention it. I haven't tried it out with other IPs > where the 4th octet (or any octet) is over 128. >
Ignore this last thing. It's probably just from having "clear_node_high_bit: yes" in corosync.conf. Everything besides the odd IP thing is still pertinent though. --------------------------------------------------------------------- Apache Qpid - AMQP Messaging Implementation Project: http://qpid.apache.org Use/Interact: mailto:[email protected]
