Re: Cluster failing to resurrect durable static route

Mark Moseley Fri, 07 Jan 2011 16:21:45 -0800

On Fri, Jan 7, 2011 at 3:50 PM, Mark Moseley <[email protected]> wrote:
> On Thu, Jan 6, 2011 at 12:55 PM, Alan Conway <[email protected]> wrote:
>> On 12/28/2010 07:00 PM, Mark Moseley wrote:
>>>
>>> Sorry in advance that this is long. I've tried to explain it as
>>> succinctly but thoroughly as possible.
>>>
>>> I've got a 2-node qpid test cluster at each of 2 datacenters, which
>>> are federated together with a single durable static route between
>>> each. Qpid is version 0.8. Corosync and openais are stock Squeeze
>>> (1.2.1-3 and 1.1.2-2, respectively). OS is Squeeze, 32-bit, on Dell
>>> Poweredge 1950s, kernel 2.6.36. The static route is durable and is set
>>> up over SSL.
>>>
>>> This is quite possibly just a conceptual problem with how I'm setting
>>> this up, so if anyone has a 'right way' to do it, I'm all ears :)
>>>
>>> Just a prelim: Call them cluster A with nodes A1 and A2, and cluster B
>>> with nodes B1 and B2. The static route is defined as A1->B1 for an
>>> exchange on cluster B (call it exchangeB), and the other route is
>>> B1->A1 for an exchange on cluster A (call it exchangeA). After setting
>>> this up, things seem to work pretty well. I can send from any node in
>>> cluster A to exchangeB and it's received by any receiving node in
>>> cluster B. Running "qpid-config ... exchanges --bindings" on cluster A
>>> nodes show the route to cluster B for exchangeB and vice versa. That
>>> seems to be good.
>>>
>>> The trouble I'm having regards failover. I'm finding that if I fail
>>> the cluster in the order where the node with the route on it lives:
>>>
>>> * Kill A1, kill A2, start A2, start A1  ->  The bindings on cluster B
>>> for exchangeA get set back up automatically
>>>
>>> Also, after I kill A1, the route seems to fail over correctly to A2,
>>> i.e. with A1 dead and A2 still alive, looking at qpid-route on B1 or
>>> B2 says:
>>> Exchange 'exchangeA' (direct)
>>>     bind [mytopic] =>  bridge_queue_1_f6d80145-67d2-4659-b26e-80c4da3ae85b
>>>
>>> If I stop the cluster in this order:
>>>
>>> * Kill A2, kill A1, start A1, start A2  ->  The bindings on cluster B
>>> for exchangeA don't get set up, i.e. on B1 or B2, qpid-route says:
>>> Exchange 'exchangeA' (direct)
>>>
>>> Am I doing something wrong or is this a known limitation? I'd expect
>>> that regardless of ordering, a durable route would come back up on its
>>> own, on either node. I'd also think that if it was a limitation, it'd
>>> happen in the other order, when A2 was the last node standing,
>>> considering the route was created for A1.
>>>
>>
>> I think you have uncovered a bug, can you create a JIRA for it and assign it
>> to me  initially? Detailed instructions on how to reproduce are greatly
>> appreciated.
>
> I've created this as JIRA 2992. I wasn't quite clever enough to figure
> out how to assign it to you :)   Sorry to be daft, but I can't seem to
> find any link/button that looks like it'd let me do that.
>
>
>>> I had tried earlier to use source routes for my routing and they
>>> seemed to do better at coming back after failover but on the source
>>> clusters' side, the non-primary node (A2) would often blow up when
>>> cluster B was down and a node in cluster B came back online, always
>>> saying this in A2's qpid logs (10.1.58.3 is A1, 10.1.58.4 is A2):
>>>
>>> 2010-12-28 17:19:37 info ACL Allow id:walcl...@qpid action:create
>>> ObjectType:link Name:
>>> 2010-12-28 17:19:37 info Connection is a federation link
>>> 2010-12-28 17:19:39 error Channel exception: not-attached: Channel 1
>>> is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
>>> 2010-12-28 17:19:39 critical cluster(10.1.58.4:3128 READY/error) local
>>> error 3054 did not occur on member 10.1.58.3:3369: not-attached:
>>> Channel 1 is not)
>>> 2010-12-28 17:19:39 critical Error delivering frames: local error did
>>> not occur on all cluster members : not-attached: Channel 1 is not
>>> attached (qpid/a)
>>> 2010-12-28 17:19:39 notice cluster(10.1.58.4:3128 LEFT/error) leaving
>>> cluster walclust
>>> 2010-12-28 17:19:39 notice Shut down
>>>
>>
>> This also sounds like a bug, can you create a separate JIRA for it? Assign
>> to me as well.
>
> In debugging this, I figured I really ought to upgrade
> corosync/openais for the heck of it. I just did that a couple of hours
> ago and now I'm going to re-test the source route case. If upgrading
> corosync/openais doesn't fix it, I'll open up another JIRA.
>
> Thanks!
>


The source-local route issue still persists with a newer
corosync/openais. I've opened the JIRA 2993. Still haven't figured out
how to actually assign it to you though.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]

Re: Cluster failing to resurrect durable static route

Reply via email to