Re: store and forward queue recovery problem

Charles Woerner Fri, 12 Mar 2010 11:57:25 -0800

If you give me a patched build in RPM form I'll test it for you.


On Mar 12, 2010, at 11:50 AM, Kim van der Riet wrote:

Thanks for the detail. I had thought that you had suffered a recover
failure in which a phase 1 recover had failed - ie the ability of the

store to analyze the stored messages from the disk owing to somesort of

disk corruption or similar. But much of the detail here is as you have
already described it - sorry for being vague in my request.

I have looked closely at the code, and believe that there may be a bug
in the recovery section of the code. The recovery code does not appear

to enforce queue policy during the recovery, allowing (as I believeyouhave observed) the loading of messages to exceed policy. I alreadyhave

a code fix for this - all I lack at the moment is a test case. The

problem is that I can't run huge tests like yours, I need to exposethis

behavior using a more modest case, and that cannot be done from the
client alone. Whether-or-not message content is in fact being

released/discarded from memory needs broker access to ascertain.Perhaps

I can find another way to test that this works from the client alone.

I plan to open a Jira for this on Monday.

Thanks once again for your help.

On Fri, 2010-03-12 at 10:41 -0800, Charles Woerner wrote:

Sure.  As I mentioned, I was running a test where I shut down the
consumers and enqueued large amounts of data.  I was running these
tests in ec2 in a store and forward (src-local queue route) topology.
The local s&f broker had a small-ish (10 GB) store and a single
durable queue with a default max-queue-size limit and a flow-to-disk
policy.  The s&f broker had a queue route to a  durable queue on the

central broker with a 100 MB store with a max-queue-size limit of 1GBand a flow-to-disk policy. Although the max-queue-size was only 1GB,

qpid continued to accept messages and acquire memory beyond the
physical memory limit and into swap - at this point it died.  So I
tried to restart it, but qpid kept complaining that it lost contact
with a child process.  So I allocated more swap and tried to restart
it and, although this helped me overcome the initial critical error
after 2 hours it was still unresponsive (and deep into swap) and had
not yet bound to the amqp port.  I then killed the process with a
"kill <pid>".  I then detached the storage device, attached it to a
larger machine and restarted qpid and it was able to startup cleanly.
However, the messages which had begun to build up on the s&f broker's
queue while the central broker was down were not automatically
delivered.

I don't not claim that there is nothing I could have done to re-
establish the message flow, but it did not appear to re-establish
itself on it's own.  I did try deleting the link and re-creating it
but this did not work.  I also tried purging the central broker's
queue using qpid-tool thinking that maybe there was a corrupt message
keeping it from being able to accept new ones, but this did not work.
For what it's worth, the ip address of the central broker changed
between the initial small central broker and the upgraded larger
central broker, so I imagine this didn't help things.  But that's why
I destroyed the link and re-created it.

When I cleared the slate and re-ran the enqueue tests between the s&f

broker and the larger central broker again at some point long intothe

process (when about 2x the policy limit had been reached and messages
had been "flowing-to-disk" for some time) the central broker crashed

without any error messages. Again messages started building up onthe

s&f queue, but this time when I restarted qpid on the central broker
the link was automatically re-established and messages from the s&f
queue were transfered to the central broker properly.

<shrugs>

On Mar 12, 2010, at 5:26 AM, Kim van der Riet wrote:

On Thu, 2010-03-11 at 18:53 -0800, Charles Woerner wrote:

Wow, and nevermind.  As I wrote that the queue stats updated and

apparently the link was re-established and the entire contents ofthe

store and forward queue were now flushed to the destination broker
properly.  Seems to work as designed!  The only real problem I can
report is that when a destination broker dies due to memory
starvation
the store may be left in a state which makes is subsequently
unrecoverable.  But I am unable to reproduce on this new larger
machine.  Case closed.  Thanks, and sorry for the noise.


Can you provide further details on the store unrecoverability you
encountered?


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]


__

Charles Woerner  | [email protected] |   demandbase   |
415.683.2669




---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]

__

Charles Woerner | [email protected] | demandbase |415.683.2669

Re: store and forward queue recovery problem

Reply via email to