Re: store and forward queue recovery problem

Kim van der Riet Fri, 12 Mar 2010 11:50:45 -0800

Thanks for the detail. I had thought that you had suffered a recover
failure in which a phase 1 recover had failed - ie the ability of the
store to analyze the stored messages from the disk owing to some sort of
disk corruption or similar. But much of the detail here is as you have
already described it - sorry for being vague in my request.


I have looked closely at the code, and believe that there may be a bug
in the recovery section of the code. The recovery code does not appear
to enforce queue policy during the recovery, allowing (as I believe you
have observed) the loading of messages to exceed policy. I already have
a code fix for this - all I lack at the moment is a test case. The
problem is that I can't run huge tests like yours, I need to expose this
behavior using a more modest case, and that cannot be done from the
client alone. Whether-or-not message content is in fact being
released/discarded from memory needs broker access to ascertain. Perhaps
I can find another way to test that this works from the client alone.

I plan to open a Jira for this on Monday.

Thanks once again for your help.

On Fri, 2010-03-12 at 10:41 -0800, Charles Woerner wrote:
> Sure.  As I mentioned, I was running a test where I shut down the  
> consumers and enqueued large amounts of data.  I was running these  
> tests in ec2 in a store and forward (src-local queue route) topology.   
> The local s&f broker had a small-ish (10 GB) store and a single  
> durable queue with a default max-queue-size limit and a flow-to-disk  
> policy.  The s&f broker had a queue route to a  durable queue on the  
> central broker with a 100 MB store with a max-queue-size limit of 1 GB  
> and a flow-to-disk policy.  Although the max-queue-size was only 1 GB,  
> qpid continued to accept messages and acquire memory beyond the  
> physical memory limit and into swap - at this point it died.  So I  
> tried to restart it, but qpid kept complaining that it lost contact  
> with a child process.  So I allocated more swap and tried to restart  
> it and, although this helped me overcome the initial critical error  
> after 2 hours it was still unresponsive (and deep into swap) and had  
> not yet bound to the amqp port.  I then killed the process with a  
> "kill <pid>".  I then detached the storage device, attached it to a  
> larger machine and restarted qpid and it was able to startup cleanly.   
> However, the messages which had begun to build up on the s&f broker's  
> queue while the central broker was down were not automatically  
> delivered.
> 
> I don't not claim that there is nothing I could have done to re- 
> establish the message flow, but it did not appear to re-establish  
> itself on it's own.  I did try deleting the link and re-creating it  
> but this did not work.  I also tried purging the central broker's  
> queue using qpid-tool thinking that maybe there was a corrupt message  
> keeping it from being able to accept new ones, but this did not work.   
> For what it's worth, the ip address of the central broker changed  
> between the initial small central broker and the upgraded larger  
> central broker, so I imagine this didn't help things.  But that's why  
> I destroyed the link and re-created it.
> 
> When I cleared the slate and re-ran the enqueue tests between the s&f  
> broker and the larger central broker again at some point long into the  
> process (when about 2x the policy limit had been reached and messages  
> had been "flowing-to-disk" for some time) the central broker crashed  
> without any error messages.  Again messages started building up on the  
> s&f queue, but this time when I restarted qpid on the central broker  
> the link was automatically re-established and messages from the s&f  
> queue were transfered to the central broker properly.
> 
> <shrugs>
> 
> On Mar 12, 2010, at 5:26 AM, Kim van der Riet wrote:
> 
> > On Thu, 2010-03-11 at 18:53 -0800, Charles Woerner wrote:
> >> Wow, and nevermind.  As I wrote that the queue stats updated and
> >> apparently the link was re-established and the entire contents of the
> >> store and forward queue were now flushed to the destination broker
> >> properly.  Seems to work as designed!  The only real problem I can
> >> report is that when a destination broker dies due to memory  
> >> starvation
> >> the store may be left in a state which makes is subsequently
> >> unrecoverable.  But I am unable to reproduce on this new larger
> >> machine.  Case closed.  Thanks, and sorry for the noise.
> >
> > Can you provide further details on the store unrecoverability you
> > encountered?
> >
> >
> > ---------------------------------------------------------------------
> > Apache Qpid - AMQP Messaging Implementation
> > Project:      http://qpid.apache.org
> > Use/Interact: mailto:[email protected]
> >
> 
> __
> 
> Charles Woerner  | [email protected] |   demandbase   |   
> 415.683.2669
> 
> 
> 
> 
> 
> 



---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]

Re: store and forward queue recovery problem

Reply via email to