Answers are inline.
________________________________________
From: Gordon Sim [[email protected]]
Sent: Wednesday, March 14, 2012 6:39 AM
To: [email protected]
Subject: Re: Major slowdown on Qpid broker

On 03/13/2012 10:35 PM, Jeff Armstrong wrote:
> I have a situation where the Qpid broker slows down tremendously, to the 
> point where enqueues stop altogether for long periods of time and dequeuing 
> is also quite slow. When I look at htop, there are 2 qpid threads running at 
> 100% CPU. When debugging in gdb, I see that every time I do a backtrace, 
> these two threads are somewhere in the RingQueuePolicy::find(), and further 
> up the stack it shows that it is in DeliveryRecord::accept().
>
> Qpid broker/client setup:
> - Ubuntu 10.0.4
> - 0.12 C++ clients and brokers
> - Exchange options: direct
> - Queue options: Ring policy, max size ~400MB
> - Subscriber options: autoAck = 0, acceptMode = ACCEPT_MODE_EXPLICIT, 
> completionMode = COMPLETE_ON_ACCEPT

I assume the flow control is the default, i.e. unlimited?

Jeff: Yes, all other settings I didn't mention would be defaults.

> - Message options: Delivery mode PERSISTENT
>
> I have 2 blades in a chassis, each with a broker running and a single sender 
> client that enqueues different messages to several bindings on the local 
> broker. Each blade also has two receiver clients that dequeue messages, but 
> only one blade's receiver clients are "active", meaning they connect to the 
> brokers on both blades, whereas the receiver clients on the "standby" blade 
> do nothing.

What does 'do nothing' mean in this context? Have they subscribed to the
queues without actually processing messages? Or have they not even
connected?

Jeff: The clients in standby are not even connected to the broker.

Do the receivers on the active blade dequeue from both brokers?

Jeff: Yes. In our case, the broker on blade 2 (whether active or on standby) 
only ever receives a few messages (about 15 on every 15 minute interval), so 
generally all activity is occurring between the clients and the broker on blade 
1.

> If an active blade goes down, the standby blade becomes active, the receiver 
> clients there will now connect to the brokers and start dequeuing,

How do the published messages get to the other broker?

Jeff: In this case, the unacquired messages on the broker of the blade that 
goes down will be lost, which is expected at this point.

> while the other blade will eventually reboot into standby mode.
>
> The two receiver clients each subscribe to their own single queue, which are 
> attached to the same binding on the same exchange. The clients' normal 
> behaviour is to dequeue messages for 15 minutes, finish processing them, then 
> send an accept() on the subscription of all the processed id's. The idea is 
> that if the active blade goes down, all of the messages that were not 
> accept()ed will be lost, so the clients on the standby blade will then 
> connect and should get these messages redelivered. This seems to have worked 
> in the past.
>
> The following events occurred (note that only blade 1 is actually enqueuing 
> to its broker, blade 2 has no enqueuing going on, this is on purpose):
> - blade 1 (active) and blade 2 (standby)
> - blade 1 reboots, so blade 2 becomes active then blade 1 comes up into 
> standby
> - blade 2 then reboots, so blade 1 becomes active then blade 2 goes into 
> standby
>
> We then made the following observations:
> - When blade 2 reboots, and blade 1 becomes active, the receiver clients 
> never output any of the expected redelivered messages. We think that the 
> redelivery never took place.
> - When inspecting the 'unacked' queue in SemanticState (and also the queue in 
> RingQueuePolicy) in gdb, we noticed about 100,000 messages in each client's 
> queue with old sequence numbers that correspond to 2/3 of the messages that 
> we never saw redelivered
> - The first 1/3 or so of the messages we expected to be redelivered weren't 
> in those queues
> - When we finally stopped one of the receiver clients, it cored (aborted), 
> the other receiver client died, and the qpid broker also cored
> - There was a logged qpid::TransportFailure exception that happened right 
> before all of these crashed
>
> Here are some of our thoughts/questions:
> - We think the 1/3 of the messages that vanished might have been because the 
> queue filled, and the ring policy caused them to be deleted
> - We think that the 2/3 of the messages we expected to be redelivered, might 
> not have got redelivered because the session on the new clients might have 
> been started before the sessions of the clients that went down with a reboot 
> were ended. Is there some sort of session timeout that must occur before the 
> new session gets these redelivered? What happens in this case?

I'm still not quite clear on what exactly your clients do.

Jeff: A receiver client subscribes to a single queue, and using a 
qpid::broker::LocalQueue, gets messages, does some processing on them, and 
writes the processed messages to an open temporary file. On a 15-minute 
interval, the client will move the temporary file to a permanent output 
directory, and then send an accept() to the broker for all the messages that 
were in that file, since they have now been fully processed.

> - We think the slowdown is because of the 100,000 unaccepted messages on the 
> front of the RingQueuePolicy's queue. We send about 200,000 ids to accept 
> after a 15 minute period, so for each of these messages, it will have to 
> traverse over the 100,000 unaccepted ids. Could this account for such a huge 
> slowdown and 100% cpu usage on the accept() with 200,000 ids?

Yes, it could, especially if some of those messages have been removed
from the ring queue already to make room for newer messages.

Jeff: After looking through the code and doing some debugging on the broker, 
it's still not clear to me how the messages are stored. There seem to be 
several deques that correspond to a single queue on the broker. Are you saying 
that if a message is removed from the ring queue, the broker still maintains a 
reference to that message if it was unacked? If so, is this a leak, or does the 
a copy of then actual message still exist somewhere else?

Can you send accepts more frequently? Batching accepts is good to some
extent, but if you can reduce the set of in-doubt messages held by the
broker you will likely improve the performance.

Jeff: I can configure the time interval to be a bit lower. I think the 
performance is only affected if the broker keeps a bunch of unacked messages 
that it also never redelivers (which sounds like a bug). The other strange 
thing is that I have tried to simulate this same scenario by acquiring 100k 
messages and never accepting them, then continually acquiring and then 
accepting batches of 200k messages, and the performance is still very fast. The 
difference was that it never seems to call into the RingQueuePolicy, since the 
policy pointer is null on the queue. I guess this means that most of the work 
is actually done in the RingQueuePolicy::dequeued()/RingQueuePolicy::find() - 
which matches the fact that every backtrace I got was somewhere in there. On a 
side note, I'm not sure why my queue on my attempt at simulating the problem 
didn't have a policy, since I also set the queue options to have a ring policy. 
Any ideas there?

> - No ideas about the crashes that occurred when we tried to stop one of the 
> receiver clients

If you have the core files, can you get backtraces for them to give some
clues as to where the crash occurred?

Jeff: I haven't had a chance to take a look at these yet, I will let you know 
what I find when I do.

Thanks for the help.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to