On 03/13/2012 10:35 PM, Jeff Armstrong wrote:
I have a situation where the Qpid broker slows down tremendously, to the point 
where enqueues stop altogether for long periods of time and dequeuing is also 
quite slow. When I look at htop, there are 2 qpid threads running at 100% CPU. 
When debugging in gdb, I see that every time I do a backtrace, these two 
threads are somewhere in the RingQueuePolicy::find(), and further up the stack 
it shows that it is in DeliveryRecord::accept().

Qpid broker/client setup:
- Ubuntu 10.0.4
- 0.12 C++ clients and brokers
- Exchange options: direct
- Queue options: Ring policy, max size ~400MB
- Subscriber options: autoAck = 0, acceptMode = ACCEPT_MODE_EXPLICIT, 
completionMode = COMPLETE_ON_ACCEPT

I assume the flow control is the default, i.e. unlimited?

- Message options: Delivery mode PERSISTENT

I have 2 blades in a chassis, each with a broker running and a single sender client that enqueues 
different messages to several bindings on the local broker. Each blade also has two receiver 
clients that dequeue messages, but only one blade's receiver clients are "active", 
meaning they connect to the brokers on both blades, whereas the receiver clients on the 
"standby" blade do nothing.

What does 'do nothing' mean in this context? Have they subscribed to the queues without actually processing messages? Or have they not even connected?

Do the receivers on the active blade dequeue from both brokers?

If an active blade goes down, the standby blade becomes active, the receiver 
clients there will now connect to the brokers and start dequeuing,

How do the published messages get to the other broker?

while the other blade will eventually reboot into standby mode.

The two receiver clients each subscribe to their own single queue, which are 
attached to the same binding on the same exchange. The clients' normal 
behaviour is to dequeue messages for 15 minutes, finish processing them, then 
send an accept() on the subscription of all the processed id's. The idea is 
that if the active blade goes down, all of the messages that were not 
accept()ed will be lost, so the clients on the standby blade will then connect 
and should get these messages redelivered. This seems to have worked in the 
past.

The following events occurred (note that only blade 1 is actually enqueuing to 
its broker, blade 2 has no enqueuing going on, this is on purpose):
- blade 1 (active) and blade 2 (standby)
- blade 1 reboots, so blade 2 becomes active then blade 1 comes up into standby
- blade 2 then reboots, so blade 1 becomes active then blade 2 goes into standby

We then made the following observations:
- When blade 2 reboots, and blade 1 becomes active, the receiver clients never 
output any of the expected redelivered messages. We think that the redelivery 
never took place.
- When inspecting the 'unacked' queue in SemanticState (and also the queue in 
RingQueuePolicy) in gdb, we noticed about 100,000 messages in each client's 
queue with old sequence numbers that correspond to 2/3 of the messages that we 
never saw redelivered
- The first 1/3 or so of the messages we expected to be redelivered weren't in 
those queues
- When we finally stopped one of the receiver clients, it cored (aborted), the 
other receiver client died, and the qpid broker also cored
- There was a logged qpid::TransportFailure exception that happened right 
before all of these crashed

Here are some of our thoughts/questions:
- We think the 1/3 of the messages that vanished might have been because the 
queue filled, and the ring policy caused them to be deleted
- We think that the 2/3 of the messages we expected to be redelivered, might 
not have got redelivered because the session on the new clients might have been 
started before the sessions of the clients that went down with a reboot were 
ended. Is there some sort of session timeout that must occur before the new 
session gets these redelivered? What happens in this case?

I'm still not quite clear on what exactly your clients do.

- We think the slowdown is because of the 100,000 unaccepted messages on the 
front of the RingQueuePolicy's queue. We send about 200,000 ids to accept after 
a 15 minute period, so for each of these messages, it will have to traverse 
over the 100,000 unaccepted ids. Could this account for such a huge slowdown 
and 100% cpu usage on the accept() with 200,000 ids?

Yes, it could, especially if some of those messages have been removed from the ring queue already to make room for newer messages.

Can you send accepts more frequently? Batching accepts is good to some extent, but if you can reduce the set of in-doubt messages held by the broker you will likely improve the performance.

- No ideas about the crashes that occurred when we tried to stop one of the 
receiver clients

If you have the core files, can you get backtraces for them to give some clues as to where the crash occurred?

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to