On Wed, Apr 29, 2015 at 3:51 PM, Gordon Sim <[email protected]> wrote:

> On 04/29/2015 08:03 PM, Matt Broadstone wrote:
>
>> On Wed, Apr 29, 2015 at 3:01 PM, Matt Broadstone <[email protected]>
>> wrote:
>>
>>  On Wed, Apr 29, 2015 at 2:55 PM, Gordon Sim <[email protected]> wrote:
>>>
>>>  On 04/29/2015 05:46 PM, Matt Broadstone wrote:
>>>>
>>>>  Hi,
>>>>>
>>>>> I have a service using the C++ Messaging API which connects to a single
>>>>> instance of qpidd (currently on the same machine), which seems to crash
>>>>> out
>>>>> with this exception every couple of days under moderate load:
>>>>>
>>>>> qpidd[68257]: 2015-04-28 11:56:38 [Broker] error
>>>>> qpid.192.168.2.225:5672-192.168.2.148:60492: resource-limit-exceeded:
>>>>> Maximum depth exceeded on
>>>>>
>>>>>
>>>>> b1386bee-a36c-449d-953f-c25f4842e76d_hive.guest.metadata_7bf9355b-524b-4853-89bd-1848366cd21f:
>>>>> current=[count: 389438, size: 104857546], max=[size: 104857600]
>>>>> (/build/buildd/qpid-cpp-0.28/src/qpid/broker/Queue.cpp:1575)
>>>>>
>>>>> Using qpid-stat I don't see the queue depth ever increase from 0
>>>>> (which I
>>>>> gather is why the exception is thrown, from reading the code), however
>>>>> I
>>>>> -do- notice that the "acquired" count is increasing with every message
>>>>> with
>>>>> no corresponding "release" (release count is always 0).
>>>>>
>>>>>
>>>> That's actually 'expected', in terms of the code. It only increments the
>>>> released count when a messages is released back to the queue, rather
>>>> than
>>>> being acknowledged and dequeued. Also there is nothing at present that
>>>> decrements the acquired count, so it would be expected to keep going up.
>>>>
>>>>
>>>>  Okay, good to know, just making sure I wasn't seeing a huge problem
>>> with
>>> improperly handled messages here.
>>>
>>>
>>>  The exception above is indeed a result of the queue backing up,
>>>> apparently reaching a depth of 389438 messages. What address options if
>>>> any
>>>> are used for the receiver consuming from that queue? Is there anything
>>>> to
>>>> indicate whether that receiver was behaving normally just before the
>>>> point
>>>> at which the error occurred?
>>>>
>>>>
>>>>  I'm using no address options at all. The two programs I submitted
>>> earlier
>>> (mqget/mqsend) are reduced examples of what we're using (except the
>>> receiver in my case uses the "multiple receivers"
>>> Session.getNext().fetch()
>>> etc). Aside from that it's very "vanilla" right now. AFAICT everything
>>> was
>>> fine, until it wasn't. The original bug occurred with version 0.28, so
>>> maybe there's an issue with the fact that it was still using the legacy
>>> store? However, everything I see here indicates nothing ever touched the
>>> disk (these are just messages being published to a topic). As for the
>>> receiving side, each receiver (and this one in particular) are set to a
>>> prefetch(capacity) of 10.
>>>
>>> What seems particularly strange to me is that the backup is for hundreds
>>> of thousands of messages, how could that even be possible? Right now we
>>> have about 10 producers publishing every ~6 seconds.
>>>
>>> Matt
>>>
>>>
>>>
>> Also, what is the recommended failover scenario for this situation?
>> Basically what happened for us is that this "situation" occurred, and then
>> were no longer receiving ANY messages on that receiver and it took our
>> whole system down. The "workaround" was to simply restart the qpidd
>> process.
>>
>
> Did you restart the clients (do you use auto reconnect)? Did you run
> qpid-stat at the time the incident occurred? Did you try restarting the
> receiver before restarting qpidd?
>
>
The clients do use auto reconnect, however when this happens there was no
error on the client side iirc. I was actually not available when the issue
occurred, and the guys working on the system at the time were in QA and
needed to fix it immediately - so I have limited knowledge of
statistics/state at the time of failure unfortunately.

I'm going to try to locally reproduce the problem with the 0.32 fix you
mentioned above (thanks for that). Do you have any ideas why this issue
might occur (like what would cause it to start increasing queue depth), so
I can try to speed up my testing?

Matt




>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to