Re: Java broker OOM due to DirectMemory

Keith W Thu, 20 Apr 2017 09:09:19 -0700

Hi Ramayan

We have been discussing your problem here and have a couple of questions.


I have been experimenting with use-cases based on your descriptions
above, but so far, have been unsuccessful in reproducing a
"java.lang.OutOfMemoryError: Direct buffer memory"  condition. The
direct memory usage reflects the expected model: it levels off when
the flow to disk threshold is reached and direct memory is release as
messages are consumed until the minimum size for caching of direct is
reached.

1] For clarity let me check: we believe when you say "patch to use
MultiQueueConsumer" you are referring to the patch attached to
QPID-7462 "Add experimental "pull" consumers to the broker"  and you
are using a combination of this "x-pull-only"  with the standard
"x-multiqueue" feature.  Is this correct?

2] One idea we had here relates to the size of the virtualhost IO
pool.   As you know from the documentation, the Broker caches/reuses
direct memory internally but the documentation fails to mentions that
each pooled virtualhost IO thread also grabs a chunk (256K) of direct
memory from this cache.  By default the virtual host IO pool is sized
Math.max(Runtime.getRuntime().availableProcessors() * 2, 64), so if
you have a machine with a very large number of cores, you may have a
surprising large amount of direct memory assigned to virtualhost IO
threads.   Check the value of connectionThreadPoolSize on the
virtualhost 
(http://<server>:<port>/api/latest/virtualhost/<virtualhostnodename>/<virtualhostname>)
to see what value is in force.  What is it?  It is possible to tune
the pool size using context variable
virtualhost.connectionThreadPool.size.

3] Tell me if you are tuning the Broker in way beyond the direct/heap
memory settings you have told us about already.  For instance you are
changing any of the direct memory pooling settings
broker.directByteBufferPoolSize, default network buffer size
qpid.broker.networkBufferSize or applying any other non-standard
settings?

4] How many virtual hosts do you have on the Broker?

5] What is the consumption pattern of the messages?  Do consume in a
strictly FIFO fashion or are you making use of message selectors
or/and any of the out-of-order queue types (LVQs, priority queue or
sorted queues)?

6] Is it just the 0.16 client involved in the application?   Can I
check that you are not using any of the AMQP 1.0 clients
(org,apache.qpid:qpid-jms-client or
org.apache.qpid:qpid-amqp-1-0-client) in the software stack (as either
consumers or producers)

Hopefully the answers to these questions will get us closer to a
reproduction.   If you are able to reliable reproduce it, please share
the steps with us.

Kind regards, Keith.


On 20 April 2017 at 10:21, Ramayan Tiwari <[email protected]> wrote:
> After a lot of log mining, we might have a way to explain the sustained
> increased in DirectMemory allocation, the correlation seems to be with the
> growth in the size of a Queue that is getting consumed but at a much slower
> rate than producers putting messages on this queue.
>
> The pattern we see is that in each instance of broker crash, there is at
> least one queue (usually 1 queue) whose size kept growing steadily. It’d be
> of significant size but not the largest queue -- usually there are multiple
> larger queues -- but it was different from other queues in that its size
> was growing steadily. The queue would also be moving, but its processing
> rate was not keeping up with the enqueue rate.
>
> Our theory that might be totally wrong: If a queue is moving the entire
> time, maybe then the broker would keep reusing the same buffer in direct
> memory for the queue, and keep on adding onto it at the end to accommodate
> new messages. But because it’s active all the time and we’re pointing to
> the same buffer, space allocated for messages at the head of the
> queue/buffer doesn’t get reclaimed, even long after those messages have
> been processed. Just a theory.
>
> We are also trying to reproduce this using some perf tests to enqueue with
> same pattern, will update with the findings.
>
> Thanks
> Ramayan
>
> On Wed, Apr 19, 2017 at 6:52 PM, Ramayan Tiwari <[email protected]>
> wrote:
>
>> Another issue that we noticed is when broker goes OOM due to direct
>> memory, it doesn't create heap dump (specified by "-XX:+
>> HeapDumpOnOutOfMemoryError"), even when the OOM error is same as what is
>> mentioned in the oracle JVM docs ("java.lang.OutOfMemoryError").
>>
>> Has anyone been able to find a way to get to heap dump for DM OOM?
>>
>> - Ramayan
>>
>> On Wed, Apr 19, 2017 at 11:21 AM, Ramayan Tiwari <[email protected]
>> > wrote:
>>
>>> Alex,
>>>
>>> Below are the flow to disk logs from broker having 3million+ messages at
>>> this time. We only have one virtual host. Time is in GMT. Looks like flow
>>> to disk is active on the whole virtual host and not a queue level.
>>>
>>> When the same broker went OOM yesterday, I did not see any flow to disk
>>> logs from when it was started until it crashed (crashed twice within 4hrs).
>>>
>>>
>>> 4/19/17 4:17:43.509 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1014 : Message flow to disk active :  Message memory use 3356539KB
>>> exceeds threshold 3355443KB
>>> 4/19/17 2:31:13.502 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1015 : Message flow to disk inactive : Message memory use 3354866KB
>>> within threshold 3355443KB
>>> 4/19/17 2:28:43.511 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1014 : Message flow to disk active :  Message memory use 3358509KB
>>> exceeds threshold 3355443KB
>>> 4/19/17 2:20:13.500 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1015 : Message flow to disk inactive : Message memory use 3353501KB
>>> within threshold 3355443KB
>>> 4/19/17 2:18:13.500 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1014 : Message flow to disk active :  Message memory use 3357544KB
>>> exceeds threshold 3355443KB
>>> 4/19/17 2:08:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1015 : Message flow to disk inactive : Message memory use 3353236KB
>>> within threshold 3355443KB
>>> 4/19/17 2:08:13.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1014 : Message flow to disk active :  Message memory use 3356704KB
>>> exceeds threshold 3355443KB
>>> 4/19/17 2:00:43.500 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1015 : Message flow to disk inactive : Message memory use 3353511KB
>>> within threshold 3355443KB
>>> 4/19/17 2:00:13.504 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1014 : Message flow to disk active :  Message memory use 3357948KB
>>> exceeds threshold 3355443KB
>>> 4/19/17 1:50:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1015 : Message flow to disk inactive : Message memory use 3355310KB
>>> within threshold 3355443KB
>>> 4/19/17 1:47:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1014 : Message flow to disk active :  Message memory use 3365624KB
>>> exceeds threshold 3355443KB
>>> 4/19/17 1:43:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1015 : Message flow to disk inactive : Message memory use 3355136KB
>>> within threshold 3355443KB
>>> 4/19/17 1:31:43.509 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1014 : Message flow to disk active :  Message memory use 3358683KB
>>> exceeds threshold 3355443KB
>>>
>>>
>>> After production release (2days back), we have seen 4 crashes in 3
>>> different brokers, this is the most pressing concern for us in decision if
>>> we should roll back to 0.32. Any help is greatly appreciated.
>>>
>>> Thanks
>>> Ramayan
>>>
>>> On Wed, Apr 19, 2017 at 9:36 AM, Oleksandr Rudyy <[email protected]>
>>> wrote:
>>>
>>>> Ramayan,
>>>> Thanks for the details. I would like to clarify whether flow to disk was
>>>> triggered today for 3 million messages?
>>>>
>>>> The following logs are issued for flow to disk:
>>>> BRK-1014 : Message flow to disk active :  Message memory use
>>>> {0,number,#}KB
>>>> exceeds threshold {1,number,#.##}KB
>>>> BRK-1015 : Message flow to disk inactive : Message memory use
>>>> {0,number,#}KB within threshold {1,number,#.##}KB
>>>>
>>>> Kind Regards,
>>>> Alex
>>>>
>>>>
>>>> On 19 April 2017 at 17:10, Ramayan Tiwari <[email protected]>
>>>> wrote:
>>>>
>>>> > Hi Alex,
>>>> >
>>>> > Thanks for your response, here are the details:
>>>> >
>>>> > We use "direct" exchange, without persistence (we specify
>>>> NON_PERSISTENT
>>>> > that while sending from client) and use BDB store. We use JSON virtual
>>>> host
>>>> > type. We are not using SSL.
>>>> >
>>>> > When the broker went OOM, we had around 1.3 million messages with 100
>>>> bytes
>>>> > average message size. Direct memory allocation (value read from MBean)
>>>> kept
>>>> > going up, even though it wouldn't need more DM to store these many
>>>> > messages. DM allocated persisted at 99% for about 3 and half hours
>>>> before
>>>> > crashing.
>>>> >
>>>> > Today, on the same broker we have 3 million messages (same message
>>>> size)
>>>> > and DM allocated is only at 8%. This seems like there is some issue
>>>> with
>>>> > de-allocation or a leak.
>>>> >
>>>> > I have uploaded the memory utilization graph here:
>>>> > https://drive.google.com/file/d/0Bwi0MEV3srPRVHFEbDlIYUpLaUE/
>>>> > view?usp=sharing
>>>> > Blue line is DM allocated, Yellow is DM Used (sum of queue payload)
>>>> and Red
>>>> > is heap usage.
>>>> >
>>>> > Thanks
>>>> > Ramayan
>>>> >
>>>> > On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr Rudyy <[email protected]>
>>>> wrote:
>>>> >
>>>> > > Hi Ramayan,
>>>> > >
>>>> > > Could please share with us the details of messaging use case(s) which
>>>> > ended
>>>> > > up in OOM on broker side?
>>>> > > I would like to reproduce the issue on my local broker in order to
>>>> fix
>>>> > it.
>>>> > > I would appreciate if you could provide as much details as possible,
>>>> > > including, messaging topology, message persistence type, message
>>>> > > sizes,volumes, etc.
>>>> > >
>>>> > > Qpid Broker 6.0.x uses direct memory for keeping message content and
>>>> > > receiving/sending data. Each plain connection utilizes 512K of direct
>>>> > > memory. Each SSL connection uses 1M of direct memory. Your memory
>>>> > settings
>>>> > > look Ok to me.
>>>> > >
>>>> > > Kind Regards,
>>>> > > Alex
>>>> > >
>>>> > >
>>>> > > On 18 April 2017 at 23:39, Ramayan Tiwari <[email protected]>
>>>> > > wrote:
>>>> > >
>>>> > > > Hi All,
>>>> > > >
>>>> > > > We are using Java broker 6.0.5, with patch to use
>>>> MultiQueueConsumer
>>>> > > > feature. We just finished deploying to production and saw couple of
>>>> > > > instances of broker OOM due to running out of DirectMemory buffer
>>>> > > > (exceptions at the end of this email).
>>>> > > >
>>>> > > > Here is our setup:
>>>> > > > 1. Max heap 12g, max direct memory 4g (this is opposite of what the
>>>> > > > recommendation is, however, for our use cause message payload is
>>>> really
>>>> > > > small ~400bytes and is way less than the per message overhead of
>>>> 1KB).
>>>> > In
>>>> > > > perf testing, we were able to put 2 million messages without any
>>>> > issues.
>>>> > > > 2. ~400 connections to broker.
>>>> > > > 3. Each connection has 20 sessions and there is one multi queue
>>>> > consumer
>>>> > > > attached to each session, listening to around 1000 queues.
>>>> > > > 4. We are still using 0.16 client (I know).
>>>> > > >
>>>> > > > With the above setup, the baseline utilization (without any
>>>> messages)
>>>> > for
>>>> > > > direct memory was around 230mb (with 410 connection each taking
>>>> 500KB).
>>>> > > >
>>>> > > > Based on our understanding of broker memory allocation, message
>>>> payload
>>>> > > > should be the only thing adding to direct memory utilization (on
>>>> top of
>>>> > > > baseline), however, we are experiencing something completely
>>>> different.
>>>> > > In
>>>> > > > our last broker crash, we see that broker is constantly running
>>>> with
>>>> > 90%+
>>>> > > > direct memory allocated, even when message payload sum from all the
>>>> > > queues
>>>> > > > is only 6-8% (these % are against available DM of 4gb). During
>>>> these
>>>> > high
>>>> > > > DM usage period, heap usage was around 60% (of 12gb).
>>>> > > >
>>>> > > > We would like some help in understanding what could be the reason
>>>> of
>>>> > > these
>>>> > > > high DM allocations. Are there things other than message payload
>>>> and
>>>> > AMQP
>>>> > > > connection, which use DM and could be contributing to these high
>>>> usage?
>>>> > > >
>>>> > > > Another thing where we are puzzled is the de-allocation of DM byte
>>>> > > buffers.
>>>> > > > From log mining of heap and DM utilization, de-allocation of DM
>>>> doesn't
>>>> > > > correlate with heap GC. If anyone has seen any documentation
>>>> related to
>>>> > > > this, it would be very helpful if you could share that.
>>>> > > >
>>>> > > > Thanks
>>>> > > > Ramayan
>>>> > > >
>>>> > > >
>>>> > > > *Exceptions*
>>>> > > >
>>>> > > > java.lang.OutOfMemoryError: Direct buffer memory
>>>> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
>>>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
>>>> > > > ~[na:1.8.0_40]
>>>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>>>> > > ~[na:1.8.0_40]
>>>> > > > at
>>>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
>>>> > > > QpidByteBuffer.java:474)
>>>> > > > ~[qpid-common-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainD
>>>> elegate.
>>>> > > > restoreApplicationBufferForWrite(NonBlockingConnectionPlainDele
>>>> > > > gate.java:93)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
>>>> > > > gate.processData(NonBlockingConnectionPlainDelegate.java:60)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.doRead(
>>>> > > > NonBlockingConnection.java:506)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.doWork(
>>>> > > > NonBlockingConnection.java:285)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.NetworkConnectionScheduler.
>>>> > > > processConnection(NetworkConnectionScheduler.java:124)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.SelectorThread$ConnectionPr
>>>> ocessor.
>>>> > > > processConnection(SelectorThread.java:504)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.SelectorThread$
>>>> > > > SelectionTask.performSelect(SelectorThread.java:337)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTask.run(
>>>> > > > SelectorThread.java:87)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.SelectorThread.run(
>>>> > > > SelectorThread.java:462)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
>>>> > > > ThreadPoolExecutor.java:1142)
>>>> > > > ~[na:1.8.0_40]
>>>> > > > at
>>>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>>> > > > ThreadPoolExecutor.java:617)
>>>> > > > ~[na:1.8.0_40]
>>>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > > > *Second exception*
>>>> > > > java.lang.OutOfMemoryError: Direct buffer memory
>>>> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
>>>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
>>>> > > > ~[na:1.8.0_40]
>>>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>>>> > > ~[na:1.8.0_40]
>>>> > > > at
>>>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
>>>> > > > QpidByteBuffer.java:474)
>>>> > > > ~[qpid-common-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
>>>> > > > gate.<init>(NonBlockingConnectionPlainDelegate.java:45)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.
>>>> > > > setTransportEncryption(NonBlockingConnection.java:625)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.<init>(
>>>> > > > NonBlockingConnection.java:117)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.NonBlockingNetworkTransport.
>>>> > > > acceptSocketChannel(NonBlockingNetworkTransport.java:158)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTas
>>>> k$1.run(
>>>> > > > SelectorThread.java:191)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.SelectorThread.run(
>>>> > > > SelectorThread.java:462)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
>>>> > > > ThreadPoolExecutor.java:1142)
>>>> > > > ~[na:1.8.0_40]
>>>> > > > at
>>>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>>> > > > ThreadPoolExecutor.java:617)
>>>> > > > ~[na:1.8.0_40]
>>>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Java broker OOM due to DirectMemory

Reply via email to