After a lot of log mining, we might have a way to explain the sustained increased in DirectMemory allocation, the correlation seems to be with the growth in the size of a Queue that is getting consumed but at a much slower rate than producers putting messages on this queue.
The pattern we see is that in each instance of broker crash, there is at least one queue (usually 1 queue) whose size kept growing steadily. It’d be of significant size but not the largest queue -- usually there are multiple larger queues -- but it was different from other queues in that its size was growing steadily. The queue would also be moving, but its processing rate was not keeping up with the enqueue rate. Our theory that might be totally wrong: If a queue is moving the entire time, maybe then the broker would keep reusing the same buffer in direct memory for the queue, and keep on adding onto it at the end to accommodate new messages. But because it’s active all the time and we’re pointing to the same buffer, space allocated for messages at the head of the queue/buffer doesn’t get reclaimed, even long after those messages have been processed. Just a theory. We are also trying to reproduce this using some perf tests to enqueue with same pattern, will update with the findings. Thanks Ramayan On Wed, Apr 19, 2017 at 6:52 PM, Ramayan Tiwari <[email protected]> wrote: > Another issue that we noticed is when broker goes OOM due to direct > memory, it doesn't create heap dump (specified by "-XX:+ > HeapDumpOnOutOfMemoryError"), even when the OOM error is same as what is > mentioned in the oracle JVM docs ("java.lang.OutOfMemoryError"). > > Has anyone been able to find a way to get to heap dump for DM OOM? > > - Ramayan > > On Wed, Apr 19, 2017 at 11:21 AM, Ramayan Tiwari <[email protected] > > wrote: > >> Alex, >> >> Below are the flow to disk logs from broker having 3million+ messages at >> this time. We only have one virtual host. Time is in GMT. Looks like flow >> to disk is active on the whole virtual host and not a queue level. >> >> When the same broker went OOM yesterday, I did not see any flow to disk >> logs from when it was started until it crashed (crashed twice within 4hrs). >> >> >> 4/19/17 4:17:43.509 AM INFO [Housekeeping[test]] - [Housekeeping[test]] >> BRK-1014 : Message flow to disk active : Message memory use 3356539KB >> exceeds threshold 3355443KB >> 4/19/17 2:31:13.502 AM INFO [Housekeeping[test]] - [Housekeeping[test]] >> BRK-1015 : Message flow to disk inactive : Message memory use 3354866KB >> within threshold 3355443KB >> 4/19/17 2:28:43.511 AM INFO [Housekeeping[test]] - [Housekeeping[test]] >> BRK-1014 : Message flow to disk active : Message memory use 3358509KB >> exceeds threshold 3355443KB >> 4/19/17 2:20:13.500 AM INFO [Housekeeping[test]] - [Housekeeping[test]] >> BRK-1015 : Message flow to disk inactive : Message memory use 3353501KB >> within threshold 3355443KB >> 4/19/17 2:18:13.500 AM INFO [Housekeeping[test]] - [Housekeeping[test]] >> BRK-1014 : Message flow to disk active : Message memory use 3357544KB >> exceeds threshold 3355443KB >> 4/19/17 2:08:43.501 AM INFO [Housekeeping[test]] - [Housekeeping[test]] >> BRK-1015 : Message flow to disk inactive : Message memory use 3353236KB >> within threshold 3355443KB >> 4/19/17 2:08:13.501 AM INFO [Housekeeping[test]] - [Housekeeping[test]] >> BRK-1014 : Message flow to disk active : Message memory use 3356704KB >> exceeds threshold 3355443KB >> 4/19/17 2:00:43.500 AM INFO [Housekeeping[test]] - [Housekeeping[test]] >> BRK-1015 : Message flow to disk inactive : Message memory use 3353511KB >> within threshold 3355443KB >> 4/19/17 2:00:13.504 AM INFO [Housekeeping[test]] - [Housekeeping[test]] >> BRK-1014 : Message flow to disk active : Message memory use 3357948KB >> exceeds threshold 3355443KB >> 4/19/17 1:50:43.501 AM INFO [Housekeeping[test]] - [Housekeeping[test]] >> BRK-1015 : Message flow to disk inactive : Message memory use 3355310KB >> within threshold 3355443KB >> 4/19/17 1:47:43.501 AM INFO [Housekeeping[test]] - [Housekeeping[test]] >> BRK-1014 : Message flow to disk active : Message memory use 3365624KB >> exceeds threshold 3355443KB >> 4/19/17 1:43:43.501 AM INFO [Housekeeping[test]] - [Housekeeping[test]] >> BRK-1015 : Message flow to disk inactive : Message memory use 3355136KB >> within threshold 3355443KB >> 4/19/17 1:31:43.509 AM INFO [Housekeeping[test]] - [Housekeeping[test]] >> BRK-1014 : Message flow to disk active : Message memory use 3358683KB >> exceeds threshold 3355443KB >> >> >> After production release (2days back), we have seen 4 crashes in 3 >> different brokers, this is the most pressing concern for us in decision if >> we should roll back to 0.32. Any help is greatly appreciated. >> >> Thanks >> Ramayan >> >> On Wed, Apr 19, 2017 at 9:36 AM, Oleksandr Rudyy <[email protected]> >> wrote: >> >>> Ramayan, >>> Thanks for the details. I would like to clarify whether flow to disk was >>> triggered today for 3 million messages? >>> >>> The following logs are issued for flow to disk: >>> BRK-1014 : Message flow to disk active : Message memory use >>> {0,number,#}KB >>> exceeds threshold {1,number,#.##}KB >>> BRK-1015 : Message flow to disk inactive : Message memory use >>> {0,number,#}KB within threshold {1,number,#.##}KB >>> >>> Kind Regards, >>> Alex >>> >>> >>> On 19 April 2017 at 17:10, Ramayan Tiwari <[email protected]> >>> wrote: >>> >>> > Hi Alex, >>> > >>> > Thanks for your response, here are the details: >>> > >>> > We use "direct" exchange, without persistence (we specify >>> NON_PERSISTENT >>> > that while sending from client) and use BDB store. We use JSON virtual >>> host >>> > type. We are not using SSL. >>> > >>> > When the broker went OOM, we had around 1.3 million messages with 100 >>> bytes >>> > average message size. Direct memory allocation (value read from MBean) >>> kept >>> > going up, even though it wouldn't need more DM to store these many >>> > messages. DM allocated persisted at 99% for about 3 and half hours >>> before >>> > crashing. >>> > >>> > Today, on the same broker we have 3 million messages (same message >>> size) >>> > and DM allocated is only at 8%. This seems like there is some issue >>> with >>> > de-allocation or a leak. >>> > >>> > I have uploaded the memory utilization graph here: >>> > https://drive.google.com/file/d/0Bwi0MEV3srPRVHFEbDlIYUpLaUE/ >>> > view?usp=sharing >>> > Blue line is DM allocated, Yellow is DM Used (sum of queue payload) >>> and Red >>> > is heap usage. >>> > >>> > Thanks >>> > Ramayan >>> > >>> > On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr Rudyy <[email protected]> >>> wrote: >>> > >>> > > Hi Ramayan, >>> > > >>> > > Could please share with us the details of messaging use case(s) which >>> > ended >>> > > up in OOM on broker side? >>> > > I would like to reproduce the issue on my local broker in order to >>> fix >>> > it. >>> > > I would appreciate if you could provide as much details as possible, >>> > > including, messaging topology, message persistence type, message >>> > > sizes,volumes, etc. >>> > > >>> > > Qpid Broker 6.0.x uses direct memory for keeping message content and >>> > > receiving/sending data. Each plain connection utilizes 512K of direct >>> > > memory. Each SSL connection uses 1M of direct memory. Your memory >>> > settings >>> > > look Ok to me. >>> > > >>> > > Kind Regards, >>> > > Alex >>> > > >>> > > >>> > > On 18 April 2017 at 23:39, Ramayan Tiwari <[email protected]> >>> > > wrote: >>> > > >>> > > > Hi All, >>> > > > >>> > > > We are using Java broker 6.0.5, with patch to use >>> MultiQueueConsumer >>> > > > feature. We just finished deploying to production and saw couple of >>> > > > instances of broker OOM due to running out of DirectMemory buffer >>> > > > (exceptions at the end of this email). >>> > > > >>> > > > Here is our setup: >>> > > > 1. Max heap 12g, max direct memory 4g (this is opposite of what the >>> > > > recommendation is, however, for our use cause message payload is >>> really >>> > > > small ~400bytes and is way less than the per message overhead of >>> 1KB). >>> > In >>> > > > perf testing, we were able to put 2 million messages without any >>> > issues. >>> > > > 2. ~400 connections to broker. >>> > > > 3. Each connection has 20 sessions and there is one multi queue >>> > consumer >>> > > > attached to each session, listening to around 1000 queues. >>> > > > 4. We are still using 0.16 client (I know). >>> > > > >>> > > > With the above setup, the baseline utilization (without any >>> messages) >>> > for >>> > > > direct memory was around 230mb (with 410 connection each taking >>> 500KB). >>> > > > >>> > > > Based on our understanding of broker memory allocation, message >>> payload >>> > > > should be the only thing adding to direct memory utilization (on >>> top of >>> > > > baseline), however, we are experiencing something completely >>> different. >>> > > In >>> > > > our last broker crash, we see that broker is constantly running >>> with >>> > 90%+ >>> > > > direct memory allocated, even when message payload sum from all the >>> > > queues >>> > > > is only 6-8% (these % are against available DM of 4gb). During >>> these >>> > high >>> > > > DM usage period, heap usage was around 60% (of 12gb). >>> > > > >>> > > > We would like some help in understanding what could be the reason >>> of >>> > > these >>> > > > high DM allocations. Are there things other than message payload >>> and >>> > AMQP >>> > > > connection, which use DM and could be contributing to these high >>> usage? >>> > > > >>> > > > Another thing where we are puzzled is the de-allocation of DM byte >>> > > buffers. >>> > > > From log mining of heap and DM utilization, de-allocation of DM >>> doesn't >>> > > > correlate with heap GC. If anyone has seen any documentation >>> related to >>> > > > this, it would be very helpful if you could share that. >>> > > > >>> > > > Thanks >>> > > > Ramayan >>> > > > >>> > > > >>> > > > *Exceptions* >>> > > > >>> > > > java.lang.OutOfMemoryError: Direct buffer memory >>> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40] >>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) >>> > > > ~[na:1.8.0_40] >>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) >>> > > ~[na:1.8.0_40] >>> > > > at >>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect( >>> > > > QpidByteBuffer.java:474) >>> > > > ~[qpid-common-6.0.5.jar:6.0.5] >>> > > > at >>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainD >>> elegate. >>> > > > restoreApplicationBufferForWrite(NonBlockingConnectionPlainDele >>> > > > gate.java:93) >>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5] >>> > > > at >>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele >>> > > > gate.processData(NonBlockingConnectionPlainDelegate.java:60) >>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5] >>> > > > at >>> > > > org.apache.qpid.server.transport.NonBlockingConnection.doRead( >>> > > > NonBlockingConnection.java:506) >>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5] >>> > > > at >>> > > > org.apache.qpid.server.transport.NonBlockingConnection.doWork( >>> > > > NonBlockingConnection.java:285) >>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5] >>> > > > at >>> > > > org.apache.qpid.server.transport.NetworkConnectionScheduler. >>> > > > processConnection(NetworkConnectionScheduler.java:124) >>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5] >>> > > > at >>> > > > org.apache.qpid.server.transport.SelectorThread$ConnectionPr >>> ocessor. >>> > > > processConnection(SelectorThread.java:504) >>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5] >>> > > > at >>> > > > org.apache.qpid.server.transport.SelectorThread$ >>> > > > SelectionTask.performSelect(SelectorThread.java:337) >>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5] >>> > > > at >>> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTask.run( >>> > > > SelectorThread.java:87) >>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5] >>> > > > at >>> > > > org.apache.qpid.server.transport.SelectorThread.run( >>> > > > SelectorThread.java:462) >>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5] >>> > > > at >>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker( >>> > > > ThreadPoolExecutor.java:1142) >>> > > > ~[na:1.8.0_40] >>> > > > at >>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run( >>> > > > ThreadPoolExecutor.java:617) >>> > > > ~[na:1.8.0_40] >>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40] >>> > > > >>> > > > >>> > > > >>> > > > *Second exception* >>> > > > java.lang.OutOfMemoryError: Direct buffer memory >>> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40] >>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) >>> > > > ~[na:1.8.0_40] >>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) >>> > > ~[na:1.8.0_40] >>> > > > at >>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect( >>> > > > QpidByteBuffer.java:474) >>> > > > ~[qpid-common-6.0.5.jar:6.0.5] >>> > > > at >>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele >>> > > > gate.<init>(NonBlockingConnectionPlainDelegate.java:45) >>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5] >>> > > > at >>> > > > org.apache.qpid.server.transport.NonBlockingConnection. >>> > > > setTransportEncryption(NonBlockingConnection.java:625) >>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5] >>> > > > at >>> > > > org.apache.qpid.server.transport.NonBlockingConnection.<init>( >>> > > > NonBlockingConnection.java:117) >>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5] >>> > > > at >>> > > > org.apache.qpid.server.transport.NonBlockingNetworkTransport. >>> > > > acceptSocketChannel(NonBlockingNetworkTransport.java:158) >>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5] >>> > > > at >>> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTas >>> k$1.run( >>> > > > SelectorThread.java:191) >>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5] >>> > > > at >>> > > > org.apache.qpid.server.transport.SelectorThread.run( >>> > > > SelectorThread.java:462) >>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5] >>> > > > at >>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker( >>> > > > ThreadPoolExecutor.java:1142) >>> > > > ~[na:1.8.0_40] >>> > > > at >>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run( >>> > > > ThreadPoolExecutor.java:617) >>> > > > ~[na:1.8.0_40] >>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40] >>> > > > >>> > > >>> > >>> >> >> >
