Hi all,
we are running Apache ActiveMQ Artemis 2.39.0 on OpenJDK 17 in a Kubernetes 
environment and experienced an OutOfMemoryError on our production broker. We 
are assuming that this could be a bug. We have been investigating the root 
cause and would appreciate the community's input on our findings and open 
questions.


Environment

  *   Artemis version: 2.39.0
  *   Java: OpenJDK 17 (G1 GC)
  *   JVM: -Xms 4G, -Xmx 9G
  *   global-max-size: 800M


Situation
Our setup uses a software with an internal Artemis broker per endpoint. The 
broker handles message routing between a Business Application (BA), the 
endpoint itself, and a central broker.
A feature called "AMQP Send Handler" writes a SendEvent into the queue 
`ecp.endpoint.send.event` for every message the endpoint sends. This handler 
was enabled, but no consumer was ever connected to this queue.
Over approximately 1.5 years of continuous operation, this queue accumulated 
22,240,016 messages with 0 consumers and 0 acknowledgements.


The OOME
The JVM heap showed a sawtooth pattern consistently reaching ~95%, with the GC 
managing to recover each time. Eventually a single spike pushed usage to ~99.9% 
and triggered the OutOfMemoryError.

Heap analysis
We ran `jcmd 1 GC.class_histogram` on the production broker and found the 
following top heap consumers:
Class
Instances
Bytes
-
-
-
PageTransactionInfoImpl
22,132,340
1,062,352,320 (~1 GB)
ConcurrentHashMap$Node
22,160,468
709,134,976 (~676 MB)
JournalRecord
22,256,890
534,165,360 (~509 MB)
Long
22,132,609
531,182,616 (~506 MB)
The instance counts correlate almost exactly with the 22M stuck messages in 
`ecp.endpoint.send.event`. These four object types alone consumed approximately 
2.8 GB of heap.
All other objects (AMQPStandardMessage, MessageReferenceImpl, etc.) had normal 
counts (~130K instances), consistent with the actively processed queues.


Resolution
We purged the 22M messages from the queue using `removeMessages` with a low 
flushLimit. The heap usage dropped significantly after the purge. We also 
disabled the Send Handler to prevent re-accumulation.
Reproduction attempt (ACCE environment)
We attempted to reproduce this on a test broker with the same Artemis version 
and identical broker.xml configuration, but with -Xmx 1G. We sent >10M messages 
to the same queue (no consumer). However, the heap histogram showed a very 
different picture:
Class
PROD (22M msgs)
ACCE (10M+ msgs)
-
-
-
PageTransactionInfoImpl
22,132,340
188,264
JournalRecord
22,256,890
315,352
MessageReferenceImpl
124,518
127,035
Despite having millions of paged messages, the ACCE broker only held ~188K 
PageTransactionInfoImpl objects in heap (vs. 22M in PROD). JVM usage stayed 
stable around 50%.

Questions for the community

1. Can someone confirm that Artemis keeps a PageTransactionInfoImpl and 
JournalRecord in heap for each paged message as long as the message is not 
consumed/acknowledged? Is this by design?
2. Why is there such a large discrepancy between PROD and ACCE? Both have the 
same broker configuration, both had millions of paged messages with 0 
consumers. Our hypothesis is that the long-running production environment (1.5 
years, continuous message flow across other queues) leads to journal 
fragmentation/accumulation that prevents journal compaction from cleaning up 
the PageTransactionInfoImpl records, whereas in the short-lived test scenario 
the compaction process works efficiently. Is this plausible?
3. Shouldn't the paging mechanism prevent excactly this szenario, heap getting 
filled up due to a lot of messages?
Thanks in advance for any insights.

Best regards,

Frederik Fournes
Associmates GmbH
Poppelsdorfer Allee 106
53115 Bonn
Germany
+49 228 3040 6375

Managing Directors: Tobias Berger, Alexander Fournes
Registered Offices: Bonn, Germany - Registration number: HRB 25008

Reply via email to