Hello,

    thank you for the answer!

    In this case TOMEE and AMQ are in the same process, on the same machine, communicating through 127.0.0.1 so network between AMQ and TOMEE shouldn't be an issue.     In our case, writing to JMS keeps working, but consumers don't get notified. I'm not sure if there are two separate communication channels for that?

    I'm not sure what you mean by backpressure, but we did disable flow control (which should only affect writes though, not notifying consumers) -- were you referring to something like that?     Also don't know about a disk issue -- the persistent queue keeps filling up on disk, and I see no exceptions in the logs.

    When you talk about batch size, do you mean acknowledge optimization? ("ActiveMQ can acknowledge receipt of messages back to the broker in batches (to improve performance). The batch size is 65% of the prefetch limit for the Consumer"). This sounds like it could be related.. If acknowldge breaks down, AMQ would wait on consumers to complete, while the consumers did complete and are waiting for new messages. I already had the idea to check on the JMX "InFlightMessages" info during such an incident to confirm whether AMQ thinks that consumers are busy. But even if it turns out it does, that doesn't really help me, short-term.

    In this case client=server (we get messages over HTTP, write them in a queue on the activeMQ which runs in the same process as TOMEE, and consume them from the same TOMEE instance) so the thread dump I did covers both client & server.

    Another backend than kahadb could be interesting, but there's a lot of traffic and validation of configuration changes for the server is expensive. I'm not sure that's really workable, especially since the chances of this fixing the issue are not that high.

    Regards,

Emmanuel

On 24/10/2018 08:39, Romain Manni-Bucau wrote:
Hello Emmanuel

It can be a lot of things like a network breakdown behind a proxy (so AMQ
does not see it in some cases and a restart recreates the connection), some
backpressure (exponentional), some disk issue etc...

It can be interesting to check your config for healthchecks, batch sizes,
and dump the threads in the server and client when hanging. Also testing
with another backend than kahadb can be interesting depending your work
load.

Le mer. 24 oct. 2018 07:59, Emmanuel Touzery <
emmanuel.touz...@lit-transit.com> a écrit :

Hello,

      noone has any suggestion?

      Regards,

emmanuel

On 22/10/2018 16:04, Emmanuel Touzery wrote:
Hello,

     we have a tomee+ 7.0.3 installation with activemq, using kahadb as
a persistent message storage. We have an activemq.xml, we plugged it
though :

BrokerXmlConfig = xbean:file:/opt/app/tomee/conf/activemq.xml

     in the tomee.xml. The activemq broken runs within TOMEE:

ServerUrl       =  tcp://127.0.0.1:61616

     We have a prefetch of 2000:

<transportConnector name="nio"
uri="nio://0.0.0.0:61616?jms.prefetchPolicy.all=2000"/>

     We use mKaha. We disabled flow control.

     So that everything would work, we had to add a couple of JARs in
the TOMEE lib folder:

activemq-spring-5.14.3.jar
spring-beans-3.2.9.RELEASE.jar
spring-context-3.2.9.RELEASE.jar
spring-core-3.2.9.RELEASE.jar
spring-expression-3.2.9.RELEASE.jar
spring-web-3.2.9.RELEASE.jar
xbean-spring-3.9.jar

     We are "reading" from JMS through message-driven beans,
implementing MessageListener and with @MessageDriven annotations.

     The application is pretty simple... Receive the data from
HTTP/JSON, and store it to SQL (through hibernate).

     Everything works fine as long as the traffic is normal. However
when there is a surge of incoming traffic, sometimes the JMS consumers
stop getting called, and the queue only grows. The issue does not get
fixed until TOMEE is restarted. And then we've seen the issue
re-appear again maybe 40 minutes later. After a while, the server
clears the queue and everything is fine again.

     We took a jstack thread dump of the application when it's in that
"hung" state:
https://www.dropbox.com/s/p8wy7uz6inzsmlj/jstack.txt?dl=0

     What's interesting is that writes fall quite fast, and in steps,
in general not all at once, but as well not slowly:
https://www.dropbox.com/s/nhm5s2zc7r9mk9z/graph_writes.png?dl=0

     After a restart things are fine again immediately.

     We're not sure what is the cause. From what we can tell from the
thread dump, the consumers are idle, they just don't get notified that
work is available. The server is certainly aware there are items in
the queue, we monitor the queue through JMX and the queue size keeps
growing during these episodes. We don't see anything out of the
ordinary in the logs. We looked at thread IDs for consumers just
before the issue, it doesn't look like the consumers get some deadlock
one after the other for instance. It seems like a bunch of them are
called in the last minute before the dropoff for instance. Also,
during a blackout the JDBC pool usage is at 0 according to our JMX
monitoring, so it doesn't seem to be about a deadlocked JDBC connection.

     We did notice the following activemq warnings in the log file, but
the timestamps don't match with any particular events and from what we
found out, they don't seem to be particularly worrying or likely to be
related to the issue:

WARNING [ActiveMQ Journal Checkpoint Worker]

org.apache.activemq.store.kahadb.MessageDatabase.getNextLocationForAckForward

Failed to load next journal location: null

WARNING [ActiveMQ NIO Worker 6]
org.apache.activemq.broker.TransportConnection.serviceTransportException
Transport Connection to: tcp://127.0.0.1:37024 failed:
java.io.EOFException

     Do you have any suggestion to try to fix this issue (which we
sadly can't reproduce at will.. and it only happens pretty rarely)?
Should we rather ask on the activemq mailing list?

     Regards,

emmanuel




Reply via email to