Le mer. 24 oct. 2018 à 09:23, Emmanuel Touzery < emmanuel.touz...@lit-transit.com> a écrit :
> Hello, > > thank you for the answer! > > In this case TOMEE and AMQ are in the same process, on the same > machine, communicating through 127.0.0.1 so network between AMQ and > TOMEE shouldn't be an issue. > you are client of yourself? was more thinking about client <-> server typically if you have sender ------ some proxy ------- message driven bean then you can have these "fake state" on network layer > In our case, writing to JMS keeps working, but consumers don't get > notified. I'm not sure if there are two separate communication channels > for that? > normally it is the same but depending the queue state it can block did you check your DLQ too? > > I'm not sure what you mean by backpressure, but we did disable flow > control (which should only affect writes though, not notifying > consumers) -- were you referring to something like that? > Yep > Also don't know about a disk issue -- the persistent queue keeps > filling up on disk, and I see no exceptions in the logs. > maybe next time give a quick peak to the disk state and if the partition is full, can also be good to activate AMQ debug logs in such cases (if you can) > > When you talk about batch size, do you mean acknowledge > optimization? ("ActiveMQ can acknowledge receipt of messages back to the > broker in batches (to improve performance). The batch size is 65% of the > prefetch limit for the Consumer"). This sounds like it could be > related.. If acknowldge breaks down, AMQ would wait on consumers to > complete, while the consumers did complete and are waiting for new > messages. I already had the idea to check on the JMX "InFlightMessages" > info during such an incident to confirm whether AMQ thinks that > consumers are busy. But even if it turns out it does, that doesn't > really help me, short-term. > Yep, here you can test with a batch size of 1 (will be "slow" but each message is considered alone) > > In this case client=server (we get messages over HTTP, write them > in a queue on the activeMQ which runs in the same process as TOMEE, and > consume them from the same TOMEE instance) so the thread dump I did > covers both client & server. > It answers to the first question, so if you don't have any dynamic network (like it can be with docker or so) it is likely not the issue. > > Another backend than kahadb could be interesting, but there's a lot > of traffic and validation of configuration changes for the server is > expensive. I'm not sure that's really workable, especially since the > chances of this fixing the issue are not that high. > Well, first you need to have an environment reproducing that issue then you can iterate to identify it: (no particular order) 0. DLQ state (ensure it is monitored) 1. network (if relevant) 2. backend 3. (potentially) transport 4. AMQ version ... > > Regards, > > Emmanuel > > On 24/10/2018 08:39, Romain Manni-Bucau wrote: > > Hello Emmanuel > > > > It can be a lot of things like a network breakdown behind a proxy (so AMQ > > does not see it in some cases and a restart recreates the connection), > some > > backpressure (exponentional), some disk issue etc... > > > > It can be interesting to check your config for healthchecks, batch sizes, > > and dump the threads in the server and client when hanging. Also testing > > with another backend than kahadb can be interesting depending your work > > load. > > > > Le mer. 24 oct. 2018 07:59, Emmanuel Touzery < > > emmanuel.touz...@lit-transit.com> a écrit : > > > >> Hello, > >> > >> noone has any suggestion? > >> > >> Regards, > >> > >> emmanuel > >> > >> On 22/10/2018 16:04, Emmanuel Touzery wrote: > >>> Hello, > >>> > >>> we have a tomee+ 7.0.3 installation with activemq, using kahadb as > >>> a persistent message storage. We have an activemq.xml, we plugged it > >>> though : > >>> > >>> BrokerXmlConfig = xbean:file:/opt/app/tomee/conf/activemq.xml > >>> > >>> in the tomee.xml. The activemq broken runs within TOMEE: > >>> > >>> ServerUrl = tcp://127.0.0.1:61616 > >>> > >>> We have a prefetch of 2000: > >>> > >>> <transportConnector name="nio" > >>> uri="nio://0.0.0.0:61616?jms.prefetchPolicy.all=2000"/> > >>> > >>> We use mKaha. We disabled flow control. > >>> > >>> So that everything would work, we had to add a couple of JARs in > >>> the TOMEE lib folder: > >>> > >>> activemq-spring-5.14.3.jar > >>> spring-beans-3.2.9.RELEASE.jar > >>> spring-context-3.2.9.RELEASE.jar > >>> spring-core-3.2.9.RELEASE.jar > >>> spring-expression-3.2.9.RELEASE.jar > >>> spring-web-3.2.9.RELEASE.jar > >>> xbean-spring-3.9.jar > >>> > >>> We are "reading" from JMS through message-driven beans, > >>> implementing MessageListener and with @MessageDriven annotations. > >>> > >>> The application is pretty simple... Receive the data from > >>> HTTP/JSON, and store it to SQL (through hibernate). > >>> > >>> Everything works fine as long as the traffic is normal. However > >>> when there is a surge of incoming traffic, sometimes the JMS consumers > >>> stop getting called, and the queue only grows. The issue does not get > >>> fixed until TOMEE is restarted. And then we've seen the issue > >>> re-appear again maybe 40 minutes later. After a while, the server > >>> clears the queue and everything is fine again. > >>> > >>> We took a jstack thread dump of the application when it's in that > >>> "hung" state: > >>> https://www.dropbox.com/s/p8wy7uz6inzsmlj/jstack.txt?dl=0 > >>> > >>> What's interesting is that writes fall quite fast, and in steps, > >>> in general not all at once, but as well not slowly: > >>> https://www.dropbox.com/s/nhm5s2zc7r9mk9z/graph_writes.png?dl=0 > >>> > >>> After a restart things are fine again immediately. > >>> > >>> We're not sure what is the cause. From what we can tell from the > >>> thread dump, the consumers are idle, they just don't get notified that > >>> work is available. The server is certainly aware there are items in > >>> the queue, we monitor the queue through JMX and the queue size keeps > >>> growing during these episodes. We don't see anything out of the > >>> ordinary in the logs. We looked at thread IDs for consumers just > >>> before the issue, it doesn't look like the consumers get some deadlock > >>> one after the other for instance. It seems like a bunch of them are > >>> called in the last minute before the dropoff for instance. Also, > >>> during a blackout the JDBC pool usage is at 0 according to our JMX > >>> monitoring, so it doesn't seem to be about a deadlocked JDBC > connection. > >>> > >>> We did notice the following activemq warnings in the log file, but > >>> the timestamps don't match with any particular events and from what we > >>> found out, they don't seem to be particularly worrying or likely to be > >>> related to the issue: > >>> > >>> WARNING [ActiveMQ Journal Checkpoint Worker] > >>> > >> > org.apache.activemq.store.kahadb.MessageDatabase.getNextLocationForAckForward > >> > >>> Failed to load next journal location: null > >>> > >>> WARNING [ActiveMQ NIO Worker 6] > >>> > org.apache.activemq.broker.TransportConnection.serviceTransportException > >>> Transport Connection to: tcp://127.0.0.1:37024 failed: > >>> java.io.EOFException > >>> > >>> Do you have any suggestion to try to fix this issue (which we > >>> sadly can't reproduce at will.. and it only happens pretty rarely)? > >>> Should we rather ask on the activemq mailing list? > >>> > >>> Regards, > >>> > >>> emmanuel > >>> > >>> > >> > >