Le mer. 24 oct. 2018 à 09:23, Emmanuel Touzery <
emmanuel.touz...@lit-transit.com> a écrit :

> Hello,
>
>      thank you for the answer!
>
>      In this case TOMEE and AMQ are in the same process, on the same
> machine, communicating through 127.0.0.1 so network between AMQ and
> TOMEE shouldn't be an issue.
>

you are client of yourself? was more thinking about client <-> server

typically if you have

sender ------ some proxy ------- message driven bean

then you can have these "fake state" on network layer


>      In our case, writing to JMS keeps working, but consumers don't get
> notified. I'm not sure if there are two separate communication channels
> for that?
>

normally it is the same but depending the queue state it can block

did you check your DLQ too?


>
>      I'm not sure what you mean by backpressure, but we did disable flow
> control (which should only affect writes though, not notifying
> consumers) -- were you referring to something like that?
>

Yep


>      Also don't know about a disk issue -- the persistent queue keeps
> filling up on disk, and I see no exceptions in the logs.
>

maybe next time give a quick peak to the disk state and if the partition is
full,
can also be good to activate AMQ debug logs in such cases (if you can)


>
>      When you talk about batch size, do you mean acknowledge
> optimization? ("ActiveMQ can acknowledge receipt of messages back to the
> broker in batches (to improve performance). The batch size is 65% of the
> prefetch limit for the Consumer"). This sounds like it could be
> related.. If acknowldge breaks down, AMQ would wait on consumers to
> complete, while the consumers did complete and are waiting for new
> messages. I already had the idea to check on the JMX "InFlightMessages"
> info during such an incident to confirm whether AMQ thinks that
> consumers are busy. But even if it turns out it does, that doesn't
> really help me, short-term.
>

Yep, here you can test with a batch size of 1 (will be "slow" but each
message is considered alone)


>
>      In this case client=server (we get messages over HTTP, write them
> in a queue on the activeMQ which runs in the same process as TOMEE, and
> consume them from the same TOMEE instance) so the thread dump I did
> covers both client & server.
>

It answers to the first question, so if you don't have any dynamic network
(like it can be with docker or so) it is likely not the issue.


>
>      Another backend than kahadb could be interesting, but there's a lot
> of traffic and validation of configuration changes for the server is
> expensive. I'm not sure that's really workable, especially since the
> chances of this fixing the issue are not that high.
>

Well, first you need to have an environment reproducing that issue then you
can iterate to identify it:

(no particular order)

0. DLQ state (ensure it is monitored)
1. network (if relevant)
2. backend
3. (potentially) transport
4. AMQ version

...


>
>      Regards,
>
> Emmanuel
>
> On 24/10/2018 08:39, Romain Manni-Bucau wrote:
> > Hello Emmanuel
> >
> > It can be a lot of things like a network breakdown behind a proxy (so AMQ
> > does not see it in some cases and a restart recreates the connection),
> some
> > backpressure (exponentional), some disk issue etc...
> >
> > It can be interesting to check your config for healthchecks, batch sizes,
> > and dump the threads in the server and client when hanging. Also testing
> > with another backend than kahadb can be interesting depending your work
> > load.
> >
> > Le mer. 24 oct. 2018 07:59, Emmanuel Touzery <
> > emmanuel.touz...@lit-transit.com> a écrit :
> >
> >> Hello,
> >>
> >>       noone has any suggestion?
> >>
> >>       Regards,
> >>
> >> emmanuel
> >>
> >> On 22/10/2018 16:04, Emmanuel Touzery wrote:
> >>> Hello,
> >>>
> >>>      we have a tomee+ 7.0.3 installation with activemq, using kahadb as
> >>> a persistent message storage. We have an activemq.xml, we plugged it
> >>> though :
> >>>
> >>> BrokerXmlConfig = xbean:file:/opt/app/tomee/conf/activemq.xml
> >>>
> >>>      in the tomee.xml. The activemq broken runs within TOMEE:
> >>>
> >>> ServerUrl       =  tcp://127.0.0.1:61616
> >>>
> >>>      We have a prefetch of 2000:
> >>>
> >>> <transportConnector name="nio"
> >>> uri="nio://0.0.0.0:61616?jms.prefetchPolicy.all=2000"/>
> >>>
> >>>      We use mKaha. We disabled flow control.
> >>>
> >>>      So that everything would work, we had to add a couple of JARs in
> >>> the TOMEE lib folder:
> >>>
> >>> activemq-spring-5.14.3.jar
> >>> spring-beans-3.2.9.RELEASE.jar
> >>> spring-context-3.2.9.RELEASE.jar
> >>> spring-core-3.2.9.RELEASE.jar
> >>> spring-expression-3.2.9.RELEASE.jar
> >>> spring-web-3.2.9.RELEASE.jar
> >>> xbean-spring-3.9.jar
> >>>
> >>>      We are "reading" from JMS through message-driven beans,
> >>> implementing MessageListener and with @MessageDriven annotations.
> >>>
> >>>      The application is pretty simple... Receive the data from
> >>> HTTP/JSON, and store it to SQL (through hibernate).
> >>>
> >>>      Everything works fine as long as the traffic is normal. However
> >>> when there is a surge of incoming traffic, sometimes the JMS consumers
> >>> stop getting called, and the queue only grows. The issue does not get
> >>> fixed until TOMEE is restarted. And then we've seen the issue
> >>> re-appear again maybe 40 minutes later. After a while, the server
> >>> clears the queue and everything is fine again.
> >>>
> >>>      We took a jstack thread dump of the application when it's in that
> >>> "hung" state:
> >>> https://www.dropbox.com/s/p8wy7uz6inzsmlj/jstack.txt?dl=0
> >>>
> >>>      What's interesting is that writes fall quite fast, and in steps,
> >>> in general not all at once, but as well not slowly:
> >>> https://www.dropbox.com/s/nhm5s2zc7r9mk9z/graph_writes.png?dl=0
> >>>
> >>>      After a restart things are fine again immediately.
> >>>
> >>>      We're not sure what is the cause. From what we can tell from the
> >>> thread dump, the consumers are idle, they just don't get notified that
> >>> work is available. The server is certainly aware there are items in
> >>> the queue, we monitor the queue through JMX and the queue size keeps
> >>> growing during these episodes. We don't see anything out of the
> >>> ordinary in the logs. We looked at thread IDs for consumers just
> >>> before the issue, it doesn't look like the consumers get some deadlock
> >>> one after the other for instance. It seems like a bunch of them are
> >>> called in the last minute before the dropoff for instance. Also,
> >>> during a blackout the JDBC pool usage is at 0 according to our JMX
> >>> monitoring, so it doesn't seem to be about a deadlocked JDBC
> connection.
> >>>
> >>>      We did notice the following activemq warnings in the log file, but
> >>> the timestamps don't match with any particular events and from what we
> >>> found out, they don't seem to be particularly worrying or likely to be
> >>> related to the issue:
> >>>
> >>> WARNING [ActiveMQ Journal Checkpoint Worker]
> >>>
> >>
> org.apache.activemq.store.kahadb.MessageDatabase.getNextLocationForAckForward
> >>
> >>> Failed to load next journal location: null
> >>>
> >>> WARNING [ActiveMQ NIO Worker 6]
> >>>
> org.apache.activemq.broker.TransportConnection.serviceTransportException
> >>> Transport Connection to: tcp://127.0.0.1:37024 failed:
> >>> java.io.EOFException
> >>>
> >>>      Do you have any suggestion to try to fix this issue (which we
> >>> sadly can't reproduce at will.. and it only happens pretty rarely)?
> >>> Should we rather ask on the activemq mailing list?
> >>>
> >>>      Regards,
> >>>
> >>> emmanuel
> >>>
> >>>
> >>
>
>

Reply via email to