Artemis cluster - Messages stuck in Delivering state

AntonR Fri, 17 Apr 2020 03:34:19 -0700

Hi,

I have an issue with the Artemis broker which I am having troubles solving
and also reproduce outside of my testing environment.


The setup is the following: 3 Artemis brokers running on separate servers,
clustered in an Active-Active fashion with static connectors

The clients are running JBoss 6 with the ActiveMQ 5 RA

Messages are processed in XA transactions with MDBs

All clients (16 of them, multiple queues each, no topics) use one separate
RA (both old and new version tested) for each broker and use the failover
protocol with prioritybackup=true and randomize=false, each RA connecting to
server 1, 2 or 3 and are set to fail over to the next broker in line in the
event of a broker becoming unavailable. This is done in order to achieve
both Load balancing and redundancy.
The environment is set up like this because it used to run with ActiveMQ 5
brokers as well, and this made sense at the time.

The problem I am seeing with the Artemis brokers is that after a
failover-failback scenario, so if a broker goes down and later comes back
up, messages get stuck in the "Delivering" state and the only way to get
them to roll back is to restart the broker. After a restart though, this
problem persists, so the clients will "prefetch" up to their limit again and
then stop.

There is no timeout happening, messages stay like this forever and the only
solution to this state is to either restart the clients or stop all Artemis
brokers, start an ActiveMQ 5 broker for ~10 seconds and then start the
Artemis brokers again. This happens on all broker restarts, but not to all
clients at once, so I would guess this is some sort of a timing issue.

I have tried changing every possible config I can think of without any
effect and have yet to be able to reproduce this issue outside of this
(legacy) test environment. I run Artemis in several other environments with
newer clients (but who mostly run ActiveMQ5 clients but without JBoss, MDB
and XA) and have zero issues.

Some things I have noticed but have yet to piece together:

The connectionID for the consumer that holds the messages "Delvering" does
not exist, so in Hawtio I can trace the messages to a consumer, that
consumer has a corresponding Session but the session does not have an
associated connection. (there is a connectionID reported but if i click on
it or search for it, it does not exist)

The DeliveringCount goes to 1000 messages for each consumer, which is the
Openwire default for prefetched messages, but most clients use
prefetchPolicy.all=100, which is otherwise respected

Artemis reports "Error during buffering operation", see attached file 
artemis_stacktrace.txt
<http://activemq.2283324.n4.nabble.com/file/t378961/artemis_stacktrace.txt>  

A thread dump on the clients report that basically all JMS related threads
are stuck at the same place, see attached file  client_threads.txt
<http://activemq.2283324.n4.nabble.com/file/t378961/client_threads.txt>  

Br,
Anton



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html

Artemis cluster - Messages stuck in Delivering state

Reply via email to