Artemis failed but HA secondary did not take over

Ben Warrick Wed, 20 Jan 2021 07:12:13 -0800

I had an incident last night. I can share the actual log files with anyone
interested in seeing the full picture. Below is a summary. I also have OS
resource tracking data.


I have a pair of Artemis servers using a shared store -- an NFS mounted
folder. My use case is enterprise integration. I use artemis to pass
messages between interfaces. For example, the website publishes an order to
an address which has several queues. All the interfaces that need that
message subscribe to their queue.

It looks to me like an issue with the page file. Is the page file in this
case mean virtual memory, or is it talking about the paging folder inside
the data folder?

This morning I could not bring up the console, but I was able to send and
receive messages using JMS. I tried restarting the primary -- which did not
help the console, but after that I could no longer get a JMS connection.
After shutting down the primary, the secondary did not take over. From the
logs it looked like it did, the objects were deployed, but the whole
Artemis tab disappeared from the console and I was not able to make a JMS
connection.

So then I shut both down. Waited a few minutes, and started the primary. It
came up fine. So then I started the secondary, waited a little while, and
shut down the primary, just as a test. This time the secondary took over
quickly. I restarted the primary and it happily took control back.

Any advice would be much appreciated.

Thanks!

Ben



*** Log Summary ***

2021-01-19 14:58:51 UTC this appears in the log:

ERROR [org.apache.activemq.artemis.core.journal.impl.JournalImpl]
appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx with
id 33193877: java.lang.IllegalStateException: Cannot find tx with id
33193877

Along with some Java dump, directly followed by a few of these:

WARN  [org.apache.activemq.artemis.core.server] AMQ222104: IO Error
completing the transaction, code = 6, message = Cannot find tx with id
33193877

2021-01-19 15:41:10  The configuration reloads and the objects all
redeploy. But then:

2021-01-19 15:52:51  ERROR
[org.apache.activemq.artemis.core.journal.impl.JournalImpl]
appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx with
id 33318276: java.lang.IllegalStateException: Cannot find tx with id
33318276

Then several

2021-01-19 15:52:51,666 WARN  [org.apache.activemq.artemis.core.server]
AMQ222104: IO Error completing the transaction, code = 6, message = Cannot
find tx with id 33318276

And a few minutes later I see about 1 billion of these:

2021-01-19 16:01:21,164 WARN  [org.apache.activemq.artemis.core.server]
AMQ222033: Page file 000000008.page had incomplete records at position
8,081,827 at record number 6,504
2021-01-19 16:01:21,165 WARN  [org.apache.activemq.artemis.core.server]
AMQ222288: Page 8, message 6,504 could not be found on offset 8,081,827,
with starting message 6,504. This represents a logic error or inconsistency
on the data, and the system will try once again from the beggining of the
page file.

These end the following day with several of these:

2021-01-20 03:10:22,764 WARN  [org.apache.activemq.artemis.core.client]
AMQ212037: Connection failure to /13.68.117.113:42767 has been detected:
AMQ229014: Did not receive data from /{IP}:{port} within the 60,000ms
connection TTL. The connection will now be closed.
[code=CONNECTION_TIMEDOUT]

For several different IP addresses and ports. Then I see both of these two
errors happening together, until I get a nice

Java heap space: java.lang.OutOfMemoryError: Java heap space

Followed by a lot of interesting looking Java dump. A bunch of these:

ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
seconds) on acceptor "artemis" during protocol handshake with /
52.179.166.52:61538 has occurred.

Then a bunch of handshake errors

ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
seconds) on acceptor "artemis" during protocol handshake with /
52.179.166.52:5972 has occurred.

A bunch more of these:

ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
seconds) on acceptor "artemis" during protocol handshake with
/{IP}.52:22481 has occurred.

This one appears and looks useful maybe:

2021-01-20 03:30:48,627 ERROR
[io.netty.util.concurrent.DefaultPromise.rejectedExecution] Failed to
submit a listener notification task. Event loop shut down?:
java.util.concurrent.RejectedExecutionException: event executor terminated

More heap space complaints and java dump... Then numerous of these:

WARN  [io.netty.channel.AbstractChannel] Force-closing a channel whose
registration task was not accepted by an event loop: [id: 0x6be932f5, L:/
10.40.50.17:61616 - R:/52.179.166.52:45604]:
java.util.concurrent.RejectedExecutionException: event executor terminated

And that's the state where I found it this morning.

Artemis failed but HA secondary did not take over

Reply via email to