Re: Artemis failed but HA secondary did not take over

Ben Warrick Wed, 20 Jan 2021 10:15:29 -0800

Hi Justin,

There is 16GB of memory on the VM. And I have JVM memory set to -Xms512M
-Xmx2G in the artemis.profile file, so I can increase those.


I don't have max-size-bytes or global-max-size set. However, according to
the comments in the broker.xml file, global-max-size defaults to half of
-Xmx, so shouldn't Artemis start using the page file after it uses 1G?

I've been tracking heap usage, and it did spike a few times when the
problem started, so you're probably right about this.

Ben

On Wed, Jan 20, 2021 at 12:03 PM Justin Bertram <[email protected]> wrote:

> > java.lang.OutOfMemoryError: Java heap space
>
> My guess is that you were approaching your heap limit which caused the
> preceding issues. Once you hit an OutOfMemoryError then in my experience
> all bets are off so to speak and you need to restart the broker.
>
> It's not clear why you couldn't connect to the slave.
>
> I recommend you take a look at your global-max-size as well as your
> max-size-bytes configuration as these are designed to help prevent OOME
> when receiving lots of messages.
>
>
> Justin
>
> On Wed, Jan 20, 2021 at 9:12 AM Ben Warrick <[email protected]>
> wrote:
>
> > I had an incident last night. I can share the actual log files with
> anyone
> > interested in seeing the full picture. Below is a summary. I also have OS
> > resource tracking data.
> >
> > I have a pair of Artemis servers using a shared store -- an NFS mounted
> > folder. My use case is enterprise integration. I use artemis to pass
> > messages between interfaces. For example, the website publishes an order
> to
> > an address which has several queues. All the interfaces that need that
> > message subscribe to their queue.
> >
> > It looks to me like an issue with the page file. Is the page file in this
> > case mean virtual memory, or is it talking about the paging folder inside
> > the data folder?
> >
> > This morning I could not bring up the console, but I was able to send and
> > receive messages using JMS. I tried restarting the primary -- which did
> not
> > help the console, but after that I could no longer get a JMS connection.
> > After shutting down the primary, the secondary did not take over. From
> the
> > logs it looked like it did, the objects were deployed, but the whole
> > Artemis tab disappeared from the console and I was not able to make a JMS
> > connection.
> >
> > So then I shut both down. Waited a few minutes, and started the primary.
> It
> > came up fine. So then I started the secondary, waited a little while, and
> > shut down the primary, just as a test. This time the secondary took over
> > quickly. I restarted the primary and it happily took control back.
> >
> > Any advice would be much appreciated.
> >
> > Thanks!
> >
> > Ben
> >
> >
> >
> > *** Log Summary ***
> >
> > 2021-01-19 14:58:51 UTC this appears in the log:
> >
> > ERROR [org.apache.activemq.artemis.core.journal.impl.JournalImpl]
> > appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx with
> > id 33193877: java.lang.IllegalStateException: Cannot find tx with id
> > 33193877
> >
> > Along with some Java dump, directly followed by a few of these:
> >
> > WARN  [org.apache.activemq.artemis.core.server] AMQ222104: IO Error
> > completing the transaction, code = 6, message = Cannot find tx with id
> > 33193877
> >
> > 2021-01-19 15:41:10  The configuration reloads and the objects all
> > redeploy. But then:
> >
> > 2021-01-19 15:52:51  ERROR
> > [org.apache.activemq.artemis.core.journal.impl.JournalImpl]
> > appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx with
> > id 33318276: java.lang.IllegalStateException: Cannot find tx with id
> > 33318276
> >
> > Then several
> >
> > 2021-01-19 15:52:51,666 WARN  [org.apache.activemq.artemis.core.server]
> > AMQ222104: IO Error completing the transaction, code = 6, message =
> Cannot
> > find tx with id 33318276
> >
> > And a few minutes later I see about 1 billion of these:
> >
> > 2021-01-19 16:01:21,164 WARN  [org.apache.activemq.artemis.core.server]
> > AMQ222033: Page file 000000008.page had incomplete records at position
> > 8,081,827 at record number 6,504
> > 2021-01-19 16:01:21,165 WARN  [org.apache.activemq.artemis.core.server]
> > AMQ222288: Page 8, message 6,504 could not be found on offset 8,081,827,
> > with starting message 6,504. This represents a logic error or
> inconsistency
> > on the data, and the system will try once again from the beggining of the
> > page file.
> >
> > These end the following day with several of these:
> >
> > 2021-01-20 03:10:22,764 WARN  [org.apache.activemq.artemis.core.client]
> > AMQ212037: Connection failure to /13.68.117.113:42767 has been detected:
> > AMQ229014: Did not receive data from /{IP}:{port} within the 60,000ms
> > connection TTL. The connection will now be closed.
> > [code=CONNECTION_TIMEDOUT]
> >
> > For several different IP addresses and ports. Then I see both of these
> two
> > errors happening together, until I get a nice
> >
> > Java heap space: java.lang.OutOfMemoryError: Java heap space
> >
> > Followed by a lot of interesting looking Java dump. A bunch of these:
> >
> > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
> > seconds) on acceptor "artemis" during protocol handshake with /
> > 52.179.166.52:61538 has occurred.
> >
> > Then a bunch of handshake errors
> >
> > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
> > seconds) on acceptor "artemis" during protocol handshake with /
> > 52.179.166.52:5972 has occurred.
> >
> > A bunch more of these:
> >
> > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
> > seconds) on acceptor "artemis" during protocol handshake with
> > /{IP}.52:22481 has occurred.
> >
> > This one appears and looks useful maybe:
> >
> > 2021-01-20 03:30:48,627 ERROR
> > [io.netty.util.concurrent.DefaultPromise.rejectedExecution] Failed to
> > submit a listener notification task. Event loop shut down?:
> > java.util.concurrent.RejectedExecutionException: event executor
> terminated
> >
> > More heap space complaints and java dump... Then numerous of these:
> >
> > WARN  [io.netty.channel.AbstractChannel] Force-closing a channel whose
> > registration task was not accepted by an event loop: [id: 0x6be932f5, L:/
> > 10.40.50.17:61616 - R:/52.179.166.52:45604]:
> > java.util.concurrent.RejectedExecutionException: event executor
> terminated
> >
> > And that's the state where I found it this morning.
> >
>

Re: Artemis failed but HA secondary did not take over

Reply via email to