Re: Artemis 2.30 cluster split brain due to sudden credit consumption

Clebert Suconic Fri, 25 Aug 2023 12:54:15 -0700

Do you have the clock of your system in sync ?

Just out of coincidence I had some issues the other day on an user and it
turned out to be that.


Check the time at your systems.


On Fri, Aug 25, 2023 at 3:22 PM Stefano Mazzocchi <
stefano.mazzoc...@gmail.com> wrote:

> Hi Justin, thx for your response!
>
> Find my answers inline below.
>
> On Thu, Aug 24, 2023 at 8:43 PM Justin Bertram <jbert...@apache.org>
> wrote:
>
> > Couple of questions:
> >
> >  - What high availability configuration are you using and at what point
> > does split brain occur?
> >
>
> We don't have HA enabled. Artemis is used as an asynchronous ephemeral
> control plane sending messages between software modules. If it does go down
> for a little while, or some messages are lost, it's ok for our needs.
>
> The split brain occurs when that log event is emitted. We have not been
> able to identify what is causing that to happen.
>
>
> >  - Is JGroups w/TCP really viable in AWS? I assumed it would be onerous
> to
> > configure in a cloud environment since it requires a static list of IP
> > addresses (i.e. no dynamic discovery).
> >
>
> Our cluster uses kubernetes to manage 3 different artemis "pods" living in
> 3 different availability zones. We configured it using JGroups with TCP
> because it's not possible to do IP multicast across AZs in AWS.
>
>
> >  - What metric exactly are you looking at for the cluster-connection's
> > credits?
> >
>
> We are scraping the balance="" value out of DEBUG logs.
>
>
> >  - Have you considered using the connection router functionality [1] to
> pin
> > relevant producers and consumers to the same node to avoid moving
> messages
> > around the cluster? Moving messages might be neutralizing the benefits of
> > clustering [2].
> >
>
> We are using Artemis to create an asynchronous and ephemeral control plane
> between a few thousands of software modules and we designed the system to
> be resilient to latency and temporary failures and we didn't expect our
> load (600 msg/sec) to be enough to justify investing in this kind of broker
> affiliation. What we did NOT expect is this kind of "wedged" behavior in
> which Artemis finds itself and is not able to recover until we physically
> kill the instance that is accumulating messages. Our modules are designed
> to wait and reconnect if communication to the broker goes down, but they
> have no way of telling the difference between a valid connection that is
> not receiving messages because there aren't any to be received or a valid
> connection that is not receiving messages because they are stuck in transit
> between brokers.
>
> We could limp along indefinitely like this (automating the termination of
> any artemis pod that shows any accumulation of messages) or we could just
> abandon the entire concept of multi-pod Artemis configuration and just have
> one and tolerate it going down once in a while (the rest of our system is
> designed to withstand that) but before giving up we wanted to understand
> why this is happening and if there was something we can do to prevent it.
> (or if it's a bug in Artemis)
>
>
> >
> > Justin
> >
> > [1]
> >
> >
> https://activemq.apache.org/components/artemis/documentation/latest/connection-routers.html
> > [2]
> >
> >
> https://activemq.apache.org/components/artemis/documentation/latest/clusters.html#performance-considerations
> >
> > On Thu, Aug 24, 2023 at 7:49 PM Stefano Mazzocchi <stef...@apache.org>
> > wrote:
> >
> > > Hi there,
> > >
> > > at $day_job we are running in production an Artemis 2.30 cluster with 3
> > > nodes using jgroups over TCP for broadcast and discovery. We are using
> it
> > > over MQTT and things are working well.
> > >
> > > Every couple of days, messages stop flowing across nodes (causing
> > negative
> > > issues with the rest of our cluster which directly impact our
> customers).
> > >
> > > The smoking gun seems to be this log message:
> > >
> > >
> > >
> >
> [org.apache.activemq.artemis.core.client.impl.AsynchronousProducerCreditsImpl]
> > > releaseOutstanding credits, balance=0, callback=class
> > >
> > >
> >
> org.apache.activemq.artemis.core.server.cluster.impl.ClusterConnectionBridge
> > >
> > > Every time this message appears, messages stop being routed across
> > Artemis
> > > instances and end up piling up in internal queues instead of being
> > > delivered.
> > >
> > > We have tried configuring "producer-window-size" to be -1 in the
> cluster
> > > connector but that has caused even more problems so we had to revert
> it.
> > > Our production environment is therefore operating with the default
> value
> > > which we believe to be 1Mb.
> > >
> > > We have also created a grafana dashboard to look at the value of the
> > > "credits" for each cluster connector over time and they oscillate
> > > consistently between the "1mb" and 600kb range. The ONLY time it dips
> > below
> > > 600kb is when it goes straight to zero and then it bounces right back,
> > but
> > > the messages continue to be stuck in a queue.
> > >
> > > There is no indication of reconnection or anything else in the logs.
> > >
> > > Unfortunately we have been unable to reproduce this with artificial
> load
> > > tests. It seems to be something very specific to how our production
> > cluster
> > > is operating (in AWS).
> > >
> > > Has anyone experienced anything like this before? Do you have any
> > > suggestions on what we could try to prevent this from happening?
> > >
> > > Thank you very much in advance for any suggestion you could give us.
> > >
> > > --
> > > Stefano.
> > >
> >
>
>
> --
> Stefano.
>
-- 
Clebert Suconic

Re: Artemis 2.30 cluster split brain due to sudden credit consumption

Reply via email to