Re: Artemis 2.30 cluster split brain due to sudden credit consumption

Stefano Mazzocchi Mon, 28 Aug 2023 15:16:21 -0700

On Fri, Aug 25, 2023 at 9:08 PM Justin Bertram <jbert...@apache.org> wrote:


> > We don't have HA enabled.
>
> In ActiveMQ Artemis the idea of "split-brain" [1] is predicated on an HA
> configuration. In short, it's what we call what happens when both a primary
> and backup server are active at the same time serving the same messages to
> clients. Given that you're not using HA then "split-brain" doesn't seem to
> apply here.
>
> What specifically do you mean when you use the term "split-brain"? Are you
> talking about the situation where 2 active nodes in the cluster are not
> communicating properly?
>

I'm sorry I used the term improperly.

Yes, I'm referring to a situation in which a cluster of 3 brokers gets into
a state in which brokers can no longer talk to each other and the messages
don't flow between them.


> > We configured it using JGroups with TCP because it's not possible to do
> IP multicast across AZs in AWS.
>
> Why not just use the "static" connector configuration offered via the
> normal configuration? Typically folks who configure JGroups use something
> more exotic for cloud-based use-cases like S3_PING [2] or KUBE_PING [3].
>

Yeah, we might want to resort to that although we originally planned on
using KUBE_PING but ended up stopping when DNS_PING worked for us.


> > ...we didn't expect our load (600 msg/sec) to be enough to justify
> investing in this kind of broker affiliation.
>
> Fair enough. I brought up the connection router because many folks seem to
> be under the impression that clustering is just a silver bullet for more
> performance without understanding the underlying implications of
> clustering.
>

Yeah, we understand that.


>
> > What we did NOT expect is this kind of "wedged" behavior in which Artemis
> finds itself and is not able to recover until we physically kill the
> instance that is accumulating messages.
>
> That's certainly problematic and not something I would expect either.
> Occasional administrative restarts seems like a viable short-term
> work-around, but the goal would be to identify the root cause of the
> problem so it can either be addressed via configuration or code (i.e. a bug
> fix). At this point I can't say what the root cause is.
>

Yes, it's very puzzling.

We are 99% sure the problem happens when this method gets invoked:

https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientProducerCreditManagerImpl.java#L161

There are only two other methods calling this one:

ClientProducerCreditManagerImpl.getCredits()
https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientProducerCreditManagerImpl.java#L55

and

ClientProducerCreditManagerImpl.returnCredits()
https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientProducerCreditManagerImpl.java#L105

It seems that the internal address between brokers is treated just the same
as any other address in terms of flow control and when the entry is
removed, it ends up being "blocked" but there isn't anything else that ever
unblocks it. It feels like a bug, honestly. Or it could be that whatever
causes the unblocking is not invoked because of some misconfiguration on
our part.

You said that you tried using -1 as the producer-window-size on the
> cluster-connection and that it caused even more problems. What were those
> problems?


Our entire cluster went bad because messages weren't going thru so we had
to quickly revert the configuration. A lot of messages failed to be
delivered but we don't know if that was because of a load slam or something
else.

Our biggest problem is that the only way to reproduce this problem is under
load in our production environments, which impacts our customers, so it's a
very slow and risky process to experiment with this.


> Did you try any other values greater than the default (i.e.
> 1048576 - 1MiB)? If not, could you?
>

Yes, we could, but see above.


> How long has this deployment been running before you saw this issue?


Well, we just launched our service a few weeks ago.


> Has
> anything changed recently? Version 2.30.0 was only recently released. Did
> you use another version previously? If so, did you see this problem in the
> previous version?
>

We launched with 2.28.0 and had the same problem. We upgraded to 2.30.0
hoping it would go away but it didn't.


> How large are the messages that you are sending?
>

Pretty small, few kb tops.


> Instead of restarting the entire broker have you tried stopping and
> starting the cluster connection via the management API? If so, what
> happened? If not, could you?
>

We did not. How would you do this?


> When you attempt to reproduce the issue do you still see the
> "releaseOutstanding" log message at some point? In your reproducer
> environment have you tried lowering the producer-window-size as a way to
> potentially make the error more likely?
>

Ah, that's a good suggestion. We did not but we could try to see if that
helps us discover it in dev.


>
> > ...we could just abandon the entire concept of multi-pod Artemis
> configuration and just have one and tolerate it going down once in a
> while...
>
> Generally speaking I think this is a viable strategy and one I recommend to
> folks often (which goes back to the fact that lots of folks deploy a
> cluster without any real need). You could potentially configure HA to
> mitigate the risk of the broker going down, although that has caveats of
> its own.
>

We just tested today a single Artemis instance and managed to get enough
load to satisfy our needs, so we will probably go with that for now.

Still, I can't shake the feeling a intra-broker queue getting wedged like
that is not a good thing and I would like to understand why it's happening
because we might need to cluster in the future.

Thx for all your help and suggestions.


>
>
> Justin
>
> [1]
>
> https://activemq.apache.org/components/artemis/documentation/latest/network-isolation.html
> [2] http://www.jgroups.org/javadoc/org/jgroups/protocols/S3_PING.html
> [3] http://www.jgroups.org/manual5/index.html#_kube_ping
>
> On Fri, Aug 25, 2023 at 2:22 PM Stefano Mazzocchi <
> stefano.mazzoc...@gmail.com> wrote:
>
> > Hi Justin, thx for your response!
> >
> > Find my answers inline below.
> >
> > On Thu, Aug 24, 2023 at 8:43 PM Justin Bertram <jbert...@apache.org>
> > wrote:
> >
> > > Couple of questions:
> > >
> > >  - What high availability configuration are you using and at what point
> > > does split brain occur?
> > >
> >
> > We don't have HA enabled. Artemis is used as an asynchronous ephemeral
> > control plane sending messages between software modules. If it does go
> down
> > for a little while, or some messages are lost, it's ok for our needs.
> >
> > The split brain occurs when that log event is emitted. We have not been
> > able to identify what is causing that to happen.
> >
> >
> > >  - Is JGroups w/TCP really viable in AWS? I assumed it would be onerous
> > to
> > > configure in a cloud environment since it requires a static list of IP
> > > addresses (i.e. no dynamic discovery).
> > >
> >
> > Our cluster uses kubernetes to manage 3 different artemis "pods" living
> in
> > 3 different availability zones. We configured it using JGroups with TCP
> > because it's not possible to do IP multicast across AZs in AWS.
> >
> >
> > >  - What metric exactly are you looking at for the cluster-connection's
> > > credits?
> > >
> >
> > We are scraping the balance="" value out of DEBUG logs.
> >
> >
> > >  - Have you considered using the connection router functionality [1] to
> > pin
> > > relevant producers and consumers to the same node to avoid moving
> > messages
> > > around the cluster? Moving messages might be neutralizing the benefits
> of
> > > clustering [2].
> > >
> >
> > We are using Artemis to create an asynchronous and ephemeral control
> plane
> > between a few thousands of software modules and we designed the system to
> > be resilient to latency and temporary failures and we didn't expect our
> > load (600 msg/sec) to be enough to justify investing in this kind of
> broker
> > affiliation. What we did NOT expect is this kind of "wedged" behavior in
> > which Artemis finds itself and is not able to recover until we physically
> > kill the instance that is accumulating messages. Our modules are designed
> > to wait and reconnect if communication to the broker goes down, but they
> > have no way of telling the difference between a valid connection that is
> > not receiving messages because there aren't any to be received or a valid
> > connection that is not receiving messages because they are stuck in
> transit
> > between brokers.
> >
> > We could limp along indefinitely like this (automating the termination of
> > any artemis pod that shows any accumulation of messages) or we could just
> > abandon the entire concept of multi-pod Artemis configuration and just
> have
> > one and tolerate it going down once in a while (the rest of our system is
> > designed to withstand that) but before giving up we wanted to understand
> > why this is happening and if there was something we can do to prevent it.
> > (or if it's a bug in Artemis)
> >
> >
> > >
> > > Justin
> > >
> > > [1]
> > >
> > >
> >
> https://activemq.apache.org/components/artemis/documentation/latest/connection-routers.html
> > > [2]
> > >
> > >
> >
> https://activemq.apache.org/components/artemis/documentation/latest/clusters.html#performance-considerations
> > >
> > > On Thu, Aug 24, 2023 at 7:49 PM Stefano Mazzocchi <stef...@apache.org>
> > > wrote:
> > >
> > > > Hi there,
> > > >
> > > > at $day_job we are running in production an Artemis 2.30 cluster
> with 3
> > > > nodes using jgroups over TCP for broadcast and discovery. We are
> using
> > it
> > > > over MQTT and things are working well.
> > > >
> > > > Every couple of days, messages stop flowing across nodes (causing
> > > negative
> > > > issues with the rest of our cluster which directly impact our
> > customers).
> > > >
> > > > The smoking gun seems to be this log message:
> > > >
> > > >
> > > >
> > >
> >
> [org.apache.activemq.artemis.core.client.impl.AsynchronousProducerCreditsImpl]
> > > > releaseOutstanding credits, balance=0, callback=class
> > > >
> > > >
> > >
> >
> org.apache.activemq.artemis.core.server.cluster.impl.ClusterConnectionBridge
> > > >
> > > > Every time this message appears, messages stop being routed across
> > > Artemis
> > > > instances and end up piling up in internal queues instead of being
> > > > delivered.
> > > >
> > > > We have tried configuring "producer-window-size" to be -1 in the
> > cluster
> > > > connector but that has caused even more problems so we had to revert
> > it.
> > > > Our production environment is therefore operating with the default
> > value
> > > > which we believe to be 1Mb.
> > > >
> > > > We have also created a grafana dashboard to look at the value of the
> > > > "credits" for each cluster connector over time and they oscillate
> > > > consistently between the "1mb" and 600kb range. The ONLY time it dips
> > > below
> > > > 600kb is when it goes straight to zero and then it bounces right
> back,
> > > but
> > > > the messages continue to be stuck in a queue.
> > > >
> > > > There is no indication of reconnection or anything else in the logs.
> > > >
> > > > Unfortunately we have been unable to reproduce this with artificial
> > load
> > > > tests. It seems to be something very specific to how our production
> > > cluster
> > > > is operating (in AWS).
> > > >
> > > > Has anyone experienced anything like this before? Do you have any
> > > > suggestions on what we could try to prevent this from happening?
> > > >
> > > > Thank you very much in advance for any suggestion you could give us.
> > > >
> > > > --
> > > > Stefano.
> > > >
> > >
> >
> >
> > --
> > Stefano.
> >
>


-- 
Stefano.

Re: Artemis 2.30 cluster split brain due to sudden credit consumption

Reply via email to