Do you have the clock of your system in sync ? Just out of coincidence I had some issues the other day on an user and it turned out to be that.
Check the time at your systems. On Fri, Aug 25, 2023 at 3:22 PM Stefano Mazzocchi < stefano.mazzoc...@gmail.com> wrote: > Hi Justin, thx for your response! > > Find my answers inline below. > > On Thu, Aug 24, 2023 at 8:43 PM Justin Bertram <jbert...@apache.org> > wrote: > > > Couple of questions: > > > > - What high availability configuration are you using and at what point > > does split brain occur? > > > > We don't have HA enabled. Artemis is used as an asynchronous ephemeral > control plane sending messages between software modules. If it does go down > for a little while, or some messages are lost, it's ok for our needs. > > The split brain occurs when that log event is emitted. We have not been > able to identify what is causing that to happen. > > > > - Is JGroups w/TCP really viable in AWS? I assumed it would be onerous > to > > configure in a cloud environment since it requires a static list of IP > > addresses (i.e. no dynamic discovery). > > > > Our cluster uses kubernetes to manage 3 different artemis "pods" living in > 3 different availability zones. We configured it using JGroups with TCP > because it's not possible to do IP multicast across AZs in AWS. > > > > - What metric exactly are you looking at for the cluster-connection's > > credits? > > > > We are scraping the balance="" value out of DEBUG logs. > > > > - Have you considered using the connection router functionality [1] to > pin > > relevant producers and consumers to the same node to avoid moving > messages > > around the cluster? Moving messages might be neutralizing the benefits of > > clustering [2]. > > > > We are using Artemis to create an asynchronous and ephemeral control plane > between a few thousands of software modules and we designed the system to > be resilient to latency and temporary failures and we didn't expect our > load (600 msg/sec) to be enough to justify investing in this kind of broker > affiliation. What we did NOT expect is this kind of "wedged" behavior in > which Artemis finds itself and is not able to recover until we physically > kill the instance that is accumulating messages. Our modules are designed > to wait and reconnect if communication to the broker goes down, but they > have no way of telling the difference between a valid connection that is > not receiving messages because there aren't any to be received or a valid > connection that is not receiving messages because they are stuck in transit > between brokers. > > We could limp along indefinitely like this (automating the termination of > any artemis pod that shows any accumulation of messages) or we could just > abandon the entire concept of multi-pod Artemis configuration and just have > one and tolerate it going down once in a while (the rest of our system is > designed to withstand that) but before giving up we wanted to understand > why this is happening and if there was something we can do to prevent it. > (or if it's a bug in Artemis) > > > > > > Justin > > > > [1] > > > > > https://activemq.apache.org/components/artemis/documentation/latest/connection-routers.html > > [2] > > > > > https://activemq.apache.org/components/artemis/documentation/latest/clusters.html#performance-considerations > > > > On Thu, Aug 24, 2023 at 7:49 PM Stefano Mazzocchi <stef...@apache.org> > > wrote: > > > > > Hi there, > > > > > > at $day_job we are running in production an Artemis 2.30 cluster with 3 > > > nodes using jgroups over TCP for broadcast and discovery. We are using > it > > > over MQTT and things are working well. > > > > > > Every couple of days, messages stop flowing across nodes (causing > > negative > > > issues with the rest of our cluster which directly impact our > customers). > > > > > > The smoking gun seems to be this log message: > > > > > > > > > > > > [org.apache.activemq.artemis.core.client.impl.AsynchronousProducerCreditsImpl] > > > releaseOutstanding credits, balance=0, callback=class > > > > > > > > > org.apache.activemq.artemis.core.server.cluster.impl.ClusterConnectionBridge > > > > > > Every time this message appears, messages stop being routed across > > Artemis > > > instances and end up piling up in internal queues instead of being > > > delivered. > > > > > > We have tried configuring "producer-window-size" to be -1 in the > cluster > > > connector but that has caused even more problems so we had to revert > it. > > > Our production environment is therefore operating with the default > value > > > which we believe to be 1Mb. > > > > > > We have also created a grafana dashboard to look at the value of the > > > "credits" for each cluster connector over time and they oscillate > > > consistently between the "1mb" and 600kb range. The ONLY time it dips > > below > > > 600kb is when it goes straight to zero and then it bounces right back, > > but > > > the messages continue to be stuck in a queue. > > > > > > There is no indication of reconnection or anything else in the logs. > > > > > > Unfortunately we have been unable to reproduce this with artificial > load > > > tests. It seems to be something very specific to how our production > > cluster > > > is operating (in AWS). > > > > > > Has anyone experienced anything like this before? Do you have any > > > suggestions on what we could try to prevent this from happening? > > > > > > Thank you very much in advance for any suggestion you could give us. > > > > > > -- > > > Stefano. > > > > > > > > -- > Stefano. > -- Clebert Suconic