Hi, while the error is the same, I don't believe it is the same issue as described in ARTEMIS-3640. In this case DNS names in broker configuration and client configuration matches. UknownHostname exception is most probably due to the way Kubernetes work. If Pod is down, Kubernetes removes hostname from internal DNS completely. When the Pod is started, DNS hostname is created again.
Even though I always advice against running clusterized services, such as Artemis, on Kubernetes (unless you are trying to solve particular problem), the last time I tried HA worked fine. Pedro, could you provide your complete either replication or shared-storage configuration? For now it looks like you are using replication but the primary configuration is empty. As per notice in https://activemq.apache.org/components/artemis/documentation/latest/ha.html#replication automatic failover for such configuration won't work. If you really want to use replication you should read https://activemq.apache.org/components/artemis/documentation/latest/network-isolation.html first. -- Vilius -----Original Message----- From: Domenico Francesco Bruscino <[email protected]> Sent: Monday, September 29, 2025 7:52 PM To: [email protected] Cc: [email protected]; Teixeira Pedro (BT-VS/ESW-CSA4) <[email protected]> Subject: Re: Non working Artemis HA on Kubernetes Hi Pedro, when the backup broker announces itself, the core clients receive the connector url defined in the cluster-connection. After a failover, the core clients try to connect to the connector url of the backup broker received. This works fine if the brokers and the clients are in the same network. When the brokers are deployed in a Kubernetes cluster and the clients are external, they fail to connect to the connector url of the backup broker received, see ARTEMIS-3640[1]. You can fix this issue by setting the connector name in the connection url of the clients to overwrite the connector url of the backup broker received, for further details see testJMSConsumerAfterFailover[2]. Alternatively, I would suggest taking a look at the leader-follower solution in the ArkMQ Operator test-suite: https://github.com/arkmq-org/activemq-artemis-operator/blob/main/controllers/activemqartemis_rwm_pvc_ha_test.go [1] https://issues.apache.org/jira/browse/ARTEMIS-3640 [2] https://github.com/apache/activemq-artemis/blob/2.42.0/tests/integration-tests/src/test/java/org/apache/activemq/artemis/tests/integration/cluster/failover/ClientConnectorFailoverTest.java#L313 Regards, Domenico On Mon, 29 Sept 2025 at 15:34, Teixeira Pedro (BT-VS/ESW-CSA4) <[email protected]> wrote: > Hello Artemis community! > > > > We are trying to create a solution using Artemis within Kubernetes. > > Our requirement is that if a broker is offline for some reason - > network issue or the instance shuts down either gracefully or suddenly > - the clients can still produce and consume messages as usual. > > > > To do that we tried a High Availability solution with Primary + Backup > brokers (both with replicated and shared storage) > > Each broker is a Stateful Set and there's Kubernetes Service to ensure > connectivity to the brokers. > > To connect to the primary and backup we consider Headless Services ( > https://kubernetes.io/docs/concepts/services-networking/service/#headl > ess-services > ). > > For example, artemis-ha-primary-0.svc-artemis-ha.svc.cluster.local is > the Headless Service to access Pod artemis-ha-primary-0 through > Service svc-artemis-ha-primary. > > > > We tested both on Artemis v2.38.0 and Artemis v2.42.0. > > > > We tested the solution by > > 1. Producing messages using artemis producer command: "./bin/artemis > producer --url > > "(tcp://artemis-ha-primary-0.svc-artemis-ha.svc.cluster.local:61616,tcp://artemis-ha-backup-0.svc-artemis-ha.svc.cluster.local:61616)?ha=true&retryInterval=1000&retryIntervalMultiplier=1.0&reconnectAttempts=-1" > --destination queue://... --user ... --password ....--message-count 10000 > --sleep 2 --verbose > > > 2. Crashing the Primary instance (through deletion of the pod or > scaling down of the Stateful Set we have Artemis within) > > We expected the client to be able to recover. > > However, we were not able to see that. > > > > Analyzing the brokers logs and console, there is connectivity between > the primary and the backup > > > > (primary) > > INFO [org.apache.activemq.artemis.core.server] AMQ221035: Primary > Server Obtained primary lock > > > > > > (backup) > > INFO [org.apache.activemq.artemis.core.server] AMQ221033: ** got > backup lock > > INFO [org.apache.activemq.artemis.core.server] AMQ221031: backup > announced > > > > DEBUG > [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl > ] > Connected with the > currentConnectorConfig=TransportConfiguration(name=primary, > factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConn > ectorFactory)?port=61616&host=svc-artemis-ha-primary-svc-cluster-local > > > > But when we crash the primary broker we can see that the backup > instance enters a loop of > > > > ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to > create netty connection > > java.net.UnknownHostException: > svc-artemis-ha-primary.svc.cluster.local > > > > And never promotes itself to a live broker. > > > > We were able to make it work only when stopping the primary broker > gracefully (artemis stop) > > > > INFO [org.apache.activemq.artemis.core.client] AMQ214036: Connection > closure to svc-artemis-ha-primary.svc.cluster.local/....:61616 has > been detected: AMQ219015: The connection was disconnected because of > server shutdown [code=DISCONNECTED] > > INFO [org.apache.activemq.artemis.core.client] AMQ214036: Connection > closure to svc-artemis-ha-primary.svc.cluster.local/...:61616 has been > detected: AMQ219015: The connection was disconnected because of server > shutdown [code=DISCONNECTED] > > INFO [org.apache.activemq.artemis.core.server] AMQ221010: Backup > Server is now active > > > > We'd like to ask if what we are trying to do in the context of > Kubernetes is possible, if we have stumbled upon a bug or limitation > or if we did something wrong. > > > > Thank you very much in advance for the help. > > I'm available for additional questions if needed. > > > > Best regards, > > *Pedro Teixeira* > > > > PS: For reference, our broker configurations are as follows: > > -- Primary > > <connectors> > > <connector > name="self">tcp://artemis-ha-primary-0.svc-artemis-ha.svc.cluster.loca > l:61616</connector> > > <connector > name="backup">tcp://artemis-ha-backup-0.svc-artemis-ha.svc.cluster.loc > al:61616</connector> > > </connectors> > > > > <ha-policy> > > <replication> > > <primary> > > </primary> > > </replication> > > </ha-policy> > > > > <cluster-user>ACTIVEMQ.CLUSTER.ADMIN.USER</cluster-user> > > <cluster-password>....</cluster-password> > > <cluster-connections> > > <cluster-connection name="artemis-cluster"> > > <!-- All addresses --> > > <address></address> > > <connector-ref>self</connector-ref> > > > <retry-interval>500</retry-interval> > > <use-duplicate-detection>true</use-duplicate-detection> > > <message-load-balancing>OFF</message-load-balancing> > > <max-hops>1</max-hops> > > <static-connectors> > > <connector-ref>backup</connector-ref> > > </static-connectors> > > </cluster-connection> > > </cluster-connections> > > > > -- Backup > > <ha-policy> > > <replication> > > <backup> > > <!-- ensure the backup that has become active never stops > so it's ready to be backup again, otherwirse it will enter > CrashLoopBackOff automatically stop per > https://activemq.apache.org/components/artemis/documentation/latest/ha > .html#failback-with-shared-store > --> > > <allow-failback>true</allow-failback> > > <restart-backup>true</restart-backup> > > </backup> > > </replication> > > </ha-policy> > > > > <!-- Configure the cluster connection --> > > <cluster-user>ACTIVEMQ.CLUSTER.ADMIN.USER</cluster-user> > > <cluster-password>...</cluster-password> > > <cluster-connections> > > <cluster-connection name="artemis-cluster"> > > <!-- All addresses --> > > <address></address> > > <connector-ref>self</connector-ref> > > > <retry-interval>500</retry-interval> > > <use-duplicate-detection>true</use-duplicate-detection> > > <!-- The goal is to have a primary / backup solution, not load > balancing --> > > <message-load-balancing>OFF</message-load-balancing> > > <max-hops>1</max-hops> > > <!-- load balacning only to other Artemis directly connected > to this server --> > > <static-connectors> > > <connector-ref>primary</connector-ref> > > </static-connectors> > > </cluster-connection> > > </cluster-connections> > > > > >
