Hi,

while the error is the same, I don't believe it is the same issue as described 
in ARTEMIS-3640. In this case DNS names in broker configuration and client 
configuration matches. UknownHostname exception is most probably due to the way 
Kubernetes work. If Pod is down, Kubernetes removes hostname from internal DNS 
completely. When the Pod is started, DNS hostname is created again.

Even though I always advice against running clusterized services, such as 
Artemis, on Kubernetes (unless you are trying to solve particular problem), the 
last time I tried HA worked fine.

Pedro, could you provide your complete either replication or shared-storage 
configuration? For now it looks like you are using replication but the primary 
configuration is empty. As per notice in 
https://activemq.apache.org/components/artemis/documentation/latest/ha.html#replication
 automatic failover for such configuration won't work. If you really want to 
use replication you should read 
https://activemq.apache.org/components/artemis/documentation/latest/network-isolation.html
 first.

-- 
    Vilius

-----Original Message-----
From: Domenico Francesco Bruscino <[email protected]> 
Sent: Monday, September 29, 2025 7:52 PM
To: [email protected]
Cc: [email protected]; Teixeira Pedro (BT-VS/ESW-CSA4) 
<[email protected]>
Subject: Re: Non working Artemis HA on Kubernetes

Hi Pedro,

when the backup broker announces itself, the core clients receive the connector 
url defined in the cluster-connection. After a failover, the core clients try 
to connect to the connector url of the backup broker received.
This works fine if the brokers and the clients are in the same network.
When the brokers are deployed in a Kubernetes cluster and the clients are 
external, they fail to connect to the connector url of the backup broker 
received, see ARTEMIS-3640[1]. You can fix this issue by setting the connector 
name in the connection url of the clients to overwrite the connector url of the 
backup broker received, for further details see testJMSConsumerAfterFailover[2].

Alternatively, I would suggest taking a look at the leader-follower solution in 
the ArkMQ Operator test-suite:
https://github.com/arkmq-org/activemq-artemis-operator/blob/main/controllers/activemqartemis_rwm_pvc_ha_test.go


[1] https://issues.apache.org/jira/browse/ARTEMIS-3640
[2]
https://github.com/apache/activemq-artemis/blob/2.42.0/tests/integration-tests/src/test/java/org/apache/activemq/artemis/tests/integration/cluster/failover/ClientConnectorFailoverTest.java#L313

Regards,
Domenico

On Mon, 29 Sept 2025 at 15:34, Teixeira Pedro (BT-VS/ESW-CSA4) 
<[email protected]> wrote:

> Hello Artemis community!
>
>
>
> We are trying to create a solution using Artemis within Kubernetes.
>
> Our requirement is that if a broker is offline for some reason - 
> network issue or the instance shuts down either gracefully or suddenly 
> - the clients can still produce and consume messages as usual.
>
>
>
> To do that we tried a High Availability solution with Primary + Backup 
> brokers (both with replicated and shared storage)
>
> Each broker is a Stateful Set and there's Kubernetes Service to ensure 
> connectivity to the brokers.
>
> To connect to the primary and backup we consider Headless Services ( 
> https://kubernetes.io/docs/concepts/services-networking/service/#headl
> ess-services
> ).
>
> For example,  artemis-ha-primary-0.svc-artemis-ha.svc.cluster.local is 
> the Headless Service to access Pod artemis-ha-primary-0 through 
> Service svc-artemis-ha-primary.
>
>
>
> We tested both on Artemis v2.38.0 and Artemis v2.42.0.
>
>
>
> We tested the solution by
>
>    1. Producing messages using artemis producer command: "./bin/artemis
>    producer --url
>    
> "(tcp://artemis-ha-primary-0.svc-artemis-ha.svc.cluster.local:61616,tcp://artemis-ha-backup-0.svc-artemis-ha.svc.cluster.local:61616)?ha=true&retryInterval=1000&retryIntervalMultiplier=1.0&reconnectAttempts=-1"
>    --destination queue://... --user ... --password ....--message-count 10000
>    --sleep 2 --verbose
>
>
>    2. Crashing the Primary instance (through deletion of the pod or
>    scaling down of the Stateful Set we have Artemis within)
>
> We expected the client to be able to recover.
>
> However, we were not able to see that.
>
>
>
> Analyzing the brokers logs and console, there is connectivity between 
> the primary and the backup
>
>
>
> (primary)
>
> INFO [org.apache.activemq.artemis.core.server] AMQ221035: Primary 
> Server Obtained primary lock
>
>
>
>
>
> (backup)
>
> INFO [org.apache.activemq.artemis.core.server] AMQ221033: ** got 
> backup lock
>
> INFO [org.apache.activemq.artemis.core.server] AMQ221031: backup 
> announced
>
>
>
> DEBUG
> [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl
> ]
> Connected with the
> currentConnectorConfig=TransportConfiguration(name=primary,
> factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConn
> ectorFactory)?port=61616&host=svc-artemis-ha-primary-svc-cluster-local
>
>
>
> But when we crash the primary broker we can see that the backup 
> instance enters a loop of
>
>
>
> ERROR [org.apache.activemq.artemis.core.client] AMQ214016: Failed to 
> create netty connection
>
> java.net.UnknownHostException: 
> svc-artemis-ha-primary.svc.cluster.local
>
>
>
> And never promotes itself to a live broker.
>
>
>
> We were able to make it work only when stopping the primary broker 
> gracefully (artemis stop)
>
>
>
> INFO  [org.apache.activemq.artemis.core.client] AMQ214036: Connection 
> closure to svc-artemis-ha-primary.svc.cluster.local/....:61616 has 
> been detected: AMQ219015: The connection was disconnected because of 
> server shutdown [code=DISCONNECTED]
>
> INFO  [org.apache.activemq.artemis.core.client] AMQ214036: Connection 
> closure to svc-artemis-ha-primary.svc.cluster.local/...:61616 has been 
> detected: AMQ219015: The connection was disconnected because of server 
> shutdown [code=DISCONNECTED]
>
> INFO  [org.apache.activemq.artemis.core.server] AMQ221010: Backup 
> Server is now active
>
>
>
> We'd like to ask if what we are trying to do in the context of 
> Kubernetes is possible, if we have stumbled upon a bug or limitation 
> or if we did something wrong.
>
>
>
> Thank you very much in advance for the help.
>
> I'm available for additional questions if needed.
>
>
>
> Best regards,
>
> *Pedro Teixeira*
>
>
>
> PS: For reference, our broker configurations are as follows:
>
> -- Primary
>
> <connectors>
>
>     <connector
> name="self">tcp://artemis-ha-primary-0.svc-artemis-ha.svc.cluster.loca
> l:61616</connector>
>
>     <connector
> name="backup">tcp://artemis-ha-backup-0.svc-artemis-ha.svc.cluster.loc
> al:61616</connector>
>
> </connectors>
>
>
>
> <ha-policy>
>
>     <replication>
>
>         <primary>
>
>             </primary>
>
>     </replication>
>
>     </ha-policy>
>
>
>
> <cluster-user>ACTIVEMQ.CLUSTER.ADMIN.USER</cluster-user>
>
> <cluster-password>....</cluster-password>
>
> <cluster-connections>
>
>     <cluster-connection name="artemis-cluster">
>
>         <!-- All addresses -->
>
>         <address></address>
>
>         <connector-ref>self</connector-ref>
>
>
>         <retry-interval>500</retry-interval>
>
>         <use-duplicate-detection>true</use-duplicate-detection>
>
>         <message-load-balancing>OFF</message-load-balancing>
>
>         <max-hops>1</max-hops>
>
>         <static-connectors>
>
>             <connector-ref>backup</connector-ref>
>
>         </static-connectors>
>
>     </cluster-connection>
>
> </cluster-connections>
>
>
>
> -- Backup
>
> <ha-policy>
>
>     <replication>
>
>         <backup>
>
>             <!-- ensure the backup that has become active never stops 
> so it's ready to be backup again, otherwirse it will enter 
> CrashLoopBackOff automatically stop per 
> https://activemq.apache.org/components/artemis/documentation/latest/ha
> .html#failback-with-shared-store
> -->
>
>             <allow-failback>true</allow-failback>
>
>             <restart-backup>true</restart-backup>
>
>         </backup>
>
>     </replication>
>
> </ha-policy>
>
>
>
> <!-- Configure the cluster connection -->
>
> <cluster-user>ACTIVEMQ.CLUSTER.ADMIN.USER</cluster-user>
>
> <cluster-password>...</cluster-password>
>
> <cluster-connections>
>
>     <cluster-connection name="artemis-cluster">
>
>         <!-- All addresses -->
>
>         <address></address>
>
>         <connector-ref>self</connector-ref>
>
>
>         <retry-interval>500</retry-interval>
>
>         <use-duplicate-detection>true</use-duplicate-detection>
>
>         <!-- The goal is to have a primary / backup solution, not load 
> balancing -->
>
>         <message-load-balancing>OFF</message-load-balancing>
>
>         <max-hops>1</max-hops>
>
>         <!-- load balacning only to other Artemis directly connected 
> to this server -->
>
>         <static-connectors>
>
>             <connector-ref>primary</connector-ref>
>
>         </static-connectors>
>
>     </cluster-connection>
>
> </cluster-connections>
>
>
>
>
>

Reply via email to