Thank you for your response.
At the moment we are not looking to change our cluster topology to mirror.
Primary/backup with shared stored was enough for very long time. We understand
that this doesn’t project from full shared storage failure. Just want to
understand how to configure it correctly, to protect from cluster node or
network failures.
--
Best Regards,
Vilius
From: Clebert Suconic <[email protected]>
Sent: Monday, March 9, 2026 5:28 AM
To: [email protected]
Cc: Vilius Šumskas <[email protected]>
Subject: Re: error indication when cluster shared storage is not available
Using mirror and lock coordinator could be an alternative here?
Clebert Suconic
On Sun, Mar 8, 2026 at 5:41 PM Vilius Šumskas via users
<[email protected]<mailto:[email protected]>> wrote:
Yes. I spent couple of hours on weekend to test Artemis behaviour with soft vs.
hard NFS mounts, also some other tests. These are the results.
First test I made was by changing NFS IP allowlist by modifying exports file.
Since this was done on NFS server side the test affected both, primary and
backup Artemis nodes. When IP allowlist is changed modern NFS client react by
disconnect mount point completely. On primary node such disconnect correctly
resulted in Critical IO Error and automatic broker shutdown. Backup didn't
react to this at all. It only detected that connector of primary is down, but
didn't start serving clients, nor did it produce Critical IO Error. Not sure if
this correct behaviour. I suspect it didn't produce IO errors since the broker
was not started in primary mode, but why it didn't TRY to take over?
You can see full logs of this test in the attached files
"primary_nfsleveldeny.txt" and "backup_nfsleveldeny.txt".
Second test was performed with hard NFS mount option by denying outbound
traffic to NFS server's IP address on primary node with "firewall-cmd --direct
--add-rule ipv4 filter OUTPUT 0 -d <nfs_server_ip>/32 -j REJECT". Full logs
are in the attached "primary_nfshard.txt" and "backup_nfshard.txt" files.
* At ~22:13 I have logged in Artemis Console on primary.
* At ~22:14 the traffic to shared storage on primary was denied.
As during our incident, this didn't produce any logs on primary. No client
login errors, no IO errors, nothing. On Artemis Console I could even send a
message to ExpiredQueue (though I could not browse it). I could also create new
address, search for new address in Console, and delete it. How is this even
possible? Should I assume some broker operations happen completely in memory
and are not written to journal, until needed? Though, after a while, address
search in Console stopped working. I assume some kind of Console cache expired.
But I still could list consumer and producers.
* At around the same time (22:14) backup node took over. This is probably
expected even with hard NFS option because Artemis lock on shared volume has
expired?
* Cluster stayed this way, with primary producing no errors, until ~22:22 when
I have removed firewall rule, and outbound traffic from primary to NFS storage
could flow again.
This resulted in lost lock and Critical IO Errors, and primary have finally
shutdown.
Maybe it's just me, but I think there should be a way to detect such stalled
NFS mount and shutdown the broker sooner.
Now to the last test. It was performed with soft NFS mount option by denying
outbound traffic to NFS server's IP address on primary node using the same
firewalld rule. Logs are in "primary_nfssoft.txt" file.
* At ~22:36 the traffic to shared storage on primary was denied.
* At ~22:37 this resulted in lock failure, however the broker didn't produce
Critical IO Error and didn't try to automatically shutdown until 22:40. Not
sure why. On one hand, this matches our timeo=600,retrans=2 mount options, but
shouldn't broker try to shutdown right away with "Lost NodeManager lock" (just
like in previous test after mount point came back)? I could even use Artemis
console and create or delete address. Which is, again, strange for the cluster
node which just lost the lock.
* Anyway, so at 22:40 it is shutting down, it produces other Critical IO
errors, Critical Analyzer also kicks in. Broker also produced thread dump
during the process (attached separately), however it never actually shuts down
completely. I could see java process trying to do something.
* I waited more than 15 minutes, but full shutdown never occurred.
* In parallel, backup at ~22:36 took over, so no surprises there. However since
primary was not fully down, Artemis client didn't failover to backup.
Summarizing all the tests, my main concerns is regarding data integrity during
NFS incidents, be it on one or both nodes. If NFS mount is not available, where
does all the message and topology data go? Memory? If yes, what happens if NFS
mount point doesn't appear in time and broker is killed? Do we loose all
addresses or queues created during that time? This incident and tests tells me
that the journal is not what I have expected it to be.
Test were performed with latest Artemis version 2.52.0 on Rocky Linux 9.7 and
NetApp NFS cloud storage.
P.S. I didn't include intr mount option in my tests, as it is deprecated now
and completely ignored by kernels above version 2.6.25. I will prepare a PR
about this for documentation in a few.
--
Vilius
-----Original Message-----
From: Justin Bertram <[email protected]<mailto:[email protected]>>
Sent: Wednesday, March 4, 2026 8:50 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: error indication when cluster shared storage is not available
Any results to share from your testing?
Justin
On Tue, Mar 3, 2026 at 9:27 AM Vilius Šumskas via users
<[email protected]<mailto:[email protected]>> wrote:
>
> Thank you Justin for the explanation.
>
> I guess ActiveMQBasicSecurityManager part explains why we didn't see any
> "unknown user" errors in the logs. All users were loaded into memory in
> advance. It's strange though that clients were connecting, could not do so,
> but the broker didn't print anything about connection resets or other related
> information. We are using default logging level btw. It was not the same
> network issue because clients live on the different subnet than storage. We
> could ping Artemis nodes from clients successfully during incident, and fast
> (millisecond) reconnections at Qpid level indicates that it was not TCP level
> issue.
>
> I just checked NFS mount recommendations. We are using timeo=600,retrans=2,
> however indeed we are using hard instead of soft option. I'm going to try to
> reproduce an issue with both settings to see how it behaves. Could you
> elaborate a bit why documentation says that NFS recommendation regarding soft
> option and data corruption can be safely ignored?
>
> --
> Vilius
>
> -----Original Message-----
> From: Justin Bertram <[email protected]<mailto:[email protected]>>
> Sent: Tuesday, March 3, 2026 4:43 PM
> To: [email protected]<mailto:[email protected]>
> Subject: Re: error indication when cluster shared storage is not
> available
>
> Since Artemis 2.11.0 [1] the broker will periodically evaluate the shared
> journal file-lock to ensure it hasn't been lost and/or the backup hasn't
> activated. Assuming proper configuration, I would have expected this
> component to shut down the broker in your situation.
> Since it didn't shut down the broker my hunch is that your NFS mount is not
> configured properly. Can you confirm that you're following the NFS mount
> recommendations [2]? I'm specifically thinking about using soft vs. hard.
>
> It's worth noting that the ActiveMQBasicSecurityManager accesses the journal
> only when the broker starts. It reads all user/role information from the
> journal and loads it into memory. The only exception is if an administrator
> uses the management API to add, remove, or update a user, role, etc. at which
> point the broker will write to the journal.
>
> Also, if there is no activity on the broker, the critical analyzer has no
> chance to detect problems.
>
> Based on your description, it sounds like the same network problem that
> caused an issue with NFS might also have prevented clients from connecting to
> the broker.
>
>
> Justin
>
> [1] https://issues.apache.org/jira/browse/ARTEMIS-2421
> [2]
> https://artemis.apache.org/components/artemis/documentation/latest/ha.
> html#nfs-mount-recommendations
>
> On Mon, Mar 2, 2026 at 4:11 PM Vilius Šumskas via users
> <[email protected]<mailto:[email protected]>> wrote:
> >
> > Hello,
> >
> >
> >
> > we have a pretty straightforward Artemis HA cluster consisting from 2
> > nodes, primary and a backup. Cluster uses NFS4.1 shared storage to store
> > the journal. In addition, we are using ActiveMQBasicSecurityManager for
> > authentication, which means information about Artemis users are on the same
> > shared storage.
> >
> >
> >
> > Couple of days ago we had an incident with our shared storage provider.
> > During the incident the storage was fully unreachable network wise. The
> > interesting part is that during the incident Artemis didn’t print any
> > exceptions or any errors in the logs. No messages that journal could not be
> > reachable, no messages about failure to reach the backup, even though the
> > backup was also experiencing the same issue with the storage. External AMQP
> > client connections also didn’t result in the usual warning in the logs for
> > “unknown users”, even though on the client side Qpid clients constantly
> > printed “cannot connect” errors. As if broker instances were unreachable by
> > the clients but inside the broker all processes just stopped hanging and
> > waiting for the storage.
> >
> > Critical analyzer also didn’t kick in for some reason. Usually it works
> > very well for us, when the same NFS storage slows down considerably, but
> > not this time.
> >
> >
> >
> > Only after I completely restarted primary VM node, and it could not mount
> > NFS storage completely (after waiting 3 minutes to timeout during restart),
> > then Artemis booted and started producing IOExceptions, “unknown user”
> > errors, “connection failed to backup node” errors, and every other possible
> > error related to unreachable journal, as expected.
> >
> >
> >
> > Is the silence in the logs due to unreachable NFS storage a bug? If so,
> > what developers need for the reproducible case? As I said, there is nothing
> > in the logs at the moment, but I could try to reproduce it on testing
> > environment with any combination of debugging properties if needed.
> >
> >
> >
> > If it’s not a bug, how should we ensure proper alerting (and possibly
> > automatic Artemis shutdown) in case shared storage is down? Do we miss some
> > NFS mount option or critical analyzer setting, maybe? Currently we are
> > using defaults.
> >
> >
> >
> > Any pointers are much appreciated!
> >
> >
> >
> > --
> >
> > Best Regards,
> >
> >
> >
> > Vilius Šumskas
> >
> > Rivile
> >
> > IT manager
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [email protected]<mailto:[email protected]>
> For additional commands, e-mail:
> [email protected]<mailto:[email protected]>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [email protected]<mailto:[email protected]>
> For additional commands, e-mail:
> [email protected]<mailto:[email protected]>
>
---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]<mailto:[email protected]>
For additional commands, e-mail:
[email protected]<mailto:[email protected]>
---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]<mailto:[email protected]>
For additional commands, e-mail:
[email protected]<mailto:[email protected]>