RE: error indication when cluster shared storage is not available

Vilius Šumskas via users Mon, 09 Mar 2026 01:40:46 -0700

Thank you for your response.

At the moment we are not looking to change our cluster topology to mirror. 
Primary/backup with shared stored was enough for very long time. We understand 
that this doesn’t project from full shared storage failure. Just want to 
understand how to configure it correctly, to protect from cluster node or 
network failures.


--
   Best Regards,
    Vilius

From: Clebert Suconic <[email protected]>
Sent: Monday, March 9, 2026 5:28 AM
To: [email protected]
Cc: Vilius Šumskas <[email protected]>
Subject: Re: error indication when cluster shared storage is not available

Using mirror and lock coordinator could be an alternative here?

Clebert Suconic


On Sun, Mar 8, 2026 at 5:41 PM Vilius Šumskas via users 
<[email protected]<mailto:[email protected]>> wrote:
Yes. I spent couple of hours on weekend to test Artemis behaviour with soft vs. 
hard NFS mounts, also some other tests. These are the results.

First test I made was by changing NFS IP allowlist by modifying exports file. 
Since this was done on NFS server side the test affected both, primary and 
backup Artemis nodes. When IP allowlist is changed modern NFS client react by 
disconnect mount point completely. On primary node such disconnect correctly 
resulted in Critical IO Error and automatic broker shutdown. Backup didn't 
react to this at all. It only detected that connector of primary is down, but 
didn't start serving clients, nor did it produce Critical IO Error. Not sure if 
this correct behaviour. I suspect it didn't produce IO errors since the broker 
was not started in primary mode, but why it didn't TRY to take over?
You can see full logs of this test in the attached files 
"primary_nfsleveldeny.txt" and "backup_nfsleveldeny.txt".

Second test was performed with hard NFS mount option by denying outbound 
traffic to NFS server's IP address on primary node with "firewall-cmd --direct 
--add-rule ipv4 filter OUTPUT 0 -d <nfs_server_ip>/32  -j REJECT". Full logs 
are in the attached "primary_nfshard.txt" and "backup_nfshard.txt" files.
* At ~22:13 I have logged in Artemis Console on primary.
* At ~22:14 the traffic to shared storage on primary was denied.
As during our incident, this didn't produce any logs on primary. No client 
login errors, no IO errors, nothing. On Artemis Console I could even send a 
message to ExpiredQueue (though I could not browse it). I could also create new 
address, search for new address in Console, and delete it. How is this even 
possible? Should I assume some broker operations happen completely in memory 
and are not written to journal, until needed? Though, after a while, address 
search in Console stopped working. I assume some kind of Console cache expired. 
But I still could list consumer and producers.
* At around the same time (22:14) backup node took over. This is probably 
expected even with hard NFS option because Artemis lock on shared volume has 
expired?
* Cluster stayed this way, with primary producing no errors, until ~22:22 when 
I have removed firewall rule, and outbound traffic from primary to NFS storage 
could flow again.
This resulted in lost lock and Critical IO Errors, and primary have finally 
shutdown.
Maybe it's just me, but I think there should be a way to detect such stalled 
NFS mount and shutdown the broker sooner.

Now to the last test. It was performed with soft NFS mount option by denying 
outbound traffic to NFS server's IP address on primary node using the same 
firewalld rule. Logs are in "primary_nfssoft.txt" file.
* At ~22:36 the traffic to shared storage on primary was denied.
* At ~22:37 this resulted in lock failure, however the broker didn't produce 
Critical IO Error and didn't try to automatically shutdown until 22:40. Not 
sure why. On one hand, this matches our timeo=600,retrans=2 mount options, but 
shouldn't broker try to shutdown right away with "Lost NodeManager lock" (just 
like in previous test after mount point came back)? I could even use Artemis 
console and create or delete address. Which is, again, strange for the cluster 
node which just lost the lock.
* Anyway, so at 22:40 it is shutting down, it produces other Critical IO 
errors, Critical Analyzer also kicks in. Broker also produced thread dump 
during the process (attached separately), however it never actually shuts down 
completely. I could see java process trying to do something.
* I waited more than 15 minutes, but full shutdown never occurred.
* In parallel, backup at ~22:36 took over, so no surprises there. However since 
primary was not fully down, Artemis client didn't failover to backup.

Summarizing all the tests, my main concerns is regarding data integrity during 
NFS incidents, be it on one or both nodes. If NFS mount is not available, where 
does all the message and topology data go? Memory? If yes, what happens if NFS 
mount point doesn't appear in time and broker is killed? Do we loose all 
addresses or queues created during that time? This incident and tests tells me 
that the journal is not what I have expected it to be.

Test were performed with latest Artemis version 2.52.0 on Rocky Linux 9.7 and 
NetApp NFS cloud storage.

P.S. I didn't include intr mount option in my tests, as it is deprecated now 
and completely ignored by kernels above version 2.6.25. I will prepare a PR 
about this for documentation in a few.

--
    Vilius

-----Original Message-----
From: Justin Bertram <[email protected]<mailto:[email protected]>>
Sent: Wednesday, March 4, 2026 8:50 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: error indication when cluster shared storage is not available

Any results to share from your testing?


Justin

On Tue, Mar 3, 2026 at 9:27 AM Vilius Šumskas via users 
<[email protected]<mailto:[email protected]>> wrote:
>
> Thank you Justin for the explanation.
>
> I guess ActiveMQBasicSecurityManager part explains why we didn't see any 
> "unknown user" errors in the logs. All users were loaded into memory in 
> advance. It's strange though that clients were connecting, could not do so, 
> but the broker didn't print anything about connection resets or other related 
> information. We are using default logging level btw. It was not the same 
> network issue because clients live on the different subnet than storage. We 
> could ping Artemis nodes from clients successfully during incident, and fast 
> (millisecond) reconnections at Qpid level indicates that it was not TCP level 
> issue.
>
> I just checked NFS mount recommendations. We are using timeo=600,retrans=2, 
> however indeed we are using hard instead of soft option. I'm going to try to 
> reproduce an issue with both settings to see how it behaves. Could you 
> elaborate a bit why documentation says that NFS recommendation regarding soft 
> option and data corruption can be safely ignored?
>
> --
>     Vilius
>
> -----Original Message-----
> From: Justin Bertram <[email protected]<mailto:[email protected]>>
> Sent: Tuesday, March 3, 2026 4:43 PM
> To: [email protected]<mailto:[email protected]>
> Subject: Re: error indication when cluster shared storage is not
> available
>
> Since Artemis 2.11.0 [1] the broker will periodically evaluate the shared 
> journal file-lock to ensure it hasn't been lost and/or the backup hasn't 
> activated. Assuming proper configuration, I would have expected this 
> component to shut down the broker in your situation.
> Since it didn't shut down the broker my hunch is that your NFS mount is not 
> configured properly. Can you confirm that you're following the NFS mount 
> recommendations [2]? I'm specifically thinking about using soft vs. hard.
>
> It's worth noting that the ActiveMQBasicSecurityManager accesses the journal 
> only when the broker starts. It reads all user/role information from the 
> journal and loads it into memory. The only exception is if an administrator 
> uses the management API to add, remove, or update a user, role, etc. at which 
> point the broker will write to the journal.
>
> Also, if there is no activity on the broker, the critical analyzer has no 
> chance to detect problems.
>
> Based on your description, it sounds like the same network problem that 
> caused an issue with NFS might also have prevented clients from connecting to 
> the broker.
>
>
> Justin
>
> [1] https://issues.apache.org/jira/browse/ARTEMIS-2421
> [2]
> https://artemis.apache.org/components/artemis/documentation/latest/ha.
> html#nfs-mount-recommendations
>
> On Mon, Mar 2, 2026 at 4:11 PM Vilius Šumskas via users 
> <[email protected]<mailto:[email protected]>> wrote:
> >
> > Hello,
> >
> >
> >
> > we have a pretty straightforward Artemis HA cluster consisting from 2 
> > nodes, primary and a backup. Cluster uses NFS4.1 shared storage to store 
> > the journal. In addition, we are using ActiveMQBasicSecurityManager for 
> > authentication, which means information about Artemis users are on the same 
> > shared storage.
> >
> >
> >
> > Couple of days ago we had an incident with our shared storage provider. 
> > During the incident the storage was fully unreachable network wise. The 
> > interesting part is that during the incident Artemis didn’t print any 
> > exceptions or any errors in the logs. No messages that journal could not be 
> > reachable, no messages about failure to reach the backup, even though the 
> > backup was also experiencing the same issue with the storage. External AMQP 
> > client connections also didn’t result in the usual warning in the logs for 
> > “unknown users”, even though on the client side Qpid clients constantly 
> > printed “cannot connect” errors. As if broker instances were unreachable by 
> > the clients but inside the broker all processes just stopped hanging and 
> > waiting for the storage.
> >
> > Critical analyzer also didn’t kick in for some reason. Usually it works 
> > very well for us, when the same NFS storage slows down considerably, but 
> > not this time.
> >
> >
> >
> > Only after I completely restarted primary VM node, and it could not mount 
> > NFS storage completely (after waiting 3 minutes to timeout during restart), 
> > then Artemis booted and started producing IOExceptions, “unknown user” 
> > errors, “connection failed to backup node” errors, and every other possible 
> > error related to unreachable journal, as expected.
> >
> >
> >
> > Is the silence in the logs due to unreachable NFS storage a bug? If so, 
> > what developers need for the reproducible case? As I said, there is nothing 
> > in the logs at the moment, but I could try to reproduce it on testing 
> > environment with any combination of debugging properties if needed.
> >
> >
> >
> > If it’s not a bug, how should we ensure proper alerting (and possibly 
> > automatic Artemis shutdown) in case shared storage is down? Do we miss some 
> > NFS mount option or critical analyzer setting, maybe? Currently we are 
> > using defaults.
> >
> >
> >
> > Any pointers are much appreciated!
> >
> >
> >
> > --
> >
> >    Best Regards,
> >
> >
> >
> >     Vilius Šumskas
> >
> >     Rivile
> >
> >     IT manager
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: 
> [email protected]<mailto:[email protected]>
> For additional commands, e-mail: 
> [email protected]<mailto:[email protected]>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: 
> [email protected]<mailto:[email protected]>
> For additional commands, e-mail: 
> [email protected]<mailto:[email protected]>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: 
[email protected]<mailto:[email protected]>
For additional commands, e-mail: 
[email protected]<mailto:[email protected]>


---------------------------------------------------------------------
To unsubscribe, e-mail: 
[email protected]<mailto:[email protected]>
For additional commands, e-mail: 
[email protected]<mailto:[email protected]>

RE: error indication when cluster shared storage is not available

Reply via email to