Re: error indication when cluster shared storage is not available

Justin Bertram Tue, 10 Mar 2026 11:38:27 -0700

> First test I made was by changing NFS IP allowlist by modifying exports 
> file...I suspect it didn't produce IO errors since the broker was not started 
> in primary mode, but why it didn't TRY to take over?


I imagine it didn't take over because it couldn't acquire the lock.
This is the correct behavior in my opinion.

> Second test was performed with hard NFS mount option...Maybe it's just me, 
> but I think there should be a way to detect such stalled NFS mount and 
> shutdown the broker sooner.

What you described in this test is more or less what I would expect --
bizarre behavior. Due to the hard mount all disk operations will
simply freeze. Operations happening in memory will continue to work
fine. When NFS connectivity is restored the primary will realize that
it has lost the lock and shut itself down. Again, this is why we
strongly recommend against using hard mounts. It might be possible to
detect such a stalled NFS mount, but since we don't recommend using a
hard mount in the first place there is little motivation to do so. The
whole point of the hard mount is to block the process using it until
functionality is restored.

> Now to the last test. It was performed with soft NFS mount option...

A bug appears to exist here as the broker should have shut down at the
first IOException when attempting to access the lock file. I need to
see a thread-dump from the broker to investigate why it didn't shut
down completely.

> If NFS mount is not available, where does all the message and topology data 
> go?

For producers, the broker will attempt to write durable messages to
disk. If it cannot, it will return an error to the client so the
client knows the message needs to be sent again.

For consumers, the broker will attempt to write acknowledgements for
durable messages to disk. If it cannot, it will return an error to the
client so the client knows the message needs to be consumed again.

If the broker is killed during operations involving durable messages,
the client will receive an error, indicating that the operation it was
performing was likely unsuccessful. I say "likely" because the
operation might have succeeded, and the broker could have been killed
before responding to the client. Of course, there are ways to deal
with this situation as well.

Any operation involving non-durable messages occurs in memory unless
paging is involved. The guarantees for non-durable messages are, of
course, completely different from those for durable messages. They
are, by definition, volatile.

I'm not really sure what you mean by "topology data," but if you're
using "topology" to mean the runtime information about cluster nodes,
as the broker does, then all that data exists only in memory. Please
clarify if that's not what you mean.

In any case, if the broker cannot reach the journal, for any reason,
there should be no data integrity issues assuming the clients are
written properly.

> Do we loose all addresses or queues created during that time?

Any operation that writes data to the journal should wait until the
write successfully finishes before returning a result to the client.
If a client doesn't receive a successful result, it can safely
conclude the operation was unsuccessful. The client can handle this
result in whatever way best suits its use-case.

> This incident and tests tells me that the journal is not what I have expected 
> it to be.

I'm not really sure what you mean by that.


Justin

>

On Sun, Mar 8, 2026 at 4:40 PM Vilius Šumskas via users

<[email protected]> wrote:
>
> Yes. I spent couple of hours on weekend to test Artemis behaviour with soft 
> vs. hard NFS mounts, also some other tests. These are the results.
>
> First test I made was by changing NFS IP allowlist by modifying exports file. 
> Since this was done on NFS server side the test affected both, primary and 
> backup Artemis nodes. When IP allowlist is changed modern NFS client react by 
> disconnect mount point completely. On primary node such disconnect correctly 
> resulted in Critical IO Error and automatic broker shutdown. Backup didn't 
> react to this at all. It only detected that connector of primary is down, but 
> didn't start serving clients, nor did it produce Critical IO Error. Not sure 
> if this correct behaviour. I suspect it didn't produce IO errors since the 
> broker was not started in primary mode, but why it didn't TRY to take over?
> You can see full logs of this test in the attached files 
> "primary_nfsleveldeny.txt" and "backup_nfsleveldeny.txt".
>
> Second test was performed with hard NFS mount option by denying outbound 
> traffic to NFS server's IP address on primary node with "firewall-cmd 
> --direct --add-rule ipv4 filter OUTPUT 0 -d <nfs_server_ip>/32  -j REJECT". 
> Full logs are in the attached "primary_nfshard.txt" and "backup_nfshard.txt" 
> files.
> * At ~22:13 I have logged in Artemis Console on primary.
> * At ~22:14 the traffic to shared storage on primary was denied.
> As during our incident, this didn't produce any logs on primary. No client 
> login errors, no IO errors, nothing. On Artemis Console I could even send a 
> message to ExpiredQueue (though I could not browse it). I could also create 
> new address, search for new address in Console, and delete it. How is this 
> even possible? Should I assume some broker operations happen completely in 
> memory and are not written to journal, until needed? Though, after a while, 
> address search in Console stopped working. I assume some kind of Console 
> cache expired. But I still could list consumer and producers.
> * At around the same time (22:14) backup node took over. This is probably 
> expected even with hard NFS option because Artemis lock on shared volume has 
> expired?
> * Cluster stayed this way, with primary producing no errors, until ~22:22 
> when I have removed firewall rule, and outbound traffic from primary to NFS 
> storage could flow again.
> This resulted in lost lock and Critical IO Errors, and primary have finally 
> shutdown.
> Maybe it's just me, but I think there should be a way to detect such stalled 
> NFS mount and shutdown the broker sooner.
>
> Now to the last test. It was performed with soft NFS mount option by denying 
> outbound traffic to NFS server's IP address on primary node using the same 
> firewalld rule. Logs are in "primary_nfssoft.txt" file.
> * At ~22:36 the traffic to shared storage on primary was denied.
> * At ~22:37 this resulted in lock failure, however the broker didn't produce 
> Critical IO Error and didn't try to automatically shutdown until 22:40. Not 
> sure why. On one hand, this matches our timeo=600,retrans=2 mount options, 
> but shouldn't broker try to shutdown right away with "Lost NodeManager lock" 
> (just like in previous test after mount point came back)? I could even use 
> Artemis console and create or delete address. Which is, again, strange for 
> the cluster node which just lost the lock.
> * Anyway, so at 22:40 it is shutting down, it produces other Critical IO 
> errors, Critical Analyzer also kicks in. Broker also produced thread dump 
> during the process (attached separately), however it never actually shuts 
> down completely. I could see java process trying to do something.
> * I waited more than 15 minutes, but full shutdown never occurred.
> * In parallel, backup at ~22:36 took over, so no surprises there. However 
> since primary was not fully down, Artemis client didn't failover to backup.
>
> Summarizing all the tests, my main concerns is regarding data integrity 
> during NFS incidents, be it on one or both nodes. If NFS mount is not 
> available, where does all the message and topology data go? Memory? If yes, 
> what happens if NFS mount point doesn't appear in time and broker is killed? 
> Do we loose all addresses or queues created during that time? This incident 
> and tests tells me that the journal is not what I have expected it to be.
>
> Test were performed with latest Artemis version 2.52.0 on Rocky Linux 9.7 and 
> NetApp NFS cloud storage.
>
> P.S. I didn't include intr mount option in my tests, as it is deprecated now 
> and completely ignored by kernels above version 2.6.25. I will prepare a PR 
> about this for documentation in a few.
>
> --
>     Vilius
>
> -----Original Message-----
> From: Justin Bertram <[email protected]>
> Sent: Wednesday, March 4, 2026 8:50 PM
> To: [email protected]
> Subject: Re: error indication when cluster shared storage is not available
>
> Any results to share from your testing?
>
>
> Justin
>
> On Tue, Mar 3, 2026 at 9:27 AM Vilius Šumskas via users 
> <[email protected]> wrote:
> >
> > Thank you Justin for the explanation.
> >
> > I guess ActiveMQBasicSecurityManager part explains why we didn't see any 
> > "unknown user" errors in the logs. All users were loaded into memory in 
> > advance. It's strange though that clients were connecting, could not do so, 
> > but the broker didn't print anything about connection resets or other 
> > related information. We are using default logging level btw. It was not the 
> > same network issue because clients live on the different subnet than 
> > storage. We could ping Artemis nodes from clients successfully during 
> > incident, and fast (millisecond) reconnections at Qpid level indicates that 
> > it was not TCP level issue.
> >
> > I just checked NFS mount recommendations. We are using timeo=600,retrans=2, 
> > however indeed we are using hard instead of soft option. I'm going to try 
> > to reproduce an issue with both settings to see how it behaves. Could you 
> > elaborate a bit why documentation says that NFS recommendation regarding 
> > soft option and data corruption can be safely ignored?
> >
> > --
> >     Vilius
> >
> > -----Original Message-----
> > From: Justin Bertram <[email protected]>
> > Sent: Tuesday, March 3, 2026 4:43 PM
> > To: [email protected]
> > Subject: Re: error indication when cluster shared storage is not
> > available
> >
> > Since Artemis 2.11.0 [1] the broker will periodically evaluate the shared 
> > journal file-lock to ensure it hasn't been lost and/or the backup hasn't 
> > activated. Assuming proper configuration, I would have expected this 
> > component to shut down the broker in your situation.
> > Since it didn't shut down the broker my hunch is that your NFS mount is not 
> > configured properly. Can you confirm that you're following the NFS mount 
> > recommendations [2]? I'm specifically thinking about using soft vs. hard.
> >
> > It's worth noting that the ActiveMQBasicSecurityManager accesses the 
> > journal only when the broker starts. It reads all user/role information 
> > from the journal and loads it into memory. The only exception is if an 
> > administrator uses the management API to add, remove, or update a user, 
> > role, etc. at which point the broker will write to the journal.
> >
> > Also, if there is no activity on the broker, the critical analyzer has no 
> > chance to detect problems.
> >
> > Based on your description, it sounds like the same network problem that 
> > caused an issue with NFS might also have prevented clients from connecting 
> > to the broker.
> >
> >
> > Justin
> >
> > [1] https://issues.apache.org/jira/browse/ARTEMIS-2421
> > [2]
> > https://artemis.apache.org/components/artemis/documentation/latest/ha.
> > html#nfs-mount-recommendations
> >
> > On Mon, Mar 2, 2026 at 4:11 PM Vilius Šumskas via users 
> > <[email protected]> wrote:
> > >
> > > Hello,
> > >
> > >
> > >
> > > we have a pretty straightforward Artemis HA cluster consisting from 2 
> > > nodes, primary and a backup. Cluster uses NFS4.1 shared storage to store 
> > > the journal. In addition, we are using ActiveMQBasicSecurityManager for 
> > > authentication, which means information about Artemis users are on the 
> > > same shared storage.
> > >
> > >
> > >
> > > Couple of days ago we had an incident with our shared storage provider. 
> > > During the incident the storage was fully unreachable network wise. The 
> > > interesting part is that during the incident Artemis didn’t print any 
> > > exceptions or any errors in the logs. No messages that journal could not 
> > > be reachable, no messages about failure to reach the backup, even though 
> > > the backup was also experiencing the same issue with the storage. 
> > > External AMQP client connections also didn’t result in the usual warning 
> > > in the logs for “unknown users”, even though on the client side Qpid 
> > > clients constantly printed “cannot connect” errors. As if broker 
> > > instances were unreachable by the clients but inside the broker all 
> > > processes just stopped hanging and waiting for the storage.
> > >
> > > Critical analyzer also didn’t kick in for some reason. Usually it works 
> > > very well for us, when the same NFS storage slows down considerably, but 
> > > not this time.
> > >
> > >
> > >
> > > Only after I completely restarted primary VM node, and it could not mount 
> > > NFS storage completely (after waiting 3 minutes to timeout during 
> > > restart), then Artemis booted and started producing IOExceptions, 
> > > “unknown user” errors, “connection failed to backup node” errors, and 
> > > every other possible error related to unreachable journal, as expected.
> > >
> > >
> > >
> > > Is the silence in the logs due to unreachable NFS storage a bug? If so, 
> > > what developers need for the reproducible case? As I said, there is 
> > > nothing in the logs at the moment, but I could try to reproduce it on 
> > > testing environment with any combination of debugging properties if 
> > > needed.
> > >
> > >
> > >
> > > If it’s not a bug, how should we ensure proper alerting (and possibly 
> > > automatic Artemis shutdown) in case shared storage is down? Do we miss 
> > > some NFS mount option or critical analyzer setting, maybe? Currently we 
> > > are using defaults.
> > >
> > >
> > >
> > > Any pointers are much appreciated!
> > >
> > >
> > >
> > > --
> > >
> > >    Best Regards,
> > >
> > >
> > >
> > >     Vilius Šumskas
> > >
> > >     Rivile
> > >
> > >     IT manager
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: error indication when cluster shared storage is not available

Reply via email to