RE: error indication when cluster shared storage is not available

Vilius Šumskas via users Tue, 03 Mar 2026 07:27:21 -0800

Thank you Justin for the explanation.

I guess ActiveMQBasicSecurityManager part explains why we didn't see any 
"unknown user" errors in the logs. All users were loaded into memory in 
advance. It's strange though that clients were connecting, could not do so, but 
the broker didn't print anything about connection resets or other related 
information. We are using default logging level btw. It was not the same 
network issue because clients live on the different subnet than storage. We 
could ping Artemis nodes from clients successfully during incident, and fast 
(millisecond) reconnections at Qpid level indicates that it was not TCP level 
issue.

I just checked NFS mount recommendations. We are using timeo=600,retrans=2, 
however indeed we are using hard instead of soft option. I'm going to try to 
reproduce an issue with both settings to see how it behaves. Could you 
elaborate a bit why documentation says that NFS recommendation regarding soft 
option and data corruption can be safely ignored?

-- 
    Vilius

-----Original Message-----
From: Justin Bertram <[email protected]> 
Sent: Tuesday, March 3, 2026 4:43 PM
To: [email protected]
Subject: Re: error indication when cluster shared storage is not available

Since Artemis 2.11.0 [1] the broker will periodically evaluate the shared 
journal file-lock to ensure it hasn't been lost and/or the backup hasn't 
activated. Assuming proper configuration, I would have expected this component 
to shut down the broker in your situation.
Since it didn't shut down the broker my hunch is that your NFS mount is not 
configured properly. Can you confirm that you're following the NFS mount 
recommendations [2]? I'm specifically thinking about using soft vs. hard.

It's worth noting that the ActiveMQBasicSecurityManager accesses the journal 
only when the broker starts. It reads all user/role information from the 
journal and loads it into memory. The only exception is if an administrator 
uses the management API to add, remove, or update a user, role, etc. at which 
point the broker will write to the journal.

Also, if there is no activity on the broker, the critical analyzer has no 
chance to detect problems.

Based on your description, it sounds like the same network problem that caused 
an issue with NFS might also have prevented clients from connecting to the 
broker.

Justin

[1] https://issues.apache.org/jira/browse/ARTEMIS-2421
[2] 
https://artemis.apache.org/components/artemis/documentation/latest/ha.html#nfs-mount-recommendations

On Mon, Mar 2, 2026 at 4:11 PM Vilius Šumskas via users 
<[email protected]> wrote:
>
> Hello,
>
>
>
> we have a pretty straightforward Artemis HA cluster consisting from 2 nodes, 
> primary and a backup. Cluster uses NFS4.1 shared storage to store the 
> journal. In addition, we are using ActiveMQBasicSecurityManager for 
> authentication, which means information about Artemis users are on the same 
> shared storage.
>
>
>
> Couple of days ago we had an incident with our shared storage provider. 
> During the incident the storage was fully unreachable network wise. The 
> interesting part is that during the incident Artemis didn’t print any 
> exceptions or any errors in the logs. No messages that journal could not be 
> reachable, no messages about failure to reach the backup, even though the 
> backup was also experiencing the same issue with the storage. External AMQP 
> client connections also didn’t result in the usual warning in the logs for 
> “unknown users”, even though on the client side Qpid clients constantly 
> printed “cannot connect” errors. As if broker instances were unreachable by 
> the clients but inside the broker all processes just stopped hanging and 
> waiting for the storage.
>
> Critical analyzer also didn’t kick in for some reason. Usually it works very 
> well for us, when the same NFS storage slows down considerably, but not this 
> time.
>
>
>
> Only after I completely restarted primary VM node, and it could not mount NFS 
> storage completely (after waiting 3 minutes to timeout during restart), then 
> Artemis booted and started producing IOExceptions, “unknown user” errors, 
> “connection failed to backup node” errors, and every other possible error 
> related to unreachable journal, as expected.
>
>
>
> Is the silence in the logs due to unreachable NFS storage a bug? If so, what 
> developers need for the reproducible case? As I said, there is nothing in the 
> logs at the moment, but I could try to reproduce it on testing environment 
> with any combination of debugging properties if needed.
>
>
>
> If it’s not a bug, how should we ensure proper alerting (and possibly 
> automatic Artemis shutdown) in case shared storage is down? Do we miss some 
> NFS mount option or critical analyzer setting, maybe? Currently we are using 
> defaults.
>
>
>
> Any pointers are much appreciated!
>
>
>
> --
>
>    Best Regards,
>
>
>
>     Vilius Šumskas
>
>     Rivile
>
>     IT manager
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: error indication when cluster shared storage is not available

Reply via email to