Re: error indication when cluster shared storage is not available

Justin Bertram Tue, 03 Mar 2026 06:43:41 -0800

Since Artemis 2.11.0 [1] the broker will periodically evaluate the
shared journal file-lock to ensure it hasn't been lost and/or the
backup hasn't activated. Assuming proper configuration, I would have
expected this component to shut down the broker in your situation.
Since it didn't shut down the broker my hunch is that your NFS mount
is not configured properly. Can you confirm that you're following the
NFS mount recommendations [2]? I'm specifically thinking about using
soft vs. hard.


It's worth noting that the ActiveMQBasicSecurityManager accesses the
journal only when the broker starts. It reads all user/role
information from the journal and loads it into memory. The only
exception is if an administrator uses the management API to add,
remove, or update a user, role, etc. at which point the broker will
write to the journal.

Also, if there is no activity on the broker, the critical analyzer has
no chance to detect problems.

Based on your description, it sounds like the same network problem
that caused an issue with NFS might also have prevented clients from
connecting to the broker.


Justin

[1] https://issues.apache.org/jira/browse/ARTEMIS-2421
[2] 
https://artemis.apache.org/components/artemis/documentation/latest/ha.html#nfs-mount-recommendations

On Mon, Mar 2, 2026 at 4:11 PM Vilius Šumskas via users
<[email protected]> wrote:
>
> Hello,
>
>
>
> we have a pretty straightforward Artemis HA cluster consisting from 2 nodes, 
> primary and a backup. Cluster uses NFS4.1 shared storage to store the 
> journal. In addition, we are using ActiveMQBasicSecurityManager for 
> authentication, which means information about Artemis users are on the same 
> shared storage.
>
>
>
> Couple of days ago we had an incident with our shared storage provider. 
> During the incident the storage was fully unreachable network wise. The 
> interesting part is that during the incident Artemis didn’t print any 
> exceptions or any errors in the logs. No messages that journal could not be 
> reachable, no messages about failure to reach the backup, even though the 
> backup was also experiencing the same issue with the storage. External AMQP 
> client connections also didn’t result in the usual warning in the logs for 
> “unknown users”, even though on the client side Qpid clients constantly 
> printed “cannot connect” errors. As if broker instances were unreachable by 
> the clients but inside the broker all processes just stopped hanging and 
> waiting for the storage.
>
> Critical analyzer also didn’t kick in for some reason. Usually it works very 
> well for us, when the same NFS storage slows down considerably, but not this 
> time.
>
>
>
> Only after I completely restarted primary VM node, and it could not mount NFS 
> storage completely (after waiting 3 minutes to timeout during restart), then 
> Artemis booted and started producing IOExceptions, “unknown user” errors, 
> “connection failed to backup node” errors, and every other possible error 
> related to unreachable journal, as expected.
>
>
>
> Is the silence in the logs due to unreachable NFS storage a bug? If so, what 
> developers need for the reproducible case? As I said, there is nothing in the 
> logs at the moment, but I could try to reproduce it on testing environment 
> with any combination of debugging properties if needed.
>
>
>
> If it’s not a bug, how should we ensure proper alerting (and possibly 
> automatic Artemis shutdown) in case shared storage is down? Do we miss some 
> NFS mount option or critical analyzer setting, maybe? Currently we are using 
> defaults.
>
>
>
> Any pointers are much appreciated!
>
>
>
> --
>
>    Best Regards,
>
>
>
>     Vilius Šumskas
>
>     Rivile
>
>     IT manager
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: error indication when cluster shared storage is not available

Reply via email to