Re: error indication when cluster shared storage is not available

Justin Bertram Wed, 04 Mar 2026 10:50:14 -0800

Any results to share from your testing?


Justin

On Tue, Mar 3, 2026 at 9:27 AM Vilius Šumskas via users
<[email protected]> wrote:
>
> Thank you Justin for the explanation.
>
> I guess ActiveMQBasicSecurityManager part explains why we didn't see any 
> "unknown user" errors in the logs. All users were loaded into memory in 
> advance. It's strange though that clients were connecting, could not do so, 
> but the broker didn't print anything about connection resets or other related 
> information. We are using default logging level btw. It was not the same 
> network issue because clients live on the different subnet than storage. We 
> could ping Artemis nodes from clients successfully during incident, and fast 
> (millisecond) reconnections at Qpid level indicates that it was not TCP level 
> issue.
>
> I just checked NFS mount recommendations. We are using timeo=600,retrans=2, 
> however indeed we are using hard instead of soft option. I'm going to try to 
> reproduce an issue with both settings to see how it behaves. Could you 
> elaborate a bit why documentation says that NFS recommendation regarding soft 
> option and data corruption can be safely ignored?
>
> --
>     Vilius
>
> -----Original Message-----
> From: Justin Bertram <[email protected]>
> Sent: Tuesday, March 3, 2026 4:43 PM
> To: [email protected]
> Subject: Re: error indication when cluster shared storage is not available
>
> Since Artemis 2.11.0 [1] the broker will periodically evaluate the shared 
> journal file-lock to ensure it hasn't been lost and/or the backup hasn't 
> activated. Assuming proper configuration, I would have expected this 
> component to shut down the broker in your situation.
> Since it didn't shut down the broker my hunch is that your NFS mount is not 
> configured properly. Can you confirm that you're following the NFS mount 
> recommendations [2]? I'm specifically thinking about using soft vs. hard.
>
> It's worth noting that the ActiveMQBasicSecurityManager accesses the journal 
> only when the broker starts. It reads all user/role information from the 
> journal and loads it into memory. The only exception is if an administrator 
> uses the management API to add, remove, or update a user, role, etc. at which 
> point the broker will write to the journal.
>
> Also, if there is no activity on the broker, the critical analyzer has no 
> chance to detect problems.
>
> Based on your description, it sounds like the same network problem that 
> caused an issue with NFS might also have prevented clients from connecting to 
> the broker.
>
>
> Justin
>
> [1] https://issues.apache.org/jira/browse/ARTEMIS-2421
> [2] 
> https://artemis.apache.org/components/artemis/documentation/latest/ha.html#nfs-mount-recommendations
>
> On Mon, Mar 2, 2026 at 4:11 PM Vilius Šumskas via users 
> <[email protected]> wrote:
> >
> > Hello,
> >
> >
> >
> > we have a pretty straightforward Artemis HA cluster consisting from 2 
> > nodes, primary and a backup. Cluster uses NFS4.1 shared storage to store 
> > the journal. In addition, we are using ActiveMQBasicSecurityManager for 
> > authentication, which means information about Artemis users are on the same 
> > shared storage.
> >
> >
> >
> > Couple of days ago we had an incident with our shared storage provider. 
> > During the incident the storage was fully unreachable network wise. The 
> > interesting part is that during the incident Artemis didn’t print any 
> > exceptions or any errors in the logs. No messages that journal could not be 
> > reachable, no messages about failure to reach the backup, even though the 
> > backup was also experiencing the same issue with the storage. External AMQP 
> > client connections also didn’t result in the usual warning in the logs for 
> > “unknown users”, even though on the client side Qpid clients constantly 
> > printed “cannot connect” errors. As if broker instances were unreachable by 
> > the clients but inside the broker all processes just stopped hanging and 
> > waiting for the storage.
> >
> > Critical analyzer also didn’t kick in for some reason. Usually it works 
> > very well for us, when the same NFS storage slows down considerably, but 
> > not this time.
> >
> >
> >
> > Only after I completely restarted primary VM node, and it could not mount 
> > NFS storage completely (after waiting 3 minutes to timeout during restart), 
> > then Artemis booted and started producing IOExceptions, “unknown user” 
> > errors, “connection failed to backup node” errors, and every other possible 
> > error related to unreachable journal, as expected.
> >
> >
> >
> > Is the silence in the logs due to unreachable NFS storage a bug? If so, 
> > what developers need for the reproducible case? As I said, there is nothing 
> > in the logs at the moment, but I could try to reproduce it on testing 
> > environment with any combination of debugging properties if needed.
> >
> >
> >
> > If it’s not a bug, how should we ensure proper alerting (and possibly 
> > automatic Artemis shutdown) in case shared storage is down? Do we miss some 
> > NFS mount option or critical analyzer setting, maybe? Currently we are 
> > using defaults.
> >
> >
> >
> > Any pointers are much appreciated!
> >
> >
> >
> > --
> >
> >    Best Regards,
> >
> >
> >
> >     Vilius Šumskas
> >
> >     Rivile
> >
> >     IT manager
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: error indication when cluster shared storage is not available

Reply via email to