Any results to share from your testing?
Justin On Tue, Mar 3, 2026 at 9:27 AM Vilius Šumskas via users <[email protected]> wrote: > > Thank you Justin for the explanation. > > I guess ActiveMQBasicSecurityManager part explains why we didn't see any > "unknown user" errors in the logs. All users were loaded into memory in > advance. It's strange though that clients were connecting, could not do so, > but the broker didn't print anything about connection resets or other related > information. We are using default logging level btw. It was not the same > network issue because clients live on the different subnet than storage. We > could ping Artemis nodes from clients successfully during incident, and fast > (millisecond) reconnections at Qpid level indicates that it was not TCP level > issue. > > I just checked NFS mount recommendations. We are using timeo=600,retrans=2, > however indeed we are using hard instead of soft option. I'm going to try to > reproduce an issue with both settings to see how it behaves. Could you > elaborate a bit why documentation says that NFS recommendation regarding soft > option and data corruption can be safely ignored? > > -- > Vilius > > -----Original Message----- > From: Justin Bertram <[email protected]> > Sent: Tuesday, March 3, 2026 4:43 PM > To: [email protected] > Subject: Re: error indication when cluster shared storage is not available > > Since Artemis 2.11.0 [1] the broker will periodically evaluate the shared > journal file-lock to ensure it hasn't been lost and/or the backup hasn't > activated. Assuming proper configuration, I would have expected this > component to shut down the broker in your situation. > Since it didn't shut down the broker my hunch is that your NFS mount is not > configured properly. Can you confirm that you're following the NFS mount > recommendations [2]? I'm specifically thinking about using soft vs. hard. > > It's worth noting that the ActiveMQBasicSecurityManager accesses the journal > only when the broker starts. It reads all user/role information from the > journal and loads it into memory. The only exception is if an administrator > uses the management API to add, remove, or update a user, role, etc. at which > point the broker will write to the journal. > > Also, if there is no activity on the broker, the critical analyzer has no > chance to detect problems. > > Based on your description, it sounds like the same network problem that > caused an issue with NFS might also have prevented clients from connecting to > the broker. > > > Justin > > [1] https://issues.apache.org/jira/browse/ARTEMIS-2421 > [2] > https://artemis.apache.org/components/artemis/documentation/latest/ha.html#nfs-mount-recommendations > > On Mon, Mar 2, 2026 at 4:11 PM Vilius Šumskas via users > <[email protected]> wrote: > > > > Hello, > > > > > > > > we have a pretty straightforward Artemis HA cluster consisting from 2 > > nodes, primary and a backup. Cluster uses NFS4.1 shared storage to store > > the journal. In addition, we are using ActiveMQBasicSecurityManager for > > authentication, which means information about Artemis users are on the same > > shared storage. > > > > > > > > Couple of days ago we had an incident with our shared storage provider. > > During the incident the storage was fully unreachable network wise. The > > interesting part is that during the incident Artemis didn’t print any > > exceptions or any errors in the logs. No messages that journal could not be > > reachable, no messages about failure to reach the backup, even though the > > backup was also experiencing the same issue with the storage. External AMQP > > client connections also didn’t result in the usual warning in the logs for > > “unknown users”, even though on the client side Qpid clients constantly > > printed “cannot connect” errors. As if broker instances were unreachable by > > the clients but inside the broker all processes just stopped hanging and > > waiting for the storage. > > > > Critical analyzer also didn’t kick in for some reason. Usually it works > > very well for us, when the same NFS storage slows down considerably, but > > not this time. > > > > > > > > Only after I completely restarted primary VM node, and it could not mount > > NFS storage completely (after waiting 3 minutes to timeout during restart), > > then Artemis booted and started producing IOExceptions, “unknown user” > > errors, “connection failed to backup node” errors, and every other possible > > error related to unreachable journal, as expected. > > > > > > > > Is the silence in the logs due to unreachable NFS storage a bug? If so, > > what developers need for the reproducible case? As I said, there is nothing > > in the logs at the moment, but I could try to reproduce it on testing > > environment with any combination of debugging properties if needed. > > > > > > > > If it’s not a bug, how should we ensure proper alerting (and possibly > > automatic Artemis shutdown) in case shared storage is down? Do we miss some > > NFS mount option or critical analyzer setting, maybe? Currently we are > > using defaults. > > > > > > > > Any pointers are much appreciated! > > > > > > > > -- > > > > Best Regards, > > > > > > > > Vilius Šumskas > > > > Rivile > > > > IT manager > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
