RE: EXTERNAL: Re: error indication when cluster shared storage is not available

[email protected] Fri, 13 Mar 2026 16:22:41 -0700

My bad, must have selected the wrong email (which I did, obviously).
I replied on the correct thread this time.


-----Original Message-----
From: Justin Bertram <[email protected]> 
Sent: Friday, March 13, 2026 10:47 AM
To: [email protected]
Subject: EXTERNAL: Re: EXTERNAL: Re: error indication when cluster shared 
storage is not available

I'm confused. Did you really mean to reply to this thread about NFS started by 
Vilius? Perhaps you meant to reply to the other thread that you started about 
configuring broker.xml?


Justin

On Fri, Mar 13, 2026 at 10:37 AM [email protected] 
<[email protected]> wrote:
>
> Finally found the bug in one of my primary clients.
> A simple "!= null" check that should have been "== null" was preventing 
> creating the Destination instance.
> Once that was fixed, messages flowing to that client very well.
>
> What is the policy for sharing broker.xml file on this forum?  I would be 
> happy to share for other developers.
>
> -----Original Message-----
> From: Justin Bertram <[email protected]>
> Sent: Wednesday, March 4, 2026 12:50 PM
> To: [email protected]
> Subject: EXTERNAL: Re: error indication when cluster shared storage is 
> not available
>
> Any results to share from your testing?
>
>
> Justin
>
> On Tue, Mar 3, 2026 at 9:27 AM Vilius Šumskas via users 
> <[email protected]> wrote:
> >
> > Thank you Justin for the explanation.
> >
> > I guess ActiveMQBasicSecurityManager part explains why we didn't see any 
> > "unknown user" errors in the logs. All users were loaded into memory in 
> > advance. It's strange though that clients were connecting, could not do so, 
> > but the broker didn't print anything about connection resets or other 
> > related information. We are using default logging level btw. It was not the 
> > same network issue because clients live on the different subnet than 
> > storage. We could ping Artemis nodes from clients successfully during 
> > incident, and fast (millisecond) reconnections at Qpid level indicates that 
> > it was not TCP level issue.
> >
> > I just checked NFS mount recommendations. We are using timeo=600,retrans=2, 
> > however indeed we are using hard instead of soft option. I'm going to try 
> > to reproduce an issue with both settings to see how it behaves. Could you 
> > elaborate a bit why documentation says that NFS recommendation regarding 
> > soft option and data corruption can be safely ignored?
> >
> > --
> >     Vilius
> >
> > -----Original Message-----
> > From: Justin Bertram <[email protected]>
> > Sent: Tuesday, March 3, 2026 4:43 PM
> > To: [email protected]
> > Subject: Re: error indication when cluster shared storage is not 
> > available
> >
> > Since Artemis 2.11.0 [1] the broker will periodically evaluate the shared 
> > journal file-lock to ensure it hasn't been lost and/or the backup hasn't 
> > activated. Assuming proper configuration, I would have expected this 
> > component to shut down the broker in your situation.
> > Since it didn't shut down the broker my hunch is that your NFS mount is not 
> > configured properly. Can you confirm that you're following the NFS mount 
> > recommendations [2]? I'm specifically thinking about using soft vs. hard.
> >
> > It's worth noting that the ActiveMQBasicSecurityManager accesses the 
> > journal only when the broker starts. It reads all user/role information 
> > from the journal and loads it into memory. The only exception is if an 
> > administrator uses the management API to add, remove, or update a user, 
> > role, etc. at which point the broker will write to the journal.
> >
> > Also, if there is no activity on the broker, the critical analyzer has no 
> > chance to detect problems.
> >
> > Based on your description, it sounds like the same network problem that 
> > caused an issue with NFS might also have prevented clients from connecting 
> > to the broker.
> >
> >
> > Justin
> >
> > [1] https://issues.apache.org/jira/browse/ARTEMIS-2421
> > [2]
> > https://artemis.apache.org/components/artemis/documentation/latest/ha.
> > html#nfs-mount-recommendations
> >
> > On Mon, Mar 2, 2026 at 4:11 PM Vilius Šumskas via users 
> > <[email protected]> wrote:
> > >
> > > Hello,
> > >
> > >
> > >
> > > we have a pretty straightforward Artemis HA cluster consisting from 2 
> > > nodes, primary and a backup. Cluster uses NFS4.1 shared storage to store 
> > > the journal. In addition, we are using ActiveMQBasicSecurityManager for 
> > > authentication, which means information about Artemis users are on the 
> > > same shared storage.
> > >
> > >
> > >
> > > Couple of days ago we had an incident with our shared storage provider. 
> > > During the incident the storage was fully unreachable network wise. The 
> > > interesting part is that during the incident Artemis didn’t print any 
> > > exceptions or any errors in the logs. No messages that journal could not 
> > > be reachable, no messages about failure to reach the backup, even though 
> > > the backup was also experiencing the same issue with the storage. 
> > > External AMQP client connections also didn’t result in the usual warning 
> > > in the logs for “unknown users”, even though on the client side Qpid 
> > > clients constantly printed “cannot connect” errors. As if broker 
> > > instances were unreachable by the clients but inside the broker all 
> > > processes just stopped hanging and waiting for the storage.
> > >
> > > Critical analyzer also didn’t kick in for some reason. Usually it works 
> > > very well for us, when the same NFS storage slows down considerably, but 
> > > not this time.
> > >
> > >
> > >
> > > Only after I completely restarted primary VM node, and it could not mount 
> > > NFS storage completely (after waiting 3 minutes to timeout during 
> > > restart), then Artemis booted and started producing IOExceptions, 
> > > “unknown user” errors, “connection failed to backup node” errors, and 
> > > every other possible error related to unreachable journal, as expected.
> > >
> > >
> > >
> > > Is the silence in the logs due to unreachable NFS storage a bug? If so, 
> > > what developers need for the reproducible case? As I said, there is 
> > > nothing in the logs at the moment, but I could try to reproduce it on 
> > > testing environment with any combination of debugging properties if 
> > > needed.
> > >
> > >
> > >
> > > If it’s not a bug, how should we ensure proper alerting (and possibly 
> > > automatic Artemis shutdown) in case shared storage is down? Do we miss 
> > > some NFS mount option or critical analyzer setting, maybe? Currently we 
> > > are using defaults.
> > >
> > >
> > >
> > > Any pointers are much appreciated!
> > >
> > >
> > >
> > > --
> > >
> > >    Best Regards,
> > >
> > >
> > >
> > >     Vilius Šumskas
> > >
> > >     Rivile
> > >
> > >     IT manager
> > >
> > >
> >
> > --------------------------------------------------------------------
> > - To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
> > --------------------------------------------------------------------
> > - To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: EXTERNAL: Re: error indication when cluster shared storage is not available

Reply via email to