Was the broker completely idle ? I would need some thread dumps during the Incident to not have to guess.
But I can guess I would say the broker was idle. And no load pushed on storage during your outage ? Clebert Suconic On Mon, Mar 2, 2026 at 5:11 PM Vilius Šumskas via users < [email protected]> wrote: > Hello, > > > > we have a pretty straightforward Artemis HA cluster consisting from 2 > nodes, primary and a backup. Cluster uses NFS4.1 shared storage to store > the journal. In addition, we are using ActiveMQBasicSecurityManager for > authentication, which means information about Artemis users are on the same > shared storage. > > > > Couple of days ago we had an incident with our shared storage provider. > During the incident the storage was fully unreachable network wise. The > interesting part is that during the incident Artemis didn’t print any > exceptions or any errors in the logs. No messages that journal could not be > reachable, no messages about failure to reach the backup, even though the > backup was also experiencing the same issue with the storage. External AMQP > client connections also didn’t result in the usual warning in the logs for > “unknown users”, even though on the client side Qpid clients constantly > printed “cannot connect” errors. As if broker instances were unreachable by > the clients but inside the broker all processes just stopped hanging and > waiting for the storage. > > Critical analyzer also didn’t kick in for some reason. Usually it works > very well for us, when the same NFS storage slows down considerably, but > not this time. > > > > Only after I completely restarted primary VM node, and it could not mount > NFS storage completely (after waiting 3 minutes to timeout during restart), > then Artemis booted and started producing IOExceptions, “unknown user” > errors, “connection failed to backup node” errors, and every other possible > error related to unreachable journal, as expected. > > > > Is the silence in the logs due to unreachable NFS storage a bug? If so, > what developers need for the reproducible case? As I said, there is nothing > in the logs at the moment, but I could try to reproduce it on testing > environment with any combination of debugging properties if needed. > > > > If it’s not a bug, how should we ensure proper alerting (and possibly > automatic Artemis shutdown) in case shared storage is down? Do we miss some > NFS mount option or critical analyzer setting, maybe? Currently we are > using defaults. > > > > Any pointers are much appreciated! > > > > -- > > Best Regards, > > > > Vilius Šumskas > > Rivile > > IT manager > > >
