Was the broker completely idle ? I would need some thread dumps during the
Incident to not have to guess.

But I can guess I would say the broker was idle. And no load pushed on
storage during your outage ?

Clebert Suconic


On Mon, Mar 2, 2026 at 5:11 PM Vilius Šumskas via users <
[email protected]> wrote:

> Hello,
>
>
>
> we have a pretty straightforward Artemis HA cluster consisting from 2
> nodes, primary and a backup. Cluster uses NFS4.1 shared storage to store
> the journal. In addition, we are using ActiveMQBasicSecurityManager for
> authentication, which means information about Artemis users are on the same
> shared storage.
>
>
>
> Couple of days ago we had an incident with our shared storage provider.
> During the incident the storage was fully unreachable network wise. The
> interesting part is that during the incident Artemis didn’t print any
> exceptions or any errors in the logs. No messages that journal could not be
> reachable, no messages about failure to reach the backup, even though the
> backup was also experiencing the same issue with the storage. External AMQP
> client connections also didn’t result in the usual warning in the logs for
> “unknown users”, even though on the client side Qpid clients constantly
> printed “cannot connect” errors. As if broker instances were unreachable by
> the clients but inside the broker all processes just stopped hanging and
> waiting for the storage.
>
> Critical analyzer also didn’t kick in for some reason. Usually it works
> very well for us, when the same NFS storage slows down considerably, but
> not this time.
>
>
>
> Only after I completely restarted primary VM node, and it could not mount
> NFS storage completely (after waiting 3 minutes to timeout during restart),
> then Artemis booted and started producing IOExceptions, “unknown user”
> errors, “connection failed to backup node” errors, and every other possible
> error related to unreachable journal, as expected.
>
>
>
> Is the silence in the logs due to unreachable NFS storage a bug? If so,
> what developers need for the reproducible case? As I said, there is nothing
> in the logs at the moment, but I could try to reproduce it on testing
> environment with any combination of debugging properties if needed.
>
>
>
> If it’s not a bug, how should we ensure proper alerting (and possibly
> automatic Artemis shutdown) in case shared storage is down? Do we miss some
> NFS mount option or critical analyzer setting, maybe? Currently we are
> using defaults.
>
>
>
> Any pointers are much appreciated!
>
>
>
> --
>
>    Best Regards,
>
>
>
>     Vilius Šumskas
>
>     Rivile
>
>     IT manager
>
>
>

Reply via email to