Hello,

we have a pretty straightforward Artemis HA cluster consisting from 2 nodes, 
primary and a backup. Cluster uses NFS4.1 shared storage to store the journal. 
In addition, we are using ActiveMQBasicSecurityManager for authentication, 
which means information about Artemis users are on the same shared storage.

Couple of days ago we had an incident with our shared storage provider. During 
the incident the storage was fully unreachable network wise. The interesting 
part is that during the incident Artemis didn’t print any exceptions or any 
errors in the logs. No messages that journal could not be reachable, no 
messages about failure to reach the backup, even though the backup was also 
experiencing the same issue with the storage. External AMQP client connections 
also didn’t result in the usual warning in the logs for “unknown users”, even 
though on the client side Qpid clients constantly printed “cannot connect” 
errors. As if broker instances were unreachable by the clients but inside the 
broker all processes just stopped hanging and waiting for the storage.
Critical analyzer also didn’t kick in for some reason. Usually it works very 
well for us, when the same NFS storage slows down considerably, but not this 
time.

Only after I completely restarted primary VM node, and it could not mount NFS 
storage completely (after waiting 3 minutes to timeout during restart), then 
Artemis booted and started producing IOExceptions, “unknown user” errors, 
“connection failed to backup node” errors, and every other possible error 
related to unreachable journal, as expected.

Is the silence in the logs due to unreachable NFS storage a bug? If so, what 
developers need for the reproducible case? As I said, there is nothing in the 
logs at the moment, but I could try to reproduce it on testing environment with 
any combination of debugging properties if needed.

If it’s not a bug, how should we ensure proper alerting (and possibly automatic 
Artemis shutdown) in case shared storage is down? Do we miss some NFS mount 
option or critical analyzer setting, maybe? Currently we are using defaults.

Any pointers are much appreciated!

--
   Best Regards,

    Vilius Šumskas
    Rivile
    IT manager

Reply via email to