Re: NFS Server Failure Caused Cascading Reboots and HA Events

Antoine Boucher Fri, 28 Mar 2025 09:17:45 -0700

Thank you, Wei, as always.

This is a half-empty versus half-full glass issue.

Based on our experience, there is more to lose than gain.  I would suggest 
setting the default to 
reboot.host.and.alert.management.on.heartbeat.timeout=false. 

Regards,
Antoine

Antoine Boucher
antoi...@haltondc.com
[o] +1-226-505-9734
www.haltondc.com

Confidentiality Warning: This message and any attachments are intended only for 
the use of the intended recipient(s), are confidential, and may be privileged. 
If you are not the intended recipient, you are hereby notified that any review, 
retransmission, conversion to hard copy, copying, circulation or other use of 
this message and any attachments is strictly prohibited. If you are not the 
intended recipient, please notify the sender immediately by return e-mail, and 
delete this message and any attachments from your system.

> On Mar 28, 2025, at 3:22 AM, Wei ZHOU <ustcweiz...@gmail.com> wrote:
> 
> Hi,
> 
> Currently this is the default behavior that the host is rebooted in case of
> NFS failure.
> 
> You can add the line to agent.properties and restart cloudstack-agent to
> make it effective.
> 
> reboot.host.and.alert.management.on.heartbeat.timeout=false
> 
> 
> 
> -Wei
> 
> On Fri, Mar 28, 2025 at 5:06 AM Antoine Boucher
> <antoi...@haltondc.com.invalid> wrote:
> 
>> We experienced unexpected cascading reboots across all hosts, followed by
>> HA kicking in and migrating VMs. Amid the chaos, we discovered that a newly
>> added zone-wide NFS server, used only by one stopped test VM, had gone
>> offline. Once we disabled that NFS server in the UI, everything slowly
>> stabilized.
>> 
>> We have a large number of NFS servers online in the zone. Is this expected
>> behavior? Can one NFS server going offline with just a single stopped VM
>> trigger mass host reboots? This feels like operational madness.
>> 
>> Regards, Antoine
>> 
>> Antoine Boucher
>> antoi...@haltondc.com
>> [o] +1-226-505-9734
>> www.haltondc.com
>>

Re: NFS Server Failure Caused Cascading Reboots and HA Events

Reply via email to