Re: NFS Server Failure Caused Cascading Reboots and HA Events

Wei ZHOU Fri, 28 Mar 2025 10:28:15 -0700

Hi Antoine,

There was a pull request to change the default value:
https://github.com/apache/cloudstack/pull/10111


I personally agree with the change, but it is better to discuss it with a
wider group of users.
you can  share your opinion on github.


-Wei


On Fri, Mar 28, 2025 at 5:17 PM Antoine Boucher <[email protected]>
wrote:

> Thank you, Wei, as always.
>
> This is a half-empty versus half-full glass issue.
>
> Based on our experience, there is more to lose than gain.  I would suggest
> setting the default to
> reboot.host.and.alert.management.on.heartbeat.timeout=false.
>
> Regards,
> Antoine
>
>
>
> *Antoine Boucher*
> [email protected]
> [o] +1-226-505-9734
> www.haltondc.com
>
>
>
> Confidentiality Warning: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential, and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, retransmission, conversion to hard copy,
> copying, circulation or other use of this message and any attachments is
> strictly prohibited. If you are not the intended recipient, please notify
> the sender immediately by return e-mail, and delete this message and any
> attachments from your system.
>
>
> On Mar 28, 2025, at 3:22 AM, Wei ZHOU <[email protected]> wrote:
>
> Hi,
>
> Currently this is the default behavior that the host is rebooted in case of
> NFS failure.
>
> You can add the line to agent.properties and restart cloudstack-agent to
> make it effective.
>
> reboot.host.and.alert.management.on.heartbeat.timeout=false
>
>
>
> -Wei
>
> On Fri, Mar 28, 2025 at 5:06 AM Antoine Boucher
> <[email protected]> wrote:
>
> We experienced unexpected cascading reboots across all hosts, followed by
> HA kicking in and migrating VMs. Amid the chaos, we discovered that a newly
> added zone-wide NFS server, used only by one stopped test VM, had gone
> offline. Once we disabled that NFS server in the UI, everything slowly
> stabilized.
>
> We have a large number of NFS servers online in the zone. Is this expected
> behavior? Can one NFS server going offline with just a single stopped VM
> trigger mass host reboots? This feels like operational madness.
>
> Regards, Antoine
>
> Antoine Boucher
> [email protected]
> [o] +1-226-505-9734
> www.haltondc.com
>
>
>

Re: NFS Server Failure Caused Cascading Reboots and HA Events

Reply via email to