Re: All cluster reboot when a Primary storage fails

Daniel Augusto Veronezi Salvador Fri, 15 Oct 2021 20:40:33 -0700

Hi Mauro,

On KVM's monitor, when there is an inconsistency on the heartbeat's fileor heartbeat timeout is extrapolated several times, by default, the hostis restarted.

The PR 4586 (https://github.com/apache/cloudstack/pull/4586) alreadyaddressed this issue by externalizing a property, which allows theoperator to decide if the host must be restarted or not (default is'true', meaning that the host will be restarted). However, this featurewill be available only after release 4.16.



Best regards,

Daniel Salvador


On 15/10/2021 20:43, Mauro Ferraro - G2K Hosting wrote:

Hi guys, how are you?.

We are having this problems with ACS when a primary storages fails.
We have several primary storage with Linux and NFS server serving KVMimages. So every hosts have been mounted all the NFS servers becausein one Host can be running VMs from different storages. The mainproblem of this, is when some storage fails because any reason all thecluster gets crazy and start rebooting the hosts to reconnect withthis storage and all the VMs on the cluster, (including the VMs thatwere working good) goes down becuase the conection to one storage fails.If the problem with storage is permanent, the cluster never startagain and hosts will reboot indefinitely.
When this problem appears, the logs say this:
host heartbeat: kvmheartbeat.sh will reboot system because it wasunable to write the heartbeat to the storage.
Many users, edit the script kvmheartbeat.shto avoid the hosts rebootor restart the agent on the host but i really not be sure that this isthe real solution.
Can someone help to propose a best solution at this high risk problem?.

Regards,

Mauro

Re: All cluster reboot when a Primary storage fails

Reply via email to