Hi Mauro, that is really intresting to hear - i am not so long dealing with cloudstack. so this is quiet new to me. how ever: reading through the admin guide http://docs.cloudstack.apache.org/en/latest/adminguide/reliability.html?highlight=Storage%20Outage#primary-storage-outage-and-data-loss The described behaviour seems not "normal" for the hosts.
Did you already take a look into the isues on github? Restarting all hosts of the cluster sounds like a bug to me - so might be worth opening a new issue for further investigation? Am Sa., 16. Okt. 2021 um 01:43 Uhr schrieb Mauro Ferraro - G2K Hosting < mferr...@g2khosting.com>: > Hi guys, how are you?. > > We are having this problems with ACS when a primary storages fails. > > We have several primary storage with Linux and NFS server serving KVM > images. So every hosts have been mounted all the NFS servers because in > one Host can be running VMs from different storages. The main problem of > this, is when some storage fails because any reason all the cluster gets > crazy and start rebooting the hosts to reconnect with this storage and > all the VMs on the cluster, (including the VMs that were working good) > goes down becuase the conection to one storage fails. > If the problem with storage is permanent, the cluster never start again > and hosts will reboot indefinitely. > > When this problem appears, the logs say this: > > host heartbeat: kvmheartbeat.sh will reboot system because it was unable > to write the heartbeat to the storage. > > Many users, edit the script kvmheartbeat.shto avoid the hosts reboot or > restart the agent on the host but i really not be sure that this is the > real solution. > > Can someone help to propose a best solution at this high risk problem?. > > Regards, > > Mauro > > >