Hi All, Having been smashed by the unexpected behaviour of the KVM Heartbeat / HA process, we've been working through the logic of the process, and I now believe the intent of the process is sumarised by:
================= The heartbeat process consists of 3 parts: 1. a shell script that's distributed to each of the hypervisors during the CloudStack installation process: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh 2. Two java classes, built into CloudStack com.cloud.hypervisor.kvm.resource.KVMHAMonitor com.cloud.hypervisor.kvm.resource.KVMHAChecker Behaviour Each of the classes periodically calls the kvmheartbeat.sh script with different arguments, the script is used to confirm the existence of NFS mounts, remount any that are missing, clean up (i.e. kill) VMs in indeterminate state, read and write heartbeats to NFS volumes and force the host hypervisor to reboot (as part of a "shoot the node in the head" approach to restoring sanity to the cluster). The KVMHAMonitor script writes a timestamp to each of the NFS volumes (pools), each minute, if this process times out (4 times), then calls the script once more to force a spontaneous reboot of the host (via: echo b > /proc/sysrq_trigger). The KVMHAChecker is responsible for triggering the script to read the heartbeat value and compare with the current timestamp. Where ALL NFS volumes are determined to be "DEAD" (i.e timestamp is older than 60 seconds), ================ Is my understanding correct? The problem is, when testing this logic in my test lab (currently 4.4.4, but there's been no significant updates committed to these files since), I've been unable to see any evidence of the KVMHAChecker actually executing! I see plenty of evidence of heartbeat writes (and of hypervisor reboots triggered when this process timesout). Thanks, Rohan