Dear all, We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD storage. We have been having a specific problem, which has been happening for quite some time (may be from the first day we use CloudStack), which we suspect is related to HA.
When a CloudStack agent gets disconnected from the management server for any reason, CloudStack would gradually mark some or all the VMs on the disconnected host as "Stopped" even though it's actually still running on the disconnected VM. When I tried to reconnect the agent, CloudStack seems to instruct the agent to stop the VM first, and will be busy shutting down each of the VMs one by one while in "Connecting" state, before it can obtain "Up" state. This caused all the VMs inside the host (with the disconnected agent) to be down unnecessarily, even though technically they can actually stay up while the agent is reconnecting to the management server. Is there a way we can prevent CloudStack from shutting down the VMs during agent re-connection? Relevant logs from management server and agent are below, it seems HA is the culprit. Any advice is appreciated. Excerpts from management server logs -- on below example, the hostname of the affected VM on the disconnected host is "vm-hostname" and below is the result of grepping "vm-hostname" from the logs. ==== 2016-04-30 23:24:32,680 INFO [cloud.ha.HighAvailabilityManagerImpl] (Timer-1:null) Schedule vm for HA: VM[User|vm-hostname] 2016-04-30 23:24:35,565 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-1:work-11007) HA on VM[User|vm-hostname] 2016-04-30 23:24:35,571 DEBUG [cloud.ha.CheckOnAgentInvestigator] (HA-Worker-1:work-11007) Unable to reach the agent for VM[User|vm-hostname]: Resource [Host:34] is unreachable: Host 34: Host with specified id is not in the right state: Disconnected 2016-04-30 23:24:35,571 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-1:work-11007) SimpleInvestigator found VM[User|vm-hostname]to be alive? null 2016-04-30 23:24:35,571 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-1:work-11007) XenServerInvestigator found VM[User|vm-hostname]to be alive? null 2016-04-30 23:24:35,571 DEBUG [cloud.ha.UserVmDomRInvestigator] (HA-Worker-1:work-11007) testing if VM[User|vm-hostname] is alive 2016-04-30 23:24:35,581 DEBUG [cloud.ha.UserVmDomRInvestigator] (HA-Worker-1:work-11007) VM[User|vm-hostname] could not be pinged, returning that it is unknown 2016-04-30 23:24:35,581 DEBUG [cloud.ha.UserVmDomRInvestigator] (HA-Worker-1:work-11007) Returning null since we're unable to determine state of VM[User|vm-hostname] 2016-04-30 23:24:35,581 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-1:work-11007) null found VM[User|vm-hostname]to be alive? null 2016-04-30 23:24:35,582 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-1:work-11007) Not a System Vm, unable to determine state of VM[User|vm-hostname] returning null 2016-04-30 23:24:35,582 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-1:work-11007) Testing if VM[User|vm-hostname] is alive 2016-04-30 23:24:35,586 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-1:work-11007) Unable to find a management nic, cannot ping this system VM, unable to determine state of VM[User|vm-hostname] returning null 2016-04-30 23:24:35,586 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-1:work-11007) null found VM[User|vm-hostname]to be alive? null 2016-04-30 23:24:35,588 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-1:work-11007) KVMInvestigator found VM[User|vm-hostname]to be alive? null 2016-04-30 23:24:35,592 DEBUG [cloud.ha.KVMFencer] (HA-Worker-1:work-11007) Unable to fence off VM[User|vm-hostname] on Host[-34-Routing] 2016-04-30 23:24:35,592 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-1:work-11007) We were unable to fence off the VM VM[User|vm-hostname] 2016-04-30 23:24:35,592 WARN [apache.cloudstack.alerts] (HA-Worker-1:work-11007) alertType:: 8 // dataCenterId:: 6 // podId:: 6 // clusterId:: null // message:: Unable to restart vm-hostname which was running on host name: hypervisor-host(id:34), availability zone: xxxxxxxxxx-Singapore-01, pod: xxxxxxxxxx-Singapore-Pod-01 2016-04-30 23:24:41,028 DEBUG [cloud.vm.VirtualMachineManagerImpl] (AgentConnectTaskPool-4:null) Both states are Running for VM[User|vm-hostname] ===== The above will keep on looping until a time when CloudStack management server decides to do a force stop as follows: ===== 2016-05-01 00:30:23,305 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-3:work-11249) HA on VM[User|vm-hostname] 2016-05-01 00:30:23,311 DEBUG [cloud.ha.CheckOnAgentInvestigator] (HA-Worker-3:work-11249) Unable to reach the agent for VM[User|vm-hostname]: Resource [Host:34] is unreachable: Host 34: Host with specified id is not in the right state: Disconnected 2016-05-01 00:30:23,311 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-3:work-11249) SimpleInvestigator found VM[User|vm-hostname]to be alive? null 2016-05-01 00:30:23,311 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-3:work-11249) XenServerInvestigator found VM[User|vm-hostname]to be alive? null 2016-05-01 00:30:23,311 DEBUG [cloud.ha.UserVmDomRInvestigator] (HA-Worker-3:work-11249) testing if VM[User|vm-hostname] is alive 2016-05-01 00:30:35,499 DEBUG [cloud.ha.UserVmDomRInvestigator] (HA-Worker-3:work-11249) VM[User|vm-hostname] could not be pinged, returning that it is unknown 2016-05-01 00:30:35,499 DEBUG [cloud.ha.UserVmDomRInvestigator] (HA-Worker-3:work-11249) Returning null since we're unable to determine state of VM[User|vm-hostname] 2016-05-01 00:30:35,499 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-3:work-11249) null found VM[User|vm-hostname]to be alive? null 2016-05-01 00:30:35,499 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-3:work-11249) Not a System Vm, unable to determine state of VM[User|vm-hostname] returning null 2016-05-01 00:30:35,499 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-3:work-11249) Testing if VM[User|vm-hostname] is alive 2016-05-01 00:30:35,505 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-3:work-11249) Unable to find a management nic, cannot ping this system VM, unable to determine state of VM[User|vm-hostname] returning null 2016-05-01 00:30:35,505 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-3:work-11249) null found VM[User|vm-hostname]to be alive? null 2016-05-01 00:30:35,558 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-3:work-11249) KVMInvestigator found VM[User|vm-hostname]to be alive? null 2016-05-01 00:30:35,688 WARN [cloud.vm.VirtualMachineManagerImpl] (HA-Worker-3:work-11249) Unable to actually stop VM[User|vm-hostname] but continue with release because it's a force stop 2016-05-01 00:30:35,693 DEBUG [cloud.vm.VirtualMachineManagerImpl] (HA-Worker-3:work-11249) VM[User|vm-hostname] is stopped on the host. Proceeding to release resource held. 2016-05-01 00:30:35,698 DEBUG [cloud.vm.VirtualMachineManagerImpl] (HA-Worker-3:work-11249) Successfully released network resources for the vm VM[User|vm-hostname] 2016-05-01 00:30:35,698 DEBUG [cloud.vm.VirtualMachineManagerImpl] (HA-Worker-3:work-11249) Successfully released storage resources for the vm VM[User|vm-hostname] 2016-05-01 00:31:38,426 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-3:work-11183) HA on VM[User|vm-hostname] 2016-05-01 00:31:38,426 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-3:work-11183) VM VM[User|vm-hostname] has been changed. Current State = Stopped Previous State = Running last updated = 113 previous updated = 111 ===== Below are the excerpts from the corresponding agent.log, note that i-1082-3086-VM is the VM ID for the above vm-hostname as example: ===== 2016-04-30 23:24:36,592 DEBUG [kvm.resource.LibvirtComputingResource] (Agent-Handler-1:null) Detecting a new state but couldn't find a old state so adding it to the changes: i-1082-3086-VM ===== After CloudStack management server decides to mark the VM as stopped, the agent will try to shutdown the VM upon reconnecting of the agent to the management server: ===== 2016-05-01 00:32:32,029 DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:null) Processing command: com.cloud.agent.api.StopCommand 2016-05-01 00:32:32,063 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/vm/network/security_group.py destroy_network_rules_for_vm --vmname i-1082-3086-VM --vif vnet11 2016-05-01 00:32:32,195 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Execution is successful. 2016-05-01 00:32:32,196 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Try to stop the vm at first ===== and ===== 2016-05-01 00:33:04,835 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) successfully shut down vm i-1082-3086-VM 2016-05-01 00:33:04,836 DEBUG [utils.script.Script] (agentRequest-Handler-3:null) Executing: /bin/bash -c ls /sys/class/net/breth1-8/brif | grep vnet 2016-05-01 00:33:04,847 DEBUG [utils.script.Script] (agentRequest-Handler-3:null) Execution is successful. ===== - Is there a way for us to prevent the above scenario from happening? - Is the only way to prevent the above scenario is to disable HA on the VM? - Understand that disabling HA will require applying new service offering for each VM and restart the VM for the changes to take effect. Is there a way to disable HA globally without changing the service offering for each VM? - Is it possible to avoid the above scenario from happening without having to disable HA and losing the HA features and functionality? Any advice is greatly appreciated. Looking forward to your reply, thank you. Cheers.