Hi all, Testing HA from CS. Its no enabled at XenServer cluster level.
This is just simple setup with the cluster containing two hots viz com7 and com8. Everything is working fine in the zone, except HA. So Com7 have only 2VRs and 1VM (created from SO that have HA checked/set to yes). I go to IPMI and power off Com7 manually. After 7-9 mins CS marks host as down. But VR then stuck in starting loop. It still hows Com7 as its host from GUI. VR never comes up (and thus VM behind it which is on Com7) My understanding it VRs do have HA enabled by default. And with HA they should be restarted on other hosts viz com 8 here. I dig up the log and found below 2018-07-31 14:57:10,893 WARN [o.a.c.alerts] > (Work-Job-Executor-46:ctx-61eadbe7 job-575/job-576 ctx-33dce069) > alertType:: 9 // dataCenterId:: 1 // podId:: 1 // clusterId:: null // > message:: Command: com.cloud.agent.api.check.CheckSshCommand failed while > starting virtual router > 2018-07-31 14:57:10,901 WARN [c.c.n.r.VirtualNetworkApplianceManagerImpl] > (Work-Job-Executor-46:ctx-61eadbe7 job-575/job-576 ctx-33dce069) Command: > com.cloud.agent.api.check.CheckSshCommand failed while starting virtual > router > 2018-07-31 14:57:10,901 INFO [c.c.v.VirtualMachineManagerImpl] > (Work-Job-Executor-46:ctx-61eadbe7 job-575/job-576 ctx-33dce069) The guru > did not like the answers so stopping VM[DomainRouter|r-37-VM] > 2018-07-31 14:57:10,903 DEBUG [c.c.a.m.AgentManagerImpl] > (Work-Job-Executor-46:ctx-61eadbe7 job-575/job-576 ctx-33dce069) > Can not send command com.cloud.agent.api.StopCommand due to Host 1 is not > up > 2018-07-31 14:57:10,904 WARN [c.c.v.VirtualMachineManagerImpl] > (Work-Job-Executor-46:ctx-61eadbe7 job-575/job-576 ctx-33dce069) Unable to > stop VM[DomainRouter|r-37-VM] due to no answers > 2018-07-31 14:57:10,916 DEBUG [c.c.h.HighAvailabilityManagerImpl] > (Work-Job-Executor-46:ctx-61eadbe7 job-575/job-576 ctx-33dce069) Scheduled > HAWork[134-ForceStop-37-Starting-Scheduled] > 2018-07-31 14:57:10,918 ERROR [c.c.v.VirtualMachineManagerImpl] > (Work-Job-Executor-46:ctx-61eadbe7 job-575/job-576 ctx-33dce069) Failed to > start instance VM[DomainRouter|r-37-VM] > 2018-07-31 14:57:10,928 DEBUG [c.c.v.VirtualMachineManagerImpl] > (Work-Job-Executor-46:ctx-61eadbe7 job-575/job-576 ctx-33dce069) Cleaning > up resources for the vm VM[DomainRouter|r-37-VM] in Starting state > . > . > 2018-07-31 14:57:11,110 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] > (API-Job-Executor-74:ctx-8ee12938 job-575) Complete async job-575, > jobStatus: FAILED, resultCode: 530, result: > org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":530,"errortext":"Job > failed due to exception Resource [Host:1] is unreachable: Host 1: Unable to > start instance due to Unable to stop VM[DomainRouter|r-37-VM] so we are > unable to retry the start operation"} The host 1 in above log is Com7 and it is obviously down. It was down for HA testing. Why its just failing by giving that as an excuse? Kind of odd. It should have restarted VRs on com8. Anyone? -- Makrand