Hi all.

I use the following environment: CS 4.1, KVM, Centos 6.4
(management+node1+node2), OpenIndiana NFS server as primary and secondary
storage.
and I have the following problem:
If I switch one hypervisor node off via ipmi (simulate server crash), it
never goes to Disconnected status in management. Accordingly, ha-enabled
VMs are not restarted on another hypervisor node, because it believes that
disconnected node is still online.

I get following in management server logs:

2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
(AgentManager-Handler-13:null) Seq 19-1133189098:             Processing:
 { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
[{"Answer":{"result":false,"details":     "Unable to ping computing host,
exiting","wait":0}}] }
2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
(AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: , MgmtId:
161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
(AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged, returning
null ('I don't know')
2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
(AgentTaskPool-1:null) could not reach agent, could   not reach agent's
host, returning that we don't have enough information
2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
(AgentTaskPool-1:null) null unable to determine  the state of the host.
 Moving on.
2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
(AgentTaskPool-1:null) null unable to determine  the state of the host.
 Moving on.
2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
(AgentTaskPool-1:null) Agent state cannot be           determined, do
nothing


If I power on dead node, it goes to state "Connecting" and then "Up" in
management interface.

2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null) Ping
timeout for host 12, do invstigation
2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null) Ping
timeout for host 12, do invstigation
2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null) Ping
timeout for host 12, do invstigation
2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
(AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled, Agent
event = AgentConnected, Host id = 12, name = ad112.colobridge.net]
2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
(AgentConnectTaskPool-5:null) Agent status update: [id = 12; name =
ad112.colobridge.net; old status = Up; event = AgentConnected; new status =
Connecting; old update count = 1285; new update count = 1286]
2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
(AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled, Agent
event = Ready, Host id = 12, name = ad112.colobridge.net]
2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
(AgentConnectTaskPool-5:null) Agent status update: [id = 12; name =
ad112.colobridge.net; old status = Connecting; event = Ready; new status =
Up; old update count = 1286; new update count = 1287]


If I restart cloud-management service, dead node goes to state
"Disconnected" in management interface.
(there is nothing special in logs in this case)

If I do nothing,  dead node could stay in "Up" state forever (I waited for
12 hours) in management interface, throwing into logs "Agent state cannot
be determined, do nothing"

Would appreciate if someone could help/suggest how to deal with this
problem.

-- 
Regards,
Valery

http://protocol.by/slayer

Reply via email to