RE: CS 4.5.2: all hosts reboot after 3 days at production

Remi Bergsma Sun, 13 Sep 2015 04:39:08 -0700

Hi Vadim,

Not sure what the problem is. Although I do know that when shared storage is 
used, both CloudStack and XenServer will fence (reboot) the box to prevent 
corruption in case access to the network or the storage is not possible. What 
storage do you use?


What does this return on a XenServer?:
xe pool-list params=all | grep -E "ha-enabled|ha-config"

HA should be on, or else a hypervisor crash will not recover properly.

If you search the logs for Fence or reboot, does anything come back?

The logs you mention are nothing to worry about.

Can you tell us in some more details what happens and how we can reproduce it?

Regards,
Remi


-----Original Message-----
From: Vadim Kimlaychuk [mailto:[email protected]] 
Sent: zondag 13 september 2015 9:32
To: [email protected]
Cc: Remi Bergsma
Subject: Re: CS 4.5.2: all hosts reboot after 3 days at production

Hello Remi,

         This issue has nothing to do with CS 4.5.2. We got host reboot after 
precisely 1 week with previous version of CS (4.5.1). Previous version has been 
working without restart for 106 days before. So it is not a software issue.

         What does really make me unhappy -- accidental host reboot made entire 
cluster unusable. Cloudstack management server was up and running, second 
cluster node was up and running all the time and VM were transferred to the 
second host, but System VMs were not rebooted properly by CS and half of the 
network was down. SSVM and CPVM were in "disconnected" status. Client VMs were 
up, but couldn't connect to storage, because VRs were offline. Entire mess.

         I have used planned maintenance mode before and cluster worked just 
perfect. We didn't have any single second downtime. But with accidental reboot 
there is no use of clusterization. :(

Vadim.

On 2015-09-08 09:35, Vadim Kimlaychuk wrote:

> Hello Remi,
> 
> First of all I don't have /var/log/xha.log file. I have examined logs 
> in detail and haven't found any trace that heartbeat has failed. The 
> only serious problem I have found in management logs before restart is 
> repeating many times error:
> 
> -⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
> -⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
> 2015-09-06 00:47:21,591 DEBUG [c.c.a.m.AgentManagerImpl]
> (RouterMonitor-1:ctx-2d67d422) Details from executing class
> com.cloud.agent.api.NetworkUsageCommand: Exception: 
> java.lang.Exception
> Message: vpc network usage plugin call failed
> Stack: java.lang.Exception: vpc network usage plugin call failed at
> com.cloud.hypervisor.xenserver.resource.XenServer56Resource.VPCNetwork
> Usage(XenServer56Resource.java:172)
> at
> com.cloud.hypervisor.xenserver.resource.XenServer56Resource.execute(Xe
> nServer56Resource.java:195)
> at
> com.cloud.hypervisor.xenserver.resource.XenServer56Resource.executeReq
> uest(XenServer56Resource.java:62)
> at
> com.cloud.hypervisor.xenserver.resource.XenServer610Resource.executeRe
> quest(XenServer610Resource.java:87)
> at
> com.cloud.hypervisor.xenserver.resource.XenServer620SP1Resource.execut
> eRequest(XenServer620SP1Resource.java:65)
> at
> com.cloud.agent.manager.DirectAgentAttache$Task.runInContext(DirectAge
> ntAttache.java:302)
> ...
> 
> -⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
> -⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
> 
> Just couple of seconds before XCP2 host restart:
> 
> -⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
> -⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
> 2015-09-06 00:48:27,884 DEBUG [c.c.a.m.DirectAgentAttache]
> (DirectAgentCronJob-83:ctx-ff822baf) Ping from 2(xcp1)
> 2015-09-06 00:48:27,884 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) Process host VM state report from 
> ping process. host: 2
> 2015-09-06 00:48:27,904 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) Process VM state report. host: 2, 
> number of records in report: 6
> 2015-09-06 00:48:27,904 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 
> 85, power state: PowerOn
> 2015-09-06 00:48:27,907 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
> skip DB writing. vm id: 85
> 2015-09-06 00:48:27,907 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 
> 1, power state: PowerOn
> 2015-09-06 00:48:27,910 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
> skip DB writing. vm id: 1
> 2015-09-06 00:48:27,910 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 
> 2, power state: PowerOn
> 2015-09-06 00:48:27,913 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
> skip DB writing. vm id: 2
> 2015-09-06 00:48:27,913 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 
> 82, power state: PowerOn
> 2015-09-06 00:48:27,916 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
> skip DB writing. vm id: 82
> 2015-09-06 00:48:27,916 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 
> 94, power state: PowerOn
> 2015-09-06 00:48:27,919 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
> skip DB writing. vm id: 94
> 2015-09-06 00:48:27,919 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 
> 90, power state: PowerOn
> 2015-09-06 00:48:27,922 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
> skip DB writing. vm id: 90
> 2015-09-06 00:48:27,928 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) Done with process of VM state 
> report. host: 2
> 2015-09-06 00:48:27,940 DEBUG [c.c.a.m.DirectAgentAttache]
> (DirectAgentCronJob-154:ctx-2e8a5911) Ping from 1(xcp2)
> 2015-09-06 00:48:27,940 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-154:ctx-2e8a5911) Process host VM state report 
> from ping process. host: 1
> 2015-09-06 00:48:27,951 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-154:ctx-2e8a5911) Process VM state report. host: 
> 1, number of records in report: 4
> 2015-09-06 00:48:27,951 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-154:ctx-2e8a5911) VM state report. host: 1, vm id: 
> 100, power state: PowerOn
> 2015-09-06 00:48:27,954 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-154:ctx-2e8a5911) VM power state does not change, 
> skip DB writing. vm id: 100
> 2015-09-06 00:48:27,954 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-154:ctx-2e8a5911) VM state report. host: 1, vm id: 
> 33, power state: PowerOn
> 2015-09-06 00:48:27,957 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-154:ctx-2e8a5911) VM power state does not change, 
> skip DB writing. vm id: 33
> 2015-09-06 00:48:27,957 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-154:ctx-2e8a5911) VM state report. host: 1, vm id: 
> 89, power state: PowerOn
> 2015-09-06 00:48:27,960 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-154:ctx-2e8a5911) VM power state does not change, 
> skip DB writing. vm id: 89
> 2015-09-06 00:48:27,961 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-154:ctx-2e8a5911) VM state report. host: 1, vm id: 
> 88, power state: PowerOn
> 2015-09-06 00:48:27,963 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-154:ctx-2e8a5911) VM power state does not change, 
> skip DB writing. vm id: 88
> 2015-09-06 00:48:27,968 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-154:ctx-2e8a5911) Done with process of VM state 
> report. host: 1 
> -⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
> -⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
> 
> Vadim.
> 
> On 2015-⁠09-⁠07 23:18, Remi Bergsma wrote:
> 
> Hi Vadim,
> 
> What kind of storage do you use? Can you show /var/log/xha.log (I 
> think that is the name) please? It could be xen-ha that fences the box 
> if the heartbeat cannot be written.
> 
> You suggest it is CloudStack. Did you see anything in the mgt logs?
> 
> Regards, Remi
> 
> Sent from my iPhone
> 
> On 07 Sep 2015, at 08:26, Vadim Kimlaychuk <[email protected]> wrote:
> 
> Hello all,
> 
> I have experienced accidental cluster reboot 3 days after update to CS 
> 4.5.2. Cluster is XenServer 6.5 with SP1. Reboot has been started from 
> slave node and then - master.
> Syslog on slave shows only this:
> 
> Sep 6 00:47:05 xcp2 last message repeated 3 times Sep 6 00:47:15 xcp2 
> xenstored: D12 write data/⁠meminfo_free 713732 Sep 6 00:47:15 xcp2 
> xenstored: A1564203 w event /local/domain/12/data/meminfo_free 
> /local/domain/12/data/meminfo_free
> Sep 6 00:47:15 xcp2 xenstored: D12 write data/updated Sun Sep 6
> 00:48:55 EEST 2015
> Sep 6 00:47:15 xcp2 xenstored: A6 w event 
> /local/domain/12/data/updated /local/domain/12/data/updated Sep 6 
> 00:47:15 xcp2 xenstored: A10 w event /local/domain/12/data/updated 
> /local/domain/12/data/updated Sep 6 00:47:26 xcp2 dhclient: 
> DHCPREQUEST on xenbr0 to 172.17.0.1 port
> 67 (xid=0x304ae9dc)
> Sep 6 00:47:27 xcp2 xapi: [ info|xcp2|462044 INET 
> 0.0.0.0:80|dispatch:host.call_plugin D:7593b578fada|taskhelper] task 
> host.call_plugin R:ddd3cc399f86 forwarded
> (trackid=407f6adaa118a34f19eb1e29cd68a0e8)
> Sep 6 00:47:36 xcp2 dhclient: DHCPREQUEST on xenbr0 to 172.17.0.1 port
> 67 (xid=0x304ae9dc)
> Sep 6 00:48:18 xcp2 last message repeated 4 times Sep 6 00:48:25 xcp2 
> xenstored: D1 write data/⁠meminfo_free 1740496 Sep 6 00:48:25 xcp2 
> xenstored: A1564203 w event /local/domain/1/data/meminfo_free 
> /local/domain/1/data/meminfo_free Sep 6 00:48:25 xcp2 xenstored: D1 
> write data/updated Sat Sep 5 21:50:07 EEST 2015 Sep 6 00:48:25 xcp2 
> xenstored: A6 w event /local/domain/1/data/updated 
> /local/domain/1/data/updated Sep 6 00:48:25 xcp2 xenstored: A10 w 
> event /local/domain/1/data/updated /local/domain/1/data/updated Sep 6 
> 00:48:26 xcp2 dhclient: DHCPREQUEST on xenbr0 to 172.17.0.1 port
> 67 (xid=0x304ae9dc)
> Sep 6 00:48:27 xcp2 xapi: [ info|xcp2|462044 INET 
> 0.0.0.0:80|dispatch:host.call_plugin D:f2c8987bc0ff|taskhelper] task 
> host.call_plugin R:b62d2d4f58eb forwarded
> (trackid=e3d4ea00c96194830a7dbbfc35563a3c)
> Sep 6 00:48:38 xcp2 dhclient: DHCPREQUEST on xenbr0 to 172.17.0.1 port
> 67 (xid=0x304ae9dc)
> Sep 06 00:48:48 xcp2 syslogd 1.4.1: restart.
> Sep 6 00:48:48 xcp2 kernel: klogd 1.4.1, log source = /proc/kmsg 
> started.
> Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Initializing cgroup subsys 
> cpuset Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Initializing cgroup 
> subsys cpu Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Initializing cgroup 
> subsys cpuacct Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Linux version 
> 3.10.0+2
> 
> Can anyone help with diagnostics ?
> 
> Thank you,
> 
> Vadim.

RE: CS 4.5.2: all hosts reboot after 3 days at production

Reply via email to