Re: [ovirt-users] [hosted-engine] engine VM doesn't respawn when its host was killed (poweroff)

2016-05-04 Thread Wee Sritippho


On 4 พฤษภาคม 2016 18 นาฬิกา 48 นาที 25 วินาที GMT+07:00, Martin Sivak 
 wrote:
>Hi,
>
>you have an ISO domain inside the hosted engine VM, don't you?
>
>MainThread::INFO::2016-05-04
>12:28:47,090::ovf_store::109::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>Extracting Engine VM OVF from the OVF_STORE
>MainThread::INFO::2016-05-04
>12:38:47,504::ovf_store::116::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>OVF_STORE volume path:
>/rhev/data-center/mnt/blockSD/d2dad0e9-4f7d-41d6-b61c-487d44ae6d5d/images/157b67ef-1a29-4e51-9396-79d3425b7871/a394b440-91bb-4c7c-b344-146240d66a43
>
>There is a 10 minute gap between two log lines. We log something every
>10 seconds..
>
>Please check https://bugzilla.redhat.com/show_bug.cgi?id=1332813 to
>see if it might be the same issue.

Yes, exactly the same issue.

Thank you.

>Regards
>
>--
>Martin Sivak
>SLA / oVirt
>
>
>On Wed, May 4, 2016 at 8:34 AM, Wee Sritippho 
>wrote:
>> I've tried again and made sure all hosts have same clock.
>>
>> After added all 3 hosts, I tested it by shutting down host01. The
>engine was
>> restarted on host02 in less than 2 minutes. I enabled and tested
>power
>> management on all hosts (using ilo4), then tried disabling host02's
>network
>> to test the fencing. Waited for about 5 minutes and saw in the
>console that
>> host02 wasn't fenced. I thought the fencing didn't work and enabled
>the
>> network again. host02 was then fenced immediately after the network
>was
>> enabled (didn't know why) and the engine was never restarted, even
>when
>> host02 is up and running again. I have to start the engine vm
>manually by
>> running "hosted-engine --vm-start" on host02.
>>
>> I thought it might have something to do with ilo4, so I disabled
>power
>> management for all hosts and tried to poweroff host02 again. After
>about 10
>> minutes, the engine still won't start, so I manually start it on
>host01
>> instead.
>>
>> Here are my recent actions:
>>
>> 2016-05-04 12:25:51 ICT - run hosted-engine --vm-status on host01, vm
>is
>> running on host01
>> 2016-05-04 12:28:32 ICT - run reboot on host01, engine vm is down
>> 2016-05-04 12:34:57 ICT - run hosted-engine --vm-status on host01,
>engine
>> status on every hosts is "unknown stale-data", host01's score=0,
>> stopped=true
>> 2016-05-04 12:37:30 ICT - host01 is pingable
>> 2016-05-04 12:41:09 ICT - run hosted-engine --vm-status on host02,
>engine
>> status on every hosts is "unknown stale-data", all hosts' score=3400,
>> stopped=false
>> 2016-05-04 12:43:29 ICT - run hosted-engine --vm-status on host02, vm
>is
>> running on host01
>>
>> Log files: https://app.box.com/s/jjgn14onv19e1qi82mkf24jl2baa2l9s
>>
>>
>> On 1/5/2559 19:32, Yedidyah Bar David wrote:
>>>
>>> It's very hard to understand your flow when time moves backwards.
>>>
>>> Please try again from a clean state. Make sure all hosts have same
>clock.
>>> Then document the exact time you do stuff - starting/stopping a
>host,
>>> checking status, etc.
>>>
>>> Some things to check from your logs:
>>>
>>> in agent.host01.log:
>>>
>>> MainThread::INFO::2016-04-25
>>>
>>>
>15:32:41,370::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
>>> Engine down and local host has best score (3400), attempting to
>start
>>> engine VM
>>> ...
>>> MainThread::INFO::2016-04-25
>>>
>>>
>15:32:44,276::hosted_engine::1147::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
>>> Engine VM started on localhost
>>> ...
>>> MainThread::INFO::2016-04-25
>>>
>>>
>15:32:58,478::states::672::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>>> Score is 0 due to unexpected vm shutdown at Mon Apr 25 15:32:58 2016
>>>
>>> Why?
>>>
>>> Also, in agent.host03.log:
>>>
>>> MainThread::INFO::2016-04-25
>>>
>>>
>15:29:53,218::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
>>> Engine down and local host has best score (3400), attempting to
>start
>>> engine VM
>>> MainThread::INFO::2016-04-25
>>>
>>>
>15:29:53,223::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>>> Trying: notify time=1461572993.22 type=state_transition
>>> detail=EngineDown-EngineStart hostname='host03.ovirt.forest.go.th'
>>> MainThread::ERROR::2016-04-25
>>>
>>>
>15:30:23,253::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate)
>>> Connection closed: Connection timed out
>>>
>>> Why?
>>>
>>> Also, in addition to the actions you stated, you changed a lot
>maintenance
>>> mode.
>>>
>>> You can try something like this to get some interesting lines from
>>> agent.log:
>>>
>>> egrep -i 'start eng|shut|vm started|vm running|vm is running on|
>>> maintenance detected|migra'
>>>
>>> Best,
>>>
>>> On Mon, Apr 25, 2016 at 12:27 PM, Wee Sritippho 
>>> wrote:

 The hosted engine storage is located in an external Fibre Channel
>SAN.


 On 

Re: [ovirt-users] [hosted-engine] engine VM doesn't respawn when its host was killed (poweroff)

2016-05-04 Thread Martin Sivak
Hi,

you have an ISO domain inside the hosted engine VM, don't you?

MainThread::INFO::2016-05-04
12:28:47,090::ovf_store::109::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
Extracting Engine VM OVF from the OVF_STORE
MainThread::INFO::2016-05-04
12:38:47,504::ovf_store::116::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
OVF_STORE volume path:
/rhev/data-center/mnt/blockSD/d2dad0e9-4f7d-41d6-b61c-487d44ae6d5d/images/157b67ef-1a29-4e51-9396-79d3425b7871/a394b440-91bb-4c7c-b344-146240d66a43

There is a 10 minute gap between two log lines. We log something every
10 seconds..

Please check https://bugzilla.redhat.com/show_bug.cgi?id=1332813 to
see if it might be the same issue.

Regards

--
Martin Sivak
SLA / oVirt


On Wed, May 4, 2016 at 8:34 AM, Wee Sritippho  wrote:
> I've tried again and made sure all hosts have same clock.
>
> After added all 3 hosts, I tested it by shutting down host01. The engine was
> restarted on host02 in less than 2 minutes. I enabled and tested power
> management on all hosts (using ilo4), then tried disabling host02's network
> to test the fencing. Waited for about 5 minutes and saw in the console that
> host02 wasn't fenced. I thought the fencing didn't work and enabled the
> network again. host02 was then fenced immediately after the network was
> enabled (didn't know why) and the engine was never restarted, even when
> host02 is up and running again. I have to start the engine vm manually by
> running "hosted-engine --vm-start" on host02.
>
> I thought it might have something to do with ilo4, so I disabled power
> management for all hosts and tried to poweroff host02 again. After about 10
> minutes, the engine still won't start, so I manually start it on host01
> instead.
>
> Here are my recent actions:
>
> 2016-05-04 12:25:51 ICT - run hosted-engine --vm-status on host01, vm is
> running on host01
> 2016-05-04 12:28:32 ICT - run reboot on host01, engine vm is down
> 2016-05-04 12:34:57 ICT - run hosted-engine --vm-status on host01, engine
> status on every hosts is "unknown stale-data", host01's score=0,
> stopped=true
> 2016-05-04 12:37:30 ICT - host01 is pingable
> 2016-05-04 12:41:09 ICT - run hosted-engine --vm-status on host02, engine
> status on every hosts is "unknown stale-data", all hosts' score=3400,
> stopped=false
> 2016-05-04 12:43:29 ICT - run hosted-engine --vm-status on host02, vm is
> running on host01
>
> Log files: https://app.box.com/s/jjgn14onv19e1qi82mkf24jl2baa2l9s
>
>
> On 1/5/2559 19:32, Yedidyah Bar David wrote:
>>
>> It's very hard to understand your flow when time moves backwards.
>>
>> Please try again from a clean state. Make sure all hosts have same clock.
>> Then document the exact time you do stuff - starting/stopping a host,
>> checking status, etc.
>>
>> Some things to check from your logs:
>>
>> in agent.host01.log:
>>
>> MainThread::INFO::2016-04-25
>>
>> 15:32:41,370::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
>> Engine down and local host has best score (3400), attempting to start
>> engine VM
>> ...
>> MainThread::INFO::2016-04-25
>>
>> 15:32:44,276::hosted_engine::1147::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
>> Engine VM started on localhost
>> ...
>> MainThread::INFO::2016-04-25
>>
>> 15:32:58,478::states::672::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>> Score is 0 due to unexpected vm shutdown at Mon Apr 25 15:32:58 2016
>>
>> Why?
>>
>> Also, in agent.host03.log:
>>
>> MainThread::INFO::2016-04-25
>>
>> 15:29:53,218::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
>> Engine down and local host has best score (3400), attempting to start
>> engine VM
>> MainThread::INFO::2016-04-25
>>
>> 15:29:53,223::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>> Trying: notify time=1461572993.22 type=state_transition
>> detail=EngineDown-EngineStart hostname='host03.ovirt.forest.go.th'
>> MainThread::ERROR::2016-04-25
>>
>> 15:30:23,253::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate)
>> Connection closed: Connection timed out
>>
>> Why?
>>
>> Also, in addition to the actions you stated, you changed a lot maintenance
>> mode.
>>
>> You can try something like this to get some interesting lines from
>> agent.log:
>>
>> egrep -i 'start eng|shut|vm started|vm running|vm is running on|
>> maintenance detected|migra'
>>
>> Best,
>>
>> On Mon, Apr 25, 2016 at 12:27 PM, Wee Sritippho 
>> wrote:
>>>
>>> The hosted engine storage is located in an external Fibre Channel SAN.
>>>
>>>
>>> On 25/4/2559 16:19, Martin Sivak wrote:

 Hi,

 it seems that all nodes lost access to storage for some reason after
 the host was killed. Where is your hosted engine storage located?

 Regards

 --
 Martin Sivak
 SLA / oVirt


 On Mon, Apr 25, 

Re: [ovirt-users] [hosted-engine] engine VM doesn't respawn when its host was killed (poweroff)

2016-05-04 Thread Wee Sritippho

I've tried again and made sure all hosts have same clock.

After added all 3 hosts, I tested it by shutting down host01. The engine 
was restarted on host02 in less than 2 minutes. I enabled and tested 
power management on all hosts (using ilo4), then tried disabling 
host02's network to test the fencing. Waited for about 5 minutes and saw 
in the console that host02 wasn't fenced. I thought the fencing didn't 
work and enabled the network again. host02 was then fenced immediately 
after the network was enabled (didn't know why) and the engine was never 
restarted, even when host02 is up and running again. I have to start the 
engine vm manually by running "hosted-engine --vm-start" on host02.


I thought it might have something to do with ilo4, so I disabled power 
management for all hosts and tried to poweroff host02 again. After about 
10 minutes, the engine still won't start, so I manually start it on 
host01 instead.


Here are my recent actions:

2016-05-04 12:25:51 ICT - run hosted-engine --vm-status on host01, vm is 
running on host01

2016-05-04 12:28:32 ICT - run reboot on host01, engine vm is down
2016-05-04 12:34:57 ICT - run hosted-engine --vm-status on host01, 
engine status on every hosts is "unknown stale-data", host01's score=0, 
stopped=true

2016-05-04 12:37:30 ICT - host01 is pingable
2016-05-04 12:41:09 ICT - run hosted-engine --vm-status on host02, 
engine status on every hosts is "unknown stale-data", all hosts' 
score=3400, stopped=false
2016-05-04 12:43:29 ICT - run hosted-engine --vm-status on host02, vm is 
running on host01


Log files: https://app.box.com/s/jjgn14onv19e1qi82mkf24jl2baa2l9s

On 1/5/2559 19:32, Yedidyah Bar David wrote:

It's very hard to understand your flow when time moves backwards.

Please try again from a clean state. Make sure all hosts have same clock.
Then document the exact time you do stuff - starting/stopping a host,
checking status, etc.

Some things to check from your logs:

in agent.host01.log:

MainThread::INFO::2016-04-25
15:32:41,370::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down and local host has best score (3400), attempting to start
engine VM
...
MainThread::INFO::2016-04-25
15:32:44,276::hosted_engine::1147::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
Engine VM started on localhost
...
MainThread::INFO::2016-04-25
15:32:58,478::states::672::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
Score is 0 due to unexpected vm shutdown at Mon Apr 25 15:32:58 2016

Why?

Also, in agent.host03.log:

MainThread::INFO::2016-04-25
15:29:53,218::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down and local host has best score (3400), attempting to start
engine VM
MainThread::INFO::2016-04-25
15:29:53,223::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1461572993.22 type=state_transition
detail=EngineDown-EngineStart hostname='host03.ovirt.forest.go.th'
MainThread::ERROR::2016-04-25
15:30:23,253::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate)
Connection closed: Connection timed out

Why?

Also, in addition to the actions you stated, you changed a lot maintenance mode.

You can try something like this to get some interesting lines from agent.log:

egrep -i 'start eng|shut|vm started|vm running|vm is running on|
maintenance detected|migra'

Best,

On Mon, Apr 25, 2016 at 12:27 PM, Wee Sritippho  wrote:

The hosted engine storage is located in an external Fibre Channel SAN.


On 25/4/2559 16:19, Martin Sivak wrote:

Hi,

it seems that all nodes lost access to storage for some reason after
the host was killed. Where is your hosted engine storage located?

Regards

--
Martin Sivak
SLA / oVirt


On Mon, Apr 25, 2016 at 10:58 AM, Wee Sritippho 
wrote:

Hi,

  From the hosted-engine FAQ, the engine VM should be up and running in
about
5 minutes after its host was forced poweroff. However, after updated
oVirt
3.6.4 to 3.6.5, the engine VM won't restart automatically even after 10+
minutes (I already made sure that global maintenance mode is set to
none). I
initially thought its a time sync issue, so I installed and enabled ntp
on
the hosts and engine. However, the issue still persists.

###Versions:
[root@host01 ~]# rpm -qa | grep ovirt
libgovirt-0.3.3-1.el7_2.1.x86_64
ovirt-vmconsole-1.0.0-1.el7.centos.noarch
ovirt-vmconsole-host-1.0.0-1.el7.centos.noarch
ovirt-hosted-engine-ha-1.3.5.3-1.el7.centos.noarch
ovirt-host-deploy-1.4.1-1.el7.centos.noarch
ovirt-engine-sdk-python-3.6.5.0-1.el7.centos.noarch
ovirt-hosted-engine-setup-1.3.5.0-1.el7.centos.noarch
ovirt-release36-007-1.noarch
ovirt-setup-lib-1.0.1-1.el7.centos.noarch
[root@host01 ~]# rpm -qa | grep vdsm
vdsm-infra-4.17.26-0.el7.centos.noarch
vdsm-jsonrpc-4.17.26-0.el7.centos.noarch
vdsm-gluster-4.17.26-0.el7.centos.noarch

Re: [ovirt-users] [hosted-engine] engine VM doesn't respawn when its host was killed (poweroff)

2016-05-01 Thread Yedidyah Bar David
It's very hard to understand your flow when time moves backwards.

Please try again from a clean state. Make sure all hosts have same clock.
Then document the exact time you do stuff - starting/stopping a host,
checking status, etc.

Some things to check from your logs:

in agent.host01.log:

MainThread::INFO::2016-04-25
15:32:41,370::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down and local host has best score (3400), attempting to start
engine VM
...
MainThread::INFO::2016-04-25
15:32:44,276::hosted_engine::1147::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
Engine VM started on localhost
...
MainThread::INFO::2016-04-25
15:32:58,478::states::672::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
Score is 0 due to unexpected vm shutdown at Mon Apr 25 15:32:58 2016

Why?

Also, in agent.host03.log:

MainThread::INFO::2016-04-25
15:29:53,218::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down and local host has best score (3400), attempting to start
engine VM
MainThread::INFO::2016-04-25
15:29:53,223::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1461572993.22 type=state_transition
detail=EngineDown-EngineStart hostname='host03.ovirt.forest.go.th'
MainThread::ERROR::2016-04-25
15:30:23,253::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate)
Connection closed: Connection timed out

Why?

Also, in addition to the actions you stated, you changed a lot maintenance mode.

You can try something like this to get some interesting lines from agent.log:

egrep -i 'start eng|shut|vm started|vm running|vm is running on|
maintenance detected|migra'

Best,

On Mon, Apr 25, 2016 at 12:27 PM, Wee Sritippho  wrote:
> The hosted engine storage is located in an external Fibre Channel SAN.
>
>
> On 25/4/2559 16:19, Martin Sivak wrote:
>>
>> Hi,
>>
>> it seems that all nodes lost access to storage for some reason after
>> the host was killed. Where is your hosted engine storage located?
>>
>> Regards
>>
>> --
>> Martin Sivak
>> SLA / oVirt
>>
>>
>> On Mon, Apr 25, 2016 at 10:58 AM, Wee Sritippho 
>> wrote:
>>>
>>> Hi,
>>>
>>>  From the hosted-engine FAQ, the engine VM should be up and running in
>>> about
>>> 5 minutes after its host was forced poweroff. However, after updated
>>> oVirt
>>> 3.6.4 to 3.6.5, the engine VM won't restart automatically even after 10+
>>> minutes (I already made sure that global maintenance mode is set to
>>> none). I
>>> initially thought its a time sync issue, so I installed and enabled ntp
>>> on
>>> the hosts and engine. However, the issue still persists.
>>>
>>> ###Versions:
>>> [root@host01 ~]# rpm -qa | grep ovirt
>>> libgovirt-0.3.3-1.el7_2.1.x86_64
>>> ovirt-vmconsole-1.0.0-1.el7.centos.noarch
>>> ovirt-vmconsole-host-1.0.0-1.el7.centos.noarch
>>> ovirt-hosted-engine-ha-1.3.5.3-1.el7.centos.noarch
>>> ovirt-host-deploy-1.4.1-1.el7.centos.noarch
>>> ovirt-engine-sdk-python-3.6.5.0-1.el7.centos.noarch
>>> ovirt-hosted-engine-setup-1.3.5.0-1.el7.centos.noarch
>>> ovirt-release36-007-1.noarch
>>> ovirt-setup-lib-1.0.1-1.el7.centos.noarch
>>> [root@host01 ~]# rpm -qa | grep vdsm
>>> vdsm-infra-4.17.26-0.el7.centos.noarch
>>> vdsm-jsonrpc-4.17.26-0.el7.centos.noarch
>>> vdsm-gluster-4.17.26-0.el7.centos.noarch
>>> vdsm-python-4.17.26-0.el7.centos.noarch
>>> vdsm-yajsonrpc-4.17.26-0.el7.centos.noarch
>>> vdsm-4.17.26-0.el7.centos.noarch
>>> vdsm-cli-4.17.26-0.el7.centos.noarch
>>> vdsm-xmlrpc-4.17.26-0.el7.centos.noarch
>>> vdsm-hook-vmfex-dev-4.17.26-0.el7.centos.noarch
>>>
>>> ###Log files:
>>> https://app.box.com/s/fkurmwagogwkv5smkwwq7i4ztmwf9q9r
>>>
>>> ###After host02 was killed:
>>> [root@host03 wees]# hosted-engine --vm-status
>>>
>>>
>>> --== Host 1 status ==--
>>>
>>> Status up-to-date  : True
>>> Hostname   : host01.ovirt.forest.go.th
>>> Host ID: 1
>>> Engine status  : {"reason": "vm not running on this
>>> host", "health": "bad", "vm": "down", "detail": "unknown"}
>>> Score  : 3400
>>> stopped: False
>>> Local maintenance  : False
>>> crc32  : 396766e0
>>> Host timestamp : 4391
>>>
>>>
>>> --== Host 2 status ==--
>>>
>>> Status up-to-date  : True
>>> Hostname   : host02.ovirt.forest.go.th
>>> Host ID: 2
>>> Engine status  : {"health": "good", "vm": "up",
>>> "detail": "up"}
>>> Score  : 0
>>> stopped: True
>>> Local maintenance  : False
>>> crc32  : 3a345b65
>>> Host timestamp : 1458
>>>
>>>
>>> --== Host 3 status ==--
>>>
>>> 

Re: [ovirt-users] [hosted-engine] engine VM doesn't respawn when its host was killed (poweroff)

2016-04-25 Thread Wee Sritippho

The hosted engine storage is located in an external Fibre Channel SAN.

On 25/4/2559 16:19, Martin Sivak wrote:

Hi,

it seems that all nodes lost access to storage for some reason after
the host was killed. Where is your hosted engine storage located?

Regards

--
Martin Sivak
SLA / oVirt


On Mon, Apr 25, 2016 at 10:58 AM, Wee Sritippho  wrote:

Hi,

 From the hosted-engine FAQ, the engine VM should be up and running in about
5 minutes after its host was forced poweroff. However, after updated oVirt
3.6.4 to 3.6.5, the engine VM won't restart automatically even after 10+
minutes (I already made sure that global maintenance mode is set to none). I
initially thought its a time sync issue, so I installed and enabled ntp on
the hosts and engine. However, the issue still persists.

###Versions:
[root@host01 ~]# rpm -qa | grep ovirt
libgovirt-0.3.3-1.el7_2.1.x86_64
ovirt-vmconsole-1.0.0-1.el7.centos.noarch
ovirt-vmconsole-host-1.0.0-1.el7.centos.noarch
ovirt-hosted-engine-ha-1.3.5.3-1.el7.centos.noarch
ovirt-host-deploy-1.4.1-1.el7.centos.noarch
ovirt-engine-sdk-python-3.6.5.0-1.el7.centos.noarch
ovirt-hosted-engine-setup-1.3.5.0-1.el7.centos.noarch
ovirt-release36-007-1.noarch
ovirt-setup-lib-1.0.1-1.el7.centos.noarch
[root@host01 ~]# rpm -qa | grep vdsm
vdsm-infra-4.17.26-0.el7.centos.noarch
vdsm-jsonrpc-4.17.26-0.el7.centos.noarch
vdsm-gluster-4.17.26-0.el7.centos.noarch
vdsm-python-4.17.26-0.el7.centos.noarch
vdsm-yajsonrpc-4.17.26-0.el7.centos.noarch
vdsm-4.17.26-0.el7.centos.noarch
vdsm-cli-4.17.26-0.el7.centos.noarch
vdsm-xmlrpc-4.17.26-0.el7.centos.noarch
vdsm-hook-vmfex-dev-4.17.26-0.el7.centos.noarch

###Log files:
https://app.box.com/s/fkurmwagogwkv5smkwwq7i4ztmwf9q9r

###After host02 was killed:
[root@host03 wees]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date  : True
Hostname   : host01.ovirt.forest.go.th
Host ID: 1
Engine status  : {"reason": "vm not running on this
host", "health": "bad", "vm": "down", "detail": "unknown"}
Score  : 3400
stopped: False
Local maintenance  : False
crc32  : 396766e0
Host timestamp : 4391


--== Host 2 status ==--

Status up-to-date  : True
Hostname   : host02.ovirt.forest.go.th
Host ID: 2
Engine status  : {"health": "good", "vm": "up",
"detail": "up"}
Score  : 0
stopped: True
Local maintenance  : False
crc32  : 3a345b65
Host timestamp : 1458


--== Host 3 status ==--

Status up-to-date  : True
Hostname   : host03.ovirt.forest.go.th
Host ID: 3
Engine status  : {"reason": "vm not running on this
host", "health": "bad", "vm": "down", "detail": "unknown"}
Score  : 3400
stopped: False
Local maintenance  : False
crc32  : 4c34b0ed
Host timestamp : 11958

###After host02 was killed for a while:
[root@host03 wees]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date  : False
Hostname   : host01.ovirt.forest.go.th
Host ID: 1
Engine status  : unknown stale-data
Score  : 3400
stopped: False
Local maintenance  : False
crc32  : 72e4e418
Host timestamp : 4415


--== Host 2 status ==--

Status up-to-date  : False
Hostname   : host02.ovirt.forest.go.th
Host ID: 2
Engine status  : unknown stale-data
Score  : 0
stopped: True
Local maintenance  : False
crc32  : 3a345b65
Host timestamp : 1458


--== Host 3 status ==--

Status up-to-date  : False
Hostname   : host03.ovirt.forest.go.th
Host ID: 3
Engine status  : unknown stale-data
Score  : 3400
stopped: False
Local maintenance  : False
crc32  : 4c34b0ed
Host timestamp : 11958

###After host02 was up again completely:
[root@host03 wees]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date  : True
Hostname   : host01.ovirt.forest.go.th
Host ID: 1
Engine status   

Re: [ovirt-users] [hosted-engine] engine VM doesn't respawn when its host was killed (poweroff)

2016-04-25 Thread Martin Sivak
Hi,

it seems that all nodes lost access to storage for some reason after
the host was killed. Where is your hosted engine storage located?

Regards

--
Martin Sivak
SLA / oVirt


On Mon, Apr 25, 2016 at 10:58 AM, Wee Sritippho  wrote:
> Hi,
>
> From the hosted-engine FAQ, the engine VM should be up and running in about
> 5 minutes after its host was forced poweroff. However, after updated oVirt
> 3.6.4 to 3.6.5, the engine VM won't restart automatically even after 10+
> minutes (I already made sure that global maintenance mode is set to none). I
> initially thought its a time sync issue, so I installed and enabled ntp on
> the hosts and engine. However, the issue still persists.
>
> ###Versions:
> [root@host01 ~]# rpm -qa | grep ovirt
> libgovirt-0.3.3-1.el7_2.1.x86_64
> ovirt-vmconsole-1.0.0-1.el7.centos.noarch
> ovirt-vmconsole-host-1.0.0-1.el7.centos.noarch
> ovirt-hosted-engine-ha-1.3.5.3-1.el7.centos.noarch
> ovirt-host-deploy-1.4.1-1.el7.centos.noarch
> ovirt-engine-sdk-python-3.6.5.0-1.el7.centos.noarch
> ovirt-hosted-engine-setup-1.3.5.0-1.el7.centos.noarch
> ovirt-release36-007-1.noarch
> ovirt-setup-lib-1.0.1-1.el7.centos.noarch
> [root@host01 ~]# rpm -qa | grep vdsm
> vdsm-infra-4.17.26-0.el7.centos.noarch
> vdsm-jsonrpc-4.17.26-0.el7.centos.noarch
> vdsm-gluster-4.17.26-0.el7.centos.noarch
> vdsm-python-4.17.26-0.el7.centos.noarch
> vdsm-yajsonrpc-4.17.26-0.el7.centos.noarch
> vdsm-4.17.26-0.el7.centos.noarch
> vdsm-cli-4.17.26-0.el7.centos.noarch
> vdsm-xmlrpc-4.17.26-0.el7.centos.noarch
> vdsm-hook-vmfex-dev-4.17.26-0.el7.centos.noarch
>
> ###Log files:
> https://app.box.com/s/fkurmwagogwkv5smkwwq7i4ztmwf9q9r
>
> ###After host02 was killed:
> [root@host03 wees]# hosted-engine --vm-status
>
>
> --== Host 1 status ==--
>
> Status up-to-date  : True
> Hostname   : host01.ovirt.forest.go.th
> Host ID: 1
> Engine status  : {"reason": "vm not running on this
> host", "health": "bad", "vm": "down", "detail": "unknown"}
> Score  : 3400
> stopped: False
> Local maintenance  : False
> crc32  : 396766e0
> Host timestamp : 4391
>
>
> --== Host 2 status ==--
>
> Status up-to-date  : True
> Hostname   : host02.ovirt.forest.go.th
> Host ID: 2
> Engine status  : {"health": "good", "vm": "up",
> "detail": "up"}
> Score  : 0
> stopped: True
> Local maintenance  : False
> crc32  : 3a345b65
> Host timestamp : 1458
>
>
> --== Host 3 status ==--
>
> Status up-to-date  : True
> Hostname   : host03.ovirt.forest.go.th
> Host ID: 3
> Engine status  : {"reason": "vm not running on this
> host", "health": "bad", "vm": "down", "detail": "unknown"}
> Score  : 3400
> stopped: False
> Local maintenance  : False
> crc32  : 4c34b0ed
> Host timestamp : 11958
>
> ###After host02 was killed for a while:
> [root@host03 wees]# hosted-engine --vm-status
>
>
> --== Host 1 status ==--
>
> Status up-to-date  : False
> Hostname   : host01.ovirt.forest.go.th
> Host ID: 1
> Engine status  : unknown stale-data
> Score  : 3400
> stopped: False
> Local maintenance  : False
> crc32  : 72e4e418
> Host timestamp : 4415
>
>
> --== Host 2 status ==--
>
> Status up-to-date  : False
> Hostname   : host02.ovirt.forest.go.th
> Host ID: 2
> Engine status  : unknown stale-data
> Score  : 0
> stopped: True
> Local maintenance  : False
> crc32  : 3a345b65
> Host timestamp : 1458
>
>
> --== Host 3 status ==--
>
> Status up-to-date  : False
> Hostname   : host03.ovirt.forest.go.th
> Host ID: 3
> Engine status  : unknown stale-data
> Score  : 3400
> stopped: False
> Local maintenance  : False
> crc32  : 4c34b0ed
> Host timestamp : 11958
>
> ###After host02 was up again completely:
> [root@host03 wees]# hosted-engine --vm-status
>
>
> --== Host 1 status ==--
>
> Status up-to-date  : True
> 

[ovirt-users] [hosted-engine] engine VM doesn't respawn when its host was killed (poweroff)

2016-04-25 Thread Wee Sritippho

Hi,

From the hosted-engine FAQ, the engine VM should be up and running in 
about 5 minutes after its host was forced poweroff. However, after 
updated oVirt 3.6.4 to 3.6.5, the engine VM won't restart automatically 
even after 10+ minutes (I already made sure that global maintenance mode 
is set to none). I initially thought its a time sync issue, so I 
installed and enabled ntp on the hosts and engine. However, the issue 
still persists.


###Versions:
[root@host01 ~]# rpm -qa | grep ovirt
libgovirt-0.3.3-1.el7_2.1.x86_64
ovirt-vmconsole-1.0.0-1.el7.centos.noarch
ovirt-vmconsole-host-1.0.0-1.el7.centos.noarch
ovirt-hosted-engine-ha-1.3.5.3-1.el7.centos.noarch
ovirt-host-deploy-1.4.1-1.el7.centos.noarch
ovirt-engine-sdk-python-3.6.5.0-1.el7.centos.noarch
ovirt-hosted-engine-setup-1.3.5.0-1.el7.centos.noarch
ovirt-release36-007-1.noarch
ovirt-setup-lib-1.0.1-1.el7.centos.noarch
[root@host01 ~]# rpm -qa | grep vdsm
vdsm-infra-4.17.26-0.el7.centos.noarch
vdsm-jsonrpc-4.17.26-0.el7.centos.noarch
vdsm-gluster-4.17.26-0.el7.centos.noarch
vdsm-python-4.17.26-0.el7.centos.noarch
vdsm-yajsonrpc-4.17.26-0.el7.centos.noarch
vdsm-4.17.26-0.el7.centos.noarch
vdsm-cli-4.17.26-0.el7.centos.noarch
vdsm-xmlrpc-4.17.26-0.el7.centos.noarch
vdsm-hook-vmfex-dev-4.17.26-0.el7.centos.noarch

###Log files:
https://app.box.com/s/fkurmwagogwkv5smkwwq7i4ztmwf9q9r

###After host02 was killed:
[root@host03 wees]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date  : True
Hostname   : host01.ovirt.forest.go.th
Host ID: 1
Engine status  : {"reason": "vm not running on this 
host", "health": "bad", "vm": "down", "detail": "unknown"}

Score  : 3400
stopped: False
Local maintenance  : False
crc32  : 396766e0
Host timestamp : 4391


--== Host 2 status ==--

Status up-to-date  : True
Hostname   : host02.ovirt.forest.go.th
Host ID: 2
Engine status  : {"health": "good", "vm": "up", 
"detail": "up"}

Score  : 0
stopped: True
Local maintenance  : False
crc32  : 3a345b65
Host timestamp : 1458


--== Host 3 status ==--

Status up-to-date  : True
Hostname   : host03.ovirt.forest.go.th
Host ID: 3
Engine status  : {"reason": "vm not running on this 
host", "health": "bad", "vm": "down", "detail": "unknown"}

Score  : 3400
stopped: False
Local maintenance  : False
crc32  : 4c34b0ed
Host timestamp : 11958

###After host02 was killed for a while:
[root@host03 wees]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date  : False
Hostname   : host01.ovirt.forest.go.th
Host ID: 1
Engine status  : unknown stale-data
Score  : 3400
stopped: False
Local maintenance  : False
crc32  : 72e4e418
Host timestamp : 4415


--== Host 2 status ==--

Status up-to-date  : False
Hostname   : host02.ovirt.forest.go.th
Host ID: 2
Engine status  : unknown stale-data
Score  : 0
stopped: True
Local maintenance  : False
crc32  : 3a345b65
Host timestamp : 1458


--== Host 3 status ==--

Status up-to-date  : False
Hostname   : host03.ovirt.forest.go.th
Host ID: 3
Engine status  : unknown stale-data
Score  : 3400
stopped: False
Local maintenance  : False
crc32  : 4c34b0ed
Host timestamp : 11958

###After host02 was up again completely:
[root@host03 wees]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date  : True
Hostname   : host01.ovirt.forest.go.th
Host ID: 1
Engine status  : {"reason": "vm not running on this 
host", "health": "bad", "vm": "down", "detail": "unknown"}

Score  : 0
stopped: False
Local maintenance  : False
crc32  : f5728fca
Host timestamp : 


--== Host 2 status ==--