Hi, you have an ISO domain inside the hosted engine VM, don't you?
MainThread::INFO::2016-05-04 12:28:47,090::ovf_store::109::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE MainThread::INFO::2016-05-04 12:38:47,504::ovf_store::116::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/blockSD/d2dad0e9-4f7d-41d6-b61c-487d44ae6d5d/images/157b67ef-1a29-4e51-9396-79d3425b7871/a394b440-91bb-4c7c-b344-146240d66a43 There is a 10 minute gap between two log lines. We log something every 10 seconds.. Please check https://bugzilla.redhat.com/show_bug.cgi?id=1332813 to see if it might be the same issue. Regards -- Martin Sivak SLA / oVirt On Wed, May 4, 2016 at 8:34 AM, Wee Sritippho <we...@forest.go.th> wrote: > I've tried again and made sure all hosts have same clock. > > After added all 3 hosts, I tested it by shutting down host01. The engine was > restarted on host02 in less than 2 minutes. I enabled and tested power > management on all hosts (using ilo4), then tried disabling host02's network > to test the fencing. Waited for about 5 minutes and saw in the console that > host02 wasn't fenced. I thought the fencing didn't work and enabled the > network again. host02 was then fenced immediately after the network was > enabled (didn't know why) and the engine was never restarted, even when > host02 is up and running again. I have to start the engine vm manually by > running "hosted-engine --vm-start" on host02. > > I thought it might have something to do with ilo4, so I disabled power > management for all hosts and tried to poweroff host02 again. After about 10 > minutes, the engine still won't start, so I manually start it on host01 > instead. > > Here are my recent actions: > > 2016-05-04 12:25:51 ICT - run hosted-engine --vm-status on host01, vm is > running on host01 > 2016-05-04 12:28:32 ICT - run reboot on host01, engine vm is down > 2016-05-04 12:34:57 ICT - run hosted-engine --vm-status on host01, engine > status on every hosts is "unknown stale-data", host01's score=0, > stopped=true > 2016-05-04 12:37:30 ICT - host01 is pingable > 2016-05-04 12:41:09 ICT - run hosted-engine --vm-status on host02, engine > status on every hosts is "unknown stale-data", all hosts' score=3400, > stopped=false > 2016-05-04 12:43:29 ICT - run hosted-engine --vm-status on host02, vm is > running on host01 > > Log files: https://app.box.com/s/jjgn14onv19e1qi82mkf24jl2baa2l9s > > > On 1/5/2559 19:32, Yedidyah Bar David wrote: >> >> It's very hard to understand your flow when time moves backwards. >> >> Please try again from a clean state. Make sure all hosts have same clock. >> Then document the exact time you do stuff - starting/stopping a host, >> checking status, etc. >> >> Some things to check from your logs: >> >> in agent.host01.log: >> >> MainThread::INFO::2016-04-25 >> >> 15:32:41,370::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) >> Engine down and local host has best score (3400), attempting to start >> engine VM >> ... >> MainThread::INFO::2016-04-25 >> >> 15:32:44,276::hosted_engine::1147::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) >> Engine VM started on localhost >> ... >> MainThread::INFO::2016-04-25 >> >> 15:32:58,478::states::672::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) >> Score is 0 due to unexpected vm shutdown at Mon Apr 25 15:32:58 2016 >> >> Why? >> >> Also, in agent.host03.log: >> >> MainThread::INFO::2016-04-25 >> >> 15:29:53,218::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) >> Engine down and local host has best score (3400), attempting to start >> engine VM >> MainThread::INFO::2016-04-25 >> >> 15:29:53,223::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >> Trying: notify time=1461572993.22 type=state_transition >> detail=EngineDown-EngineStart hostname='host03.ovirt.forest.go.th' >> MainThread::ERROR::2016-04-25 >> >> 15:30:23,253::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) >> Connection closed: Connection timed out >> >> Why? >> >> Also, in addition to the actions you stated, you changed a lot maintenance >> mode. >> >> You can try something like this to get some interesting lines from >> agent.log: >> >> egrep -i 'start eng|shut|vm started|vm running|vm is running on| >> maintenance detected|migra' >> >> Best, >> >> On Mon, Apr 25, 2016 at 12:27 PM, Wee Sritippho <we...@forest.go.th> >> wrote: >>> >>> The hosted engine storage is located in an external Fibre Channel SAN. >>> >>> >>> On 25/4/2559 16:19, Martin Sivak wrote: >>>> >>>> Hi, >>>> >>>> it seems that all nodes lost access to storage for some reason after >>>> the host was killed. Where is your hosted engine storage located? >>>> >>>> Regards >>>> >>>> -- >>>> Martin Sivak >>>> SLA / oVirt >>>> >>>> >>>> On Mon, Apr 25, 2016 at 10:58 AM, Wee Sritippho <we...@forest.go.th> >>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> From the hosted-engine FAQ, the engine VM should be up and running in >>>>> about >>>>> 5 minutes after its host was forced poweroff. However, after updated >>>>> oVirt >>>>> 3.6.4 to 3.6.5, the engine VM won't restart automatically even after >>>>> 10+ >>>>> minutes (I already made sure that global maintenance mode is set to >>>>> none). I >>>>> initially thought its a time sync issue, so I installed and enabled ntp >>>>> on >>>>> the hosts and engine. However, the issue still persists. >>>>> >>>>> ###Versions: >>>>> [root@host01 ~]# rpm -qa | grep ovirt >>>>> libgovirt-0.3.3-1.el7_2.1.x86_64 >>>>> ovirt-vmconsole-1.0.0-1.el7.centos.noarch >>>>> ovirt-vmconsole-host-1.0.0-1.el7.centos.noarch >>>>> ovirt-hosted-engine-ha-1.3.5.3-1.el7.centos.noarch >>>>> ovirt-host-deploy-1.4.1-1.el7.centos.noarch >>>>> ovirt-engine-sdk-python-3.6.5.0-1.el7.centos.noarch >>>>> ovirt-hosted-engine-setup-1.3.5.0-1.el7.centos.noarch >>>>> ovirt-release36-007-1.noarch >>>>> ovirt-setup-lib-1.0.1-1.el7.centos.noarch >>>>> [root@host01 ~]# rpm -qa | grep vdsm >>>>> vdsm-infra-4.17.26-0.el7.centos.noarch >>>>> vdsm-jsonrpc-4.17.26-0.el7.centos.noarch >>>>> vdsm-gluster-4.17.26-0.el7.centos.noarch >>>>> vdsm-python-4.17.26-0.el7.centos.noarch >>>>> vdsm-yajsonrpc-4.17.26-0.el7.centos.noarch >>>>> vdsm-4.17.26-0.el7.centos.noarch >>>>> vdsm-cli-4.17.26-0.el7.centos.noarch >>>>> vdsm-xmlrpc-4.17.26-0.el7.centos.noarch >>>>> vdsm-hook-vmfex-dev-4.17.26-0.el7.centos.noarch >>>>> >>>>> ###Log files: >>>>> https://app.box.com/s/fkurmwagogwkv5smkwwq7i4ztmwf9q9r >>>>> >>>>> ###After host02 was killed: >>>>> [root@host03 wees]# hosted-engine --vm-status >>>>> >>>>> >>>>> --== Host 1 status ==-- >>>>> >>>>> Status up-to-date : True >>>>> Hostname : host01.ovirt.forest.go.th >>>>> Host ID : 1 >>>>> Engine status : {"reason": "vm not running on this >>>>> host", "health": "bad", "vm": "down", "detail": "unknown"} >>>>> Score : 3400 >>>>> stopped : False >>>>> Local maintenance : False >>>>> crc32 : 396766e0 >>>>> Host timestamp : 4391 >>>>> >>>>> >>>>> --== Host 2 status ==-- >>>>> >>>>> Status up-to-date : True >>>>> Hostname : host02.ovirt.forest.go.th >>>>> Host ID : 2 >>>>> Engine status : {"health": "good", "vm": "up", >>>>> "detail": "up"} >>>>> Score : 0 >>>>> stopped : True >>>>> Local maintenance : False >>>>> crc32 : 3a345b65 >>>>> Host timestamp : 1458 >>>>> >>>>> >>>>> --== Host 3 status ==-- >>>>> >>>>> Status up-to-date : True >>>>> Hostname : host03.ovirt.forest.go.th >>>>> Host ID : 3 >>>>> Engine status : {"reason": "vm not running on this >>>>> host", "health": "bad", "vm": "down", "detail": "unknown"} >>>>> Score : 3400 >>>>> stopped : False >>>>> Local maintenance : False >>>>> crc32 : 4c34b0ed >>>>> Host timestamp : 11958 >>>>> >>>>> ###After host02 was killed for a while: >>>>> [root@host03 wees]# hosted-engine --vm-status >>>>> >>>>> >>>>> --== Host 1 status ==-- >>>>> >>>>> Status up-to-date : False >>>>> Hostname : host01.ovirt.forest.go.th >>>>> Host ID : 1 >>>>> Engine status : unknown stale-data >>>>> Score : 3400 >>>>> stopped : False >>>>> Local maintenance : False >>>>> crc32 : 72e4e418 >>>>> Host timestamp : 4415 >>>>> >>>>> >>>>> --== Host 2 status ==-- >>>>> >>>>> Status up-to-date : False >>>>> Hostname : host02.ovirt.forest.go.th >>>>> Host ID : 2 >>>>> Engine status : unknown stale-data >>>>> Score : 0 >>>>> stopped : True >>>>> Local maintenance : False >>>>> crc32 : 3a345b65 >>>>> Host timestamp : 1458 >>>>> >>>>> >>>>> --== Host 3 status ==-- >>>>> >>>>> Status up-to-date : False >>>>> Hostname : host03.ovirt.forest.go.th >>>>> Host ID : 3 >>>>> Engine status : unknown stale-data >>>>> Score : 3400 >>>>> stopped : False >>>>> Local maintenance : False >>>>> crc32 : 4c34b0ed >>>>> Host timestamp : 11958 >>>>> >>>>> ###After host02 was up again completely: >>>>> [root@host03 wees]# hosted-engine --vm-status >>>>> >>>>> >>>>> --== Host 1 status ==-- >>>>> >>>>> Status up-to-date : True >>>>> Hostname : host01.ovirt.forest.go.th >>>>> Host ID : 1 >>>>> Engine status : {"reason": "vm not running on this >>>>> host", "health": "bad", "vm": "down", "detail": "unknown"} >>>>> Score : 0 >>>>> stopped : False >>>>> Local maintenance : False >>>>> crc32 : f5728fca >>>>> Host timestamp : 5555 >>>>> >>>>> >>>>> --== Host 2 status ==-- >>>>> >>>>> Status up-to-date : True >>>>> Hostname : host02.ovirt.forest.go.th >>>>> Host ID : 2 >>>>> Engine status : {"health": "good", "vm": "up", >>>>> "detail": "up"} >>>>> Score : 3400 >>>>> stopped : False >>>>> Local maintenance : False >>>>> crc32 : e5284763 >>>>> Host timestamp : 715 >>>>> >>>>> >>>>> --== Host 3 status ==-- >>>>> >>>>> Status up-to-date : True >>>>> Hostname : host03.ovirt.forest.go.th >>>>> Host ID : 3 >>>>> Engine status : {"reason": "vm not running on this >>>>> host", "health": "bad", "vm": "down", "detail": "unknown"} >>>>> Score : 3400 >>>>> stopped : False >>>>> Local maintenance : False >>>>> crc32 : bc10c7fc >>>>> Host timestamp : 13119 >>>>> >>>>> -- >>>>> Wee >>>>> >>>>> _______________________________________________ >>>>> Users mailing list >>>>> Users@ovirt.org >>>>> http://lists.ovirt.org/mailman/listinfo/users >>> >>> >>> -- >>> วีร์ ศรีทิพโพธิ์ >>> นักวิชาการคอมพิวเตอร์ปฏิบัติการ >>> ศูนย์สารสนเทศ กรมป่าไม้ >>> โทร. 025614292-3 ต่อ 5621 >>> มือถือ. 0864678919 >>> >>> >>> _______________________________________________ >>> Users mailing list >>> Users@ovirt.org >>> http://lists.ovirt.org/mailman/listinfo/users >> >> >> > > -- > Wee > _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users