On 4 พฤษภาคม 2016 18 นาฬิกา 48 นาที 25 วินาที GMT+07:00, Martin Sivak <msi...@redhat.com> wrote: >Hi, > >you have an ISO domain inside the hosted engine VM, don't you? > >MainThread::INFO::2016-05-04 >12:28:47,090::ovf_store::109::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) >Extracting Engine VM OVF from the OVF_STORE >MainThread::INFO::2016-05-04 >12:38:47,504::ovf_store::116::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) >OVF_STORE volume path: >/rhev/data-center/mnt/blockSD/d2dad0e9-4f7d-41d6-b61c-487d44ae6d5d/images/157b67ef-1a29-4e51-9396-79d3425b7871/a394b440-91bb-4c7c-b344-146240d66a43 > >There is a 10 minute gap between two log lines. We log something every >10 seconds.. > >Please check https://bugzilla.redhat.com/show_bug.cgi?id=1332813 to >see if it might be the same issue.
Yes, exactly the same issue. Thank you. >Regards > >-- >Martin Sivak >SLA / oVirt > > >On Wed, May 4, 2016 at 8:34 AM, Wee Sritippho <we...@forest.go.th> >wrote: >> I've tried again and made sure all hosts have same clock. >> >> After added all 3 hosts, I tested it by shutting down host01. The >engine was >> restarted on host02 in less than 2 minutes. I enabled and tested >power >> management on all hosts (using ilo4), then tried disabling host02's >network >> to test the fencing. Waited for about 5 minutes and saw in the >console that >> host02 wasn't fenced. I thought the fencing didn't work and enabled >the >> network again. host02 was then fenced immediately after the network >was >> enabled (didn't know why) and the engine was never restarted, even >when >> host02 is up and running again. I have to start the engine vm >manually by >> running "hosted-engine --vm-start" on host02. >> >> I thought it might have something to do with ilo4, so I disabled >power >> management for all hosts and tried to poweroff host02 again. After >about 10 >> minutes, the engine still won't start, so I manually start it on >host01 >> instead. >> >> Here are my recent actions: >> >> 2016-05-04 12:25:51 ICT - run hosted-engine --vm-status on host01, vm >is >> running on host01 >> 2016-05-04 12:28:32 ICT - run reboot on host01, engine vm is down >> 2016-05-04 12:34:57 ICT - run hosted-engine --vm-status on host01, >engine >> status on every hosts is "unknown stale-data", host01's score=0, >> stopped=true >> 2016-05-04 12:37:30 ICT - host01 is pingable >> 2016-05-04 12:41:09 ICT - run hosted-engine --vm-status on host02, >engine >> status on every hosts is "unknown stale-data", all hosts' score=3400, >> stopped=false >> 2016-05-04 12:43:29 ICT - run hosted-engine --vm-status on host02, vm >is >> running on host01 >> >> Log files: https://app.box.com/s/jjgn14onv19e1qi82mkf24jl2baa2l9s >> >> >> On 1/5/2559 19:32, Yedidyah Bar David wrote: >>> >>> It's very hard to understand your flow when time moves backwards. >>> >>> Please try again from a clean state. Make sure all hosts have same >clock. >>> Then document the exact time you do stuff - starting/stopping a >host, >>> checking status, etc. >>> >>> Some things to check from your logs: >>> >>> in agent.host01.log: >>> >>> MainThread::INFO::2016-04-25 >>> >>> >15:32:41,370::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) >>> Engine down and local host has best score (3400), attempting to >start >>> engine VM >>> ... >>> MainThread::INFO::2016-04-25 >>> >>> >15:32:44,276::hosted_engine::1147::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) >>> Engine VM started on localhost >>> ... >>> MainThread::INFO::2016-04-25 >>> >>> >15:32:58,478::states::672::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) >>> Score is 0 due to unexpected vm shutdown at Mon Apr 25 15:32:58 2016 >>> >>> Why? >>> >>> Also, in agent.host03.log: >>> >>> MainThread::INFO::2016-04-25 >>> >>> >15:29:53,218::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) >>> Engine down and local host has best score (3400), attempting to >start >>> engine VM >>> MainThread::INFO::2016-04-25 >>> >>> >15:29:53,223::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >>> Trying: notify time=1461572993.22 type=state_transition >>> detail=EngineDown-EngineStart hostname='host03.ovirt.forest.go.th' >>> MainThread::ERROR::2016-04-25 >>> >>> >15:30:23,253::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) >>> Connection closed: Connection timed out >>> >>> Why? >>> >>> Also, in addition to the actions you stated, you changed a lot >maintenance >>> mode. >>> >>> You can try something like this to get some interesting lines from >>> agent.log: >>> >>> egrep -i 'start eng|shut|vm started|vm running|vm is running on| >>> maintenance detected|migra' >>> >>> Best, >>> >>> On Mon, Apr 25, 2016 at 12:27 PM, Wee Sritippho <we...@forest.go.th> >>> wrote: >>>> >>>> The hosted engine storage is located in an external Fibre Channel >SAN. >>>> >>>> >>>> On 25/4/2559 16:19, Martin Sivak wrote: >>>>> >>>>> Hi, >>>>> >>>>> it seems that all nodes lost access to storage for some reason >after >>>>> the host was killed. Where is your hosted engine storage located? >>>>> >>>>> Regards >>>>> >>>>> -- >>>>> Martin Sivak >>>>> SLA / oVirt >>>>> >>>>> >>>>> On Mon, Apr 25, 2016 at 10:58 AM, Wee Sritippho ><we...@forest.go.th> >>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> From the hosted-engine FAQ, the engine VM should be up and >running in >>>>>> about >>>>>> 5 minutes after its host was forced poweroff. However, after >updated >>>>>> oVirt >>>>>> 3.6.4 to 3.6.5, the engine VM won't restart automatically even >after >>>>>> 10+ >>>>>> minutes (I already made sure that global maintenance mode is set >to >>>>>> none). I >>>>>> initially thought its a time sync issue, so I installed and >enabled ntp >>>>>> on >>>>>> the hosts and engine. However, the issue still persists. >>>>>> >>>>>> ###Versions: >>>>>> [root@host01 ~]# rpm -qa | grep ovirt >>>>>> libgovirt-0.3.3-1.el7_2.1.x86_64 >>>>>> ovirt-vmconsole-1.0.0-1.el7.centos.noarch >>>>>> ovirt-vmconsole-host-1.0.0-1.el7.centos.noarch >>>>>> ovirt-hosted-engine-ha-1.3.5.3-1.el7.centos.noarch >>>>>> ovirt-host-deploy-1.4.1-1.el7.centos.noarch >>>>>> ovirt-engine-sdk-python-3.6.5.0-1.el7.centos.noarch >>>>>> ovirt-hosted-engine-setup-1.3.5.0-1.el7.centos.noarch >>>>>> ovirt-release36-007-1.noarch >>>>>> ovirt-setup-lib-1.0.1-1.el7.centos.noarch >>>>>> [root@host01 ~]# rpm -qa | grep vdsm >>>>>> vdsm-infra-4.17.26-0.el7.centos.noarch >>>>>> vdsm-jsonrpc-4.17.26-0.el7.centos.noarch >>>>>> vdsm-gluster-4.17.26-0.el7.centos.noarch >>>>>> vdsm-python-4.17.26-0.el7.centos.noarch >>>>>> vdsm-yajsonrpc-4.17.26-0.el7.centos.noarch >>>>>> vdsm-4.17.26-0.el7.centos.noarch >>>>>> vdsm-cli-4.17.26-0.el7.centos.noarch >>>>>> vdsm-xmlrpc-4.17.26-0.el7.centos.noarch >>>>>> vdsm-hook-vmfex-dev-4.17.26-0.el7.centos.noarch >>>>>> >>>>>> ###Log files: >>>>>> https://app.box.com/s/fkurmwagogwkv5smkwwq7i4ztmwf9q9r >>>>>> >>>>>> ###After host02 was killed: >>>>>> [root@host03 wees]# hosted-engine --vm-status >>>>>> >>>>>> >>>>>> --== Host 1 status ==-- >>>>>> >>>>>> Status up-to-date : True >>>>>> Hostname : host01.ovirt.forest.go.th >>>>>> Host ID : 1 >>>>>> Engine status : {"reason": "vm not running >on this >>>>>> host", "health": "bad", "vm": "down", "detail": "unknown"} >>>>>> Score : 3400 >>>>>> stopped : False >>>>>> Local maintenance : False >>>>>> crc32 : 396766e0 >>>>>> Host timestamp : 4391 >>>>>> >>>>>> >>>>>> --== Host 2 status ==-- >>>>>> >>>>>> Status up-to-date : True >>>>>> Hostname : host02.ovirt.forest.go.th >>>>>> Host ID : 2 >>>>>> Engine status : {"health": "good", "vm": >"up", >>>>>> "detail": "up"} >>>>>> Score : 0 >>>>>> stopped : True >>>>>> Local maintenance : False >>>>>> crc32 : 3a345b65 >>>>>> Host timestamp : 1458 >>>>>> >>>>>> >>>>>> --== Host 3 status ==-- >>>>>> >>>>>> Status up-to-date : True >>>>>> Hostname : host03.ovirt.forest.go.th >>>>>> Host ID : 3 >>>>>> Engine status : {"reason": "vm not running >on this >>>>>> host", "health": "bad", "vm": "down", "detail": "unknown"} >>>>>> Score : 3400 >>>>>> stopped : False >>>>>> Local maintenance : False >>>>>> crc32 : 4c34b0ed >>>>>> Host timestamp : 11958 >>>>>> >>>>>> ###After host02 was killed for a while: >>>>>> [root@host03 wees]# hosted-engine --vm-status >>>>>> >>>>>> >>>>>> --== Host 1 status ==-- >>>>>> >>>>>> Status up-to-date : False >>>>>> Hostname : host01.ovirt.forest.go.th >>>>>> Host ID : 1 >>>>>> Engine status : unknown stale-data >>>>>> Score : 3400 >>>>>> stopped : False >>>>>> Local maintenance : False >>>>>> crc32 : 72e4e418 >>>>>> Host timestamp : 4415 >>>>>> >>>>>> >>>>>> --== Host 2 status ==-- >>>>>> >>>>>> Status up-to-date : False >>>>>> Hostname : host02.ovirt.forest.go.th >>>>>> Host ID : 2 >>>>>> Engine status : unknown stale-data >>>>>> Score : 0 >>>>>> stopped : True >>>>>> Local maintenance : False >>>>>> crc32 : 3a345b65 >>>>>> Host timestamp : 1458 >>>>>> >>>>>> >>>>>> --== Host 3 status ==-- >>>>>> >>>>>> Status up-to-date : False >>>>>> Hostname : host03.ovirt.forest.go.th >>>>>> Host ID : 3 >>>>>> Engine status : unknown stale-data >>>>>> Score : 3400 >>>>>> stopped : False >>>>>> Local maintenance : False >>>>>> crc32 : 4c34b0ed >>>>>> Host timestamp : 11958 >>>>>> >>>>>> ###After host02 was up again completely: >>>>>> [root@host03 wees]# hosted-engine --vm-status >>>>>> >>>>>> >>>>>> --== Host 1 status ==-- >>>>>> >>>>>> Status up-to-date : True >>>>>> Hostname : host01.ovirt.forest.go.th >>>>>> Host ID : 1 >>>>>> Engine status : {"reason": "vm not running >on this >>>>>> host", "health": "bad", "vm": "down", "detail": "unknown"} >>>>>> Score : 0 >>>>>> stopped : False >>>>>> Local maintenance : False >>>>>> crc32 : f5728fca >>>>>> Host timestamp : 5555 >>>>>> >>>>>> >>>>>> --== Host 2 status ==-- >>>>>> >>>>>> Status up-to-date : True >>>>>> Hostname : host02.ovirt.forest.go.th >>>>>> Host ID : 2 >>>>>> Engine status : {"health": "good", "vm": >"up", >>>>>> "detail": "up"} >>>>>> Score : 3400 >>>>>> stopped : False >>>>>> Local maintenance : False >>>>>> crc32 : e5284763 >>>>>> Host timestamp : 715 >>>>>> >>>>>> >>>>>> --== Host 3 status ==-- >>>>>> >>>>>> Status up-to-date : True >>>>>> Hostname : host03.ovirt.forest.go.th >>>>>> Host ID : 3 >>>>>> Engine status : {"reason": "vm not running >on this >>>>>> host", "health": "bad", "vm": "down", "detail": "unknown"} >>>>>> Score : 3400 >>>>>> stopped : False >>>>>> Local maintenance : False >>>>>> crc32 : bc10c7fc >>>>>> Host timestamp : 13119 >>>>>> >>>>>> -- >>>>>> Wee >>>>>> >>>>>> _______________________________________________ >>>>>> Users mailing list >>>>>> Users@ovirt.org >>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>> >>>> >>>> -- >>>> วีร์ ศรีทิพโพธิ์ >>>> นักวิชาการคอมพิวเตอร์ปฏิบัติการ >>>> ศูนย์สารสนเทศ กรมป่าไม้ >>>> โทร. 025614292-3 ต่อ 5621 >>>> มือถือ. 0864678919 >>>> >>>> >>>> _______________________________________________ >>>> Users mailing list >>>> Users@ovirt.org >>>> http://lists.ovirt.org/mailman/listinfo/users >>> >>> >>> >> >> -- >> Wee >> -- วีร์ ศรีทิพโพธิ์ นักวิชาการคอมพิวเตอร์ปฏิบัติการ ศูนย์สารสนเทศ กรมป่าไม้ โทร. 025614292-3 ต่อ 5621 มือถือ. 0864678919 _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users