Hi Daniel, As per my original post, each host believed the *other* is a better candidate, with the result that neither would start the engine. As you may have read by now, the bug has been confirmed and a fix has been proposed.
Your claim that HA is working is incorrect. A system that requires manual intervention when something goes wrong is not HA. regards, John On 18/08/14 19:18, Daniel Helgenberger wrote: > Hello John, > > > On Mi, 2014-07-23 at 19:47 -0400, Jason Brooks wrote: >> ----- Original Message ----- >>> From: "John Gardeniers" <[email protected]> >>> To: "users" <[email protected]> >>> Sent: Wednesday, July 23, 2014 4:29:45 PM >>> Subject: [ovirt-users] Self-hosted engine won't start >>> >>> Hi All, >>> >>> I have created a lab with 2 hypervisors and a self-hosted engine. Today >>> I followed the upgrade instructions as described in >>> http://www.ovirt.org/Hosted_Engine_Howto and rebooted the engine. I >>> didn't really do an upgrade but simply wanted to test what would happen >>> when the engine was rebooted. >>> >>> When the engine didn't restart I re-ran hosted-engine >>> --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and >>> ovirt-ha-broker services on both nodes. 15 minutes later it still hadn't >>> restarted, so I then tried rebooting both hypervisers. After an hour >>> there was still no sign of the engine starting. The agent logs don't >>> help me much. The following bits are repeated over and over. >>> >>> ovirt1 (192.168.19.20): >>> >>> MainThread::INFO::2014-07-24 >>> 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >>> Trying: notify time=1406157520.27 type=state_transition >>> detail=EngineDown-EngineDown hostname='ovirt1.om.net' >>> MainThread::INFO::2014-07-24 >>> 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >>> Success, was notification of state_transition (EngineDown-EngineDown) >>> sent? ignored >>> MainThread::INFO::2014-07-24 >>> 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >>> Current state EngineDown (score: 2400) >>> MainThread::INFO::2014-07-24 >>> 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >>> Best remote host 192.168.19.21 (id: 2, score: 2400) >>> >>> ovirt2 (192.168.19.21): >>> >>> MainThread::INFO::2014-07-24 >>> 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >>> Trying: notify time=1406157484.01 type=state_transition >>> detail=EngineDown-EngineDown hostname='ovirt2.om.net' >>> MainThread::INFO::2014-07-24 >>> 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >>> Success, was notification of state_transition (EngineDown-EngineDown) >>> sent? ignored >>> MainThread::INFO::2014-07-24 >>> 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >>> Current state EngineDown (score: 2400) >>> MainThread::INFO::2014-07-24 >>> 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >>> Best remote host 192.168.19.20 (id: 1, score: 2400) >>> >>> From the above information I decided to simply shut down one hypervisor >>> and see what happens. The engine did start back up again a few minutes >>> later. >> I've seen this behavior, too. >> >> Jason >> >>> The interesting part is that each hypervisor seems to think the other is >>> a better host. > Where do you get this from? From the line: > 'Best remote host 192.168.19.20 (id: 1, score: 2400)' ? > > I assume this is not the case; HA broker just looking for the best > remote candidate. > > But I have also trouble with this behavior; esp. when I had the cluster > in global maintenance. > I resolve this by stating hosted engine manually in in global > maintenance and waiting for {"health": "good", "vm": "up", "detail": > "up"} and disabling global maintenance afterwards. > > I found the HA feature is indeed working - and tried out best by > manually stopping the engine service (service hosted-engine stop). IIRC > This should trigger a failover and reboot of the engine. > > >> The two machines are identical, so there's no reason I >>> can see for this odd behaviour. In a lab environment this is little more >>> than an annoying inconvenience. In a production environment it would be >>> completely unacceptable. >>> >>> May I suggest that this issue be looked into and some means found to >>> eliminate this kind of mutual exclusion? e.g. After a few minutes of >>> such an issue one hypervisor could be randomly given a slightly higher >>> weighting, which should result in it being chosen to start the engine. >>> >>> regards, >>> John >>> _______________________________________________ >>> Users mailing list >>> [email protected] >>> http://lists.ovirt.org/mailman/listinfo/users >>> >> _______________________________________________ >> Users mailing list >> [email protected] >> http://lists.ovirt.org/mailman/listinfo/users > > Cheers, > Daniel > > > _______________________________________________ > Users mailing list > [email protected] > http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/users

