Re: [ovirt-users] Decrease downtime for HA

Daniel Menzel Fri, 06 Apr 2018 04:34:21 -0700

Hi Michal,

(sorry for misspelling your name in my first mail).


The settings for the VMs are the following (oVirt 4.2):

1. HA checkbox enabled of course
2. "Target Storage Domain for VM Lease" -> left empty
3. "Resume Behavior" -> AUTO_RESUME
4. Priority for Migration -> High
5. "Watchdog Model" -> No-Watchdog

For testing we did not kill any VM but the host. So basically wesimulated an instantaneous crash by manually turning the machine off viaIPMI-Interface (not via operating system!) and ping the guest(s). Whathappens then?


1. 2-3 seconds after the we press the host's shutdown button we lose
   ping contact to the VM(s).
2. After another 20s oVirt changes the host's status to "connecting",
   the VM's status is set to a question mark.
3. After ~1:30 the host is flagged to "non responsive"
4. After ~2:10 the host's reboot is initiated by oVirt, 5-10s later the
   guest is back online.

So, there seems to be one mistake I made in the first mail: The downtimeis "only" 2.5min. But still I think this time can be decreased as forsome services it is still quite a long time.


Best
Daniel


On 06.04.2018 12:49, Michal Skrivanek wrote:

On 6 Apr 2018, at 12:45, Daniel Menzel <[email protected]> wrote:

Hi Michael,
thanks for your mail. Sorry, I forgot to write that. Yes, we have power 
management and fencing enabled on all hosts. We also tested this and found out 
that it works perfectly. So this cannot be the reason I guess.

Hi Daniel,
ok, then it’s worth looking into details. Can you describe in more detail what 
happens? What exact settings you’re using for such VM? Are you killing the HE 
VM or other VMs or both? Would be good to narrow it down a bit and then review 
the exact flow

Thanks,
michal

Daniel



On 06.04.2018 11:11, Michal Skrivanek wrote:

On 4 Apr 2018, at 15:36, Daniel Menzel <[email protected]> wrote:

Hello,

we're successfully using a setup with 4 Nodes and a replicated Gluster for 
storage. The engine is self hosted. What we're dealing with at the moment is 
the high availability: If a node fails (for example simulated by a forced power 
loss) the engine comes back up online withing ~2min. But guests (having the HA 
option enabled) come back online only after a very long grace time of ~5min. As 
we have a reliable network (40 GbE) and reliable servers I think that the 
default grace times are way too high for us - is there any possibility to 
change those values?

And do you have Power Management(iLO, iDRAC,etc) configured for your hosts? 
Otherwise we have to resort to relatively long timeouts to make sure the host 
is really dead
Thanks,
michal

Thanks in advance!
Daniel

_______________________________________________
Users mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________
Users mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] Decrease downtime for HA

Reply via email to