Re: [ovirt-users] Decrease downtime for HA

Eli Mesika Wed, 25 Apr 2018 01:48:40 -0700

On Mon, Apr 23, 2018 at 8:06 PM, Michal Skrivanek <
michal.skriva...@redhat.com> wrote:


>
>
> On 23 Apr 2018, at 10:52, Daniel Menzel <daniel.men...@hhi.fraunhofer.de>
> wrote:
>
> Hi Michal,
>
> in your last mail you wrote, that the values can be turned down - how can
> this be done?
> H
>
> AFAIK , there is no point in changing fencing vdc_options values in that
case (assuming no kdump is configured here ...)

The Fencing mechanism 

is putting the host in "connecting" state for a grace period that depends
on its number of running VMs and if it serves as APM or not
While the host became non-responding , we first try to do a soft-fence
(restart VDSM via ssh) , this will also take time
After that point , if soft-fence is failing , the host will be reboot via
the fencing script and the time it takes is totally depending on the host
If you have something to look at , it is your host reboot time and try to
improve it, if the host will reboot faster, the whole process will take
less time ...

Regards

Eli


>
>
> Best
> Daniel
>
> On 12.04.2018 20:29, Michal Skrivanek wrote:
>
>
>
> On 12 Apr 2018, at 13:13, Daniel Menzel <daniel.men...@hhi.fraunhofer.de>
> wrote:
>
> Hi there,
>
> does anyone have an idea how to decrease a virtual machine's downtime?
>
> Best
> Daniel
>
> On 06.04.2018 13:34, Daniel Menzel wrote:
>
> Hi Michal,
>
>
> Hi Daniel,
> adding Martin to review fencing behavior
>
> (sorry for misspelling your name in my first mail).
>
>
> that’s not the reason I’m replying late!:-))
>
> The settings for the VMs are the following (oVirt 4.2):
>
>    1. HA checkbox enabled of course
>    2. "Target Storage Domain for VM Lease" -> left empty
>
>
> if you need faster reactions then try to use VM Leases as well, it won’t
> make a difference in this case but will help in case of network issues.
> E.g. if you use iSCSI and the storage connection breaks while host
> connection still works it would restart the VM in about 80s; otherwise it
> would take >5 mins.
>
>
>    1. "Resume Behavior" -> AUTO_RESUME
>    2. Priority for Migration -> High
>    3. "Watchdog Model" -> No-Watchdog
>
> For testing we did not kill any VM but the host. So basically we simulated
> an instantaneous crash by manually turning the machine off via
> IPMI-Interface (not via operating system!) and ping the guest(s). What
> happens then?
>
>    1. 2-3 seconds after the we press the host's shutdown button we lose
>    ping contact to the VM(s).
>    2. After another 20s oVirt changes the host's status to "connecting",
>    the VM's status is set to a question mark.
>    3. After ~1:30 the host is flagged to "non responsive”
>
>
> that sounds about right. Now fencing action should have been initiated, if
> you can share the engine logs we can confirm that. IIRC we first try soft
> fencing - try to ssh to that host, that might take some time to time out I
> guess. Martin?
>
>
>    1.
>    2. After ~2:10 the host's reboot is initiated by oVirt, 5-10s later
>    the guest is back online.
>
> So, there seems to be one mistake I made in the first mail: The downtime
> is "only" 2.5min. But still I think this time can be decreased as for some
> services it is still quite a long time.
>
>
> these values can be tuned down, but then you may be more susceptible to
> fencing power cycling a host in case of shorter network outages. It may be
> ok…depending on your requirements.
>
> Best
> Daniel
>
> On 06.04.2018 12:49, Michal Skrivanek wrote:
>
> On 6 Apr 2018, at 12:45, Daniel Menzel <daniel.men...@hhi.fraunhofer.de> 
> <daniel.men...@hhi.fraunhofer.de> wrote:
>
> Hi Michael,
> thanks for your mail. Sorry, I forgot to write that. Yes, we have power 
> management and fencing enabled on all hosts. We also tested this and found 
> out that it works perfectly. So this cannot be the reason I guess.
>
> Hi Daniel,
> ok, then it’s worth looking into details. Can you describe in more detail 
> what happens? What exact settings you’re using for such VM? Are you killing 
> the HE VM or other VMs or both? Would be good to narrow it down a bit and 
> then review the exact flow
>
> Thanks,
> michal
>
>
> Daniel
>
>
>
> On 06.04.2018 11:11, Michal Skrivanek wrote:
>
> On 4 Apr 2018, at 15:36, Daniel Menzel <daniel.men...@hhi.fraunhofer.de> 
> <daniel.men...@hhi.fraunhofer.de> wrote:
>
> Hello,
>
> we're successfully using a setup with 4 Nodes and a replicated Gluster for 
> storage. The engine is self hosted. What we're dealing with at the moment is 
> the high availability: If a node fails (for example simulated by a forced 
> power loss) the engine comes back up online withing ~2min. But guests (having 
> the HA option enabled) come back online only after a very long grace time of 
> ~5min. As we have a reliable network (40 GbE) and reliable servers I think 
> that the default grace times are way too high for us - is there any 
> possibility to change those values?
>
> And do you have Power Management(iLO, iDRAC,etc) configured for your hosts? 
> Otherwise we have to resort to relatively long timeouts to make sure the host 
> is really dead
> Thanks,
> michal
>
> Thanks in advance!
> Daniel
>
> _______________________________________________
> Users mailing listUsers@ovirt.orghttp://lists.ovirt.org/mailman/listinfo/users
>
>
>
>
> _______________________________________________
> Users mailing listUsers@ovirt.orghttp://lists.ovirt.org/mailman/listinfo/users
>
>
> _______________________________________________
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
>
>
>
>

_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] Decrease downtime for HA

Reply via email to