On Mon, Apr 23, 2018 at 8:06 PM, Michal Skrivanek < michal.skriva...@redhat.com> wrote:
> > > On 23 Apr 2018, at 10:52, Daniel Menzel <daniel.men...@hhi.fraunhofer.de> > wrote: > > Hi Michal, > > in your last mail you wrote, that the values can be turned down - how can > this be done? > H > > AFAIK , there is no point in changing fencing vdc_options values in that case (assuming no kdump is configured here ...) The Fencing mechanism is putting the host in "connecting" state for a grace period that depends on its number of running VMs and if it serves as APM or not While the host became non-responding , we first try to do a soft-fence (restart VDSM via ssh) , this will also take time After that point , if soft-fence is failing , the host will be reboot via the fencing script and the time it takes is totally depending on the host If you have something to look at , it is your host reboot time and try to improve it, if the host will reboot faster, the whole process will take less time ... Regards Eli > > > Best > Daniel > > On 12.04.2018 20:29, Michal Skrivanek wrote: > > > > On 12 Apr 2018, at 13:13, Daniel Menzel <daniel.men...@hhi.fraunhofer.de> > wrote: > > Hi there, > > does anyone have an idea how to decrease a virtual machine's downtime? > > Best > Daniel > > On 06.04.2018 13:34, Daniel Menzel wrote: > > Hi Michal, > > > Hi Daniel, > adding Martin to review fencing behavior > > (sorry for misspelling your name in my first mail). > > > that’s not the reason I’m replying late!:-)) > > The settings for the VMs are the following (oVirt 4.2): > > 1. HA checkbox enabled of course > 2. "Target Storage Domain for VM Lease" -> left empty > > > if you need faster reactions then try to use VM Leases as well, it won’t > make a difference in this case but will help in case of network issues. > E.g. if you use iSCSI and the storage connection breaks while host > connection still works it would restart the VM in about 80s; otherwise it > would take >5 mins. > > > 1. "Resume Behavior" -> AUTO_RESUME > 2. Priority for Migration -> High > 3. "Watchdog Model" -> No-Watchdog > > For testing we did not kill any VM but the host. So basically we simulated > an instantaneous crash by manually turning the machine off via > IPMI-Interface (not via operating system!) and ping the guest(s). What > happens then? > > 1. 2-3 seconds after the we press the host's shutdown button we lose > ping contact to the VM(s). > 2. After another 20s oVirt changes the host's status to "connecting", > the VM's status is set to a question mark. > 3. After ~1:30 the host is flagged to "non responsive” > > > that sounds about right. Now fencing action should have been initiated, if > you can share the engine logs we can confirm that. IIRC we first try soft > fencing - try to ssh to that host, that might take some time to time out I > guess. Martin? > > > 1. > 2. After ~2:10 the host's reboot is initiated by oVirt, 5-10s later > the guest is back online. > > So, there seems to be one mistake I made in the first mail: The downtime > is "only" 2.5min. But still I think this time can be decreased as for some > services it is still quite a long time. > > > these values can be tuned down, but then you may be more susceptible to > fencing power cycling a host in case of shorter network outages. It may be > ok…depending on your requirements. > > Best > Daniel > > On 06.04.2018 12:49, Michal Skrivanek wrote: > > On 6 Apr 2018, at 12:45, Daniel Menzel <daniel.men...@hhi.fraunhofer.de> > <daniel.men...@hhi.fraunhofer.de> wrote: > > Hi Michael, > thanks for your mail. Sorry, I forgot to write that. Yes, we have power > management and fencing enabled on all hosts. We also tested this and found > out that it works perfectly. So this cannot be the reason I guess. > > Hi Daniel, > ok, then it’s worth looking into details. Can you describe in more detail > what happens? What exact settings you’re using for such VM? Are you killing > the HE VM or other VMs or both? Would be good to narrow it down a bit and > then review the exact flow > > Thanks, > michal > > > Daniel > > > > On 06.04.2018 11:11, Michal Skrivanek wrote: > > On 4 Apr 2018, at 15:36, Daniel Menzel <daniel.men...@hhi.fraunhofer.de> > <daniel.men...@hhi.fraunhofer.de> wrote: > > Hello, > > we're successfully using a setup with 4 Nodes and a replicated Gluster for > storage. The engine is self hosted. What we're dealing with at the moment is > the high availability: If a node fails (for example simulated by a forced > power loss) the engine comes back up online withing ~2min. But guests (having > the HA option enabled) come back online only after a very long grace time of > ~5min. As we have a reliable network (40 GbE) and reliable servers I think > that the default grace times are way too high for us - is there any > possibility to change those values? > > And do you have Power Management(iLO, iDRAC,etc) configured for your hosts? > Otherwise we have to resort to relatively long timeouts to make sure the host > is really dead > Thanks, > michal > > Thanks in advance! > Daniel > > _______________________________________________ > Users mailing listUsers@ovirt.orghttp://lists.ovirt.org/mailman/listinfo/users > > > > > _______________________________________________ > Users mailing listUsers@ovirt.orghttp://lists.ovirt.org/mailman/listinfo/users > > > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users > > > > >
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users