Hello, Murilo

NFSv3 has this behaviour, and it is a known issue, see this thread for reference[1]. This situation does not occur using version 4 of the protocol, as it changed the way lock works in this version. However, you mentioned you changed the storage to NFSv4, did you change the lock lease time of the NFS protocol? This situation should not happen with NFSv4, as the lock is removed if the client does not renew the lease, you can refer to [2] for more details upon the difference in lock mechanism between NFSv3 and NFSv4.

Best regards,
Bryan

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1547095#c43
[2]: https://community.netapp.com/t5/Tech-ONTAP-Blogs/NFSv3-and-NFSv4-What-s-the-difference/ba-p/441316#toc-hId-2029730352

On 15/04/2024 14:16, Murilo Moura wrote:
Complementing the case... I did a new test and found that the problem really lies in the lock generated in NFS.

When the KVM host has a virtualized instance running (with NFS primary storage) and this host suddenly loses power, the HA scheme only works if before marking the host as degraded I restart the NFS service to unlock the machine's file virtual. Doing it this way, I restart the NFS server (with this the file locks are released) and then, when I mark the KVM host as degraded, all failover happens automatically.

In conclusion, my problem is in fact this lock generated on the volume shared via NFS...

All tests were done in ACS version 4.19.0.1

image.png

image.png


regards,

Murilo Moura



On Mon, Apr 15, 2024 at 2:07 PM Murilo Moura <a...@bigsys.com.br> wrote:

    Correct, the user instance and the computing offer have HA enabled.

    regards,

    Murilo Moura


    On Mon, Apr 15, 2024 at 4:00 AM <m...@swen.io> wrote:

        HI Murilo,

        just checking, the user instance you are talking about are
        using a service offering with HA enabled, correct?

        Regards,
        Swen

        -----Ursprüngliche Nachricht-----
        Von: Murilo Moura <a...@bigsys.com.br>
        Gesendet: Sonntag, 14. April 2024 06:31
        An: users@cloudstack.apache.org
        Betreff: Re: AW: Manual fence KVM Host

        Hello Guto!


        I carefully checked the instructions that you and Daniel left
        in this thread that I opened, but one point is not working and
        I would like to see if you have experienced something similar...

        By putting the host in the "Disconnected" state, I can trigger
        the API to mark the host as degraded, so far everything is ok.
        Right after this action I see that the system VMs are
        recreated on the node that was active, but the user instances
        (user vms) are not recreated.

        Checking the NFS host where the image of this VM is located, I
        noticed that using the "qemu-img info" command I cannot read
        the volume file of this instance (error: Failed to get shared
        "write" lock).

        Is there any way to execute unlock or even another parameter
        that makes kvm start a VM without locking the volume on the
        primary storage in NFS? (I tried to put the NFS storage in
        version 4, but it still had no effect)...


        regards,

        Murilo Moura


        On Wed, Apr 10, 2024 at 2:38 PM Guto Veronezi
        <gutoveron...@apache.org>
        wrote:

        > Hello Murilo,
        >
        > Complementing Swen's answer, if your host is still up and
        you can
        > manage it, then you could also put your host in maintenance
        mode in
        > ACS. This process will evacuate (migrate to another host)
        every VM
        > from the host (not only the ones that have HA enabled). Is
        this your
        > situation? If not, could you provide more details about your
        > configurations and the environment state?
        >
        > Depending on what you have in your setup, the HA might not
        work as
        > expected. For VMware and XenServer, the process is expected
        to happen
        > at the hypervisor level. For KVM, ACS does not support HA;
        what ACS
        > supports is failover (it is named HA in ACS though) and this
        process
        > will work only when certain criteria are met. Furthermore,
        we have two
        > ways to implement the failover for ACS + KVM: the VM's
        failover and
        > the host's failover. In both cases, when identified that a host
        > crashed or a VM suddenly stopped working, ACS will start the
        VM in another host.
        >
        > In ACS + KVM, to work with VM's failover, it is necessary at
        least one
        > NFS primary storage; the KVM Agent of every host writes the
        heartbeat
        > in it. The VM's failover is triggered only if the VM's compute
        > offering has the property "Offer HA" enabled OR the global
        setting
        > "force.ha" is enabled. VRs have failover triggered
        independently of
        > the offering of the global setting. In this approach, ACS
        will check
        > the VM state periodically (sending commands to the KVM
        Agent) and it
        > will trigger the failover if the VM meets the previously
        mentioned
        > criteria AND the determined limit (defined by the global
        settings
        > "ping.interval" and
        > "ping.timeout") has been elapsed. Bear in mind that, if you
        lose your
        > host, ACS will trigger the failover; however, if you gracefully
        > shutdown the KVM Agent or the host, the Agent will send a
        disconnect
        > command to the Management Server and ACS will not check the
        VM state
        > anymore for that host. Therefore, if you lose your host
        while the
        > service is down, the failover will not be triggered. Also,
        if a host
        > loses access to the NFS primary storage used for heartbeat
        and the VM
        > uses some other primary storage, ACS might trigger the
        failover too.
        > As we do not have a STONITH/fencing in this scenario, it is
        possible
        > for the VM to still be running in the host and ACS to try to
        start it in another host.
        >
        > In ACS + KVM, to work with the host's failover, it is
        necessary to
        > configure the host's OOBM (of each host desired to trigger the
        > failover) in ACS. In this approach, ACS monitors the Agent's
        state and
        > triggers the failover in case it cannot establish the
        connection
        > again. In this scenario, ACS will shut down the host via
        OOBM and will
        > start the VMs in another host; therefore, it is not
        dependent on an NFS primary storage.
        > This behavior is driven by the "kvm.ha.*" global settings.
        > Furthermore, one has to be aware that stopping the Agent
        might trigger
        > the failover; therefore, it is recommended to disable the
        failover
        > feature while doing operations in the host (like upgrading the
        > packages or some other maintenance processes).
        >
        > Best regards,
        > Daniel Salvador (gutoveronezi)
        >
        > On 10/04/2024 03:52, m...@swen.io wrote:
        > > What exactly do you mean? In which state is the host?
        > > If a host is in state "Disconnected" or "Alert" you can
        declare a
        > > host
        > as degraded via api (
        >
        https://cloudstack.apache.org/api/apidocs-4.19/apis/declareHostAsDegra
        > ded.html)
        > or UI (icon).
        > > Cloudstack will then start all VM with HA enabled on other
        hosts, if
        > storage is accessible.
        > >
        > > Regards,
        > > Swen
        > >
        > > -----Ursprüngliche Nachricht-----
        > > Von: Murilo Moura <a...@bigsys.com.br>
        > > Gesendet: Mittwoch, 10. April 2024 02:10
        > > An: users@cloudstack.apache.org
        > > Betreff: Manual fence KVM Host
        > >
        > > hey guys!
        > >
        > > Is there any way to manually fence a KVM host and then
        automatically
        > start the migration of VMs that have HA enabled?
        > >
        > >
        >

Reply via email to