Re: [ClusterLabs] Qemu VM resources - cannot acquire state change lock

lejeczek via Users Sat, 28 Aug 2021 03:33:30 -0700


On 26/08/2021 10:35, Klaus Wenninger wrote:

On Thu, Aug 26, 2021 at 11:13 AM lejeczek via Users<users@clusterlabs.org <mailto:users@clusterlabs.org>> wrote:


    Hi guys.

    I sometimes - I think I know when in terms of any
    pattern -
    get resources stuck on one node (two-node cluster) with
    these in libvirtd's logs:
    ...
    Cannot start job (query, none, none) for domain
    c8kubermaster1; current job is (modify, none, none)
    owned by
    (192261 qemuProcessReconnect, 0 <null>, 0 <null>
    (flags=0x0)) for (1093s, 0s, 0s)
    Cannot start job (query, none, none) for domain
    ubuntu-tor;
    current job is (modify, none, none) owned by (192263
    qemuProcessReconnect, 0 <null>, 0 <null> (flags=0x0)) for
    (1093s, 0s, 0s)
    Timed out during operation: cannot acquire state
    change lock
    (held by monitor=qemuProcessReconnect)
    Timed out during operation: cannot acquire state
    change lock
    (held by monitor=qemuProcessReconnect)
    ...

    when this happens, and if the resourec is meant to be the
    other node, I have to to disable the resource first, then
    the node on which resources are stuck will shutdown
    the VM
    and then I have to re-enable that resource so it
    would, only
    then, start on that other, the second node.

    I think this problem occurs if I restart 'libvirtd'
    via systemd.

    Any thoughts on this guys?


What are the logs on the pacemaker-side saying?
An issue with migration?

Klaus

I'll have to try to tidy up the "protocol" with my stuff soI could call it all reproducible, at the moment if onlyfeels that way, as reproducible.

I'm on CentOS Stream and have 2-node cluster, with KVMresources, with same glusterfs cluster 2-node. (allpsychically is two machines)

1) I power down one node in orderly manner and the othernode is last-man-standing.2) after a while (not sure if time period is also a keyhere) I brought up that first node.3) the last man-standing-node libvirtd becomes irresponsive(don't know yet, if that is only after the first node cameback up) to virt cmd and to probably everything else,pacameker log says:

...

pacemaker-controld[2730]: error: Result of probe operationfor c8kubernode2 on dzien: Timed Out

...

and libvirtd log does not say anything really (with defaultdebug levels)

4) if glusterfs might play any role? Healing of thevolume(s) is finished at this time, completed successfully.

This the moment where I would manually 'systemd restartlibvirtd' that irresponsive node(was last-man-standing) andgot original error messages.


There is plenty of room for anybody to make guesses, obvious.

Is it 'libvirtd' going haywire because glusterfs volume isin an unhealthy state and needs healing?Is it pacemaker last-man-standing which makes 'libvirtd' gohaywire?

etc...

I can add much concrete stuff at this moment but willappreciate any thoughts you want to share.

thanks, L

    many thanks, L.
    _______________________________________________
    Manage your subscription:
    https://lists.clusterlabs.org/mailman/listinfo/users
    <https://lists.clusterlabs.org/mailman/listinfo/users>

    ClusterLabs home: https://www.clusterlabs.org/
    <https://www.clusterlabs.org/>


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Qemu VM resources - cannot acquire state change lock

Reply via email to