On 26/08/2021 10:35, Klaus Wenninger wrote:


On Thu, Aug 26, 2021 at 11:13 AM lejeczek via Users <users@clusterlabs.org <mailto:users@clusterlabs.org>> wrote:

    Hi guys.

    I sometimes - I think I know when in terms of any
    pattern -
    get resources stuck on one node (two-node cluster) with
    these in libvirtd's logs:
    ...
    Cannot start job (query, none, none) for domain
    c8kubermaster1; current job is (modify, none, none)
    owned by
    (192261 qemuProcessReconnect, 0 <null>, 0 <null>
    (flags=0x0)) for (1093s, 0s, 0s)
    Cannot start job (query, none, none) for domain
    ubuntu-tor;
    current job is (modify, none, none) owned by (192263
    qemuProcessReconnect, 0 <null>, 0 <null> (flags=0x0)) for
    (1093s, 0s, 0s)
    Timed out during operation: cannot acquire state
    change lock
    (held by monitor=qemuProcessReconnect)
    Timed out during operation: cannot acquire state
    change lock
    (held by monitor=qemuProcessReconnect)
    ...

    when this happens, and if the resourec is meant to be the
    other node, I have to to disable the resource first, then
    the node on which resources are stuck will shutdown
    the VM
    and then I have to re-enable that resource so it
    would, only
    then, start on that other, the second node.

    I think this problem occurs if I restart 'libvirtd'
    via systemd.

    Any thoughts on this guys?


What are the logs on the pacemaker-side saying?
An issue with migration?

Klaus

I'll have to try to tidy up the "protocol" with my stuff so I could call it all reproducible, at the moment if only feels that way, as reproducible.

I'm on CentOS Stream and have 2-node cluster, with KVM resources, with same glusterfs cluster 2-node. (all psychically is two machines)

1) I power down one node in orderly manner and the other node is last-man-standing. 2) after a while (not sure if time period is also a key here) I brought up that first node. 3) the last man-standing-node libvirtd becomes irresponsive (don't know yet, if that is only after the first node came back up) to virt cmd and to probably everything else, pacameker log says:
...
pacemaker-controld[2730]:  error: Result of probe operation for c8kubernode2 on dzien: Timed Out
...
and libvirtd log does not say anything really (with default debug levels)

4) if glusterfs might play any role? Healing of the volume(s) is finished at this time, completed successfully.

This the moment where I would manually 'systemd restart libvirtd' that irresponsive node(was last-man-standing) and got original error messages.

There is plenty of room for anybody to make guesses, obvious.
Is it 'libvirtd' going haywire because glusterfs volume is in an unhealthy state and needs healing? Is it pacemaker last-man-standing which makes 'libvirtd' go haywire?
etc...

I can add much concrete stuff at this moment but will appreciate any thoughts you want to share.
thanks, L

    many thanks, L.
    _______________________________________________
    Manage your subscription:
    https://lists.clusterlabs.org/mailman/listinfo/users
    <https://lists.clusterlabs.org/mailman/listinfo/users>

    ClusterLabs home: https://www.clusterlabs.org/
    <https://www.clusterlabs.org/>


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to