On 11.04.2021 21:38, lejeczek wrote: > Hi guys. > > I've experiencing weir "handling" of VirtualDomain by the cluster. It > seems that cluster sometimes fails to report real state of VM which > results sometime in troubles - like when cluster thinks VM is not > running, which is running then cluster starts it on another node which > fcuks up qcow image. > Right now for example I'm looking at cluster report VM is up & okey > while it is not, on none of the nodes (because VM was 'poweroff' from > itself) > So I: > > -> $ pcs resource refresh c8kubermaster1 > Cleaned up c8kubermaster1 on swir > Cleaned up c8kubermaster1 on dzien > Waiting for 2 replies from the controller > ... got reply > ... got reply (done) > > In logs where VM is supposed to be running, according to cluster > .. > notice: Requesting local execution of probe operation for > c8kubermaster1 on swir > notice: Result of probe operation for c8kubermaster1 on swir: ok > notice: Requesting local execution of monitor operation for > c8kubermaster1 on swir > notice: Result of monitor operation for c8kubermaster1 on swir: ok > > , on the second node (2-node cluster) in logs: > .. > notice: State transition S_IDLE -> S_POLICY_ENGINE > notice: Ignoring expired c8kubernode1_migrate_to_0 failure on dzien > notice: * Start c8kubermaster1 ( swir ) > notice: Calculated transition 42, saving inputs in > /var/lib/pacemaker/pengine/pe-input-2655.bz2 > notice: Initiating monitor operation c8kubermaster1_monitor_0 on swir > notice: Initiating monitor operation c8kubermaster1_monitor_0 locally > on dzien > notice: Requesting local execution of probe operation for > c8kubermaster1 on dzien > notice: Result of probe operation for c8kubermaster1 on dzien: not running > notice: Transition 42 aborted by operation c8kubermaster1_monitor_0 > 'modify' on swir: Event failed > notice: Transition 42 action 11 (c8kubermaster1_monitor_0 on swir): > expected 'not running' but got 'ok' >
You need to debug whether virsh returns correct information which is misinterpreted by agent/pacemaker or virsh returns incorrect information. As far as I can tell, all that VirtualDomain monitor option does is running "virsh domstate $DOMAIN". > -> $ pcs resource config c8kubermaster1 > Resource: c8kubermaster1 (class=ocf provider=heartbeat type=VirtualDomain) > Attributes: config=/var/lib/pacemaker/conf.d/c8kubermaster1.xml > hypervisor=qemu:///system migration_transport=ssh > Meta Attrs: allow-migrate=true failure-timeout=120s > Operations: migrate_from interval=0s timeout=180s > (c8kubermaster1-migrate_from-interval-0s) > migrate_to interval=0s timeout=180s > (c8kubermaster1-migrate_to-interval-0s) > monitor interval=30s (c8kubermaster1-monitor-interval-30s) > start interval=0s timeout=90s > (c8kubermaster1-start-interval-0s) > stop interval=0s timeout=90s > (c8kubermaster1-stop-interval-0s) > > Disable + enable the resource 'fixes' the glitch but, naturally the > obvious question would be - why that is even allowed to happen? > many thanks, L. > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/