Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

2022-02-18 Thread Ken Gaillot
On Fri, 2022-02-18 at 16:00 +0100, Lentes, Bernd wrote:
> 
> - On Feb 17, 2022, at 4:25 PM, kgaillot kgail...@redhat.com
> wrote:
> > > So for me the big question is:
> > > When a transition is happening, and there is a change in the
> > > cluster,
> > > is the transition "aborted"
> > > (delayed or interrupted would be better) or not ?
> > > Is this behaviour consistent ? If no, from what does it depend ?
> > > 
> > > Bernd
> > 
> > Yes, anytime the DC sees a change that could affect resources, it
> > will
> > abort the current transition and calculate a new one. Aborting
> > means
> > not initiating any new actions from the transition -- but any
> > actions
> > currently in flight must complete before the new transition can be
> > calculated.
> > 
> > Changes that abort a transition include configuration changes, a
> > node
> > joining or leaving, an unexpected action result being received, a
> > node
> > attribute changing, the cluster-recheck-interval passing since the
> > last
> > transition, or a timer popping for a time-based event (failure
> > timeout,
> > rule, etc.). I may be forgetting some, but you get the idea.
> > --
> 
> Hi Ken,
> 
> thanks for your explanation. 
> Now i try to resume if i understood everything correctly:
> I started the shutdown of several VirtualDomains with "crm resource
> vm_xxx stop".
> Not concurrently, one by one with some delay of about 30 sec.
> But there was already one VirtualDomain shutting down before.
> Cluster said this transition is aborted, but in real it couldn't be
> aborted. How to abort a running shutdown ?

The "transition" (i.e. a plan of actions to take) is aborted, not
running actions. The wording is awkward and I hope to find the time to
change it at some point (references are scattered throughout the code,
and we have to think about people who may have scripts that parse logs
or whatnot).

There's no way from within the cluster to abort a running action.
However kill -9 on the agent works :) (the cluster will consider the
action failed)

> So we had to wait for the shutdown of that domain.
> It has been switched off by libvirt with "virsh destroy" after 10
> minutes.
> After that the shutdown of the other domains was initiated, and the
> domains shutdown cleanly.
> 
> So, to conclude:
> I forgot that i had already one domain in shutdown. I should have
> waited for this to finish before starting the stop of the other
> resources.
> Cluster tried to "abort" the shutdown, but shutdown can't be aborted.
> And i had bad luck that the shutdown of this domain took so long.
> 
> Correct ?
> 
> Bernd
> 

Yes, other than the cluster isn't trying to abort the shutdown, it's
just discarding any actions that were planned after it in the same
transition.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

2022-02-18 Thread Lentes, Bernd


- On Feb 17, 2022, at 4:25 PM, kgaillot kgail...@redhat.com wrote:
>> So for me the big question is:
>> When a transition is happening, and there is a change in the cluster,
>> is the transition "aborted"
>> (delayed or interrupted would be better) or not ?
>> Is this behaviour consistent ? If no, from what does it depend ?
>> 
>> Bernd
> 
> Yes, anytime the DC sees a change that could affect resources, it will
> abort the current transition and calculate a new one. Aborting means
> not initiating any new actions from the transition -- but any actions
> currently in flight must complete before the new transition can be
> calculated.
> 
> Changes that abort a transition include configuration changes, a node
> joining or leaving, an unexpected action result being received, a node
> attribute changing, the cluster-recheck-interval passing since the last
> transition, or a timer popping for a time-based event (failure timeout,
> rule, etc.). I may be forgetting some, but you get the idea.
> --

Hi Ken,

thanks for your explanation. 
Now i try to resume if i understood everything correctly:
I started the shutdown of several VirtualDomains with "crm resource vm_xxx 
stop".
Not concurrently, one by one with some delay of about 30 sec.
But there was already one VirtualDomain shutting down before.
Cluster said this transition is aborted, but in real it couldn't be aborted. 
How to abort a running shutdown ?
So we had to wait for the shutdown of that domain.
It has been switched off by libvirt with "virsh destroy" after 10 minutes.
After that the shutdown of the other domains was initiated, and the domains 
shutdown cleanly.

So, to conclude:
I forgot that i had already one domain in shutdown. I should have waited for 
this to finish before starting the stop of the other resources.
Cluster tried to "abort" the shutdown, but shutdown can't be aborted.
And i had bad luck that the shutdown of this domain took so long.

Correct ?

Bernd



smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

2022-02-17 Thread Ken Gaillot
On Thu, 2022-02-17 at 14:05 +0100, Lentes, Bernd wrote:
> - On Feb 16, 2022, at 6:48 PM, arvidjaar arvidj...@gmail.com
> wrote:
> > 
> > Splitting logs between different messages does not really help in
> > interpreting
> > them.
> 
> I agree.
> Here is the complete excerpt from the respective time:
> https://nc-mcd.helmholtz-muenchen.de/nextcloud/s/eY8SA8pe4HZBBc8
> 
> > I guess the real question here is why "Transition aborted" is
> > logged although
> > transition apparently continues. Transition 128 started at 20:54:30
> > and
> > completed
> > at 21:04:26, but there were multiple "Transition 128 aborted"
> > messages in
> > between
> 
> That's correct. The shutdown_timeout for the domain is set with 600
> sec. in the CIB.
> The RA says:
> # The "shutdown_timeout" we use here is the operation
> # timeout specified in the CIB, minus 5 seconds
> And between 20:54:30 and 21:04:26 we have very close 595 sec.
> 
> > It looks like "Transition aborted" is more "we try to abort this
> > transition if
> > possible". My guess is that pacemaker must wait for currently
> > running action(s)
> > which can take quite some time when stopping virtual domain.
> > Transition 128
> > was initiated when stopping vm_pathway, but we have no idea when it
> > was stopped.
> 
> We have:
> Feb 15 21:04:26 [15370] ha-idg-2   crmd:   notice:
> run_graph:   Transition 128 (Complete=1, Pending=0, Fired=0,
> Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-
> 3548.bz2): Complete
> 
> and the log from libvirt confirms it:
> /var/log/libvirtd/qemu/vm_pathway.log:
> 2022-02-15T20:04:26.569471Z qemu-system-x86_64: terminating on signal
> 15 from pid 7368 (/usr/sbin/libvirtd)
> 2022-02-15 20:04:26.769+: shutting down, reason=destroyed
> 
> Time in libvirt logs is UTC, and in Munich we have currently UTC+1,
> so the time differs in the logs.
> We see that the domain is "switched off" via libvirt exactly at
> 21:04:26.
> 
> So for me the big question is:
> When a transition is happening, and there is a change in the cluster,
> is the transition "aborted"
> (delayed or interrupted would be better) or not ?
> Is this behaviour consistent ? If no, from what does it depend ?
> 
> Bernd

Yes, anytime the DC sees a change that could affect resources, it will
abort the current transition and calculate a new one. Aborting means
not initiating any new actions from the transition -- but any actions
currently in flight must complete before the new transition can be
calculated.

Changes that abort a transition include configuration changes, a node
joining or leaving, an unexpected action result being received, a node
attribute changing, the cluster-recheck-interval passing since the last
transition, or a timer popping for a time-based event (failure timeout,
rule, etc.). I may be forgetting some, but you get the idea.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

2022-02-17 Thread Lentes, Bernd

- On Feb 16, 2022, at 6:48 PM, arvidjaar arvidj...@gmail.com wrote:
> 
> 
> Splitting logs between different messages does not really help in interpreting
> them.

I agree.
Here is the complete excerpt from the respective time:
https://nc-mcd.helmholtz-muenchen.de/nextcloud/s/eY8SA8pe4HZBBc8

> 
> I guess the real question here is why "Transition aborted" is logged although
> transition apparently continues. Transition 128 started at 20:54:30 and
> completed
> at 21:04:26, but there were multiple "Transition 128 aborted" messages in
> between

That's correct. The shutdown_timeout for the domain is set with 600 sec. in the 
CIB.
The RA says:
# The "shutdown_timeout" we use here is the operation
# timeout specified in the CIB, minus 5 seconds
And between 20:54:30 and 21:04:26 we have very close 595 sec.

> It looks like "Transition aborted" is more "we try to abort this transition if
> possible". My guess is that pacemaker must wait for currently running 
> action(s)
> which can take quite some time when stopping virtual domain. Transition 128
> was initiated when stopping vm_pathway, but we have no idea when it was 
> stopped.

We have:
Feb 15 21:04:26 [15370] ha-idg-2   crmd:   notice: run_graph:   
Transition 128 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-3548.bz2): Complete

and the log from libvirt confirms it:
/var/log/libvirtd/qemu/vm_pathway.log:
2022-02-15T20:04:26.569471Z qemu-system-x86_64: terminating on signal 15 from 
pid 7368 (/usr/sbin/libvirtd)
2022-02-15 20:04:26.769+: shutting down, reason=destroyed

Time in libvirt logs is UTC, and in Munich we have currently UTC+1, so the time 
differs in the logs.
We see that the domain is "switched off" via libvirt exactly at 21:04:26.

So for me the big question is:
When a transition is happening, and there is a change in the cluster, is the 
transition "aborted"
(delayed or interrupted would be better) or not ?
Is this behaviour consistent ? If no, from what does it depend ?

Bernd




smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

2022-02-16 Thread Ken Gaillot
On Wed, 2022-02-16 at 18:09 +0100, Lentes, Bernd wrote:
> 
> - On Feb 16, 2022, at 12:52 AM, kgaillot kgail...@redhat.com
> wrote:
> 
> > A transition is the set of actions that need to be taken in
> > response to
> > current conditions. A transition is aborted any time conditions
> > change
> > (here, the target-role being changed in the configuration), so that
> > a
> > new set of actions can be calculated.
> > 
> > Someone once defined a transition as an "action plan", and I'm
> > tempted
> > to use that instead. Plus maybe replace "aborted" with
> > "interrupted",
> > so then we'd have "Action plan interrupted" which is maybe a little
> > more understandable.
> 
> These "transition  aborted" happen quite often.

Yes, they're quite normal. Basically they just mean that conditions
changed, so we need to check if anything needs to be done differently.

> 
> Feb 15 20:53:25 [15370] ha-idg-2   crmd:   notice:
> abort_transition_graph:  Transition 126 aborted by vm_documents-oo-
> meta_attributes-target-role doing modify target-role=Stopped:
> Configuration change | cib=7.27453.0 source=te_update
> _diff_v2:483
> path=/cib/configuration/resources/primitive[@id='vm_documents-
> oo']/meta_attributes[@id='vm_documents-oo-
> meta_attributes']/nvpair[@id='vm_documents-oo-meta_attributes-target-
> role'] complete=false
>   
> Feb 15 20:53:00 [15370] ha-idg-2   crmd: info:
> abort_transition_graph:  Transition 125 aborted by vm_amok-
> meta_attributes-target-role doing modify target-role=Stopped:
> Configuration change | cib=7.27452.0 source=te_update_diff_v2
> :483
> path=/cib/configuration/resources/primitive[@id='vm_amok']/meta_attri
> butes[@id='vm_amok-meta_attributes']/nvpair[@id='vm_amok-
> meta_attributes-target-role'] complete=true
> 
> Why is there sometimes "complete=true" and sometimes "complete=false"
> ?
> What does that mean ?
> 
> Bernd

"Complete" is whether all actions originally planned in the transition
were completed. For complete=true, the log is basically just a heads-up 
that the cluster needs to recheck things, since there's nothing to
actually abort.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

2022-02-16 Thread Ken Gaillot
On Wed, 2022-02-16 at 21:47 +0300, Andrei Borzenkov wrote:
> On 16.02.2022 20:48, Andrei Borzenkov wrote:
> > I guess the real question here is why "Transition aborted" is
> > logged although
> > transition apparently continues. Transition 128 started at 20:54:30
> > and completed
> > at 21:04:26, but there were multiple "Transition 128 aborted"
> > messages in between
> > (unfortunately one needs now to hunt for another mail to put them
> > together).
> > 
> > It looks like "Transition aborted" is more "we try to abort this
> > transition if
> > possible". My guess is that pacemaker must wait for currently
> > running action(s)
> > which can take quite some time when stopping virtual domain.
> > Transition 128
> > was initiated when stopping vm_pathway, but we have no idea when it
> > was stopped.
> > 
> 
> Yes, when code logs "Transition aborted", nothing is really aborted.
> It just tells
> pacemaker to not start any further actions which are part of this
> transition. But
> for all I can tell it does not affect currently running action.

Exactly, any actions already initiated must finish before the next
transition can be calculated, because their results can affect what
needs to be done.

We don't kill actions in flight because it's perfectly reasonable for
actions to be split across multiple transitions. Often when some event
is happening, lots of micro-conditions (action results, node attribute
changes, etc.) change in a short time, and you'll see a new transition
after each one.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

2022-02-16 Thread Andrei Borzenkov
On 16.02.2022 20:48, Andrei Borzenkov wrote:
> 
> I guess the real question here is why "Transition aborted" is logged although
> transition apparently continues. Transition 128 started at 20:54:30 and 
> completed
> at 21:04:26, but there were multiple "Transition 128 aborted" messages in 
> between
> (unfortunately one needs now to hunt for another mail to put them together).
> 
> It looks like "Transition aborted" is more "we try to abort this transition if
> possible". My guess is that pacemaker must wait for currently running 
> action(s)
> which can take quite some time when stopping virtual domain. Transition 128
> was initiated when stopping vm_pathway, but we have no idea when it was 
> stopped.
> 

Yes, when code logs "Transition aborted", nothing is really aborted. It just 
tells
pacemaker to not start any further actions which are part of this transition. 
But
for all I can tell it does not affect currently running action.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

2022-02-16 Thread Andrei Borzenkov
On 16.02.2022 14:35, Lentes, Bernd wrote:
> 
> 
> - On Feb 16, 2022, at 12:52 AM, kgaillot kgail...@redhat.com wrote:
> 
> 
>>> Any idea ?
>>> What is about that transition 128, which is aborted ?
>>
>> A transition is the set of actions that need to be taken in response to
>> current conditions. A transition is aborted any time conditions change
>> (here, the target-role being changed in the configuration), so that a
>> new set of actions can be calculated.
>>
>> Someone once defined a transition as an "action plan", and I'm tempted
>> to use that instead. Plus maybe replace "aborted" with "interrupted",
>> so then we'd have "Action plan interrupted" which is maybe a little
>> more understandable.
>>
>>>
>>> Transition 128 is finished:
>>> Feb 15 21:04:26 [15370] ha-idg-2   crmd:   notice:
>>> run_graph:   Transition 128 (Complete=1, Pending=0, Fired=0,
>>> Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-
>>> 3548.bz2): Complete
>>>
>>> And one second later the shutdown starts. Is that normal that there
>>> is such a big time gap ?
>>>
>>
>> No, there should be another transition calculated (with a "saving
>> input" message) immediately after the original transition is aborted.
>> What's the timestamp on that?
>> --
> 
> Hi Ken,
> 
> this is what i found:
> 
> Feb 15 20:54:30 [15369] ha-idg-2pengine:   notice: process_pe_message:
>   Calculated transition 128, saving inputs in 
> /var/lib/pacemaker/pengine/pe-input-3548.bz2
> Feb 15 20:54:30 [15370] ha-idg-2   crmd: info: do_state_transition:   
>   State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | 
> input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response
> Feb 15 20:54:30 [15370] ha-idg-2   crmd:   notice: do_te_invoke:
> Processing graph 128 (ref=pe_calc-dc-1644954870-403) derived from 
> /var/lib/pacemaker/pengine/pe-input-3548.bz2
> Feb 15 20:54:30 [15370] ha-idg-2   crmd:   notice: te_rsc_command:  
> Initiating stop operation vm_pathway_stop_0 locally on ha-idg-2 | action 76
> 
> Feb 15 21:04:26 [15369] ha-idg-2pengine:   notice: process_pe_message:
>   Calculated transition 129, saving inputs in 
> /var/lib/pacemaker/pengine/pe-input-3549.bz2
> Feb 15 21:04:26 [15370] ha-idg-2   crmd: info: do_state_transition:   
>   State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | 
> input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response
> Feb 15 21:04:26 [15370] ha-idg-2   crmd:   notice: do_te_invoke:
> Processing graph 129 (ref=pe_calc-dc-1644955466-405) derived from 
> /var/lib/pacemaker/pengine/pe-input-3549.bz2
> 


Splitting logs between different messages does not really help in interpreting 
them.

I guess the real question here is why "Transition aborted" is logged although
transition apparently continues. Transition 128 started at 20:54:30 and 
completed
at 21:04:26, but there were multiple "Transition 128 aborted" messages in 
between
(unfortunately one needs now to hunt for another mail to put them together).

It looks like "Transition aborted" is more "we try to abort this transition if
possible". My guess is that pacemaker must wait for currently running action(s)
which can take quite some time when stopping virtual domain. Transition 128
was initiated when stopping vm_pathway, but we have no idea when it was stopped.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

2022-02-16 Thread Lentes, Bernd


- On Feb 16, 2022, at 12:52 AM, kgaillot kgail...@redhat.com wrote:

> A transition is the set of actions that need to be taken in response to
> current conditions. A transition is aborted any time conditions change
> (here, the target-role being changed in the configuration), so that a
> new set of actions can be calculated.
> 
> Someone once defined a transition as an "action plan", and I'm tempted
> to use that instead. Plus maybe replace "aborted" with "interrupted",
> so then we'd have "Action plan interrupted" which is maybe a little
> more understandable.

These "transition  aborted" happen quite often.

Feb 15 20:53:25 [15370] ha-idg-2   crmd:   notice: abort_transition_graph:  
Transition 126 aborted by vm_documents-oo-meta_attributes-target-role doing 
modify target-role=Stopped: Configuration change | cib=7.27453.0 
source=te_update
_diff_v2:483 
path=/cib/configuration/resources/primitive[@id='vm_documents-oo']/meta_attributes[@id='vm_documents-oo-meta_attributes']/nvpair[@id='vm_documents-oo-meta_attributes-target-role']
 complete=false
  
Feb 15 20:53:00 [15370] ha-idg-2   crmd: info: abort_transition_graph:  
Transition 125 aborted by vm_amok-meta_attributes-target-role doing modify 
target-role=Stopped: Configuration change | cib=7.27452.0 
source=te_update_diff_v2
:483 
path=/cib/configuration/resources/primitive[@id='vm_amok']/meta_attributes[@id='vm_amok-meta_attributes']/nvpair[@id='vm_amok-meta_attributes-target-role']
 complete=true

Why is there sometimes "complete=true" and sometimes "complete=false" ?
What does that mean ?

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

2022-02-16 Thread Lentes, Bernd


- On Feb 16, 2022, at 12:52 AM, kgaillot kgail...@redhat.com wrote:


>> Any idea ?
>> What is about that transition 128, which is aborted ?
> 
> A transition is the set of actions that need to be taken in response to
> current conditions. A transition is aborted any time conditions change
> (here, the target-role being changed in the configuration), so that a
> new set of actions can be calculated.
> 
> Someone once defined a transition as an "action plan", and I'm tempted
> to use that instead. Plus maybe replace "aborted" with "interrupted",
> so then we'd have "Action plan interrupted" which is maybe a little
> more understandable.
> 
>> 
>> Transition 128 is finished:
>> Feb 15 21:04:26 [15370] ha-idg-2   crmd:   notice:
>> run_graph:   Transition 128 (Complete=1, Pending=0, Fired=0,
>> Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-
>> 3548.bz2): Complete
>> 
>> And one second later the shutdown starts. Is that normal that there
>> is such a big time gap ?
>>
> 
> No, there should be another transition calculated (with a "saving
> input" message) immediately after the original transition is aborted.
> What's the timestamp on that?
> --

Hi Ken,

this is what i found:

Feb 15 20:54:30 [15369] ha-idg-2pengine:   notice: process_pe_message:  
Calculated transition 128, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-3548.bz2
Feb 15 20:54:30 [15370] ha-idg-2   crmd: info: do_state_transition: 
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response
Feb 15 20:54:30 [15370] ha-idg-2   crmd:   notice: do_te_invoke:
Processing graph 128 (ref=pe_calc-dc-1644954870-403) derived from 
/var/lib/pacemaker/pengine/pe-input-3548.bz2
Feb 15 20:54:30 [15370] ha-idg-2   crmd:   notice: te_rsc_command:  
Initiating stop operation vm_pathway_stop_0 locally on ha-idg-2 | action 76

Feb 15 21:04:26 [15369] ha-idg-2pengine:   notice: process_pe_message:  
Calculated transition 129, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-3549.bz2
Feb 15 21:04:26 [15370] ha-idg-2   crmd: info: do_state_transition: 
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response
Feb 15 21:04:26 [15370] ha-idg-2   crmd:   notice: do_te_invoke:
Processing graph 129 (ref=pe_calc-dc-1644955466-405) derived from 
/var/lib/pacemaker/pengine/pe-input-3549.bz2

Bernd

smime.p7s
Description: S/MIME Cryptographic Signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

2022-02-15 Thread Ken Gaillot
On Tue, 2022-02-15 at 22:24 +0100, Lentes, Bernd wrote:
> Hi,
> 
> i have a weird behaviour in my 2-node-cluster.
> I stopped several VirtualDomains via "crm resource stop
> VirtualDomain", but the respective shutdown starts minutes later.
> All on the same host.
> 
> .bash_history:
>  
>  3520  2022-02-15 20:55:44 crm resource stop vm_greensql
>  3521  2022-02-15 20:56:34 crm resource stop vm_ssh
>  3522  2022-02-15 20:57:23 crm resource stop vm_sim
>  3523  2022-02-15 20:58:38 crm resource stop vm_mouseidgenes
>  3524  2022-02-15 21:00:24 crm resource stop vm_genetrap
>  3525  2022-02-15 21:01:25 crm resource stop vm_severin
>  3526  2022-02-15 21:01:34 crm resource stop vm_idcc_devel
> 
> /var/log/cluster/corosync.log:
> 
> Feb 15 20:55:45 [15365] ha-idg-2cib: info:
> cib_perform_op:  Diff: --- 7.27455.0 2
> Feb 15 20:55:45 [15365] ha-idg-2cib: info:
> cib_perform_op:  Diff: +++ 7.27456.0 138c70d41548c4cb1d767dd578a98b8f
> Feb 15 20:55:45 [15365] ha-idg-2cib: info:
> cib_perform_op:  +  /cib:  @epoch=27456
> Feb 15 20:55:45 [15365] ha-idg-2cib: info:
> cib_perform_op:  +  /cib/configuration/resources/primitive[@id='vm_gr
> eensql']/meta_attributes[@id='vm_greensql-
> meta_attributes']/nvpair[@id='vm_greensql-meta_attributes-target-
> role']:  @value=Stopped
> Feb 15 20:55:45 [15365] ha-idg-2cib: info:
> cib_process_request: Completed cib_apply_diff operation for
> section 'all': OK (rc=0, origin=ha-idg-1/cibadmin/2,
> version=7.27456.0)
> Feb 15 20:55:45 [15370] ha-idg-2   crmd: info:
> abort_transition_graph:  Transition 128 aborted by vm_greensql-
> meta_attributes-target-role doing modify target-role=Stopped:
> Configuration change | cib=7.27456.0 source=te_update_diff_v2:483
> path=/cib/configuration/resources/primitive[@id='vm_greensql']/meta_a
> tt
> ributes[@id='vm_greensql-meta_attributes']/nvpair[@id='vm_greensql-
> meta_attributes-target-role'] complete=false
>  ...
> Feb 15 20:56:35 [15365] ha-idg-2cib: info:
> cib_perform_op:  Diff: --- 7.27456.0 2
> Feb 15 20:56:35 [15365] ha-idg-2cib: info:
> cib_perform_op:  Diff: +++ 7.27457.0 (null)
> Feb 15 20:56:35 [15365] ha-idg-2cib: info:
> cib_perform_op:  +  /cib:  @epoch=27457
> Feb 15 20:56:35 [15365] ha-idg-2cib: info:
> cib_perform_op:  +  /cib/configuration/resources/primitive[@id='vm_ss
> h']/meta_attributes[@id='vm_ssh-meta_attributes']/nvpair[@id='vm_ssh-
> meta_attributes-target-role']:  @value=Stopped
> Feb 15 20:56:35 [15365] ha-idg-2cib: info:
> cib_process_request: Completed cib_apply_diff operation for
> section 'all': OK (rc=0, origin=ha-idg-1/cibadmin/2,
> version=7.27457.0)
> Feb 15 20:56:35 [15370] ha-idg-2   crmd: info:
> abort_transition_graph:  Transition 128 aborted by vm_ssh-
> meta_attributes-target-role doing modify target-role=Stopped:
> Configuration change | cib=7.27457.0 source=te_update_diff_v2:483
> path=/cib/configuration/resources/primitive[@id='vm_ssh']/meta_attrib
> utes[@i
> d='vm_ssh-meta_attributes']/nvpair[@id='vm_ssh-meta_attributes-
> target-role'] complete=false
>  ...
> Feb 15 20:57:24 [15365] ha-idg-2cib: info:
> cib_perform_op:  Diff: --- 7.27457.0 2
> Feb 15 20:57:24 [15365] ha-idg-2cib: info:
> cib_perform_op:  Diff: +++ 7.27458.0 7f91d8e52c8ff0887916ad921703fadd
> Feb 15 20:57:24 [15365] ha-idg-2cib: info:
> cib_perform_op:  +  /cib:  @epoch=27458
> Feb 15 20:57:24 [15365] ha-idg-2cib: info:
> cib_perform_op:  +  /cib/configuration/resources/primitive[@id='vm_si
> m']/meta_attributes[@id='vm_sim-meta_attributes']/nvpair[@id='vm_sim-
> meta_attributes-target-role']:  @value=Stopped
> Feb 15 20:57:24 [15365] ha-idg-2cib: info:
> cib_process_request: Completed cib_apply_diff operation for
> section 'all': OK (rc=0, origin=ha-idg-1/cibadmin/2,
> version=7.27458.0)
> Feb 15 20:57:24 [15370] ha-idg-2   crmd: info:
> abort_transition_graph:  Transition 128 aborted by vm_sim-
> meta_attributes-target-role doing modify target-role=Stopped:
> Configuration change | cib=7.27458.0 source=te_update_diff_v2:483
> path=/cib/configuration/resources/primitive[@id='vm_sim']/meta_attrib
> utes[@i
> d='vm_sim-meta_attributes']/nvpair[@id='vm_sim-meta_attributes-
> target-role'] complete=false
>  ...
> Feb 15 20:58:39 [15365] ha-idg-2cib: info:
> cib_perform_op:  Diff: --- 7.27458.0 2
> Feb 15 20:58:39 [15365] ha-idg-2cib: info:
> cib_perform_op:  Diff: +++ 7.27459.0 727c5953b33542602028bf903b0578bc
> Feb 15 20:58:39 [15365] ha-idg-2cib: info:
> cib_perform_op:  +  /cib:  @epoch=27459
> Feb 15 20:58:39 [15365] ha-idg-2cib: info:
> cib_perform_op:  +  /cib/configuration/resources/primitive[@id='vm_mo
> useidgenes']/meta_attributes[@id='vm_mouseidgenes-
> meta_attributes']/nvpair[@id='vm_mouseidgenes-meta_attributes-target-
> role']:  @value=Stopped
> Feb 15