Dne 18. 01. 21 v 20:08 Digimer napsal(a):
On 2021-01-18 4:49 a.m., Tomas Jelinek wrote:
Hi Digimer,

Regarding pcs behavior:

When deleting a resource, pcs first sets its target-role to Stopped,
pushes the change into pacemaker and waits for the resource to stop.
Once the resource stops, pcs removes the resource from CIB. If pcs
simply removed the resource from CIB without stopping it first, the
resource would be running as orphaned (until pacemaker stops it if
configured to do so). We want to avoid that.

If the resource cannot be stopped for whatever reason, pcs reports this
and advises running the delete command with --force. Running 'pcs
resource delete --force' skips the part where pcs sets target role and
waits for the resource to stop, making pcs simply remove the resource
from CIB.

I agree that pcs should handle deleting unmanaged resources in a better
way. We plan to address that, but it's not on top of the priority list.
Our plan is actually to prevent deleting unmanaged resources (or require
--force to be specified to do so) based on the following scenario:

If a resource is deleted while in unmanaged state, it ends up in
ORPHANED state - it is removed from CIB but still present in running
configuration. This can cause various issues, i.e. when unmanaged
resource is stopped manually outside of the cluster there might be
problems with stopping the resource upon deletion (while unmanaged)
which may end up with stonith being initiated - this is not desired.


Regards,
Tomas

This logic makes sense. If I may propose a reason for an alternative method;

In my case, the idea I was experimenting with was to remove a running
server from cluster management, without actually shutting down the
server. This is somewhat contrived, I freely admin, but the idea of
taking a server out of the config entirely without shutting it down
could be useful in some cases.

In my case, I didn't worry about the orphaned state and the risk of it
trying to start elsewhere as there are additional safeguards in place to
prevent this (both in our software and in that DRBD is not set to
dual-primary, so the VM simply can't start elsewhere while it's running
somewhere).

Totally understand it's not a priority, but when this is addressed, some
special mechanism to say "I know this will leave it orphaned and that's
OK" would be nice to have.

You can do it even now with "pcs resource delete --force". I admit it's not the best way and an extra flag (--dont-stop or similar) would be better. I wrote the idea into our notes so it doesn't get forgotten.

Tomas


digimer

Dne 18. 01. 21 v 3:11 Digimer napsal(a):
Hi all,

    Mind the slew of questions, well into testing now and finding lots of
issues. This one is two questions... :)

    I set a server to be unamaged in pacemaker while the server was
running. Then I tried to remove the resource, and it refused saying it
couldn't stop it, and to use '--force'. So I did, and the node got
fenced. Now, the resource was setup with;

pcs resource create srv07-el6 ocf:alteeve:server name="srv07-el6" \
   meta allow-migrate="true" target-role="started" \
   op monitor interval="60" start timeout="INFINITY" \
   on-fail="block" stop timeout="INFINITY" on-fail="block" \
   migrate_to timeout="INFINITY"

    I would have expected the 'stop timeout="INFINITY" on-fail="block"' to
prevent fencing if the server failed to stop (question 1) and that if a
resource was unmanaged, that the resource wouldn't even try to stop
(question 2).

    Can someone help me understand what happened here?

digimer

More below;

====
[root@el8-a01n01 ~]# pcs resource remove srv01-test
Attempting to stop: srv01-test... Warning: 'srv01-test' is unmanaged
Error: Unable to stop: srv01-test before deleting (re-run with --force
to force deletion)
[root@el8-a01n01 ~]# pcs resource remove srv01-test --force
Deleting Resource - srv01-test
[root@el8-a01n01 ~]# client_loop: send disconnect: Broken pipe
====

    As you can see, the node was fenced. The logs on that node were;

====
Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-execd[1872]:  warning:
srv01-test_stop_0 process (PID 113779) timed out
Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-execd[1872]:  warning:
srv01-test_stop_0[113779] timed out after 20000ms
Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-controld[1875]:  error:
Result of stop operation for srv01-test on el8-a01n01: Timed Out
Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-controld[1875]:  notice:
el8-a01n01-srv01-test_stop_0:37 [ The server: [srv01-test] is indeed
running. It will be shut down now.\n ]
Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]:  notice:
Setting fail-count-srv01-test#stop_0[el8-a01n01]: (unset) -> INFINITY
Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]:  notice:
Setting last-failure-srv01-test#stop_0[el8-a01n01]: (unset) -> 1610935435
Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]:  notice:
Setting fail-count-srv01-test#stop_0[el8-a01n01]: INFINITY -> (unset)
Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker-attrd[1873]:  notice:
Setting last-failure-srv01-test#stop_0[el8-a01n01]: 1610935435 -> (unset)
client_loop: send disconnect: Broken pipe
====

On the peer node, the logs showed;

====
Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: State transition S_IDLE -> S_POLICY_ENGINE
Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Calculated transition 58, saving inputs in
/var/lib/pacemaker/pengine/pe-input-100.bz2
Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 58 (Complete=0, Pending=0, Fired=0, Skipped=0,
Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-100.bz2):
Complete
Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: State transition S_IDLE -> S_POLICY_ENGINE
Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Calculated transition 59, saving inputs in
/var/lib/pacemaker/pengine/pe-input-101.bz2
Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 59 (Complete=0, Pending=0, Fired=0, Skipped=0,
Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-101.bz2):
Complete
Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: State transition S_IDLE -> S_POLICY_ENGINE
Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Detected active orphan srv01-test running on el8-a01n01
Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Clearing failure of srv01-test on el8-a01n02 because resource
parameters have changed
Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Removing srv01-test from el8-a01n01
Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Removing srv01-test from el8-a01n02
Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice:  * Stop       srv01-test             (               el8-a01n01
)   due to node availability
Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Calculated transition 60, saving inputs in
/var/lib/pacemaker/pengine/pe-input-102.bz2
Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Initiating stop operation srv01-test_stop_0 on el8-a01n01
Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 60 aborted by deletion of
lrm_rsc_op[@id='srv01-test_last_failure_0']: Resource operation removal
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 60 action 11 (srv01-test_stop_0 on el8-a01n01):
expected 'ok' but got 'error'
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 60 (Complete=2, Pending=0, Fired=0, Skipped=0,
Incomplete=2, Source=/var/lib/pacemaker/pengine/pe-input-102.bz2):
Complete
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
Setting fail-count-srv01-test#stop_0[el8-a01n01]: (unset) -> INFINITY
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
Setting last-failure-srv01-test#stop_0[el8-a01n01]: (unset) -> 1610935435
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Unexpected result (error) was recorded for stop of srv01-test
on el8-a01n01 at Jan 18 02:03:35 2021
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Unexpected result (error) was recorded for stop of srv01-test
on el8-a01n01 at Jan 18 02:03:35 2021
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Cluster node el8-a01n01 will be fenced: srv01-test failed there
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Detected active orphan srv01-test running on el8-a01n01
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Scheduling Node el8-a01n01 for STONITH
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Stop of failed resource srv01-test is implicit after el8-a01n01
is fenced
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice:  * Fence (reboot) el8-a01n01 'srv01-test failed there'
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice:  * Move       virsh_node2_pulsar     ( el8-a01n01 -> el8-a01n02 )
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice:  * Stop       srv01-test             (               el8-a01n01
)   due to node availability
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Calculated transition 61 (with warnings), saving inputs in
/var/lib/pacemaker/pengine/pe-warn-1.bz2
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Unexpected result (error) was recorded for stop of srv01-test
on el8-a01n01 at Jan 18 02:03:35 2021
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Unexpected result (error) was recorded for stop of srv01-test
on el8-a01n01 at Jan 18 02:03:35 2021
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Cluster node el8-a01n01 will be fenced: srv01-test failed there
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Detected active orphan srv01-test running on el8-a01n01
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Forcing srv01-test away from el8-a01n01 after 1000000 failures
(max=1000000)
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Clearing failure of srv01-test on el8-a01n01 because it is
orphaned
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Scheduling Node el8-a01n01 for STONITH
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Stop of failed resource srv01-test is implicit after el8-a01n01
is fenced
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice:  * Fence (reboot) el8-a01n01 'srv01-test failed there'
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice:  * Move       virsh_node2_pulsar     ( el8-a01n01 -> el8-a01n02 )
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice:  * Stop       srv01-test             (               el8-a01n01
)   due to node availability
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
warning: Calculated transition 62 (with warnings), saving inputs in
/var/lib/pacemaker/pengine/pe-warn-2.bz2
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Requesting fencing (reboot) of node el8-a01n01
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Initiating start operation virsh_node2_pulsar_start_0 locally on
el8-a01n02
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
Client pacemaker-controld.490050.72911c98 wants to fence (reboot)
'el8-a01n01' with device '(any)'
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
Requesting peer fencing (reboot) targeting el8-a01n01
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
Setting fail-count-srv01-test#stop_0[el8-a01n01]: INFINITY -> (unset)
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
Setting last-failure-srv01-test#stop_0[el8-a01n01]: 1610935435 -> (unset)
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
virsh_node2_pulsar is not eligible to fence (reboot) el8-a01n01:
static-list
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
virsh_node1_pulsar is eligible to fence (reboot) el8-a01n01: static-list
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 62 aborted by deletion of
lrm_rsc_op[@id='srv01-test_last_failure_0']: Resource operation removal
Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
Requesting that el8-a01n02 perform 'reboot' action targeting el8-a01n01
using 'virsh_node1_pulsar'
Jan 18 02:03:56 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Result of start operation for virsh_node2_pulsar on
el8-a01n02: ok
Jan 18 02:03:57 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
Operation 'reboot' [646769] (call 4 from pacemaker-controld.490050) for
host 'el8-a01n01' with device 'virsh_node1_pulsar' returned: 0 (OK)
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
Node el8-a01n01 state is now lost
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
Removing all el8-a01n01 attributes for peer loss
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Node el8-a01n01 state is now lost
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-based[490045]:  notice:
Node el8-a01n01 state is now lost
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-based[490045]:  notice:
Purged 1 peer with id=1 and/or uname=el8-a01n01 from the membership cache
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
Node el8-a01n01 state is now lost
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
Purged 1 peer with id=1 and/or uname=el8-a01n01 from the membership cache
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-attrd[490048]:  notice:
Purged 1 peer with id=1 and/or uname=el8-a01n01 from the membership cache
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
Action 'reboot' targeting el8-a01n01 using virsh_node1_pulsar on behalf
of pacemaker-controld.490050@el8-a01n02: OK
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-fenced[490046]:  notice:
Operation 'reboot' targeting el8-a01n01 on el8-a01n02 for
pacemaker-controld.490050@el8-a01n02.8ff64dd6: OK
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Stonith operation 4/2:62:0:e827eea0-dedc-4200-a207-c4095621b3c6:
OK (0)
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Peer el8-a01n01 was terminated (reboot) by el8-a01n02 on behalf
of pacemaker-controld.490050: OK
Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 62 (Complete=5, Pending=0, Fired=0, Skipped=1,
Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Removing srv01-test from el8-a01n02
Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Calculated transition 63, saving inputs in
/var/lib/pacemaker/pengine/pe-input-103.bz2
Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Initiating monitor operation virsh_node2_pulsar_monitor_60000
locally on el8-a01n02
Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Initiating delete operation srv01-test_delete_0 locally on
el8-a01n02
Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 63 aborted by deletion of
lrm_resource[@id='srv01-test']: Resource state removal
Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Result of monitor operation for virsh_node2_pulsar on
el8-a01n02: ok
Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 63 (Complete=2, Pending=0, Fired=0, Skipped=0,
Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-103.bz2):
Complete
Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-schedulerd[490049]:
notice: Calculated transition 64, saving inputs in
/var/lib/pacemaker/pengine/pe-input-104.bz2
Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: Transition 64 (Complete=0, Pending=0, Fired=0, Skipped=0,
Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-104.bz2):
Complete
Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]:
notice: State transition S_TRANSITION_ENGINE -> S_IDLE
====


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to