Re: [ClusterLabs] (Live) Migration failure results in a stop operation
On Tue, 2018-02-20 at 02:13 -0500, Digimer wrote: > On 2018-02-20 12:07 AM, Digimer wrote: > > Hi all, > > > > Is there a way to tell pacemaker that, if a migration operation > > fails, > > to just leave the service on the host node? The service being > > hosted is > > a VM and a migration failure that triggers a shut down and reboot > > is > > very disruptive. I'd rather just leave it alone (and let a human > > fix the > > underlying problem). > > > > Thanks! > > > > I should mention; I tried setting the 'on-fail' for the 'migate_to' > and > 'migrate_from' operations; > > pcs resource create srv01-c7 ocf:alteeve:server name="srv01-c7" \ > meta allow-migrate="true" op monitor interval="60" \ > op stop on-fail="block" op migrate_to on-fail="ignore" \ > op migrate_from on-fail="ignore" \ I think you want "block" (don't take any further action) rather than "ignore" (proceed as if the action succeeded). With "ignore", you should see log messages like "Pretending the failure of ... succeeded". "ignore" is rarely useful, mainly when debugging a resource agent that is wrongly returning an error. > meta allow-migrate="true" failure-timeout="75" > > [root@m3-a02n01 ~]# pcs config > Cluster Name: m3-anvil-02 > Corosync Nodes: > m3-a02n01.alteeve.com m3-a02n02.alteeve.com > Pacemaker Nodes: > m3-a02n01.alteeve.com m3-a02n02.alteeve.com > > Resources: > Clone: hypervisor-clone > Meta Attrs: clone-max=2 notify=false > Resource: hypervisor (class=systemd type=libvirtd) > Operations: monitor interval=60 (hypervisor-monitor-interval-60) > start interval=0s timeout=100 (hypervisor-start- > interval-0s) > stop interval=0s timeout=100 (hypervisor-stop- > interval-0s) > Resource: srv01-c7 (class=ocf provider=alteeve type=server) > Attributes: name=srv01-c7 > Meta Attrs: allow-migrate=true failure-timeout=75 > Operations: migrate_from interval=0s on-fail=ignore > (srv01-c7-migrate_from-interval-0s) > migrate_to interval=0s on-fail=ignore > (srv01-c7-migrate_to-interval-0s) > monitor interval=60 (srv01-c7-monitor-interval-60) > start interval=0s timeout=30 (srv01-c7-start-interval- > 0s) > stop interval=0s on-fail=block (srv01-c7-stop-interval- > 0s) > > Stonith Devices: > Resource: virsh_node1 (class=stonith type=fence_virsh) > Attributes: delay=15 ipaddr=10.255.255.250 login=root > passwd="secret" > pcmk_host_list=m3-a02n01.alteeve.com port=m3-a02n01 > Operations: monitor interval=60 (virsh_node1-monitor-interval-60) > Resource: virsh_node2 (class=stonith type=fence_virsh) > Attributes: ipaddr=10.255.255.250 login=root passwd="secret" > pcmk_host_list=m3-a02n02.alteeve.com port=m3-a02n02 > Operations: monitor interval=60 (virsh_node2-monitor-interval-60) > Fencing Levels: > > Location Constraints: > Resource: srv01-c7 > Enabled on: m3-a02n02.alteeve.com (score:50) > (id:location-srv01-c7-m3-a02n02.alteeve.com-50) > Ordering Constraints: > Colocation Constraints: > Ticket Constraints: > > Alerts: > No alerts defined > > Resources Defaults: > No defaults set > Operations Defaults: > No defaults set > > Cluster Properties: > cluster-infrastructure: corosync > cluster-name: m3-anvil-02 > dc-version: 1.1.16-12.el7_4.7-94ff4df > have-watchdog: false > last-lrm-refresh: 1518584295 > > Quorum: > Options: > > > When I tried to migrate (with the RA set to fail on purpose), I got: > > Node 1 > Feb 20 07:06:40 m3-a02n01.alteeve.com crmd[1865]: notice: Result of > migrate_to operation for srv01-c7 on m3-a02n01.alteeve.com: 1 > (unknown > error) > Feb 20 07:06:40 m3-a02n01.alteeve.com ocf:alteeve:server[3440]: 167; > ocf:alteeve:server invoked. > Feb 20 07:06:40 m3-a02n01.alteeve.com ocf:alteeve:server[3442]: 1360; > Command line switch: [stop] -> [#!SET!#] > > > Node 2 > Feb 20 07:05:37 m3-a02n02.alteeve.com crmd[2394]: notice: State > transition S_TRANSITION_ENGINE -> S_IDLE > Feb 20 07:06:33 m3-a02n02.alteeve.com crmd[2394]: notice: State > transition S_IDLE -> S_POLICY_ENGINE > Feb 20 07:06:33 m3-a02n02.alteeve.com pengine[2393]: notice: * > Migratesrv01-c7( m3-a02n01.alteeve.com -> > m3-a02n02.alteeve.com ) > Feb 20 07:06:33 m3-a02n02.alteeve.com pengine[2393]: notice: > Calculated transition 756, saving inputs in > /var/lib/pacemaker/pengine/pe-input-172.bz2 > Feb 20 07:06:33 m3-a02n02.alteeve.com crmd[2394]: notice: > Initiating > migrate_to operation srv01-c7_migrate_to_0 on m3-a02n01.alteeve.com > Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]: warning: Action 22 > (srv01-c7_migrate_to_0) on m3-a02n01.alteeve.com failed (target: 0 > vs. > rc: 1): Error > Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]: warning: Action 22 > (srv01-c7_migrate_to_0) on m3-a02n01.alteeve.com failed (target: 0 > vs. > rc: 1): Error > Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]: notice: > Initiating > migrate_from operation srv01
Re: [ClusterLabs] (Live) Migration failure results in a stop operation
On 2018-02-20 12:07 AM, Digimer wrote: > Hi all, > > Is there a way to tell pacemaker that, if a migration operation fails, > to just leave the service on the host node? The service being hosted is > a VM and a migration failure that triggers a shut down and reboot is > very disruptive. I'd rather just leave it alone (and let a human fix the > underlying problem). > > Thanks! > I should mention; I tried setting the 'on-fail' for the 'migate_to' and 'migrate_from' operations; pcs resource create srv01-c7 ocf:alteeve:server name="srv01-c7" \ meta allow-migrate="true" op monitor interval="60" \ op stop on-fail="block" op migrate_to on-fail="ignore" \ op migrate_from on-fail="ignore" \ meta allow-migrate="true" failure-timeout="75" [root@m3-a02n01 ~]# pcs config Cluster Name: m3-anvil-02 Corosync Nodes: m3-a02n01.alteeve.com m3-a02n02.alteeve.com Pacemaker Nodes: m3-a02n01.alteeve.com m3-a02n02.alteeve.com Resources: Clone: hypervisor-clone Meta Attrs: clone-max=2 notify=false Resource: hypervisor (class=systemd type=libvirtd) Operations: monitor interval=60 (hypervisor-monitor-interval-60) start interval=0s timeout=100 (hypervisor-start-interval-0s) stop interval=0s timeout=100 (hypervisor-stop-interval-0s) Resource: srv01-c7 (class=ocf provider=alteeve type=server) Attributes: name=srv01-c7 Meta Attrs: allow-migrate=true failure-timeout=75 Operations: migrate_from interval=0s on-fail=ignore (srv01-c7-migrate_from-interval-0s) migrate_to interval=0s on-fail=ignore (srv01-c7-migrate_to-interval-0s) monitor interval=60 (srv01-c7-monitor-interval-60) start interval=0s timeout=30 (srv01-c7-start-interval-0s) stop interval=0s on-fail=block (srv01-c7-stop-interval-0s) Stonith Devices: Resource: virsh_node1 (class=stonith type=fence_virsh) Attributes: delay=15 ipaddr=10.255.255.250 login=root passwd="secret" pcmk_host_list=m3-a02n01.alteeve.com port=m3-a02n01 Operations: monitor interval=60 (virsh_node1-monitor-interval-60) Resource: virsh_node2 (class=stonith type=fence_virsh) Attributes: ipaddr=10.255.255.250 login=root passwd="secret" pcmk_host_list=m3-a02n02.alteeve.com port=m3-a02n02 Operations: monitor interval=60 (virsh_node2-monitor-interval-60) Fencing Levels: Location Constraints: Resource: srv01-c7 Enabled on: m3-a02n02.alteeve.com (score:50) (id:location-srv01-c7-m3-a02n02.alteeve.com-50) Ordering Constraints: Colocation Constraints: Ticket Constraints: Alerts: No alerts defined Resources Defaults: No defaults set Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: m3-anvil-02 dc-version: 1.1.16-12.el7_4.7-94ff4df have-watchdog: false last-lrm-refresh: 1518584295 Quorum: Options: When I tried to migrate (with the RA set to fail on purpose), I got: Node 1 Feb 20 07:06:40 m3-a02n01.alteeve.com crmd[1865]: notice: Result of migrate_to operation for srv01-c7 on m3-a02n01.alteeve.com: 1 (unknown error) Feb 20 07:06:40 m3-a02n01.alteeve.com ocf:alteeve:server[3440]: 167; ocf:alteeve:server invoked. Feb 20 07:06:40 m3-a02n01.alteeve.com ocf:alteeve:server[3442]: 1360; Command line switch: [stop] -> [#!SET!#] Node 2 Feb 20 07:05:37 m3-a02n02.alteeve.com crmd[2394]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE Feb 20 07:06:33 m3-a02n02.alteeve.com crmd[2394]: notice: State transition S_IDLE -> S_POLICY_ENGINE Feb 20 07:06:33 m3-a02n02.alteeve.com pengine[2393]: notice: * Migratesrv01-c7( m3-a02n01.alteeve.com -> m3-a02n02.alteeve.com ) Feb 20 07:06:33 m3-a02n02.alteeve.com pengine[2393]: notice: Calculated transition 756, saving inputs in /var/lib/pacemaker/pengine/pe-input-172.bz2 Feb 20 07:06:33 m3-a02n02.alteeve.com crmd[2394]: notice: Initiating migrate_to operation srv01-c7_migrate_to_0 on m3-a02n01.alteeve.com Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]: warning: Action 22 (srv01-c7_migrate_to_0) on m3-a02n01.alteeve.com failed (target: 0 vs. rc: 1): Error Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]: warning: Action 22 (srv01-c7_migrate_to_0) on m3-a02n01.alteeve.com failed (target: 0 vs. rc: 1): Error Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]: notice: Initiating migrate_from operation srv01-c7_migrate_from_0 locally on m3-a02n02.alteeve.com Feb 20 07:06:34 m3-a02n02.alteeve.com ocf:alteeve:server[3396]: 167; ocf:alteeve:server invoked. Feb 20 07:06:34 m3-a02n02.alteeve.com ocf:alteeve:server[3398]: 1360; Command line switch: [migrate_from] -> [#!SET!#] Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]: notice: Result of migrate_from operation for srv01-c7 on m3-a02n02.alteeve.com: 1 (unknown error) Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]: warning: Action 23 (srv01-c7_migrate_from_0) on m3-a02n02.alteeve.com failed (target: 0 vs. rc: 1): Error Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]: warning: Act