Re: [ClusterLabs] (Live) Migration failure results in a stop operation

2018-02-20 Thread Ken Gaillot
On Tue, 2018-02-20 at 02:13 -0500, Digimer wrote:
> On 2018-02-20 12:07 AM, Digimer wrote:
> > Hi all,
> > 
> >   Is there a way to tell pacemaker that, if a migration operation
> > fails,
> > to just leave the service on the host node? The service being
> > hosted is
> > a VM and a migration failure that triggers a shut down and reboot
> > is
> > very disruptive. I'd rather just leave it alone (and let a human
> > fix the
> > underlying problem).
> > 
> > Thanks!
> > 
> 
> I should mention; I tried setting the 'on-fail' for the 'migate_to'
> and
> 'migrate_from' operations;
> 
> pcs resource create srv01-c7 ocf:alteeve:server name="srv01-c7" \
> meta allow-migrate="true" op monitor interval="60" \
> op stop on-fail="block" op migrate_to on-fail="ignore" \
> op migrate_from on-fail="ignore" \

I think you want "block" (don't take any further action) rather than
"ignore" (proceed as if the action succeeded).

With "ignore", you should see log messages like "Pretending the failure
of ... succeeded". "ignore" is rarely useful, mainly when debugging a
resource agent that is wrongly returning an error.

> meta allow-migrate="true" failure-timeout="75"
> 
>  [root@m3-a02n01 ~]# pcs config
> Cluster Name: m3-anvil-02
> Corosync Nodes:
>  m3-a02n01.alteeve.com m3-a02n02.alteeve.com
> Pacemaker Nodes:
>  m3-a02n01.alteeve.com m3-a02n02.alteeve.com
> 
> Resources:
>  Clone: hypervisor-clone
>   Meta Attrs: clone-max=2 notify=false
>   Resource: hypervisor (class=systemd type=libvirtd)
>    Operations: monitor interval=60 (hypervisor-monitor-interval-60)
>    start interval=0s timeout=100 (hypervisor-start-
> interval-0s)
>    stop interval=0s timeout=100 (hypervisor-stop-
> interval-0s)
>  Resource: srv01-c7 (class=ocf provider=alteeve type=server)
>   Attributes: name=srv01-c7
>   Meta Attrs: allow-migrate=true failure-timeout=75
>   Operations: migrate_from interval=0s on-fail=ignore
> (srv01-c7-migrate_from-interval-0s)
>   migrate_to interval=0s on-fail=ignore
> (srv01-c7-migrate_to-interval-0s)
>   monitor interval=60 (srv01-c7-monitor-interval-60)
>   start interval=0s timeout=30 (srv01-c7-start-interval-
> 0s)
>   stop interval=0s on-fail=block (srv01-c7-stop-interval-
> 0s)
> 
> Stonith Devices:
>  Resource: virsh_node1 (class=stonith type=fence_virsh)
>   Attributes: delay=15 ipaddr=10.255.255.250 login=root
> passwd="secret"
> pcmk_host_list=m3-a02n01.alteeve.com port=m3-a02n01
>   Operations: monitor interval=60 (virsh_node1-monitor-interval-60)
>  Resource: virsh_node2 (class=stonith type=fence_virsh)
>   Attributes: ipaddr=10.255.255.250 login=root passwd="secret"
> pcmk_host_list=m3-a02n02.alteeve.com port=m3-a02n02
>   Operations: monitor interval=60 (virsh_node2-monitor-interval-60)
> Fencing Levels:
> 
> Location Constraints:
>   Resource: srv01-c7
> Enabled on: m3-a02n02.alteeve.com (score:50)
> (id:location-srv01-c7-m3-a02n02.alteeve.com-50)
> Ordering Constraints:
> Colocation Constraints:
> Ticket Constraints:
> 
> Alerts:
>  No alerts defined
> 
> Resources Defaults:
>  No defaults set
> Operations Defaults:
>  No defaults set
> 
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-name: m3-anvil-02
>  dc-version: 1.1.16-12.el7_4.7-94ff4df
>  have-watchdog: false
>  last-lrm-refresh: 1518584295
> 
> Quorum:
>   Options:
> 
> 
> When I tried to migrate (with the RA set to fail on purpose), I got:
> 
>  Node 1
> Feb 20 07:06:40 m3-a02n01.alteeve.com crmd[1865]:   notice: Result of
> migrate_to operation for srv01-c7 on m3-a02n01.alteeve.com: 1
> (unknown
> error)
> Feb 20 07:06:40 m3-a02n01.alteeve.com ocf:alteeve:server[3440]: 167;
> ocf:alteeve:server invoked.
> Feb 20 07:06:40 m3-a02n01.alteeve.com ocf:alteeve:server[3442]: 1360;
> Command line switch: [stop] -> [#!SET!#]
> 
> 
>  Node 2
> Feb 20 07:05:37 m3-a02n02.alteeve.com crmd[2394]:   notice: State
> transition S_TRANSITION_ENGINE -> S_IDLE
> Feb 20 07:06:33 m3-a02n02.alteeve.com crmd[2394]:   notice: State
> transition S_IDLE -> S_POLICY_ENGINE
> Feb 20 07:06:33 m3-a02n02.alteeve.com pengine[2393]:   notice:  *
> Migratesrv01-c7( m3-a02n01.alteeve.com ->
> m3-a02n02.alteeve.com )
> Feb 20 07:06:33 m3-a02n02.alteeve.com pengine[2393]:   notice:
> Calculated transition 756, saving inputs in
> /var/lib/pacemaker/pengine/pe-input-172.bz2
> Feb 20 07:06:33 m3-a02n02.alteeve.com crmd[2394]:   notice:
> Initiating
> migrate_to operation srv01-c7_migrate_to_0 on m3-a02n01.alteeve.com
> Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:  warning: Action 22
> (srv01-c7_migrate_to_0) on m3-a02n01.alteeve.com failed (target: 0
> vs.
> rc: 1): Error
> Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:  warning: Action 22
> (srv01-c7_migrate_to_0) on m3-a02n01.alteeve.com failed (target: 0
> vs.
> rc: 1): Error
> Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:   notice:
> Initiating
> migrate_from operation srv01

Re: [ClusterLabs] (Live) Migration failure results in a stop operation

2018-02-19 Thread Digimer
On 2018-02-20 12:07 AM, Digimer wrote:
> Hi all,
> 
>   Is there a way to tell pacemaker that, if a migration operation fails,
> to just leave the service on the host node? The service being hosted is
> a VM and a migration failure that triggers a shut down and reboot is
> very disruptive. I'd rather just leave it alone (and let a human fix the
> underlying problem).
> 
> Thanks!
> 

I should mention; I tried setting the 'on-fail' for the 'migate_to' and
'migrate_from' operations;

pcs resource create srv01-c7 ocf:alteeve:server name="srv01-c7" \
meta allow-migrate="true" op monitor interval="60" \
op stop on-fail="block" op migrate_to on-fail="ignore" \
op migrate_from on-fail="ignore" \
meta allow-migrate="true" failure-timeout="75"

 [root@m3-a02n01 ~]# pcs config
Cluster Name: m3-anvil-02
Corosync Nodes:
 m3-a02n01.alteeve.com m3-a02n02.alteeve.com
Pacemaker Nodes:
 m3-a02n01.alteeve.com m3-a02n02.alteeve.com

Resources:
 Clone: hypervisor-clone
  Meta Attrs: clone-max=2 notify=false
  Resource: hypervisor (class=systemd type=libvirtd)
   Operations: monitor interval=60 (hypervisor-monitor-interval-60)
   start interval=0s timeout=100 (hypervisor-start-interval-0s)
   stop interval=0s timeout=100 (hypervisor-stop-interval-0s)
 Resource: srv01-c7 (class=ocf provider=alteeve type=server)
  Attributes: name=srv01-c7
  Meta Attrs: allow-migrate=true failure-timeout=75
  Operations: migrate_from interval=0s on-fail=ignore
(srv01-c7-migrate_from-interval-0s)
  migrate_to interval=0s on-fail=ignore
(srv01-c7-migrate_to-interval-0s)
  monitor interval=60 (srv01-c7-monitor-interval-60)
  start interval=0s timeout=30 (srv01-c7-start-interval-0s)
  stop interval=0s on-fail=block (srv01-c7-stop-interval-0s)

Stonith Devices:
 Resource: virsh_node1 (class=stonith type=fence_virsh)
  Attributes: delay=15 ipaddr=10.255.255.250 login=root passwd="secret"
pcmk_host_list=m3-a02n01.alteeve.com port=m3-a02n01
  Operations: monitor interval=60 (virsh_node1-monitor-interval-60)
 Resource: virsh_node2 (class=stonith type=fence_virsh)
  Attributes: ipaddr=10.255.255.250 login=root passwd="secret"
pcmk_host_list=m3-a02n02.alteeve.com port=m3-a02n02
  Operations: monitor interval=60 (virsh_node2-monitor-interval-60)
Fencing Levels:

Location Constraints:
  Resource: srv01-c7
Enabled on: m3-a02n02.alteeve.com (score:50)
(id:location-srv01-c7-m3-a02n02.alteeve.com-50)
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: m3-anvil-02
 dc-version: 1.1.16-12.el7_4.7-94ff4df
 have-watchdog: false
 last-lrm-refresh: 1518584295

Quorum:
  Options:


When I tried to migrate (with the RA set to fail on purpose), I got:

 Node 1
Feb 20 07:06:40 m3-a02n01.alteeve.com crmd[1865]:   notice: Result of
migrate_to operation for srv01-c7 on m3-a02n01.alteeve.com: 1 (unknown
error)
Feb 20 07:06:40 m3-a02n01.alteeve.com ocf:alteeve:server[3440]: 167;
ocf:alteeve:server invoked.
Feb 20 07:06:40 m3-a02n01.alteeve.com ocf:alteeve:server[3442]: 1360;
Command line switch: [stop] -> [#!SET!#]


 Node 2
Feb 20 07:05:37 m3-a02n02.alteeve.com crmd[2394]:   notice: State
transition S_TRANSITION_ENGINE -> S_IDLE
Feb 20 07:06:33 m3-a02n02.alteeve.com crmd[2394]:   notice: State
transition S_IDLE -> S_POLICY_ENGINE
Feb 20 07:06:33 m3-a02n02.alteeve.com pengine[2393]:   notice:  *
Migratesrv01-c7( m3-a02n01.alteeve.com ->
m3-a02n02.alteeve.com )
Feb 20 07:06:33 m3-a02n02.alteeve.com pengine[2393]:   notice:
Calculated transition 756, saving inputs in
/var/lib/pacemaker/pengine/pe-input-172.bz2
Feb 20 07:06:33 m3-a02n02.alteeve.com crmd[2394]:   notice: Initiating
migrate_to operation srv01-c7_migrate_to_0 on m3-a02n01.alteeve.com
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:  warning: Action 22
(srv01-c7_migrate_to_0) on m3-a02n01.alteeve.com failed (target: 0 vs.
rc: 1): Error
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:  warning: Action 22
(srv01-c7_migrate_to_0) on m3-a02n01.alteeve.com failed (target: 0 vs.
rc: 1): Error
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:   notice: Initiating
migrate_from operation srv01-c7_migrate_from_0 locally on
m3-a02n02.alteeve.com
Feb 20 07:06:34 m3-a02n02.alteeve.com ocf:alteeve:server[3396]: 167;
ocf:alteeve:server invoked.
Feb 20 07:06:34 m3-a02n02.alteeve.com ocf:alteeve:server[3398]: 1360;
Command line switch: [migrate_from] -> [#!SET!#]
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:   notice: Result of
migrate_from operation for srv01-c7 on m3-a02n02.alteeve.com: 1 (unknown
error)
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:  warning: Action 23
(srv01-c7_migrate_from_0) on m3-a02n02.alteeve.com failed (target: 0 vs.
rc: 1): Error
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:  warning: Act