Re: [ClusterLabs] Cluster failover failure with Unresolved dependency

Ken Gaillot Mon, 14 Mar 2016 10:46:43 -0700

On 03/10/2016 09:49 AM, Lorand Kelemen wrote:
> Dear List,
> 
> After the creation and testing of a simple 2 node active-passive
> drbd+postfix cluster nearly everything works flawlessly (standby, failure
> of a filesystem resource + failover, splitbrain + manual recovery) however
> when delibarately killing the postfix instance, after reaching the
> migration threshold failover does not occur and resources revert to the
> Stopped state (except the master-slave drbd resource, which works as
> expected).
> 
> Ordering and colocation is configured, STONITH and quorum disabled, the
> goal is to always have one node running all the resources and at any sign
> of error it should fail over to the passive node, nothing fancy.
> 
> Is my configuration wrong or am I hitting a bug?
> 
> All software from centos 7 + elrepo repositories.


With these versions, you can set "two_node: 1" in
/etc/corosync/corosync.conf (which will be done automatically if you
used "pcs cluster setup" initially), and then you don't need to ignore
quorum in pacemaker.

> Regarding STONITH: the machines are running on free ESXi instances on
> separate machines, so the Vmware fencing agents won't work because in the
> free version the API is read-only.
> Still trying to figure out a way to go, until then manual recovery + huge
> arp cache times on the upstream firewall...
> 
> Please find pe-input*.bz files attached, logs and config below. The
> situation: on node mail1 postfix was killed 3 times (migration threshold),
> it should have failed over to mail2.
> When killing a filesystem resource three times this happens flawlessly.
> 
> Thanks for your input!
> 
> Best regards,
> Lorand
> 
> 
> Cluster Name: mailcluster
> Corosync Nodes:
>  mail1 mail2
> Pacemaker Nodes:
>  mail1 mail2
> 
> Resources:
>  Group: network-services
>   Resource: virtualip-1 (class=ocf provider=heartbeat type=IPaddr2)
>    Attributes: ip=10.20.64.10 cidr_netmask=24 nic=lan0
>    Operations: start interval=0s timeout=20s (virtualip-1-start-interval-0s)
>                stop interval=0s timeout=20s (virtualip-1-stop-interval-0s)
>                monitor interval=30s (virtualip-1-monitor-interval-30s)
>  Master: spool-clone
>   Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
> notify=true
>   Resource: spool (class=ocf provider=linbit type=drbd)
>    Attributes: drbd_resource=spool
>    Operations: start interval=0s timeout=240 (spool-start-interval-0s)
>                promote interval=0s timeout=90 (spool-promote-interval-0s)
>                demote interval=0s timeout=90 (spool-demote-interval-0s)
>                stop interval=0s timeout=100 (spool-stop-interval-0s)
>                monitor interval=10s (spool-monitor-interval-10s)
>  Master: mail-clone
>   Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
> notify=true
>   Resource: mail (class=ocf provider=linbit type=drbd)
>    Attributes: drbd_resource=mail
>    Operations: start interval=0s timeout=240 (mail-start-interval-0s)
>                promote interval=0s timeout=90 (mail-promote-interval-0s)
>                demote interval=0s timeout=90 (mail-demote-interval-0s)
>                stop interval=0s timeout=100 (mail-stop-interval-0s)
>                monitor interval=10s (mail-monitor-interval-10s)
>  Group: fs-services
>   Resource: fs-spool (class=ocf provider=heartbeat type=Filesystem)
>    Attributes: device=/dev/drbd0 directory=/var/spool/postfix fstype=ext4
> options=nodev,nosuid,noexec
>    Operations: start interval=0s timeout=60 (fs-spool-start-interval-0s)
>                stop interval=0s timeout=60 (fs-spool-stop-interval-0s)
>                monitor interval=20 timeout=40 (fs-spool-monitor-interval-20)
>   Resource: fs-mail (class=ocf provider=heartbeat type=Filesystem)
>    Attributes: device=/dev/drbd1 directory=/var/spool/mail fstype=ext4
> options=nodev,nosuid,noexec
>    Operations: start interval=0s timeout=60 (fs-mail-start-interval-0s)
>                stop interval=0s timeout=60 (fs-mail-stop-interval-0s)
>                monitor interval=20 timeout=40 (fs-mail-monitor-interval-20)
>  Group: mail-services
>   Resource: postfix (class=ocf provider=heartbeat type=postfix)
>    Operations: start interval=0s timeout=20s (postfix-start-interval-0s)
>                stop interval=0s timeout=20s (postfix-stop-interval-0s)
>                monitor interval=45s (postfix-monitor-interval-45s)
> 
> Stonith Devices:
> Fencing Levels:
> 
> Location Constraints:
> Ordering Constraints:
>   start network-services then promote mail-clone (kind:Mandatory)
> (id:order-network-services-mail-clone-mandatory)
>   promote mail-clone then promote spool-clone (kind:Mandatory)
> (id:order-mail-clone-spool-clone-mandatory)
>   promote spool-clone then start fs-services (kind:Mandatory)
> (id:order-spool-clone-fs-services-mandatory)
>   start fs-services then start mail-services (kind:Mandatory)
> (id:order-fs-services-mail-services-mandatory)
> Colocation Constraints:
>   network-services with spool-clone (score:INFINITY) (rsc-role:Started)
> (with-rsc-role:Master) (id:colocation-network-services-spool-clone-INFINITY)
>   network-services with mail-clone (score:INFINITY) (rsc-role:Started)
> (with-rsc-role:Master) (id:colocation-network-services-mail-clone-INFINITY)
>   network-services with fs-services (score:INFINITY)
> (id:colocation-network-services-fs-services-INFINITY)
>   network-services with mail-services (score:INFINITY)
> (id:colocation-network-services-mail-services-INFINITY)

I'm not sure whether it's causing your issue, but I would make the
constraints reflect the logical relationships better.

For example, network-services only needs to be colocated with
mail-services logically; it's mail-services that needs to be with
fs-services, and fs-services that needs to be with
spool-clone/mail-clone master. In other words, don't make the
highest-level resource depend on everything else, make each level depend
on the level below it.

Also, I would guess that the virtual IP only needs to be ordered before
mail-services, and mail-clone and spool-clone could both be ordered
before fs-services, rather than ordering mail-clone before spool-clone.

> Resources Defaults:
>  migration-threshold: 3
> Operations Defaults:
>  on-fail: restart
> 
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-name: mailcluster
>  cluster-recheck-interval: 5min
>  dc-version: 1.1.13-10.el7_2.2-44eb2dd
>  default-resource-stickiness: infinity
>  have-watchdog: false
>  last-lrm-refresh: 1457613674
>  no-quorum-policy: ignore
>  pe-error-series-max: 1024
>  pe-input-series-max: 1024
>  pe-warn-series-max: 1024
>  stonith-enabled: false
> 
> 
> 
> 
> 
> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: --- 0.197.15 2
> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: +++ 0.197.16 (null)
> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +  /cib:  @num_updates=16
> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +
>  
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='postfix']/lrm_rsc_op[@id='postfix_last_failure_0']:
>  @transition-key=4:1234:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> @transition-magic=0:7;4:1234:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> @call-id=1274, @last-rc-change=1457613440
> Mar 10 13:37:20 [7420] HWJ-626.domain.local       crmd:     info:
> abort_transition_graph:       Transition aborted by postfix_monitor_45000
> 'modify' on mail1: Inactive graph
> (magic=0:7;4:1234:0:ae755a85-c250-498f-9c94-ddd8a7e2788a, cib=0.197.16,
> source=process_graph_event:598, 1)
> Mar 10 13:37:20 [7420] HWJ-626.domain.local       crmd:     info:
> update_failcount:     Updating failcount for postfix on mail1 after failed
> monitor: rc=7 (update=value++, time=1457613440)
> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
> attrd_client_update:  Expanded fail-count-postfix=value++ to 3
> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> cib_process_request:  Completed cib_modify operation for section status: OK
> (rc=0, origin=mail1/crmd/196, version=0.197.16)
> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
> attrd_peer_update:    Setting fail-count-postfix[mail1]: 2 -> 3 from mail2
> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
> write_attribute:      Sent update 400 with 2 changes for
> fail-count-postfix, id=<n/a>, set=(null)
> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> cib_process_request:  Forwarding cib_modify operation for section status to
> master (origin=local/attrd/400)
> Mar 10 13:37:20 [7420] HWJ-626.domain.local       crmd:     info:
> process_graph_event:  Detected action (1234.4)
> postfix_monitor_45000.1274=not running: failed
> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
> attrd_peer_update:    Setting last-failure-postfix[mail1]: 1457613347 ->
> 1457613440 from mail2
> Mar 10 13:37:20 [7420] HWJ-626.domain.local       crmd:   notice:
> do_state_transition:  State transition S_IDLE -> S_POLICY_ENGINE [
> input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
> write_attribute:      Sent update 401 with 2 changes for
> last-failure-postfix, id=<n/a>, set=(null)
> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: --- 0.197.16 2
> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: +++ 0.197.17 (null)
> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +  /cib:  @num_updates=17
> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +
>  
> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-fail-count-postfix']:
>  @value=3
> Mar 10 13:37:20 [7415] HWJ-626.domain.local        cib:     info:
> cib_process_request:  Completed cib_modify operation for section status: OK
> (rc=0, origin=mail2/attrd/400, version=0.197.17)
> Mar 10 13:37:20 [7420] HWJ-626.domain.local       crmd:     info:
> abort_transition_graph:       Transition aborted by
> status-1-fail-count-postfix, fail-count-postfix=3: Transient attribute
> change (modify cib=0.197.17, source=abort_unless_down:319,
> path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-fail-count-postfix'],
> 1)
> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
> attrd_cib_callback:   Update 400 for fail-count-postfix: OK (0)
> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
> attrd_cib_callback:   Update 400 for fail-count-postfix[mail1]=3: OK (0)
> Mar 10 13:37:20 [7418] HWJ-626.domain.local      attrd:     info:
> attrd_cib_callback:   Update 400 for fail-count-postfix[mail2]=(null): OK
> (0)
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_process_request:  Forwarding cib_modify operation for section status to
> master (origin=local/attrd/401)
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: --- 0.197.17 2
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: +++ 0.197.18 (null)
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +  /cib:  @num_updates=18
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +
>  
> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-last-failure-postfix']:
>  @value=1457613440
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_process_request:  Completed cib_modify operation for section status: OK
> (rc=0, origin=mail2/attrd/401, version=0.197.18)
> Mar 10 13:37:21 [7418] HWJ-626.domain.local      attrd:     info:
> attrd_cib_callback:   Update 401 for last-failure-postfix: OK (0)
> Mar 10 13:37:21 [7418] HWJ-626.domain.local      attrd:     info:
> attrd_cib_callback:   Update 401 for
> last-failure-postfix[mail1]=1457613440: OK (0)
> Mar 10 13:37:21 [7418] HWJ-626.domain.local      attrd:     info:
> attrd_cib_callback:   Update 401 for
> last-failure-postfix[mail2]=1457610376: OK (0)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
> abort_transition_graph:       Transition aborted by
> status-1-last-failure-postfix, last-failure-postfix=1457613440: Transient
> attribute change (modify cib=0.197.18, source=abort_unless_down:319,
> path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-last-failure-postfix'],
> 1)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> unpack_config:        On loss of CCM Quorum: Ignore
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> determine_online_status:      Node mail1 is online
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> determine_online_status:      Node mail2 is online
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource mail:0 active in
> master mode on mail1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource spool:0 active in
> master mode on mail1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource fs-spool active on
> mail1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource fs-mail active on
> mail1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:  warning:
> unpack_rsc_op_failure:        Processing failed op monitor for postfix on
> mail1: not running (7)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource spool:1 active in
> master mode on mail2
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource mail:1 active in
> master mode on mail2
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> group_print:   Resource Group: network-services
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> native_print:      virtualip-1        (ocf::heartbeat:IPaddr2):     Started
> mail1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> clone_print:   Master/Slave Set: spool-clone [spool]
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> short_print:       Masters: [ mail1 ]
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> short_print:       Slaves: [ mail2 ]
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> clone_print:   Master/Slave Set: mail-clone [mail]
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> short_print:       Masters: [ mail1 ]
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> short_print:       Slaves: [ mail2 ]
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> group_print:   Resource Group: fs-services
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> native_print:      fs-spool   (ocf::heartbeat:Filesystem):    Started mail1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> native_print:      fs-mail    (ocf::heartbeat:Filesystem):    Started mail1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> group_print:   Resource Group: mail-services
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> native_print:      postfix    (ocf::heartbeat:postfix):       FAILED mail1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> get_failcount_full:   postfix has failed 3 times on mail1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:  warning:
> common_apply_stickiness:      Forcing postfix away from mail1 after 3
> failures (max=3)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> master_color: Promoting mail:0 (Master mail1)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> master_color: mail-clone: Promoted 1 instances of a possible 1 to master
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> master_color: Promoting spool:0 (Master mail1)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> master_color: spool-clone: Promoted 1 instances of a possible 1 to master
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> rsc_merge_weights:    postfix: Rolling back scores from virtualip-1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> native_color: Resource virtualip-1 cannot run anywhere
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> RecurringOp:   Start recurring monitor (45s) for postfix on mail2
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> LogActions:   Stop    virtualip-1     (mail1)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   spool:0 (Master mail1)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   spool:1 (Slave mail2)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   mail:0  (Master mail1)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   mail:1  (Slave mail2)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> LogActions:   Stop    fs-spool        (Started mail1)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> LogActions:   Stop    fs-mail (Started mail1)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> LogActions:   Stop    postfix (Started mail1)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> process_pe_message:   Calculated Transition 1235:
> /var/lib/pacemaker/pengine/pe-input-302.bz2
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
> handle_response:      pe_calc calculation pe_calc-dc-1457613441-3756 is
> obsolete
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> unpack_config:        On loss of CCM Quorum: Ignore
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> determine_online_status:      Node mail1 is online
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> determine_online_status:      Node mail2 is online
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource mail:0 active in
> master mode on mail1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource spool:0 active in
> master mode on mail1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource fs-spool active on
> mail1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource fs-mail active on
> mail1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:  warning:
> unpack_rsc_op_failure:        Processing failed op monitor for postfix on
> mail1: not running (7)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource spool:1 active in
> master mode on mail2
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> determine_op_status:  Operation monitor found resource mail:1 active in
> master mode on mail2
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> group_print:   Resource Group: network-services
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> native_print:      virtualip-1        (ocf::heartbeat:IPaddr2):     Started
> mail1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> clone_print:   Master/Slave Set: spool-clone [spool]
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> short_print:       Masters: [ mail1 ]
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> short_print:       Slaves: [ mail2 ]
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> clone_print:   Master/Slave Set: mail-clone [mail]
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> short_print:       Masters: [ mail1 ]
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> short_print:       Slaves: [ mail2 ]
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> group_print:   Resource Group: fs-services
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> native_print:      fs-spool   (ocf::heartbeat:Filesystem):    Started mail1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> native_print:      fs-mail    (ocf::heartbeat:Filesystem):    Started mail1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> group_print:   Resource Group: mail-services
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> native_print:      postfix    (ocf::heartbeat:postfix):       FAILED mail1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> get_failcount_full:   postfix has failed 3 times on mail1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:  warning:
> common_apply_stickiness:      Forcing postfix away from mail1 after 3
> failures (max=3)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> master_color: Promoting mail:0 (Master mail1)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> master_color: mail-clone: Promoted 1 instances of a possible 1 to master
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> master_color: Promoting spool:0 (Master mail1)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> master_color: spool-clone: Promoted 1 instances of a possible 1 to master
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> rsc_merge_weights:    postfix: Rolling back scores from virtualip-1
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> native_color: Resource virtualip-1 cannot run anywhere
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> RecurringOp:   Start recurring monitor (45s) for postfix on mail2
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> LogActions:   Stop    virtualip-1     (mail1)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   spool:0 (Master mail1)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   spool:1 (Slave mail2)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   mail:0  (Master mail1)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:     info:
> LogActions:   Leave   mail:1  (Slave mail2)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> LogActions:   Stop    fs-spool        (Started mail1)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> LogActions:   Stop    fs-mail (Started mail1)
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> LogActions:   Stop    postfix (Started mail1)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
> do_state_transition:  State transition S_POLICY_ENGINE ->
> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=handle_response ]
> Mar 10 13:37:21 [7419] HWJ-626.domain.local    pengine:   notice:
> process_pe_message:   Calculated Transition 1236:
> /var/lib/pacemaker/pengine/pe-input-303.bz2
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
> do_te_invoke: Processing graph 1236 (ref=pe_calc-dc-1457613441-3757)
> derived from /var/lib/pacemaker/pengine/pe-input-303.bz2
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> te_rsc_command:       Initiating action 12: stop virtualip-1_stop_0 on mail1
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> te_rsc_command:       Initiating action 5: stop postfix_stop_0 on mail1
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: --- 0.197.18 2
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: +++ 0.197.19 (null)
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +  /cib:  @num_updates=19
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +
>  
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='virtualip-1']/lrm_rsc_op[@id='virtualip-1_last_0']:
>  @operation_key=virtualip-1_stop_0, @operation=stop,
> @transition-key=12:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> @transition-magic=0:0;12:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> @call-id=1276, @last-run=1457613441, @last-rc-change=1457613441,
> @exec-time=66
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_process_request:  Completed cib_modify operation for section status: OK
> (rc=0, origin=mail1/crmd/197, version=0.197.19)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
> match_graph_event:    Action virtualip-1_stop_0 (12) confirmed on mail1
> (rc=0)
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: --- 0.197.19 2
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: +++ 0.197.20 (null)
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +  /cib:  @num_updates=20
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +
>  
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='postfix']/lrm_rsc_op[@id='postfix_last_0']:
>  @operation_key=postfix_stop_0, @operation=stop,
> @transition-key=5:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> @transition-magic=0:0;5:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> @call-id=1278, @last-run=1457613441, @last-rc-change=1457613441,
> @exec-time=476
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
> match_graph_event:    Action postfix_stop_0 (5) confirmed on mail1 (rc=0)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> te_rsc_command:       Initiating action 79: stop fs-mail_stop_0 on mail1
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_process_request:  Completed cib_modify operation for section status: OK
> (rc=0, origin=mail1/crmd/198, version=0.197.20)
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: --- 0.197.20 2
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: +++ 0.197.21 (null)
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +  /cib:  @num_updates=21
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +
>  
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='fs-mail']/lrm_rsc_op[@id='fs-mail_last_0']:
>  @operation_key=fs-mail_stop_0, @operation=stop,
> @transition-key=79:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> @transition-magic=0:0;79:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> @call-id=1280, @last-run=1457613441, @last-rc-change=1457613441,
> @exec-time=88, @queue-time=1
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_process_request:  Completed cib_modify operation for section status: OK
> (rc=0, origin=mail1/crmd/199, version=0.197.21)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
> match_graph_event:    Action fs-mail_stop_0 (79) confirmed on mail1 (rc=0)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> te_rsc_command:       Initiating action 77: stop fs-spool_stop_0 on mail1
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: --- 0.197.21 2
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       Diff: +++ 0.197.22 (null)
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +  /cib:  @num_updates=22
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_perform_op:       +
>  
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='fs-spool']/lrm_rsc_op[@id='fs-spool_last_0']:
>  @operation_key=fs-spool_stop_0, @operation=stop,
> @transition-key=77:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> @transition-magic=0:0;77:1236:0:ae755a85-c250-498f-9c94-ddd8a7e2788a,
> @call-id=1282, @last-run=1457613441, @last-rc-change=1457613441,
> @exec-time=86
> Mar 10 13:37:21 [7415] HWJ-626.domain.local        cib:     info:
> cib_process_request:  Completed cib_modify operation for section status: OK
> (rc=0, origin=mail1/crmd/200, version=0.197.22)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info:
> match_graph_event:    Action fs-spool_stop_0 (77) confirmed on mail1 (rc=0)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:  warning:
> run_graph:    Transition 1236 (Complete=11, Pending=0, Fired=0, Skipped=0,
> Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-input-303.bz2):
> Terminated
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:  warning:
> te_graph_trigger:     Transition failed: terminated
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> print_graph:  Graph 1236 with 12 actions: batch-limit=12 jobs,
> network-delay=0ms
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> print_synapse:        [Action   16]: Completed pseudo op
> network-services_stopped_0     on N/A (priority: 0, waiting: none)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> print_synapse:        [Action   15]: Completed pseudo op
> network-services_stop_0        on N/A (priority: 0, waiting: none)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> print_synapse:        [Action   12]: Completed rsc op virtualip-1_stop_0
>              on mail1 (priority: 0, waiting: none)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> print_synapse:        [Action   84]: Completed pseudo op
> fs-services_stopped_0          on N/A (priority: 0, waiting: none)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> print_synapse:        [Action   83]: Completed pseudo op fs-services_stop_0
>             on N/A (priority: 0, waiting: none)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> print_synapse:        [Action   77]: Completed rsc op fs-spool_stop_0
>             on mail1 (priority: 0, waiting: none)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> print_synapse:        [Action   79]: Completed rsc op fs-mail_stop_0
>              on mail1 (priority: 0, waiting: none)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> print_synapse:        [Action   90]: Completed pseudo op
> mail-services_stopped_0        on N/A (priority: 0, waiting: none)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> print_synapse:        [Action   89]: Completed pseudo op
> mail-services_stop_0           on N/A (priority: 0, waiting: none)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> print_synapse:        [Action   86]: Pending rsc op postfix_monitor_45000
>             on mail2 (priority: 0, waiting: none)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> print_synapse:         * [Input 85]: Unresolved dependency rsc op
> postfix_start_0 on mail2
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> print_synapse:        [Action    5]: Completed rsc op postfix_stop_0
>              on mail1 (priority: 0, waiting: none)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> print_synapse:        [Action    8]: Completed pseudo op all_stopped
>              on N/A (priority: 0, waiting: none)
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:     info: do_log:
>     FSA: Input I_TE_SUCCESS from notify_crmd() received in state
> S_TRANSITION_ENGINE
> Mar 10 13:37:21 [7420] HWJ-626.domain.local       crmd:   notice:
> do_state_transition:  State transition S_TRANSITION_ENGINE -> S_IDLE [
> input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> Mar 10 13:37:26 [7415] HWJ-626.domain.local        cib:     info:
> cib_process_ping:     Reporting our current digest to mail2:
> 3896ee29cdb6ba128330b0ef6e41bd79 for 0.197.22 (0x1544a30 0)

_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Cluster failover failure with Unresolved dependency

Reply via email to