Re: [ClusterLabs] reset of sticking service in peer node's reboot in Active/Passive configuration

石井俊直 Mon, 01 May 2017 12:29:12 -0700

Hi everyone and thank you Matsushima-san for response.

By researching logs, I’ve found the reason of the restart. The systemd service 
registered
as cluster service is enabled as systemd service. Therefore what happens are, 
(1) the service
starts automatically in OS boot sequence, (2) the services running on both node 
are detected
and then (3) the service is stopped by pacemaker on one node to make it passive.


Here’s the log of (2) and (3).

May  2 02:51:41 node-1 pengine[1111]:   error: Resource apache-httpd 
(systemd::httpd) is active on 2 nodes attempting recovery
May  2 02:51:41 node-1 pengine[1111]: warning: See 
http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
May  2 02:51:41 node-1 pengine[1111]:  notice: Restart apache-httpd#011(Started 
node-1)
May  2 02:51:41 node-1 pengine[1111]:   error: Calculated transition 48 (with 
errors), saving inputs in /var/lib/pacemaker/pengine/pe-error-53.bz2
May  2 02:51:41 node-1 crmd[1112]:  notice: Initiating stop operation 
apache-httpd_stop_0 on node-2
May  2 02:51:41 node-1 crmd[1112]:  notice: Initiating stop operation 
apache-httpd_stop_0 locally on node-1
May  2 02:51:41 node-1 systemd: Reloading.
May  2 02:51:41 node-1 systemd: Stopping The Apache HTTP Server...
May  2 02:51:42 node-1 systemd: Stopped The Apache HTTP Server.
May  2 02:51:43 node-1 crmd[1112]:  notice: Result of stop operation for 
apache-httpd on node-1: 0 (ok)
May  2 02:51:43 node-1 crmd[1112]:  notice: Initiating start operation 
apache-httpd_start_0 locally on node-1
May  2 02:51:43 node-1 systemd: Reloading.
May  2 02:51:44 node-1 systemd: Starting Cluster Controlled httpd...


How to solve the problem is obvious. The systemd service registered as cluster 
service
should be disabled as systemd service (on both nodes). Let it be started by 
pacemaker only.

  # systemctl disable httpd
  Removed symlink /etc/systemd/system/multi-user.target.wants/httpd.service.

Here’s log of node-1 on node-2’s bootup after httpd disabled as systemd service.

May  2 04:08:51 node-1 corosync[1057]: [TOTEM ] A new membership 
(192.168.1.201:720) was formed. Members joined: 2
May  2 04:08:51 node-1 corosync[1057]: [QUORUM] Members[2]: 1 2
May  2 04:08:51 node-1 corosync[1057]: [MAIN  ] Completed service 
synchronization, ready to provide service.
May  2 04:08:51 node-1 pacemakerd[1064]:  notice: Node node-2 state is now 
member
May  2 04:08:51 node-1 crmd[1074]:  notice: Node node-2 state is now member
May  2 04:08:52 node-1 attrd[1072]:  notice: Node node-2 state is now member
May  2 04:08:52 node-1 stonith-ng[1070]:  notice: Node node-2 state is now 
member
May  2 04:08:53 node-1 cib[1069]:  notice: Node node-2 state is now member 
May  2 04:08:53 node-1 crmd[1074]:  notice: State transition S_IDLE -> 
S_INTEGRATION
May  2 04:08:56 node-1 pengine[1073]:  notice: On loss of CCM Quorum: Ignore
May  2 04:08:56 node-1 pengine[1073]:  notice: Calculated transition 2, saving 
inputs in /var/lib/pacemaker/pengine/pe-input-232.bz2
May  2 04:08:56 node-1 crmd[1074]:  notice: Initiating monitor operation 
ClusterIP_monitor_0 on node-2
May  2 04:08:56 node-1 crmd[1074]:  notice: Initiating monitor operation 
apache-httpd_monitor_0 on node-2
May  2 04:08:56 node-1 crmd[1074]:  notice: Transition 2 (Complete=2, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-232.bz2): Complete
May  2 04:08:56 node-1 crmd[1074]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE


Have a nice day.


> 2017/05/01 19:03、Takehiro Matsushima <[email protected]>のメール:
> 
> Hello Ishii-san,
> 
> I could not reproduce the issue in my environment CentOS7 w/ Pacemaker 1.1.15.
> Following configuration works fine when reboot a passive node.
> (lighttpd is just for example of systemd resource)
> 
> ---- %< ----
> primitive ipaddr IPaddr2 \
>        params nic=enp0s10 ip=172.22.23.254 cidr_netmask=24 \
>        op start interval=0 timeout=20 on-fail=restart \
>        op stop interval=0 timeout=20 on-fail=ignore \
>        op monitor interval=10 timeout=20 on-fail=restart
> primitive lighttpd systemd:lighttpd \
>        op start interval=0 timeout=20 on-fail=restart \
>        op stop interval=0 timeout=20 on-fail=ignore \
>        op monitor interval=10 timeout=20 on-fail=restart
> colocation vip-colocation inf: ipaddr lighttpd
> order web-order inf: lighttpd ipaddr
> property cib-bootstrap-options: \
>        have-watchdog=false \
>        dc-version=1.1.15-1.el7-e174ec8 \
>        cluster-infrastructure=corosync \
>        no-quorum-policy=ignore \
>        startup-fencing=no \
>        stonith-enabled=no \
>        cluster-recheck-interval=1m
> rsc_defaults rsc-options: \
>        resource-stickiness=infinity \
>        migration-threshold=1
> ---- %< ----
> 
> I made sure resources did not restart and did not move by changing
> resource-stickiness to some positive values such as 10, 100 and 0.
> Also it works replacing colocation and order constraints by "group" 
> constraint.
> 
> If you are watching cluster's status by crm_mon, please run with "-t"
> option and watch "last-run" in line of "start" operation for each
> resource.
> If the time is not change when you rebooted a passive node, the
> resource should not restarted actually.
> 
> 
> Thanks,
> 
> Takehiro Matsushima
> 
> 2017-04-30 19:32 GMT+09:00 石井 俊直 <[email protected]>:
>> Hi.
>> 
>> We have 2-node Active/Passive cluster each of which are CentOS7 and there 
>> are two cluster services,
>> one is ocf:heartbeat:IPaddr2 and the other is systemd based service. They 
>> have colocation constraint.
>> The configuration looks almost good so that they are normally running 
>> without problems.
>> 
>> When one of the OS reboots, there happens a thing we do not want to have, 
>> which is 5) of the following.
>> Suppose nodes are node-1 and node-2, cluster resource is running on node-1 
>> and we reboot node-2.
>> Following is events sequence that happens.
>> 
>>  1) node-2 shutdowns
>>  2) node-1 detects node-2 is OFFLINE
>>  3) node-2 boots up
>>  4) node-1 detects node-2 is Online, node-2 detects both are Online
>>  5) cluster services running on node-1 Stops
>>  6) cluster services starts on node-1
>> 
>> 6) is based on our configuration of resource-stickiness to be something like 
>> 100. In the case the service
>> does not move to node-2, we do not our service stopped even just for a while.
>> 
>> If someone knows how to configure pacemaker not to behave like 5), please 
>> let us know.
>> 
>> Thanks you.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Users mailing list: [email protected]
>> http://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Users mailing list: [email protected]
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Users mailing list: [email protected]
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] reset of sticking service in peer node's reboot in Active/Passive configuration

Reply via email to