Hi everyone and thank you Matsushima-san for response. By researching logs, I’ve found the reason of the restart. The systemd service registered as cluster service is enabled as systemd service. Therefore what happens are, (1) the service starts automatically in OS boot sequence, (2) the services running on both node are detected and then (3) the service is stopped by pacemaker on one node to make it passive.
Here’s the log of (2) and (3). May 2 02:51:41 node-1 pengine[1111]: error: Resource apache-httpd (systemd::httpd) is active on 2 nodes attempting recovery May 2 02:51:41 node-1 pengine[1111]: warning: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information. May 2 02:51:41 node-1 pengine[1111]: notice: Restart apache-httpd#011(Started node-1) May 2 02:51:41 node-1 pengine[1111]: error: Calculated transition 48 (with errors), saving inputs in /var/lib/pacemaker/pengine/pe-error-53.bz2 May 2 02:51:41 node-1 crmd[1112]: notice: Initiating stop operation apache-httpd_stop_0 on node-2 May 2 02:51:41 node-1 crmd[1112]: notice: Initiating stop operation apache-httpd_stop_0 locally on node-1 May 2 02:51:41 node-1 systemd: Reloading. May 2 02:51:41 node-1 systemd: Stopping The Apache HTTP Server... May 2 02:51:42 node-1 systemd: Stopped The Apache HTTP Server. May 2 02:51:43 node-1 crmd[1112]: notice: Result of stop operation for apache-httpd on node-1: 0 (ok) May 2 02:51:43 node-1 crmd[1112]: notice: Initiating start operation apache-httpd_start_0 locally on node-1 May 2 02:51:43 node-1 systemd: Reloading. May 2 02:51:44 node-1 systemd: Starting Cluster Controlled httpd... How to solve the problem is obvious. The systemd service registered as cluster service should be disabled as systemd service (on both nodes). Let it be started by pacemaker only. # systemctl disable httpd Removed symlink /etc/systemd/system/multi-user.target.wants/httpd.service. Here’s log of node-1 on node-2’s bootup after httpd disabled as systemd service. May 2 04:08:51 node-1 corosync[1057]: [TOTEM ] A new membership (192.168.1.201:720) was formed. Members joined: 2 May 2 04:08:51 node-1 corosync[1057]: [QUORUM] Members[2]: 1 2 May 2 04:08:51 node-1 corosync[1057]: [MAIN ] Completed service synchronization, ready to provide service. May 2 04:08:51 node-1 pacemakerd[1064]: notice: Node node-2 state is now member May 2 04:08:51 node-1 crmd[1074]: notice: Node node-2 state is now member May 2 04:08:52 node-1 attrd[1072]: notice: Node node-2 state is now member May 2 04:08:52 node-1 stonith-ng[1070]: notice: Node node-2 state is now member May 2 04:08:53 node-1 cib[1069]: notice: Node node-2 state is now member May 2 04:08:53 node-1 crmd[1074]: notice: State transition S_IDLE -> S_INTEGRATION May 2 04:08:56 node-1 pengine[1073]: notice: On loss of CCM Quorum: Ignore May 2 04:08:56 node-1 pengine[1073]: notice: Calculated transition 2, saving inputs in /var/lib/pacemaker/pengine/pe-input-232.bz2 May 2 04:08:56 node-1 crmd[1074]: notice: Initiating monitor operation ClusterIP_monitor_0 on node-2 May 2 04:08:56 node-1 crmd[1074]: notice: Initiating monitor operation apache-httpd_monitor_0 on node-2 May 2 04:08:56 node-1 crmd[1074]: notice: Transition 2 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-232.bz2): Complete May 2 04:08:56 node-1 crmd[1074]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE Have a nice day. > 2017/05/01 19:03、Takehiro Matsushima <[email protected]>のメール: > > Hello Ishii-san, > > I could not reproduce the issue in my environment CentOS7 w/ Pacemaker 1.1.15. > Following configuration works fine when reboot a passive node. > (lighttpd is just for example of systemd resource) > > ---- %< ---- > primitive ipaddr IPaddr2 \ > params nic=enp0s10 ip=172.22.23.254 cidr_netmask=24 \ > op start interval=0 timeout=20 on-fail=restart \ > op stop interval=0 timeout=20 on-fail=ignore \ > op monitor interval=10 timeout=20 on-fail=restart > primitive lighttpd systemd:lighttpd \ > op start interval=0 timeout=20 on-fail=restart \ > op stop interval=0 timeout=20 on-fail=ignore \ > op monitor interval=10 timeout=20 on-fail=restart > colocation vip-colocation inf: ipaddr lighttpd > order web-order inf: lighttpd ipaddr > property cib-bootstrap-options: \ > have-watchdog=false \ > dc-version=1.1.15-1.el7-e174ec8 \ > cluster-infrastructure=corosync \ > no-quorum-policy=ignore \ > startup-fencing=no \ > stonith-enabled=no \ > cluster-recheck-interval=1m > rsc_defaults rsc-options: \ > resource-stickiness=infinity \ > migration-threshold=1 > ---- %< ---- > > I made sure resources did not restart and did not move by changing > resource-stickiness to some positive values such as 10, 100 and 0. > Also it works replacing colocation and order constraints by "group" > constraint. > > If you are watching cluster's status by crm_mon, please run with "-t" > option and watch "last-run" in line of "start" operation for each > resource. > If the time is not change when you rebooted a passive node, the > resource should not restarted actually. > > > Thanks, > > Takehiro Matsushima > > 2017-04-30 19:32 GMT+09:00 石井 俊直 <[email protected]>: >> Hi. >> >> We have 2-node Active/Passive cluster each of which are CentOS7 and there >> are two cluster services, >> one is ocf:heartbeat:IPaddr2 and the other is systemd based service. They >> have colocation constraint. >> The configuration looks almost good so that they are normally running >> without problems. >> >> When one of the OS reboots, there happens a thing we do not want to have, >> which is 5) of the following. >> Suppose nodes are node-1 and node-2, cluster resource is running on node-1 >> and we reboot node-2. >> Following is events sequence that happens. >> >> 1) node-2 shutdowns >> 2) node-1 detects node-2 is OFFLINE >> 3) node-2 boots up >> 4) node-1 detects node-2 is Online, node-2 detects both are Online >> 5) cluster services running on node-1 Stops >> 6) cluster services starts on node-1 >> >> 6) is based on our configuration of resource-stickiness to be something like >> 100. In the case the service >> does not move to node-2, we do not our service stopped even just for a while. >> >> If someone knows how to configure pacemaker not to behave like 5), please >> let us know. >> >> Thanks you. >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> Users mailing list: [email protected] >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Users mailing list: [email protected] > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: [email protected] http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
