[ClusterLabs] Pacemaker Master restarts when Slave is added to the cluster
Hello, In my test environment, I meet one issue about the pacemaker: when a new node is added in the cluster, the master node restart. This issue will lead to the system out of service for a while when adding a new node because there is no master node. Could you please help tell how to debug such issue? I have a pacemaker master/slave cluster as below. pgsql-ha is a resource. I copy the script from /usr/lib/ocf/resource.d/heartbeat/Dumy and add some simple codes to make it support promote/demote. Now when I run “pcs cluster stop” on db1,the db1 is stopped status and db2 is still master. The problem is: when I run “pcs cluster start” on db1.The db2 status changes as below: master -> slave->stop->slave->master. Why does db2 restart? CENTOS7: == 2 nodes and 7 resources configured Online: [ db1 db2 ] Full list of resources: Clone Set: dlm-clone [dlm] Started: [ db1 db2 ] Clone Set: clvmd-clone [clvmd] Started: [ db1 db2 ] scsi-stonith-device(stonith:fence_scsi): Started db2 Master/Slave Set: pgsql-ha [pgsqld] Masters: [ db2 ] Slaves: [ db1 ] Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@db1 heartbeat]# == /var/log/messages: Dec 27 00:52:50 db2 cib[3290]: notice: Purged 1 peers with id=1 and/or uname=db1 from the membership cache Dec 27 00:52:51 db2 kernel: dlm: closing connection to node 1 Dec 27 00:52:51 db2 corosync[3268]: [TOTEM ] A new membership (192.168.199.199:372) was formed. Members left: 1 Dec 27 00:52:51 db2 corosync[3268]: [QUORUM] Members[1]: 2 Dec 27 00:52:51 db2 corosync[3268]: [MAIN ] Completed service synchronization, ready to provide service. Dec 27 00:52:51 db2 crmd[3295]: notice: Node db1 state is now lost Dec 27 00:52:51 db2 crmd[3295]: notice: do_shutdown of peer db1 is complete Dec 27 00:52:51 db2 pacemakerd[3289]: notice: Node db1 state is now lost Dec 27 00:52:57 db2 Doctor(pgsqld)[6671]: INFO: pgsqld monitor : 8 Dec 27 00:53:12 db2 Doctor(pgsqld)[6681]: INFO: pgsqld monitor : 8 Dec 27 00:53:27 db2 Doctor(pgsqld)[6746]: INFO: pgsqld monitor : 8 Dec 27 00:53:33 db2 corosync[3268]: [TOTEM ] A new membership (192.168.199.197:376) was formed. Members joined: 1 Dec 27 00:53:33 db2 corosync[3268]: [QUORUM] Members[2]: 1 2 Dec 27 00:53:33 db2 corosync[3268]: [MAIN ] Completed service synchronization, ready to provide service. Dec 27 00:53:33 db2 crmd[3295]: notice: Node db1 state is now member Dec 27 00:53:33 db2 pacemakerd[3289]: notice: Node db1 state is now member Dec 27 00:53:33 db2 crmd[3295]: notice: do_shutdown of peer db1 is complete Dec 27 00:53:33 db2 crmd[3295]: notice: State transition S_IDLE -> S_INTEGRATION Dec 27 00:53:33 db2 pengine[3294]: notice: Calculated transition 17, saving inputs in /var/lib/pacemaker/pengine/pe-input-116.bz2 Dec 27 00:53:33 db2 crmd[3295]: notice: Transition 17 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-116.bz2): Complete Dec 27 00:53:33 db2 crmd[3295]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE Dec 27 00:53:33 db2 stonith-ng[3291]: notice: Node db1 state is now member Dec 27 00:53:33 db2 attrd[3293]: notice: Node db1 state is now member Dec 27 00:53:33 db2 cib[3290]: notice: Node db1 state is now member Dec 27 00:53:34 db2 crmd[3295]: notice: State transition S_IDLE -> S_INTEGRATION Dec 27 00:53:37 db2 crmd[3295]: warning: No reason to expect node 2 to be down Dec 27 00:53:38 db2 pengine[3294]: notice: Unfencing db1: node discovery Dec 27 00:53:38 db2 pengine[3294]: notice: Start dlm:1#011(db1) Dec 27 00:53:38 db2 pengine[3294]: notice: Start clvmd:1#011(db1) Dec 27 00:53:38 db2 pengine[3294]: notice: Restart pgsqld:0#011(Master db2) /var/log/cluster/corosync.log: Dec 27 00:53:37 [3290] db2cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=db2/crmd/99, version=0.60.29) Dec 27 00:53:37 [3290] db2cib: info: cib_process_request: Forwarding cib_delete operation for section //node_state[@uname='db2']/lrm to all (origin=local/crmd/100) Dec 27 00:53:37 [3295] db2 crmd: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE | input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state Dec 27 00:53:37 [3295] db2 crmd: info: abort_transition_graph: Transition aborted: Peer Cancelled | source=do_te_invoke:161 complete=true Dec 27 00:53:37 [3293] db2 attrd: info: attrd_client_refresh: Updating all attributes Dec 27 00:53:37 [3293] db2 attrd: info: write_attribute: Sent update 12 with 2 changes for shutdown, id=, set=(null) Dec 27 00:53:37 [3293] db2 attrd: info: write_attribute: Sent update 13 with 1 changes for last-failure-pgsqld, id=, set=(null) Dec 27 00:53:37 [3293] db2
[ClusterLabs] 答复: Pacemaker Master restarts when Slave is added to the cluster
Andrei, I set the interleave=true and it does not restart any more. Thank you very much. A word of you resolves the problem confusing my several days 😊 -邮件原件- 发件人: Andrei Borzenkov [mailto:arvidj...@gmail.com] 发送时间: 2017年12月27日 19:06 收件人: Cluster Labs - All topics related to open-source clustering welcomed 主题: Re: [ClusterLabs] Pacemaker Master restarts when Slave is added to the cluster Usual suspect - interleave=false on clone resource. On Wed, Dec 27, 2017 at 10:49 AM, 范国腾 wrote: > Hello, > > > > In my test environment, I meet one issue about the pacemaker: when a > new node is added in the cluster, the master node restart. This issue > will lead to the system out of service for a while when adding a new > node because there is no master node. Could you please help tell how to debug > such issue? > > > > I have a pacemaker master/slave cluster as below. pgsql-ha is a > resource. I copy the script from > /usr/lib/ocf/resource.d/heartbeat/Dumy and add some simple codes to make it > support promote/demote. > > Now when I run “pcs cluster stop” on db1,the db1 is stopped status and > db2 is still master. > > The problem is: when I run “pcs cluster start” on db1.The db2 status > changes as below: master -> slave->stop->slave->master. Why does db2 restart? > > > > CENTOS7: > > == > > 2 nodes and 7 resources configured > > > > Online: [ db1 db2 ] > > > > Full list of resources: > > > > Clone Set: dlm-clone [dlm] > > Started: [ db1 db2 ] > > Clone Set: clvmd-clone [clvmd] > > Started: [ db1 db2 ] > > scsi-stonith-device(stonith:fence_scsi): Started db2 > > Master/Slave Set: pgsql-ha [pgsqld] > > Masters: [ db2 ] > > Slaves: [ db1 ] > > > > Daemon Status: > > corosync: active/enabled > > pacemaker: active/enabled > > pcsd: active/enabled > > [root@db1 heartbeat]# > > == > > /var/log/messages: > > Dec 27 00:52:50 db2 cib[3290]: notice: Purged 1 peers with id=1 > and/or > uname=db1 from the membership cache > > Dec 27 00:52:51 db2 kernel: dlm: closing connection to node 1 > > Dec 27 00:52:51 db2 corosync[3268]: [TOTEM ] A new membership > (192.168.199.199:372) was formed. Members left: 1 > > Dec 27 00:52:51 db2 corosync[3268]: [QUORUM] Members[1]: 2 > > Dec 27 00:52:51 db2 corosync[3268]: [MAIN ] Completed service > synchronization, ready to provide service. > > Dec 27 00:52:51 db2 crmd[3295]: notice: Node db1 state is now lost > > Dec 27 00:52:51 db2 crmd[3295]: notice: do_shutdown of peer db1 is > complete > > Dec 27 00:52:51 db2 pacemakerd[3289]: notice: Node db1 state is now > lost > > Dec 27 00:52:57 db2 Doctor(pgsqld)[6671]: INFO: pgsqld monitor : 8 > > Dec 27 00:53:12 db2 Doctor(pgsqld)[6681]: INFO: pgsqld monitor : 8 > > Dec 27 00:53:27 db2 Doctor(pgsqld)[6746]: INFO: pgsqld monitor : 8 > > Dec 27 00:53:33 db2 corosync[3268]: [TOTEM ] A new membership > (192.168.199.197:376) was formed. Members joined: 1 > > Dec 27 00:53:33 db2 corosync[3268]: [QUORUM] Members[2]: 1 2 > > Dec 27 00:53:33 db2 corosync[3268]: [MAIN ] Completed service > synchronization, ready to provide service. > > Dec 27 00:53:33 db2 crmd[3295]: notice: Node db1 state is now member > > Dec 27 00:53:33 db2 pacemakerd[3289]: notice: Node db1 state is now > member > > Dec 27 00:53:33 db2 crmd[3295]: notice: do_shutdown of peer db1 is > complete > > Dec 27 00:53:33 db2 crmd[3295]: notice: State transition S_IDLE -> > S_INTEGRATION > > Dec 27 00:53:33 db2 pengine[3294]: notice: Calculated transition 17, > saving inputs in /var/lib/pacemaker/pengine/pe-input-116.bz2 > > Dec 27 00:53:33 db2 crmd[3295]: notice: Transition 17 (Complete=0, > Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-116.bz2): Complete > > Dec 27 00:53:33 db2 crmd[3295]: notice: State transition > S_TRANSITION_ENGINE -> S_IDLE > > Dec 27 00:53:33 db2 stonith-ng[3291]: notice: Node db1 state is now > member > > Dec 27 00:53:33 db2 attrd[3293]: notice: Node db1 state is now member > > Dec 27 00:53:33 db2 cib[3290]: notice: Node db1 state is now member > > Dec 27 00:53:34 db2 crmd[3295]: notice: State transition S_IDLE -> > S_INTEGRATION > > Dec 27 00:53:37 db2 crmd[3295]: warning: No reason to expect node 2 to > be down > > Dec 27 00:53:38 db2 pengine[3294]: notice: Unfencing db1: node > discovery > > Dec 27 00:53:38 db2 pengine[3294]: notice: Start dlm:1#011(db1) > > Dec 27
[ClusterLabs] pacemaker reports monitor timeout while CPU is high
Hello, This issue only appears when we run performance test and the CPU is high. The cluster and log is as below. The Pacemaker will restart the Slave Side pgsql-ha resource about every two minutes. Take the following scenario for example:(when the pgsqlms RA is called, we print the log “execute the command start (command)”. When the command is returned, we print the log “execute the command stop (Command) (result)”) 1. We could see that pacemaker call “pgsqlms monitor” about every 15 seconds. And it return $OCF_SUCCESS 2. In calls monitor command again at 13:56:16, and then it reports timeout error error 13:56:18. It is only 2 seconds but it reports “timeout=1ms” 3. In other logs, sometimes after 15 minutes, there is no “execute the command start monitor” printed and it reports timeout error directly. Could you please tell how to debug or resolve such issue? The log: Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command start monitor Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0 Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command start monitor Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]: notice: High CPU load detected: 426.77 Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command start monitor Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process (PID 5606) timed out Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 - timed out after 1ms Jan 10 13:56:18 sds2 crmd[26096]: error: Result of monitor operation for pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000 timeout=1ms Jan 10 13:56:18 sds2 crmd[26096]: notice: db2-pgsqld_monitor_16000:102 [ /tmp:5432 - accepting connections\n ] Jan 10 13:56:18 sds2 crmd[26096]: notice: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op monitor for pgsqld:0 on db2: unknown error (1) Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op start for pgsqld:1 on db1: unknown error (1) Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 after 100 failures (max=100) Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 after 100 failures (max=100) Jan 10 13:56:19 sds2 pengine[26095]: notice: Recover pgsqld:0#011(Slave db2) Jan 10 13:56:19 sds2 pengine[26095]: notice: Calculated transition 37, saving inputs in /var/lib/pacemaker/pengine/pe-input-1251.bz2 The Cluster Configuration: 2 nodes and 13 resources configured Online: [ db1 db2 ] Full list of resources: Clone Set: dlm-clone [dlm] Started: [ db1 db2 ] Clone Set: clvmd-clone [clvmd] Started: [ db1 db2 ] ipmi_node1 (stonith:fence_ipmilan):Started db2 ipmi_node2 (stonith:fence_ipmilan):Started db1 Clone Set: clusterfs-clone [clusterfs] Started: [ db1 db2 ] Master/Slave Set: pgsql-ha [pgsqld]> Masters: [ db1 ] Slaves: [ db2 ] Resource Group: mastergroup db1-vip(ocf::heartbeat:IPaddr2): Started rep-vip(ocf::heartbeat:IPaddr2): Started Resource Group: slavegroup db2-vip(ocf::heartbeat:IPaddr2): Started pcs resource show pgsql-ha Master: pgsql-ha Meta Attrs: interleave=true notify=true Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms) Attributes: bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data Operations: start interval=0s timeout=160s (pgsqld-start-interval-0s) stop interval=0s timeout=60s (pgsqld-stop-interval-0s) promote interval=0s timeout=130s (pgsqld-promote-interval-0s) demote interval=0s timeout=120s (pgsqld-demote-interval-0s) monitor interval=15s role=Master timeout=10s (pgsqld-monitor-interval-15s) monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s) notify interval=0s timeout=60s (pgsqld-notify-interval-0s) ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: pacemaker reports monitor timeout while CPU is high
Thank you, Ken. We have set the timeout to be 10 seconds, but it reports timeout only after 2 seconds. So it seems not work if I set higher timeouts. Our application which is managed by pacemaker will start more than 500 process to run when running performance test. Does it affect the result? Which log could help us to analyze? > monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s) -邮件原件- 发件人: Ken Gaillot [mailto:kgail...@redhat.com] 发送时间: 2018年1月11日 0:54 收件人: Cluster Labs - All topics related to open-source clustering welcomed 主题: Re: [ClusterLabs] pacemaker reports monitor timeout while CPU is high On Wed, 2018-01-10 at 09:40 +0000, 范国腾 wrote: > Hello, > > This issue only appears when we run performance test and the CPU is > high. The cluster and log is as below. The Pacemaker will restart the > Slave Side pgsql-ha resource about every two minutes. > > Take the following scenario for example:(when the pgsqlms RA is > called, we print the log “execute the command start (command)”. When > the command is returned, we print the log “execute the command stop > (Command) (result)”) > 1. We could see that pacemaker call “pgsqlms monitor” about every > 15 seconds. And it return $OCF_SUCCESS 2. In calls monitor command > again at 13:56:16, and then it reports timeout error error 13:56:18. > It is only 2 seconds but it reports “timeout=1ms” > 3. In other logs, sometimes after 15 minutes, there is no “execute > the command start monitor” printed and it reports timeout error > directly. > > Could you please tell how to debug or resolve such issue? > > The log: > > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command > start monitor Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: > _confirm_role start Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: > _confirm_role stop > 0 > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command > stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: > execute the command start monitor Jan 10 13:55:52 sds2 > pgsqlms(pgsqld)[5477]: INFO: _confirm_role start Jan 10 13:55:52 sds2 > pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop > 0 > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command > stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]: notice: High CPU > load detected: > 426.77 > Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command > start monitor Jan 10 13:56:18 sds2 lrmd[26093]: warning: > pgsqld_monitor_16000 process (PID 5606) timed out There's something more going on than in this log snippet. Notice the process that timed out (5606) is not one of the processes that logged above (5240 and 5477). Generally, once load gets that high, it's very difficult to maintain responsiveness, and the expectation is that another node will fence it. But it can often be worked around with high timeouts, and/or you can use rules to set higher timeouts or maintenance mode during times when high load is expected. > Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 > - timed out after 1ms > Jan 10 13:56:18 sds2 crmd[26096]: error: Result of monitor operation > for pgsqld on db2: Timed Out | call=102 > key=pgsqld_monitor_16000 timeout=1ms Jan 10 13:56:18 sds2 > crmd[26096]: notice: db2- > pgsqld_monitor_16000:102 [ /tmp:5432 - accepting connections\n ] Jan > 10 13:56:18 sds2 crmd[26096]: notice: State transition S_IDLE -> > S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]: > warning: Processing failed op monitor for pgsqld:0 on db2: unknown > error (1) Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing > failed op start for pgsqld:1 on db1: unknown error (1) Jan 10 13:56:19 > sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 after > 100 failures (max=100) Jan 10 13:56:19 sds2 pengine[26095]: > warning: Forcing pgsql-ha away from db1 after 100 failures > (max=100) Jan 10 13:56:19 sds2 pengine[26095]: notice: Recover > pgsqld:0#011(Slave db2) Jan 10 13:56:19 sds2 pengine[26095]: notice: > Calculated transition 37, saving inputs in > /var/lib/pacemaker/pengine/pe-input-1251.bz2 > > > The Cluster Configuration: > 2 nodes and 13 resources configured > > Online: [ db1 db2 ] > > Full list of resources: > > Clone Set: dlm-clone [dlm] > Started: [ db1 db2 ] > Clone Set: clvmd-clone [clvmd] > Started: [ db1 db2 ] > ipmi_node1 (stonith:fence_ipmilan): Started db2 > ipmi_node2 (stonith:fence_ipmilan): Started db1 Clone Set: > clusterfs-clone [clusterfs] > Started: [ db1 db2 ] > Master/Slave Set
[ClusterLabs] 答复: Antw: pacemaker reports monitor timeout while CPU is high
Ulrich, Thank you very much for the help. When we do the performance test, our application(pgsql-ha) will start more than 500 process to process the client request. Is it possible to make this issue? Is it any workaround or method to make pacemaker not restart the resource in such situation? Now the system could not work if the client sends high call load but we could not control the client's behavior. Thanks -邮件原件- 发件人: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de] 发送时间: 2018年1月10日 18:20 收件人: users@clusterlabs.org 主题: [ClusterLabs] Antw: pacemaker reports monitor timeout while CPU is high Hi! I only can talk for myself: In former times with HP-UX, we had severe performance problems when the load was in the range of 8 to 14 (I/O waits not included, average for all logical CPUs), while in Linux we are getting problems with a load above 40 (or so) (I/O included, sum of all logical CPUs (which are 24)). Also I/O waits cause cluster timeouts before CPU load actually matters (for us). So with a load above 400 (not knowing your number of CPUs) it should not be that unusual. What is the number of threads in your system at that time? It might be worth the efforts binding the cluster processes to specific CPUs and keep other tasks away from those, but I don't have experience with that. I guess the "High CPU load detected" message triggers some internal suspend in the cluster engine (assuming the cluster engine caused the high load). Of course for "external " load the measure won't help... Regards, Ulrich >>> ??? schrieb am 10.01.2018 um 10:40 in >>> Nachricht <4dc98a5d9be144a78fb9a18721743...@ex01.highgo.com>: > Hello, > > This issue only appears when we run performance test and the CPU is high. > The cluster and log is as below. The Pacemaker will restart the Slave > Side pgsql-ha resource about every two minutes. > > Take the following scenario for example:(when the pgsqlms RA is > called, we print the log “execute the command start (command)”. When > the command is > returned, we print the log “execute the command stop (Command) (result)”) > > 1. We could see that pacemaker call “pgsqlms monitor” about every 15 > seconds. And it return $OCF_SUCCESS > > 2. In calls monitor command again at 13:56:16, and then it reports > timeout error error 13:56:18. It is only 2 seconds but it reports > “timeout=1ms” > > 3. In other logs, sometimes after 15 minutes, there is no “execute the > command start monitor” printed and it reports timeout error directly. > > Could you please tell how to debug or resolve such issue? > > The log: > > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command > start > monitor > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0 > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command > stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: > execute the command start > monitor > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0 > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command > stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]: notice: High CPU > load detected: > 426.77 > Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command > start > monitor > Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 > process (PID > 5606) timed out > Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 - > timed > out after 1ms > Jan 10 13:56:18 sds2 crmd[26096]: error: Result of monitor operation for > pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000 timeout=1ms > Jan 10 13:56:18 sds2 crmd[26096]: notice: > db2-pgsqld_monitor_16000:102 [ > /tmp:5432 - accepting connections\n ] > Jan 10 13:56:18 sds2 crmd[26096]: notice: State transition S_IDLE -> > S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]: > warning: Processing failed op monitor for pgsqld:0 on db2: unknown > error (1) Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing > failed op start for > pgsqld:1 on db1: unknown error (1) > Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away > from db1 > after 100 failures (max=100) > Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away > from db1 > after 100 failures (max=100) > Jan 10 13:56:19 sds2 pengine[26095]: notice: Recover > pgsqld:0#011(Slave > db2) > Jan 10 13:56:19 sds2 pengine[26095]: notice: Calculated transition > 37, saving inputs in /var/lib/pacemaker/pengine/pe-input-1251.bz2 > > > The Cluster Configuration: > 2 nodes and 13 resources configured > > Online: [ db1 db2 ] > > Full list of resources: > > Clone Set: dlm-clone [dlm] > Started: [ db1
[ClusterLabs] 答复: 答复: pacemaker reports monitor timeout while CPU is high
Thank you very much, Ken. I will set the high timeout and try. -邮件原件- 发件人: Ken Gaillot [mailto:kgail...@redhat.com] 发送时间: 2018年1月11日 23:48 收件人: Cluster Labs - All topics related to open-source clustering welcomed 抄送: 王亮 主题: Re: [ClusterLabs] 答复: pacemaker reports monitor timeout while CPU is high On Thu, 2018-01-11 at 03:50 +, 范国腾 wrote: > Thank you, Ken. > > We have set the timeout to be 10 seconds, but it reports timeout only > after 2 seconds. So it seems not work if I set higher timeouts. > Our application which is managed by pacemaker will start more than > 500 process to run when running performance test. Does it affect the > result? Which log could help us to analyze? > > > monitor interval=16s role=Slave timeout=10s (pgsqld-monitor- > > interval-16s) It's not timing out after 2 seconds. The message: sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command start monitor indicates that the monitor's process ID is 5240, but the message: sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process (PID 5606) timed out indicates that the monitor that timed out had process ID 5606. That means that there were two separate monitors in progress. I'm not sure why; I wouldn't expect the second one to be started until after the first one had timed out. But it's possible with the high load that the log messages were simply written to the log out of order, since they were written by different processes. I would just raise the timeout higher than 10s during the test. > > -邮件原件- > 发件人: Ken Gaillot [mailto:kgail...@redhat.com] > 发送时间: 2018年1月11日 0:54 > 收件人: Cluster Labs - All topics related to open-source clustering > welcomed > 主题: Re: [ClusterLabs] pacemaker reports monitor timeout while CPU is > high > > On Wed, 2018-01-10 at 09:40 +, 范国腾 wrote: > > Hello, > > > > This issue only appears when we run performance test and the CPU is > > high. The cluster and log is as below. The Pacemaker will restart > > the Slave Side pgsql-ha resource about every two minutes. > > > > Take the following scenario for example:(when the pgsqlms RA is > > called, we print the log “execute the command start (command)”. > > When > > the command is returned, we print the log “execute the command stop > > (Command) (result)”) > > 1. We could see that pacemaker call “pgsqlms monitor” about > > every > > 15 seconds. And it return $OCF_SUCCESS 2. In calls monitor > > command again at 13:56:16, and then it reports timeout error error > > 13:56:18. > > It is only 2 seconds but it reports “timeout=1ms” > > 3. In other logs, sometimes after 15 minutes, there is no > > “execute the command start monitor” printed and it reports timeout > > error directly. > > > > Could you please tell how to debug or resolve such issue? > > > > The log: > > > > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the > > command start monitor Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: > > INFO: > > _confirm_role start Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: > > INFO: > > _confirm_role stop > > 0 > > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the > > command stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: > > INFO: > > execute the command start monitor Jan 10 13:55:52 sds2 > > pgsqlms(pgsqld)[5477]: INFO: _confirm_role start Jan 10 13:55:52 > > sds2 > > pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop > > 0 > > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the > > command stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]: notice: > > High CPU load detected: > > 426.77 > > Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the > > command start monitor Jan 10 13:56:18 sds2 lrmd[26093]: warning: > > pgsqld_monitor_16000 process (PID 5606) timed out > > There's something more going on than in this log snippet. Notice the > process that timed out (5606) is not one of the processes that logged > above (5240 and 5477). > > Generally, once load gets that high, it's very difficult to maintain > responsiveness, and the expectation is that another node will fence > it. > But it can often be worked around with high timeouts, and/or you can > use rules to set higher timeouts or maintenance mode during times when > high load is expected. > > > Jan 10 13:56:18 sds2 lrmd[26093]: warning: > > pgsqld_monitor_16000:5606 > > - timed out after 1ms > > Jan 10 13:56:18 sds2 crmd[26096]: error: Result of monitor > > operation for pgsqld on db2: Timed Out | call=102 > > key=pgsqld_
[ClusterLabs] “pcs --debug” does not work
Hello, The help of "pcs --debug" says " Print all network traffic and external commands run." But when I run the "pcs --debug", it still print the help information. How to trigger it to print the network traffic? Thanks Steven [root@db3 ~]# pcs --debug Usage: pcs [-f file] [-h] [commands]... Control and configure pacemaker and corosync. Options: -h, --help Display usage and exit. -f filePerform actions on file instead of active CIB. --debugPrint all network traffic and external commands run. --version Print pcs version information. --request-timeout Timeout for each outgoing request to another node in seconds. Default is 60s. Commands: cluster Configure cluster options and nodes. resourceManage cluster resources. stonith Manage fence devices. constraint Manage resource constraints. propertyManage pacemaker properties. acl Manage pacemaker access control lists. qdevice Manage quorum device provider on the local host. quorum Manage cluster quorum settings. booth Manage booth (cluster ticket manager). status View cluster status. config View and manage cluster configuration. pcsdManage pcs daemon. nodeManage cluster nodes. alert Manage pacemaker alerts. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] How to create the stonith resource in virtualbox
Hello, I setup the pacemaker cluster using virtualbox. There are three nodes. The OS is centos7, the /dev/sdb is the shared storage(three nodes use the same disk file). (1) At first, I create the stonith using this command: pcs stonith create scsi-stonith-device fence_scsi devices=/dev/mapper/fence pcmk_monitor_action=metadata pcmk_reboot_action=off pcmk_host_list="db7-1 db7-2 db7-3" meta provides=unfencing; I know the VM not have the /dev/mapper/fence. But sometimes the stonith resource able to start, sometimes not. Don't know why. It is not stable. (2) Then I use the following command to setup stonith using the shared disk /dev/sdb: pcs stonith create scsi-shooter fence_scsi devices=/dev/disk/by-id/ata-VBOX_HARDDISK_VBc833e6c6-af12c936 meta provides=unfencing But the stonith always be stopped and the log show: Feb 7 15:45:53 db7-1 stonith-ng[8166]: warning: fence_scsi[8197] stderr: [ Failed: nodename or key is required ] Could anyone help tell what is the correct command to setup the stonith in VM and centos? Is there any document to introduce this so that I could study it? Thanks Here is the cluster status: [root@db7-1 ~]# pcs status Cluster name: cluster_pgsql Stack: corosync Current DC: db7-2 (version 1.1.16-12.el7_4.7-94ff4df) - partition with quorum Last updated: Wed Feb 7 16:27:13 2018 Last change: Wed Feb 7 15:42:38 2018 by root via cibadmin on db7-1 3 nodes configured 1 resource configured Online: [ db7-1 db7-2 db7-3 ] Full list of resources: scsi-shooter (stonith:fence_scsi): Stopped Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: How to create the stonith resource in virtualbox
Thank Klaus, The information is very helpful. I try to study the fence_vbox and the fence_sdb. In our test lab, we use ipmi as the stonith. But I want to setup a simulator environment in my laptop. So I just need the stonith resource in start state so that I could create dlm and clvm resource.And I don't need it relally work. Do anybody have other suggestion? -邮件原件- 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Klaus Wenninger 发送时间: 2018年2月9日 1:11 收件人: users@clusterlabs.org 主题: Re: [ClusterLabs] How to create the stonith resource in virtualbox On 02/08/2018 02:05 PM, Andrei Borzenkov wrote: > On Thu, Feb 8, 2018 at 5:51 AM, 范国腾 wrote: >> Hello, >> >> I setup the pacemaker cluster using virtualbox. There are three nodes. The >> OS is centos7, the /dev/sdb is the shared storage(three nodes use the same >> disk file). >> >> (1) At first, I create the stonith using this command: >> pcs stonith create scsi-stonith-device fence_scsi >> devices=/dev/mapper/fence pcmk_monitor_action=metadata >> pcmk_reboot_action=off pcmk_host_list="db7-1 db7-2 db7-3" meta >> provides=unfencing; >> >> I know the VM not have the /dev/mapper/fence. But sometimes the stonith >> resource able to start, sometimes not. Don't know why. It is not stable. >> > It probably tries to check resource and fails. State of stonith > resource is irrelevant for actual fencing operation (this resource is > only used for periodical check, not for fencing itself). > >> (2) Then I use the following command to setup stonith using the shared disk >> /dev/sdb: >> pcs stonith create scsi-shooter fence_scsi >> devices=/dev/disk/by-id/ata-VBOX_HARDDISK_VBc833e6c6-af12c936 meta >> provides=unfencing >> >> But the stonith always be stopped and the log show: >> Feb 7 15:45:53 db7-1 stonith-ng[8166]: warning: fence_scsi[8197] >> stderr: [ Failed: nodename or key is required ] >> > Well, you need to provide what is missing - your command did not > specify any host. > >> Could anyone help tell what is the correct command to setup the stonith in >> VM and centos? Is there any document to introduce this so that I could study >> it? I personally don't have any experience setting up a pacemaker-cluster in vbox. Thus I'm limited to giving rather general advice. What you might have to assure together with fence_scsi is if the scsi-emulation vbox offers lives up to the requirements of fence_scsi. I've read about troubles in a posting back from 2015. The guy then went for using scsi via iSCSI. Otherwise you could look for alternatives to fence_scsi. One might be fence_vbox. It doesn't come with centos so far iirc but the upstream repo on github has it. Fencing via the hypervisor is in general not a bad idea when it comes to clusters running in VMs (If you can live with the boundary conditions like giving certain credentials to the VMs that allow communication with the hypervisor.). There was some discussion about fence_vbox on the clusterlabs-list a couple of months ago. iirc there had been issues with using windows as a host for vbox - but I guess they were fixed in the course of this discussion. Another way of doing fencing via a shared disk is fence_sbd (available in centos) - although quite different from how fence_scsi is using the disk. One difference that might be helpful here is that it has less requirements on which disk-infrastructure is emulated. On the other hand it is strongly advised for sbd in general to use a good watchdog device (one that brings down your machine - virtual or physical - in a very reliable manner). And afaik the only watchdog-device available inside a vbox VM is softdog that doesn't meet this requirement too well as it relies on the kernel running in the VM to be at least partially functional. Sorry for not being able to help in a more specific way but I would be interested in which ways of fencing people are using when it comes to clusters based on vbox VMs myself ;-) Regards, Klaus >> >> >> Thanks >> >> >> Here is the cluster status: >> [root@db7-1 ~]# pcs status >> Cluster name: cluster_pgsql >> Stack: corosync >> Current DC: db7-2 (version 1.1.16-12.el7_4.7-94ff4df) - partition >> with quorum Last updated: Wed Feb 7 16:27:13 2018 Last change: Wed >> Feb 7 15:42:38 2018 by root via cibadmin on db7-1 >> >> 3 nodes configured >> 1 resource configured >> >> Online: [ db7-1 db7-2 db7-3 ] >> >> Full list of resources: >> >> scsi-shooter (stonith:fence_scsi): Stopped >> >> Daemon Status: >> corosync: active/disabled >> pacemaker: active/disabled >> pcsd: active/enabl
[ClusterLabs] 答复: 答复: How to create the stonith resource in virtualbox
Marek, Thank you very much for your help. I add the “pcmk_monitor_action=metadata”and the stonith could work now. Thanks 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Marek Grac 发送时间: 2018年2月9日 16:38 收件人: Cluster Labs - All topics related to open-source clustering welcomed 主题: Re: [ClusterLabs] 答复: How to create the stonith resource in virtualbox Hi, for fence_vbox take a look at my older blogpost> https://ox.sk/howto-fence-vbox-cdd3da374ecd if all you need is to have fencing in a state when dlm works and you promise that you will never have real data on it. There is an easy hack, it really does not matter which fence agent you use. All we care about is if action 'monitor' works, so add option> pcmk_monitor_action=metadata It means that instead of monitor action, you will use action 'metadata' which just prints XML metadata and succeed. m, On Fri, Feb 9, 2018 at 6:33 AM, 范国腾 mailto:fanguot...@highgo.com>> wrote: Thank Klaus, The information is very helpful. I try to study the fence_vbox and the fence_sdb. In our test lab, we use ipmi as the stonith. But I want to setup a simulator environment in my laptop. So I just need the stonith resource in start state so that I could create dlm and clvm resource.And I don't need it relally work. Do anybody have other suggestion? -邮件原件- 发件人: Users [mailto:users-boun...@clusterlabs.org<mailto:users-boun...@clusterlabs.org>] 代表 Klaus Wenninger 发送时间: 2018年2月9日 1:11 收件人: users@clusterlabs.org<mailto:users@clusterlabs.org> 主题: Re: [ClusterLabs] How to create the stonith resource in virtualbox On 02/08/2018 02:05 PM, Andrei Borzenkov wrote: > On Thu, Feb 8, 2018 at 5:51 AM, 范国腾 > mailto:fanguot...@highgo.com>> wrote: >> Hello, >> >> I setup the pacemaker cluster using virtualbox. There are three nodes. The >> OS is centos7, the /dev/sdb is the shared storage(three nodes use the same >> disk file). >> >> (1) At first, I create the stonith using this command: >> pcs stonith create scsi-stonith-device fence_scsi >> devices=/dev/mapper/fence pcmk_monitor_action=metadata >> pcmk_reboot_action=off pcmk_host_list="db7-1 db7-2 db7-3" meta >> provides=unfencing; >> >> I know the VM not have the /dev/mapper/fence. But sometimes the stonith >> resource able to start, sometimes not. Don't know why. It is not stable. >> > It probably tries to check resource and fails. State of stonith > resource is irrelevant for actual fencing operation (this resource is > only used for periodical check, not for fencing itself). > >> (2) Then I use the following command to setup stonith using the shared disk >> /dev/sdb: >> pcs stonith create scsi-shooter fence_scsi >> devices=/dev/disk/by-id/ata-VBOX_HARDDISK_VBc833e6c6-af12c936 meta >> provides=unfencing >> >> But the stonith always be stopped and the log show: >> Feb 7 15:45:53 db7-1 stonith-ng[8166]: warning: fence_scsi[8197] >> stderr: [ Failed: nodename or key is required ] >> > Well, you need to provide what is missing - your command did not > specify any host. > >> Could anyone help tell what is the correct command to setup the stonith in >> VM and centos? Is there any document to introduce this so that I could study >> it? I personally don't have any experience setting up a pacemaker-cluster in vbox. Thus I'm limited to giving rather general advice. What you might have to assure together with fence_scsi is if the scsi-emulation vbox offers lives up to the requirements of fence_scsi. I've read about troubles in a posting back from 2015. The guy then went for using scsi via iSCSI. Otherwise you could look for alternatives to fence_scsi. One might be fence_vbox. It doesn't come with centos so far iirc but the upstream repo on github has it. Fencing via the hypervisor is in general not a bad idea when it comes to clusters running in VMs (If you can live with the boundary conditions like giving certain credentials to the VMs that allow communication with the hypervisor.). There was some discussion about fence_vbox on the clusterlabs-list a couple of months ago. iirc there had been issues with using windows as a host for vbox - but I guess they were fixed in the course of this discussion. Another way of doing fencing via a shared disk is fence_sbd (available in centos) - although quite different from how fence_scsi is using the disk. One difference that might be helpful here is that it has less requirements on which disk-infrastructure is emulated. On the other hand it is strongly advised for sbd in general to use a good watchdog device (one that brings down your machine - virtual or physical - in a very reliable manner). And afaik the only watchdog-device available inside a vbox VM is softdog tha
[ClusterLabs] How to configure to make each slave resource has one VIP
Hi, Our system manages the database (one master and multiple slave). We use one VIP for multiple Slave resources firstly. Now I want to change the configuration that each slave resource has a separate VIP. For example, I have 3 slave nodes and my VIP group has 2 vip; The 2 vips binds to node1 and node2 now; When the node2 fails, the vip could move to the node3. [cid:image002.png@01D3ACB5.53E7BAF0] I use the following command to add the VIP pcs resource group add pgsql-slave-group pgsql-slave-ip1 pgsql-slave-ip2 pcs constraint colocation add pgsql-slave-group with slave pgsql-ha INFINITY But now the two VIPs are the same nodes: Master/Slave Set: pgsql-ha [pgsqld] Masters: [ node1 ] Slaves: [ node2 node3 ] pgsql-master-ip(ocf::heartbeat:IPaddr2): Started node1 Resource Group: pgsql-slave-group pgsql-slave-ip1(ocf::heartbeat:IPaddr2): Started node2 pgsql-slave-ip2(ocf::heartbeat:IPaddr2): Started node2 Could anyone tell how to configure to make each slave node has a VIP? Thanks ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: How to configure to make each slave resource has one VIP
Tomas, Thank you very much. I do the change according to your suggestion and it works. There is a question: If there are too much nodes (e.g. total 10 slave nodes ), I need run "pcs constraint colocation add pgsql-slave-ipx with pgsql-slave-ipy -INFINITY" many times. Is there a simple command to do this? Master/Slave Set: pgsql-ha [pgsqld] Masters: [ node1 ] Slaves: [ node2 node3 ] pgsql-master-ip(ocf::heartbeat:IPaddr2): Started node1 pgsql-slave-ip1(ocf::heartbeat:IPaddr2): Started node3 pgsql-slave-ip2(ocf::heartbeat:IPaddr2): Started node2 Thanks Steven -邮件原件- 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Tomas Jelinek 发送时间: 2018年2月23日 17:02 收件人: users@clusterlabs.org 主题: Re: [ClusterLabs] How to configure to make each slave resource has one VIP Dne 23.2.2018 v 08:17 范国腾 napsal(a): > Hi, > > Our system manages the database (one master and multiple slave). We > use one VIP for multiple Slave resources firstly. > > Now I want to change the configuration that each slave resource has a > separate VIP. For example, I have 3 slave nodes and my VIP group has 2 > vip; The 2 vips binds to node1 and node2 now; When the node2 fails, > the vip could move to the node3. > > > I use the following command to add the VIP > > / pcs resource group add pgsql-slave-group pgsql-slave-ip1 > pgsql-slave-ip2/ > > / pcs constraint colocation add pgsql-slave-group with slave > pgsql-ha INFINITY/ > > But now the two VIPs are the same nodes: > > /Master/Slave Set: pgsql-ha [pgsqld]/ > > / Masters: [ node1 ]/ > > / Slaves: [ node2 node3 ]/ > > /pgsql-master-ip (ocf::heartbeat:IPaddr2): Started node1/ > > /Resource Group: pgsql-slave-group/ > > */ pgsql-slave-ip1 (ocf::heartbeat:IPaddr2): Started > node2/* > > */ pgsql-slave-ip2 (ocf::heartbeat:IPaddr2): Started > node2/* > > Could anyone tell how to configure to make each slave node has a VIP? Resources in a group always run on the same node. You want the ip resources to run on different nodes so you cannot put them into a group. This will take the resources out of the group: pcs resource ungroup pgsql-slave-group Then you can set colocation constraints for them: pcs constraint colocation add pgsql-slave-ip1 with slave pgsql-ha pcs constraint colocation add pgsql-slave-ip2 with slave pgsql-ha You may also need to tell pacemaker not to put both ips on the same node: pcs constraint colocation add pgsql-slave-ip1 with pgsql-slave-ip2 -INFINITY Regards, Tomas > > Thanks > > > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: 答复: How to configure to make each slave resource has one VIP
Thank you very much, Tomas. This resolves my problem. -邮件原件- 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Tomas Jelinek 发送时间: 2018年2月23日 17:37 收件人: users@clusterlabs.org 主题: Re: [ClusterLabs] 答复: How to configure to make each slave resource has one VIP Dne 23.2.2018 v 10:16 范国腾 napsal(a): > Tomas, > > Thank you very much. I do the change according to your suggestion and it > works. > > There is a question: If there are too much nodes (e.g. total 10 slave nodes > ), I need run "pcs constraint colocation add pgsql-slave-ipx with > pgsql-slave-ipy -INFINITY" many times. Is there a simple command to do this? I think colocation set does the trick: pcs constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 pgsql-slave-ip3 setoptions score=-INFINITY You may specify as many resources as you need in this command. Tomas > > Master/Slave Set: pgsql-ha [pgsqld] > Masters: [ node1 ] > Slaves: [ node2 node3 ] > pgsql-master-ip(ocf::heartbeat:IPaddr2): Started node1 > pgsql-slave-ip1(ocf::heartbeat:IPaddr2): Started node3 > pgsql-slave-ip2(ocf::heartbeat:IPaddr2): Started node2 > > Thanks > Steven > > -邮件原件- > 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Tomas Jelinek > 发送时间: 2018年2月23日 17:02 > 收件人: users@clusterlabs.org > 主题: Re: [ClusterLabs] How to configure to make each slave resource has > one VIP > > Dne 23.2.2018 v 08:17 范国腾 napsal(a): >> Hi, >> >> Our system manages the database (one master and multiple slave). We >> use one VIP for multiple Slave resources firstly. >> >> Now I want to change the configuration that each slave resource has a >> separate VIP. For example, I have 3 slave nodes and my VIP group has >> 2 vip; The 2 vips binds to node1 and node2 now; When the node2 fails, >> the vip could move to the node3. >> >> >> I use the following command to add the VIP >> >> / pcs resource group add pgsql-slave-group pgsql-slave-ip1 >> pgsql-slave-ip2/ >> >> / pcs constraint colocation add pgsql-slave-group with slave >> pgsql-ha INFINITY/ >> >> But now the two VIPs are the same nodes: >> >> /Master/Slave Set: pgsql-ha [pgsqld]/ >> >> / Masters: [ node1 ]/ >> >> / Slaves: [ node2 node3 ]/ >> >> /pgsql-master-ip (ocf::heartbeat:IPaddr2): Started >> node1/ >> >> /Resource Group: pgsql-slave-group/ >> >> */ pgsql-slave-ip1 (ocf::heartbeat:IPaddr2): Started >> node2/* >> >> */ pgsql-slave-ip2 (ocf::heartbeat:IPaddr2): Started >> node2/* >> >> Could anyone tell how to configure to make each slave node has a VIP? > > Resources in a group always run on the same node. You want the ip resources > to run on different nodes so you cannot put them into a group. > > This will take the resources out of the group: > pcs resource ungroup pgsql-slave-group > > Then you can set colocation constraints for them: > pcs constraint colocation add pgsql-slave-ip1 with slave pgsql-ha pcs > constraint colocation add pgsql-slave-ip2 with slave pgsql-ha > > You may also need to tell pacemaker not to put both ips on the same node: > pcs constraint colocation add pgsql-slave-ip1 with pgsql-slave-ip2 > -INFINITY > > > Regards, > Tomas > >> >> Thanks >> >> >> >> ___ >> Users mailing list: Users@clusterlabs.org >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: 答复: 答复: How to configure to make each slave resource has one VIP
Thank you, Ken, So I could use the following command: pcs constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 pgsql-slave-ip3 setoptions score=-1000 -邮件原件- 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Ken Gaillot 发送时间: 2018年2月23日 23:14 收件人: Cluster Labs - All topics related to open-source clustering welcomed 主题: Re: [ClusterLabs] 答复: 答复: How to configure to make each slave resource has one VIP On Fri, 2018-02-23 at 12:45 +, 范国腾 wrote: > Thank you very much, Tomas. > This resolves my problem. > > -邮件原件- > 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Tomas Jelinek > 发送时间: 2018年2月23日 17:37 > 收件人: users@clusterlabs.org > 主题: Re: [ClusterLabs] 答复: How to configure to make each slave resource > has one VIP > > Dne 23.2.2018 v 10:16 范国腾 napsal(a): > > Tomas, > > > > Thank you very much. I do the change according to your suggestion > > and it works. One thing to keep in mind: a score of -INFINITY means the IPs will *never* run on the same node, even if one or more nodes go down. If that's what you want, of course, that's good. If you want the IPs to stay on different nodes normally, but be able to run on the same node in case of node outage, use a finite negative score. > > > > There is a question: If there are too much nodes (e.g. total 10 > > slave nodes ), I need run "pcs constraint colocation add pgsql- > > slave-ipx with pgsql-slave-ipy -INFINITY" many times. Is there a > > simple command to do this? > > I think colocation set does the trick: > pcs constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 > pgsql-slave-ip3 setoptions score=-INFINITY You may specify as many > resources as you need in this command. > > Tomas > > > > > Master/Slave Set: pgsql-ha [pgsqld] > > Masters: [ node1 ] > > Slaves: [ node2 node3 ] > > pgsql-master-ip(ocf::heartbeat:IPaddr2): Started > > node1 > > pgsql-slave-ip1(ocf::heartbeat:IPaddr2): Started > > node3 > > pgsql-slave-ip2(ocf::heartbeat:IPaddr2): Started > > node2 > > > > Thanks > > Steven > > > > -邮件原件- > > 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Tomas Jelinek > > 发送时间: 2018年2月23日 17:02 > > 收件人: users@clusterlabs.org > > 主题: Re: [ClusterLabs] How to configure to make each slave resource > > has one VIP > > > > Dne 23.2.2018 v 08:17 范国腾 napsal(a): > > > Hi, > > > > > > Our system manages the database (one master and multiple slave). > > > We > > > use one VIP for multiple Slave resources firstly. > > > > > > Now I want to change the configuration that each slave resource > > > has a separate VIP. For example, I have 3 slave nodes and my VIP > > > group has > > > 2 vip; The 2 vips binds to node1 and node2 now; When the node2 > > > fails, the vip could move to the node3. > > > > > > > > > I use the following command to add the VIP > > > > > > / pcs resource group add pgsql-slave-group pgsql-slave-ip1 > > > pgsql-slave-ip2/ > > > > > > / pcs constraint colocation add pgsql-slave-group with slave > > > pgsql-ha INFINITY/ > > > > > > But now the two VIPs are the same nodes: > > > > > > /Master/Slave Set: pgsql-ha [pgsqld]/ > > > > > > / Masters: [ node1 ]/ > > > > > > / Slaves: [ node2 node3 ]/ > > > > > > /pgsql-master-ip (ocf::heartbeat:IPaddr2): Started > > > node1/ > > > > > > /Resource Group: pgsql-slave-group/ > > > > > > */ pgsql-slave-ip1 (ocf::heartbeat:IPaddr2): Started > > > node2/* > > > > > > */ pgsql-slave-ip2 (ocf::heartbeat:IPaddr2): Started > > > node2/* > > > > > > Could anyone tell how to configure to make each slave node has a > > > VIP? > > > > Resources in a group always run on the same node. You want the ip > > resources to run on different nodes so you cannot put them into a > > group. > > > > This will take the resources out of the group: > > pcs resource ungroup pgsql-slave-group > > > > Then you can set colocation constraints for them: > > pcs constraint colocation add pgsql-slave-ip1 with slave pgsql-ha > > pcs constraint colocation add pgsql-slave-ip2 with slave pgsql-ha > > > > You may also need to tell pacemaker not to put both ips on the same > > node: > > pcs constraint colocation add p
[ClusterLabs] 答复: 答复: How to configure to make each slave resource has one VIP
Hello, If all of the slave nodes crash, all of the slave vips could not work. Do we have any way to make all of the slave VIPs binds to the master node if there is no slave nodes in the system? the user client will not know the system has problem in this way. Thanks -邮件原件- 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Tomas Jelinek 发送时间: 2018年2月23日 17:37 收件人: users@clusterlabs.org 主题: Re: [ClusterLabs] 答复: How to configure to make each slave resource has one VIP Dne 23.2.2018 v 10:16 范国腾 napsal(a): > Tomas, > > Thank you very much. I do the change according to your suggestion and it > works. > > There is a question: If there are too much nodes (e.g. total 10 slave nodes > ), I need run "pcs constraint colocation add pgsql-slave-ipx with > pgsql-slave-ipy -INFINITY" many times. Is there a simple command to do this? I think colocation set does the trick: pcs constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 pgsql-slave-ip3 setoptions score=-INFINITY You may specify as many resources as you need in this command. Tomas > > Master/Slave Set: pgsql-ha [pgsqld] > Masters: [ node1 ] > Slaves: [ node2 node3 ] > pgsql-master-ip(ocf::heartbeat:IPaddr2): Started node1 > pgsql-slave-ip1(ocf::heartbeat:IPaddr2): Started node3 > pgsql-slave-ip2(ocf::heartbeat:IPaddr2): Started node2 > > Thanks > Steven > > -邮件原件- > 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Tomas Jelinek > 发送时间: 2018年2月23日 17:02 > 收件人: users@clusterlabs.org > 主题: Re: [ClusterLabs] How to configure to make each slave resource has > one VIP > > Dne 23.2.2018 v 08:17 范国腾 napsal(a): >> Hi, >> >> Our system manages the database (one master and multiple slave). We >> use one VIP for multiple Slave resources firstly. >> >> Now I want to change the configuration that each slave resource has a >> separate VIP. For example, I have 3 slave nodes and my VIP group has >> 2 vip; The 2 vips binds to node1 and node2 now; When the node2 fails, >> the vip could move to the node3. >> >> >> I use the following command to add the VIP >> >> / pcs resource group add pgsql-slave-group pgsql-slave-ip1 >> pgsql-slave-ip2/ >> >> / pcs constraint colocation add pgsql-slave-group with slave >> pgsql-ha INFINITY/ >> >> But now the two VIPs are the same nodes: >> >> /Master/Slave Set: pgsql-ha [pgsqld]/ >> >> / Masters: [ node1 ]/ >> >> / Slaves: [ node2 node3 ]/ >> >> /pgsql-master-ip (ocf::heartbeat:IPaddr2): Started >> node1/ >> >> /Resource Group: pgsql-slave-group/ >> >> */ pgsql-slave-ip1 (ocf::heartbeat:IPaddr2): Started >> node2/* >> >> */ pgsql-slave-ip2 (ocf::heartbeat:IPaddr2): Started >> node2/* >> >> Could anyone tell how to configure to make each slave node has a VIP? > > Resources in a group always run on the same node. You want the ip resources > to run on different nodes so you cannot put them into a group. > > This will take the resources out of the group: > pcs resource ungroup pgsql-slave-group > > Then you can set colocation constraints for them: > pcs constraint colocation add pgsql-slave-ip1 with slave pgsql-ha pcs > constraint colocation add pgsql-slave-ip2 with slave pgsql-ha > > You may also need to tell pacemaker not to put both ips on the same node: > pcs constraint colocation add pgsql-slave-ip1 with pgsql-slave-ip2 > -INFINITY > > > Regards, > Tomas > >> >> Thanks >> >> >> >> ___ >> Users mailing list: Users@clusterlabs.org >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mai
[ClusterLabs] 答复: 答复: How to create the stonith resource in virtualbox
Hi Marek and all, I use the following command to create a stonith resource in virtualbox(centos7)which has no /dev/mapper/fence: pcs stonith create scsi-stonith-device fence_scsi devices=/dev/mapper/fence pcmk_monitor_action=metadata pcmk_reboot_action=off pcmk_host_list="node1 node2" meta provides=unfencing; The stonith resource status could be “started” in 2017 and I think that it is because I use the metadata. But when I install a new VM this year and create the stonith again using the same command. It always be stopped status. I try many times and here are current situation: (1) I created a cluster in 2017 in VM using the following command in CENTOS7. The stonith status is started until now. (2) I created a cluster today in VM using the following command in CENTOS7. The stonith status is always stopped. (3) I created a cluster today in VM using the following command in REDHAT7. The stonith status could be started. I compare the /usr/sbin/fence_scsi file in different node and it has no logic change. Why could the Stonith resource could be started? How should I debug it? Here is my command: systemctl stop firewalld;chkconfig firewalld off; yum install -y corosync pacemaker pcs gfs2-utils lvm2-cluster *scsi* python-clufter; pcs cluster auth node1 node2 node3 -u hacluster;pcs cluster setup --name cluster_pgsql node1 node2 node3;pcs cluster start --all;pcs property set no-quorum-policy=freeze;pcs property set stonith-enabled=true; pcs stonith create scsi-stonith-device fence_scsi devices=/dev/mapper/fence pcmk_monitor_action=metadata pcmk_reboot_action=off pcmk_host_list="node1 node2" meta provides=unfencing; Here is the log: Feb 26 03:55:08 db1 crmd[2215]: notice: Requesting fencing (on) of node db1 Feb 26 03:55:08 db1 stonith-ng[2211]: notice: Client crmd.2215.c5d11cbe wants to fence (on) 'db2' with device '(any)' Feb 26 03:55:08 db1 stonith-ng[2211]: notice: Requesting peer fencing (on) of db2 Feb 26 03:55:08 db1 stonith-ng[2211]: notice: Client crmd.2215.c5d11cbe wants to fence (on) 'db1' with device '(any)' Feb 26 03:55:08 db1 stonith-ng[2211]: notice: Requesting peer fencing (on) of db1 Feb 26 03:55:08 db1 stonith-ng[2211]: notice: scsi-stonith-device can fence (on) db1: static-list Feb 26 03:55:08 db1 stonith-ng[2211]: notice: scsi-stonith-device can fence (on) db1: static-list Feb 26 03:55:08 db1 fence_scsi: Failed: device "/dev/mapper/fence" does not exist Feb 26 03:55:08 db1 fence_scsi: Please use '-h' for usage Feb 26 03:55:08 db1 stonith-ng[2211]: warning: fence_scsi[9072] stderr: [ WARNING:root:Parse error: Ignoring unknown option 'port=db1' ] Feb 26 03:55:08 db1 stonith-ng[2211]: warning: fence_scsi[9072] stderr: [ ] Feb 26 03:55:08 db1 stonith-ng[2211]: warning: fence_scsi[9072] stderr: [ ERROR:root:Failed: device "/dev/mapper/fence" does not exist ] Feb 26 03:55:08 db1 stonith-ng[2211]: warning: fence_scsi[9072] stderr: [ ] Feb 26 03:55:08 db1 stonith-ng[2211]: warning: fence_scsi[9072] stderr: [ Failed: device "/dev/mapper/fence" does not exist ] Feb 26 03:55:08 db1 stonith-ng[2211]: warning: fence_scsi[9072] stderr: [ ] Feb 26 03:55:08 db1 stonith-ng[2211]: warning: fence_scsi[9072] stderr: [ ERROR:root:Please use '-h' for usage ] Feb 26 03:55:08 db1 stonith-ng[2211]: warning: fence_scsi[9072] stderr: [ ] Feb 26 03:55:08 db1 stonith-ng[2211]: warning: fence_scsi[9072] stderr: [ Please use '-h' for usage ] Feb 26 03:55:08 db1 stonith-ng[2211]: warning: fence_scsi[9072] stderr: [ ] Feb 26 03:59:11 db1 rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="627" x-info="http://www.rsyslog.com";] start Feb 26 03:59:12 db1 rsyslogd-2027: imjournal: fscanf on state file `/var/lib/rsyslog/imjournal.state' failed [try http://www.rsyslog.com/e/2027 ] 发件人: 范国腾 发送时间: 2018年2月11日 15:43 收件人: Cluster Labs - All topics related to open-source clustering welcomed 主题: 答复: [ClusterLabs] 答复: How to create the stonith resource in virtualbox Marek, Thank you very much for your help. I add the “pcmk_monitor_action=metadata”and the stonith could work now. Thanks 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Marek Grac 发送时间: 2018年2月9日 16:38 收件人: Cluster Labs - All topics related to open-source clustering welcomed mailto:users@clusterlabs.org>> 主题: Re: [ClusterLabs] 答复: How to create the stonith resource in virtualbox Hi, for fence_vbox take a look at my older blogpost> https://ox.sk/howto-fence-vbox-cdd3da374ecd if all you need is to have fencing in a state when dlm works and you promise that you will never have real data on it. There is an easy hack, it really does not matter which fence agent you use. All we care about is if action 'monitor' works, so add option> pcmk_monitor_action=metadata It means that instead of monitor
[ClusterLabs] 答复: 答复: 答复: How to configure to make each slave resource has one VIP
Thank you, Ken. Got it :) -邮件原件- 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Ken Gaillot 发送时间: 2018年3月6日 7:18 收件人: Cluster Labs - All topics related to open-source clustering welcomed 主题: Re: [ClusterLabs] 答复: 答复: How to configure to make each slave resource has one VIP On Sun, 2018-02-25 at 02:24 +, 范国腾 wrote: > Hello, > > If all of the slave nodes crash, all of the slave vips could not work. > > Do we have any way to make all of the slave VIPs binds to the master > node if there is no slave nodes in the system? > > the user client will not know the system has problem in this way. > > Thanks Hi, If you colocate all the slave IPs "with pgsql-ha" instead of "with slave pgsql-ha", then they can run on either master or slave nodes. Including the master IP in the anti-colocation set will keep them apart normally. > > -邮件原件- > 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Tomas Jelinek > 发送时间: 2018年2月23日 17:37 > 收件人: users@clusterlabs.org > 主题: Re: [ClusterLabs] 答复: How to configure to make each slave resource > has one VIP > > Dne 23.2.2018 v 10:16 范国腾 napsal(a): > > Tomas, > > > > Thank you very much. I do the change according to your suggestion > > and it works. > > > > There is a question: If there are too much nodes (e.g. total 10 > > slave nodes ), I need run "pcs constraint colocation add pgsql- > > slave-ipx with pgsql-slave-ipy -INFINITY" many times. Is there a > > simple command to do this? > > I think colocation set does the trick: > pcs constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 > pgsql-slave-ip3 setoptions score=-INFINITY You may specify as many > resources as you need in this command. > > Tomas > > > > > Master/Slave Set: pgsql-ha [pgsqld] > > Masters: [ node1 ] > > Slaves: [ node2 node3 ] > > pgsql-master-ip(ocf::heartbeat:IPaddr2): Started > > node1 > > pgsql-slave-ip1(ocf::heartbeat:IPaddr2): Started > > node3 > > pgsql-slave-ip2(ocf::heartbeat:IPaddr2): Started > > node2 > > > > Thanks > > Steven > > > > -邮件原件----- > > 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Tomas Jelinek > > 发送时间: 2018年2月23日 17:02 > > 收件人: users@clusterlabs.org > > 主题: Re: [ClusterLabs] How to configure to make each slave resource > > has one VIP > > > > Dne 23.2.2018 v 08:17 范国腾 napsal(a): > > > Hi, > > > > > > Our system manages the database (one master and multiple slave). > > > We > > > use one VIP for multiple Slave resources firstly. > > > > > > Now I want to change the configuration that each slave resource > > > has a separate VIP. For example, I have 3 slave nodes and my VIP > > > group has > > > 2 vip; The 2 vips binds to node1 and node2 now; When the node2 > > > fails, the vip could move to the node3. > > > > > > > > > I use the following command to add the VIP > > > > > > / pcs resource group add pgsql-slave-group pgsql-slave-ip1 > > > pgsql-slave-ip2/ > > > > > > / pcs constraint colocation add pgsql-slave-group with slave > > > pgsql-ha INFINITY/ > > > > > > But now the two VIPs are the same nodes: > > > > > > /Master/Slave Set: pgsql-ha [pgsqld]/ > > > > > > / Masters: [ node1 ]/ > > > > > > / Slaves: [ node2 node3 ]/ > > > > > > /pgsql-master-ip (ocf::heartbeat:IPaddr2): Started > > > node1/ > > > > > > /Resource Group: pgsql-slave-group/ > > > > > > */ pgsql-slave-ip1 (ocf::heartbeat:IPaddr2): Started > > > node2/* > > > > > > */ pgsql-slave-ip2 (ocf::heartbeat:IPaddr2): Started > > > node2/* > > > > > > Could anyone tell how to configure to make each slave node has a > > > VIP? > > > > Resources in a group always run on the same node. You want the ip > > resources to run on different nodes so you cannot put them into a > > group. > > > > This will take the resources out of the group: > > pcs resource ungroup pgsql-slave-group > > > > Then you can set colocation constraints for them: > > pcs constraint colocation add pgsql-slave-ip1 with slave pgsql-ha > > pcs constraint colocation add pgsql-slave-ip2 with slave pgsql-ha > > > > You may also need to tell pacemaker not to put both ips on the same > > node: > > p
[ClusterLabs] 答复: 答复: 答复: 答复: How to configure to make each slave resource has one VIP
Thank you, Rorthais, I read the link and it is very helpful. There are some issues that I have met when I installed the cluster. 1. “pcs cluster stop” could not stop the cluster in some times. 2. when I upgrade the PAF, I could just replace the pgsqlms file. When I upgrade the postgres, I just replace the /usr/local/pgsql/. 3. If the cluster does not stop normally, the pgcontroldata status is not "SHUTDOWN",then the PAF would not start the postgresql any more, so I normally change the pgsqlms as below after installing the PAF. elsif ( $pgisready_rc == 2 ) { # The instance is not listening. # We check the process status using pg_ctl status and check # if it was propertly shut down using pg_controldata. ocf_log( 'debug', 'pgsql_monitor: instance "%s" is not listening', $OCF_RESOURCE_INSTANCE ); return _confirm_stopped(); ### remove this line return $OCF_NOT_RUNNING;### add this line } -邮件原件- 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 发送时间: 2018年3月6日 17:08 收件人: 范国腾 抄送: Cluster Labs - All topics related to open-source clustering welcomed 主题: Re: [ClusterLabs] 答复: 答复: 答复: How to configure to make each slave resource has one VIP Hi guys, Few month ago, I started a new chapter about this exact subject for "PAF - Cluster administration under CentOS" ( see: https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html) Please, find attach my draft. All feedback, fix, comments and intensive tests are welcome! ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: 答复: 答复: 答复: How to configure to make each slave resource has one VIP
Sorry, Rorthais, I have thought that the link and the attachment was the same document yesterday. I just read the attachment and that is exactly what I ask originally. I have two questions on the following two command: # pcs constraint colocation add pgsql-ip-stby1 with slave pgsql-ha 10 Q: Does the score 10 means that " move to the master if there is no standby alive "? # pcs constraint order start pgsql-ha then start pgsql-ip-stby1 kind=Mandatory Q: I did not set the order and I did not find the issue until now. So I add this constraint? What will happen if I miss it? Here is what I did now: pcs resource create pgsql-slave-ip1 ocf:heartbeat:IPaddr2 ip=192.168.199.186 nic=enp3s0f0 cidr_netmask=24 op monitor interval=10s; pcs resource create pgsql-slave-ip2 ocf:heartbeat:IPaddr2 ip=192.168.199.187 nic=enp3s0f0 cidr_netmask=24 op monitor interval=10s; pcs constraint colocation add pgsql-slave-ip1 with pgsql-ha pcs constraint colocation add pgsql-slave-ip2 with pgsql-ha pcs constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 pgsql-master-ip setoptions score=-1000 -邮件原件- 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 发送时间: 2018年3月7日 16:29 收件人: 范国腾 抄送: Cluster Labs - All topics related to open-source clustering welcomed 主题: Re: [ClusterLabs] 答复: 答复: 答复: How to configure to make each slave resource has one VIP On Wed, 7 Mar 2018 01:27:16 + 范国腾 wrote: > Thank you, Rorthais, > > I read the link and it is very helpful. Did you read the draft I attached to the email? It was the main purpose of my answer: helping you with IP on slaves. It seems to me your mail is reporting different issues than the original subject. > There are some issues that I have met when I installed the cluster. I suppose this is another subject and we should open a new thread with the appropriate subject. > 1. “pcs cluster stop” could not stop the cluster in some times. You would have to give some more details about the context where "pcs cluster stop" timed out. > 2. when I upgrade the PAF, I could just replace the pgsqlms file. When > I upgrade the postgres, I just replace the /usr/local/pgsql/. I believe both actions are documented with best practices in this links I gave you. > 3. If the cluster does not stop normally, the pgcontroldata status is > not "SHUTDOWN",then the PAF would not start the postgresql any more, > so I normally change the pgsqlms as below after installing the PAF. > [...] This should be discussed to understand the exact context before considering your patch. At a first glance, your patch seems quite dangerous as it bypass the sanity checks. Please, could you start a new thread with proper subject and add extensive informations about this issue? You could open a new issue on PAF repository as well: https://github.com/ClusterLabs/PAF/issues Regards, ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: 答复: 答复: 答复: How to configure to make each slave resource has one VIP
Thanks Rorthais, Got it. The following command could make sure that it move to the master if there is no standby alive: pcs constraint colocation add pgsql-ip-stby1 with slave pgsql-ha 100 pcs constraint colocation add pgsql-ip-stby1 with pgsql-ha 50 -邮件原件- 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 发送时间: 2018年3月8日 17:41 收件人: 范国腾 抄送: Cluster Labs - All topics related to open-source clustering welcomed 主题: Re: [ClusterLabs] 答复: 答复: 答复: How to configure to make each slave resource has one VIP On Thu, 8 Mar 2018 01:45:43 + 范国腾 wrote: > Sorry, Rorthais, I have thought that the link and the attachment was > the same document yesterday. No problem. For your information, I merged the draft in the official documentation yesterday. > I just read the attachment and that is exactly what I ask originally. Excellent! Glad it could helped. > I have two questions on the following two command: > # pcs constraint colocation add pgsql-ip-stby1 with slave pgsql-ha 10 > Q: Does the score 10 means that " move to the master if there is no > standby alive "? Kind of. It actually says nothing about moving to the master. It just says the slaves IP should prefers to locate with a slave. If slaves nodes are down or in standby, the IP "can" move to the master as nothing forbid it. In fact, while writing this sentence, I realize there's nothing to push the slaves IP on the master if other nodes are up, but the pgsql-ha slaves are stopped or banned. The configuration I provided is incomplete. 1. I added the missing constraints in the doc online 2. notice I raised all the scores so they are higher than the stickiness See: https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#adding-ips-on-slaves-nodes Sorry for this :/ > # pcs constraint order start pgsql-ha then start pgsql-ip-stby1 > kind=Mandatory > Q: I did not set the order and I did not find the issue until now. So > I add this constraint? What will happen if I miss it? The IP address can start before PostgreSQL is up on the node. You will have client connexions being rejected with error "PostgreSQL is not listening on host [...]". > Here is what I did now: > pcs resource create pgsql-slave-ip1 ocf:heartbeat:IPaddr2 ip=192.168.199.186 > nic=enp3s0f0 cidr_netmask=24 op monitor interval=10s; pcs resource > create pgsql-slave-ip2 ocf:heartbeat:IPaddr2 ip=192.168.199.187 > nic=enp3s0f0 cidr_netmask=24 op monitor interval=10s; pcs constraint > colocation add pgsql-slave-ip1 with pgsql-ha It misses the score and the role. Without role specification, it can colocates with Master or Slave with no preference. > pcs constraint colocation add pgsql-slave-ip2 with pgsql-ha Same, it misses the score and the role. > pcs constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 > pgsql-master-ip setoptions score=-1000 The score seems too high in my opinion, compared to other ones. You should probably remove all the colocation constraints and try with the one I pushed online. Regards, > -邮件原件----- > 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] > 发送时间: 2018年3月7日 16:29 > 收件人: 范国腾 > 抄送: Cluster Labs - All topics related to open-source clustering > welcomed 主题: Re: [ClusterLabs] 答复: 答复: 答复: How > to configure to make each slave resource has one VIP > > On Wed, 7 Mar 2018 01:27:16 + > 范国腾 wrote: > > > Thank you, Rorthais, > > > > I read the link and it is very helpful. > > Did you read the draft I attached to the email? It was the main > purpose of my > answer: helping you with IP on slaves. It seems to me your mail is > reporting different issues than the original subject. > > > There are some issues that I have met when I installed the cluster. > > I suppose this is another subject and we should open a new thread with > the appropriate subject. > > > 1. “pcs cluster stop” could not stop the cluster in some times. > > You would have to give some more details about the context where "pcs > cluster stop" timed out. > > > 2. when I upgrade the PAF, I could just replace the pgsqlms file. > > When I upgrade the postgres, I just replace the /usr/local/pgsql/. > > I believe both actions are documented with best practices in this > links I gave you. > > > 3. If the cluster does not stop normally, the pgcontroldata status > > is not "SHUTDOWN",then the PAF would not start the postgresql any > > more, so I normally change the pgsqlms as below after installing the PAF. > > [...] > > This should be discussed to understand the exact context before > considering your patch. > > At a first glance, your patch seems quite dangerous as it bypass the &
[ClusterLabs] The node and resource status is defferent when the node poweroff
Hello, There are three nodes in our cluster (redhat7). When we run "reboot" in one node, the "pcs status" show the node status is offline and the resource status is Stopped. That is fine. But when we power off the node directly, the node status is " UNCLEAN (offline)" and the resource status is " Started(UNCLEAN) ". Why is the status different when one node shutdown in different way? Could we have any way to make the resource status change from " Started node1 (UNCLEAN)" to "Stopped" when we poweroff the node computer? 1. The normal status: scsi-shooter (stonith:fence_scsi): Started node1 Clone Set: dlm-clone [dlm] Started: [ node1 node2 node3 ] Clone Set: clvmd-clone [clvmd] Started: [ node1 node2 node3 ] Clone Set: clusterfs-clone [clusterfs] Started: [ node1 node2 node3 ] Master/Slave Set: pgsql-ha [pgsqld] Masters: [ node3 ] Slaves: [ node1 node2 ] pgsql-master-ip(ocf::heartbeat:IPaddr2): Started node3 2. When executing "reboot" in one node: Online: [ node2 node3 ] OFFLINE: [ node1 ] Full list of resources: scsi-shooter (stonith:fence_scsi): Started node2 Clone Set: dlm-clone [dlm] Started: [ node2 node3 ] Stopped: [ node1 ] Clone Set: clvmd-clone [clvmd] Started: [ node2 node3 ] Stopped: [ node1 ] Clone Set: clusterfs-clone [clusterfs] Started: [ node2 node3 ] Stopped: [ node1 ] Master/Slave Set: pgsql-ha [pgsqld] Masters: [ node3 ] Slaves: [ node2 ] Stopped: [ node1 ] pgsql-master-ip(ocf::heartbeat:IPaddr2): Started node3 3. When power off the node: Node node1: UNCLEAN (offline) Online: [ node2 node3 ] Full list of resources: scsi-shooter (stonith:fence_scsi): Started[ node1 node2 ] Clone Set: dlm-clone [dlm] dlm(ocf::pacemaker:controld): Started node1 (UNCLEAN) Started: [ node2 node3 ] Clone Set: clvmd-clone [clvmd] clvmd (ocf::heartbeat:clvm): Started node1 (UNCLEAN) Started: [ node2 node3 ] Clone Set: clusterfs-clone [clusterfs] clusterfs (ocf::heartbeat:Filesystem):Started node1 (UNCLEAN) Started: [ node2 node3 ] Master/Slave Set: pgsql-ha [pgsqld] pgsqld (ocf::heartbeat:pgsqlms): Slave node1 (UNCLEAN) Masters: [ node3 ] Slaves: [ node2 ] pgsql-master-ip(ocf::heartbeat:IPaddr2): Started node3 ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: The node and resource status is defferent when the node poweroff
Thank you, Andrei and Ulrich. Yes, I use a fake stonith now. I will test the working stonith device and see if it happens again. -邮件原件- 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Andrei Borzenkov 发送时间: 2018年3月15日 16:06 收件人: Cluster Labs - All topics related to open-source clustering welcomed 抄送: 李晓飞 ; 祁华鹏 主题: Re: [ClusterLabs] The node and resource status is defferent when the node poweroff On Thu, Mar 15, 2018 at 10:42 AM, 范国腾 wrote: > Hello, > > There are three nodes in our cluster (redhat7). When we run "reboot" in one > node, the "pcs status" show the node status is offline and the resource > status is Stopped. That is fine. But when we power off the node directly, the > node status is " UNCLEAN (offline)" and the resource status is " > Started(UNCLEAN) ". > > Why is the status different when one node shutdown in different way? Could > we have any way to make the resource status change from " Started node1 > (UNCLEAN)" to "Stopped" when we poweroff the node computer? You must have working stonith agent. Then when node unexpectedly goes away other nodes will invoke stonith which confirms that UNCLEAN node is down. After that pacemaker will change status to offline and will proceed with restarting resources that were running on powered off node. ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] How to setup a simple master/slave cluster in two nodes without stonith resource
Hello, I want to setup a cluster in two nodes. One is master and the other is slave. I don’t need the fencing device because my internal network is stable. I use the following command to create the resource, but all of the two nodes are slave and cluster don’t promote it to master. Could you please help check if there is anything wrong with my configuration? pcs property set stonith-enabled=false; pcs resource create pgsqld ocf:heartbeat:pgsqlms bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data op start timeout=600s op stop timeout=60s op promote timeout=300s op demote timeout=120s op monitor interval=15s timeout=100s role="Master" op monitor interval=16s timeout=100s role="Slave" op notify timeout=60s;pcs resource master pgsql-ha pgsqld notify=true interleave=true; The status is as below: [root@node1 ~]# pcs status Cluster name: cluster_pgsql Stack: corosync Current DC: node2-1 (version 1.1.15-11.el7-e174ec8) - partition with quorum Last updated: Mon Apr 2 21:51:57 2018 Last change: Mon Apr 2 21:32:22 2018 by hacluster via crmd on node2-1 2 nodes and 3 resources configured Online: [ node1-1 node2-1 ] Full list of resources: Master/Slave Set: pgsql-ha [pgsqld] Slaves: [ node1-1 node2-1 ] pgsql-master-ip(ocf::heartbeat:IPaddr2): Stopped Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled When I execute pcs resource cleanup in one node, there is always one node print the following waring message in the /var/log/messages. But the other nodes’ log show no error. The resource log(pgsqlms) show the monitor action could return 0 but why the crmd log show failed? Apr 2 21:53:09 node2 crmd[2425]: warning: No reason to expect node 1 to be down Apr 2 21:53:09 node2 crmd[2425]: notice: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph Apr 2 21:53:09 node2 crmd[2425]: warning: No reason to expect node 2 to be down Apr 2 21:53:09 node2 pengine[2424]: notice: Start pgsqld:0#011(node1-1) Apr 2 21:53:09 node2 pengine[2424]: notice: Start pgsqld:1#011(node2-1) Apr 2 21:53:09 node2 pengine[2424]: notice: Calculated transition 4, saving inputs in /var/lib/pacemaker/pengine/pe-input-6.bz2 Apr 2 21:53:09 node2 crmd[2425]: notice: Initiating monitor operation pgsqld:0_monitor_0 on node1-1 | action 2 Apr 2 21:53:09 node2 crmd[2425]: notice: Initiating monitor operation pgsqld:1_monitor_0 locally on node2-1 | action 3 Apr 2 21:53:09 node2 pgsqlms(pgsqld)[3644]: INFO: Action is monitor Apr 2 21:53:09 node2 pgsqlms(pgsqld)[3644]: INFO: pgsql_monitor: monitor is a probe Apr 2 21:53:09 node2 pgsqlms(pgsqld)[3644]: INFO: pgsql_monitor: instance "pgsqld" is listening Apr 2 21:53:09 node2 pgsqlms(pgsqld)[3644]: INFO: Action result is 0 Apr 2 21:53:09 node2 crmd[2425]: notice: Result of probe operation for pgsqld on node2-1: 0 (ok) | call=33 key=pgsqld_monitor_0 confirmed=true cib-update=62 Apr 2 21:53:09 node2 crmd[2425]: warning: Action 3 (pgsqld:1_monitor_0) on node2-1 failed (target: 7 vs. rc: 0): Error Apr 2 21:53:09 node2 crmd[2425]: notice: Transition aborted by operation pgsqld_monitor_0 'create' on node2-1: Event failed | magic=0:0;3:4:7:3a132f28-d8b9-4948-bb6b-736edc221664 cib=0.28.2 source=match_graph_event:310 complete=false Apr 2 21:53:09 node2 crmd[2425]: warning: Action 3 (pgsqld:1_monitor_0) on node2-1 failed (target: 7 vs. rc: 0): Error Apr 2 21:53:09 node2 crmd[2425]: warning: Action 2 (pgsqld:0_monitor_0) on node1-1 failed (target: 7 vs. rc: 0): Error Apr 2 21:53:09 node2 crmd[2425]: warning: Action 2 (pgsqld:0_monitor_0) on node1-1 failed (target: 7 vs. rc: 0): Error Apr 2 21:53:09 node2 crmd[2425]: notice: Transition 4 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-6.bz2): Complete Apr 2 21:53:09 node2 pengine[2424]: notice: Calculated transition 5, saving inputs in /var/lib/pacemaker/pengine/pe-input-7.bz2 Apr 2 21:53:09 node2 crmd[2425]: notice: Initiating monitor operation pgsqld_monitor_16000 locally on node2-1 | action 4 Apr 2 21:53:09 node2 crmd[2425]: notice: Initiating monitor operation pgsqld_monitor_16000 on node1-1 | action 7 Apr 2 21:53:09 node2 pgsqlms(pgsqld)[3663]: INFO: Action is monitor ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: How to setup a simple master/slave cluster in two nodes without stonith resource
Yes, my resource are started and they are slave status. So I run "pcs resource cleanup pgsql-ha" command. The log shows the error when I run this command. -邮件原件- 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Andrei Borzenkov 发送时间: 2018年4月3日 12:00 收件人: users@clusterlabs.org 主题: Re: [ClusterLabs] How to setup a simple master/slave cluster in two nodes without stonith resource 03.04.2018 05:07, 范国腾 пишет: > Hello, > > I want to setup a cluster in two nodes. One is master and the other is slave. > I don’t need the fencing device because my internal network is stable. I use > the following command to create the resource, but all of the two nodes are > slave and cluster don’t promote it to master. Could you please help check if > there is anything wrong with my configuration? > > pcs property set stonith-enabled=false; pcs resource create pgsqld > ocf:heartbeat:pgsqlms bindir=/usr/local/pgsql/bin > pgdata=/home/postgres/data op start timeout=600s op stop timeout=60s > op promote timeout=300s op demote timeout=120s op monitor interval=15s > timeout=100s role="Master" op monitor interval=16s timeout=100s > role="Slave" op notify timeout=60s;pcs resource master pgsql-ha pgsqld > notify=true interleave=true; > > The status is as below: > > [root@node1 ~]# pcs status > Cluster name: cluster_pgsql > Stack: corosync > Current DC: node2-1 (version 1.1.15-11.el7-e174ec8) - partition with quorum > Last updated: Mon Apr 2 21:51:57 2018 Last change: Mon Apr 2 > 21:32:22 2018 by hacluster via crmd on node2-1 > > 2 nodes and 3 resources configured > > Online: [ node1-1 node2-1 ] > > Full list of resources: > > Master/Slave Set: pgsql-ha [pgsqld] > Slaves: [ node1-1 node2-1 ] > pgsql-master-ip(ocf::heartbeat:IPaddr2): Stopped > > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled > > When I execute pcs resource cleanup in one node, there is always one node > print the following waring message in the /var/log/messages. But the other > nodes’ log show no error. The resource log(pgsqlms) show the monitor action > could return 0 but why the crmd log show failed? > > Apr 2 21:53:09 node2 crmd[2425]: warning: No reason to expect node 1 > to be down Apr 2 21:53:09 node2 crmd[2425]: notice: State transition > S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph Apr 2 21:53:09 node2 crmd[2425]: warning: No > reason to expect node 2 to be down > Apr 2 21:53:09 node2 pengine[2424]: notice: Start pgsqld:0#011(node1-1) > Apr 2 21:53:09 node2 pengine[2424]: notice: Start pgsqld:1#011(node2-1) > Apr 2 21:53:09 node2 pengine[2424]: notice: Calculated transition 4, > saving inputs in /var/lib/pacemaker/pengine/pe-input-6.bz2 > Apr 2 21:53:09 node2 crmd[2425]: notice: Initiating monitor > operation pgsqld:0_monitor_0 on node1-1 | action 2 Apr 2 21:53:09 > node2 crmd[2425]: notice: Initiating monitor operation > pgsqld:1_monitor_0 locally on node2-1 | action 3 Apr 2 21:53:09 node2 > pgsqlms(pgsqld)[3644]: INFO: Action is monitor Apr 2 21:53:09 node2 > pgsqlms(pgsqld)[3644]: INFO: pgsql_monitor: monitor is a probe Apr 2 > 21:53:09 node2 pgsqlms(pgsqld)[3644]: INFO: pgsql_monitor: instance > "pgsqld" is listening Apr 2 21:53:09 node2 pgsqlms(pgsqld)[3644]: > INFO: Action result is 0 Apr 2 21:53:09 node2 crmd[2425]: notice: > Result of probe operation for pgsqld on node2-1: 0 (ok) | call=33 > key=pgsqld_monitor_0 confirmed=true cib-update=62 Apr 2 21:53:09 > node2 crmd[2425]: warning: Action 3 (pgsqld:1_monitor_0) on node2-1 > failed (target: 7 vs. rc: 0): Error Apr 2 21:53:09 node2 crmd[2425]: > notice: Transition aborted by operation pgsqld_monitor_0 'create' on > node2-1: Event failed | > magic=0:0;3:4:7:3a132f28-d8b9-4948-bb6b-736edc221664 cib=0.28.2 > source=match_graph_event:310 complete=false Apr 2 21:53:09 node2 > crmd[2425]: warning: Action 3 (pgsqld:1_monitor_0) on node2-1 failed > (target: 7 vs. rc: 0): Error Apr 2 21:53:09 node2 crmd[2425]: > warning: Action 2 (pgsqld:0_monitor_0) on node1-1 failed (target: 7 > vs. rc: 0): Error Apr 2 21:53:09 node2 crmd[2425]: warning: Action 2 > (pgsqld:0_monitor_0) on node1-1 failed (target: 7 vs. rc: 0): Error Apparently your applications are already started on both nodes at the time you start pacemaker. Pacemaker expects resources to be in inactive state initially. > Apr 2 21:53:09 node2 crmd[2425]: notice: Transition 4 (Complete=4, > Pending=0, Fired=0, Skipped=0, Incomplete=10, > Source=/var/lib/pacemaker/pengine/pe-input-6.bz2): Complete Apr 2 > 21:53:09 node2 pengine[24
[ClusterLabs] 答复: How to setup a simple master/slave cluster in two nodes without stonith resource
Rorthais, Thank you very much for your help. I do according to your comments and the cluster status is ok now. I want to ask two more questions: 1. This line code in PAF prevent the score to be set. Why does PAF request the prev_state must be shutdown? Could I just set the score if it is not set? if ( $prev_state eq "shut down" and not _master_score_exists() ) 2. The log shows " Transition aborted by operation pgsqld_monitor_0 'create' on node2-1: Event failed ". How could we see the score is not set according this log? Thanks -邮件原件- 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 发送时间: 2018年4月3日 21:02 收件人: 范国腾 抄送: Cluster Labs - All topics related to open-source clustering welcomed 主题: Re: [ClusterLabs] How to setup a simple master/slave cluster in two nodes without stonith resource On Tue, 3 Apr 2018 14:41:56 +0200 Jehan-Guillaume de Rorthais wrote: > On Tue, 3 Apr 2018 02:07:50 + > 范国腾 wrote: > > > Hello, > > > > I want to setup a cluster in two nodes. One is master and the other > > is slave. I don’t need the fencing device because my internal network is > > stable. > > How much stable it is? This assumption is frequently wrong. > > See: https://aphyr.com/posts/288-the-network-is-reliable Plus, if you really don't want to setup node fencing, at least, setup watchdog: https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#setting-up-a-watchdog ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: Trouble starting up PAF cluster for first time
Hi, I am using PAF too. You could read the /usr/lib/ocf/resource.d/heartbeat/pgsqlms file to find what pgsql command is called. For example, pacemaker start ->pg_ctl start, pacemaker monitor->pg_isready. Thanks Steven -邮件原件- 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Casey & Gina 发送时间: 2018年4月7日 6:46 收件人: Cluster Labs - All topics related to open-source clustering welcomed 主题: Re: [ClusterLabs] Trouble starting up PAF cluster for first time It looks like the main problem was that I needed to add pghost="/var/run/postgresql" to the postgresql-10-main resource. I'm not sure why I have to do that, but it makes things work. For both this and my last E-mail to the list that was also a problem with the command being run to start the instance up, I'd like to understand how to diagnose what's happening better myself instead of resorting to guesswork. How can I tell exactly what the command is that Pacemaker ends up calling to start PostgreSQL? I don't see it in corosync.log. If I could see exactly what was being tried, I could try running it by hand and determine the problem myself a lot more effectively. Best wishes, -- Casey ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Could we use pacemaker configure two master/slave resources?
Hello, We use the following cluster now. the postgres resource agent PAF has its own mechanism to decide which node should be promoted to master. But the performance test show the disk IO is the bottleneck. So we want to try to change the GFS2 resource(clusterf) to be master/slave mode too which could increase the performance. So there will be two resources(pgsql and gfs)to be master/slave module and we hope their master are always in the same node. If switchover happens and we hope the pgsql and gfs switchover at the same time. I am studying how to implement this configuration. Could you please give any suggestion? [cid:image001.jpg@01D3D19F.C815E640] By the way, Now we use the following method to create the clone GFS resource. Is there any way to configure to improve the share disk access performance? pcs resource create dlm ocf:pacemaker:controld op monitor interval=30s on-fail=fence clone interleave=true ordered=true; pcs resource create clvmd ocf:heartbeat:clvm op monitor interval=30s on-fail=fence clone interleave=true ordered=true; pcs resource create clusterfs Filesystem device="/dev/sdb1" directory="/home/postgres/sharedisk" fstype="gfs2" "options=noatime" op monitor interval=10s on-fail=fence clone interleave=true; ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: No slave is promoted to be master
Hello, We use the following command to create the cluster. Node2 is always the master when the cluster starts. Why does pacemaker not select node1 as the default master? How to configure if we want node1 to be the default master? pcs cluster setup --name cluster_pgsql node1 node2 pcs resource create pgsqld ocf:heartbeat:pgsqlms bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data op start timeout=600s op stop timeout=60s op promote timeout=300s op demote timeout=120s op monitor interval=15s timeout=100s role="Master" op monitor interval=16s timeout=100s role="Slave" op notify timeout=60s;pcs resource master pgsql-ha pgsqld notify=true interleave=true; Sometimes it reports the following error, how to configure to avoid it? [cid:image003.png@01D3D271.A4C44560] ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: No slave is promoted to be master
Thank you, Rorthais. I see now. -邮件原件- 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 发送时间: 2018年4月13日 17:17 收件人: 范国腾 抄送: Cluster Labs - All topics related to open-source clustering welcomed 主题: Re: [ClusterLabs] No slave is promoted to be master OK, I know what happen. It seems like your standbies were not replicating when the master "crashed", you can find tons of messages like this in the log files: WARNING: No secondary connected to the master WARNING: "db2" is not connected to the primary WARNING: "db3" is not connected to the primary When a standby is not replicating, the master set negative master score to them to forbid the promotion on them, as they are probably lagging for some undefined time. The following command shows the scores just before the simulated master crash: $ crm_simulate -x pe-input-2039.bz2 -s|grep -E 'date|promotion' Using the original execution date of: 2018-04-11 16:23:07Z pgsqld:0 promotion score on db1: 1001 pgsqld:1 promotion score on db2: -1000 pgsqld:2 promotion score on db3: -1000 "1001" score design the master. Streaming standbies always have a positive master score between 1000 and 1000-N*10 where N is the number of connected standbies. On Fri, 13 Apr 2018 01:37:54 + 范国腾 wrote: > The log is in the attachment. > > We make a bug in the PG code in master node to make it not be > restarted any more in order to test the following scenario: One slave > could be promoted when the master crashed, > > -邮件原件- > 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] > 发送时间: 2018年4月12日 17:39 > 收件人: 范国腾 > 抄送: Cluster Labs - All topics related to open-source clustering > welcomed 主题: Re: [ClusterLabs] No slave is > promoted to be master > > Hi, > On Thu, 12 Apr 2018 08:31:39 + > 范国腾 wrote: > > > Thank you very much for help check this issue. The information is in > > the attachment. > > > > I have restarted the cluster after I send my first email. Not sure > > if it affects the checking of "the result of "crm_simulate -sL" > > It does... > > Could you please provide files > from /var/lib/pacemaker/pengine/pe-input-2039.bz2 to pe-input-2065.bz2 ? > > [...] > > Then the master is restarted and it could not start(that is ok and > > we know the reason)。 > > Why couldn't it start ? -- Jehan-Guillaume de Rorthais Dalibo ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] No slave is promoted to be master
Hi, We install a new lab which only have the postgres resource and the vip resource. After the cluster is installed, the status is ok: only node is master and the other is slave. Then I run "pcs cluster stop --all" to close the cluster and then I run the "pcs cluster start --all" to start the cluster. All of the pgsql is slave status and they could not be promoted to be master any more like this: Master/Slave Set: pgsql-ha [pgsqld] Slaves: [ sds1 sds2 ] There is no error in the log and the " crm_simulate -sL" show the flowing and it seems that the score is ok too. The detailed log and config is in the attachment. [root@node1 ~]# crm_simulate -sL Current cluster status: Online: [ sds1 sds2 ] Master/Slave Set: pgsql-ha [pgsqld] Slaves: [ sds1 sds2 ] Resource Group: mastergroup master-vip (ocf::heartbeat:IPaddr2): Stopped pgsql-master-ip(ocf::heartbeat:IPaddr2): Stopped Allocation scores: clone_color: pgsql-ha allocation score on sds1: 1 clone_color: pgsql-ha allocation score on sds2: 1 clone_color: pgsqld:0 allocation score on sds1: 1003 clone_color: pgsqld:0 allocation score on sds2: 1 clone_color: pgsqld:1 allocation score on sds1: 1 clone_color: pgsqld:1 allocation score on sds2: 1002 native_color: pgsqld:0 allocation score on sds1: 1003 native_color: pgsqld:0 allocation score on sds2: 1 native_color: pgsqld:1 allocation score on sds1: -INFINITY native_color: pgsqld:1 allocation score on sds2: 1002 pgsqld:0 promotion score on sds1: 1002 pgsqld:1 promotion score on sds2: 1001 group_color: mastergroup allocation score on sds1: 0 group_color: mastergroup allocation score on sds2: 0 group_color: master-vip allocation score on sds1: 0 group_color: master-vip allocation score on sds2: 0 native_color: master-vip allocation score on sds1: 1003 native_color: master-vip allocation score on sds2: -INFINITY native_color: pgsql-master-ip allocation score on sds1: 1003 native_color: pgsql-master-ip allocation score on sds2: -INFINITY Transition Summary: * Promote pgsqld:0 (Slave -> Master sds1) * Start master-vip (sds1) * Start pgsql-master-ip (sds1) log.rar Description: log.rar ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: No slave is promoted to be master
I check the status again. It is not not promoted but it promoted about 15 minutes after the cluster starts. I try in three labs and the results are same: The promotion happens 15 minutes after the cluster starts. Why is there about 15 minutes delay every time? Apr 16 22:08:32 node1 attrd[16618]: notice: Node sds1 state is now member Apr 16 22:08:32 node1 attrd[16618]: notice: Node sds2 state is now member .. Apr 16 22:21:36 node1 pgsqlms(pgsqld)[18230]: INFO: Execute action monitor and the result 0 Apr 16 22:21:52 node1 pgsqlms(pgsqld)[18257]: INFO: Execute action monitor and the result 0 Apr 16 22:22:09 node1 pgsqlms(pgsqld)[18296]: INFO: Execute action monitor and the result 0 Apr 16 22:22:25 node1 pgsqlms(pgsqld)[18315]: INFO: Execute action monitor and the result 0 Apr 16 22:22:41 node1 pgsqlms(pgsqld)[18343]: INFO: Execute action monitor and the result 0 Apr 16 22:22:57 node1 pgsqlms(pgsqld)[18362]: INFO: Execute action monitor and the result 0 Apr 16 22:23:13 node1 pgsqlms(pgsqld)[18402]: INFO: Execute action monitor and the result 0 Apr 16 22:23:29 node1 pgsqlms(pgsqld)[18421]: INFO: Execute action monitor and the result 0 Apr 16 22:23:45 node1 pgsqlms(pgsqld)[18449]: INFO: Execute action monitor and the result 0 Apr 16 22:23:57 node1 crmd[16620]: notice: State transition S_IDLE -> S_POLICY_ENGINE Apr 16 22:23:57 node1 pengine[16619]: notice: Promote pgsqld:0#011(Slave -> Master sds1) Apr 16 22:23:57 node1 pengine[16619]: notice: Start master-vip#011(sds1) Apr 16 22:23:57 node1 pengine[16619]: notice: Start pgsql-master-ip#011(sds1) Apr 16 22:23:57 node1 pengine[16619]: notice: Calculated transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input-18.bz2 Apr 16 22:23:57 node1 crmd[16620]: notice: Initiating cancel operation pgsqld_monitor_16000 locally on sds1 Apr 16 22:23:57 node1 crmd[16620]: notice: Initiating notify operation pgsqld_pre_notify_promote_0 locally on sds1 Apr 16 22:23:57 node1 crmd[16620]: notice: Initiating notify operation pgsqld_pre_notify_promote_0 on sds2 Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Promoting instance on node "sds1" Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Current node TL#LSN: 4#117440512 Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Execute action notify and the result 0 Apr 16 22:23:58 node1 crmd[16620]: notice: Result of notify operation for pgsqld on sds1: 0 (ok) Apr 16 22:23:58 node1 crmd[16620]: notice: Initiating promote operation pgsqld_promote_0 locally on sds1 Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18499]: INFO: Waiting for the promote to complete Apr 16 22:23:59 node1 pgsqlms(pgsqld)[18499]: INFO: Promote complete [root@node1 ~]# crm_simulate -sL Current cluster status: Online: [ sds1 sds2 ] Master/Slave Set: pgsql-ha [pgsqld] Masters: [ sds1 ] Slaves: [ sds2 ] Resource Group: mastergroup master-vip (ocf::heartbeat:IPaddr2): Started sds1 pgsql-master-ip(ocf::heartbeat:IPaddr2): Started sds1 Allocation scores: clone_color: pgsql-ha allocation score on sds1: 1 clone_color: pgsql-ha allocation score on sds2: 1 clone_color: pgsqld:0 allocation score on sds1: 1003 clone_color: pgsqld:0 allocation score on sds2: 1 clone_color: pgsqld:1 allocation score on sds1: 1 clone_color: pgsqld:1 allocation score on sds2: 1002 native_color: pgsqld:0 allocation score on sds1: 1003 native_color: pgsqld:0 allocation score on sds2: 1 native_color: pgsqld:1 allocation score on sds1: -INFINITY native_color: pgsqld:1 allocation score on sds2: 1002 pgsqld:0 promotion score on sds1: 1002 pgsqld:1 promotion score on sds2: 1001 group_color: mastergroup allocation score on sds1: 0 group_color: mastergroup allocation score on sds2: 0 group_color: master-vip allocation score on sds1: 0 group_color: master-vip allocation score on sds2: 0 native_color: master-vip allocation score on sds1: 1003 native_color: master-vip allocation score on sds2: -INFINITY native_color: pgsql-master-ip allocation score on sds1: 1003 native_color: pgsql-master-ip allocation score on sds2: -INFINITY Transition Summary: [root@node1 ~]# You could reproduce the issue in two nodes, and execute the following command. Then run "pcs cluster stop --all" and "pcs cluster start --all". pcs resource create pgsqld ocf:heartbeat:pgsqlms bindir=/home/highgo/highgo/database/4.3.1/bin pgdata=/home/highgo/highgo/database/4.3.1/data op start timeout=600s op stop timeout=60s op promote timeout=300s op demote timeout=120s op monitor interval=10s timeout=100s role="Master" op monitor interval=16s timeout=100s role="Slave" op notify timeout=60s pcs resource master pgsql-ha pgsqld notify=true interleave=true -邮件原件- 发件人: 范国腾 发送时间: 2018年4月17日 10:25 收件人: 'Jehan-Guillaume de Rorthais' 抄送: Cluster Labs - All topics related to open-source clustering welcomed 主题: [ClusterLabs] No slave is promoted to be master Hi,
[ClusterLabs] 答复: No slave is promoted to be master
Thank you very much, Rorthais, I see now. I have two more questions. 1. If I change the "cluster-recheck-interval" parameter from the default 15 minutes to 10 seconds, is there any bad impact? Could this be a workaround? 2. This issue happens only in the following configuration. [cid:image003.jpg@01D3D700.2F3E24D0] But it does not happen in the following configuration. Why is the behaviors different? [cid:image004.jpg@01D3D700.2F3E24D0] -邮件原件- 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 发送时间: 2018年4月17日 17:47 收件人: 范国腾 抄送: Cluster Labs - All topics related to open-source clustering welcomed 主题: Re: [ClusterLabs] No slave is promoted to be master On Tue, 17 Apr 2018 04:16:38 + 范国腾 mailto:fanguot...@highgo.com>> wrote: > I check the status again. It is not not promoted but it promoted about > 15 minutes after the cluster starts. > > I try in three labs and the results are same: The promotion happens 15 > minutes after the cluster starts. > > Why is there about 15 minutes delay every time? This was a bug in Pacemaker up to 1.1.17. I did a report about this last August and Ken Gaillot fixed it few days later in 1.1.18. See: https://lists.clusterlabs.org/pipermail/developers/2017-August/001110.html https://lists.clusterlabs.org/pipermail/developers/2017-September/001113.html I wonder if disabling the pgsql resource before shutting down the cluster might be a simpler and safer workaround. Eg.: pcs resource disable pgsql-ha --wait pcs cluster stop --all and pcs cluster start --all pcs resource enable pgsql-ha Another fix would be to force a master score on one node **if needed** using: crm_master -N -r -l forever -v 1 ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: Postgres PAF setup
I have meet the similar issue when the postgres is not stopped normally. You could run pg_controldata to check if your postgres status is shutdown/shutdown in recovery. I change the /usr/lib/ocf/resource.d/heartbeat/pgsqlms to avoid this problem: elsif ( $pgisready_rc == 2 ) { # The instance is not listening. # We check the process status using pg_ctl status and check # if it was propertly shut down using pg_controldata. ocf_log( 'debug', 'pgsql_monitor: instance "%s" is not listening', $OCF_RESOURCE_INSTANCE ); # return _confirm_stopped(); # remove this line return $OCF_NOT_RUNNING; } -邮件原件- 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Adrien Nayrat 发送时间: 2018年4月24日 16:16 收件人: Andrew Edenburn ; pgsql-gene...@postgresql.org; users@clusterlabs.org 主题: Re: [ClusterLabs] Postgres PAF setup On 04/23/2018 08:09 PM, Andrew Edenburn wrote: > I am having issues with my PAF setup. I am new to Postgres and have > setup the cluster as seen below. > > I am getting this error when trying to start my cluster resources. > > > > Master/Slave Set: pgsql-ha [pgsqld] > > pgsqld (ocf::heartbeat:pgsqlms): FAILED dcmilphlum224 > (unmanaged) > > pgsqld (ocf::heartbeat:pgsqlms): FAILED dcmilphlum223 > (unmanaged) > > pgsql-master-ip (ocf::heartbeat:IPaddr2): Started > dcmilphlum223 > > > > Failed Actions: > > * pgsqld_stop_0 on dcmilphlum224 'unknown error' (1): call=239, > status=complete, exitreason='Unexpected state for instance "pgsqld" > (returned 1)', > > last-rc-change='Mon Apr 23 13:11:17 2018', queued=0ms, exec=95ms > > * pgsqld_stop_0 on dcmilphlum223 'unknown error' (1): call=248, > status=complete, exitreason='Unexpected state for instance "pgsqld" > (returned 1)', > > last-rc-change='Mon Apr 23 13:11:17 2018', queued=0ms, exec=89ms > > > > cleanup and clear is not fixing any issues and I am not seeing > anything in the logs. Any help would be greatly appreciated. > > Hello Andrew, Could you enable debug logs in Pacemaker? With Centos you have to edit PCMK_debug variable in /etc/sysconfig/pacemaker : PCMK_debug=crmd,pengine,lrmd This should give you more information in logs. Monitor action in PAF should report why the cluster doesn't start : https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L1525 Regards, -- Adrien NAYRAT ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: 答复: Postgres PAF setup
Adrien, Is there any way to make the cluster recover if the postgres was not properly stopped, such as the lab power off or the OS reboot? Thanks -邮件原件- 发件人: Adrien Nayrat [mailto:adrien.nay...@anayrat.info] 发送时间: 2018年4月25日 15:29 收件人: Cluster Labs - All topics related to open-source clustering welcomed ; 范国腾 ; Andrew Edenburn ; pgsql-gene...@postgresql.org 主题: Re: [ClusterLabs] 答复: Postgres PAF setup On 04/25/2018 02:31 AM, 范国腾 wrote: > I have meet the similar issue when the postgres is not stopped normally. > > You could run pg_controldata to check if your postgres status is > shutdown/shutdown in recovery. > > I change the /usr/lib/ocf/resource.d/heartbeat/pgsqlms to avoid this problem: > > elsif ( $pgisready_rc == 2 ) { > # The instance is not listening. > # We check the process status using pg_ctl status and check # if it > was propertly shut down using pg_controldata. > ocf_log( 'debug', 'pgsql_monitor: instance "%s" is not listening', > $OCF_RESOURCE_INSTANCE ); > # return _confirm_stopped(); # remove this line > return $OCF_NOT_RUNNING; > } Hello, It is a bad idea. The goal of _confirm_stopped is to check if the instance was properly stopped. If it wasn't you could corrupt your instance. _confirm_stopped return $OCF_NOT_RUNNING only if the instance was properly shutdown : elsif ( $controldata_rc == $OCF_NOT_RUNNING ) { # The controldata state is consistent, the instance was probably # propertly shut down. ocf_log( 'debug', '_confirm_stopped: instance "%s" controldata indicates that the instance was propertly shut down', $OCF_RESOURCE_INSTANCE ); return $OCF_NOT_RUNNING; } Regards, -- Adrien NAYRAT ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] the PAF switchover does not happen if the VIP resource is stopped
Hi, Our lab has two resource: (1) PAF (master/slave)(2) VIP (bind to the master PAF node). The configuration is in the attachment. Each node has two network card: One(enp0s8) is for the pacemaker heartbeat in internal network, the other(enp0s3) is for the master VIP in the external network. We are testing the following case: if the master VIP network card is down, the master postgres and VIP could switch to another node. 1. At first, node2 is master, I run "ifdown enp0s3" in node2, then node1 become the master, that is ok. [cid:image001.png@01D3DCB4.E6FC6140] [cid:image002.png@01D3DCB5.4DF4DA80] 2. Then I run "ifup enp0s3" in node2, wait for 60 seconds, then run "ifdown enp0s3" in node1, but the node1 still be master. Why does switchover doesn't happened? How to recover to make system work? [cid:image003.png@01D3DCB5.AC487F60] The log is in the attachment. Node1 reports the following waring: Apr 25 04:49:27 node1 crmd[24678]: notice: State transition S_IDLE -> S_POLICY_ENGINE Apr 25 04:49:27 node1 pengine[24677]: warning: Processing failed op start for master-vip on sds2: unknown error (1) Apr 25 04:49:27 node1 pengine[24677]: warning: Processing failed op start for master-vip on sds1: unknown error (1) Apr 25 04:49:27 node1 pengine[24677]: warning: Forcing master-vip away from sds1 after 100 failures (max=100) Apr 25 04:49:27 node1 pengine[24677]: warning: Forcing master-vip away from sds2 after 100 failures (max=100) Apr 25 04:49:27 node1 pengine[24677]: notice: Calculated transition 14, saving inputs in /var/lib/pacemaker/pengine/pe-input-59.bz2 Apr 25 04:49:27 node1 crmd[24678]: notice: Transition 14 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Complete info.rar Description: info.rar ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: the PAF switchover does not happen if the VIP resource is stopped
Hi Rorthais, Thank you for your help. The replication works at that time. I try again today. (1) If I run "ifup enp0s3" in node2, then run "ifdown enp0s3" in node1, the switchover issue could be reproduced. (2) But if I run "ifup enp0s3" in node2, run "pcs resource cleanup mastergroup" to clean the VIP resource, and there is no Failed Actions in "pcs status", then run "ifdown enp0s3" in node1, it works. The switchover could happened again. Is there any parameter to control this behaviors so that I don't need to execute the "pcs cleanup" command every time? -邮件原件- 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 发送时间: 2018年4月25日 18:39 收件人: 范国腾 抄送: Cluster Labs - All topics related to open-source clustering welcomed ; 李梦怡 主题: Re: [ClusterLabs] the PAF switchover does not happen if the VIP resource is stopped On Wed, 25 Apr 2018 08:58:34 + 范国腾 wrote: > > Our lab has two resource: (1) PAF (master/slave)(2) VIP (bind to the > master PAF node). The configuration is in the attachment. > > Each node has two network card: One(enp0s8) is for the pacemaker > heartbeat in internal network, the other(enp0s3) is for the master VIP > in the external network. > > > > We are testing the following case: if the master VIP network card is > down, the master postgres and VIP could switch to another node. > > > > 1. At first, node2 is master, I run "ifdown enp0s3" in node2, then > node1 become the master, that is ok. > > 2. Then I run "ifup enp0s3" in node2, wait for 60 seconds, Did you check PostgreSQL instances were replicating again? > then run "ifdown enp0s3" in node1, but the node1 still be master. Why > does switchover doesn't happened? How to recover to make system work? info.rar Description: info.rar ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: the PAF switchover does not happen if the VIP resource is stopped
1. There is no failure in initial status. sds1 is master [cid:image001.png@01D3DD75.3F4BF110] 2. ifdown the sds1 VIP network card. [cid:image002.png@01D3DD75.71D5DE70] 3. ifup the sds1 VIP network card and then ifdown sds2 VIP network card [cid:image003.png@01D3DD76.26C5E820] -邮件原件- 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 发送时间: 2018年4月26日 15:07 收件人: 范国腾 抄送: Cluster Labs - All topics related to open-source clustering welcomed ; 李梦怡 主题: Re: [ClusterLabs] the PAF switchover does not happen if the VIP resource is stopped On Thu, 26 Apr 2018 02:53:33 + 范国腾 mailto:fanguot...@highgo.com>> wrote: > Hi Rorthais, > > Thank you for your help. > > The replication works at that time. > > I try again today. > (1) If I run "ifup enp0s3" in node2, then run "ifdown enp0s3" in > node1, the switchover issue could be reproduced. (2) But if I run > "ifup enp0s3" in node2, run "pcs resource cleanup mastergroup" to > clean the VIP resource, and there is no Failed Actions in "pcs > status", then run "ifdown enp0s3" in node1, it works. The switchover could > happened again. > > > Is there any parameter to control this behaviors so that I don't need > to execute the "pcs cleanup" command every time? Check the failcounts for each resource on each nodes (pcs resource failcount [...]). Check the scores as well (crm_simulate -sL). > > -邮件原件- > 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] > 发送时间: 2018年4月25日 18:39 > 收件人: 范国腾 mailto:fanguot...@highgo.com>> > 抄送: Cluster Labs - All topics related to open-source clustering > welcomed mailto:users@clusterlabs.org>>; 李梦怡 > mailto:limen...@highgo.com>> 主题: Re: > [ClusterLabs] the PAF switchover does not happen if the VIP resource > is stopped > > > On Wed, 25 Apr 2018 08:58:34 + > 范国腾 mailto:fanguot...@highgo.com>> wrote: > > > > > Our lab has two resource: (1) PAF (master/slave)(2) VIP (bind to the > > master PAF node). The configuration is in the attachment. > > > > Each node has two network card: One(enp0s8) is for the pacemaker > > heartbeat in internal network, the other(enp0s3) is for the master > > VIP in the external network. > > > > > > > > We are testing the following case: if the master VIP network card is > > down, the master postgres and VIP could switch to another node. > > > > > > > > 1. At first, node2 is master, I run "ifdown enp0s3" in node2, then > > node1 become the master, that is ok. > > > > 2. Then I run "ifup enp0s3" in node2, wait for 60 seconds, > > Did you check PostgreSQL instances were replicating again? > > > then run "ifdown enp0s3" in node1, but the node1 still be master. > > Why does switchover doesn't happened? How to recover to make system work? ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: the PAF switchover does not happen if the VIP resource is stopped
Does it mean if one node has ever a resource failure, it could not be promoted to be master any more except that I run the pcs cleanup to clean the failcount? I am testing the case if the VIP resource down because of some reason, the cluster could still work. So I only ifdown the VIP network(enp0s3), not the heartbeat network card(enp0s8)? -邮件原件- 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 发送时间: 2018年4月26日 16:02 收件人: 范国腾 抄送: Cluster Labs - All topics related to open-source clustering welcomed ; 李梦怡 主题: Re: [ClusterLabs] the PAF switchover does not happen if the VIP resource is stopped On Thu, 26 Apr 2018 07:53:07 + 范国腾 wrote: > 1. There is no failure in initial status. sds1 is master > > [cid:image001.png@01D3DD75.3F4BF110] yes. > 2. ifdown the sds1 VIP network card. > > [cid:image002.png@01D3DD75.71D5DE70] ok, failcount and -inf score appears. > 3. ifup the sds1 VIP network card and then ifdown sds2 VIP network > card > > [cid:image003.png@01D3DD76.26C5E820] Now failcount and -inf score everywhere. I'm not sure I understand your mail, do you have a question ? > -邮件原件- > 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] > 发送时间: 2018年4月26日 15:07 > 收件人: 范国腾 > 抄送: Cluster Labs - All topics related to open-source clustering > welcomed ; 李梦怡 主题: Re: > [ClusterLabs] the PAF switchover does not happen if the VIP resource > is stopped > > > > On Thu, 26 Apr 2018 02:53:33 + > > 范国腾 mailto:fanguot...@highgo.com>> wrote: > > > > > Hi Rorthais, > > > > > > Thank you for your help. > > > > > > The replication works at that time. > > > > > > I try again today. > > > (1) If I run "ifup enp0s3" in node2, then run "ifdown enp0s3" in > > > node1, the switchover issue could be reproduced. (2) But if I run > > > "ifup enp0s3" in node2, run "pcs resource cleanup mastergroup" to > > > clean the VIP resource, and there is no Failed Actions in "pcs > > > status", then run "ifdown enp0s3" in node1, it works. The switchover > > could happened again. > > > > > > > > > Is there any parameter to control this behaviors so that I don't > > need > > > to execute the "pcs cleanup" command every time? > > > > Check the failcounts for each resource on each nodes (pcs resource > failcount [...]). > > Check the scores as well (crm_simulate -sL). > > > > > > > > -邮件原件- > > > 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] > > > 发送时间: 2018年4月25日 18:39 > > > 收件人: 范国腾 mailto:fanguot...@highgo.com>> > > > 抄送: Cluster Labs - All topics related to open-source clustering > > > welcomed mailto:users@clusterlabs.org>>; 李梦怡 > > mailto:limen...@highgo.com>> 主题: Re: > > > [ClusterLabs] the PAF switchover does not happen if the VIP resource > > > is stopped > > > > > > > > > On Wed, 25 Apr 2018 08:58:34 + > > > 范国腾 mailto:fanguot...@highgo.com>> wrote: > > > > > > > > > > > Our lab has two resource: (1) PAF (master/slave)(2) VIP (bind to the > > > > master PAF node). The configuration is in the attachment. > > > > > > > > Each node has two network card: One(enp0s8) is for the pacemaker > > > > heartbeat in internal network, the other(enp0s3) is for the master > > > > VIP in the external network. > > > > > > > > > > > > > > > > We are testing the following case: if the master VIP network card > > > is > > > > down, the master postgres and VIP could switch to another node. > > > > > > > > > > > > > > > > 1. At first, node2 is master, I run "ifdown enp0s3" in node2, then > > > > node1 become the master, that is ok. > > > > > > > > 2. Then I run "ifup enp0s3" in node2, wait for 60 seconds, > > > > > > Did you check PostgreSQL instances were replicating again? > > > > > > > then run "ifdown enp0s3" in node1, but the node1 still be master. > > > > Why does switchover doesn't happened? How to recover to make > > > system work? -- Jehan-Guillaume de Rorthais Dalibo ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Could not start only one node in pacemaker
Hi, The cluster has three nodes: one is master and two are slave. Now we run “pcs cluster stop --all” to stop all of the nodes. Then we run “pcs cluster start” in the master node. We find it not able to started. The cause is that the stonith resource could not be started so all of the other resource could not be started. We test this case in two cluster system and the result is same: l If we start all of the three nodes, the stonith resource could be started. If we stop one node after it starts, the stonith resource could be migrated to another node and the cluster still work. l If we start only one or only two nodes, the stonith resource could not be started. (1) We create the stonith resource using this method in one system: pcs stonith create ipmi_node1 fence_ipmilan ipaddr="192.168.100.202" login="ADMIN" passwd="ADMIN" pcmk_host_list="node1" pcs stonith create ipmi_node2 fence_ipmilan ipaddr="192.168.100.203" login="ADMIN" passwd="ADMIN" pcmk_host_list="node2" pcs stonith create ipmi_node3 fence_ipmilan ipaddr="192.168.100.204" login="ADMIN" passwd="ADMIN" pcmk_host_list="node3" (2) We create the stonith resource using this method in another system: pcs stonith create scsi-stonith-device fence_scsi devices=/dev/mapper/fence pcmk_monitor_action=metadata pcmk_reboot_action=off pcmk_host_list="node1 node2 node3 node4" meta provides=unfencing; The log is in the attachment. What prevents the stonith resource to be started if we only started part of the nodes? Thanks Desktop.rar Description: Desktop.rar ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: Could not start only one node in pacemaker
Andrei, We set "pcs property set no-quorum-policy=freeze;" If we want to keep this "freeze" value, could you please tell what quorum parameter we should set? Thanks -邮件原件- 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Andrei Borzenkov 发送时间: 2018年5月2日 12:20 收件人: users@clusterlabs.org 主题: Re: [ClusterLabs] Could not start only one node in pacemaker 02.05.2018 05:52, 范国腾 пишет: > Hi, > The cluster has three nodes: one is master and two are slave. Now we run “pcs > cluster stop --all” to stop all of the nodes. Then we run “pcs cluster start” > in the master node. We find it not able to started. The cause is that the > stonith resource could not be started so all of the other resource could not > be started. > > We test this case in two cluster system and the result is same: > > l If we start all of the three nodes, the stonith resource could be started. > If we stop one node after it starts, the stonith resource could be migrated > to another node and the cluster still work. > > l If we start only one or only two nodes, the stonith resource could not be > started. > > > (1) We create the stonith resource using this method in one system: > pcs stonith create ipmi_node1 fence_ipmilan ipaddr="192.168.100.202" > login="ADMIN" passwd="ADMIN" pcmk_host_list="node1" > pcs stonith create ipmi_node2 fence_ipmilan ipaddr="192.168.100.203" > login="ADMIN" passwd="ADMIN" pcmk_host_list="node2" > pcs stonith create ipmi_node3 fence_ipmilan ipaddr="192.168.100.204" > login="ADMIN" passwd="ADMIN" pcmk_host_list="node3" > > > (2) We create the stonith resource using this method in another system: > > pcs stonith create scsi-stonith-device fence_scsi > devices=/dev/mapper/fence pcmk_monitor_action=metadata > pcmk_reboot_action=off pcmk_host_list="node1 node2 node3 node4" meta > provides=unfencing; > > > The log is in the attachment. > What prevents the stonith resource to be started if we only started part of > the nodes? It says quite clearly May 1 22:02:09 node3 pengine[17997]: notice: Cannot fence unclean nodes until quorum is attained (or no-quorum-policy is set to ignore) ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: 答复: Could not start only one node in pacemaker
Andrei, We use the following command to create the cluster: pcs cluster auth node1 node2 node3 node4 -u hacluster; pcs cluster setup --name cluster_pgsql node1 node2 node3 node4; pcs cluster start --all; pcs property set no-quorum-policy=freeze; pcs property set stonith-enabled=true; pcs stonith create scsi-stonith-device fence_scsi devices=/dev/mapper/fence pcmk_monitor_action=metadata pcmk_reboot_action=off pcmk_host_list="node1 node2 node3 node4" meta provides=unfencing; Could you please tell how to configure if I want to use only fencing or only quorum? Maybe no-quorum-policy=freeze and stonith-enabled=true could not be set at the same time? Thanks -邮件原件- 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Andrei Borzenkov 发送时间: 2018年5月2日 13:06 收件人: users@clusterlabs.org 主题: Re: [ClusterLabs] 答复: Could not start only one node in pacemaker 02.05.2018 07:28, 范国腾 пишет: > Andrei, > > We set "pcs property set no-quorum-policy=freeze;" If we want to keep > this "freeze" value, could you please tell what quorum parameter we > should set? > There is no other parameter. Either you base your cluster on quorum or you base your cluster on fencing. Attempt to mix them will give you the result you have observed. You cannot start resource management until you are aware of state of other nodes so either starting node puts other nodes in defined state (fencing) or you *MUST* stop and wait for (sufficient number of) other nodes to appear, because doing anything else will clearly violate quorum requirement. Quorum relies on the fact that out-of-quorum nodes will not do anything. > Thanks > > > -邮件原件- 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 > Andrei Borzenkov 发送时间: 2018年5月2日 12:20 收件人: users@clusterlabs.org 主题: > Re: [ClusterLabs] Could not start only one node in pacemaker > > 02.05.2018 05:52, 范国腾 пишет: >> Hi, The cluster has three nodes: one is master and two are slave. >> Now we run “pcs cluster stop --all” to stop all of the nodes. Then we >> run “pcs cluster start” in the master node. We find it not able to >> started. The cause is that the stonith resource could not be started >> so all of the other resource could not be started. >> >> We test this case in two cluster system and the result is same: >> >> l If we start all of the three nodes, the stonith resource could be >> started. If we stop one node after it starts, the stonith resource >> could be migrated to another node and the cluster still work. >> >> l If we start only one or only two nodes, the stonith resource could >> not be started. >> >> >> (1) We create the stonith resource using this method in one >> system: pcs stonith create ipmi_node1 fence_ipmilan >> ipaddr="192.168.100.202" login="ADMIN" passwd="ADMIN" >> pcmk_host_list="node1" pcs stonith create ipmi_node2 fence_ipmilan >> ipaddr="192.168.100.203" login="ADMIN" passwd="ADMIN" >> pcmk_host_list="node2" pcs stonith create ipmi_node3 fence_ipmilan >> ipaddr="192.168.100.204" login="ADMIN" passwd="ADMIN" >> pcmk_host_list="node3" >> >> >> (2) We create the stonith resource using this method in another >> system: >> >> pcs stonith create scsi-stonith-device fence_scsi >> devices=/dev/mapper/fence pcmk_monitor_action=metadata >> pcmk_reboot_action=off pcmk_host_list="node1 node2 node3 node4" >> meta provides=unfencing; >> >> >> The log is in the attachment. What prevents the stonith resource to >> be started if we only started part of the nodes? > > It says quite clearly > > May 1 22:02:09 node3 pengine[17997]: notice: Cannot fence unclean > nodes until quorum is attained (or no-quorum-policy is set to ignore) > ___ Users mailing list: > Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: > http://bugs.clusterlabs.org > ___ Users mailing list: > Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: > http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: The slave not does not promote to master
Thank you, Klaus. There is no fencing device in our network according to the request. Is there any other way to configure the cluster to make it work? 发件人: Klaus Wenninger [mailto:kwenn...@redhat.com] 发送时间: 2018年5月7日 14:40 收件人: Cluster Labs - All topics related to open-source clustering welcomed ; 范国腾 主题: Re: [ClusterLabs] The slave not does not promote to master On 05/07/2018 07:39 AM, 范国腾 wrote: Hi, We have two nodes cluster using PAF to manage the postgres. Node2 is master. Master/Slave Set: pgsql-ha [pgsqld] Master: [sds2] Slaves: [ sds1 ] In the master node(sds2), I remove the data directory of postgres. I expect the master nodes(sds2) stop and the slave node(sds1) is promoted to master. The sds2 log show that is executes monitor->notify->demote->notify->stop. The sds1 log also show " Promote pgsqld:0#011(Slave -> Master sds1)". But the "pcs status" shows the status like the following. Could you please help check what prevents the promotion happen in sds1? What should I do if I want to recovery the system? Didn't check all detail but looks as if stopping the resource would fail. So that it doesn't know the state on sds2 and thus can't promote on sds1. If you had enabled fencing this would lead to sds2 being fenced so that sds1 can take over. As digimer would say: "use fencing!" Regards, Klaus 2 nodes configured 3 resources configured Online: [ sds1 sds2 ] Full list of resources: Master/Slave Set: pgsql-ha [pgsqld] pgsqld (ocf::heartbeat:pgsqlms): FAILED Master sds2 (blocked) Slaves: [ sds1 ] Resource Group: mastergroup master-vip (ocf::heartbeat:IPaddr2): Started sds2 Failed Actions: * pgsqld_stop_0 on sds2 'invalid parameter' (2): call=42, status=complete, exitreason='PGDATA "/home/highgo/highgo/database/4.3.1/data" does not exists', last-rc-change='Mon May 7 00:39:06 2018', queued=1ms, exec=72ms Here is the sds2 log: May 7 00:38:46 node2 pgsqlms(pgsqld)[14000]: INFO: Execute action monitor and the result 8 May 7 00:38:56 node2 pgsqlms(pgsqld)[14077]: INFO: Execute action monitor and the result 8 May 7 00:39:06 node2 pgsqlms(pgsqld)[14152]: ERROR: PGDATA "/home/highgo/highgo/database/4.3.1/data" does not exists May 7 00:39:06 node2 lrmd[1126]: notice: pgsqld_monitor_1:14152:stderr [ ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not exists ] May 7 00:39:06 node2 crmd[1129]: notice: sds2-pgsqld_monitor_1:36 [ ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not exists\n ] May 7 00:39:06 node2 pgsqlms(pgsqld)[14162]: ERROR: PGDATA "/home/highgo/highgo/database/4.3.1/data" does not exists May 7 00:39:06 node2 lrmd[1126]: notice: pgsqld_notify_0:14162:stderr [ ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not exists ] May 7 00:39:06 node2 crmd[1129]: notice: Result of notify operation for pgsqld on sds2: 0 (ok) May 7 00:39:06 node2 crmd[1129]: notice: sds2-pgsqld_monitor_1:36 [ ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not exists\n ] May 7 00:39:06 node2 pgsqlms(pgsqld)[14172]: ERROR: PGDATA "/home/highgo/highgo/database/4.3.1/data" does not exists May 7 00:39:06 node2 lrmd[1126]: notice: pgsqld_demote_0:14172:stderr [ ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not exists ] May 7 00:39:06 node2 crmd[1129]: notice: Result of demote operation for pgsqld on sds2: 2 (invalid parameter) May 7 00:39:06 node2 crmd[1129]: notice: sds2-pgsqld_demote_0:39 [ ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not exists\n ] May 7 00:39:06 node2 pgsqlms(pgsqld)[14182]: ERROR: PGDATA "/home/highgo/highgo/database/4.3.1/data" does not exists May 7 00:39:06 node2 lrmd[1126]: notice: pgsqld_notify_0:14182:stderr [ ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not exists ] May 7 00:39:06 node2 crmd[1129]: notice: Result of notify operation for pgsqld on sds2: 0 (ok) May 7 00:39:06 node2 pgsqlms(pgsqld)[14192]: ERROR: PGDATA "/home/highgo/highgo/database/4.3.1/data" does not exists May 7 00:39:06 node2 lrmd[1126]: notice: pgsqld_notify_0:14192:stderr [ ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not exists ] May 7 00:39:06 node2 crmd[1129]: notice: Result of notify operation for pgsqld on sds2: 0 (ok) May 7 00:39:06 node2 pgsqlms(pgsqld)[14202]: ERROR: PGDATA "/home/highgo/highgo/database/4.3.1/data" does not exists May 7 00:39:06 node2 lrmd[1126]: notice: pgsqld_stop_0:14202:stderr [ ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not exists ] May 7 00:39:06 node2 crmd[11
[ClusterLabs] “pcs cluster stop -all” hangs and
Hi, When I run the "pcs cluster stop --all", it will hang and there is no any response sometimes. The log is as below. Could we find the reason why it hangs from the log and how to make the cluster stop right now? [root@node2 pg_log]# pcs status Cluster name: hgpurog Stack: corosync Current DC: sds1 (version 1.1.16-12.el7-94ff4df) - partition with quorum Last updated: Fri May 11 01:11:26 2018 Last change: Fri May 11 01:09:24 2018 by hacluster via crmd on sds1 2 nodes configured 3 resources configured Online: [ sds1 sds2 ] Full list of resources: Master/Slave Set: pgsql-ha [pgsqld] Stopped: [ sds1 sds2 ] Resource Group: mastergroup master-vip (ocf::heartbeat:IPaddr2): Started sds1 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@node2 pg_log]# pcs cluster stop --all The /var/log/messages is as asbelow: May 11 01:07:50 node2 crmd[5365]: notice: State transition S_PENDING -> S_NOT_DC May 11 01:07:50 node2 crmd[5365]: notice: State transition S_NOT_DC -> S_PENDING May 11 01:07:50 node2 crmd[5365]: notice: State transition S_PENDING -> S_NOT_DC May 11 01:07:51 node2 pgsqlms(pgsqld)[5371]: INFO: Execute action monitor and the result 7 May 11 01:07:51 node2 pgsqlms(undef)[5408]: INFO: Execute action meta-data and the result 0 May 11 01:07:51 node2 crmd[5365]: notice: Result of probe operation for pgsqld on sds2: 7 (not running) May 11 01:07:51 node2 crmd[5365]: notice: sds2-pgsqld_monitor_0:6 [ /tmp:5866 - no response\n ] May 11 01:07:51 node2 crmd[5365]: notice: Result of probe operation for master-vip on sds2: 7 (not running) May 11 01:10:02 node2 systemd: Started Session 16 of user root. May 11 01:10:02 node2 systemd: Starting Session 16 of user root. May 11 01:11:33 node2 pacemakerd[5357]: notice: Caught 'Terminated' signal May 11 01:11:33 node2 systemd: Stopping Pacemaker High Availability Cluster Manager... May 11 01:11:33 node2 pacemakerd[5357]: notice: Shutting down Pacemaker May 11 01:11:33 node2 pacemakerd[5357]: notice: Stopping crmd May 11 01:11:33 node2 crmd[5365]: notice: Caught 'Terminated' signal May 11 01:11:33 node2 crmd[5365]: notice: Shutting down cluster resource manager May 11 01:12:49 node2 systemd: Started Session 17 of user root. May 11 01:12:49 node2 systemd-logind: New session 17 of user root. May 11 01:12:49 node2 gdm-launch-environment]: AccountsService: ActUserManager: user (null) has no username (object path: /org/freedesktop/Accounts/User0, uid: 0) May 11 01:12:49 node2 journal: ActUserManager: user (null) has no username (object path: /org/freedesktop/Accounts/User0, uid: 0) May 11 01:12:49 node2 systemd: Starting Session 17 of user root. May 11 01:12:49 node2 dbus[648]: [system] Activating service name='org.freedesktop.problems' (using servicehelper) May 11 01:12:49 node2 dbus-daemon: dbus[648]: [system] Activating service name='org.freedesktop.problems' (using servicehelper) May 11 01:12:49 node2 dbus[648]: [system] Successfully activated service 'org.freedesktop.problems' May 11 01:12:49 node2 dbus-daemon: dbus[648]: [system] Successfully activated service 'org.freedesktop.problems' May 11 01:12:49 node2 journal: g_dbus_interface_skeleton_unexport: assertion 'interface_->priv->connections != NULL' failed Here is the log in the peer node May 11 01:09:08 node1 pgsqlms(pgsqld)[28599]: WARNING: No secondary connected to the master May 11 01:09:08 node1 pgsqlms(pgsqld)[28599]: WARNING: "sds2" is not connected to the primary May 11 01:09:08 node1 pgsqlms(pgsqld)[28599]: INFO: Execute action monitor and the result 8 May 11 01:09:18 node1 pgsqlms(pgsqld)[28679]: WARNING: No secondary connected to the master May 11 01:09:18 node1 pgsqlms(pgsqld)[28679]: WARNING: "sds2" is not connected to the primary May 11 01:09:18 node1 pgsqlms(pgsqld)[28679]: INFO: Execute action monitor and the result 8 May 11 01:09:24 node1 crmd[]: notice: sds1-pgsqld_monitor_1:19 [ /tmp:5866 - accepting connections\n ] May 11 01:09:24 node1 crmd[]: notice: Transition aborted by deletion of lrm_resource[@id='pgsqld']: Resource state removal May 11 01:10:02 node1 systemd: Started Session 17 of user root. May 11 01:10:02 node1 systemd: Starting Session 17 of user root. May 11 01:11:33 node1 pacemakerd[1042]: notice: Caught 'Terminated' signal May 11 01:11:33 node1 systemd: Stopping Pacemaker High Availability Cluster Manager... May 11 01:11:33 node1 pacemakerd[1042]: notice: Shutting down Pacemaker May 11 01:11:33 node1 pacemakerd[1042]: notice: Stopping crmd May 11 01:11:33 node1 crmd[]: notice: Caught 'Terminated' signal May 11 01:11:33 node1 crmd[]: notice: Shutting down cluster resource manager May 11 01:11:33 node1 crmd[]: warning: Input I_SHUTDOWN received in state S_TRANSITION_ENGINE from crm_shutdown ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: ht
[ClusterLabs] How to change the "pcs constraint colocation set"
Hi, We have two VIP resources and we use the following command to make them in different node. pcs constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 setoptions score=-1000 Now we add a new node into the cluster and we add a new VIP too. We want the constraint colocation set to change to be: pcs constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 pgsql-slave-ip3 setoptions score=-1000 How should we change the constraint set? Thanks ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: How to change the "pcs constraint colocation set"
Thank you, Tomas. I know how to remove a constraint " pcs constraint colocation remove ". Is there a command to delete a constraint colocation set? -邮件原件- 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Tomas Jelinek 发送时间: 2018年5月15日 15:42 收件人: users@clusterlabs.org 主题: Re: [ClusterLabs] How to change the "pcs constraint colocation set" Dne 15.5.2018 v 05:25 范国腾 napsal(a): > Hi, > > We have two VIP resources and we use the following command to make them in > different node. > > pcs constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 > setoptions score=-1000 > > Now we add a new node into the cluster and we add a new VIP too. We want the > constraint colocation set to change to be: > pcs constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 > pgsql-slave-ip3 setoptions score=-1000 > > How should we change the constraint set? > > Thanks Hi, pcs provides no commands for editing existing constraints. You can create a new constraint and remove the old one. If you want to do it as a single change from pacemaker's point of view, follow this procedure: [root@node1:~]# pcs cluster cib cib1.xml [root@node1:~]# cp cib1.xml cib2.xml [root@node1:~]# pcs -f cib2.xml constraint list --full Location Constraints: Ordering Constraints: Colocation Constraints: Resource Sets: set pgsql-slave-ip1 pgsql-slave-ip2 (id:pcs_rsc_set_pgsql-slave-ip1_pgsql-slave-ip2) setoptions score=-1000 (id:pcs_rsc_colocation_set_pgsql-slave-ip1_pgsql-slave-ip2) Ticket Constraints: [root@node1:~]# pcs -f cib2.xml constraint remove pcs_rsc_colocation_set_pgsql-slave-ip1_pgsql-slave-ip2 [root@node1:~]# pcs -f cib2.xml constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 pgsql-slave-ip3 setoptions score=-1000 [root@node1:~]# pcs cluster cib-push cib2.xml diff-against=cib1.xml CIB updated Pcs older than 0.9.156 does not support the diff-against option, you can do it like this: [root@node1:~]# pcs cluster cib cib.xml [root@node1:~]# pcs -f cib.xml constraint list --full Location Constraints: Ordering Constraints: Colocation Constraints: Resource Sets: set pgsql-slave-ip1 pgsql-slave-ip2 (id:pcs_rsc_set_pgsql-slave-ip1_pgsql-slave-ip2) setoptions score=-1000 (id:pcs_rsc_colocation_set_pgsql-slave-ip1_pgsql-slave-ip2) Ticket Constraints: [root@node1:~]# pcs -f cib.xml constraint remove pcs_rsc_colocation_set_pgsql-slave-ip1_pgsql-slave-ip2 [root@node1:~]# pcs -f cib.xml constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 pgsql-slave-ip3 setoptions score=-1000 [root@node1:~]# pcs cluster cib-push cib.xml CIB updated Regards, Tomas ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: 答复: How to change the "pcs constraint colocation set"
It could not find the id of constraint set. [root@node1 ~]# pcs constraint colocation --full Colocation Constraints: clvmd-clone with dlm-clone (score:INFINITY) (id:colocation-clvmd-clone-dlm-clone-INFINITY) pgsql-master-ip with pgsql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-pgsql-master-ip-pgsql-ha-INFINITY) pgsql-slave-ip2 with pgsql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Slave) (id:colocation-pgsql-slave-ip2-pgsql-ha-INFINITY) pgsql-slave-ip3 with pgsql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Slave) (id:colocation-pgsql-slave-ip3-pgsql-ha-INFINITY) Resource Sets: set pgsql-slave-ip2 (id:pcs_rsc_set_pgsql-slave-ip2) setoptions score=-1000 (id:pcs_rsc_colocation_set_pgsql-slave-ip2) set pgsql-slave-ip2 pgsql-slave-ip3 (id:pcs_rsc_set_pgsql-slave-ip2_pgsql-slave-ip3) setoptions score=-1000 (id:pcs_rsc_colocation_set_pgsql-slave-ip2_pgsql-slave-ip3) set pgsql-slave-ip2 pgsql-slave-ip3 (id:pcs_rsc_set_pgsql-slave-ip2_pgsql-slave-ip3-1) setoptions score=-INFINITY (id:pcs_rsc_colocation_set_pgsql-slave-ip2_pgsql-slave-ip3-1) [root@node1 ~]# pcs constraint remove pcs_rsc_set_pgsql-slave-ip2 Error: Unable to find constraint - 'pcs_rsc_set_pgsql-slave-ip2' [root@node1 ~]# pcs constraint remove pcs_rsc_set_pgsql-slave-ip2_pgsql-slave-ip3 Error: Unable to find constraint - 'pcs_rsc_set_pgsql-slave-ip2_pgsql-slave-ip3' [root@node1 ~]# -邮件原件- 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Tomas Jelinek 发送时间: 2018年5月15日 16:12 收件人: users@clusterlabs.org 主题: Re: [ClusterLabs] 答复: How to change the "pcs constraint colocation set" Dne 15.5.2018 v 10:02 范国腾 napsal(a): > Thank you, Tomas. I know how to remove a constraint " pcs constraint > colocation remove ". Is there a > command to delete a constraint colocation set? There is "pcs constraint remove ". To get a constraint id, run "pcs constraint colocation --full" and find the constraint you want to remove. > > -邮件原件- > 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Tomas Jelinek > 发送时间: 2018年5月15日 15:42 > 收件人: users@clusterlabs.org > 主题: Re: [ClusterLabs] How to change the "pcs constraint colocation set" > > Dne 15.5.2018 v 05:25 范国腾 napsal(a): >> Hi, >> >> We have two VIP resources and we use the following command to make them in >> different node. >> >> pcs constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 >> setoptions score=-1000 >> >> Now we add a new node into the cluster and we add a new VIP too. We want the >> constraint colocation set to change to be: >> pcs constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 >> pgsql-slave-ip3 setoptions score=-1000 >> >> How should we change the constraint set? >> >> Thanks > > Hi, > > pcs provides no commands for editing existing constraints. You can create a > new constraint and remove the old one. If you want to do it as a single > change from pacemaker's point of view, follow this procedure: > > [root@node1:~]# pcs cluster cib cib1.xml [root@node1:~]# cp cib1.xml cib2.xml > [root@node1:~]# pcs -f cib2.xml constraint list --full Location Constraints: > Ordering Constraints: > Colocation Constraints: > Resource Sets: > set pgsql-slave-ip1 pgsql-slave-ip2 > (id:pcs_rsc_set_pgsql-slave-ip1_pgsql-slave-ip2) setoptions > score=-1000 > (id:pcs_rsc_colocation_set_pgsql-slave-ip1_pgsql-slave-ip2) > Ticket Constraints: > [root@node1:~]# pcs -f cib2.xml constraint remove > pcs_rsc_colocation_set_pgsql-slave-ip1_pgsql-slave-ip2 > [root@node1:~]# pcs -f cib2.xml constraint colocation set > pgsql-slave-ip1 pgsql-slave-ip2 pgsql-slave-ip3 setoptions score=-1000 > [root@node1:~]# pcs cluster cib-push cib2.xml diff-against=cib1.xml > CIB updated > > > Pcs older than 0.9.156 does not support the diff-against option, you can do > it like this: > > [root@node1:~]# pcs cluster cib cib.xml [root@node1:~]# pcs -f cib.xml > constraint list --full Location Constraints: > Ordering Constraints: > Colocation Constraints: > Resource Sets: > set pgsql-slave-ip1 pgsql-slave-ip2 > (id:pcs_rsc_set_pgsql-slave-ip1_pgsql-slave-ip2) setoptions > score=-1000 > (id:pcs_rsc_colocation_set_pgsql-slave-ip1_pgsql-slave-ip2) > Ticket Constraints: > [root@node1:~]# pcs -f cib.xml constraint remove > pcs_rsc_colocation_set_pgsql-slave-ip1_pgsql-slave-ip2 > [root@node1:~]# pcs -f cib.xml constraint colocation set > pgsql-slave-ip1 > pgsql-slave-ip2 pgsql-slave-ip3 setoptions score=-1000 [root@node1:~]# > pcs cluster cib-push cib.xml CIB updated > > > Regards, > Tomas > ___ > User
[ClusterLabs] 答复: 答复: How to change the "pcs constraint colocation set"
Sorry, my mistake. I should use the second id. It is ok now. Thanks Tomas. -邮件原件- 发件人: 范国腾 发送时间: 2018年5月15日 16:19 收件人: users@clusterlabs.org 主题: 答复: [ClusterLabs] 答复: How to change the "pcs constraint colocation set" It could not find the id of constraint set. [root@node1 ~]# pcs constraint colocation --full Colocation Constraints: clvmd-clone with dlm-clone (score:INFINITY) (id:colocation-clvmd-clone-dlm-clone-INFINITY) pgsql-master-ip with pgsql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-pgsql-master-ip-pgsql-ha-INFINITY) pgsql-slave-ip2 with pgsql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Slave) (id:colocation-pgsql-slave-ip2-pgsql-ha-INFINITY) pgsql-slave-ip3 with pgsql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Slave) (id:colocation-pgsql-slave-ip3-pgsql-ha-INFINITY) Resource Sets: set pgsql-slave-ip2 (id:pcs_rsc_set_pgsql-slave-ip2) setoptions score=-1000 (id:pcs_rsc_colocation_set_pgsql-slave-ip2) set pgsql-slave-ip2 pgsql-slave-ip3 (id:pcs_rsc_set_pgsql-slave-ip2_pgsql-slave-ip3) setoptions score=-1000 (id:pcs_rsc_colocation_set_pgsql-slave-ip2_pgsql-slave-ip3) set pgsql-slave-ip2 pgsql-slave-ip3 (id:pcs_rsc_set_pgsql-slave-ip2_pgsql-slave-ip3-1) setoptions score=-INFINITY (id:pcs_rsc_colocation_set_pgsql-slave-ip2_pgsql-slave-ip3-1) [root@node1 ~]# pcs constraint remove pcs_rsc_set_pgsql-slave-ip2 Error: Unable to find constraint - 'pcs_rsc_set_pgsql-slave-ip2' [root@node1 ~]# pcs constraint remove pcs_rsc_set_pgsql-slave-ip2_pgsql-slave-ip3 Error: Unable to find constraint - 'pcs_rsc_set_pgsql-slave-ip2_pgsql-slave-ip3' [root@node1 ~]# -邮件原件- 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Tomas Jelinek 发送时间: 2018年5月15日 16:12 收件人: users@clusterlabs.org 主题: Re: [ClusterLabs] 答复: How to change the "pcs constraint colocation set" Dne 15.5.2018 v 10:02 范国腾 napsal(a): > Thank you, Tomas. I know how to remove a constraint " pcs constraint > colocation remove ". Is there a > command to delete a constraint colocation set? There is "pcs constraint remove ". To get a constraint id, run "pcs constraint colocation --full" and find the constraint you want to remove. > > -邮件原件- > 发件人: Users [mailto:users-boun...@clusterlabs.org] 代表 Tomas Jelinek > 发送时间: 2018年5月15日 15:42 > 收件人: users@clusterlabs.org > 主题: Re: [ClusterLabs] How to change the "pcs constraint colocation set" > > Dne 15.5.2018 v 05:25 范国腾 napsal(a): >> Hi, >> >> We have two VIP resources and we use the following command to make them in >> different node. >> >> pcs constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 >> setoptions score=-1000 >> >> Now we add a new node into the cluster and we add a new VIP too. We want the >> constraint colocation set to change to be: >> pcs constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 >> pgsql-slave-ip3 setoptions score=-1000 >> >> How should we change the constraint set? >> >> Thanks > > Hi, > > pcs provides no commands for editing existing constraints. You can create a > new constraint and remove the old one. If you want to do it as a single > change from pacemaker's point of view, follow this procedure: > > [root@node1:~]# pcs cluster cib cib1.xml [root@node1:~]# cp cib1.xml cib2.xml > [root@node1:~]# pcs -f cib2.xml constraint list --full Location Constraints: > Ordering Constraints: > Colocation Constraints: > Resource Sets: > set pgsql-slave-ip1 pgsql-slave-ip2 > (id:pcs_rsc_set_pgsql-slave-ip1_pgsql-slave-ip2) setoptions > score=-1000 > (id:pcs_rsc_colocation_set_pgsql-slave-ip1_pgsql-slave-ip2) > Ticket Constraints: > [root@node1:~]# pcs -f cib2.xml constraint remove > pcs_rsc_colocation_set_pgsql-slave-ip1_pgsql-slave-ip2 > [root@node1:~]# pcs -f cib2.xml constraint colocation set > pgsql-slave-ip1 pgsql-slave-ip2 pgsql-slave-ip3 setoptions score=-1000 > [root@node1:~]# pcs cluster cib-push cib2.xml diff-against=cib1.xml > CIB updated > > > Pcs older than 0.9.156 does not support the diff-against option, you can do > it like this: > > [root@node1:~]# pcs cluster cib cib.xml [root@node1:~]# pcs -f cib.xml > constraint list --full Location Constraints: > Ordering Constraints: > Colocation Constraints: > Resource Sets: > set pgsql-slave-ip1 pgsql-slave-ip2 > (id:pcs_rsc_set_pgsql-slave-ip1_pgsql-slave-ip2) setoptions > score=-1000 > (id:pcs_rsc_colocation_set_pgsql-slave-ip1_pgsql-slave-ip2) > Ticket Constraints: > [root@node1:~]# pcs -f cib.xml constraint remove > pcs_rsc_colocation_set_pgsql-slave-ip1_pgsql-slave-ip2 > [root@node1:~]# pcs -f cib.xml constraint colocation
[ClusterLabs] How to make PAF use psql to login with password
Hi, We use the PAF (https://dalibo.github.io/PAF/?) to manage the postgresql. According to user's requirement, we could not use trust mode in the pg_hba.conf? file. So when running psql, it will ask us to input the password and we have to input the password manually. So the pcs status show the following error: * pgsqld_stop_0 on node1-pri 'unknown error' (1): call=34, status=complete, exitreason='Unexpected state for instance "pgsqld" (returned 1)', last-rc-change='Wed Mar 6 09:09:46 2019', queued=0ms, exec=504ms The cause of the error is that the PAF (/usr/lib/ocf/resource.d/heartbeat/pgsqlms) will ask to input the password and we could not pass the password to psql command in the PAF script. exec $PGPSQL, '--set', 'ON_ERROR_STOP=1', '-qXAtf', $tmpfile, '-R', $RS, '-F', $FS, '--port', $pgport, '--host', $pghost,'--username','sysdba', Is there any way for us to pass the password to the psql command in the PAF script? We have tried to add the "export PGPASSWORD=123456" in the /etc/profile and it does not work. thanks ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org