Re: [ClusterLabs] 答复: No slave is promoted to be master
On Wed, 2018-04-18 at 02:33 +, 范国腾 wrote: > Thank you very much, Rorthais, > > I see now. I have two more questions. > > 1. If I change the "cluster-recheck-interval" parameter from the > default 15 minutes to 10 seconds, is there any bad impact? Could this > be a workaround? Personally I wouldn't lower it below 5 minutes. The policy engine will run at least this often, so setting it very low means the PE will run continuously, putting unnecessary CPU and I/O load on the cluster. > > 2. This issue happens only in the following configuration. > > But it does not happen in the following configuration. Why is the > behaviors different? I'm not sure. I don't know of any differences in those two versions (or two-node vs. three-node) that would affect this issue. > -邮件原件- > 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] > 发送时间: 2018年4月17日 17:47 > 收件人: 范国腾> 抄送: Cluster Labs - All topics related to open-source clustering > welcomed > 主题: Re: [ClusterLabs] No slave is promoted to be master > > On Tue, 17 Apr 2018 04:16:38 + > 范国腾 wrote: > > > I check the status again. It is not not promoted but it promoted > about > > 15 minutes after the cluster starts. > > > > I try in three labs and the results are same: The promotion happens > 15 > > minutes after the cluster starts. > > > > Why is there about 15 minutes delay every time? > > This was a bug in Pacemaker up to 1.1.17. I did a report about this > last August and Ken Gaillot fixed it few days later in 1.1.18. See: > > https://lists.clusterlabs.org/pipermail/developers/2017-August/001110 > .html > https://lists.clusterlabs.org/pipermail/developers/2017-September/001 > 113.html > > I wonder if disabling the pgsql resource before shutting down the > cluster might be a simpler and safer workaround. Eg.: > > pcs resource disable pgsql-ha --wait > pcs cluster stop --all > > and > > pcs cluster start --all > pcs resource enable pgsql-ha > > Another fix would be to force a master score on one node **if > needed** using: > > crm_master -N -r -l forever -v 1 > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: No slave is promoted to be master
Thank you very much, Rorthais, I see now. I have two more questions. 1. If I change the "cluster-recheck-interval" parameter from the default 15 minutes to 10 seconds, is there any bad impact? Could this be a workaround? 2. This issue happens only in the following configuration. [cid:image003.jpg@01D3D700.2F3E24D0] But it does not happen in the following configuration. Why is the behaviors different? [cid:image004.jpg@01D3D700.2F3E24D0] -邮件原件- 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 发送时间: 2018年4月17日 17:47 收件人: 范国腾抄送: Cluster Labs - All topics related to open-source clustering welcomed 主题: Re: [ClusterLabs] No slave is promoted to be master On Tue, 17 Apr 2018 04:16:38 + 范国腾 > wrote: > I check the status again. It is not not promoted but it promoted about > 15 minutes after the cluster starts. > > I try in three labs and the results are same: The promotion happens 15 > minutes after the cluster starts. > > Why is there about 15 minutes delay every time? This was a bug in Pacemaker up to 1.1.17. I did a report about this last August and Ken Gaillot fixed it few days later in 1.1.18. See: https://lists.clusterlabs.org/pipermail/developers/2017-August/001110.html https://lists.clusterlabs.org/pipermail/developers/2017-September/001113.html I wonder if disabling the pgsql resource before shutting down the cluster might be a simpler and safer workaround. Eg.: pcs resource disable pgsql-ha --wait pcs cluster stop --all and pcs cluster start --all pcs resource enable pgsql-ha Another fix would be to force a master score on one node **if needed** using: crm_master -N -r -l forever -v 1 ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] 答复: No slave is promoted to be master
Отправлено с iPhone > 17 апр. 2018 г., в 7:16, 范国腾написал(а): > > I check the status again. It is not not promoted but it promoted about 15 > minutes after the cluster starts. > > I try in three labs and the results are same: The promotion happens 15 > minutes after the cluster starts. > > Why is there about 15 minutes delay every time? > That rings the bell. 15 minutes is default interval for time based rules re-evaluation; my understanding so far is that it alarm triggers other configuration changes (basically it runs policy engine to make decision). I had similar effect when I attempted to change quorum state directly, without going via external node events. So it looks like whatever sets master scores does not trigger policy engine. > > Apr 16 22:08:32 node1 attrd[16618]: notice: Node sds1 state is now member > Apr 16 22:08:32 node1 attrd[16618]: notice: Node sds2 state is now member > > .. > > Apr 16 22:21:36 node1 pgsqlms(pgsqld)[18230]: INFO: Execute action monitor > and the result 0 > Apr 16 22:21:52 node1 pgsqlms(pgsqld)[18257]: INFO: Execute action monitor > and the result 0 > Apr 16 22:22:09 node1 pgsqlms(pgsqld)[18296]: INFO: Execute action monitor > and the result 0 > Apr 16 22:22:25 node1 pgsqlms(pgsqld)[18315]: INFO: Execute action monitor > and the result 0 > Apr 16 22:22:41 node1 pgsqlms(pgsqld)[18343]: INFO: Execute action monitor > and the result 0 > Apr 16 22:22:57 node1 pgsqlms(pgsqld)[18362]: INFO: Execute action monitor > and the result 0 > Apr 16 22:23:13 node1 pgsqlms(pgsqld)[18402]: INFO: Execute action monitor > and the result 0 > Apr 16 22:23:29 node1 pgsqlms(pgsqld)[18421]: INFO: Execute action monitor > and the result 0 > Apr 16 22:23:45 node1 pgsqlms(pgsqld)[18449]: INFO: Execute action monitor > and the result 0 > Apr 16 22:23:57 node1 crmd[16620]: notice: State transition S_IDLE -> > S_POLICY_ENGINE > Apr 16 22:23:57 node1 pengine[16619]: notice: Promote pgsqld:0#011(Slave -> > Master sds1) > Apr 16 22:23:57 node1 pengine[16619]: notice: Start master-vip#011(sds1) > Apr 16 22:23:57 node1 pengine[16619]: notice: Start > pgsql-master-ip#011(sds1) > Apr 16 22:23:57 node1 pengine[16619]: notice: Calculated transition 1, > saving inputs in /var/lib/pacemaker/pengine/pe-input-18.bz2 > Apr 16 22:23:57 node1 crmd[16620]: notice: Initiating cancel operation > pgsqld_monitor_16000 locally on sds1 > Apr 16 22:23:57 node1 crmd[16620]: notice: Initiating notify operation > pgsqld_pre_notify_promote_0 locally on sds1 > Apr 16 22:23:57 node1 crmd[16620]: notice: Initiating notify operation > pgsqld_pre_notify_promote_0 on sds2 > Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Promoting instance on > node "sds1" > Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Current node TL#LSN: > 4#117440512 > Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Execute action notify and > the result 0 > Apr 16 22:23:58 node1 crmd[16620]: notice: Result of notify operation for > pgsqld on sds1: 0 (ok) > Apr 16 22:23:58 node1 crmd[16620]: notice: Initiating promote operation > pgsqld_promote_0 locally on sds1 > Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18499]: INFO: Waiting for the promote > to complete > Apr 16 22:23:59 node1 pgsqlms(pgsqld)[18499]: INFO: Promote complete > > > > [root@node1 ~]# crm_simulate -sL > > Current cluster status: > Online: [ sds1 sds2 ] > > Master/Slave Set: pgsql-ha [pgsqld] > Masters: [ sds1 ] > Slaves: [ sds2 ] > Resource Group: mastergroup > master-vip (ocf::heartbeat:IPaddr2): Started sds1 > pgsql-master-ip(ocf::heartbeat:IPaddr2): Started sds1 > > Allocation scores: > clone_color: pgsql-ha allocation score on sds1: 1 > clone_color: pgsql-ha allocation score on sds2: 1 > clone_color: pgsqld:0 allocation score on sds1: 1003 > clone_color: pgsqld:0 allocation score on sds2: 1 > clone_color: pgsqld:1 allocation score on sds1: 1 > clone_color: pgsqld:1 allocation score on sds2: 1002 > native_color: pgsqld:0 allocation score on sds1: 1003 > native_color: pgsqld:0 allocation score on sds2: 1 > native_color: pgsqld:1 allocation score on sds1: -INFINITY > native_color: pgsqld:1 allocation score on sds2: 1002 > pgsqld:0 promotion score on sds1: 1002 > pgsqld:1 promotion score on sds2: 1001 > group_color: mastergroup allocation score on sds1: 0 > group_color: mastergroup allocation score on sds2: 0 > group_color: master-vip allocation score on sds1: 0 > group_color: master-vip allocation score on sds2: 0 > native_color: master-vip allocation score on sds1: 1003 > native_color: master-vip allocation score on sds2: -INFINITY > native_color: pgsql-master-ip allocation score on sds1: 1003 > native_color: pgsql-master-ip allocation score on sds2: -INFINITY > > Transition Summary: > [root@node1 ~]# > > You could reproduce the issue in two nodes, and execute the following > command. Then run "pcs cluster stop --all" and "pcs cluster start --all". > > pcs
[ClusterLabs] 答复: No slave is promoted to be master
I check the status again. It is not not promoted but it promoted about 15 minutes after the cluster starts. I try in three labs and the results are same: The promotion happens 15 minutes after the cluster starts. Why is there about 15 minutes delay every time? Apr 16 22:08:32 node1 attrd[16618]: notice: Node sds1 state is now member Apr 16 22:08:32 node1 attrd[16618]: notice: Node sds2 state is now member .. Apr 16 22:21:36 node1 pgsqlms(pgsqld)[18230]: INFO: Execute action monitor and the result 0 Apr 16 22:21:52 node1 pgsqlms(pgsqld)[18257]: INFO: Execute action monitor and the result 0 Apr 16 22:22:09 node1 pgsqlms(pgsqld)[18296]: INFO: Execute action monitor and the result 0 Apr 16 22:22:25 node1 pgsqlms(pgsqld)[18315]: INFO: Execute action monitor and the result 0 Apr 16 22:22:41 node1 pgsqlms(pgsqld)[18343]: INFO: Execute action monitor and the result 0 Apr 16 22:22:57 node1 pgsqlms(pgsqld)[18362]: INFO: Execute action monitor and the result 0 Apr 16 22:23:13 node1 pgsqlms(pgsqld)[18402]: INFO: Execute action monitor and the result 0 Apr 16 22:23:29 node1 pgsqlms(pgsqld)[18421]: INFO: Execute action monitor and the result 0 Apr 16 22:23:45 node1 pgsqlms(pgsqld)[18449]: INFO: Execute action monitor and the result 0 Apr 16 22:23:57 node1 crmd[16620]: notice: State transition S_IDLE -> S_POLICY_ENGINE Apr 16 22:23:57 node1 pengine[16619]: notice: Promote pgsqld:0#011(Slave -> Master sds1) Apr 16 22:23:57 node1 pengine[16619]: notice: Start master-vip#011(sds1) Apr 16 22:23:57 node1 pengine[16619]: notice: Start pgsql-master-ip#011(sds1) Apr 16 22:23:57 node1 pengine[16619]: notice: Calculated transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input-18.bz2 Apr 16 22:23:57 node1 crmd[16620]: notice: Initiating cancel operation pgsqld_monitor_16000 locally on sds1 Apr 16 22:23:57 node1 crmd[16620]: notice: Initiating notify operation pgsqld_pre_notify_promote_0 locally on sds1 Apr 16 22:23:57 node1 crmd[16620]: notice: Initiating notify operation pgsqld_pre_notify_promote_0 on sds2 Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Promoting instance on node "sds1" Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Current node TL#LSN: 4#117440512 Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Execute action notify and the result 0 Apr 16 22:23:58 node1 crmd[16620]: notice: Result of notify operation for pgsqld on sds1: 0 (ok) Apr 16 22:23:58 node1 crmd[16620]: notice: Initiating promote operation pgsqld_promote_0 locally on sds1 Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18499]: INFO: Waiting for the promote to complete Apr 16 22:23:59 node1 pgsqlms(pgsqld)[18499]: INFO: Promote complete [root@node1 ~]# crm_simulate -sL Current cluster status: Online: [ sds1 sds2 ] Master/Slave Set: pgsql-ha [pgsqld] Masters: [ sds1 ] Slaves: [ sds2 ] Resource Group: mastergroup master-vip (ocf::heartbeat:IPaddr2): Started sds1 pgsql-master-ip(ocf::heartbeat:IPaddr2): Started sds1 Allocation scores: clone_color: pgsql-ha allocation score on sds1: 1 clone_color: pgsql-ha allocation score on sds2: 1 clone_color: pgsqld:0 allocation score on sds1: 1003 clone_color: pgsqld:0 allocation score on sds2: 1 clone_color: pgsqld:1 allocation score on sds1: 1 clone_color: pgsqld:1 allocation score on sds2: 1002 native_color: pgsqld:0 allocation score on sds1: 1003 native_color: pgsqld:0 allocation score on sds2: 1 native_color: pgsqld:1 allocation score on sds1: -INFINITY native_color: pgsqld:1 allocation score on sds2: 1002 pgsqld:0 promotion score on sds1: 1002 pgsqld:1 promotion score on sds2: 1001 group_color: mastergroup allocation score on sds1: 0 group_color: mastergroup allocation score on sds2: 0 group_color: master-vip allocation score on sds1: 0 group_color: master-vip allocation score on sds2: 0 native_color: master-vip allocation score on sds1: 1003 native_color: master-vip allocation score on sds2: -INFINITY native_color: pgsql-master-ip allocation score on sds1: 1003 native_color: pgsql-master-ip allocation score on sds2: -INFINITY Transition Summary: [root@node1 ~]# You could reproduce the issue in two nodes, and execute the following command. Then run "pcs cluster stop --all" and "pcs cluster start --all". pcs resource create pgsqld ocf:heartbeat:pgsqlms bindir=/home/highgo/highgo/database/4.3.1/bin pgdata=/home/highgo/highgo/database/4.3.1/data op start timeout=600s op stop timeout=60s op promote timeout=300s op demote timeout=120s op monitor interval=10s timeout=100s role="Master" op monitor interval=16s timeout=100s role="Slave" op notify timeout=60s pcs resource master pgsql-ha pgsqld notify=true interleave=true -邮件原件- 发件人: 范国腾 发送时间: 2018年4月17日 10:25 收件人: 'Jehan-Guillaume de Rorthais'抄送: Cluster Labs - All topics related to open-source clustering welcomed 主题: [ClusterLabs] No slave is promoted to be master Hi, We install a new lab which
[ClusterLabs] 答复: No slave is promoted to be master
Thank you, Rorthais. I see now. -邮件原件- 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 发送时间: 2018年4月13日 17:17 收件人: 范国腾抄送: Cluster Labs - All topics related to open-source clustering welcomed 主题: Re: [ClusterLabs] No slave is promoted to be master OK, I know what happen. It seems like your standbies were not replicating when the master "crashed", you can find tons of messages like this in the log files: WARNING: No secondary connected to the master WARNING: "db2" is not connected to the primary WARNING: "db3" is not connected to the primary When a standby is not replicating, the master set negative master score to them to forbid the promotion on them, as they are probably lagging for some undefined time. The following command shows the scores just before the simulated master crash: $ crm_simulate -x pe-input-2039.bz2 -s|grep -E 'date|promotion' Using the original execution date of: 2018-04-11 16:23:07Z pgsqld:0 promotion score on db1: 1001 pgsqld:1 promotion score on db2: -1000 pgsqld:2 promotion score on db3: -1000 "1001" score design the master. Streaming standbies always have a positive master score between 1000 and 1000-N*10 where N is the number of connected standbies. On Fri, 13 Apr 2018 01:37:54 + 范国腾 wrote: > The log is in the attachment. > > We make a bug in the PG code in master node to make it not be > restarted any more in order to test the following scenario: One slave > could be promoted when the master crashed, > > -邮件原件- > 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] > 发送时间: 2018年4月12日 17:39 > 收件人: 范国腾 > 抄送: Cluster Labs - All topics related to open-source clustering > welcomed 主题: Re: [ClusterLabs] No slave is > promoted to be master > > Hi, > On Thu, 12 Apr 2018 08:31:39 + > 范国腾 wrote: > > > Thank you very much for help check this issue. The information is in > > the attachment. > > > > I have restarted the cluster after I send my first email. Not sure > > if it affects the checking of "the result of "crm_simulate -sL" > > It does... > > Could you please provide files > from /var/lib/pacemaker/pengine/pe-input-2039.bz2 to pe-input-2065.bz2 ? > > [...] > > Then the master is restarted and it could not start(that is ok and > > we know the reason)。 > > Why couldn't it start ? -- Jehan-Guillaume de Rorthais Dalibo ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] 答复: No slave is promoted to be master
On Thu, 2018-04-12 at 07:29 +, 范国腾 wrote: > Hello, > > We use the following command to create the cluster. Node2 is always > the master when the cluster starts. Why does pacemaker not select > node1 as the default master? > How to configure if we want node1 to be the default master? You can specify a location constraint giving pgsqld's master role a positive score (not INFINITY) on node1, or a negative score (not -INFINITY) on node2. Using a non-infinite score in a constraint tells pacemaker that it's a preference, but not a requirement. However, there's rarely a good reason to do that. In HA, the best practice is that all nodes should be completely interchangeable, so that the service can run equally on any node (since it might have to, in a failure scenario). Such constraints can be useful temporarily, e.g. if you need to upgrade the software on one node, or if one node is underperforming (perhaps it's waiting on a hardware upgrade to come in, or running some one-time job consuming a lot of CPU). > > pcs cluster setup --name cluster_pgsql node1 node2 > pcs resource create pgsqld ocf:heartbeat:pgsqlms > bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data op start > timeout=600s op stop timeout=60s op promote timeout=300s op demote > timeout=120s op monitor interval=15s timeout=100s role="Master" op > monitor interval=16s timeout=100s role="Slave" op notify > timeout=60s;pcs resource master pgsql-ha pgsqld notify=true > interleave=true; > > > Sometimes it reports the following error, how to configure to avoid > it? -- Ken Gaillot___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: No slave is promoted to be master
Hello, We use the following command to create the cluster. Node2 is always the master when the cluster starts. Why does pacemaker not select node1 as the default master? How to configure if we want node1 to be the default master? pcs cluster setup --name cluster_pgsql node1 node2 pcs resource create pgsqld ocf:heartbeat:pgsqlms bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data op start timeout=600s op stop timeout=60s op promote timeout=300s op demote timeout=120s op monitor interval=15s timeout=100s role="Master" op monitor interval=16s timeout=100s role="Slave" op notify timeout=60s;pcs resource master pgsql-ha pgsqld notify=true interleave=true; Sometimes it reports the following error, how to configure to avoid it? [cid:image003.png@01D3D271.A4C44560] ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org