Re: [ClusterLabs] 答复: No slave is promoted to be master

2018-04-18 Thread Ken Gaillot
On Wed, 2018-04-18 at 02:33 +, 范国腾 wrote:
> Thank you very much, Rorthais,
>  
> I see now. I have two more questions.
>  
> 1. If I change the "cluster-recheck-interval" parameter from the
> default 15 minutes to 10 seconds, is there any bad impact? Could this
> be a workaround?

Personally I wouldn't lower it below 5 minutes. The policy engine will
run at least this often, so setting it very low means the PE will run
continuously, putting unnecessary CPU and I/O load on the cluster.

>  
> 2. This issue happens only in the following configuration.
> 
> But it does not happen in the following configuration. Why is the
> behaviors different?

I'm not sure. I don't know of any differences in those two versions (or
two-node vs. three-node) that would affect this issue.

> -邮件原件-
> 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 
> 发送时间: 2018年4月17日 17:47
> 收件人: 范国腾 
> 抄送: Cluster Labs - All topics related to open-source clustering
> welcomed 
> 主题: Re: [ClusterLabs] No slave is promoted to be master
>  
> On Tue, 17 Apr 2018 04:16:38 +
> 范国腾  wrote:
>  
> > I check the status again. It is not not promoted but it promoted
> about
> > 15 minutes after the cluster starts.
> >
> > I try in three labs and the results are same: The promotion happens
> 15
> > minutes after the cluster starts.
> >
> > Why is there about 15 minutes delay every time?
>  
> This was a bug in Pacemaker up to 1.1.17. I did a report about this
> last August and Ken Gaillot fixed it few days later in 1.1.18. See:
>  
> https://lists.clusterlabs.org/pipermail/developers/2017-August/001110
> .html
> https://lists.clusterlabs.org/pipermail/developers/2017-September/001
> 113.html
>  
> I wonder if disabling the pgsql resource before shutting down the
> cluster might be a simpler and safer workaround. Eg.:
>  
> pcs resource disable pgsql-ha  --wait
> pcs cluster stop --all
>  
> and
>  
> pcs cluster start --all
> pcs resource enable pgsql-ha
>  
> Another fix would be to force a master score on one node **if
> needed** using:
>  
>   crm_master -N  -r  -l forever -v 1
>  
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] 答复: No slave is promoted to be master

2018-04-17 Thread 范国腾
Thank you very much, Rorthais,



I see now. I have two more questions.



1. If I change the "cluster-recheck-interval" parameter from the default 15 
minutes to 10 seconds, is there any bad impact? Could this be a workaround?



2. This issue happens only in the following configuration.

[cid:image003.jpg@01D3D700.2F3E24D0]

But it does not happen in the following configuration. Why is the behaviors 
different?

[cid:image004.jpg@01D3D700.2F3E24D0]



-邮件原件-
发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com]
发送时间: 2018年4月17日 17:47
收件人: 范国腾 
抄送: Cluster Labs - All topics related to open-source clustering welcomed 

主题: Re: [ClusterLabs] No slave is promoted to be master



On Tue, 17 Apr 2018 04:16:38 +

范国腾 > wrote:



> I check the status again. It is not not promoted but it promoted about

> 15 minutes after the cluster starts.

>

> I try in three labs and the results are same: The promotion happens 15

> minutes after the cluster starts.

>

> Why is there about 15 minutes delay every time?



This was a bug in Pacemaker up to 1.1.17. I did a report about this last August 
and Ken Gaillot fixed it few days later in 1.1.18. See:



https://lists.clusterlabs.org/pipermail/developers/2017-August/001110.html

https://lists.clusterlabs.org/pipermail/developers/2017-September/001113.html



I wonder if disabling the pgsql resource before shutting down the cluster might 
be a simpler and safer workaround. Eg.:



pcs resource disable pgsql-ha  --wait

pcs cluster stop --all



and



pcs cluster start --all

pcs resource enable pgsql-ha



Another fix would be to force a master score on one node **if needed** using:



  crm_master -N  -r  -l forever -v 1


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 答复: No slave is promoted to be master

2018-04-16 Thread Andrei Borzenkov


Отправлено с iPhone

> 17 апр. 2018 г., в 7:16, 范国腾  написал(а):
> 
> I check the status again. It is not not promoted but it promoted about 15 
> minutes after the cluster starts. 
> 
> I try in three labs and the results are same: The promotion happens 15 
> minutes after the cluster starts. 
> 
> Why is there about 15 minutes delay every time?
> 

That rings the bell. 15 minutes is default interval for time based rules 
re-evaluation; my understanding so far is that it alarm triggers other 
configuration changes (basically it runs policy engine to make decision). 

I had similar effect when I attempted to change quorum state directly, without 
going via external node events.

So it looks like whatever sets master scores does not trigger policy engine.


> 
> Apr 16 22:08:32 node1 attrd[16618]:  notice: Node sds1 state is now member
> Apr 16 22:08:32 node1 attrd[16618]:  notice: Node sds2 state is now member
> 
> ..
> 
> Apr 16 22:21:36 node1 pgsqlms(pgsqld)[18230]: INFO: Execute action monitor 
> and the result 0
> Apr 16 22:21:52 node1 pgsqlms(pgsqld)[18257]: INFO: Execute action monitor 
> and the result 0
> Apr 16 22:22:09 node1 pgsqlms(pgsqld)[18296]: INFO: Execute action monitor 
> and the result 0
> Apr 16 22:22:25 node1 pgsqlms(pgsqld)[18315]: INFO: Execute action monitor 
> and the result 0
> Apr 16 22:22:41 node1 pgsqlms(pgsqld)[18343]: INFO: Execute action monitor 
> and the result 0
> Apr 16 22:22:57 node1 pgsqlms(pgsqld)[18362]: INFO: Execute action monitor 
> and the result 0
> Apr 16 22:23:13 node1 pgsqlms(pgsqld)[18402]: INFO: Execute action monitor 
> and the result 0
> Apr 16 22:23:29 node1 pgsqlms(pgsqld)[18421]: INFO: Execute action monitor 
> and the result 0
> Apr 16 22:23:45 node1 pgsqlms(pgsqld)[18449]: INFO: Execute action monitor 
> and the result 0
> Apr 16 22:23:57 node1 crmd[16620]:  notice: State transition S_IDLE -> 
> S_POLICY_ENGINE
> Apr 16 22:23:57 node1 pengine[16619]:  notice: Promote pgsqld:0#011(Slave -> 
> Master sds1)
> Apr 16 22:23:57 node1 pengine[16619]:  notice: Start   master-vip#011(sds1)
> Apr 16 22:23:57 node1 pengine[16619]:  notice: Start   
> pgsql-master-ip#011(sds1)
> Apr 16 22:23:57 node1 pengine[16619]:  notice: Calculated transition 1, 
> saving inputs in /var/lib/pacemaker/pengine/pe-input-18.bz2
> Apr 16 22:23:57 node1 crmd[16620]:  notice: Initiating cancel operation 
> pgsqld_monitor_16000 locally on sds1
> Apr 16 22:23:57 node1 crmd[16620]:  notice: Initiating notify operation 
> pgsqld_pre_notify_promote_0 locally on sds1
> Apr 16 22:23:57 node1 crmd[16620]:  notice: Initiating notify operation 
> pgsqld_pre_notify_promote_0 on sds2
> Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Promoting instance on 
> node "sds1"
> Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Current node TL#LSN: 
> 4#117440512
> Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Execute action notify and 
> the result 0
> Apr 16 22:23:58 node1 crmd[16620]:  notice: Result of notify operation for 
> pgsqld on sds1: 0 (ok)
> Apr 16 22:23:58 node1 crmd[16620]:  notice: Initiating promote operation 
> pgsqld_promote_0 locally on sds1
> Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18499]: INFO: Waiting for the promote 
> to complete
> Apr 16 22:23:59 node1 pgsqlms(pgsqld)[18499]: INFO: Promote complete
> 
> 
> 
> [root@node1 ~]# crm_simulate -sL
> 
> Current cluster status:
> Online: [ sds1 sds2 ]
> 
> Master/Slave Set: pgsql-ha [pgsqld]
> Masters: [ sds1 ]
> Slaves: [ sds2 ]
> Resource Group: mastergroup
> master-vip (ocf::heartbeat:IPaddr2):   Started sds1
> pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started sds1
> 
> Allocation scores:
> clone_color: pgsql-ha allocation score on sds1: 1
> clone_color: pgsql-ha allocation score on sds2: 1
> clone_color: pgsqld:0 allocation score on sds1: 1003
> clone_color: pgsqld:0 allocation score on sds2: 1
> clone_color: pgsqld:1 allocation score on sds1: 1
> clone_color: pgsqld:1 allocation score on sds2: 1002
> native_color: pgsqld:0 allocation score on sds1: 1003
> native_color: pgsqld:0 allocation score on sds2: 1
> native_color: pgsqld:1 allocation score on sds1: -INFINITY
> native_color: pgsqld:1 allocation score on sds2: 1002
> pgsqld:0 promotion score on sds1: 1002
> pgsqld:1 promotion score on sds2: 1001
> group_color: mastergroup allocation score on sds1: 0
> group_color: mastergroup allocation score on sds2: 0
> group_color: master-vip allocation score on sds1: 0
> group_color: master-vip allocation score on sds2: 0
> native_color: master-vip allocation score on sds1: 1003
> native_color: master-vip allocation score on sds2: -INFINITY
> native_color: pgsql-master-ip allocation score on sds1: 1003
> native_color: pgsql-master-ip allocation score on sds2: -INFINITY
> 
> Transition Summary:
> [root@node1 ~]#
> 
> You could reproduce the issue in two nodes, and execute the following 
> command. Then run "pcs cluster stop --all" and "pcs cluster start --all".
> 
> pcs 

[ClusterLabs] 答复: No slave is promoted to be master

2018-04-16 Thread 范国腾
I check the status again. It is not not promoted but it promoted about 15 
minutes after the cluster starts. 

I try in three labs and the results are same: The promotion happens 15 minutes 
after the cluster starts. 

Why is there about 15 minutes delay every time?


Apr 16 22:08:32 node1 attrd[16618]:  notice: Node sds1 state is now member
Apr 16 22:08:32 node1 attrd[16618]:  notice: Node sds2 state is now member

..

Apr 16 22:21:36 node1 pgsqlms(pgsqld)[18230]: INFO: Execute action monitor and 
the result 0
Apr 16 22:21:52 node1 pgsqlms(pgsqld)[18257]: INFO: Execute action monitor and 
the result 0
Apr 16 22:22:09 node1 pgsqlms(pgsqld)[18296]: INFO: Execute action monitor and 
the result 0
Apr 16 22:22:25 node1 pgsqlms(pgsqld)[18315]: INFO: Execute action monitor and 
the result 0
Apr 16 22:22:41 node1 pgsqlms(pgsqld)[18343]: INFO: Execute action monitor and 
the result 0
Apr 16 22:22:57 node1 pgsqlms(pgsqld)[18362]: INFO: Execute action monitor and 
the result 0
Apr 16 22:23:13 node1 pgsqlms(pgsqld)[18402]: INFO: Execute action monitor and 
the result 0
Apr 16 22:23:29 node1 pgsqlms(pgsqld)[18421]: INFO: Execute action monitor and 
the result 0
Apr 16 22:23:45 node1 pgsqlms(pgsqld)[18449]: INFO: Execute action monitor and 
the result 0
Apr 16 22:23:57 node1 crmd[16620]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE
Apr 16 22:23:57 node1 pengine[16619]:  notice: Promote pgsqld:0#011(Slave -> 
Master sds1)
Apr 16 22:23:57 node1 pengine[16619]:  notice: Start   master-vip#011(sds1)
Apr 16 22:23:57 node1 pengine[16619]:  notice: Start   pgsql-master-ip#011(sds1)
Apr 16 22:23:57 node1 pengine[16619]:  notice: Calculated transition 1, saving 
inputs in /var/lib/pacemaker/pengine/pe-input-18.bz2
Apr 16 22:23:57 node1 crmd[16620]:  notice: Initiating cancel operation 
pgsqld_monitor_16000 locally on sds1
Apr 16 22:23:57 node1 crmd[16620]:  notice: Initiating notify operation 
pgsqld_pre_notify_promote_0 locally on sds1
Apr 16 22:23:57 node1 crmd[16620]:  notice: Initiating notify operation 
pgsqld_pre_notify_promote_0 on sds2
Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Promoting instance on node 
"sds1"
Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Current node TL#LSN: 
4#117440512
Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18467]: INFO: Execute action notify and 
the result 0
Apr 16 22:23:58 node1 crmd[16620]:  notice: Result of notify operation for 
pgsqld on sds1: 0 (ok)
Apr 16 22:23:58 node1 crmd[16620]:  notice: Initiating promote operation 
pgsqld_promote_0 locally on sds1
Apr 16 22:23:58 node1 pgsqlms(pgsqld)[18499]: INFO: Waiting for the promote to 
complete
Apr 16 22:23:59 node1 pgsqlms(pgsqld)[18499]: INFO: Promote complete



[root@node1 ~]# crm_simulate -sL

Current cluster status:
Online: [ sds1 sds2 ]

 Master/Slave Set: pgsql-ha [pgsqld]
 Masters: [ sds1 ]
 Slaves: [ sds2 ]
 Resource Group: mastergroup
 master-vip (ocf::heartbeat:IPaddr2):   Started sds1
 pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started sds1

Allocation scores:
clone_color: pgsql-ha allocation score on sds1: 1
clone_color: pgsql-ha allocation score on sds2: 1
clone_color: pgsqld:0 allocation score on sds1: 1003
clone_color: pgsqld:0 allocation score on sds2: 1
clone_color: pgsqld:1 allocation score on sds1: 1
clone_color: pgsqld:1 allocation score on sds2: 1002
native_color: pgsqld:0 allocation score on sds1: 1003
native_color: pgsqld:0 allocation score on sds2: 1
native_color: pgsqld:1 allocation score on sds1: -INFINITY
native_color: pgsqld:1 allocation score on sds2: 1002
pgsqld:0 promotion score on sds1: 1002
pgsqld:1 promotion score on sds2: 1001
group_color: mastergroup allocation score on sds1: 0
group_color: mastergroup allocation score on sds2: 0
group_color: master-vip allocation score on sds1: 0
group_color: master-vip allocation score on sds2: 0
native_color: master-vip allocation score on sds1: 1003
native_color: master-vip allocation score on sds2: -INFINITY
native_color: pgsql-master-ip allocation score on sds1: 1003
native_color: pgsql-master-ip allocation score on sds2: -INFINITY

Transition Summary:
[root@node1 ~]#

You could reproduce the issue in two nodes, and execute the following command. 
Then run "pcs cluster stop --all" and "pcs cluster start --all".

pcs resource create pgsqld ocf:heartbeat:pgsqlms 
bindir=/home/highgo/highgo/database/4.3.1/bin 
pgdata=/home/highgo/highgo/database/4.3.1/data op start timeout=600s op stop 
timeout=60s op promote timeout=300s op demote timeout=120s op monitor 
interval=10s timeout=100s role="Master" op monitor interval=16s timeout=100s 
role="Slave" op notify timeout=60s
pcs resource master pgsql-ha pgsqld notify=true interleave=true





-邮件原件-
发件人: 范国腾 
发送时间: 2018年4月17日 10:25
收件人: 'Jehan-Guillaume de Rorthais' 
抄送: Cluster Labs - All topics related to open-source clustering welcomed 

主题: [ClusterLabs] No slave is promoted to be master

Hi,

We install a new lab which 

[ClusterLabs] 答复: No slave is promoted to be master

2018-04-15 Thread 范国腾
Thank you, Rorthais. I see now.

-邮件原件-
发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 
发送时间: 2018年4月13日 17:17
收件人: 范国腾 
抄送: Cluster Labs - All topics related to open-source clustering welcomed 

主题: Re: [ClusterLabs] No slave is promoted to be master

OK, I know what happen.

It seems like your standbies were not replicating when the master "crashed", 
you can find tons of messages like this in the log files:

  WARNING: No secondary connected to the master
  WARNING: "db2" is not connected to the primary
  WARNING: "db3" is not connected to the primary

When a standby is not replicating, the master set negative master score to them 
to forbid the promotion on them, as they are probably lagging for some 
undefined time.

The following command shows the scores just before the simulated master crash:

  $ crm_simulate -x pe-input-2039.bz2 -s|grep -E 'date|promotion'
  Using the original execution date of: 2018-04-11 16:23:07Z
  pgsqld:0 promotion score on db1: 1001
  pgsqld:1 promotion score on db2: -1000
  pgsqld:2 promotion score on db3: -1000

"1001" score design the master. Streaming standbies always have a positive 
master score between 1000 and 1000-N*10 where N is the number of connected 
standbies.



On Fri, 13 Apr 2018 01:37:54 +
范国腾  wrote:

> The log is in the attachment.
> 
> We make a bug in the PG code in master node to make it not be 
> restarted any more in order to test the following scenario: One slave 
> could be promoted when the master crashed,
> 
> -邮件原件-
> 发件人: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com]
> 发送时间: 2018年4月12日 17:39
> 收件人: 范国腾 
> 抄送: Cluster Labs - All topics related to open-source clustering 
> welcomed  主题: Re: [ClusterLabs] No slave is 
> promoted to be master
> 
> Hi,
> On Thu, 12 Apr 2018 08:31:39 +
> 范国腾  wrote:
> 
> > Thank you very much for help check this issue. The information is in 
> > the attachment.
> > 
> > I have restarted the cluster after I send my first email. Not sure 
> > if it affects the checking of "the result of "crm_simulate -sL"
> 
> It does...
> 
> Could you please provide files
> from /var/lib/pacemaker/pengine/pe-input-2039.bz2 to  pe-input-2065.bz2 ?
> 
> [...]
> > Then the master is restarted and it could not start(that is ok and 
> > we know the reason)。
> 
> Why couldn't it start ?



--
Jehan-Guillaume de Rorthais
Dalibo
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 答复: No slave is promoted to be master

2018-04-12 Thread Ken Gaillot
On Thu, 2018-04-12 at 07:29 +, 范国腾 wrote:
> Hello,
>  
> We use the following command to create the cluster. Node2 is always
> the master when the cluster starts. Why does pacemaker not select
> node1 as the default master?
> How to configure if we want node1 to be the default master?

You can specify a location constraint giving pgsqld's master role a
positive score (not INFINITY) on node1, or a negative score (not
-INFINITY) on node2. Using a non-infinite score in a constraint tells
pacemaker that it's a preference, but not a requirement.

However, there's rarely a good reason to do that. In HA, the best
practice is that all nodes should be completely interchangeable, so
that the service can run equally on any node (since it might have to,
in a failure scenario). Such constraints can be useful temporarily,
e.g. if you need to upgrade the software on one node, or if one node is
underperforming (perhaps it's waiting on a hardware upgrade to come in,
or running some one-time job consuming a lot of CPU).

>  
> pcs cluster setup --name cluster_pgsql node1 node2
> pcs resource create pgsqld ocf:heartbeat:pgsqlms
> bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data op start
> timeout=600s op stop timeout=60s op promote timeout=300s op demote
> timeout=120s op monitor interval=15s timeout=100s role="Master" op
> monitor interval=16s timeout=100s role="Slave" op notify
> timeout=60s;pcs resource master pgsql-ha pgsqld notify=true
> interleave=true;
>  
>  
> Sometimes it reports the following error, how to configure to avoid
> it?
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] 答复: No slave is promoted to be master

2018-04-12 Thread 范国腾
Hello,

We use the following command to create the cluster. Node2 is always the master 
when the cluster starts. Why does pacemaker not select node1 as the default 
master?
How to configure if we want node1 to be the default master?

pcs cluster setup --name cluster_pgsql node1 node2
pcs resource create pgsqld ocf:heartbeat:pgsqlms bindir=/usr/local/pgsql/bin 
pgdata=/home/postgres/data op start timeout=600s op stop timeout=60s op promote 
timeout=300s op demote timeout=120s op monitor interval=15s timeout=100s 
role="Master" op monitor interval=16s timeout=100s role="Slave" op notify 
timeout=60s;pcs resource master pgsql-ha pgsqld notify=true interleave=true;


Sometimes it reports the following error, how to configure to avoid it?
[cid:image003.png@01D3D271.A4C44560]



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org