Re: [ClusterLabs] Antw: 答复: The slave not does not promote to master

Klaus Wenninger Mon, 07 May 2018 00:23:37 -0700

On 05/07/2018 08:52 AM, Ulrich Windl wrote:
> What about this: Configure fencing, then if everything works OK, try without
> fencing.
>
>>>> ??? <fanguot...@highgo.com> schrieb am 07.05.2018 um 08:54 in Nachricht
> <177fb170fe264dbca52df5e25d27c...@ex01.highgo.com>:
>> Thank you, Klaus. There is no fencing device in our network according to the
>> request. Is there any other way to configure the cluster to make it work?


You can consider SBD if there is no physical fencing-device available.
This is a 2-node-cluster, right? Thus watchdog-fencing alone wouldn't
work with SBD and you would have to either use a shared-disk,
qdevice or a 3rd node. (If you are intending to use a single shared-disk
sbd would have to be kind of current like 1.3.1.)

If you are not familiar with SBD you might have a look at:
http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit
The introductory part of my talk at cluster-summit 2017
might be interesting as well to get an idea:
https://wiki.clusterlabs.org/w/images/1/1a/Recent_Work_and_Future_Plans_for_SBD_1.1.pdf

Actually there is no way around some kind of fencing if you
don't want to interfere manually in case of problems.
The cluster tries to get the resources under control using
the resource-agents. If that is failing - as in your case -
it has to have a way to tear down the node that is
misbehaving. You just wouldn't like the cluster to proceed
on the 2nd node if it doesn't know the state of the resource
on the failing node.

Regards,
Klaus
>>
>>
>> 发件人: Klaus Wenninger [mailto:kwenn...@redhat.com]
>> 发送时间: 2018年5月7日 14:40
>> 收件人: Cluster Labs - All topics related to open-source clustering welcomed 
>> <users@clusterlabs.org>; 范国腾 <fanguot...@highgo.com>
>> 主题: Re: [ClusterLabs] The slave not does not promote to master
>>
>> On 05/07/2018 07:39 AM, 范国腾 wrote:
>>
>> Hi,
>>
>>
>>
>> We have two nodes cluster using PAF to manage the postgres. Node2 is master.
>> Master/Slave Set: pgsql-ha [pgsqld]
>>
>>      Master: [sds2]
>>
>>      Slaves: [ sds1 ]
>>
>>
>>
>> In the master node(sds2), I remove the data directory of postgres. I expect
>> the master nodes(sds2) stop and the slave node(sds1) is promoted to master.
>>
>> The sds2 log show that is executes monitor->notify->demote->notify->stop.
> The 
>> sds1 log also show " Promote pgsqld:0#011(Slave -> Master sds1)". But the
> "pcs 
>> status" shows the status like the following. Could you please help check
> what 
>> prevents the promotion happen in sds1? What should I do if I want to
> recovery 
>> the system?
>>
>> Didn't check all detail but looks as if stopping the resource would
>> fail. So that it doesn't know the state on sds2 and thus can't
>> promote on sds1.
>> If you had enabled fencing this would lead to sds2 being fenced
>> so that sds1 can take over.
>>
>> As digimer would say: "use fencing!"
>>
>> Regards,
>> Klaus
>>
>>
>>
>>
>>
>>
>>
>> 2 nodes configured
>>
>> 3 resources configured
>>
>> Online: [ sds1 sds2 ]
>>
>> Full list of resources:
>>
>>  Master/Slave Set: pgsql-ha [pgsqld]
>>
>>      pgsqld     (ocf::heartbeat:pgsqlms):       FAILED Master sds2
> (blocked)
>>      Slaves: [ sds1 ]
>>
>>  Resource Group: mastergroup
>>
>>      master-vip (ocf::heartbeat:IPaddr2):       Started sds2
>>
>> Failed Actions:
>>
>> * pgsqld_stop_0 on sds2 'invalid parameter' (2): call=42, status=complete, 
>> exitreason='PGDATA "/home/highgo/highgo/database/4.3.1/data" does not 
>> exists',
>>
>>     last-rc-change='Mon May  7 00:39:06 2018', queued=1ms, exec=72ms
>>
>>
>>
>>
>>
>>
>>
>> Here is the sds2 log:
>>
>> May  7 00:38:46 node2 pgsqlms(pgsqld)[14000]: INFO: Execute action monitor 
>> and the result 8
>>
>> May  7 00:38:56 node2 pgsqlms(pgsqld)[14077]: INFO: Execute action monitor 
>> and the result 8
>>
>> May  7 00:39:06 node2 pgsqlms(pgsqld)[14152]: ERROR: PGDATA 
>> "/home/highgo/highgo/database/4.3.1/data" does not exists
>>
>> May  7 00:39:06 node2 lrmd[1126]:  notice: pgsqld_monitor_10000:14152:stderr
>> [ ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not
>> exists ]
>>
>> May  7 00:39:06 node2 crmd[1129]:  notice: sds2-pgsqld_monitor_10000:36 [ 
>> ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not 
>> exists\n ]
>>
>> May  7 00:39:06 node2 pgsqlms(pgsqld)[14162]: ERROR: PGDATA 
>> "/home/highgo/highgo/database/4.3.1/data" does not exists
>>
>> May  7 00:39:06 node2 lrmd[1126]:  notice: pgsqld_notify_0:14162:stderr [ 
>> ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not 
>> exists ]
>>
>> May  7 00:39:06 node2 crmd[1129]:  notice: Result of notify operation for 
>> pgsqld on sds2: 0 (ok)
>>
>> May  7 00:39:06 node2 crmd[1129]:  notice: sds2-pgsqld_monitor_10000:36 [ 
>> ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not 
>> exists\n ]
>>
>> May  7 00:39:06 node2 pgsqlms(pgsqld)[14172]: ERROR: PGDATA 
>> "/home/highgo/highgo/database/4.3.1/data" does not exists
>>
>> May  7 00:39:06 node2 lrmd[1126]:  notice: pgsqld_demote_0:14172:stderr [ 
>> ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not 
>> exists ]
>>
>> May  7 00:39:06 node2 crmd[1129]:  notice: Result of demote operation for 
>> pgsqld on sds2: 2 (invalid parameter)
>>
>> May  7 00:39:06 node2 crmd[1129]:  notice: sds2-pgsqld_demote_0:39 [ 
>> ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not 
>> exists\n ]
>>
>> May  7 00:39:06 node2 pgsqlms(pgsqld)[14182]: ERROR: PGDATA 
>> "/home/highgo/highgo/database/4.3.1/data" does not exists
>>
>> May  7 00:39:06 node2 lrmd[1126]:  notice: pgsqld_notify_0:14182:stderr [ 
>> ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not 
>> exists ]
>>
>> May  7 00:39:06 node2 crmd[1129]:  notice: Result of notify operation for 
>> pgsqld on sds2: 0 (ok)
>>
>> May  7 00:39:06 node2 pgsqlms(pgsqld)[14192]: ERROR: PGDATA 
>> "/home/highgo/highgo/database/4.3.1/data" does not exists
>>
>> May  7 00:39:06 node2 lrmd[1126]:  notice: pgsqld_notify_0:14192:stderr [ 
>> ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not 
>> exists ]
>>
>> May  7 00:39:06 node2 crmd[1129]:  notice: Result of notify operation for 
>> pgsqld on sds2: 0 (ok)
>>
>> May  7 00:39:06 node2 pgsqlms(pgsqld)[14202]: ERROR: PGDATA 
>> "/home/highgo/highgo/database/4.3.1/data" does not exists
>>
>> May  7 00:39:06 node2 lrmd[1126]:  notice: pgsqld_stop_0:14202:stderr [ 
>> ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not 
>> exists ]
>>
>> May  7 00:39:06 node2 crmd[1129]:  notice: Result of stop operation for 
>> pgsqld on sds2: 2 (invalid parameter)
>>
>> May  7 00:39:06 node2 crmd[1129]:  notice: sds2-pgsqld_stop_0:42 [ 
>> ocf-exit-reason:PGDATA "/home/highgo/highgo/database/4.3.1/data" does not 
>> exists\n ]
>>
>> May  7 00:40:01 node2 systemd: Started Session 4 of user root.
>>
>> May  7 00:40:01 node2 systemd: Starting Session 4 of user root.
>>
>> May  7 00:47:21 node2 pacemakerd[1063]:  notice: Caught 'Terminated' signal
>>
>> May  7 00:47:21 node2 systemd: Stopping Pacemaker High Availability Cluster
>> Manager...
>>
>> May  7 00:47:21 node2 pacemakerd[1063]:  notice: Shutting down Pacemaker
>>
>> May  7 00:47:21 node2 pacemakerd[1063]:  notice: Stopping crmd
>>
>> May  7 00:47:21 node2 crmd[1129]:  notice: Caught 'Terminated' signal
>>
>> May  7 00:47:21 node2 crmd[1129]:  notice: Shutting down cluster resource 
>> manager
>>
>>
>>
>> Here is the sds1 log（in the attachment）
>>
>> May  7 00:38:47 node1 pgsqlms(pgsqld)[4426]: INFO: Execute action monitor 
>> and the result 0May  7 00:39:03 node1 pgsqlms(pgsqld)[4442]: INFO: Execute 
>> action monitor and the result 0May  7 00:39:06 node1 crmd[1133]:  notice: 
>> State transition S_IDLE -> S_POLICY_ENGINEMay  7 00:39:06 node1
> pengine[1132]: 
>> warning: Processing failed op monitor for pgsqld:1 on sds2: invalid
> parameter 
>> (2)May  7 00:39:06 node1 pengine[1132]:   error: Preventing pgsql-ha from 
>> re-starting on sds2: operation monitor failed 'invalid parameter' (2)May  7
>> 00:39:06 node1 pengine[1132]:  notice: Promote pgsqld:0#011(Slave -> Master
>> sds1)May  7 00:39:06 node1 pengine[1132]:  notice: Demote  
>> pgsqld:1#011(Master -> Stopped sds2)May  7 00:39:06 node1 pengine[1132]:  
>> notice: Move    master-vip#011(Started sds2 -> sds1)May  7 00:39:06 node1 
>> pengine[1132]:  notice: Calculated transition 31, saving inputs in 
>> /var/lib/pacemaker/pengine/pe-input-97.bz2May  7 00:39:06 node1 
>> pengine[1132]: warning: Processing failed op monitor for pgsqld:1 on sds2: 
>> invalid parameter (2)May  7 00:39:06 node1 pengine[1132]:   error:
> Preventing 
>> pgsql-ha from re-starting on sds2: operation monitor failed 'invalid 
>> parameter' (2)May  7 00:39:06 node1 pengine[1132]:  notice: Promote 
>> pgsqld:0#011(Slave -> Master sds1)May  7 00:39:06 node1 pengine[1132]:  
>> notice: Demote  pgsqld:1#011(Master -> Stopped sds2)May  7 00:39:06 node1 
>> pengine[1132]:  notice: Move    master-vip#011(Started sds2 -> sds1)May  7 
>> 00:39:06 node1 pengine[1132]:  notice: Calculated transition 32, saving 
>> inputs in /var/lib/pacemaker/pengine/pe-input-98.bz2May  7 00:39:06 node1 
>> crmd[1133]:  notice: Initiating cancel operation pgsqld_monitor_16000
> locally 
>> on sds1May  7 00:39:06 node1 crmd[1133]:  notice: Initiating notify
> operation 
>> pgsqld_pre_notify_demote_0 locally on sds1May  7 00:39:06 node1 crmd[1133]: 
>> notice: Initiating notify operation pgsqld_pre_notify_demote_0 on sds2
>>
>>
>>
>>
>> _______________________________________________
>>
>> Users mailing list: Users@clusterlabs.org<mailto:Users@clusterlabs.org>
>>
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>>
>>
>> Project Home: http://www.clusterlabs.org 
>>
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>
>> Bugs: http://bugs.clusterlabs.org 
>
>
> _______________________________________________
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: 答复: The slave not does not promote to master

Reply via email to