Re: [ClusterLabs] Replicated PGSQL woes

Israel Brewster Thu, 13 Oct 2016 11:16:29 -0700

On Oct 13, 2016, at 9:41 AM, Ken Gaillot <[email protected]> wrote:
> 
> On 10/13/2016 12:04 PM, Israel Brewster wrote:
>> Summary: Two-node cluster setup with latest pgsql resource agent.
>> Postgresql starts initially, but failover never happens.
>> 
>> Details:
>> 
>> I'm trying to get a cluster set up with Postgresql 9.6 in a streaming
>> replication using named slots scenario. I'm using the latest pgsql
>> Resource Agent, which does appear to support the named replication slot
>> feature, and I've pulled in the various utility functions the RA uses
>> that weren't available in my base install, so the RA itself no longer
>> gives me errors.
>> 
>> Setup: Two machines, centtest1 and centtest2. Both are running CentOS
>> 6.8. Centtest1 has an IP of 10.211.55.100, and centtest2 has an IP of
>> 10.211.55.101. The cluster is set up and functioning, with a shared
>> virtual IP resource at 10.211.55.200. Postgresql has been set up and
>> tested functioning properly on both nodes with centtest1 as the master
>> and centtest2 as the streaming replica slave. 
>> 
>> I then set up the postgresql master/slave resource using the following
>> commands:
>> 
>> pcs resource create pgsql_96 pgsql \
>> pgctl="/usr/pgsql-9.6/bin/pg_ctl" \
>> logfile="/var/log/pgsql/test2.log" \
>> psql="/usr/pgsql-9.6/bin/psql" \
>> pgdata="/pgsql96/data" \
>> rep_mode="async" \
>> repuser="postgres" \
>> node_list="tcentest1.ravnalaska.net <http://tcentest1.ravnalaska.net/> 
>> <http://tcentest1.ravnalaska.net <http://tcentest1.ravnalaska.net/>>
>> centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/> 
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>>" \
>> master_ip="10.211.55.200" \
>> archive_cleanup_command="" \
>> restart_on_promote="true" \
>> replication_slot_name="centtest_2_slot" \
>> monitor_user="postgres" \
>> monitor_password="SuperSecret" \
>> op start timeout="60s" interval="0s" on-fail="restart" \
>> op monitor timeout="60s" interval="4s" on-fail="restart" \
>> op monitor timeout="60s" interval="3s" on-fail="restart" role="Master" \
>> op promote timeout="60s" interval="0s" on-fail="restart" \
>> op demote timeout="60s" interval="0s" on-fail=stop \
>> op stop timeout="60s" interval="0s" on-fail="block" \
>> op notify timeout="60s" interval="0s";
>> 
>> pcs resource master msPostgresql pgsql_96 master-max=1 master-node-max=1
>> clone-max=2 clone-node-max=1 notify=true
>> 
>> pcs constraint colocation add virtual_ip with Master msPostgresql INFINITY
>> pcs constraint order promote msPostgresql then start virtual_ip
>> symmetrical=false score=INFINITY
>> pcs constraint order demote  msPostgresql then stop  virtual_ip
>> symmetrical=false score=0
>> 
>> My preference would be that the master runs on centtest1, so I add the
>> following constraint as well:
>> 
>> pcs constraint location --master msPostgresql prefers
>> centtest1.ravnalaska.net <http://centtest1.ravnalaska.net/>=50
>> 
>> When I then start the cluster, I first see *both* machines come up as
>> "slave", which I feel is somewhat odd, however the cluster software
>> quickly figures things out and promotes centtest2 to master. I've tried
> 
> This is inherent to pacemaker's model of multistate resources. Instances
> are always started in slave mode, and then promotion to master is a
> separate step.


Ok, that makes sense, and appears to be working properly then.

> 
>> this a dozen different times, and it *always* promotes centtest2 to
>> master - even if I put INFINITY in for the location constraint.
> 
> Surprisingly, location constraints do not directly support limiting to
> one role (your "--master" option is ignored, and I'm surprised it
> doesn't give an error). To do what you want, you need a rule, like:
> 
> pcs constraint location msPostgresql rule \
>   role=master score=50 \
>   #uname eq centtest1.ravnalaska.net <http://centtest1.ravnalaska.net/>
> 

Interesting. I may give that a try, just to see what happens. Of course, 
ideally, the master could be on either node equally, so I shouldn't need to do 
something like this.

> 
>> But whatever- this is a cluster, it doesn't really matter which node
>> things are running on, as long as they are running. So the cluster is
>> working - postgresql starts, the master process is on the same node as
>> the IP, you can connect, etc, everything looks good. Obviously the next
>> thing to try is failover - should the master node fail, the slave node
>> should be promoted to master. So I try testing this by shutting down the
>> cluster on the primary server: "pcs cluster stop"
>> ...and nothing happens. The master shuts down (uncleanly, I might add -
>> it leaves behind a lock file that prevents it from starting again until
>> I manually remove said lock file), but the slave is never promoted to
> 
> This definitely needs to be corrected. What creates the lock file, and
> how is that entity managed?

The lock file entity is created/managed by the postgresql process itself. On 
launch, postgres creates the lock file to say it is running, and deletes said 
lock file when it shuts down. To my understanding, its role in life is to 
prevent a restart after an unclean shutdown so the admin is reminded to make 
sure that the data is in a consistent state before starting the server again.

> 
>> master. Neither pcs status or crm_mon show any errors, but centtest1
>> never becomes master.
> 
> I remember a situation where a resource agent improperly set master
> scores, that led to no master being promoted. I don't remember the
> details, though.

The interesting thing is that it's not a case of no master being promoted *in 
general*, but rather a case of no master being promoted on a specific node, 
ever. The other node works fine for promoting masters. And I set up both 
machines at the same time, in the same way, so they should be identical, unless 
I am missing something.

The "Current DC" is centtest1, the node on which postgresql will not promote to 
master, if that makes any difference.

> 
>> 
>> If instead of stoping the cluster on centtest2, I try to simply move the
>> master using the command "pcs resource move --master msPostgresql", I
>> first run into the aforementioned unclean shutdown issue (lock file left
>> behind that has to be manually removed), and after removing the lock
>> file, I wind up with *both* nodes being slaves, and no master node. "pcs
>> resource clear --master msPostgresql" re-promotes centtest2 to master.
>> 
>> What it looks like is that for some reason pacemaker/corosync is
>> absolutely refusing to ever make centtest1 a master - even when I
>> explicitly tell it to, or when it is the only node left.
>> 
>> Looking at the messages log when I do the node shutdown test I see this:
>> 
>> Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: State transition S_IDLE
>> -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
>> origin=abort_transition_graph ]
>> Oct 13 08:29:39 CentTest1 pengine[30095]:   notice: On loss of CCM
>> Quorum: Ignore
>> Oct 13 08:29:39 CentTest1 pengine[30095]:   notice: Stop 
>>  virtual_ip#011(centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/> 
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>>)
>> Oct 13 08:29:39 CentTest1 pengine[30095]:   notice:
>> Demote  pgsql_96:0#011(Master -> Stopped centtest2.ravnalaska.net 
>> <http://centtest2.ravnalaska.net/>
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>>)
>> Oct 13 08:29:39 CentTest1 pengine[30095]:   notice: Calculated
>> Transition 193: /var/lib/pacemaker/pengine/pe-input-500.bz2
>> Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: Initiating action 43:
>> notify pgsql_96_pre_notify_demote_0 on centtest2.ravnalaska.net 
>> <http://centtest2.ravnalaska.net/>
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>>
>> Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: Initiating action 45:
>> notify pgsql_96_pre_notify_demote_0 on centtest1.ravnalaska.net 
>> <http://centtest1.ravnalaska.net/>
>> <http://centtest1.ravnalaska.net <http://centtest1.ravnalaska.net/>> (local)
>> Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: Operation
>> pgsql_96_notify_0: ok (node=centtest1.ravnalaska.net 
>> <http://centtest1.ravnalaska.net/>
>> <http://centtest1.ravnalaska.net <http://centtest1.ravnalaska.net/>>, 
>> call=230, rc=0, cib-update=0,
>> confirmed=true)
>> Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: Initiating action 6:
>> demote pgsql_96_demote_0 on centtest2.ravnalaska.net 
>> <http://centtest2.ravnalaska.net/>
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>>
>> Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: Transition aborted by
>> status-centtest2.ravnalaska.net <http://status-centtest2.ravnalaska.net/>
>> <http://status-centtest2.ravnalaska.net 
>> <http://status-centtest2.ravnalaska.net/>>-master-pgsql_96,
>> master-pgsql_96=-INFINITY: Transient attribute change (modify
>> cib=0.170.19,
>> source=abort_unless_down:329, 
>> path=/cib/status/node_state[@id='centtest2.ravnalaska.net 
>> <http://centtest2.ravnalaska.net/>
>> <http://centtest2.ravnalaska.net 
>> <http://centtest2.ravnalaska.net/>>']/transient_attributes[@id='centtest2.ravnalaska.net
>>  <http://centtest2.ravnalaska.net/>
>> <http://centtest2.ravnalaska.net 
>> <http://centtest2.ravnalaska.net/>>']/instance_attributes[@id='status-centtest2.ravnalaska.net
>>  <http://status-centtest2.ravnalaska.net/>
>> <http://status-centtest2.ravnalaska.net 
>> <http://status-centtest2.ravnalaska.net/>>']/nvpair[@id='status-centtest2.ravnalaska.net
>>  <http://status-centtest2.ravnalaska.net/>
>> <http://status-centtest2.ravnalaska.net 
>> <http://status-centtest2.ravnalaska.net/>>-master-pgsql_96'], 0)
>> Oct 13 08:29:39 CentTest1 crmd[30096]:  warning: No match for shutdown
>> action on centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/> 
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>>
>> Oct 13 08:29:40 CentTest1 crmd[30096]:   notice: Initiating action 44:
>> notify pgsql_96_post_notify_demote_0 on centtest2.ravnalaska.net 
>> <http://centtest2.ravnalaska.net/>
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>>
>> Oct 13 08:29:40 CentTest1 crmd[30096]:   notice: Initiating action 46:
>> notify pgsql_96_post_notify_demote_0 on centtest1.ravnalaska.net 
>> <http://centtest1.ravnalaska.net/>
>> <http://centtest1.ravnalaska.net <http://centtest1.ravnalaska.net/>> (local)
>> Oct 13 08:29:41 CentTest1 pgsql(pgsql_96)[19150]: INFO: My master
>> baseline : 000000000700B5C8.
>> Oct 13 08:29:41 CentTest1 attrd[30094]:   notice: Sending flush op to
>> all hosts for: pgsql_96-master-baseline (000000000700B5C8)
>> Oct 13 08:29:41 CentTest1 attrd[30094]:   notice: Sent update 227:
>> pgsql_96-master-baseline=000000000700B5C8
>> Oct 13 08:29:41 CentTest1 crmd[30096]:   notice: Operation
>> pgsql_96_notify_0: ok (node=centtest1.ravnalaska.net 
>> <http://centtest1.ravnalaska.net/>
>> <http://centtest1.ravnalaska.net <http://centtest1.ravnalaska.net/>>, 
>> call=231, rc=0, cib-update=0,
>> confirmed=true)
>> Oct 13 08:29:41 CentTest1 crmd[30096]:   notice: Transition 193
>> (Complete=12, Pending=0, Fired=0, Skipped=3, Incomplete=11,
>> Source=/var/lib/pacemaker/pengine/pe-input-500.bz2): Stopped
>> Oct 13 08:29:41 CentTest1 pengine[30095]:   notice: On loss of CCM
>> Quorum: Ignore
>> Oct 13 08:29:41 CentTest1 pengine[30095]:   notice: Stop 
>>  virtual_ip#011(centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/> 
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>>)
>> Oct 13 08:29:41 CentTest1 pengine[30095]:   notice: Stop 
>>  pgsql_96:0#011(centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/> 
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>>)
>> Oct 13 08:29:41 CentTest1 pengine[30095]:   notice: Calculated
>> Transition 194: /var/lib/pacemaker/pengine/pe-input-501.bz2
>> Oct 13 08:29:41 CentTest1 crmd[30096]:   notice: Initiating action 4:
>> stop virtual_ip_stop_0 on centtest2.ravnalaska.net 
>> <http://centtest2.ravnalaska.net/>
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>>
>> Oct 13 08:29:41 CentTest1 crmd[30096]:   notice: Initiating action 38:
>> notify pgsql_96_pre_notify_stop_0 on centtest2.ravnalaska.net 
>> <http://centtest2.ravnalaska.net/>
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>>
>> Oct 13 08:29:41 CentTest1 crmd[30096]:   notice: Initiating action 39:
>> notify pgsql_96_pre_notify_stop_0 on centtest1.ravnalaska.net 
>> <http://centtest1.ravnalaska.net/>
>> <http://centtest1.ravnalaska.net <http://centtest1.ravnalaska.net/>> (local)
>> Oct 13 08:29:41 CentTest1 crmd[30096]:   notice: Operation
>> pgsql_96_notify_0: ok (node=centtest1.ravnalaska.net 
>> <http://centtest1.ravnalaska.net/>
>> <http://centtest1.ravnalaska.net <http://centtest1.ravnalaska.net/>>, 
>> call=232, rc=0, cib-update=0,
>> confirmed=true)
>> Oct 13 08:29:41 CentTest1 crmd[30096]:   notice: Initiating action 5:
>> stop pgsql_96_stop_0 on centtest2.ravnalaska.net 
>> <http://centtest2.ravnalaska.net/>
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>>
>> Oct 13 08:29:41 CentTest1 crmd[30096]:   notice: Initiating action 40:
>> notify pgsql_96_post_notify_stop_0 on centtest1.ravnalaska.net 
>> <http://centtest1.ravnalaska.net/>
>> <http://centtest1.ravnalaska.net <http://centtest1.ravnalaska.net/>> (local)
>> Oct 13 08:29:41 CentTest1 crmd[30096]:   notice: Operation
>> pgsql_96_notify_0: ok (node=centtest1.ravnalaska.net 
>> <http://centtest1.ravnalaska.net/>
>> <http://centtest1.ravnalaska.net <http://centtest1.ravnalaska.net/>>, 
>> call=233, rc=0, cib-update=0,
>> confirmed=true)
>> Oct 13 08:29:41 CentTest1 crmd[30096]:   notice: Transition 194
>> (Complete=12, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe-input-501.bz2): Complete
>> Oct 13 08:29:41 CentTest1 crmd[30096]:   notice: State transition
>> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL
>> origin=notify_crmd ]
>> Oct 13 08:29:41 CentTest1 crmd[30096]:   notice: State transition S_IDLE
>> -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
>> origin=abort_transition_graph ]
>> Oct 13 08:29:41 CentTest1 pengine[30095]:   notice: On loss of CCM
>> Quorum: Ignore
>> Oct 13 08:29:41 CentTest1 pengine[30095]:   notice: Scheduling Node
>> centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/> 
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>> for 
>> shutdown
>> Oct 13 08:29:41 CentTest1 pengine[30095]:   notice: Calculated
>> Transition 195: /var/lib/pacemaker/pengine/pe-input-502.bz2
>> Oct 13 08:29:41 CentTest1 crmd[30096]:   notice: Transition 195
>> (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe-input-502.bz2): Complete
>> Oct 13 08:29:41 CentTest1 crmd[30096]:   notice: State transition
>> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL
>> origin=notify_crmd ]
>> Oct 13 08:29:41 CentTest1 crmd[30096]:   notice: do_shutdown of
>> centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/> 
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>> (op 30) 
>> is
>> complete
>> Oct 13 08:29:41 CentTest1 attrd[30094]:   notice: crm_update_peer_proc:
>> Node centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/> 
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>>[2] -
>> state is now lost (was member)
>> Oct 13 08:29:41 CentTest1 attrd[30094]:   notice: Removing
>> centtest2.ravnalaska.net/2 <http://centtest2.ravnalaska.net/2> 
>> <http://centtest2.ravnalaska.net/2 <http://centtest2.ravnalaska.net/2>> from 
>> the
>> membership list
>> Oct 13 08:29:41 CentTest1 attrd[30094]:   notice: Purged 1 peers with
>> id=2 and/or uname=centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>> from 
>> the membership cache
>> Oct 13 08:29:41 CentTest1 stonith-ng[30092]:   notice:
>> crm_update_peer_proc: Node centtest2.ravnalaska.net 
>> <http://centtest2.ravnalaska.net/>
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>>[2] - 
>> state is now lost (was member)
>> Oct 13 08:29:41 CentTest1 stonith-ng[30092]:   notice: Removing
>> centtest2.ravnalaska.net/2 <http://centtest2.ravnalaska.net/2> 
>> <http://centtest2.ravnalaska.net/2 <http://centtest2.ravnalaska.net/2>> from 
>> the
>> membership list
>> Oct 13 08:29:41 CentTest1 stonith-ng[30092]:   notice: Purged 1 peers
>> with id=2 and/or uname=centtest2.ravnalaska.net 
>> <http://centtest2.ravnalaska.net/>
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>> from 
>> the membership cache
>> Oct 13 08:29:41 CentTest1 cib[30091]:   notice: crm_update_peer_proc:
>> Node centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/> 
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>>[2] -
>> state is now lost (was member)
>> Oct 13 08:29:41 CentTest1 cib[30091]:   notice: Removing
>> centtest2.ravnalaska.net/2 <http://centtest2.ravnalaska.net/2> 
>> <http://centtest2.ravnalaska.net/2 <http://centtest2.ravnalaska.net/2>> from 
>> the
>> membership list
>> Oct 13 08:29:41 CentTest1 cib[30091]:   notice: Purged 1 peers with id=2
>> and/or uname=centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/> 
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>>
>> from the membership cache
>> Oct 13 08:29:43 CentTest1 pgsql(pgsql_96)[19472]: INFO: Master does not
>> exist.
>> Oct 13 08:29:43 CentTest1 pgsql(pgsql_96)[19472]: WARNING: My data is
>> out-of-date. status=DISCONNECT
> 
> I'm not familiar with pgsql, but this doesn't look right. The instance
> on centtest2 was cleanly demoted and stopped, so I'd expect pgsql to be
> consistent at this point.

Agreed. The complaint about master not existing makes sense, but data being 
out-of-date? It might simply be complaining because it hasn't heard from the 
master in a while though.


-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

> 
>> Oct 13 08:29:43 CentTest1 corosync[29835]:   [QUORUM] Members[1]: 1
>> Oct 13 08:29:43 CentTest1 corosync[29835]:   [TOTEM ] A processor joined
>> or left the membership and a new membership was formed.
>> Oct 13 08:29:43 CentTest1 corosync[29835]:   [CPG   ] chosen downlist:
>> sender r(0) ip(10.211.55.100) ; members(old:2 left:1)
>> Oct 13 08:29:43 CentTest1 corosync[29835]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Oct 13 08:29:43 CentTest1 kernel: dlm: closing connection to node 2
>> Oct 13 08:29:43 CentTest1 crmd[30096]:   notice: cman_event_callback:
>> Node centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/> 
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>>[2] -
>> state is now lost (was member)
>> Oct 13 08:29:43 CentTest1 crmd[30096]:   notice: do_shutdown of
>> centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/> 
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>> (op 30) 
>> is
>> complete
>> Oct 13 08:29:43 CentTest1 pacemakerd[30085]:   notice:
>> cman_event_callback: Node centtest2.ravnalaska.net 
>> <http://centtest2.ravnalaska.net/>
>> <http://centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>>[2] - 
>> state is now lost (was member)
>> Oct 13 08:29:43 CentTest1 crmd[30096]:   notice: State transition S_IDLE
>> -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
>> origin=check_join_state ]
>> Oct 13 08:29:43 CentTest1 attrd[30094]:   notice: Sending full refresh
>> (origin=crmd)
>> Oct 13 08:29:43 CentTest1 attrd[30094]:   notice: Sending flush op to
>> all hosts for: pgsql_96-status (HS:alone)
>> Oct 13 08:29:43 CentTest1 attrd[30094]:   notice: Sending flush op to
>> all hosts for: shutdown (0)
>> Oct 13 08:29:43 CentTest1 attrd[30094]:   notice: Sending flush op to
>> all hosts for: master-pgsql_96 (-INFINITY)
>> Oct 13 08:29:43 CentTest1 attrd[30094]:   notice: Sending flush op to
>> all hosts for: pgsql_96-master-baseline (000000000700B5C8)
>> Oct 13 08:29:43 CentTest1 pengine[30095]:   notice: On loss of CCM
>> Quorum: Ignore
>> Oct 13 08:29:43 CentTest1 pengine[30095]:   notice: Calculated
>> Transition 196: /var/lib/pacemaker/pengine/pe-input-503.bz2
>> Oct 13 08:29:43 CentTest1 crmd[30096]:   notice: Transition 196
>> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe-input-503.bz2): Complete
>> Oct 13 08:29:43 CentTest1 crmd[30096]:   notice: State transition
>> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL
>> origin=notify_crmd ]
>> Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: INFO: Master does not
>> exist.
>> Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: WARNING: My data is
>> out-of-date. status=DISCONNECT
>> Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: INFO: Master does not
>> exist.
>> Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: WARNING: My data is
>> out-of-date. status=DISCONNECT
>> 
>> Those last two lines repeat indefinitely, but there is no indication
>> that the cluster ever tries to promote centtest1 to master. Even if I
>> completely shut down the cluster, and bring it back up only on
>> centtest1, pacemaker refuses to start postgresql on centtest1 as a master.
>> 
>> What can I do to fix this? What troubleshooting steps can I follow? Thanks.
>> 
>> 
>> -----------------------------------------------
>> Israel Brewster
>> Systems Analyst II
>> Ravn Alaska
>> 5245 Airport Industrial Rd
>> Fairbanks, AK 99709
>> (907) 450-7293
>> -----------------------------------------------
> 
> _______________________________________________
> Users mailing list: [email protected] <mailto:[email protected]>
> http://clusterlabs.org/mailman/listinfo/users 
> <http://clusterlabs.org/mailman/listinfo/users>
> 
> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>

_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Replicated PGSQL woes

Reply via email to