On Oct 13, 2016, at 1:56 PM, Jehan-Guillaume de Rorthais <j...@dalibo.com> wrote: > > On Thu, 13 Oct 2016 10:05:33 -0800 > Israel Brewster <isr...@ravnalaska.net> wrote: > >> On Oct 13, 2016, at 9:41 AM, Ken Gaillot <kgail...@redhat.com> wrote: >>> >>> On 10/13/2016 12:04 PM, Israel Brewster wrote: > [...] > >>>> But whatever- this is a cluster, it doesn't really matter which node >>>> things are running on, as long as they are running. So the cluster is >>>> working - postgresql starts, the master process is on the same node as >>>> the IP, you can connect, etc, everything looks good. Obviously the next >>>> thing to try is failover - should the master node fail, the slave node >>>> should be promoted to master. So I try testing this by shutting down the >>>> cluster on the primary server: "pcs cluster stop" >>>> ...and nothing happens. The master shuts down (uncleanly, I might add - >>>> it leaves behind a lock file that prevents it from starting again until >>>> I manually remove said lock file), but the slave is never promoted to >>> >>> This definitely needs to be corrected. What creates the lock file, and >>> how is that entity managed? >> >> The lock file entity is created/managed by the postgresql process itself. On >> launch, postgres creates the lock file to say it is running, and deletes said >> lock file when it shuts down. To my understanding, its role in life is to >> prevent a restart after an unclean shutdown so the admin is reminded to make >> sure that the data is in a consistent state before starting the server again. > > What is the name of this lock file? Where is it? > > PostgreSQL does not create lock file. It creates a "postmaster.pid" file, but > it does not forbid a startup if the new process doesn't find another process > with the pid and shm shown in the postmaster.pid. > > As far as I know, the pgsql resource agent create such a lock file on promote > and delete it on graceful stop. If the PostgreSQL instance couldn't be stopped > correctly, the lock files stays and the RA refuse to start it the next time.
Ah, you're right. Looking auth the RA I see where it creates the file in question. The delete appears to be in the pgsql_real_stop() function (which makes sense), wrapped in an if block that checks for $1 being master and $OCF_RESKEY_CRM_meta_notify_slave_uname being a space. Throwing a little debugging code in there I see that when it hits that block on a cluster stop, $OCF_RESKEY_CRM_meta_notify_slave_uname is centtest1.ravnalaska.net <http://centtest1.ravnalaska.net/>, not a space, so the lock file is not removed: if [ "$1" = "master" -a "$OCF_RESKEY_CRM_meta_notify_slave_uname" = " " ]; then ocf_log info "Removing $PGSQL_LOCK." rm -f $PGSQL_LOCK fi It doesn't look like there is anywhere else where the file would be removed. > > [...] >>>> What can I do to fix this? What troubleshooting steps can I follow? Thanks. > > I can not find the result of the stop operation in your log files, maybe the > log from CentTest2 would be more useful. Sure. I was looking at centtest1 because I was trying to figure out why it wouldn't promote, but if centtest2 never really stopped (properly) that could explain things. Here's the log from 2 when calling pcs cluster stop: Oct 13 14:05:14 CentTest2 attrd[9424]: notice: Sending flush op to all hosts for: standby (true) Oct 13 14:05:14 CentTest2 attrd[9424]: notice: Sent update 26: standby=true Oct 13 14:05:14 CentTest2 pacemaker: Waiting for shutdown of managed resources Oct 13 14:05:14 CentTest2 crmd[9426]: notice: Operation pgsql_96_notify_0: ok (node=centtest2.ravnalaska.net, call=21, rc=0, cib-update=0, confirmed=true) Oct 13 14:05:14 CentTest2 attrd[9424]: notice: Sending flush op to all hosts for: master-pgsql_96 (-INFINITY) Oct 13 14:05:14 CentTest2 attrd[9424]: notice: Sent update 28: master-pgsql_96=-INFINITY Oct 13 14:05:14 CentTest2 attrd[9424]: notice: Sending flush op to all hosts for: pgsql_96-master-baseline (<null>) Oct 13 14:05:14 CentTest2 attrd[9424]: notice: Sent delete 30: node=centtest2.ravnalaska.net, attr=pgsql_96-master-baseline, id=<n/a>, set=(null), section=status Oct 13 14:05:14 CentTest2 attrd[9424]: notice: Sent delete 32: node=centtest2.ravnalaska.net, attr=pgsql_96-master-baseline, id=<n/a>, set=(null), section=status Oct 13 14:05:14 CentTest2 pgsql(pgsql_96)[5107]: INFO: Stopping PostgreSQL on demote. Oct 13 14:05:14 CentTest2 pgsql(pgsql_96)[5107]: INFO: stop_escalate(or stop_escalate_in_slave) time is adjusted to 50 based on the configured timeout. Oct 13 14:05:14 CentTest2 pgsql(pgsql_96)[5107]: INFO: server shutting down Oct 13 14:05:15 CentTest2 pgsql(pgsql_96)[5107]: INFO: PostgreSQL is down Oct 13 14:05:15 CentTest2 pgsql(pgsql_96)[5107]: INFO: Changing pgsql_96-status on centtest2.ravnalaska.net : PRI->STOP. Oct 13 14:05:15 CentTest2 attrd[9424]: notice: Sending flush op to all hosts for: pgsql_96-status (STOP) Oct 13 14:05:15 CentTest2 attrd[9424]: notice: Sent update 34: pgsql_96-status=STOP Oct 13 14:05:15 CentTest2 crmd[9426]: notice: Operation pgsql_96_demote_0: ok (node=centtest2.ravnalaska.net, call=23, rc=0, cib-update=20, confirmed=true) Oct 13 14:05:15 CentTest2 crmd[9426]: notice: Operation pgsql_96_notify_0: ok (node=centtest2.ravnalaska.net, call=24, rc=0, cib-update=0, confirmed=true) Oct 13 14:05:15 CentTest2 IPaddr2(virtual_ip)[5360]: INFO: IP status = ok, IP_CIP= Oct 13 14:05:15 CentTest2 crmd[9426]: notice: Operation virtual_ip_stop_0: ok (node=centtest2.ravnalaska.net, call=26, rc=0, cib-update=21, confirmed=true) Oct 13 14:05:15 CentTest2 crmd[9426]: notice: Operation pgsql_96_notify_0: ok (node=centtest2.ravnalaska.net, call=27, rc=0, cib-update=0, confirmed=true) Oct 13 14:05:15 CentTest2 pgsql(pgsql_96)[5497]: INFO: PostgreSQL is already stopped. Oct 13 14:05:15 CentTest2 crmd[9426]: notice: Operation pgsql_96_stop_0: ok (node=centtest2.ravnalaska.net, call=28, rc=0, cib-update=22, confirmed=true) Oct 13 14:05:16 CentTest2 pacemaker: Leaving fence domain Oct 13 14:05:16 CentTest2 pacemaker: Stopping fenced 9227 Oct 13 14:05:16 CentTest2 pacemaker: Signaling Pacemaker Cluster Manager to terminate Oct 13 14:05:16 CentTest2 pacemakerd[9415]: notice: Invoking handler for signal 15: Terminated Oct 13 14:05:16 CentTest2 pacemakerd[9415]: notice: Shutting down Pacemaker Oct 13 14:05:16 CentTest2 pacemaker: Waiting for cluster services to unload Oct 13 14:05:16 CentTest2 pacemakerd[9415]: notice: Stopping crmd: Sent -15 to process 9426 Oct 13 14:05:16 CentTest2 crmd[9426]: notice: Invoking handler for signal 15: Terminated Oct 13 14:05:16 CentTest2 crmd[9426]: notice: Requesting shutdown, upper limit is 1200000ms Oct 13 14:05:16 CentTest2 attrd[9424]: notice: Update relayed from centtest1.ravnalaska.net Oct 13 14:05:16 CentTest2 attrd[9424]: notice: Sending flush op to all hosts for: shutdown (1476396316) Oct 13 14:05:16 CentTest2 attrd[9424]: notice: Sent update 37: shutdown=1476396316 Oct 13 14:05:16 CentTest2 crmd[9426]: notice: State transition S_NOT_DC -> S_STOPPING [ input=I_STOP cause=C_HA_MESSAGE origin=route_message ] Oct 13 14:05:16 CentTest2 crmd[9426]: notice: Stopped 0 recurring operations at shutdown... waiting (0 ops remaining) Oct 13 14:05:16 CentTest2 crmd[9426]: notice: Disconnected from the LRM Oct 13 14:05:16 CentTest2 crmd[9426]: notice: Disconnecting from Corosync Oct 13 14:05:16 CentTest2 pengine[9425]: notice: Invoking handler for signal 15: Terminated Oct 13 14:05:16 CentTest2 pacemakerd[9415]: notice: Stopping pengine: Sent -15 to process 9425 Oct 13 14:05:16 CentTest2 pacemakerd[9415]: notice: Stopping attrd: Sent -15 to process 9424 Oct 13 14:05:16 CentTest2 attrd[9424]: notice: Invoking handler for signal 15: Terminated Oct 13 14:05:16 CentTest2 attrd[9424]: notice: Exiting... Oct 13 14:05:16 CentTest2 pacemakerd[9415]: notice: Stopping lrmd: Sent -15 to process 9423 Oct 13 14:05:16 CentTest2 lrmd[9423]: notice: Invoking handler for signal 15: Terminated Oct 13 14:05:16 CentTest2 pacemakerd[9415]: notice: Stopping stonith-ng: Sent -15 to process 9422 Oct 13 14:05:16 CentTest2 stonith-ng[9422]: notice: Invoking handler for signal 15: Terminated Oct 13 14:05:16 CentTest2 pacemakerd[9415]: notice: Stopping cib: Sent -15 to process 9421 Oct 13 14:05:16 CentTest2 cib[9421]: warning: new_event_notification (9421-9422-12): Broken pipe (32) Oct 13 14:05:16 CentTest2 cib[9421]: warning: Notification of client crmd/efe7e6d2-3a51-4e0d-a4c5-cb986a8711d8 failed Oct 13 14:05:16 CentTest2 cib[9421]: notice: Invoking handler for signal 15: Terminated Oct 13 14:05:16 CentTest2 cib[9421]: notice: Disconnecting from Corosync Oct 13 14:05:16 CentTest2 cib[9421]: notice: Disconnecting from Corosync Oct 13 14:05:16 CentTest2 pacemakerd[9415]: notice: Shutdown complete Oct 13 14:05:17 CentTest2 kernel: dlm: closing connection to node 2 Oct 13 14:05:17 CentTest2 kernel: dlm: closing connection to node 1 Oct 13 14:05:18 CentTest2 corosync[9163]: [SERV ] Unloading all Corosync service engines. Oct 13 14:05:18 CentTest2 corosync[9163]: [SERV ] Service engine unloaded: corosync extended virtual synchrony service Oct 13 14:05:18 CentTest2 corosync[9163]: [SERV ] Service engine unloaded: corosync configuration service Oct 13 14:05:18 CentTest2 corosync[9163]: [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01 Oct 13 14:05:18 CentTest2 corosync[9163]: [SERV ] Service engine unloaded: corosync cluster config database access v1.01 Oct 13 14:05:18 CentTest2 corosync[9163]: [SERV ] Service engine unloaded: corosync profile loading service Oct 13 14:05:18 CentTest2 corosync[9163]: [SERV ] Service engine unloaded: openais checkpoint service B.01.01 Oct 13 14:05:18 CentTest2 corosync[9163]: [SERV ] Service engine unloaded: corosync CMAN membership service 2.90 Oct 13 14:05:18 CentTest2 corosync[9163]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 Oct 13 14:05:18 CentTest2 corosync[9163]: [MAIN ] Corosync Cluster Engine exiting with status 0 at main.c:2055. > but I can find this: > > Oct 13 08:29:41 CentTest1 pengine[30095]: notice: Scheduling Node > centtest2.ravnalaska.net for shutdown > ... > Oct 13 08:29:41 CentTest1 pengine[30095]: notice: Scheduling Node > centtest2.ravnalaska.net for shutdown > > Which means the stop operation probably raised an error, leading to a fencing > of the node. In this circumstance, I bet PostgreSQL wasn't able to stop > correctly and the lock file stayed in place. > > Could you please show us your full cluster setup? Sure: how? pcs status shows this, but I suspect that's not what you are asking about: Cluster name: cluster_test Last updated: Thu Oct 13 14:09:29 2016 Last change: Thu Oct 13 12:26:55 2016 by root via crmd on centtest1.ravnalaska.net Stack: cman Current DC: centtest1.ravnalaska.net (version 1.1.14-8.el6_8.1-70404b0) - partition with quorum 2 nodes and 3 resources configured Online: [ centtest1.ravnalaska.net centtest2.ravnalaska.net ] Full list of resources: virtual_ip (ocf::heartbeat:IPaddr2): Started centtest2.ravnalaska.net Master/Slave Set: msPostgresql [pgsql_96] Masters: [ centtest2.ravnalaska.net ] Slaves: [ centtest1.ravnalaska.net ] PCSD Status: centtest1.ravnalaska.net: Online centtest2.ravnalaska.net: Online > > By the way, did you had a look to the PAF project? > > http://dalibo.github.io/PAF/ > http://dalibo.github.io/PAF/documentation.html > > The v1.1 version for EL6 is not ready yet, but you might want to give it a > try: https://github.com/dalibo/PAF/tree/v1.1 > > I would recommend EL7 and PAF 2.0, published, packaged, ready to use. First I've heard of it. I'll give it a look. Thanks! > Regards, > > -- > Jehan-Guillaume (ioguix) de Rorthais > Dalibo
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org