> > On Oct 14, 2016, at 12:30 AM, Keisuke MORI <keisuke.mori...@gmail.com > <mailto:keisuke.mori...@gmail.com>> wrote: >> >> 2016-10-14 2:04 GMT+09:00 Israel Brewster <isr...@ravnalaska.net >> <mailto:isr...@ravnalaska.net>>: >>> Summary: Two-node cluster setup with latest pgsql resource agent. Postgresql >>> starts initially, but failover never happens. >> >>> Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: INFO: Master does not >>> exist. >>> Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: WARNING: My data is >>> out-of-date. status=DISCONNECT >>> Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: INFO: Master does not >>> exist. >>> Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: WARNING: My data is >>> out-of-date. status=DISCONNECT >>> >>> Those last two lines repeat indefinitely, but there is no indication that >>> the cluster ever tries to promote centtest1 to master. Even if I completely >>> shut down the cluster, and bring it back up only on centtest1, pacemaker >>> refuses to start postgresql on centtest1 as a master. >> >> This is because the data on centtest1 is considered "out-of-date"-ed >> (as it says :) and and promoting the node to master might corrupt your >> database. > > Ok, that makes sense. So the problem is why the cluster thinks the data is > out-of-date
Turns out the problem was a simple typo in my resource creation command: I had typed centest1.ravnalaska.net <http://centest1.ravnalaska.net/> in the node_list rather than centtest1.ravnalaska.net <http://centtest1.ravnalaska.net/> (note the missing t in the middle). So when trying to get the status, it never got a status for centtest1, which meant it defaulted to DISCONNECT and HS:alone. Once I fixed that typo, failover worked, at least so far, and I can even bring the old master back up as a slave after deleting the lock file that the RA leaves behind. Wow, that was annoying to track down! Maybe I need to be more careful about picking machine names - choose something that's harder to mess up :-) Thanks for all the suggestions and help! ----------------------------------------------- Israel Brewster Systems Analyst II Ravn Alaska 5245 Airport Industrial Rd Fairbanks, AK 99709 (907) 450-7293 ----------------------------------------------- > >> >>> >>> What can I do to fix this? What troubleshooting steps can I follow? Thanks. >>> >> >> It seems that the latest data should be only on centtest2 so the >> recovering steps should be something like: >> - start centtest2 as master >> - take the basebackup from centtest2 to centtest1 >> - start centtest1 as slave >> - make sure the replications is working properly > > I've done that. Several times. The replication works properly with either > node as the master. Initially I had started centtest1 as master, because > that's where I was planning to *have* the master, however when pacemaker keep > insisting on starting centtest2 as the master, I also tried setting things up > that way. No luck: everything works fine, but no failover. > >> >> see below for details. >> http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster >> <http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster> > Yep, that's where I started from on this little adventure :-) > >> >> >> Also, it would be helpful to check 'pgsql-data-status' and >> 'pgsql-status' attributes displayed by 'crm_mon -A' to diagnose >> whether the replications is going well or not. >> >> The slave node should have the attributes like below, otherwise the >> replications is going something wrong and the node will never be >> promoted because it does not have the proper data. >> >> ``` >> * Node node2: >> + master-pgsql : 100 >> + pgsql-data-status : STREAMING|SYNC >> + pgsql-status : HS:sync >> ``` > > Now THAT is interesting. I get this: > > Node Attributes: > * Node centtest1.ravnalaska.net <http://centtest1.ravnalaska.net/>: > + master-pgsql_96 : -INFINITY > + pgsql_96-data-status : DISCONNECT > + pgsql_96-status : HS:alone > * Node centtest2.ravnalaska.net <http://centtest2.ravnalaska.net/>: > + master-pgsql_96 : 1000 > + pgsql_96-data-status : LATEST > + pgsql_96-master-baseline : 00000000070171D0 > + pgsql_96-status : PRI > > ...Which seems to indicate that pacemaker doesn't think centtest1 is > connected to or replicating centtest2 (if I am interpreting that correctly). > And yet, it is: From postgres itself: > > [root@CentTest2 ~]# /usr/pgsql-9.6/bin/psql -h centtest2 -U postgres > psql (9.6.0) > Type "help" for help. > > postgres=# SELECT * FROM pg_replication_slots; > slot_name | plugin | slot_type | datoid | database | active | > active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn > -----------------+--------+-----------+--------+----------+--------+------------+------+--------------+-------------+--------------------- > centtest_2_slot | | physical | | | t | > 27230 | 1685 | | 0/7017438 | > (1 row) > > postgres=# > > Notice that "active" is true, indicating that the slot is connected and, > well, active. Plus, from the postgresql log on centtest1: > > < 2016-10-14 08:19:38.278 AKDT > LOG: entering standby mode > < 2016-10-14 08:19:38.285 AKDT > LOG: consistent recovery state reached at > 0/7017358 > < 2016-10-14 08:19:38.285 AKDT > LOG: redo starts at 0/7017358 > < 2016-10-14 08:19:38.285 AKDT > LOG: invalid record length at 0/7017438: > wanted 24, got 0 > < 2016-10-14 08:19:38.286 AKDT > LOG: database system is ready to accept > read only connections > < 2016-10-14 08:19:38.292 AKDT > LOG: started streaming WAL from primary at > 0/7000000 on timeline 1 > > And furthermore, if I insert/change records on centtest2, those changes *do* > show up on centtest1. So everything I can see says postgresql on centtest1 > *is* connected and replicating properly, but the data status shows DISCONNECT > and the service status shows HS:alone. So obviously something is wrong here. > > > ----------------------------------------------- > Israel Brewster > Systems Analyst II > Ravn Alaska > 5245 Airport Industrial Rd > Fairbanks, AK 99709 > (907) 450-7293 > ----------------------------------------------- > >> >> >> >> -- >> Keisuke MORI >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org> >> http://clusterlabs.org/mailman/listinfo/users >> <http://clusterlabs.org/mailman/listinfo/users> >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org