Re: [ClusterLabs] Replicated PGSQL woes

2016-10-19 Thread Jehan-Guillaume de Rorthais
On Wed, 19 Oct 2016 19:44:14 +0900
Keisuke MORI  wrote:

> 2016-10-14 18:39 GMT+09:00 Jehan-Guillaume de Rorthais :
> > On Thu, 13 Oct 2016 14:11:06 -0800
> > Israel Brewster  wrote:
> >  
> >> On Oct 13, 2016, at 1:56 PM, Jehan-Guillaume de Rorthais 
> >> wrote:  
> >> >
> >> > On Thu, 13 Oct 2016 10:05:33 -0800
> >> > Israel Brewster  wrote:
> >> >  
> >> >> On Oct 13, 2016, at 9:41 AM, Ken Gaillot  wrote:  
> >> >>>
> >> >>> On 10/13/2016 12:04 PM, Israel Brewster wrote:  
> >> > [...]
> >> >  
> >>  But whatever- this is a cluster, it doesn't really matter which node
> >>  things are running on, as long as they are running. So the cluster is
> >>  working - postgresql starts, the master process is on the same node as
> >>  the IP, you can connect, etc, everything looks good. Obviously the
> >>  next thing to try is failover - should the master node fail, the
> >>  slave node should be promoted to master. So I try testing this by
> >>  shutting down the cluster on the primary server: "pcs cluster stop"
> >>  ...and nothing happens. The master shuts down (uncleanly, I might add
> >>  - it leaves behind a lock file that prevents it from starting again
> >>  until I manually remove said lock file), but the slave is never
> >>  promoted to  
> >> >>>
> >> >>> This definitely needs to be corrected. What creates the lock file, and
> >> >>> how is that entity managed?  
> >> >>
> >> >> The lock file entity is created/managed by the postgresql process
> >> >> itself. On launch, postgres creates the lock file to say it is running,
> >> >> and deletes said lock file when it shuts down. To my understanding, its
> >> >> role in life is to prevent a restart after an unclean shutdown so the
> >> >> admin is reminded to make sure that the data is in a consistent state
> >> >> before starting the server again.  
> >> >
> >> > What is the name of this lock file? Where is it?
> >> >
> >> > PostgreSQL does not create lock file. It creates a "postmaster.pid" file,
> >> > but it does not forbid a startup if the new process doesn't find another
> >> > process with the pid and shm shown in the postmaster.pid.
> >> >
> >> > As far as I know, the pgsql resource agent create such a lock file on
> >> > promote and delete it on graceful stop. If the PostgreSQL instance
> >> > couldn't be stopped correctly, the lock files stays and the RA refuse to
> >> > start it the next time.  
> >>
> >> Ah, you're right. Looking auth the RA I see where it creates the file in
> >> question. The delete appears to be in the pgsql_real_stop() function (which
> >> makes sense), wrapped in an if block that checks for $1 being master and
> >> $OCF_RESKEY_CRM_meta_notify_slave_uname being a space. Throwing a little
> >> debugging code in there I see that when it hits that block on a cluster
> >> stop, $OCF_RESKEY_CRM_meta_notify_slave_uname is centtest1.ravnalaska.net
> >> , not a space, so the lock file is not
> >> removed:
> >>
> >> if  [ "$1" = "master" -a "$OCF_RESKEY_CRM_meta_notify_slave_uname" = "
> >> " ]; then ocf_log info "Removing $PGSQL_LOCK."
> >> rm -f $PGSQL_LOCK
> >> fi
> >>
> >> It doesn't look like there is anywhere else where the file would be
> >> removed.  
> >
> > This is quite wrong to me for two reasons (I'll try to be clear):
> >
> > 1) the resource agent (RA) make sure the timeline (TL) will not be
> > incremented during promotion.
> >
> > As there is no documentation about that, I'm pretty sure this contortion
> > comes from limitations in very old versions of PostgreSQL (<= 9.1):
> >
> >   * a slave wasn't able to cross a timeline (TL) from streaming replication,
> > only from WAL archives. That means crossing a TL was requiring to
> > restart the slave or cutting the streaming rep temporary to force it to get
> > back to the archives
> >   * moreover, it was possible a standby miss some transactions on after a
> > clean master shutdown. That means the old master couldn't get back to the
> > cluster as a slave safely, as the TL is still the same...
> >
> > See slide 35->37:
> > http://www.slideshare.net/takmatsuo/2012929-pg-study-16012253
> >
> > In my understanding, that's why we make sure there's no slave around before
> > shutting down the master: should the master go back later cleanly, we make
> > sure no one could be promoted in the meantime.  
> 
> Yes, that is correct but  the issue described in the slide is not
> relevant to the Timeline ID issue,

Really ? That was pretty much the point of this slide as I understand it. But
as I didn't attend to this conference, I don't have the vocal explanation and I
might be wrong. But anyway, consider this:

  * slave are not connected
  * master receive some transactions
  * a clean shutdown occurs on the master
  * a slave is promoted **without TL 

Re: [ClusterLabs] Replicated PGSQL woes

2016-10-19 Thread Keisuke MORI
2016-10-14 18:39 GMT+09:00 Jehan-Guillaume de Rorthais :
> On Thu, 13 Oct 2016 14:11:06 -0800
> Israel Brewster  wrote:
>
>> On Oct 13, 2016, at 1:56 PM, Jehan-Guillaume de Rorthais 
>> wrote:
>> >
>> > On Thu, 13 Oct 2016 10:05:33 -0800
>> > Israel Brewster  wrote:
>> >
>> >> On Oct 13, 2016, at 9:41 AM, Ken Gaillot  wrote:
>> >>>
>> >>> On 10/13/2016 12:04 PM, Israel Brewster wrote:
>> > [...]
>> >
>>  But whatever- this is a cluster, it doesn't really matter which node
>>  things are running on, as long as they are running. So the cluster is
>>  working - postgresql starts, the master process is on the same node as
>>  the IP, you can connect, etc, everything looks good. Obviously the next
>>  thing to try is failover - should the master node fail, the slave node
>>  should be promoted to master. So I try testing this by shutting down the
>>  cluster on the primary server: "pcs cluster stop"
>>  ...and nothing happens. The master shuts down (uncleanly, I might add -
>>  it leaves behind a lock file that prevents it from starting again until
>>  I manually remove said lock file), but the slave is never promoted to
>> >>>
>> >>> This definitely needs to be corrected. What creates the lock file, and
>> >>> how is that entity managed?
>> >>
>> >> The lock file entity is created/managed by the postgresql process itself.
>> >> On launch, postgres creates the lock file to say it is running, and
>> >> deletes said lock file when it shuts down. To my understanding, its role
>> >> in life is to prevent a restart after an unclean shutdown so the admin is
>> >> reminded to make sure that the data is in a consistent state before
>> >> starting the server again.
>> >
>> > What is the name of this lock file? Where is it?
>> >
>> > PostgreSQL does not create lock file. It creates a "postmaster.pid" file,
>> > but it does not forbid a startup if the new process doesn't find another
>> > process with the pid and shm shown in the postmaster.pid.
>> >
>> > As far as I know, the pgsql resource agent create such a lock file on
>> > promote and delete it on graceful stop. If the PostgreSQL instance couldn't
>> > be stopped correctly, the lock files stays and the RA refuse to start it
>> > the next time.
>>
>> Ah, you're right. Looking auth the RA I see where it creates the file in
>> question. The delete appears to be in the pgsql_real_stop() function (which
>> makes sense), wrapped in an if block that checks for $1 being master and
>> $OCF_RESKEY_CRM_meta_notify_slave_uname being a space. Throwing a little
>> debugging code in there I see that when it hits that block on a cluster stop,
>> $OCF_RESKEY_CRM_meta_notify_slave_uname is centtest1.ravnalaska.net
>> , not a space, so the lock file is not
>> removed:
>>
>> if  [ "$1" = "master" -a "$OCF_RESKEY_CRM_meta_notify_slave_uname" = " "
>> ]; then ocf_log info "Removing $PGSQL_LOCK."
>> rm -f $PGSQL_LOCK
>> fi
>>
>> It doesn't look like there is anywhere else where the file would be removed.
>
> This is quite wrong to me for two reasons (I'll try to be clear):
>
> 1) the resource agent (RA) make sure the timeline (TL) will not be incremented
> during promotion.
>
> As there is no documentation about that, I'm pretty sure this contortion comes
> from limitations in very old versions of PostgreSQL (<= 9.1):
>
>   * a slave wasn't able to cross a timeline (TL) from streaming replication,
> only from WAL archives. That means crossing a TL was requiring to restart
> the slave or cutting the streaming rep temporary to force it to get back 
> to
> the archives
>   * moreover, it was possible a standby miss some transactions on after a 
> clean
> master shutdown. That means the old master couldn't get back to the
> cluster as a slave safely, as the TL is still the same...
>
> See slide 35->37: 
> http://www.slideshare.net/takmatsuo/2012929-pg-study-16012253
>
> In my understanding, that's why we make sure there's no slave around before
> shutting down the master: should the master go back later cleanly, we make 
> sure
> no one could be promoted in the meantime.

Yes, that is correct but  the issue described in the slide is not
relevant to the Timeline ID issue, and the issue in the slide could
still possibly happen in the recent PostgreSQL release too, as far as
I understand.

>
> Note that considering this issue and how the RA tries to avoid it, this test 
> on
> slave being shutdown before master is quite weak anyway...
>
> Last but not least, the two PostgreSQL limitations the RA is messing with have
> been fixed long time ago in 9.3:
>   * https://www.postgresql.org/docs/current/static/release-9-3.html#AEN138909
>   *
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=985bd7d49726c9f178558491d31a570d47340459
>
> ...but it requires PostgreSQL 9.3+ for the 

Re: [ClusterLabs] Replicated PGSQL woes

2016-10-14 Thread Jehan-Guillaume de Rorthais
On Fri, 14 Oct 2016 08:10:08 -0800
Israel Brewster  wrote:

> On Oct 14, 2016, at 1:39 AM, Jehan-Guillaume de Rorthais 
> wrote:
> > 
> > On Thu, 13 Oct 2016 14:11:06 -0800
> > Israel Brewster  wrote:

[...]

> > I **guess** if you really want a shutdown to occurs,

I meant «failover» here, not shutdown, sorry.

> > you need to simulate a real failure, not shutting down the first node
> > cleanly. Try to kill corosync.  
> 
> From an academic standpoint the result of that test (which, incidentally,
> were the same as the results of every other test I've done) are interesting,
> however from a practical standpoint I'm not sure it helps much - most of the
> "failures" that I experience are intentional: I want to fail over to the
> other machine so I can run some software updates, reboot for whatever reason,
> shutdown temporarily to upgrade the hardware, or whatever. 

As far as I know, this is switch over. Planed switch over.

And you should definitely check PAF then :)

[...]

-- 
Jehan-Guillaume de Rorthais
Dalibo

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Replicated PGSQL woes [solved]

2016-10-14 Thread Israel Brewster
> 
> On Oct 14, 2016, at 12:30 AM, Keisuke MORI  > wrote:
>> 
>> 2016-10-14 2:04 GMT+09:00 Israel Brewster > >:
>>> Summary: Two-node cluster setup with latest pgsql resource agent. Postgresql
>>> starts initially, but failover never happens.
>> 
>>> Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: INFO: Master does not
>>> exist.
>>> Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: WARNING: My data is
>>> out-of-date. status=DISCONNECT
>>> Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: INFO: Master does not
>>> exist.
>>> Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: WARNING: My data is
>>> out-of-date. status=DISCONNECT
>>> 
>>> Those last two lines repeat indefinitely, but there is no indication that
>>> the cluster ever tries to promote centtest1 to master. Even if I completely
>>> shut down the cluster, and bring it back up only on centtest1, pacemaker
>>> refuses to start postgresql on centtest1 as a master.
>> 
>> This is because the data on centtest1 is considered "out-of-date"-ed
>> (as it says :) and and promoting the node to master might corrupt your
>> database.
> 
> Ok, that makes sense. So the problem is why the cluster thinks the data is 
> out-of-date

Turns out the problem was a simple typo in my resource creation command: I had 
typed centest1.ravnalaska.net  in the 
node_list rather than centtest1.ravnalaska.net 
 (note the missing t in the middle). So when 
trying to get the status, it never got a status for centtest1, which meant it 
defaulted to DISCONNECT and HS:alone. Once I fixed that typo, failover worked, 
at least so far, and I can even bring the old master back up as a slave after 
deleting the lock file that the RA leaves behind. Wow, that was annoying to 
track down! Maybe I need to be more careful about picking machine names - 
choose something that's harder to mess up :-)

Thanks for all the suggestions and help!
---
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
---

> 
>> 
>>> 
>>> What can I do to fix this? What troubleshooting steps can I follow? Thanks.
>>> 
>> 
>> It seems that the latest data should be only on centtest2 so the
>> recovering steps should be something like:
>> - start centtest2 as master
>> - take the basebackup from centtest2 to centtest1
>> - start centtest1 as slave
>> - make sure the replications is working properly
> 
> I've done that. Several times. The replication works properly with either 
> node as the master. Initially I had started centtest1 as master, because 
> that's where I was planning to *have* the master, however when pacemaker keep 
> insisting on starting centtest2 as the master, I also tried setting things up 
> that way. No luck: everything works fine, but no failover.
> 
>> 
>> see below for details.
>> http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster 
>> 
> Yep, that's where I started from on this little adventure :-)
> 
>> 
>> 
>> Also, it would be helpful to check 'pgsql-data-status' and
>> 'pgsql-status' attributes displayed by 'crm_mon -A' to diagnose
>> whether the replications is going well or not.
>> 
>> The slave node should have the attributes like below, otherwise the
>> replications is going something wrong and the node will never be
>> promoted because it does not have the proper data.
>> 
>> ```
>> * Node node2:
>>+ master-pgsql  : 100
>>+ pgsql-data-status : STREAMING|SYNC
>>+ pgsql-status  : HS:sync
>> ```
> 
> Now THAT is interesting. I get this:
> 
> Node Attributes:
> * Node centtest1.ravnalaska.net :
> + master-pgsql_96   : -INFINITY 
> + pgsql_96-data-status  : DISCONNECT
> + pgsql_96-status   : HS:alone  
> * Node centtest2.ravnalaska.net :
> + master-pgsql_96   : 1000
> + pgsql_96-data-status  : LATEST
> + pgsql_96-master-baseline  : 070171D0
> + pgsql_96-status   : PRI
> 
> ...Which seems to indicate that pacemaker doesn't think centtest1 is 
> connected to or replicating centtest2 (if I am interpreting that correctly). 
> And yet, it is: From postgres itself:
> 
> [root@CentTest2 ~]# /usr/pgsql-9.6/bin/psql -h centtest2 -U postgres
> psql (9.6.0)
> Type "help" for help.
> 
> postgres=# SELECT * FROM pg_replication_slots;
> slot_name| plugin | slot_type | datoid | database | active | 
> active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
> 

Re: [ClusterLabs] Replicated PGSQL woes

2016-10-14 Thread Israel Brewster
On Oct 14, 2016, at 12:30 AM, Keisuke MORI  wrote:
> 
> 2016-10-14 2:04 GMT+09:00 Israel Brewster :
>> Summary: Two-node cluster setup with latest pgsql resource agent. Postgresql
>> starts initially, but failover never happens.
> 
>> Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: INFO: Master does not
>> exist.
>> Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: WARNING: My data is
>> out-of-date. status=DISCONNECT
>> Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: INFO: Master does not
>> exist.
>> Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: WARNING: My data is
>> out-of-date. status=DISCONNECT
>> 
>> Those last two lines repeat indefinitely, but there is no indication that
>> the cluster ever tries to promote centtest1 to master. Even if I completely
>> shut down the cluster, and bring it back up only on centtest1, pacemaker
>> refuses to start postgresql on centtest1 as a master.
> 
> This is because the data on centtest1 is considered "out-of-date"-ed
> (as it says :) and and promoting the node to master might corrupt your
> database.

Ok, that makes sense. So the problem is why the cluster thinks the data is 
out-of-date

> 
>> 
>> What can I do to fix this? What troubleshooting steps can I follow? Thanks.
>> 
> 
> It seems that the latest data should be only on centtest2 so the
> recovering steps should be something like:
> - start centtest2 as master
> - take the basebackup from centtest2 to centtest1
> - start centtest1 as slave
> - make sure the replications is working properly

I've done that. Several times. The replication works properly with either node 
as the master. Initially I had started centtest1 as master, because that's 
where I was planning to *have* the master, however when pacemaker keep 
insisting on starting centtest2 as the master, I also tried setting things up 
that way. No luck: everything works fine, but no failover.

> 
> see below for details.
> http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster 
> 
Yep, that's where I started from on this little adventure :-)

> 
> 
> Also, it would be helpful to check 'pgsql-data-status' and
> 'pgsql-status' attributes displayed by 'crm_mon -A' to diagnose
> whether the replications is going well or not.
> 
> The slave node should have the attributes like below, otherwise the
> replications is going something wrong and the node will never be
> promoted because it does not have the proper data.
> 
> ```
> * Node node2:
>+ master-pgsql  : 100
>+ pgsql-data-status : STREAMING|SYNC
>+ pgsql-status  : HS:sync
> ```

Now THAT is interesting. I get this:

Node Attributes:
* Node centtest1.ravnalaska.net:
+ master-pgsql_96   : -INFINITY 
+ pgsql_96-data-status  : DISCONNECT
+ pgsql_96-status   : HS:alone  
* Node centtest2.ravnalaska.net:
+ master-pgsql_96   : 1000
+ pgsql_96-data-status  : LATEST
+ pgsql_96-master-baseline  : 070171D0
+ pgsql_96-status   : PRI

...Which seems to indicate that pacemaker doesn't think centtest1 is connected 
to or replicating centtest2 (if I am interpreting that correctly). And yet, it 
is: From postgres itself:

[root@CentTest2 ~]# /usr/pgsql-9.6/bin/psql -h centtest2 -U postgres
psql (9.6.0)
Type "help" for help.

postgres=# SELECT * FROM pg_replication_slots;
slot_name| plugin | slot_type | datoid | database | active | active_pid 
| xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
-++---++--+++--+--+-+-
 centtest_2_slot || physical  ||  | t  |  27230 
| 1685 |  | 0/7017438   | 
(1 row)

postgres=# 

Notice that "active" is true, indicating that the slot is connected and, well, 
active. Plus, from the postgresql log on centtest1:

< 2016-10-14 08:19:38.278 AKDT > LOG:  entering standby mode
< 2016-10-14 08:19:38.285 AKDT > LOG:  consistent recovery state reached at 
0/7017358
< 2016-10-14 08:19:38.285 AKDT > LOG:  redo starts at 0/7017358
< 2016-10-14 08:19:38.285 AKDT > LOG:  invalid record length at 0/7017438: 
wanted 24, got 0
< 2016-10-14 08:19:38.286 AKDT > LOG:  database system is ready to accept read 
only connections
< 2016-10-14 08:19:38.292 AKDT > LOG:  started streaming WAL from primary at 
0/700 on timeline 1

And furthermore, if I insert/change records on centtest2, those changes *do* 
show up on centtest1. So everything I can see says postgresql on centtest1 *is* 
connected and replicating properly, but the data status shows DISCONNECT and 
the service status shows HS:alone. So obviously something is wrong here.


---
Israel Brewster
Systems Analyst II
Ravn 

Re: [ClusterLabs] Replicated PGSQL woes

2016-10-14 Thread Israel Brewster
On Oct 14, 2016, at 1:39 AM, Jehan-Guillaume de Rorthais  
wrote:
> 
> On Thu, 13 Oct 2016 14:11:06 -0800
> Israel Brewster  wrote:
> 
>> On Oct 13, 2016, at 1:56 PM, Jehan-Guillaume de Rorthais 
>> wrote:
>>> 
>>> On Thu, 13 Oct 2016 10:05:33 -0800
>>> Israel Brewster  wrote:
>>> 
 On Oct 13, 2016, at 9:41 AM, Ken Gaillot  wrote:  
> 
> On 10/13/2016 12:04 PM, Israel Brewster wrote:
>>> [...]
>>> 
>> But whatever- this is a cluster, it doesn't really matter which node
>> things are running on, as long as they are running. So the cluster is
>> working - postgresql starts, the master process is on the same node as
>> the IP, you can connect, etc, everything looks good. Obviously the next
>> thing to try is failover - should the master node fail, the slave node
>> should be promoted to master. So I try testing this by shutting down the
>> cluster on the primary server: "pcs cluster stop"
>> ...and nothing happens. The master shuts down (uncleanly, I might add -
>> it leaves behind a lock file that prevents it from starting again until
>> I manually remove said lock file), but the slave is never promoted to
> 
> This definitely needs to be corrected. What creates the lock file, and
> how is that entity managed?
 
 The lock file entity is created/managed by the postgresql process itself.
 On launch, postgres creates the lock file to say it is running, and
 deletes said lock file when it shuts down. To my understanding, its role
 in life is to prevent a restart after an unclean shutdown so the admin is
 reminded to make sure that the data is in a consistent state before
 starting the server again.  
>>> 
>>> What is the name of this lock file? Where is it?
>>> 
>>> PostgreSQL does not create lock file. It creates a "postmaster.pid" file,
>>> but it does not forbid a startup if the new process doesn't find another
>>> process with the pid and shm shown in the postmaster.pid.
>>> 
>>> As far as I know, the pgsql resource agent create such a lock file on
>>> promote and delete it on graceful stop. If the PostgreSQL instance couldn't
>>> be stopped correctly, the lock files stays and the RA refuse to start it
>>> the next time.  
>> 
>> Ah, you're right. Looking auth the RA I see where it creates the file in
>> question. The delete appears to be in the pgsql_real_stop() function (which
>> makes sense), wrapped in an if block that checks for $1 being master and
>> $OCF_RESKEY_CRM_meta_notify_slave_uname being a space. Throwing a little
>> debugging code in there I see that when it hits that block on a cluster stop,
>> $OCF_RESKEY_CRM_meta_notify_slave_uname is centtest1.ravnalaska.net
>> , not a space, so the lock file is not
>> removed:
>> 
>>if  [ "$1" = "master" -a "$OCF_RESKEY_CRM_meta_notify_slave_uname" = " "
>> ]; then ocf_log info "Removing $PGSQL_LOCK."
>>rm -f $PGSQL_LOCK
>>fi 
>> 
>> It doesn't look like there is anywhere else where the file would be removed.
> 
> This is quite wrong to me for two reasons (I'll try to be clear):
> 
> 1) the resource agent (RA) make sure the timeline (TL) will not be incremented
> during promotion.
> 
> As there is no documentation about that, I'm pretty sure this contortion comes
> from limitations in very old versions of PostgreSQL (<= 9.1):
> 
>  * a slave wasn't able to cross a timeline (TL) from streaming replication,
>only from WAL archives. That means crossing a TL was requiring to restart
>the slave or cutting the streaming rep temporary to force it to get back to
>the archives
>  * moreover, it was possible a standby miss some transactions on after a clean
>master shutdown. That means the old master couldn't get back to the
>cluster as a slave safely, as the TL is still the same...
> 
> See slide 35->37: 
> http://www.slideshare.net/takmatsuo/2012929-pg-study-16012253
> 
> In my understanding, that's why we make sure there's no slave around before
> shutting down the master: should the master go back later cleanly, we make 
> sure
> no one could be promoted in the meantime.
> 
> Note that considering this issue and how the RA tries to avoid it, this test 
> on
> slave being shutdown before master is quite weak anyway...
> 
> Last but not least, the two PostgreSQL limitations the RA is messing with have
> been fixed long time ago in 9.3:
>  * https://www.postgresql.org/docs/current/static/release-9-3.html#AEN138909
>  *
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=985bd7d49726c9f178558491d31a570d47340459
> 
> ...but it requires PostgreSQL 9.3+ for the timeline issue. By the way, I 
> suspect
> this is related to the "restart_on_promote" parameter of the RA.
> 
> 2) from a recent discussion on this list (or maybe on -dev), RA devs should 
> not
> rely on 

Re: [ClusterLabs] Replicated PGSQL woes

2016-10-14 Thread Jehan-Guillaume de Rorthais
On Thu, 13 Oct 2016 14:11:06 -0800
Israel Brewster  wrote:

> On Oct 13, 2016, at 1:56 PM, Jehan-Guillaume de Rorthais 
> wrote:
> > 
> > On Thu, 13 Oct 2016 10:05:33 -0800
> > Israel Brewster  wrote:
> >   
> >> On Oct 13, 2016, at 9:41 AM, Ken Gaillot  wrote:  
> >>> 
> >>> On 10/13/2016 12:04 PM, Israel Brewster wrote:
> > [...]
> >   
>  But whatever- this is a cluster, it doesn't really matter which node
>  things are running on, as long as they are running. So the cluster is
>  working - postgresql starts, the master process is on the same node as
>  the IP, you can connect, etc, everything looks good. Obviously the next
>  thing to try is failover - should the master node fail, the slave node
>  should be promoted to master. So I try testing this by shutting down the
>  cluster on the primary server: "pcs cluster stop"
>  ...and nothing happens. The master shuts down (uncleanly, I might add -
>  it leaves behind a lock file that prevents it from starting again until
>  I manually remove said lock file), but the slave is never promoted to
> >>> 
> >>> This definitely needs to be corrected. What creates the lock file, and
> >>> how is that entity managed?
> >> 
> >> The lock file entity is created/managed by the postgresql process itself.
> >> On launch, postgres creates the lock file to say it is running, and
> >> deletes said lock file when it shuts down. To my understanding, its role
> >> in life is to prevent a restart after an unclean shutdown so the admin is
> >> reminded to make sure that the data is in a consistent state before
> >> starting the server again.  
> > 
> > What is the name of this lock file? Where is it?
> > 
> > PostgreSQL does not create lock file. It creates a "postmaster.pid" file,
> > but it does not forbid a startup if the new process doesn't find another
> > process with the pid and shm shown in the postmaster.pid.
> > 
> > As far as I know, the pgsql resource agent create such a lock file on
> > promote and delete it on graceful stop. If the PostgreSQL instance couldn't
> > be stopped correctly, the lock files stays and the RA refuse to start it
> > the next time.  
> 
> Ah, you're right. Looking auth the RA I see where it creates the file in
> question. The delete appears to be in the pgsql_real_stop() function (which
> makes sense), wrapped in an if block that checks for $1 being master and
> $OCF_RESKEY_CRM_meta_notify_slave_uname being a space. Throwing a little
> debugging code in there I see that when it hits that block on a cluster stop,
> $OCF_RESKEY_CRM_meta_notify_slave_uname is centtest1.ravnalaska.net
> , not a space, so the lock file is not
> removed:
> 
> if  [ "$1" = "master" -a "$OCF_RESKEY_CRM_meta_notify_slave_uname" = " "
> ]; then ocf_log info "Removing $PGSQL_LOCK."
> rm -f $PGSQL_LOCK
> fi 
> 
> It doesn't look like there is anywhere else where the file would be removed.

This is quite wrong to me for two reasons (I'll try to be clear):

1) the resource agent (RA) make sure the timeline (TL) will not be incremented
during promotion.

As there is no documentation about that, I'm pretty sure this contortion comes
from limitations in very old versions of PostgreSQL (<= 9.1):

  * a slave wasn't able to cross a timeline (TL) from streaming replication,
only from WAL archives. That means crossing a TL was requiring to restart
the slave or cutting the streaming rep temporary to force it to get back to
the archives
  * moreover, it was possible a standby miss some transactions on after a clean
master shutdown. That means the old master couldn't get back to the
cluster as a slave safely, as the TL is still the same...

See slide 35->37: http://www.slideshare.net/takmatsuo/2012929-pg-study-16012253

In my understanding, that's why we make sure there's no slave around before
shutting down the master: should the master go back later cleanly, we make sure
no one could be promoted in the meantime.

Note that considering this issue and how the RA tries to avoid it, this test on
slave being shutdown before master is quite weak anyway...

Last but not least, the two PostgreSQL limitations the RA is messing with have
been fixed long time ago in 9.3:
  * https://www.postgresql.org/docs/current/static/release-9-3.html#AEN138909
  *
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=985bd7d49726c9f178558491d31a570d47340459

...but it requires PostgreSQL 9.3+ for the timeline issue. By the way, I suspect
this is related to the "restart_on_promote" parameter of the RA.

2) from a recent discussion on this list (or maybe on -dev), RA devs should not
rely on OCF_RESKEY_CRM_meta_notify_* vars outside of "notify" actions.

> > [...]  
>  What can I do to fix this? What troubleshooting steps can I follow?
>  Thanks.  
> > 
> > I can not find the result 

Re: [ClusterLabs] Replicated PGSQL woes

2016-10-14 Thread Keisuke MORI
2016-10-14 2:04 GMT+09:00 Israel Brewster :
> Summary: Two-node cluster setup with latest pgsql resource agent. Postgresql
> starts initially, but failover never happens.

> Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: INFO: Master does not
> exist.
> Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: WARNING: My data is
> out-of-date. status=DISCONNECT
> Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: INFO: Master does not
> exist.
> Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: WARNING: My data is
> out-of-date. status=DISCONNECT
>
> Those last two lines repeat indefinitely, but there is no indication that
> the cluster ever tries to promote centtest1 to master. Even if I completely
> shut down the cluster, and bring it back up only on centtest1, pacemaker
> refuses to start postgresql on centtest1 as a master.

This is because the data on centtest1 is considered "out-of-date"-ed
(as it says :) and and promoting the node to master might corrupt your
database.

>
> What can I do to fix this? What troubleshooting steps can I follow? Thanks.
>

It seems that the latest data should be only on centtest2 so the
recovering steps should be something like:
 - start centtest2 as master
 - take the basebackup from centtest2 to centtest1
 - start centtest1 as slave
 - make sure the replications is working properly

see below for details.
http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster


Also, it would be helpful to check 'pgsql-data-status' and
'pgsql-status' attributes displayed by 'crm_mon -A' to diagnose
whether the replications is going well or not.

The slave node should have the attributes like below, otherwise the
replications is going something wrong and the node will never be
promoted because it does not have the proper data.

```
* Node node2:
+ master-pgsql  : 100
+ pgsql-data-status : STREAMING|SYNC
+ pgsql-status  : HS:sync
```



-- 
Keisuke MORI

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Replicated PGSQL woes

2016-10-13 Thread Israel Brewster
On Oct 13, 2016, at 1:56 PM, Jehan-Guillaume de Rorthais  
wrote:
> 
> On Thu, 13 Oct 2016 10:05:33 -0800
> Israel Brewster  wrote:
> 
>> On Oct 13, 2016, at 9:41 AM, Ken Gaillot  wrote:
>>> 
>>> On 10/13/2016 12:04 PM, Israel Brewster wrote:  
> [...]
> 
 But whatever- this is a cluster, it doesn't really matter which node
 things are running on, as long as they are running. So the cluster is
 working - postgresql starts, the master process is on the same node as
 the IP, you can connect, etc, everything looks good. Obviously the next
 thing to try is failover - should the master node fail, the slave node
 should be promoted to master. So I try testing this by shutting down the
 cluster on the primary server: "pcs cluster stop"
 ...and nothing happens. The master shuts down (uncleanly, I might add -
 it leaves behind a lock file that prevents it from starting again until
 I manually remove said lock file), but the slave is never promoted to  
>>> 
>>> This definitely needs to be corrected. What creates the lock file, and
>>> how is that entity managed?  
>> 
>> The lock file entity is created/managed by the postgresql process itself. On
>> launch, postgres creates the lock file to say it is running, and deletes said
>> lock file when it shuts down. To my understanding, its role in life is to
>> prevent a restart after an unclean shutdown so the admin is reminded to make
>> sure that the data is in a consistent state before starting the server again.
> 
> What is the name of this lock file? Where is it?
> 
> PostgreSQL does not create lock file. It creates a "postmaster.pid" file, but
> it does not forbid a startup if the new process doesn't find another process
> with the pid and shm shown in the postmaster.pid.
> 
> As far as I know, the pgsql resource agent create such a lock file on promote
> and delete it on graceful stop. If the PostgreSQL instance couldn't be stopped
> correctly, the lock files stays and the RA refuse to start it the next time.

Ah, you're right. Looking auth the RA I see where it creates the file in 
question. The delete appears to be in the pgsql_real_stop() function (which 
makes sense), wrapped in an if block that checks for $1 being master and 
$OCF_RESKEY_CRM_meta_notify_slave_uname being a space. Throwing a little 
debugging code in there I see that when it hits that block on a cluster stop, 
$OCF_RESKEY_CRM_meta_notify_slave_uname is centtest1.ravnalaska.net 
, not a space, so the lock file is not 
removed:

if  [ "$1" = "master" -a "$OCF_RESKEY_CRM_meta_notify_slave_uname" = " " ]; 
then
ocf_log info "Removing $PGSQL_LOCK."
rm -f $PGSQL_LOCK
fi 

It doesn't look like there is anywhere else where the file would be removed.

> 
> [...]
 What can I do to fix this? What troubleshooting steps can I follow? Thanks.
> 
> I can not find the result of the stop operation in your log files, maybe the
> log from CentTest2 would be more useful.

Sure. I was looking at centtest1 because I was trying to figure out why it 
wouldn't promote, but if centtest2 never really stopped (properly) that could 
explain things. Here's the log from 2 when calling pcs cluster stop:

Oct 13 14:05:14 CentTest2 attrd[9424]:   notice: Sending flush op to all hosts 
for: standby (true)
Oct 13 14:05:14 CentTest2 attrd[9424]:   notice: Sent update 26: standby=true
Oct 13 14:05:14 CentTest2 pacemaker: Waiting for shutdown of managed resources
Oct 13 14:05:14 CentTest2 crmd[9426]:   notice: Operation pgsql_96_notify_0: ok 
(node=centtest2.ravnalaska.net, call=21, rc=0, cib-update=0, confirmed=true)
Oct 13 14:05:14 CentTest2 attrd[9424]:   notice: Sending flush op to all hosts 
for: master-pgsql_96 (-INFINITY)
Oct 13 14:05:14 CentTest2 attrd[9424]:   notice: Sent update 28: 
master-pgsql_96=-INFINITY
Oct 13 14:05:14 CentTest2 attrd[9424]:   notice: Sending flush op to all hosts 
for: pgsql_96-master-baseline ()
Oct 13 14:05:14 CentTest2 attrd[9424]:   notice: Sent delete 30: 
node=centtest2.ravnalaska.net, attr=pgsql_96-master-baseline, id=, 
set=(null), section=status
Oct 13 14:05:14 CentTest2 attrd[9424]:   notice: Sent delete 32: 
node=centtest2.ravnalaska.net, attr=pgsql_96-master-baseline, id=, 
set=(null), section=status
Oct 13 14:05:14 CentTest2 pgsql(pgsql_96)[5107]: INFO: Stopping PostgreSQL on 
demote.
Oct 13 14:05:14 CentTest2 pgsql(pgsql_96)[5107]: INFO: stop_escalate(or 
stop_escalate_in_slave) time is adjusted to 50 based on the configured timeout.
Oct 13 14:05:14 CentTest2 pgsql(pgsql_96)[5107]: INFO: server shutting down
Oct 13 14:05:15 CentTest2 pgsql(pgsql_96)[5107]: INFO: PostgreSQL is down
Oct 13 14:05:15 CentTest2 pgsql(pgsql_96)[5107]: INFO: Changing pgsql_96-status 
on centtest2.ravnalaska.net : PRI->STOP.
Oct 13 14:05:15 CentTest2 attrd[9424]:   notice: Sending flush op to all hosts 
for: pgsql_96-status (STOP)
Oct 13 14:05:15 

Re: [ClusterLabs] Replicated PGSQL woes

2016-10-13 Thread Jehan-Guillaume de Rorthais
On Thu, 13 Oct 2016 10:05:33 -0800
Israel Brewster  wrote:

> On Oct 13, 2016, at 9:41 AM, Ken Gaillot  wrote:
> > 
> > On 10/13/2016 12:04 PM, Israel Brewster wrote:  
[...]
 
> >> But whatever- this is a cluster, it doesn't really matter which node
> >> things are running on, as long as they are running. So the cluster is
> >> working - postgresql starts, the master process is on the same node as
> >> the IP, you can connect, etc, everything looks good. Obviously the next
> >> thing to try is failover - should the master node fail, the slave node
> >> should be promoted to master. So I try testing this by shutting down the
> >> cluster on the primary server: "pcs cluster stop"
> >> ...and nothing happens. The master shuts down (uncleanly, I might add -
> >> it leaves behind a lock file that prevents it from starting again until
> >> I manually remove said lock file), but the slave is never promoted to  
> > 
> > This definitely needs to be corrected. What creates the lock file, and
> > how is that entity managed?  
> 
> The lock file entity is created/managed by the postgresql process itself. On
> launch, postgres creates the lock file to say it is running, and deletes said
> lock file when it shuts down. To my understanding, its role in life is to
> prevent a restart after an unclean shutdown so the admin is reminded to make
> sure that the data is in a consistent state before starting the server again.

What is the name of this lock file? Where is it?

PostgreSQL does not create lock file. It creates a "postmaster.pid" file, but
it does not forbid a startup if the new process doesn't find another process
with the pid and shm shown in the postmaster.pid.

As far as I know, the pgsql resource agent create such a lock file on promote
and delete it on graceful stop. If the PostgreSQL instance couldn't be stopped
correctly, the lock files stays and the RA refuse to start it the next time.

[...]
> >> What can I do to fix this? What troubleshooting steps can I follow? Thanks.

I can not find the result of the stop operation in your log files, maybe the
log from CentTest2 would be more useful. but I can find this:

  Oct 13 08:29:41 CentTest1 pengine[30095]:   notice: Scheduling Node
  centtest2.ravnalaska.net for shutdown
  ...
  Oct 13 08:29:41 CentTest1 pengine[30095]:   notice: Scheduling Node
  centtest2.ravnalaska.net for shutdown

Which means the stop operation probably raised an error, leading to a fencing
of the node. In this circumstance, I bet PostgreSQL wasn't able to stop
correctly and the lock file stayed in place.

Could you please show us your full cluster setup?

By the way, did you had a look to the PAF project? 

  http://dalibo.github.io/PAF/
  http://dalibo.github.io/PAF/documentation.html

The v1.1 version for EL6 is not ready yet, but you might want to give it a
try: https://github.com/dalibo/PAF/tree/v1.1

I would recommend EL7 and PAF 2.0, published, packaged, ready to use.

Regards,

-- 
Jehan-Guillaume (ioguix) de Rorthais
Dalibo

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Replicated PGSQL woes

2016-10-13 Thread Ken Gaillot
On 10/13/2016 12:04 PM, Israel Brewster wrote:
> Summary: Two-node cluster setup with latest pgsql resource agent.
> Postgresql starts initially, but failover never happens.
> 
> Details:
> 
> I'm trying to get a cluster set up with Postgresql 9.6 in a streaming
> replication using named slots scenario. I'm using the latest pgsql
> Resource Agent, which does appear to support the named replication slot
> feature, and I've pulled in the various utility functions the RA uses
> that weren't available in my base install, so the RA itself no longer
> gives me errors.
> 
> Setup: Two machines, centtest1 and centtest2. Both are running CentOS
> 6.8. Centtest1 has an IP of 10.211.55.100, and centtest2 has an IP of
> 10.211.55.101. The cluster is set up and functioning, with a shared
> virtual IP resource at 10.211.55.200. Postgresql has been set up and
> tested functioning properly on both nodes with centtest1 as the master
> and centtest2 as the streaming replica slave. 
> 
> I then set up the postgresql master/slave resource using the following
> commands:
> 
> pcs resource create pgsql_96 pgsql \
> pgctl="/usr/pgsql-9.6/bin/pg_ctl" \
> logfile="/var/log/pgsql/test2.log" \
> psql="/usr/pgsql-9.6/bin/psql" \
> pgdata="/pgsql96/data" \
> rep_mode="async" \
> repuser="postgres" \
> node_list="tcentest1.ravnalaska.net 
> centtest2.ravnalaska.net " \
> master_ip="10.211.55.200" \
> archive_cleanup_command="" \
> restart_on_promote="true" \
> replication_slot_name="centtest_2_slot" \
> monitor_user="postgres" \
> monitor_password="SuperSecret" \
> op start timeout="60s" interval="0s" on-fail="restart" \
> op monitor timeout="60s" interval="4s" on-fail="restart" \
> op monitor timeout="60s" interval="3s" on-fail="restart" role="Master" \
> op promote timeout="60s" interval="0s" on-fail="restart" \
> op demote timeout="60s" interval="0s" on-fail=stop \
> op stop timeout="60s" interval="0s" on-fail="block" \
> op notify timeout="60s" interval="0s";
> 
> pcs resource master msPostgresql pgsql_96 master-max=1 master-node-max=1
> clone-max=2 clone-node-max=1 notify=true
> 
> pcs constraint colocation add virtual_ip with Master msPostgresql INFINITY
> pcs constraint order promote msPostgresql then start virtual_ip
> symmetrical=false score=INFINITY
> pcs constraint order demote  msPostgresql then stop  virtual_ip
> symmetrical=false score=0
> 
> My preference would be that the master runs on centtest1, so I add the
> following constraint as well:
> 
> pcs constraint location --master msPostgresql prefers
> centtest1.ravnalaska.net=50
> 
> When I then start the cluster, I first see *both* machines come up as
> "slave", which I feel is somewhat odd, however the cluster software
> quickly figures things out and promotes centtest2 to master. I've tried

This is inherent to pacemaker's model of multistate resources. Instances
are always started in slave mode, and then promotion to master is a
separate step.

> this a dozen different times, and it *always* promotes centtest2 to
> master - even if I put INFINITY in for the location constraint.

Surprisingly, location constraints do not directly support limiting to
one role (your "--master" option is ignored, and I'm surprised it
doesn't give an error). To do what you want, you need a rule, like:

pcs constraint location msPostgresql rule \
   role=master score=50 \
   #uname eq centtest1.ravnalaska.net


> But whatever- this is a cluster, it doesn't really matter which node
> things are running on, as long as they are running. So the cluster is
> working - postgresql starts, the master process is on the same node as
> the IP, you can connect, etc, everything looks good. Obviously the next
> thing to try is failover - should the master node fail, the slave node
> should be promoted to master. So I try testing this by shutting down the
> cluster on the primary server: "pcs cluster stop"
> ...and nothing happens. The master shuts down (uncleanly, I might add -
> it leaves behind a lock file that prevents it from starting again until
> I manually remove said lock file), but the slave is never promoted to

This definitely needs to be corrected. What creates the lock file, and
how is that entity managed?

> master. Neither pcs status or crm_mon show any errors, but centtest1
> never becomes master.

I remember a situation where a resource agent improperly set master
scores, that led to no master being promoted. I don't remember the
details, though.

> 
> If instead of stoping the cluster on centtest2, I try to simply move the
> master using the command "pcs resource move --master msPostgresql", I
> first run into the aforementioned unclean shutdown issue (lock file left
> behind that has to be manually removed), and after removing the lock
> file, I wind up with *both* nodes being slaves, and no master node. "pcs
> resource clear --master msPostgresql" re-promotes centtest2 to master.
> 
> What it looks like is that for 

[ClusterLabs] Replicated PGSQL woes

2016-10-13 Thread Israel Brewster
Summary: Two-node cluster setup with latest pgsql resource agent. Postgresql starts initially, but failover never happens.Details:I'm trying to get a cluster set up with Postgresql 9.6 in a streaming replication using named slots scenario. I'm using the latest pgsql Resource Agent, which does appear to support the named replication slot feature, and I've pulled in the various utility functions the RA uses that weren't available in my base install, so the RA itself no longer gives me errors.Setup: Two machines, centtest1 and centtest2. Both are running CentOS 6.8. Centtest1 has an IP of 10.211.55.100, and centtest2 has an IP of 10.211.55.101. The cluster is set up and functioning, with a shared virtual IP resource at 10.211.55.200. Postgresql has been set up and tested functioning properly on both nodes with centtest1 as the master and centtest2 as the streaming replica slave. I then set up the postgresql master/slave resource using the following commands:pcs resource create pgsql_96 pgsql \pgctl="/usr/pgsql-9.6/bin/pg_ctl" \logfile="/var/log/pgsql/test2.log" \psql="/usr/pgsql-9.6/bin/psql" \pgdata="/pgsql96/data" \rep_mode="async" \repuser="postgres" \node_list="tcentest1.ravnalaska.net centtest2.ravnalaska.net" \master_ip="10.211.55.200" \archive_cleanup_command="" \restart_on_promote="true" \replication_slot_name="centtest_2_slot" \monitor_user="postgres" \monitor_password="SuperSecret" \op start timeout="60s" interval="0s" on-fail="restart" \op monitor timeout="60s" interval="4s" on-fail="restart" \op monitor timeout="60s" interval="3s" on-fail="restart" role="Master" \op promote timeout="60s" interval="0s" on-fail="restart" \op demote timeout="60s" interval="0s" on-fail=stop \op stop timeout="60s" interval="0s" on-fail="block" \op notify timeout="60s" interval="0s";pcs resource master msPostgresql pgsql_96 master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=truepcs constraint colocation add virtual_ip with Master msPostgresql INFINITYpcs constraint order promote msPostgresql then start virtual_ip symmetrical=false score=INFINITYpcs constraint order demote  msPostgresql then stop  virtual_ip symmetrical=false score=0My preference would be that the master runs on centtest1, so I add the following constraint as well:pcs constraint location --master msPostgresql prefers centtest1.ravnalaska.net=50When I then start the cluster, I first see *both* machines come up as "slave", which I feel is somewhat odd, however the cluster software quickly figures things out and promotes centtest2 to master. I've tried this a dozen different times, and it *always* promotes centtest2 to master - even if I put INFINITY in for the location constraint.But whatever- this is a cluster, it doesn't really matter which node things are running on, as long as they are running. So the cluster is working - postgresql starts, the master process is on the same node as the IP, you can connect, etc, everything looks good. Obviously the next thing to try is failover - should the master node fail, the slave node should be promoted to master. So I try testing this by shutting down the cluster on the primary server: "pcs cluster stop"...and nothing happens. The master shuts down (uncleanly, I might add - it leaves behind a lock file that prevents it from starting again until I manually remove said lock file), but the slave is never promoted to master. Neither pcs status or crm_mon show any errors, but centtest1 never becomes master.If instead of stoping the cluster on centtest2, I try to simply move the master using the command "pcs resource move --master msPostgresql", I first run into the aforementioned unclean shutdown issue (lock file left behind that has to be manually removed), and after removing the lock file, I wind up with *both* nodes being slaves, and no master node. "pcs resource clear --master msPostgresql" re-promotes centtest2 to master.What it looks like is that for some reason pacemaker/corosync is absolutely refusing to ever make centtest1 a master - even when I explicitly tell it to, or when it is the only node left.Looking at the messages log when I do the node shutdown test I see this:Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]Oct 13 08:29:39 CentTest1 pengine[30095]:   notice: On loss of CCM Quorum: IgnoreOct 13 08:29:39 CentTest1 pengine[30095]:   notice: Stop    virtual_ip#011(centtest2.ravnalaska.net)Oct 13 08:29:39 CentTest1 pengine[30095]:   notice: Demote  pgsql_96:0#011(Master -> Stopped centtest2.ravnalaska.net)Oct 13 08:29:39 CentTest1 pengine[30095]:   notice: Calculated Transition 193: /var/lib/pacemaker/pengine/pe-input-500.bz2Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: Initiating action 43: notify pgsql_96_pre_notify_demote_0 on centtest2.ravnalaska.netOct 13 08:29:39 CentTest1 crmd[30096]:   notice: Initiating action 45: notify pgsql_96_pre_notify_demote_0 on