Re: [ClusterLabs] Frequent PAF log messages - Forbidding promotion on in state "startup"

2018-05-15 Thread Shobe, Casey
Thanks, I should have seen that.  I just assumed that everything was working 
fine because `pcs status` shows no errors.

This leads me to another question - is there a way to trigger a rebuild of a 
slave with pcs?  Or do I need to use `pcs cluster stop`, then manually do a new 
pg_basebackup, copy in the recovery.conf, and `pcs cluster start` for each 
standby node needing rebuilt?

> On May 13, 2018, at 5:58 AM, Jehan-Guillaume de Rorthais  
> wrote:
> 
> This message originated outside of DISH and was sent by: j...@dalibo.com
> 
> On Fri, 11 May 2018 16:25:18 +
> "Shobe, Casey"  wrote:
> 
>> I'm using PAF and my corosync log ends up filled with messages like this
>> (about 3 times per minute for each standby node):
>> 
>> pgsqlms(postgresql-10-main)[26822]: 2018/05/11_06:47:08  INFO: Forbidding
>> promotion on "d-gp2-dbp63-1" in state "startup"
>> pgsqlms(postgresql-10-main)[26822]: 2018/05/11_06:47:08  INFO: Forbidding
>> promotion on "d-gp2-dbp63-2" in state "startup"
>> 
>> What is the cause of this logging and does it indicate something is wrong
>> with my setup?
> 
> Yes, something is wrong with your setup. When a PostgreSQL standby is starting
> up, it tries to hook replication with the primary instance: this is the
> "startup" state. As soon as it is connected, it start replicating and tries to
> catchup with the master location, this is the "catchup" state. As soon as the
> standby is in sync with the master, it enters in "streaming" state. 
> See column "state" in the doc:
> https://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-REPLICATION-VIEW
> 
> If you have one standby stuck in "startup" state, that means it was able to
> connect to the master but is not replicating with it for some reason
> (different/incompatible/non catchable timeline?).
> 
> Look for errors in your PostgreSQL logs on the primary and the standby.
> 
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Frequent PAF log messages - Forbidding promotion on in state "startup"

2018-05-15 Thread Jehan-Guillaume de Rorthais
On Mon, 14 May 2018 19:08:47 +
"Shobe, Casey"  wrote:

> > We do not trigger error for such scenario because it would require the
> > cluster to react...and there's really no way the cluster can solve such
> > issue. So we just put a negative score, which is already quite strange to
> > be noticed in most situation.  
> 
> Where is this negative score to be noticed?

I usually use "crm_mon -frnAo"

* f: show failcounts
* r: show all resources, even inactive ones
* n: group by node instead of resource
* A: show node attributes <- this one should show you the scores
* o: show operation history

Note that you can switch this argument interactively when crm_mon is already
running. Hit 'h' for help.

[...]
> > I advice you to put the recovery.conf.pcmk outside of the PGDATA and use
> > resource parameter "recovery_template". It would save you one step to deal
> > with the recovery.conf. But this is the simplest procedure, yes.  
> 
> I do this (minus the .pcmk suffix) already, but was just being overly
> paranoid about avoiding a multi-master situation.  I guess there is no need
> for me to manually copy in the recovery.conf.

When cloning the primary, it shouldn't have a "recovery.conf" existing. It may
have a "recovery.done", but this is not a problem.

When cloning from a standby, I can understand you might want to be over
paranoid and delete the recovery.conf file.

But in either case, on resource start, PAF will create the
"PGDATA/recovery.conf" file based on your template anyway. No need to create it
yourself.

> > Should you keep the cluster up on this node for some other resources, you
> > could temporary exclude your pgsql-ha from this node so the cluster stop
> > considering it for this particular node while you rebuild your standby.
> > Here is some inspiration:
> > https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#forbidding-a-paf-resource-on-a-node
> >   
> 
> I was just reading that page before I saw this E-mail.  Another question I
> had though - is how I could deploy a change to the PostgreSQL configuration
> that requires a restart of the service, with minimal service interruption.
> For the moment, I'm assuming I need to, on each standby node, do a `pcs
> cluster stop; pcs cluster start` on each standby, then the same on the master
> which should cause a failover to one of the standby nodes.

According to the pcs manpage, you can restart a resource on one node using:

  pcs resource restart  

> If I need to change max_connections, though, I'm really not sure what to do,
> since the standby nodes will refuse to replicate from a master with a
> different max_connections setting.

You are missing a subtle detail here: standby will refuse to start if its
max_connections is lower than on the primary.

So you can change your max_connections:

* to a higher value starting from standby then the primary
* to a lower value starting from the primary then the standby

> On a related note, is there perhaps a pcs command that would issue a sighup
> to the master postgres process across all nodes, for when I change a
> configuration option that only requires a reload?

No. There are old discussions and patch about such feature in pacemaker, but
nothing end up in core. See:
https://lists.clusterlabs.org/pipermail/pacemaker/2014-February/044686.html

Note that PAF use a dummy function for the reload action anyway. But we could
easily add a "pg_ctl reload" to it if pcs (or crmsh) would allow to trigger it
manually.

Here, again, you can rely on ansible, salt, ssh command, etc. Either use
"pg_ctl -D  reload" or a simple query like "SELECT pg_reload_conf()".

> I was hoping optimistically that pcs+paf included more administrative
> functionality, since the systemctl commands such as reload can no longer be
> used.

It would be nice, I agree.

> Thank you for your assistance!

You are very welcome.

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Frequent PAF log messages - Forbidding promotion on in state "startup"

2018-05-14 Thread Jehan-Guillaume de Rorthais
On Mon, 14 May 2018 16:43:52 +
"Shobe, Casey"  wrote:

> Thanks, I should have seen that.  I just assumed that everything was working
> fine because `pcs status` shows no errors.

We do not trigger error for such scenario because it would require the cluster
to react...and there's really no way the cluster can solve such issue. So we
just put a negative score, which is already quite strange to be noticed in most
situation.

> This leads me to another question - is there a way to trigger a rebuild of a
> slave with pcs?

Nope. pcs/pacemaker has no such things. You can either write a strong a
detailed manual procedure or try some automation tools, eg. ansible, salt, etc.

>  Or do I need to use `pcs cluster stop`, then manually do a
> new pg_basebackup, copy in the recovery.conf, and `pcs cluster start` for
> each standby node needing rebuilt?

I advice you to put the recovery.conf.pcmk outside of the PGDATA and use
resource parameter "recovery_template". It would save you one step to deal with
the recovery.conf. But this is the simplest procedure, yes.

Should you keep the cluster up on this node for some other resources, you could
temporary exclude your pgsql-ha from this node so the cluster stop considering
it for this particular node while you rebuild your standby. Here is some
inspiration:
https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#forbidding-a-paf-resource-on-a-node


> > On May 13, 2018, at 5:58 AM, Jehan-Guillaume de Rorthais 
> > wrote:
> > 
> > This message originated outside of DISH and was sent by: j...@dalibo.com
> > 
> > On Fri, 11 May 2018 16:25:18 +
> > "Shobe, Casey"  wrote:
> >   
> >> I'm using PAF and my corosync log ends up filled with messages like this
> >> (about 3 times per minute for each standby node):
> >> 
> >> pgsqlms(postgresql-10-main)[26822]: 2018/05/11_06:47:08  INFO:
> >> Forbidding promotion on "d-gp2-dbp63-1" in state "startup"
> >> pgsqlms(postgresql-10-main)[26822]: 2018/05/11_06:47:08  INFO:
> >> Forbidding promotion on "d-gp2-dbp63-2" in state "startup"
> >> 
> >> What is the cause of this logging and does it indicate something is wrong
> >> with my setup?  
> > 
> > Yes, something is wrong with your setup. When a PostgreSQL standby is
> > starting up, it tries to hook replication with the primary instance: this
> > is the "startup" state. As soon as it is connected, it start replicating
> > and tries to catchup with the master location, this is the "catchup" state.
> > As soon as the standby is in sync with the master, it enters in "streaming"
> > state. See column "state" in the doc:
> > https://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-REPLICATION-VIEW
> > 
> > If you have one standby stuck in "startup" state, that means it was able to
> > connect to the master but is not replicating with it for some reason
> > (different/incompatible/non catchable timeline?).
> > 
> > Look for errors in your PostgreSQL logs on the primary and the standby.
> > 
> >   
> 



-- 
Jehan-Guillaume de Rorthais
Dalibo
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Frequent PAF log messages - Forbidding promotion on in state "startup"

2018-05-13 Thread Jehan-Guillaume de Rorthais
On Fri, 11 May 2018 16:25:18 +
"Shobe, Casey"  wrote:

> I'm using PAF and my corosync log ends up filled with messages like this
> (about 3 times per minute for each standby node):
> 
> pgsqlms(postgresql-10-main)[26822]: 2018/05/11_06:47:08  INFO: Forbidding
> promotion on "d-gp2-dbp63-1" in state "startup"
> pgsqlms(postgresql-10-main)[26822]: 2018/05/11_06:47:08  INFO: Forbidding
> promotion on "d-gp2-dbp63-2" in state "startup"
> 
> What is the cause of this logging and does it indicate something is wrong
> with my setup?

Yes, something is wrong with your setup. When a PostgreSQL standby is starting
up, it tries to hook replication with the primary instance: this is the
"startup" state. As soon as it is connected, it start replicating and tries to
catchup with the master location, this is the "catchup" state. As soon as the
standby is in sync with the master, it enters in "streaming" state. 
See column "state" in the doc:
https://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-REPLICATION-VIEW

If you have one standby stuck in "startup" state, that means it was able to
connect to the master but is not replicating with it for some reason
(different/incompatible/non catchable timeline?).

Look for errors in your PostgreSQL logs on the primary and the standby.

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Frequent PAF log messages - Forbidding promotion on in state "startup"

2018-05-11 Thread Shobe, Casey
I'm using PAF and my corosync log ends up filled with messages like this (about 
3 times per minute for each standby node):

pgsqlms(postgresql-10-main)[26822]: 2018/05/11_06:47:08  INFO: Forbidding 
promotion on "d-gp2-dbp63-1" in state "startup"
pgsqlms(postgresql-10-main)[26822]: 2018/05/11_06:47:08  INFO: Forbidding 
promotion on "d-gp2-dbp63-2" in state "startup"

What is the cause of this logging and does it indicate something is wrong with 
my setup?

Thank you,
-- 
Casey
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org