[ClusterLabs] How to change the "pcs constraint colocation set"

2018-05-14 Thread 范国腾
Hi,

We have two VIP resources and we use the following command to make them in 
different node.

pcs constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 setoptions 
score=-1000

Now we add a new node into the cluster and we add a new VIP too. We want the 
constraint colocation set to change to be:
pcs constraint colocation set pgsql-slave-ip1 pgsql-slave-ip2 pgsql-slave-ip3 
setoptions score=-1000
 
How should we change the constraint set?

Thanks
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Frequent PAF log messages - Forbidding promotion on in state "startup"

2018-05-14 Thread Jehan-Guillaume de Rorthais
On Mon, 14 May 2018 16:43:52 +
"Shobe, Casey"  wrote:

> Thanks, I should have seen that.  I just assumed that everything was working
> fine because `pcs status` shows no errors.

We do not trigger error for such scenario because it would require the cluster
to react...and there's really no way the cluster can solve such issue. So we
just put a negative score, which is already quite strange to be noticed in most
situation.

> This leads me to another question - is there a way to trigger a rebuild of a
> slave with pcs?

Nope. pcs/pacemaker has no such things. You can either write a strong a
detailed manual procedure or try some automation tools, eg. ansible, salt, etc.

>  Or do I need to use `pcs cluster stop`, then manually do a
> new pg_basebackup, copy in the recovery.conf, and `pcs cluster start` for
> each standby node needing rebuilt?

I advice you to put the recovery.conf.pcmk outside of the PGDATA and use
resource parameter "recovery_template". It would save you one step to deal with
the recovery.conf. But this is the simplest procedure, yes.

Should you keep the cluster up on this node for some other resources, you could
temporary exclude your pgsql-ha from this node so the cluster stop considering
it for this particular node while you rebuild your standby. Here is some
inspiration:
https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#forbidding-a-paf-resource-on-a-node


> > On May 13, 2018, at 5:58 AM, Jehan-Guillaume de Rorthais 
> > wrote:
> > 
> > This message originated outside of DISH and was sent by: j...@dalibo.com
> > 
> > On Fri, 11 May 2018 16:25:18 +
> > "Shobe, Casey"  wrote:
> >   
> >> I'm using PAF and my corosync log ends up filled with messages like this
> >> (about 3 times per minute for each standby node):
> >> 
> >> pgsqlms(postgresql-10-main)[26822]: 2018/05/11_06:47:08  INFO:
> >> Forbidding promotion on "d-gp2-dbp63-1" in state "startup"
> >> pgsqlms(postgresql-10-main)[26822]: 2018/05/11_06:47:08  INFO:
> >> Forbidding promotion on "d-gp2-dbp63-2" in state "startup"
> >> 
> >> What is the cause of this logging and does it indicate something is wrong
> >> with my setup?  
> > 
> > Yes, something is wrong with your setup. When a PostgreSQL standby is
> > starting up, it tries to hook replication with the primary instance: this
> > is the "startup" state. As soon as it is connected, it start replicating
> > and tries to catchup with the master location, this is the "catchup" state.
> > As soon as the standby is in sync with the master, it enters in "streaming"
> > state. See column "state" in the doc:
> > https://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-REPLICATION-VIEW
> > 
> > If you have one standby stuck in "startup" state, that means it was able to
> > connect to the master but is not replicating with it for some reason
> > (different/incompatible/non catchable timeline?).
> > 
> > Look for errors in your PostgreSQL logs on the primary and the standby.
> > 
> >   
> 



-- 
Jehan-Guillaume de Rorthais
Dalibo
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Two-node cluster fencing

2018-05-14 Thread Ken Gaillot
On Sat, 2018-05-12 at 12:51 -0600, Casey & Gina Shobe wrote:
> Without fencing, if the primary is powered off abruptly (e.g. if one
> of your ESX servers crashes), the standby will not become primary,
> and you will need to promote it manually.  We had exactly this
> scenario happen last week with a 2-node cluster.  Without fencing,
> you don't have high availability.  If you don't need high
> availability, you probably don't need pacemaker.

To go into this a bit more, the reason it's not safe to operate without
fencing is the situation where the active node still has the IP, but
has stopped responding for whatever reason (crippling load, failed
disk, flaky network card, etc.). If the passive node brought up the IP
in this case without first fencing the active node, both VMs would
advertise the IP, and packets would (practically speaking) randomly go
to one or the other, making any sort of communication impossible.

> There are instructions for setting up fencing with vmware here:  http
> s://www.hastexo.com/resources/hints-and-kinks/fencing-vmware-
> virtualized-pacemaker-nodes/
> 
> One note - rather than the SDK, I believe you actually need the CLI
> package, which can be found here:  https://my.vmware.com/web/vmware/d
> etails?downloadGroup=VCLI600=491
> 
> Good luck - I haven't managed to get it to build yet - vmware gives
> you a black box installer script that compiles a bunch of dependent
> perl modules, and it ends up getting hung with 100% CPU usage for
> days - digging into this further with lsof and friends, it seems to
> be prompting for where your apache source code is to compile
> mod_perl.  Why does it need mod_perl for the CLI??  Anyways, I
> haven't managed to get past that roadblock yet.  I'm using Ubuntu 16
> so it may happen to just work better on your RHEL instances.  If you
> have a different ESX version than 6.0, you may have better luck as
> well.
> 
> Best wishes,
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Ansible role to install corosync and pacemaker

2018-05-14 Thread Ken Gaillot
On Fri, 2018-05-11 at 10:51 +, Singh, Niraj wrote:
> Hi Team,
>  
> I am working on masakari/masakari-monitors ansible role.
> Ref: https://github.com/openstack/masakari  //project
> Ref: https://github.com/openstack/openstack-ansible-os_masakari
> //role(developing)
>  
> “Masakari provides Virtual Machine High Availability (VMHA) service
> for OpenStack clouds by automatically recovering
> the KVM-based Virtual Machine(VM)s from failure events such as VM
> process down, provisioning process down, and
> nova-compute host failure. It also provides API service for manage
> and control the automated rescue mechanism. “
>  
> To test the above role I also need to install corosync and pacemaker
> on compute nodes.
>  
> So my question is which ansible role I should use to install corosync
> and pacemaker?
>  
>  
> Thanks
> Niraj Singh

Hi Niraj,

There are no "official" ansible playbooks for pacemaker and corosync
that I'm aware of, but various users have made some available online.
It's an area I'd like to see more attention given to, but unfortunately
I personally don't have the time at the moment.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Antw: stonith continues to reboot server once fencing occurs

2018-05-14 Thread Ulrich Windl
Hi!

I'm wondering about this:
vmhost1-fsl.bcn is shutting down

That doesn't read like a STONITH, but like a regular shutdown (which may
hang).

The other thing that reads strange for a two-node cluster is this:
[11130] vmhost0-fsl.jsc.nasa.gov corosyncnotice  [TOTEM ] A new membership
(192.168.1.140:184) was formed. Members left: 2

This sounds odd, too:
[11130] vmhost0-fsl.jsc.nasa.gov corosyncwarning [MAIN  ] Totem is unable to
form a cluster because of an operating system or network fault. The most common
cause of this message is that the local firewall is configured improperly.

Regards,
Ulrich



>>> "Dickerson, Charles Chuck (JSC-EG)[Jacobs Technology, Inc.]"
 schrieb am 11.05.2018 um 19:02 in Nachricht
<0c5150d42e2b3f43b83ec3f62b3b8ee421d5f...@ndjsmbx201.ndc.nasa.gov>:
> I have attached the /var/log/cluster/corosync.log here.
> 
> The fenced node continues to be rebooted even after the stonith timeout.  
> The only way I have of stopping the reboot cycle is to completely stop the 
> cluster on the remaining node.
> 
> Stonith should be able to detect that the fenced node was successfully 
> rebooted and stop trying to fence it.  I have done this using both the cycle

> method and the onoff method, both methods have the same result.
> 
> Chuck Dickerson
> Jacobs
> JSC - EG3
> (281) 244-5895
> 
> -Original Message-
> From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Ulrich
Windl
> Sent: Friday, May 11, 2018 8:47 AM
> To: users@clusterlabs.org 
> Subject: [ClusterLabs] Antw: stonith continues to reboot server once fencing

> occurs
> 
> Hi!
> 
> Could it be that the node reboots faster than the stonith timeout? So the 
> node will unexpectedly come up...
> 
> Without logs it's hard to say.
> 
> Regards,
> Ulrich
> 
 "Dickerson, Charles Chuck (JSC-EG)[Jacobs Technology, Inc.]"
>  schrieb am 11.05.2018 um 15:32 in Nachricht
> <0c5150d42e2b3f43b83ec3f62b3b8ee421d5f...@ndjsmbx201.ndc.nasa.gov>:
>> I have a 2 node cluster, once fencing occurs, the fenced node is 
>> continually
> 
>> rebooted every time it comes up.
>> 
>> Configuration:  2 identical nodes ‑ Centos 7.4, pacemaker 1.1.18, pcs 
>> 0.9.162, fencing configured using fence_ipmilan The cluster is set to 
>> ignore quorum and stonith is enabled.  Firewalld has been disabled.
>> 
>> I can manually issue the fence_ipmilan command and the specified node 
>> is rebooted, comes back up and fence_ipmilan sees this and reports
success.
>> 
>> If fencing is initiated via the "pcs stonith fence" command, 
>> stonith_admin command, or by disrupting the communication between the 
>> nodes, the proper node is rebooted, but the stonith_admin command 
>> times out and never sees the
> 
>> node as rebooted.  The node is then rebooted every time it comes back 
>> up on
> 
>> the network.  The status remains UNCLEAN in pcs status.
>> 
>> Chuck Dickerson
>> Jacobs
>> JSC ‑ EG3
>> (281) 244‑5895
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org