Re: [ClusterLabs] issues with pacemaker daemonization

2017-11-09 Thread ashutosh tiwari
t.com>
>
>
>
> --
>
> Message: 3
> Date: Thu, 9 Nov 2017 20:18:26 +0100
> From: Jan Pokorn? <jpoko...@redhat.com>
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Issue in starting Pacemaker Virtual IP in
> RHEL 7
> Message-ID: <20171109191826.gd10...@redhat.com>
> Content-Type: text/plain; charset="us-ascii"
>
> On 06/11/17 10:43 +, Somanath Jeeva wrote:
> > I am using a two node pacemaker cluster with teaming enabled. The
> cluster has
> >
> > 1.   Two team interfaces with different subents.
> >
> > 2.   The team1 has a NFS VIP plumbed to it.
> >
> > 3.   The VirtualIP from pacemaker is configured to plumb to
> team0(Corosync ring number is 0)
> >
> > In this case  the corosync takes the NFS IP as its ring address and
> > checks the same in the corosync.conf. Since conf file has team0
> > hostname the corosync start fails.
> >
> > Outputs:
> >
> >
> > $ip a output:
> >
> > [...]
> > 10: team1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
> state UP qlen 1000
> > link/ether 38:63:bb:3f:a4:ad brd ff:ff:ff:ff:ff:ff
> > inet 10.64.23.117/28 brd 10.64.23.127 scope global team1
> >valid_lft forever preferred_lft forever
> > inet 10.64.23.121/24 scope global secondary team1:~m0
> >valid_lft forever preferred_lft forever
> > inet6 fe80::3a63:bbff:fe3f:a4ad/64 scope link
> >valid_lft forever preferred_lft forever
> > 11: team0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
> state UP qlen 1000
> > link/ether 38:63:bb:3f:a4:ac brd ff:ff:ff:ff:ff:ff
> > inet 10.64.23.103/28 brd 10.64.23.111 scope global team0
> >valid_lft forever preferred_lft forever
> > inet6 fe80::3a63:bbff:fe3f:a4ac/64 scope link
> >valid_lft forever preferred_lft forever
> >
> > Corosync Conf File:
> >
> > cat /etc/corosync/corosync.conf
> > totem {
> > version: 2
> > secauth: off
> > cluster_name: DES
> > transport: udp
> > rrp_mode: passive
> >
> > interface {
> > ringnumber: 0
> > bindnetaddr: 10.64.23.96
> > mcastaddr: 224.1.1.1
> > mcastport: 6860
> > }
> > }
> >
> > nodelist {
> > node {
> > ring0_addr: dl380x4415
> > nodeid: 1
> > }
> >
> > node {
> > ring0_addr: dl360x4405
> > nodeid: 2
> > }
> > }
> >
> > quorum {
> > provider: corosync_votequorum
> > two_node: 1
> > }
> >
> > logging {
> > to_logfile: yes
> > logfile: /var/log/cluster/corosync.log
> > to_syslog: yes
> > }
> >
> > /etc/hosts:
> >
> > $ cat /etc/hosts
> > [...]
> > 10.64.23.103   dl380x4415
> > 10.64.23.105   dl360x4405
> > [...]
> >
> > Logs:
> >
> > [3029] dl380x4415 corosyncerror   [MAIN  ] Corosync Cluster Engine
> exiting with status 20 at service.c:356.
> > [19040] dl380x4415 corosyncnotice  [MAIN  ] Corosync Cluster Engine
> ('2.4.0'): started and ready to provide service.
> > [19040] dl380x4415 corosyncinfo[MAIN  ] Corosync built-in features:
> dbus systemd xmlconf qdevices qnetd snmp pie relro bindnow
> > [19040] dl380x4415 corosyncnotice  [TOTEM ] Initializing transport
> (UDP/IP Multicast).
> > [19040] dl380x4415 corosyncnotice  [TOTEM ] Initializing
> transmit/receive security (NSS) crypto: none hash: none
> > [19040] dl380x4415 corosyncnotice  [TOTEM ] The network interface
> [10.64.23.121] is now up.
> > [19040] dl380x4415 corosyncnotice  [SERV  ] Service engine loaded:
> corosync configuration map access [0]
> > [19040] dl380x4415 corosyncinfo[QB] server name: cmap
> > [19040] dl380x4415 corosyncnotice  [SERV  ] Service engine loaded:
> corosync configuration service [1]
> > [19040] dl380x4415 corosyncinfo[QB] server name: cfg
> > [19040] dl380x4415 corosyncnotice  [SERV  ] Service engine loaded:
> corosync cluster closed process group service v1.01 [2]
> > [19040] dl380x4415 corosyncinfo[QB] server name: cpg
> > [19040] dl380x4415 corosyncnotice  [SERV  ] Service engine loaded:
> corosync profile loading service [4]
> > [19040] dl380x4415 corosyncnotice  [QUORUM] Using quorum provider
> corosync_votequorum
> > [19040] dl380x4415 corosynccrit[QUORUM] Quorum provider:
> corosync_votequorum failed to initialize.
> > [19040] 

[ClusterLabs] pcs authentication fails

2017-11-09 Thread Aviran Jerbby
Hi Clusterlabs mailing list,

I'm having issues running pcs authentication on RH cent os 7.0/7.1 (Please see 
log below).

It's important to mention that pcs authentication with RH cent os 7.2/7.4 and 
with the same setup and packages is working.

[root@ufm-host42-014 tmp]# cat /etc/redhat-release
CentOS Linux release 7.0.1406 (Core)
[root@ufm-host42-014 tmp]# rpm -qa | grep openssl
openssl-libs-1.0.2k-8.el7.x86_64
openssl-devel-1.0.2k-8.el7.x86_64
openssl-1.0.2k-8.el7.x86_64
[root@ufm-host42-014 tmp]# rpm -qa | grep pcs
pcs-0.9.158-6.el7.centos.x86_64
pcsc-lite-libs-1.8.8-4.el7.x86_64
[root@ufm-host42-014 tmp]# pcs cluster auth ufm-host42-012.rdmz.labs.mlnx 
ufm-host42-013.rdmz.labs.mlnx ufm-host42-014.rdmz.labs.mlnx -u hacluster -p "" 
--debug
Running: /usr/bin/ruby -I/usr/lib/pcsd/ /usr/lib/pcsd/pcsd-cli.rb auth
Environment:
  DISPLAY=localhost:10.0
  GEM_HOME=/usr/lib/pcsd/vendor/bundle/ruby
  HISTCONTROL=ignoredups
  HISTSIZE=1000
  HOME=/root
  HOSTNAME=ufm-host42-014.rdmz.labs.mlnx
  KDEDIRS=/usr
  LANG=en_US.UTF-8
  LC_ALL=C
  LESSOPEN=||/usr/bin/lesspipe.sh %s
  LOGNAME=root
  
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
 MAIL=/var/spool/mail/root
  OLDPWD=/root
  
PATH=/usr/lib64/qt-3.3/bin:/root/perl5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin
  PCSD_DEBUG=true
  PCSD_NETWORK_TIMEOUT=60
  PERL5LIB=/root/perl5/lib/perl5:
  PERL_LOCAL_LIB_ROOT=:/root/perl5
  PERL_MB_OPT=--install_base /root/perl5
  PERL_MM_OPT=INSTALL_BASE=/root/perl5
  PWD=/tmp
  QTDIR=/usr/lib64/qt-3.3
  QTINC=/usr/lib64/qt-3.3/include
  QTLIB=/usr/lib64/qt-3.3/lib
  QT_GRAPHICSSYSTEM=native
  QT_GRAPHICSSYSTEM_CHECKED=1
  QT_PLUGIN_PATH=/usr/lib64/kde4/plugins:/usr/lib/kde4/plugins
  SHELL=/bin/bash
  SHLVL=1
  SSH_CLIENT=10.208.0.12 47232 22
  SSH_CONNECTION=10.208.0.12 47232 10.224.40.143 22
  SSH_TTY=/dev/pts/0
  TERM=xterm
  USER=root
  XDG_RUNTIME_DIR=/run/user/0
  XDG_SESSION_ID=6
  _=/usr/sbin/pcs
--Debug Input Start--
{"username": "hacluster", "local": false, "nodes": 
["ufm-host42-014.rdmz.labs.mlnx", "ufm-host42-013.rdmz.labs.mlnx", 
"ufm-host42-012.rdmz.labs.mlnx"], "password": "", "force": false}
--Debug Input End--

Finished running: /usr/bin/ruby -I/usr/lib/pcsd/ /usr/lib/pcsd/pcsd-cli.rb auth
Return value: 0
--Debug Stdout Start--
{
  "status": "ok",
  "data": {
"auth_responses": {
  "ufm-host42-014.rdmz.labs.mlnx": {
"status": "noresponse"
 },
  "ufm-host42-012.rdmz.labs.mlnx": {
"status": "noresponse"
  },
  "ufm-host42-013.rdmz.labs.mlnx": {
"status": "noresponse"
  }
},
"sync_successful": true,
"sync_nodes_err": [

],
"sync_responses": {
}
  },
  "log": [
"I, [2017-11-07T19:52:27.434067 #25065]  INFO -- : PCSD Debugging 
enabled\n",
"D, [2017-11-07T19:52:27.454014 #25065] DEBUG -- : Did not detect RHEL 6\n",
"I, [2017-11-07T19:52:27.454076 #25065]  INFO -- : Running: 
/usr/sbin/corosync-cmapctl totem.cluster_name\n",
"I, [2017-11-07T19:52:27.454127 #25065]  INFO -- : CIB USER: hacluster, 
groups: \n",
"D, [2017-11-07T19:52:27.458142 #25065] DEBUG -- : []\n",
"D, [2017-11-07T19:52:27.458216 #25065] DEBUG -- : [\"Failed to initialize 
the cmap API. Error CS_ERR_LIBRARY\\n\"]\n",
"D, [2017-11-07T19:52:27.458284 #25065] DEBUG -- : Duration: 
0.003997742s\n",
"I, [2017-11-07T19:52:27.458382 #25065]  INFO -- : Return Value: 1\n",
"W, [2017-11-07T19:52:27.458477 #25065]  WARN -- : Cannot read config 
'corosync.conf' from '/etc/corosync/corosync.conf': No such file\n",
"W, [2017-11-07T19:52:27.458546 #25065]  WARN -- 

[ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

2017-11-09 Thread Derek Wuelfrath
Hello there,

First post here but following since a while!

Here’s my issue,
we are putting in place and running this type of cluster since a while and 
never really encountered this kind of problem.

I recently set up a Corosync / Pacemaker / PCS cluster to manage DRBD along 
with different other resources. Part of theses resources are some systemd 
resources… this is the part where things are “breaking”.

Having a two servers cluster running only DRBD or DRBD with an OCF ipaddr2 
resource (Cluser IP in instance) works just fine. I can easily move from one 
node to the other without any issue.
As soon as I add a systemd resource to the resource group, things are breaking. 
Moving from one node to the other using standby mode works just fine but as 
soon as Corosync / Pacemaker restart involves polling of a systemd resource, it 
seems like it is trying to start the whole resource group and therefore, create 
a split-brain of the DRBD resource.

It is the best explanation / description of the situation that I can give. If 
it need any clarification, examples, … I am more than open to share them.

Any guidance would be appreciated :)

Here’s the output of a ‘pcs config’

https://pastebin.com/1TUvZ4X9 

Cheers!
-dw

--
Derek Wuelfrath
dwuelfr...@inverse.ca  :: +1.514.447.4918 (x110) 
:: +1.866.353.6153 (x110)
Inverse inc. :: Leaders behind SOGo (www.sogo.nu ), 
PacketFence (www.packetfence.org ) and Fingerbank 
(www.fingerbank.org )

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] One cluster with two groups of nodes

2017-11-09 Thread Alberto Mijares
>
> The first thing I'd mention is that a 6-node cluster can only survive
> the loss of two nodes, as 3 nodes don't have quorum. You can tweak that
> behavior with corosync quorum options, or you could add a quorum-only
> node, or use corosync's new qdevice capability to have an arbiter node.
>
> Coincidentally, I recently stumbled across a long-time Pacemaker
> feature that I wasn't aware of, that can handle this type of situation.
> It's not documented yet but will be when 1.1.18 is released soon.
>
> Colocation constraints may take a "node-attribute" parameter, that
> basically means, "Put this resource on a node of the same class as the
> one running resource X".
>
> In this case, you might set a "group" node attribute on all nodes, to
> "1" on the three primary nodes and "2" on the three failover nodes.
> Pick one resource as your base resource that everything else should go
> along with. Configure colocation constraints for all the other
> resources with that one, using "node-attribute=group". That means that
> all the other resources must be one a node with the same "group"
> attribute value as the node that the base resource is running on.
>
> "node-attribute" defaults to "#uname" (node name), this giving the
> usual behavior of colocation constraints: put the resource only on a
> node with the same name, i.e. the same node.
>
> The remaining question is, how do you want the base resource to fail
> over? If the base resource can fail over to any other node, whether in
> the same group or not, then you're done. If the base resource can only
> run on one node in each group, ban it from the other nodes using
> -INFINITY location constraints. If the base resource should only fail
> over to the opposite group, that's trickier, but something roughly
> similar would be to prefer one node in each group with an equal
> positive score location constraint, and migration-threshold=1.
> --
> Ken Gaillot 


Thank you very very much for this. I'm starting some tests in my lab tonight.

I'll let you know my results and I hope I can count on you if a get
lost in the way.

BTW, every resource is supposed to run only on its designated node
with a group. In example: if nginx normally runs on A1 and it MUST
failover to B1. The same for every resource.

Best regards,


Alberto Mijares

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Issue in starting Pacemaker Virtual IP in RHEL 7

2017-11-09 Thread Jan Pokorný
On 06/11/17 10:43 +, Somanath Jeeva wrote:
> I am using a two node pacemaker cluster with teaming enabled. The cluster has
> 
> 1.   Two team interfaces with different subents.
> 
> 2.   The team1 has a NFS VIP plumbed to it.
> 
> 3.   The VirtualIP from pacemaker is configured to plumb to 
> team0(Corosync ring number is 0)
> 
> In this case  the corosync takes the NFS IP as its ring address and
> checks the same in the corosync.conf. Since conf file has team0
> hostname the corosync start fails.
> 
> Outputs:
> 
> 
> $ip a output:
> 
> [...]
> 10: team1:  mtu 1500 qdisc noqueue state UP 
> qlen 1000
> link/ether 38:63:bb:3f:a4:ad brd ff:ff:ff:ff:ff:ff
> inet 10.64.23.117/28 brd 10.64.23.127 scope global team1
>valid_lft forever preferred_lft forever
> inet 10.64.23.121/24 scope global secondary team1:~m0
>valid_lft forever preferred_lft forever
> inet6 fe80::3a63:bbff:fe3f:a4ad/64 scope link
>valid_lft forever preferred_lft forever
> 11: team0:  mtu 1500 qdisc noqueue state UP 
> qlen 1000
> link/ether 38:63:bb:3f:a4:ac brd ff:ff:ff:ff:ff:ff
> inet 10.64.23.103/28 brd 10.64.23.111 scope global team0
>valid_lft forever preferred_lft forever
> inet6 fe80::3a63:bbff:fe3f:a4ac/64 scope link
>valid_lft forever preferred_lft forever
> 
> Corosync Conf File:
> 
> cat /etc/corosync/corosync.conf
> totem {
> version: 2
> secauth: off
> cluster_name: DES
> transport: udp
> rrp_mode: passive
> 
> interface {
> ringnumber: 0
> bindnetaddr: 10.64.23.96
> mcastaddr: 224.1.1.1
> mcastport: 6860
> }
> }
> 
> nodelist {
> node {
> ring0_addr: dl380x4415
> nodeid: 1
> }
> 
> node {
> ring0_addr: dl360x4405
> nodeid: 2
> }
> }
> 
> quorum {
> provider: corosync_votequorum
> two_node: 1
> }
> 
> logging {
> to_logfile: yes
> logfile: /var/log/cluster/corosync.log
> to_syslog: yes
> }
> 
> /etc/hosts:
> 
> $ cat /etc/hosts
> [...]
> 10.64.23.103   dl380x4415
> 10.64.23.105   dl360x4405
> [...]
> 
> Logs:
> 
> [3029] dl380x4415 corosyncerror   [MAIN  ] Corosync Cluster Engine exiting 
> with status 20 at service.c:356.
> [19040] dl380x4415 corosyncnotice  [MAIN  ] Corosync Cluster Engine 
> ('2.4.0'): started and ready to provide service.
> [19040] dl380x4415 corosyncinfo[MAIN  ] Corosync built-in features: dbus 
> systemd xmlconf qdevices qnetd snmp pie relro bindnow
> [19040] dl380x4415 corosyncnotice  [TOTEM ] Initializing transport (UDP/IP 
> Multicast).
> [19040] dl380x4415 corosyncnotice  [TOTEM ] Initializing transmit/receive 
> security (NSS) crypto: none hash: none
> [19040] dl380x4415 corosyncnotice  [TOTEM ] The network interface 
> [10.64.23.121] is now up.
> [19040] dl380x4415 corosyncnotice  [SERV  ] Service engine loaded: corosync 
> configuration map access [0]
> [19040] dl380x4415 corosyncinfo[QB] server name: cmap
> [19040] dl380x4415 corosyncnotice  [SERV  ] Service engine loaded: corosync 
> configuration service [1]
> [19040] dl380x4415 corosyncinfo[QB] server name: cfg
> [19040] dl380x4415 corosyncnotice  [SERV  ] Service engine loaded: corosync 
> cluster closed process group service v1.01 [2]
> [19040] dl380x4415 corosyncinfo[QB] server name: cpg
> [19040] dl380x4415 corosyncnotice  [SERV  ] Service engine loaded: corosync 
> profile loading service [4]
> [19040] dl380x4415 corosyncnotice  [QUORUM] Using quorum provider 
> corosync_votequorum
> [19040] dl380x4415 corosynccrit[QUORUM] Quorum provider: 
> corosync_votequorum failed to initialize.
> [19040] dl380x4415 corosyncerror   [SERV  ] Service engine 'corosync_quorum' 
> failed to load for reason 'configuration error: nodelist or 
> quorum.expected_votes must be configured!'

I suspect whether teaming is involved or not is irrelevant here.

You are not using the latest greatest 2.4.3, so I'd suggest either the
upgrade or applying this patch (present in that version) if that helps:

https://github.com/corosync/corosync/commit/95f9583a25007398e3792bdca2da262db18f658a

-- 
Jan (Poki)


pgp0OKftGQYvR.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker 1.1.18 Release Candidate 4

2017-11-09 Thread Ken Gaillot
On Fri, 2017-11-03 at 08:24 +0100, Kristoffer Grönlund wrote:
> Ken Gaillot  writes:
> 
> > I decided to do another release candidate, because we had a large
> > number of changes since rc3. The fourth release candidate for
> > Pacemaker
> > version 1.1.18 is now available at:
> > 
> > https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1
> > .18-
> > rc4
> > 
> > The big changes are numerous scalability improvements and bundle
> > fixes.
> > We're starting to test Pacemaker with as many as 1,500 bundles
> > (Docker
> > containers) running on 20 guest nodes running on three 56-core
> > physical
> > cluster nodes.
> 
> Hi Ken,
> 
> That's really cool. What's the size of the CIB with that kind of
> configuration? I guess it would compress pretty well, but still.

The test cluster is gone now, so not sure ... Beekhof might know.

I know it's big enough that the transition graph could get too big to
send via IPC, and we had to re-enable pengine's ability to write it to
disk instead, and have the crmd read it from disk.

> 
> Cheers,
> Kristoffer
> 
> > 
> > For details on the changes in this release, see the ChangeLog.
> > 
> > This is likely to be the last release candidate before the final
> > release next week. Any testing you can do is very welcome.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] issues with pacemaker daemonization

2017-11-09 Thread Ken Gaillot
On Thu, 2017-11-09 at 15:59 +0530, ashutosh tiwari wrote:
> Hi,
> 
> We are observing that sometime pacemaker daemon gets the same
> processgroup id as the process /script calling the "service pacemaker
> start". 
> While child processes of pacemaeker(cib/crmd/pengine) have there
> processgroup id  same as there pid which is how things should be for
> a daemon afaik.
> 
> Do we expect it to be managed by init.d (centos 6) or pacemaker
> binary.
> 
> pacemaker version: pacemaker-1.1.14-8.el6_8.1.x86_64
> 
> 
> Thanks and Regards,
> Ashutosh Tiwari

When pacemakerd spawns a child (cib etc.), it calls setsid() in the
child to start a new session, which will set the process group ID and
session ID to the child's PID.

However it doesn't do anything similar for itself. Possibly it should.
It's a longstanding to-do item to make pacemaker daemonize itself more
"properly", but no one's had the time to address it.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] One cluster with two groups of nodes

2017-11-09 Thread Ken Gaillot
On Wed, 2017-11-08 at 23:04 -0400, Alberto Mijares wrote:
> Hi guys, nice to say hello here.
> 
> I've been assigned with a very particular task: There's a
> pacemaker-based cluster with 6 nodes. A system runs on three nodes
> (group A), while the other three are hot-standby spares (group B).
> 
> Resources from group A are never supposed to me relocated
> individually
> into nodes from group B. However, if any of the resources from group
> A
> fails, all resources must be relocated into group B. It's an "all or
> nothing" failover.
> 
> Ideally, you would split the cluster into two clusters and implement
> Cluster-Sites and Tickets Management; however, it's not possible.
> 
> Taking all this into account, can you kindly suggest an strategy for
> achieving the goal? I have some ideas but I'd like to hear from those
> who have a lot more experience than me.
> 
> Thanks in advance,
> 
> 
> Alberto Mijares

The first thing I'd mention is that a 6-node cluster can only survive
the loss of two nodes, as 3 nodes don't have quorum. You can tweak that
behavior with corosync quorum options, or you could add a quorum-only
node, or use corosync's new qdevice capability to have an arbiter node.

Coincidentally, I recently stumbled across a long-time Pacemaker
feature that I wasn't aware of, that can handle this type of situation.
It's not documented yet but will be when 1.1.18 is released soon.

Colocation constraints may take a "node-attribute" parameter, that
basically means, "Put this resource on a node of the same class as the
one running resource X".

In this case, you might set a "group" node attribute on all nodes, to
"1" on the three primary nodes and "2" on the three failover nodes.
Pick one resource as your base resource that everything else should go
along with. Configure colocation constraints for all the other
resources with that one, using "node-attribute=group". That means that
all the other resources must be one a node with the same "group"
attribute value as the node that the base resource is running on.

"node-attribute" defaults to "#uname" (node name), this giving the
usual behavior of colocation constraints: put the resource only on a
node with the same name, i.e. the same node.

The remaining question is, how do you want the base resource to fail
over? If the base resource can fail over to any other node, whether in
the same group or not, then you're done. If the base resource can only
run on one node in each group, ban it from the other nodes using
-INFINITY location constraints. If the base resource should only fail
over to the opposite group, that's trickier, but something roughly
similar would be to prefer one node in each group with an equal
positive score location constraint, and migration-threshold=1.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

2017-11-09 Thread lkxjtu


>Also, I forgot about the undocumented/unsupported start-delay operation
>attribute, that you can put on the status operation to delay the first
>monitor. That may give you the behavior you want.
I have try to add "start-delay=60s" to monitor operation. The first monitor was 
really delayed as 60s. But in the 60s, it will block other resources too! The 
result is the same to sleeping in monitor.
So, I think the best method for me,  is to judge whether need to return success 
in monitor function by timestamp.
Thank you very much!








At 2017-11-06 21:53:53, "Ken Gaillot"  wrote:
>On Sat, 2017-11-04 at 22:46 +0800, lkxjtu wrote:
>> 
>> 
>> >Another possibility would be to have the start return immediately,
>> and
>> >make the monitor artificially return success for the first 10
>> minutes
>> >after starting. It's hacky, and it depends on your situation whether
>> >the behavior is acceptable.
>> I tried to put the sleep into the monitor function,( I add a “sleep
>> 60” at the monitor entry for debug),  the start function returns
>> immediately. I found an interesting thing that is, at the first time
>> of monitor after start, it will block other resource too, but from
>> the second time, it won't block other resources! Is this normal?
>
>Yes, the first result is for an unknown status, but after that, the
>cluster assumes the resource is OK unless/until the monitor says
>otherwise.
>
>However, I wasn't suggesting putting a sleep inside the monitor -- I
>was just thinking of having the monitor check the time, and if it's
>within 10 minutes of start, return success.
>
>> >My first thought on how to implement this
>> >would be to have the start action set a private node attribute
>> >(attrd_updater -p) with a timestamp. When the monitor runs, it could
>> do
>> >its usual check, and if it succeeds, remove that node attribute, but
>> if
>> >it fails, check the node attribute to see whether it's within the
>> >desired delay.
>> This means that if it is in the desired delay, monitor should return
>> success even if healthcheck failed?
>> I think this can solve my problem except "crm status" show
>
>Yes, that's what I had in mind. The status would show "running", which
>may or may not be what you want in this case.
>
>Also, I forgot about the undocumented/unsupported start-delay operation
>attribute, that you can put on the status operation to delay the first
>monitor. That may give you the behavior you want.
>
>> At 2017-11-01 21:20:50, "Ken Gaillot"  wrote:
>> >On Sat, 2017-10-28 at 01:11 +0800, lkxjtu wrote:
>> >> 
>> >> Thank you for your response! This means that there shoudn't be
>> long
>> >> "sleep" in ocf script.
>> >> If my service takes 10 minite from service starting to healthcheck
>> >> normally, then what shoud I do?
>> >
>> >That is a tough situation with no great answer.
>> >
>> >You can leave it as it is, and live with the delay. Note that it
>> only
>> >happens if a resource fails after the slow resource has already
>> begun
>> >starting ... if they fail at the same time (as with a node failure),
>> >the cluster will schedule recovery for both at the same time.
>> >
>> >Another possibility would be to have the start return immediately,
>> and
>> >make the monitor artificially return success for the first 10
>> minutes
>> >after starting. It's hacky, and it depends on your situation whether
>> >the behavior is acceptable. My first thought on how to implement
>> this
>> >would be to have the start action set a private node attribute
>> >(attrd_updater -p) with a timestamp. When the monitor runs, it could
>> do
>> >its usual check, and if it succeeds, remove that node attribute, but
>> if
>> >it fails, check the node attribute to see whether it's within the
>> >desired delay.
>> >
>> >> Thank you very much!
>> >>  
>> >> > Hi,
>> >> > If I remember correctly, any pending actions from a previous
>> >> transition
>> >> > must be completed before a new transition can be calculated.
>> >> Otherwise,
>> >> > there's the possibility that the pending action could change the
>> >> state
>> >> > in a way that makes the second transition's decisions harmful.
>> >> > Theoretically (and ideally), pacemaker could figure out whether
>> >> some of
>> >> > the actions in the second transition would be needed regardless
>> of
>> >> > whether the pending actions succeeded or failed, but in
>> practice,
>> >> that
>> >> > would be difficult to implement (and possibly take more time to
>> >> > calculate than is desirable in a recovery situation).
>> >>  
>> >> > On Fri, 2017-10-27 at 23:54 +0800, lkxjtu wrote:
>> >> 
>> >> > I have two clone resources in my corosync/pacemaker cluster.
>> They
>> >> are
>> >> > fm_mgt and logserver. Both of their RA is ocf. fm_mgt takes 1
>> >> minute
>> >> > to start the
>> >> > service(calling ocf start function for 1 minite). Configured as
>> >> > below:
>> >> > # crm configure show
>> >> > node 168002177: 192.168.2.177
>> >> > node 

[ClusterLabs] issues with pacemaker daemonization

2017-11-09 Thread ashutosh tiwari
Hi,

We are observing that sometime pacemaker daemon gets the same processgroup
id as the process /script calling the "service pacemaker start".
While child processes of pacemaeker(cib/crmd/pengine) have there
processgroup id  same as there pid which is how things should be for a
daemon afaik.

Do we expect it to be managed by init.d (centos 6) or pacemaker binary.

pacemaker version: *pacemaker-1.1.14-8.el6_8.1.x86_64*


Thanks and Regards,
Ashutosh Tiwari
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org