Re: [ClusterLabs] Antwort: Re: reboot node / cluster standby
On 07/03/2017 08:30 AM, philipp.achmuel...@arz.at wrote: > Ken Gaillot schrieb am 29.06.2017 21:15:59: > >> Von: Ken Gaillot >> An: Ludovic Vaugeois-Pepin , Cluster Labs - All >> topics related to open-source clustering welcomed >> Datum: 29.06.2017 21:19 >> Betreff: Re: [ClusterLabs] reboot node / cluster standby >> >> On 06/29/2017 01:38 PM, Ludovic Vaugeois-Pepin wrote: >> > On Thu, Jun 29, 2017 at 7:27 PM, Ken Gaillot > wrote: >> >> On 06/29/2017 04:42 AM, philipp.achmuel...@arz.at wrote: >> >>> Hi, >> >>> >> >>> In order to reboot a Clusternode i would like to set the node to > standby >> >>> first, so a clean takeover for running resources can take in place. >> >>> Is there a default way i can set in pacemaker, or do i have to > setup my >> >>> own systemd implementation? >> >>> >> >>> thank you! >> >>> regards >> >>> >> >>> env: >> >>> Pacemaker 1.1.15 >> >>> SLES 12.2 >> >> >> >> If a node cleanly shuts down or reboots, pacemaker will move all >> >> resources off it before it exits, so that should happen as you're >> >> describing, without needing an explicit standby. >> > > > how does this work when evacuating e.g. 5 nodes out of a 10 node cluster > at the same time? A clean shutdown works the same regardless of the situation: - the OS (systemd or whatever) sends a signal to pacemakerd to exit - a pacemaker daemon on the local node sends a shutdown request to the DC node - the DC node moves all resources off the node - the DC sends an "ok to shutdown" message to the node - the node's pacemaker daemons exit - the OS proceeds with system shutdown The only wrinkle in 5 out of 10 nodes is that most likely (depending on your configuration) you are losing quorum, and the cluster will stop all resources on all nodes. > >> > This makes me wonder about timeouts. Specifically OS/systemd timeouts. >> > Say the node being shut down or rebooted holds a resource as a master, >> > and it takes a while for the demote to complete, say 100 seconds (less >> > than the demote timeout of 120s in this hypothetical scenario). Will >> > the OS/systemd wait until pacemaker exits cleanly on a regular CentOS >> > or Debian? >> >> Yes. The pacemaker systemd unit file uses TimeoutStopSec=30min. >> >> > >> > >> >> Explicitly doing standby first would be useful mainly if you want to >> >> manually check the results of the takeover before proceeding with the >> >> reboot, and/or if you want the node to come back in standby mode next >> >> time it joins. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with stonith and starting services
On 07/03/2017 02:34 AM, Cesar Hernandez wrote: > Hi > > I have installed a pacemaker cluster with two nodes. The same type of > installation has done before many times and the following error never > appeared before. The situation is the following: > > both nodes running cluster services > stop pacemaker&corosync on node 1 > stop pacemaker&corosync on node 2 > start corosync&pacemaker on node 1 > > Then node 1 starts, it sees node2 down, and it fences it, as it was expected. > But the problem comes when node 2 is rebooted and starts cluster services: > sometimes, it starts the corosync service but the pacemaker service starts > and then stops. The syslog shows the following error in these cases: > > Jul 3 09:07:04 node2 pacemakerd[597]: warning: The crmd process (608) can > no longer be respawned, shutting the cluster down. > Jul 3 09:07:04 node2 pacemakerd[597]: notice: Shutting down Pacemaker > > Previous messages show some warning messages that I'm not sure they are > related with the shutdown: > > > Jul 3 09:07:04 node2 stonith-ng[604]: notice: Operation reboot of node2 by > node1 for crmd.2413@node1.608d8118: OK > Jul 3 09:07:04 node2 crmd[608]: crit: We were allegedly just fenced by > node1 for node1! > Jul 3 09:07:04 node2 corosync[585]: [pcmk ] info: pcmk_ipc_exit: Client > crmd (conn=0x1471800, async-conn=0x1471800) left > > > On node1, all resources become unrunnable and it stays there forever until I > start manually pacemaker service on node2. > As I said, same type of installation has done before on other servers and > never happened this. The only difference is that in previous installations I > configured corosync with multicast and now I have configured with unicast (my > current network environment doesn't allow multicast) but I think it's not > related with that behaviour Agreed, I don't think it's multicast vs unicast. I can't see from this what's going wrong. Possibly node1 is trying to re-fence node2 when it comes back. Check that the fencing resources are configured correctly, and check whether node1 sees the first fencing succeed. > Cluster software versions: > corosync-1.4.8 > crmsh-2.1.5 > libqb-0.17.2 > Pacemaker-1.1.14 > resource-agents-3.9.6 > > > > Can you help me? > > Thanks > > Cesar ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Coming in Pacemaker 1.1.17: container bundles
On 07/01/2017 06:47 AM, Valentin Vidic wrote: > On Fri, Jun 30, 2017 at 12:46:29PM -0500, Ken Gaillot wrote: >> The challenge is that some properties are docker-specific and other >> container engines will have their own specific properties. >> >> We decided to go with a tag for each supported engine -- so if we add >> support for rkt, we'll add a tag with whatever properties it >> needs. Then a would need to contain either a tag or a >> tag. >> >> We did consider a generic alternative like: >> >> >> >> >> ... >> >> ... >> >> >> But it was decided that using engine-specific tags would allow for >> schema enforcement, and would be more readable. >> >> The and tags were kept under because we >> figured those are essential to the concept of a bundle, and any engine >> should support some way of mapping those. > > Thanks for the explanation, it makes sense :) > > Now I have a working rkt resource agent and would like to test it. > Can you share the pcmk:httpd image mentioned in the docker example? Sure, we have a walk-through on the wiki that I was going to announce after 1.1.17 final is released (hopefully later this week), but now is good, too :-) https://wiki.clusterlabs.org/wiki/Bundle_Walk-Through ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antwort: Re: reboot node / cluster standby
Ken Gaillot schrieb am 29.06.2017 21:15:59: > Von: Ken Gaillot > An: Ludovic Vaugeois-Pepin , Cluster Labs - All > topics related to open-source clustering welcomed > Datum: 29.06.2017 21:19 > Betreff: Re: [ClusterLabs] reboot node / cluster standby > > On 06/29/2017 01:38 PM, Ludovic Vaugeois-Pepin wrote: > > On Thu, Jun 29, 2017 at 7:27 PM, Ken Gaillot wrote: > >> On 06/29/2017 04:42 AM, philipp.achmuel...@arz.at wrote: > >>> Hi, > >>> > >>> In order to reboot a Clusternode i would like to set the node to standby > >>> first, so a clean takeover for running resources can take in place. > >>> Is there a default way i can set in pacemaker, or do i have to setup my > >>> own systemd implementation? > >>> > >>> thank you! > >>> regards > >>> > >>> env: > >>> Pacemaker 1.1.15 > >>> SLES 12.2 > >> > >> If a node cleanly shuts down or reboots, pacemaker will move all > >> resources off it before it exits, so that should happen as you're > >> describing, without needing an explicit standby. > > how does this work when evacuating e.g. 5 nodes out of a 10 node cluster at the same time? > > This makes me wonder about timeouts. Specifically OS/systemd timeouts. > > Say the node being shut down or rebooted holds a resource as a master, > > and it takes a while for the demote to complete, say 100 seconds (less > > than the demote timeout of 120s in this hypothetical scenario). Will > > the OS/systemd wait until pacemaker exits cleanly on a regular CentOS > > or Debian? > > Yes. The pacemaker systemd unit file uses TimeoutStopSec=30min. > > > > > > >> Explicitly doing standby first would be useful mainly if you want to > >> manually check the results of the takeover before proceeding with the > >> reboot, and/or if you want the node to come back in standby mode next > >> time it joins. > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Problem with stonith and starting services
Hi I have installed a pacemaker cluster with two nodes. The same type of installation has done before many times and the following error never appeared before. The situation is the following: both nodes running cluster services stop pacemaker&corosync on node 1 stop pacemaker&corosync on node 2 start corosync&pacemaker on node 1 Then node 1 starts, it sees node2 down, and it fences it, as it was expected. But the problem comes when node 2 is rebooted and starts cluster services: sometimes, it starts the corosync service but the pacemaker service starts and then stops. The syslog shows the following error in these cases: Jul 3 09:07:04 node2 pacemakerd[597]: warning: The crmd process (608) can no longer be respawned, shutting the cluster down. Jul 3 09:07:04 node2 pacemakerd[597]: notice: Shutting down Pacemaker Previous messages show some warning messages that I'm not sure they are related with the shutdown: Jul 3 09:07:04 node2 stonith-ng[604]: notice: Operation reboot of node2 by node1 for crmd.2413@node1.608d8118: OK Jul 3 09:07:04 node2 crmd[608]: crit: We were allegedly just fenced by node1 for node1! Jul 3 09:07:04 node2 corosync[585]: [pcmk ] info: pcmk_ipc_exit: Client crmd (conn=0x1471800, async-conn=0x1471800) left On node1, all resources become unrunnable and it stays there forever until I start manually pacemaker service on node2. As I said, same type of installation has done before on other servers and never happened this. The only difference is that in previous installations I configured corosync with multicast and now I have configured with unicast (my current network environment doesn't allow multicast) but I think it's not related with that behaviour Cluster software versions: corosync-1.4.8 crmsh-2.1.5 libqb-0.17.2 Pacemaker-1.1.14 resource-agents-3.9.6 Can you help me? Thanks Cesar ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org