Re: [ClusterLabs] Antwort: Re: reboot node / cluster standby

2017-07-03 Thread Ken Gaillot
On 07/03/2017 08:30 AM, philipp.achmuel...@arz.at wrote:
> Ken Gaillot  schrieb am 29.06.2017 21:15:59:
> 
>> Von: Ken Gaillot 
>> An: Ludovic Vaugeois-Pepin , Cluster Labs - All
>> topics related to open-source clustering welcomed 
>> Datum: 29.06.2017 21:19
>> Betreff: Re: [ClusterLabs] reboot node / cluster standby
>>
>> On 06/29/2017 01:38 PM, Ludovic Vaugeois-Pepin wrote:
>> > On Thu, Jun 29, 2017 at 7:27 PM, Ken Gaillot 
> wrote:
>> >> On 06/29/2017 04:42 AM, philipp.achmuel...@arz.at wrote:
>> >>> Hi,
>> >>>
>> >>> In order to reboot a Clusternode i would like to set the node to
> standby
>> >>> first, so a clean takeover for running resources can take in place.
>> >>> Is there a default way i can set in pacemaker, or do i have to
> setup my
>> >>> own systemd implementation?
>> >>>
>> >>> thank you!
>> >>> regards
>> >>> 
>> >>> env:
>> >>> Pacemaker 1.1.15
>> >>> SLES 12.2
>> >>
>> >> If a node cleanly shuts down or reboots, pacemaker will move all
>> >> resources off it before it exits, so that should happen as you're
>> >> describing, without needing an explicit standby.
>> >
> 
> how does this work when evacuating e.g. 5 nodes out of a 10 node cluster
> at the same time?

A clean shutdown works the same regardless of the situation:

- the OS (systemd or whatever) sends a signal to pacemakerd to exit
- a pacemaker daemon on the local node sends a shutdown request to the
DC node
- the DC node moves all resources off the node
- the DC sends an "ok to shutdown" message to the node
- the node's pacemaker daemons exit
- the OS proceeds with system shutdown

The only wrinkle in 5 out of 10 nodes is that most likely (depending on
your configuration) you are losing quorum, and the cluster will stop all
resources on all nodes.

> 
>> > This makes me wonder about timeouts. Specifically OS/systemd timeouts.
>> > Say the node being shut down or rebooted holds a resource as a master,
>> > and it takes a while for the demote to complete, say 100 seconds (less
>> > than the demote timeout of 120s in this hypothetical scenario).  Will
>> > the OS/systemd wait until pacemaker exits cleanly on a regular CentOS
>> > or Debian?
>>
>> Yes. The pacemaker systemd unit file uses TimeoutStopSec=30min.
>>
>> >
>> >
>> >> Explicitly doing standby first would be useful mainly if you want to
>> >> manually check the results of the takeover before proceeding with the
>> >> reboot, and/or if you want the node to come back in standby mode next
>> >> time it joins.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with stonith and starting services

2017-07-03 Thread Ken Gaillot
On 07/03/2017 02:34 AM, Cesar Hernandez wrote:
> Hi
> 
> I have installed a pacemaker cluster with two nodes. The same type of 
> installation has done before many times and the following error never 
> appeared before. The situation is the following:
> 
> both nodes running cluster services
> stop pacemaker&corosync on node 1
> stop pacemaker&corosync on node 2
> start corosync&pacemaker on node 1
> 
> Then node 1 starts, it sees node2 down, and it fences it, as it was expected. 
> But the problem comes when node 2 is rebooted and starts cluster services: 
> sometimes, it starts the corosync service but the pacemaker service starts 
> and then stops. The syslog shows the following error in these cases:
> 
> Jul  3 09:07:04 node2 pacemakerd[597]:  warning: The crmd process (608) can 
> no longer be respawned, shutting the cluster down.
> Jul  3 09:07:04 node2 pacemakerd[597]:   notice: Shutting down Pacemaker
> 
> Previous messages show some warning messages that I'm not sure they are 
> related with the shutdown:
> 
> 
> Jul  3 09:07:04 node2 stonith-ng[604]:   notice: Operation reboot of node2 by 
> node1 for crmd.2413@node1.608d8118: OK
> Jul  3 09:07:04 node2 crmd[608]: crit: We were allegedly just fenced by 
> node1 for node1!
> Jul  3 09:07:04 node2 corosync[585]:   [pcmk  ] info: pcmk_ipc_exit: Client 
> crmd (conn=0x1471800, async-conn=0x1471800) left
> 
> 
> On node1, all resources become unrunnable and it stays there forever until I 
> start manually pacemaker service on node2. 
> As I said, same type of installation has done before on other servers and 
> never happened this. The only difference is that in previous installations I 
> configured corosync with multicast and now I have configured with unicast (my 
> current network environment doesn't allow multicast) but I think it's not 
> related with that behaviour

Agreed, I don't think it's multicast vs unicast.

I can't see from this what's going wrong. Possibly node1 is trying to
re-fence node2 when it comes back. Check that the fencing resources are
configured correctly, and check whether node1 sees the first fencing
succeed.

> Cluster software versions:
> corosync-1.4.8
> crmsh-2.1.5
> libqb-0.17.2
> Pacemaker-1.1.14
> resource-agents-3.9.6
> 
> 
> 
> Can you help me?
> 
> Thanks
> 
> Cesar

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Coming in Pacemaker 1.1.17: container bundles

2017-07-03 Thread Ken Gaillot
On 07/01/2017 06:47 AM, Valentin Vidic wrote:
> On Fri, Jun 30, 2017 at 12:46:29PM -0500, Ken Gaillot wrote:
>> The challenge is that some properties are docker-specific and other
>> container engines will have their own specific properties.
>>
>> We decided to go with a tag for each supported engine -- so if we add
>> support for rkt, we'll add a  tag with whatever properties it
>> needs. Then a  would need to contain either a  tag or a
>>  tag.
>>
>> We did consider a generic alternative like:
>>
>>   
>>  
>>  
>>  ...
>>  
>>  ...
>>   
>>
>> But it was decided that using engine-specific tags would allow for
>> schema enforcement, and would be more readable.
>>
>> The  and  tags were kept under  because we
>> figured those are essential to the concept of a bundle, and any engine
>> should support some way of mapping those.
> 
> Thanks for the explanation, it makes sense :)
> 
> Now I have a working rkt resource agent and would like to test it.
> Can you share the pcmk:httpd image mentioned in the docker example?

Sure, we have a walk-through on the wiki that I was going to announce
after 1.1.17 final is released (hopefully later this week), but now is
good, too :-)

   https://wiki.clusterlabs.org/wiki/Bundle_Walk-Through

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antwort: Re: reboot node / cluster standby

2017-07-03 Thread philipp . achmueller
Ken Gaillot  schrieb am 29.06.2017 21:15:59:

> Von: Ken Gaillot 
> An: Ludovic Vaugeois-Pepin , Cluster Labs - All
> topics related to open-source clustering welcomed 

> Datum: 29.06.2017 21:19
> Betreff: Re: [ClusterLabs] reboot node / cluster standby
> 
> On 06/29/2017 01:38 PM, Ludovic Vaugeois-Pepin wrote:
> > On Thu, Jun 29, 2017 at 7:27 PM, Ken Gaillot  
wrote:
> >> On 06/29/2017 04:42 AM, philipp.achmuel...@arz.at wrote:
> >>> Hi,
> >>>
> >>> In order to reboot a Clusternode i would like to set the node to 
standby
> >>> first, so a clean takeover for running resources can take in place.
> >>> Is there a default way i can set in pacemaker, or do i have to setup 
my
> >>> own systemd implementation?
> >>>
> >>> thank you!
> >>> regards
> >>> 
> >>> env:
> >>> Pacemaker 1.1.15
> >>> SLES 12.2
> >>
> >> If a node cleanly shuts down or reboots, pacemaker will move all
> >> resources off it before it exits, so that should happen as you're
> >> describing, without needing an explicit standby.
> > 

how does this work when evacuating e.g. 5 nodes out of a 10 node cluster 
at the same time?

> > This makes me wonder about timeouts. Specifically OS/systemd timeouts.
> > Say the node being shut down or rebooted holds a resource as a master,
> > and it takes a while for the demote to complete, say 100 seconds (less
> > than the demote timeout of 120s in this hypothetical scenario).  Will
> > the OS/systemd wait until pacemaker exits cleanly on a regular CentOS
> > or Debian?
> 
> Yes. The pacemaker systemd unit file uses TimeoutStopSec=30min.
> 
> > 
> > 
> >> Explicitly doing standby first would be useful mainly if you want to
> >> manually check the results of the takeover before proceeding with the
> >> reboot, and/or if you want the node to come back in standby mode next
> >> time it joins.
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Problem with stonith and starting services

2017-07-03 Thread Cesar Hernandez
Hi

I have installed a pacemaker cluster with two nodes. The same type of 
installation has done before many times and the following error never appeared 
before. The situation is the following:

both nodes running cluster services
stop pacemaker&corosync on node 1
stop pacemaker&corosync on node 2
start corosync&pacemaker on node 1

Then node 1 starts, it sees node2 down, and it fences it, as it was expected. 
But the problem comes when node 2 is rebooted and starts cluster services: 
sometimes, it starts the corosync service but the pacemaker service starts and 
then stops. The syslog shows the following error in these cases:

Jul  3 09:07:04 node2 pacemakerd[597]:  warning: The crmd process (608) can no 
longer be respawned, shutting the cluster down.
Jul  3 09:07:04 node2 pacemakerd[597]:   notice: Shutting down Pacemaker

Previous messages show some warning messages that I'm not sure they are related 
with the shutdown:


Jul  3 09:07:04 node2 stonith-ng[604]:   notice: Operation reboot of node2 by 
node1 for crmd.2413@node1.608d8118: OK
Jul  3 09:07:04 node2 crmd[608]: crit: We were allegedly just fenced by 
node1 for node1!
Jul  3 09:07:04 node2 corosync[585]:   [pcmk  ] info: pcmk_ipc_exit: Client 
crmd (conn=0x1471800, async-conn=0x1471800) left


On node1, all resources become unrunnable and it stays there forever until I 
start manually pacemaker service on node2. 
As I said, same type of installation has done before on other servers and never 
happened this. The only difference is that in previous installations I 
configured corosync with multicast and now I have configured with unicast (my 
current network environment doesn't allow multicast) but I think it's not 
related with that behaviour

Cluster software versions:
corosync-1.4.8
crmsh-2.1.5
libqb-0.17.2
Pacemaker-1.1.14
resource-agents-3.9.6



Can you help me?

Thanks

Cesar



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org