[ClusterLabs] CIB: op-status=4 ?

2017-05-17 Thread Radoslaw Garbacz
Hi,

I have a question regarding ' 'op-status
attribute getting value 4.

In my case I have a strange behavior, when resources get those "monitor"
operation entries in the CIB with op-status=4, and they do not seem to be
called (exec-time=0).

What does 'op-status' = 4 mean?

I would appreciate some elaboration regarding this, since this is
interpreted by pacemaker as an error, which causes logs:
crm_mon:error: unpack_rsc_op:Preventing dbx_head_head from
re-starting anywhere: operation monitor failed 'not configured' (6)

and I am pretty sure the resource agent was not called (no logs,
exec-time=0)

There are two aspects of this:

1) harmless (pacemaker seems to not bother about it), which I guess
indicates cancelled monitoring operations:
op-status=4, rc-code=189

* Example:



2) error level one (op-status=4, rc-code=6), which generates logs:
crm_mon:error: unpack_rsc_op:Preventing dbx_head_head from
re-starting anywhere: operation monitor failed 'not configured' (6)

* Example:



Could it be some hardware (VM hyperviser) issue?


Thanks in advance,

-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporated
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?

2017-05-17 Thread Dmitri Maziuk

On 2017-05-17 06:24, Lentes, Bernd wrote:



...

I'd like to know what the software is use is doing. Am i the only one having 
that opinion ?


No.


How do you solve the problem of a deathmatch or killing the wrong node ?


*I* live dangerously with fencing disabled. But then my clusters only 
really go down for maintenance reboots, and I usually do those when I'm 
at work and can walk into the server room and push the power button when 
it comes to that.


(More accurately the one cluster that goes down. The others fail over 
without any problems.)


Dima



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] newbie question

2017-05-17 Thread Sergei Gerasenko
Thank you, Ken!

On Thu, May 11, 2017 at 11:19 AM, Ken Gaillot  wrote:

> On 05/05/2017 03:09 PM, Sergei Gerasenko wrote:
> > Hi,
> >
> > I have a very simple question.
> >
> > Pacemaker uses a dedicated "multicast" interface for the totem protocol.
> > I'm using pacemaker with LVS to provide HA load balancing. LVS uses
> > multicast interfaces to sync the status of TCP connections if a failover
> > occurs.
> >
> > I can understand services using the same interface if ports are used.
> > That way you can get a socket (ip + port). But there's no ports in this
> > case. So how can two services exchange messages without specifying
> > ports? I guess that's somehow related to multicast, but how exactly I
> > don't get.
> >
> > Can somebody point me to a primer on this topic?
> >
> > Thanks,
> >   S.
>
> Corosync is actually the cluster component that can use multicast, and
> it does use a specific port on a specific address. By default, it uses
> ports 5404 and 5405 when using multicast. See the corosync.conf(5) man
> page for mcastaddr and mcastport. Also see the transport option;
> corosync can be configured to use UDP unicast rather than multicast.
>
> I don't remember much about LVS, but I would guess it's the same -- it's
> probably just using a default port if not specified in the config.
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate

2017-05-17 Thread Ken Gaillot
On 05/17/2017 04:56 AM, Klaus Wenninger wrote:
> On 05/17/2017 11:28 AM, 井上 和徳 wrote:
>> Hi,
>> I'm testing Pacemaker-1.1.17-rc1.
>> The number of failures in "Too many failures (10) to fence" log does not 
>> match the number of actual failures.
> 
> Well it kind of does as after 10 failures it doesn't try fencing again
> so that is what
> failures stay at ;-)
> Of course it still sees the need to fence but doesn't actually try.
> 
> Regards,
> Klaus

This feature can be a little confusing: it doesn't prevent all further
fence attempts of the target, just *immediate* fence attempts. Whenever
the next transition is started for some other reason (a configuration or
state change, cluster-recheck-interval, node failure, etc.), it will try
to fence again.

Also, it only checks this threshold if it's aborting a transition
*because* of this fence failure. If it's aborting the transition for
some other reason, the number can go higher than the threshold. That's
what I'm guessing happened here.

>> After the 11th time fence failure, "Too many failures (10) to fence" is 
>> output.
>> Incidentally, stonith-max-attempts has not been set, so it is 10 by default..
>>
>> [root@x3650f log]# egrep "Requesting fencing|error: Operation reboot|Stonith 
>> failed|Too many failures"
>> ##Requesting fencing : 1st time
>> May 12 05:51:47 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 05:52:52 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.8415167d: No data available
>> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 2nd time
>> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 05:53:56 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.53d3592a: No data available
>> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 3rd time
>> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 05:55:01 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.9177cb76: No data available
>> May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 4th time
>> May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 05:56:05 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.946531cb: No data available
>> May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 5th time
>> May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 05:57:10 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.278b3c4b: No data available
>> May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 6th time
>> May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 05:58:14 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.7a49aebb: No data available
>> May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 7th time
>> May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 05:59:19 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.83421862: No data available
>> May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 8th time
>> May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 06:00:24 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.afd7ef98: No data available
>> May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 9th time
>> May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 06:01:28 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.3b033dbe: No data available
>> May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 10th time
>> May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 06:02:33 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.5447a345: No data available
>> May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>> failed
>> ## 11th time
>> May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
>> node rhel73-2
>> May 12 06:03:37 rhel73-1 stonith-ng[5265]:   error: Opera

Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?

2017-05-17 Thread Klaus Wenninger
On 05/17/2017 03:33 PM, Lentes, Bernd wrote:
>
> - On May 17, 2017, at 2:58 PM, Klaus Wenninger kwenn...@redhat.com wrote:
>
>
>>> I don't see that.
>> fence_* are the RHCS-style fence-agents coming mainly from
>> https://github.com/ClusterLabs/fence-agents.
>>
> Ah. Ok, i see that.
>
> Do you know if they cooperate with a SuSE HAE ? I found rpm's for SLES for 
> the fence agents.

There is no conditional-compilation around support for RHCS-fence-agents.
Thus I guess there won't be a technical issue.
Question is just the degree of support you will get / want ...
But there are probably others than me who can give you a more
satisfactory answer.

Regards,
Klaus

>
> Bernd
>  
>
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
> Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons 
> Enhsen
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
>


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?

2017-05-17 Thread Lentes, Bernd


- On May 17, 2017, at 2:58 PM, Klaus Wenninger kwenn...@redhat.com wrote:


>> I don't see that.
> 
> fence_* are the RHCS-style fence-agents coming mainly from
> https://github.com/ClusterLabs/fence-agents.
> 

Ah. Ok, i see that.

Do you know if they cooperate with a SuSE HAE ? I found rpm's for SLES for the 
fence agents.

Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?

2017-05-17 Thread Klaus Wenninger
On 05/17/2017 02:52 PM, Lentes, Bernd wrote:
>
> - On May 17, 2017, at 2:11 PM, Vladislav Bogdanov bub...@hoster-ok.com 
> wrote:
>
>> 08.05.2017 22:20, Lentes, Bernd wrote:
>>> Hi,
>>>
>>> i remember that digimer often campaigns for a fence delay in a 2-node  
>>> cluster.
>>> E.g. here: 
>>> http://oss.clusterlabs.org/pipermail/pacemaker/2013-July/019228.html
>>> In my eyes it makes sense, so i try to establish that. I have two HP 
>>> servers,
>>> each with an ILO card.
>>> I have to use the stonith:external/ipmi agent, the stonith:external/riloe
>>> refused to work.
>>>
>>> But i don't have a delay parameter there.
>>> crm ra info stonith:external/ipmi:
>> Hi,
>>
>> There is another ipmi fence agent - fence_ipmilan (part of fence-agents
>> package). It has 'delay' parameter.
>>
> I don't see that.

fence_* are the RHCS-style fence-agents coming mainly from
https://github.com/ClusterLabs/fence-agents.

>
>
> crm(live)# ra info stonith:ipmilan
> IPMI Over LAN (stonith:ipmilan)
>
> IPMI LAN STONITH device
>
> Parameters (*: required, []: default):
>
> hostname* (string):
> The hostname of the STONITH device
>
> ipaddr* (string): IP Address
> The IP address of the STONITH device
>
> port* (string):
> The port number to where the IPMI message is sent
>
> auth* (string):
> The authorization type of the IPMI session ("none", "straight", "md2", or 
> "md5")
>
> priv* (string):
> The privilege level of the user ("operator" or "admin")
>
> login* (string): Login
> The username used for logging in to the STONITH device
>
> password* (string): Password
> The password used for logging in to the STONITH device
>
> priority (integer, [0]): The priority of the stonith resource. Devices are 
> tried in order of highest priority to lowest.
> pcmk_host_argument (string, [port]): Advanced use only: An alternate 
> parameter to supply instead of 'port'
> Some devices do not support the standard 'port' parameter or may provide 
> additional ones.
> Use this to specify an alternate, device-specific, parameter that should 
> indicate the machine to be fenced.
> A value of 'none' can be used to tell the cluster not to supply any 
> additional parameters.
>
> pcmk_host_map (string): A mapping of host names to ports numbers for devices 
> that do not support host names.
> Eg. node1:1;node2:2,3 would tell the cluster to use port 1 for node1 and 
> ports 2 and 3 for node2
>
> pcmk_host_list (string): A list of machines controlled by this device 
> (Optional unless pcmk_host_check=static-list).
> pcmk_host_check (string, [dynamic-list]): How to determine which machines are 
> controlled by the device.
> Allowed values: dynamic-list (query the device), static-list (check the 
> pcmk_host_list attribute), none (assume every device can fence every machine)
> ...
>
>
> There is no delay parameter, and all the pcmk_*** parameters are the ones 
> from stonithd, and that one does not have a dedicated delay parameter,
> just the pcmk_delay_max parameter which is not fixed but random. Do you have 
> another ipmilan RA ?
>
> I have SLES 11 SP4 boxes, maybe my RA is not recent enough ?
>
> Bernd
>  
>
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
> Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons 
> Enhsen
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?

2017-05-17 Thread Lentes, Bernd


- On May 17, 2017, at 2:11 PM, Vladislav Bogdanov bub...@hoster-ok.com 
wrote:

> 08.05.2017 22:20, Lentes, Bernd wrote:
>> Hi,
>>
>> i remember that digimer often campaigns for a fence delay in a 2-node  
>> cluster.
>> E.g. here: 
>> http://oss.clusterlabs.org/pipermail/pacemaker/2013-July/019228.html
>> In my eyes it makes sense, so i try to establish that. I have two HP servers,
>> each with an ILO card.
>> I have to use the stonith:external/ipmi agent, the stonith:external/riloe
>> refused to work.
>>
>> But i don't have a delay parameter there.
>> crm ra info stonith:external/ipmi:
> 
> Hi,
> 
> There is another ipmi fence agent - fence_ipmilan (part of fence-agents
> package). It has 'delay' parameter.
> 
>>

I don't see that.


crm(live)# ra info stonith:ipmilan
IPMI Over LAN (stonith:ipmilan)

IPMI LAN STONITH device

Parameters (*: required, []: default):

hostname* (string):
The hostname of the STONITH device

ipaddr* (string): IP Address
The IP address of the STONITH device

port* (string):
The port number to where the IPMI message is sent

auth* (string):
The authorization type of the IPMI session ("none", "straight", "md2", or 
"md5")

priv* (string):
The privilege level of the user ("operator" or "admin")

login* (string): Login
The username used for logging in to the STONITH device

password* (string): Password
The password used for logging in to the STONITH device

priority (integer, [0]): The priority of the stonith resource. Devices are 
tried in order of highest priority to lowest.
pcmk_host_argument (string, [port]): Advanced use only: An alternate parameter 
to supply instead of 'port'
Some devices do not support the standard 'port' parameter or may provide 
additional ones.
Use this to specify an alternate, device-specific, parameter that should 
indicate the machine to be fenced.
A value of 'none' can be used to tell the cluster not to supply any 
additional parameters.

pcmk_host_map (string): A mapping of host names to ports numbers for devices 
that do not support host names.
Eg. node1:1;node2:2,3 would tell the cluster to use port 1 for node1 and 
ports 2 and 3 for node2

pcmk_host_list (string): A list of machines controlled by this device (Optional 
unless pcmk_host_check=static-list).
pcmk_host_check (string, [dynamic-list]): How to determine which machines are 
controlled by the device.
Allowed values: dynamic-list (query the device), static-list (check the 
pcmk_host_list attribute), none (assume every device can fence every machine)
...


There is no delay parameter, and all the pcmk_*** parameters are the ones from 
stonithd, and that one does not have a dedicated delay parameter,
just the pcmk_delay_max parameter which is not fixed but random. Do you have 
another ipmilan RA ?

I have SLES 11 SP4 boxes, maybe my RA is not recent enough ?

Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?

2017-05-17 Thread Klaus Wenninger
On 05/17/2017 01:24 PM, Lentes, Bernd wrote:
>
> - On May 10, 2017, at 9:15 PM, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote:
>
>> On 05/10/2017 01:54 PM, Ken Gaillot wrote:
>>> On 05/10/2017 12:26 PM, Dimitri Maziuk wrote:
 - fencing in 2-node clusters does not work reliably without fixed delay
>>> Not quite. Fixed delay allows a particular method for avoiding a death
>>> match in a two-node cluster. Pacemaker's built-in random delay
>>> capability is another method.
>> Deathmatch is one problem, killing the wrong node (2 nodes, no quorum)
>> is another. Fixed delay is digimer's attempt to alleviate the latter,
>> so... apples and fruits not entirely unlike apples.
>>
>> --
> Hi,
>
> so what should i do ? Using pcmk_delay_max does not seem to be really 
> reliable.
> I don't like the idea of being dependent from a software thinking "which 
> delay i should choose, depending on the ... weather conditions, any mood ..."
> I'd like to know what the software is use is doing. Am i the only one having 
> that opinion ?
>
> How do you solve the problem of a deathmatch or killing the wrong node ?

When you just have a fence-agent available that doesn't support an
extra delay-parameter there is another solution - although not very
beautiful in my mind:

You can use fencing-levels and a dummy fence agent like the
one coming with cts (fence_dummy).
If you put it on the same level as your real fence-agent you would
have it in the list before that one, configure it to wait for a certain
time and succeed afterwards.
Or you are using multiple levels put it into a level that has
higher prio than the one your real fence-agent is in. Again make
it wait but afterwards fail so that the next level is attempted.

But I think we should aim for a solution inside pacemaker here.
I'm btw. the one that brought up a delay derived from the node
health Ken had mentioned.
I could as well imagine other enhancements to what we have
with pcmk_delay_max like a constant base delay where the
random comes on top.
It might be useful to be able to derive both - the constant base
and the random part - from attributes.
For the health-based delay I had in mind to have 3 parameters
like pcmk_delay_green, pcmk_delay_orange, pcmk_delay_red.

Maybe a good opportunity here to ask for some feedback
regarding enhancements of generic fencing delay mechanisms...

Regards,
Klaus

>
> Bernd
>  
>
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
> Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons 
> Enhsen
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?

2017-05-17 Thread Vladislav Bogdanov

08.05.2017 22:20, Lentes, Bernd wrote:

Hi,

i remember that digimer often campaigns for a fence delay in a 2-node  cluster.
E.g. here: http://oss.clusterlabs.org/pipermail/pacemaker/2013-July/019228.html
In my eyes it makes sense, so i try to establish that. I have two HP servers, 
each with an ILO card.
I have to use the stonith:external/ipmi agent, the stonith:external/riloe 
refused to work.

But i don't have a delay parameter there.
crm ra info stonith:external/ipmi:


Hi,

There is another ipmi fence agent - fence_ipmilan (part of fence-agents 
package). It has 'delay' parameter.




...
pcmk_delay_max (time, [0s]): Enable random delay for stonith actions and 
specify the maximum of random delay
This prevents double fencing when using slow devices such as sbd.
Use this to enable random delay for stonith actions and specify the maximum 
of random delay.
...

This is the only delay parameter i can use. But a random delay does not seem to 
be a reliable solution.

The stonith:ipmilan agent also provides just a random delay. Same with the 
riloe agent.

How did anyone solve this problem ?

Or do i have to edit the RA (I will get practice in that :-))?


Bernd





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?

2017-05-17 Thread Lentes, Bernd


- On May 10, 2017, at 9:15 PM, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote:

> On 05/10/2017 01:54 PM, Ken Gaillot wrote:
>> On 05/10/2017 12:26 PM, Dimitri Maziuk wrote:
> 
>>> - fencing in 2-node clusters does not work reliably without fixed delay
>> 
>> Not quite. Fixed delay allows a particular method for avoiding a death
>> match in a two-node cluster. Pacemaker's built-in random delay
>> capability is another method.
> 
> Deathmatch is one problem, killing the wrong node (2 nodes, no quorum)
> is another. Fixed delay is digimer's attempt to alleviate the latter,
> so... apples and fruits not entirely unlike apples.
> 
> --

Hi,

so what should i do ? Using pcmk_delay_max does not seem to be really reliable.
I don't like the idea of being dependent from a software thinking "which delay 
i should choose, depending on the ... weather conditions, any mood ..."
I'd like to know what the software is use is doing. Am i the only one having 
that opinion ?

How do you solve the problem of a deathmatch or killing the wrong node ?

Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate

2017-05-17 Thread Klaus Wenninger
On 05/17/2017 11:28 AM, 井上 和徳 wrote:
> Hi,
> I'm testing Pacemaker-1.1.17-rc1.
> The number of failures in "Too many failures (10) to fence" log does not 
> match the number of actual failures.

Well it kind of does as after 10 failures it doesn't try fencing again
so that is what
failures stay at ;-)
Of course it still sees the need to fence but doesn't actually try.

Regards,
Klaus

>
> After the 11th time fence failure, "Too many failures (10) to fence" is 
> output.
> Incidentally, stonith-max-attempts has not been set, so it is 10 by default..
>
> [root@x3650f log]# egrep "Requesting fencing|error: Operation reboot|Stonith 
> failed|Too many failures"
> ##Requesting fencing : 1st time
> May 12 05:51:47 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 05:52:52 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.8415167d: No data available
> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 2nd time
> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 05:53:56 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.53d3592a: No data available
> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 3rd time
> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 05:55:01 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.9177cb76: No data available
> May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 4th time
> May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 05:56:05 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.946531cb: No data available
> May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 5th time
> May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 05:57:10 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.278b3c4b: No data available
> May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 6th time
> May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 05:58:14 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.7a49aebb: No data available
> May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 7th time
> May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 05:59:19 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.83421862: No data available
> May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 8th time
> May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 06:00:24 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.afd7ef98: No data available
> May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 9th time
> May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 06:01:28 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.3b033dbe: No data available
> May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 10th time
> May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 06:02:33 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.5447a345: No data available
> May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
> ## 11th time
> May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
> node rhel73-2
> May 12 06:03:37 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.db50c21a: No data available
> May 12 06:03:37 rhel73-1 crmd[5269]: warning: Too many failures (10) to fence 
> rhel73-2, giving up
> May 12 06:03:37 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
> failed
>
> Regards,
> Kazunori INOUE
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@cl

[ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate

2017-05-17 Thread 井上 和徳
Hi,
I'm testing Pacemaker-1.1.17-rc1.
The number of failures in "Too many failures (10) to fence" log does not match 
the number of actual failures.

After the 11th time fence failure, "Too many failures (10) to fence" is output.
Incidentally, stonith-max-attempts has not been set, so it is 10 by default..

[root@x3650f log]# egrep "Requesting fencing|error: Operation reboot|Stonith 
failed|Too many failures"
##Requesting fencing : 1st time
May 12 05:51:47 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 05:52:52 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.8415167d: No data available
May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 2nd time
May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 05:53:56 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.53d3592a: No data available
May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 3rd time
May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 05:55:01 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.9177cb76: No data available
May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 4th time
May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 05:56:05 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.946531cb: No data available
May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 5th time
May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 05:57:10 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.278b3c4b: No data available
May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 6th time
May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 05:58:14 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.7a49aebb: No data available
May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 7th time
May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 05:59:19 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.83421862: No data available
May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 8th time
May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 06:00:24 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.afd7ef98: No data available
May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 9th time
May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 06:01:28 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.3b033dbe: No data available
May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 10th time
May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 06:02:33 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.5447a345: No data available
May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
## 11th time
May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of 
node rhel73-2
May 12 06:03:37 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.db50c21a: No data available
May 12 06:03:37 rhel73-1 crmd[5269]: warning: Too many failures (10) to fence 
rhel73-2, giving up
May 12 06:03:37 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed

Regards,
Kazunori INOUE

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org