[ClusterLabs] CIB: op-status=4 ?
Hi, I have a question regarding ' 'op-status attribute getting value 4. In my case I have a strange behavior, when resources get those "monitor" operation entries in the CIB with op-status=4, and they do not seem to be called (exec-time=0). What does 'op-status' = 4 mean? I would appreciate some elaboration regarding this, since this is interpreted by pacemaker as an error, which causes logs: crm_mon:error: unpack_rsc_op:Preventing dbx_head_head from re-starting anywhere: operation monitor failed 'not configured' (6) and I am pretty sure the resource agent was not called (no logs, exec-time=0) There are two aspects of this: 1) harmless (pacemaker seems to not bother about it), which I guess indicates cancelled monitoring operations: op-status=4, rc-code=189 * Example: 2) error level one (op-status=4, rc-code=6), which generates logs: crm_mon:error: unpack_rsc_op:Preventing dbx_head_head from re-starting anywhere: operation monitor failed 'not configured' (6) * Example: Could it be some hardware (VM hyperviser) issue? Thanks in advance, -- Best Regards, Radoslaw Garbacz XtremeData Incorporated ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?
On 2017-05-17 06:24, Lentes, Bernd wrote: ... I'd like to know what the software is use is doing. Am i the only one having that opinion ? No. How do you solve the problem of a deathmatch or killing the wrong node ? *I* live dangerously with fencing disabled. But then my clusters only really go down for maintenance reboots, and I usually do those when I'm at work and can walk into the server room and push the power button when it comes to that. (More accurately the one cluster that goes down. The others fail over without any problems.) Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] newbie question
Thank you, Ken! On Thu, May 11, 2017 at 11:19 AM, Ken Gaillot wrote: > On 05/05/2017 03:09 PM, Sergei Gerasenko wrote: > > Hi, > > > > I have a very simple question. > > > > Pacemaker uses a dedicated "multicast" interface for the totem protocol. > > I'm using pacemaker with LVS to provide HA load balancing. LVS uses > > multicast interfaces to sync the status of TCP connections if a failover > > occurs. > > > > I can understand services using the same interface if ports are used. > > That way you can get a socket (ip + port). But there's no ports in this > > case. So how can two services exchange messages without specifying > > ports? I guess that's somehow related to multicast, but how exactly I > > don't get. > > > > Can somebody point me to a primer on this topic? > > > > Thanks, > > S. > > Corosync is actually the cluster component that can use multicast, and > it does use a specific port on a specific address. By default, it uses > ports 5404 and 5405 when using multicast. See the corosync.conf(5) man > page for mcastaddr and mcastport. Also see the transport option; > corosync can be configured to use UDP unicast rather than multicast. > > I don't remember much about LVS, but I would guess it's the same -- it's > probably just using a default port if not specified in the config. > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate
On 05/17/2017 04:56 AM, Klaus Wenninger wrote: > On 05/17/2017 11:28 AM, 井上 和徳 wrote: >> Hi, >> I'm testing Pacemaker-1.1.17-rc1. >> The number of failures in "Too many failures (10) to fence" log does not >> match the number of actual failures. > > Well it kind of does as after 10 failures it doesn't try fencing again > so that is what > failures stay at ;-) > Of course it still sees the need to fence but doesn't actually try. > > Regards, > Klaus This feature can be a little confusing: it doesn't prevent all further fence attempts of the target, just *immediate* fence attempts. Whenever the next transition is started for some other reason (a configuration or state change, cluster-recheck-interval, node failure, etc.), it will try to fence again. Also, it only checks this threshold if it's aborting a transition *because* of this fence failure. If it's aborting the transition for some other reason, the number can go higher than the threshold. That's what I'm guessing happened here. >> After the 11th time fence failure, "Too many failures (10) to fence" is >> output. >> Incidentally, stonith-max-attempts has not been set, so it is 10 by default.. >> >> [root@x3650f log]# egrep "Requesting fencing|error: Operation reboot|Stonith >> failed|Too many failures" >> ##Requesting fencing : 1st time >> May 12 05:51:47 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 05:52:52 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.8415167d: No data available >> May 12 05:52:52 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 2nd time >> May 12 05:52:52 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 05:53:56 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.53d3592a: No data available >> May 12 05:53:56 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 3rd time >> May 12 05:53:56 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 05:55:01 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.9177cb76: No data available >> May 12 05:55:01 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 4th time >> May 12 05:55:01 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 05:56:05 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.946531cb: No data available >> May 12 05:56:05 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 5th time >> May 12 05:56:05 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 05:57:10 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.278b3c4b: No data available >> May 12 05:57:10 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 6th time >> May 12 05:57:10 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 05:58:14 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.7a49aebb: No data available >> May 12 05:58:14 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 7th time >> May 12 05:58:14 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 05:59:19 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.83421862: No data available >> May 12 05:59:19 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 8th time >> May 12 05:59:19 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 06:00:24 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.afd7ef98: No data available >> May 12 06:00:24 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 9th time >> May 12 06:00:24 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 06:01:28 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.3b033dbe: No data available >> May 12 06:01:28 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 10th time >> May 12 06:01:28 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 06:02:33 rhel73-1 stonith-ng[5265]: error: Operation reboot of >> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.5447a345: No data available >> May 12 06:02:33 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >> failed >> ## 11th time >> May 12 06:02:33 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of >> node rhel73-2 >> May 12 06:03:37 rhel73-1 stonith-ng[5265]: error: Opera
Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?
On 05/17/2017 03:33 PM, Lentes, Bernd wrote: > > - On May 17, 2017, at 2:58 PM, Klaus Wenninger kwenn...@redhat.com wrote: > > >>> I don't see that. >> fence_* are the RHCS-style fence-agents coming mainly from >> https://github.com/ClusterLabs/fence-agents. >> > Ah. Ok, i see that. > > Do you know if they cooperate with a SuSE HAE ? I found rpm's for SLES for > the fence agents. There is no conditional-compilation around support for RHCS-fence-agents. Thus I guess there won't be a technical issue. Question is just the degree of support you will get / want ... But there are probably others than me who can give you a more satisfactory answer. Regards, Klaus > > Bernd > > > Helmholtz Zentrum Muenchen > Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) > Ingolstaedter Landstr. 1 > 85764 Neuherberg > www.helmholtz-muenchen.de > Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe > Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons > Enhsen > Registergericht: Amtsgericht Muenchen HRB 6466 > USt-IdNr: DE 129521671 > ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?
- On May 17, 2017, at 2:58 PM, Klaus Wenninger kwenn...@redhat.com wrote: >> I don't see that. > > fence_* are the RHCS-style fence-agents coming mainly from > https://github.com/ClusterLabs/fence-agents. > Ah. Ok, i see that. Do you know if they cooperate with a SuSE HAE ? I found rpm's for SLES for the fence agents. Bernd Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?
On 05/17/2017 02:52 PM, Lentes, Bernd wrote: > > - On May 17, 2017, at 2:11 PM, Vladislav Bogdanov bub...@hoster-ok.com > wrote: > >> 08.05.2017 22:20, Lentes, Bernd wrote: >>> Hi, >>> >>> i remember that digimer often campaigns for a fence delay in a 2-node >>> cluster. >>> E.g. here: >>> http://oss.clusterlabs.org/pipermail/pacemaker/2013-July/019228.html >>> In my eyes it makes sense, so i try to establish that. I have two HP >>> servers, >>> each with an ILO card. >>> I have to use the stonith:external/ipmi agent, the stonith:external/riloe >>> refused to work. >>> >>> But i don't have a delay parameter there. >>> crm ra info stonith:external/ipmi: >> Hi, >> >> There is another ipmi fence agent - fence_ipmilan (part of fence-agents >> package). It has 'delay' parameter. >> > I don't see that. fence_* are the RHCS-style fence-agents coming mainly from https://github.com/ClusterLabs/fence-agents. > > > crm(live)# ra info stonith:ipmilan > IPMI Over LAN (stonith:ipmilan) > > IPMI LAN STONITH device > > Parameters (*: required, []: default): > > hostname* (string): > The hostname of the STONITH device > > ipaddr* (string): IP Address > The IP address of the STONITH device > > port* (string): > The port number to where the IPMI message is sent > > auth* (string): > The authorization type of the IPMI session ("none", "straight", "md2", or > "md5") > > priv* (string): > The privilege level of the user ("operator" or "admin") > > login* (string): Login > The username used for logging in to the STONITH device > > password* (string): Password > The password used for logging in to the STONITH device > > priority (integer, [0]): The priority of the stonith resource. Devices are > tried in order of highest priority to lowest. > pcmk_host_argument (string, [port]): Advanced use only: An alternate > parameter to supply instead of 'port' > Some devices do not support the standard 'port' parameter or may provide > additional ones. > Use this to specify an alternate, device-specific, parameter that should > indicate the machine to be fenced. > A value of 'none' can be used to tell the cluster not to supply any > additional parameters. > > pcmk_host_map (string): A mapping of host names to ports numbers for devices > that do not support host names. > Eg. node1:1;node2:2,3 would tell the cluster to use port 1 for node1 and > ports 2 and 3 for node2 > > pcmk_host_list (string): A list of machines controlled by this device > (Optional unless pcmk_host_check=static-list). > pcmk_host_check (string, [dynamic-list]): How to determine which machines are > controlled by the device. > Allowed values: dynamic-list (query the device), static-list (check the > pcmk_host_list attribute), none (assume every device can fence every machine) > ... > > > There is no delay parameter, and all the pcmk_*** parameters are the ones > from stonithd, and that one does not have a dedicated delay parameter, > just the pcmk_delay_max parameter which is not fixed but random. Do you have > another ipmilan RA ? > > I have SLES 11 SP4 boxes, maybe my RA is not recent enough ? > > Bernd > > > Helmholtz Zentrum Muenchen > Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) > Ingolstaedter Landstr. 1 > 85764 Neuherberg > www.helmholtz-muenchen.de > Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe > Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons > Enhsen > Registergericht: Amtsgericht Muenchen HRB 6466 > USt-IdNr: DE 129521671 > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?
- On May 17, 2017, at 2:11 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: > 08.05.2017 22:20, Lentes, Bernd wrote: >> Hi, >> >> i remember that digimer often campaigns for a fence delay in a 2-node >> cluster. >> E.g. here: >> http://oss.clusterlabs.org/pipermail/pacemaker/2013-July/019228.html >> In my eyes it makes sense, so i try to establish that. I have two HP servers, >> each with an ILO card. >> I have to use the stonith:external/ipmi agent, the stonith:external/riloe >> refused to work. >> >> But i don't have a delay parameter there. >> crm ra info stonith:external/ipmi: > > Hi, > > There is another ipmi fence agent - fence_ipmilan (part of fence-agents > package). It has 'delay' parameter. > >> I don't see that. crm(live)# ra info stonith:ipmilan IPMI Over LAN (stonith:ipmilan) IPMI LAN STONITH device Parameters (*: required, []: default): hostname* (string): The hostname of the STONITH device ipaddr* (string): IP Address The IP address of the STONITH device port* (string): The port number to where the IPMI message is sent auth* (string): The authorization type of the IPMI session ("none", "straight", "md2", or "md5") priv* (string): The privilege level of the user ("operator" or "admin") login* (string): Login The username used for logging in to the STONITH device password* (string): Password The password used for logging in to the STONITH device priority (integer, [0]): The priority of the stonith resource. Devices are tried in order of highest priority to lowest. pcmk_host_argument (string, [port]): Advanced use only: An alternate parameter to supply instead of 'port' Some devices do not support the standard 'port' parameter or may provide additional ones. Use this to specify an alternate, device-specific, parameter that should indicate the machine to be fenced. A value of 'none' can be used to tell the cluster not to supply any additional parameters. pcmk_host_map (string): A mapping of host names to ports numbers for devices that do not support host names. Eg. node1:1;node2:2,3 would tell the cluster to use port 1 for node1 and ports 2 and 3 for node2 pcmk_host_list (string): A list of machines controlled by this device (Optional unless pcmk_host_check=static-list). pcmk_host_check (string, [dynamic-list]): How to determine which machines are controlled by the device. Allowed values: dynamic-list (query the device), static-list (check the pcmk_host_list attribute), none (assume every device can fence every machine) ... There is no delay parameter, and all the pcmk_*** parameters are the ones from stonithd, and that one does not have a dedicated delay parameter, just the pcmk_delay_max parameter which is not fixed but random. Do you have another ipmilan RA ? I have SLES 11 SP4 boxes, maybe my RA is not recent enough ? Bernd Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?
On 05/17/2017 01:24 PM, Lentes, Bernd wrote: > > - On May 10, 2017, at 9:15 PM, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote: > >> On 05/10/2017 01:54 PM, Ken Gaillot wrote: >>> On 05/10/2017 12:26 PM, Dimitri Maziuk wrote: - fencing in 2-node clusters does not work reliably without fixed delay >>> Not quite. Fixed delay allows a particular method for avoiding a death >>> match in a two-node cluster. Pacemaker's built-in random delay >>> capability is another method. >> Deathmatch is one problem, killing the wrong node (2 nodes, no quorum) >> is another. Fixed delay is digimer's attempt to alleviate the latter, >> so... apples and fruits not entirely unlike apples. >> >> -- > Hi, > > so what should i do ? Using pcmk_delay_max does not seem to be really > reliable. > I don't like the idea of being dependent from a software thinking "which > delay i should choose, depending on the ... weather conditions, any mood ..." > I'd like to know what the software is use is doing. Am i the only one having > that opinion ? > > How do you solve the problem of a deathmatch or killing the wrong node ? When you just have a fence-agent available that doesn't support an extra delay-parameter there is another solution - although not very beautiful in my mind: You can use fencing-levels and a dummy fence agent like the one coming with cts (fence_dummy). If you put it on the same level as your real fence-agent you would have it in the list before that one, configure it to wait for a certain time and succeed afterwards. Or you are using multiple levels put it into a level that has higher prio than the one your real fence-agent is in. Again make it wait but afterwards fail so that the next level is attempted. But I think we should aim for a solution inside pacemaker here. I'm btw. the one that brought up a delay derived from the node health Ken had mentioned. I could as well imagine other enhancements to what we have with pcmk_delay_max like a constant base delay where the random comes on top. It might be useful to be able to derive both - the constant base and the random part - from attributes. For the health-based delay I had in mind to have 3 parameters like pcmk_delay_green, pcmk_delay_orange, pcmk_delay_red. Maybe a good opportunity here to ask for some feedback regarding enhancements of generic fencing delay mechanisms... Regards, Klaus > > Bernd > > > Helmholtz Zentrum Muenchen > Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) > Ingolstaedter Landstr. 1 > 85764 Neuherberg > www.helmholtz-muenchen.de > Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe > Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons > Enhsen > Registergericht: Amtsgericht Muenchen HRB 6466 > USt-IdNr: DE 129521671 > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?
08.05.2017 22:20, Lentes, Bernd wrote: Hi, i remember that digimer often campaigns for a fence delay in a 2-node cluster. E.g. here: http://oss.clusterlabs.org/pipermail/pacemaker/2013-July/019228.html In my eyes it makes sense, so i try to establish that. I have two HP servers, each with an ILO card. I have to use the stonith:external/ipmi agent, the stonith:external/riloe refused to work. But i don't have a delay parameter there. crm ra info stonith:external/ipmi: Hi, There is another ipmi fence agent - fence_ipmilan (part of fence-agents package). It has 'delay' parameter. ... pcmk_delay_max (time, [0s]): Enable random delay for stonith actions and specify the maximum of random delay This prevents double fencing when using slow devices such as sbd. Use this to enable random delay for stonith actions and specify the maximum of random delay. ... This is the only delay parameter i can use. But a random delay does not seem to be a reliable solution. The stonith:ipmilan agent also provides just a random delay. Same with the riloe agent. How did anyone solve this problem ? Or do i have to edit the RA (I will get practice in that :-))? Bernd ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] how to set a dedicated fence delay for a stonith agent ?
- On May 10, 2017, at 9:15 PM, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote: > On 05/10/2017 01:54 PM, Ken Gaillot wrote: >> On 05/10/2017 12:26 PM, Dimitri Maziuk wrote: > >>> - fencing in 2-node clusters does not work reliably without fixed delay >> >> Not quite. Fixed delay allows a particular method for avoiding a death >> match in a two-node cluster. Pacemaker's built-in random delay >> capability is another method. > > Deathmatch is one problem, killing the wrong node (2 nodes, no quorum) > is another. Fixed delay is digimer's attempt to alleviate the latter, > so... apples and fruits not entirely unlike apples. > > -- Hi, so what should i do ? Using pcmk_delay_max does not seem to be really reliable. I don't like the idea of being dependent from a software thinking "which delay i should choose, depending on the ... weather conditions, any mood ..." I'd like to know what the software is use is doing. Am i the only one having that opinion ? How do you solve the problem of a deathmatch or killing the wrong node ? Bernd Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate
On 05/17/2017 11:28 AM, 井上 和徳 wrote: > Hi, > I'm testing Pacemaker-1.1.17-rc1. > The number of failures in "Too many failures (10) to fence" log does not > match the number of actual failures. Well it kind of does as after 10 failures it doesn't try fencing again so that is what failures stay at ;-) Of course it still sees the need to fence but doesn't actually try. Regards, Klaus > > After the 11th time fence failure, "Too many failures (10) to fence" is > output. > Incidentally, stonith-max-attempts has not been set, so it is 10 by default.. > > [root@x3650f log]# egrep "Requesting fencing|error: Operation reboot|Stonith > failed|Too many failures" > ##Requesting fencing : 1st time > May 12 05:51:47 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 05:52:52 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.8415167d: No data available > May 12 05:52:52 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 2nd time > May 12 05:52:52 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 05:53:56 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.53d3592a: No data available > May 12 05:53:56 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 3rd time > May 12 05:53:56 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 05:55:01 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.9177cb76: No data available > May 12 05:55:01 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 4th time > May 12 05:55:01 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 05:56:05 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.946531cb: No data available > May 12 05:56:05 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 5th time > May 12 05:56:05 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 05:57:10 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.278b3c4b: No data available > May 12 05:57:10 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 6th time > May 12 05:57:10 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 05:58:14 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.7a49aebb: No data available > May 12 05:58:14 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 7th time > May 12 05:58:14 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 05:59:19 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.83421862: No data available > May 12 05:59:19 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 8th time > May 12 05:59:19 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 06:00:24 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.afd7ef98: No data available > May 12 06:00:24 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 9th time > May 12 06:00:24 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 06:01:28 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.3b033dbe: No data available > May 12 06:01:28 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 10th time > May 12 06:01:28 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 06:02:33 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.5447a345: No data available > May 12 06:02:33 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > ## 11th time > May 12 06:02:33 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of > node rhel73-2 > May 12 06:03:37 rhel73-1 stonith-ng[5265]: error: Operation reboot of > rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.db50c21a: No data available > May 12 06:03:37 rhel73-1 crmd[5269]: warning: Too many failures (10) to fence > rhel73-2, giving up > May 12 06:03:37 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith > failed > > Regards, > Kazunori INOUE > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@cl
[ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate
Hi, I'm testing Pacemaker-1.1.17-rc1. The number of failures in "Too many failures (10) to fence" log does not match the number of actual failures. After the 11th time fence failure, "Too many failures (10) to fence" is output. Incidentally, stonith-max-attempts has not been set, so it is 10 by default.. [root@x3650f log]# egrep "Requesting fencing|error: Operation reboot|Stonith failed|Too many failures" ##Requesting fencing : 1st time May 12 05:51:47 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2 May 12 05:52:52 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.8415167d: No data available May 12 05:52:52 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed ## 2nd time May 12 05:52:52 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2 May 12 05:53:56 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.53d3592a: No data available May 12 05:53:56 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed ## 3rd time May 12 05:53:56 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2 May 12 05:55:01 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.9177cb76: No data available May 12 05:55:01 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed ## 4th time May 12 05:55:01 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2 May 12 05:56:05 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.946531cb: No data available May 12 05:56:05 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed ## 5th time May 12 05:56:05 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2 May 12 05:57:10 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.278b3c4b: No data available May 12 05:57:10 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed ## 6th time May 12 05:57:10 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2 May 12 05:58:14 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.7a49aebb: No data available May 12 05:58:14 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed ## 7th time May 12 05:58:14 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2 May 12 05:59:19 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.83421862: No data available May 12 05:59:19 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed ## 8th time May 12 05:59:19 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2 May 12 06:00:24 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.afd7ef98: No data available May 12 06:00:24 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed ## 9th time May 12 06:00:24 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2 May 12 06:01:28 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.3b033dbe: No data available May 12 06:01:28 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed ## 10th time May 12 06:01:28 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2 May 12 06:02:33 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.5447a345: No data available May 12 06:02:33 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed ## 11th time May 12 06:02:33 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2 May 12 06:03:37 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.db50c21a: No data available May 12 06:03:37 rhel73-1 crmd[5269]: warning: Too many failures (10) to fence rhel73-2, giving up May 12 06:03:37 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed Regards, Kazunori INOUE ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org