Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-22 Thread Lars Ellenberg
On Thu, Sep 22, 2016 at 08:01:44AM +0200, Klaus Wenninger wrote:
> On 09/22/2016 06:34 AM, renayama19661...@ybb.ne.jp wrote:
> > Hi Klaus,
> >
> > Thank you for comment.
> >
> > Okay!
> >
> > Will it mean that improvement is considered in community in future?
> 
> Speaking for me I'd like to have some feedback if we might
> have overseen something so that it is rather a config issue.
> 
> One of my current projects is to introduce improved
> observation of pacemaker_remoted by sbd. (Saying improved
> here because there is already something when you enable
> pacemaker-watcher on remote-nodes but it creates unneeded
> watchdog-reboots in a couple of cases ...)
> Looks as if some additional (direct) communication (heartbeat -
> the principle not the communication & membership for
> clusters) between pacemaker_remoted (very similar to lrmd)
> and sbd would come handy for that.
> 
> So in this light it might make sense
> to consider expanding that for crmd as well ...
> 
> If we are finally facing an issue I'd herewith like to ask for
> input.

In a somewhat extended context, there used to be "apphbd",
which itself would register with some watchdog to "monitor" itself,
and which "applications" would register with to negotiate their own
"application heartbeat".
Not neccessarily only components of the cluster manager,
but cluster aware "resources" as well.

If they fail to feed their app hb, apphbd would then "trigger a
notification", and some other entity would react on that based on
yet an other configuration.  And plugins.
Didn't old heartbeat like the concept of plugins...

Anyways, you get the idea.

Currently, we have SBD chosen as such a "watchdog proxy",
maybe we can generalize it?

All of that would require cooperation within the node itself, though.

In this scenario, the cluster is not trusting the "sanity"
of the "commander in chief".

So maybe in addition of this "in-node application heartbeat",
all non-DCs should periodically actively challenge the sanity
of the DC from the outside, and trigger re-election if they have
"reasonable doubt"?


-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-22 Thread Klaus Wenninger
On 09/22/2016 06:34 AM, renayama19661...@ybb.ne.jp wrote:
> Hi Klaus,
>
> Thank you for comment.
>
> Okay!
>
> Will it mean that improvement is considered in community in future?

Speaking for me I'd like to have some feedback if we might
have overseen something so that it is rather a config issue.

One of my current projects is to introduce improved
observation of pacemaker_remoted by sbd. (Saying improved
here because there is already something when you enable
pacemaker-watcher on remote-nodes but it creates unneeded
watchdog-reboots in a couple of cases ...)
Looks as if some additional (direct) communication (heartbeat -
the principle not the communication & membership for
clusters) between pacemaker_remoted (very similar to lrmd)
and sbd would come handy for that.

So in this light it might make sense
to consider expanding that for crmd as well ...

If we are finally facing an issue I'd herewith like to ask for
input.

>
> Best Regards,
> Hideo Yamauchi.
>
>
>
> - Original Message -
>> From: Klaus Wenninger <kwenn...@redhat.com>
>> To: users@clusterlabs.org
>> Cc: 
>> Date: 2016/9/21, Wed 19:09
>> Subject: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster 
>> decisions are delayed infinitely
>>
>> My observation results still stand:
>>
>> - cluster doesn't autonomously find out of this situation
>> - sbd doesn't help because the cib is still readable
>>
>> Regards,
>> Klaus
>>
>> On 09/21/2016 11:52 AM, renayama19661...@ybb.ne.jp wrote:
>>>  Hi All,
>>>
>>>  Was the final conclusion given about this problem?
>>>
>>>  If a user uses sbd, can the cluster evade a problem of SIGSTOP of crmd?
>>>
>>>  We are interested in this problem, too.
>>>
>>>  Best Regards,
>>>
>>>  Hideo Yamauchi.
>>>
>>>
>>>  ___
>>>  Users mailing list: Users@clusterlabs.org
>>>  http://clusterlabs.org/mailman/listinfo/users
>>>
>>>  Project Home: http://www.clusterlabs.org
>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs: http://bugs.clusterlabs.org
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Digimer
On 08/09/16 06:51 PM, Shermal Fernando wrote:
> Hi Jehan-Guillaume,
> 
> Sorry for disturbing you. This is really important for us to pass this test 
> on the pacemaker resiliency and robustness. 
> To my understanding, it's the pacemakerd who feeds the watchdog. If only the 
> crmd is hung, fencing will not work. Am I correct here?
> 
> Regards,
> Shermal Fernando

Watchdog fencing is not ideal. If you're running a critical enough
environment, consider using IPMI or other "real" fencing methods.
Personally, we use (and recommend) IPMI as a primary fence method with a
pair of switched PDUs as a backup fence method. This provides full
coverage and is generally a lot faster.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Jehan-Guillaume de Rorthais
On Thu, 8 Sep 2016 09:51:27 +
Shermal Fernando <sherma...@millenniumit.com> wrote:

> Hi Jehan-Guillaume,
> 
> Sorry for disturbing you. This is really important for us to pass this test
> on the pacemaker resiliency and robustness. To my understanding, it's the
> pacemakerd who feeds the watchdog. If only the crmd is hung, fencing will not
> work. Am I correct here?

I guess yes.

I am talking of a scenario where the server is under a high load (fork bomb,
swap storm, ...), not only crmd being hung for some reasons.


> -Original Message-
> From: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 
> Sent: Thursday, September 08, 2016 3:12 PM
> To: Shermal Fernando
> Cc: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster
> decisions are delayed infinitely
> 
> On Thu, 8 Sep 2016 08:58:15 +
> Shermal Fernando <sherma...@millenniumit.com> wrote:
> 
> > Hi Jehan-Guillaume,
> > 
> > Does this means watchdog will serf-terminate the machine when the crm 
> > daemon is frozen?
> 
> This means that if the machine is under such a load that PAcemaker is not
> able to feed the watchdog, the watchdog will fence the machine itself.
> 
> > -Original Message-
> > From: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com]
> > Sent: Thursday, September 08, 2016 12:52 PM
> > To: Digimer
> > Cc: Cluster Labs - All topics related to open-source clustering 
> > welcomed
> > Subject: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, 
> > cluster decisions are delayed infinitely
> > 
> > On Thu, 8 Sep 2016 15:55:50 +0900
> > Digimer <li...@alteeve.ca> wrote:
> > 
> > > On 08/09/16 03:47 PM, Ulrich Windl wrote:
> > > >>>> Shermal Fernando <sherma...@millenniumit.com> schrieb am
> > > >>>> 08.09.2016 um
> > > >>>> 06:41 in
> > > > Nachricht
> > > > <8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
> > > >> The whole cluster will fail if the DC (crm daemon) is frozen due 
> > > >> to CPU starvation or hanging while trying to perform a IO operation.
> > > >> Please share some thoughts on this issue.
> > > > 
> > > > What is "the whole cluster will fail"? If the DC times out, some 
> > > > recovery will take place.
> > > 
> > > Yup. The starved node should be declared lost by corosync, the 
> > > remaining nodes reform and if they're still quorate, the hung node 
> > > should be fenced. Recovery occur and life goes on.
> > 
> > +1
> > 
> > And fencing might either come from outside, or just from the server 
> > itself using watchdog.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Shermal Fernando
Hi Jehan-Guillaume,

Sorry for disturbing you. This is really important for us to pass this test on 
the pacemaker resiliency and robustness. 
To my understanding, it's the pacemakerd who feeds the watchdog. If only the 
crmd is hung, fencing will not work. Am I correct here?

Regards,
Shermal Fernando







-Original Message-
From: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 
Sent: Thursday, September 08, 2016 3:12 PM
To: Shermal Fernando
Cc: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster 
decisions are delayed infinitely

On Thu, 8 Sep 2016 08:58:15 +
Shermal Fernando <sherma...@millenniumit.com> wrote:

> Hi Jehan-Guillaume,
> 
> Does this means watchdog will serf-terminate the machine when the crm 
> daemon is frozen?

This means that if the machine is under such a load that PAcemaker is not able 
to feed the watchdog, the watchdog will fence the machine itself.

> -Original Message-
> From: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com]
> Sent: Thursday, September 08, 2016 12:52 PM
> To: Digimer
> Cc: Cluster Labs - All topics related to open-source clustering 
> welcomed
> Subject: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, 
> cluster decisions are delayed infinitely
> 
> On Thu, 8 Sep 2016 15:55:50 +0900
> Digimer <li...@alteeve.ca> wrote:
> 
> > On 08/09/16 03:47 PM, Ulrich Windl wrote:
> > >>>> Shermal Fernando <sherma...@millenniumit.com> schrieb am
> > >>>> 08.09.2016 um
> > >>>> 06:41 in
> > > Nachricht
> > > <8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
> > >> The whole cluster will fail if the DC (crm daemon) is frozen due 
> > >> to CPU starvation or hanging while trying to perform a IO operation.
> > >> Please share some thoughts on this issue.
> > > 
> > > What is "the whole cluster will fail"? If the DC times out, some 
> > > recovery will take place.
> > 
> > Yup. The starved node should be declared lost by corosync, the 
> > remaining nodes reform and if they're still quorate, the hung node 
> > should be fenced. Recovery occur and life goes on.
> 
> +1
> 
> And fencing might either come from outside, or just from the server 
> itself using watchdog.


This e-mail transmission (inclusive of any attachments) is strictly 
confidential and intended solely for the ordinary user of the e-mail address to 
which it was addressed. It may contain legally privileged and/or CONFIDENTIAL 
information. The unauthorized use, disclosure, distribution printing and/or 
copying of this e-mail or any information it contains is prohibited and could, 
in certain circumstances, constitute an offence. If you have received this 
e-mail in error or are not an intended recipient please inform the sender of 
the email and MillenniumIT immediately by return e-mail or telephone (+94-11) 
2416000. We advise that in keeping with good computing practice, the recipient 
of this e-mail should ensure that it is virus free. We do not accept 
responsibility for any virus that may be transferred by way of this e-mail. 
E-mail may be susceptible to data corruption, interception and unauthorized 
amendment, and we do not accept liability for any such corruption, interception 
or amen
 dment or any consequences thereof.  www.millenniumit.com 


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Klaus Wenninger
On 09/08/2016 10:58 AM, Shermal Fernando wrote:
> Hi Jehan-Guillaume,
>
> Does this means watchdog will serf-terminate the machine when the crm daemon 
> is frozen?

Would be desirable but doesn't seem to happen - at least till now - will
see what I can do on that front.
 
>
> Regards,
> Shermal Fernando
>
>
>
>
>
>
>
>
> -Original Message-
> From: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 
> Sent: Thursday, September 08, 2016 12:52 PM
> To: Digimer
> Cc: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster 
> decisions are delayed infinitely
>
> On Thu, 8 Sep 2016 15:55:50 +0900
> Digimer <li...@alteeve.ca> wrote:
>
>> On 08/09/16 03:47 PM, Ulrich Windl wrote:
>>>>>> Shermal Fernando <sherma...@millenniumit.com> schrieb am 
>>>>>> 08.09.2016 um
>>>>>> 06:41 in
>>> Nachricht
>>> <8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
>>>> The whole cluster will fail if the DC (crm daemon) is frozen due to 
>>>> CPU starvation or hanging while trying to perform a IO operation.
>>>> Please share some thoughts on this issue.
>>> What is "the whole cluster will fail"? If the DC times out, some 
>>> recovery will take place.
>> Yup. The starved node should be declared lost by corosync, the 
>> remaining nodes reform and if they're still quorate, the hung node 
>> should be fenced. Recovery occur and life goes on.
> +1
>
> And fencing might either come from outside, or just from the server itself 
> using watchdog.
>
> --
> Jehan-Guillaume (ioguix) de Rorthais
> Dalibo
>
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
> This e-mail transmission (inclusive of any attachments) is strictly 
> confidential and intended solely for the ordinary user of the e-mail address 
> to which it was addressed. It may contain legally privileged and/or 
> CONFIDENTIAL information. The unauthorized use, disclosure, distribution 
> printing and/or copying of this e-mail or any information it contains is 
> prohibited and could, in certain circumstances, constitute an offence. If you 
> have received this e-mail in error or are not an intended recipient please 
> inform the sender of the email and MillenniumIT immediately by return e-mail 
> or telephone (+94-11) 2416000. We advise that in keeping with good computing 
> practice, the recipient of this e-mail should ensure that it is virus free. 
> We do not accept responsibility for any virus that may be transferred by way 
> of this e-mail. E-mail may be susceptible to data corruption, interception 
> and unauthorized amendment, and we do not accept liability for any such 
> corruption, interceptio!
>  n or amen
>  dment or any consequences thereof.  www.millenniumit.com 
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Shermal Fernando
Hi Jehan-Guillaume,

Does this means watchdog will serf-terminate the machine when the crm daemon is 
frozen?

Regards,
Shermal Fernando








-Original Message-
From: Jehan-Guillaume de Rorthais [mailto:j...@dalibo.com] 
Sent: Thursday, September 08, 2016 12:52 PM
To: Digimer
Cc: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster 
decisions are delayed infinitely

On Thu, 8 Sep 2016 15:55:50 +0900
Digimer <li...@alteeve.ca> wrote:

> On 08/09/16 03:47 PM, Ulrich Windl wrote:
> >>>> Shermal Fernando <sherma...@millenniumit.com> schrieb am 
> >>>> 08.09.2016 um
> >>>> 06:41 in
> > Nachricht
> > <8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
> >> The whole cluster will fail if the DC (crm daemon) is frozen due to 
> >> CPU starvation or hanging while trying to perform a IO operation.
> >> Please share some thoughts on this issue.
> > 
> > What is "the whole cluster will fail"? If the DC times out, some 
> > recovery will take place.
> 
> Yup. The starved node should be declared lost by corosync, the 
> remaining nodes reform and if they're still quorate, the hung node 
> should be fenced. Recovery occur and life goes on.

+1

And fencing might either come from outside, or just from the server itself 
using watchdog.

--
Jehan-Guillaume (ioguix) de Rorthais
Dalibo

___
Users mailing list: Users@clusterlabs.org 
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


This e-mail transmission (inclusive of any attachments) is strictly 
confidential and intended solely for the ordinary user of the e-mail address to 
which it was addressed. It may contain legally privileged and/or CONFIDENTIAL 
information. The unauthorized use, disclosure, distribution printing and/or 
copying of this e-mail or any information it contains is prohibited and could, 
in certain circumstances, constitute an offence. If you have received this 
e-mail in error or are not an intended recipient please inform the sender of 
the email and MillenniumIT immediately by return e-mail or telephone (+94-11) 
2416000. We advise that in keeping with good computing practice, the recipient 
of this e-mail should ensure that it is virus free. We do not accept 
responsibility for any virus that may be transferred by way of this e-mail. 
E-mail may be susceptible to data corruption, interception and unauthorized 
amendment, and we do not accept liability for any such corruption, interception 
or amen
 dment or any consequences thereof.  www.millenniumit.com 


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Klaus Wenninger
On 09/08/2016 08:55 AM, Digimer wrote:
> On 08/09/16 03:47 PM, Ulrich Windl wrote:
> Shermal Fernando  schrieb am 08.09.2016 um 
> 06:41 in
>> Nachricht
>> <8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
>>> The whole cluster will fail if the DC (crm daemon) is frozen due to CPU 
>>> starvation or hanging while trying to perform a IO operation.  
>>> Please share some thoughts on this issue.
>> What is "the whole cluster will fail"? If the DC times out, some recovery 
>> will take place.
> Yup. The starved node should be declared lost by corosync, the remaining
> nodes reform and if they're still quorate, the hung node should be
> fenced. Recovery occur and life goes on.
Didn't happen in my test (SIGSTOP to crmd).
Might be a configuration mistake though...
Even had sbd with a watchdog active (amongst
other - real - fencing devices).
Thinking if it might make sense so tickle the
crmd-API from sbd-pacemaker-watcher ...
>
> Unless you don't have fencing, then may $deity of mercy. ;)
>


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Shermal Fernando
If the DC (crm daemon) is frozen (corosync is running without problem), DC will 
not time out. Frozen DC will be there forever.

Regards,
Shermal Fernando








-Original Message-
From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de] 
Sent: Thursday, September 08, 2016 12:18 PM
To: users@clusterlabs.org
Subject: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions 
are delayed infinitely

>>> Shermal Fernando <sherma...@millenniumit.com> schrieb am 08.09.2016 
>>> um 06:41 in
Nachricht
<8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
> The whole cluster will fail if the DC (crm daemon) is frozen due to 
> CPU starvation or hanging while trying to perform a IO operation.
> Please share some thoughts on this issue.

What is "the whole cluster will fail"? If the DC times out, some recovery will 
take place.

> 
> Regards,
> Shermal Fernando
> 
> 
> 
> 
> 
> 
> 
> -Original Message-
> From: Klaus Wenninger [mailto:kwenn...@redhat.com]
> Sent: Monday, September 05, 2016 6:42 PM
> To: users@clusterlabs.org; develop...@clusterlabs.org
> Subject: Re: [ClusterLabs] When the DC crmd is frozen, cluster 
> decisions are delayed infinitely
> 
> On 09/03/2016 08:42 PM, Shermal Fernando wrote:
>>
>> Hi,
>>
>>  
>>
>> Currently our system have 99.96% uptime. But our goal is to increase 
>> it beyond 99.999%. Now we are studying the 
>> reliability/performance/features of pacemaker to replace the existing 
>> clustering solution.
>>
>>  
>>
>> While testing pacemaker, I have encountered a problem. If the DC (crm
>> daemon) is frozen by sending the SIGSTOP signal, crmds in other 
>> machines never start election to elect a new DC. Therefore 
>> fail-overs, resource restartings and other cluster decisions will be 
>> delayed until the DC is unfrozen.
>>
>> Is this the default behavior of pacemaker or is it due to a 
>> misconfiguration? Is there any way to avoid this single point of failure?
>>
>>  
>>
>> For the testing, we use Pacemaker 1.1.12 with Corosync 2.3.3 in SLES
>> 12 SP1 operation system.
>>
> 
> Guess I can reproduce that with pacemaker 1.1.15 & corosync 2.3.6.
> I'm having sbd with pacemaker-watcher running as well on the nodes.
> As the node-health is not updated and the cib can be read sbd is happy 
> - as to be expected.
> Maybe we could at least add something into sbd-pacemaker-watcher to 
> detect the issue ... thinking ...
> 
> Regards,
> Klaus
> 
>>  
>>
>>  
>>
>> Regards,
>>
>> Shermal Fernando
>>
>>  
>>
>>  
>>
>>  
>>
>>  
>>
>>  
>>
>>  
>>
>>  
>>
>> This e-mail transmission (inclusive of any attachments) is strictly 
>> confidential and intended solely for the ordinary user of the e-mail 
>> address to which it was addressed. It may contain legally privileged 
>> and/or CONFIDENTIAL information. The unauthorized use, disclosure, 
>> distribution printing and/or copying of this e-mail or any 
>> information it contains is prohibited and could, in certain 
>> circumstances, constitute an offence. If you have received this 
>> e-mail in error or are not an intended recipient please inform the 
>> sender of the email and MillenniumIT immediately by return e-mail or 
>> telephone (+94-11) 2416000. We advise that in keeping with good 
>> computing practice, the recipient of this e-mail should ensure that 
>> it is virus free. We do not accept responsibility for any virus that 
>> may be transferred by way of this e-mail. E-mail may be susceptible 
>> to data corruption, interception and unauthorized amendment, and we 
>> do not accept liability for any such corruption, interception or 
>> amendment or any consequences thereof.
>>
>> www.millenniumit.com <http://www.millenniumit.com>
>>
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> _

Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-08 Thread Digimer
On 08/09/16 03:47 PM, Ulrich Windl wrote:
 Shermal Fernando  schrieb am 08.09.2016 um 
 06:41 in
> Nachricht
> <8ce6e8d87f896546b9c65ed80d30a4336578c...@lg-spmb-mbx02.lseg.stockex.local>:
>> The whole cluster will fail if the DC (crm daemon) is frozen due to CPU 
>> starvation or hanging while trying to perform a IO operation.  
>> Please share some thoughts on this issue.
> 
> What is "the whole cluster will fail"? If the DC times out, some recovery 
> will take place.

Yup. The starved node should be declared lost by corosync, the remaining
nodes reform and if they're still quorate, the hung node should be
fenced. Recovery occur and life goes on.

Unless you don't have fencing, then may $deity of mercy. ;)

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org