Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

Klaus Wenninger Wed, 19 Oct 2016 11:10:34 -0700

On 10/14/2016 11:21 AM, [email protected] wrote:
> Hi Klaus,
> Hi All,
>
> I tried prototype of watchdog using WD service.
>  - 
> https://github.com/HideoYamauchi/pacemaker/commit/3ee97b76e0212b1790226864dfcacd1a327dbcc9
>
> Please comment.
Thank you Hideo for providing the prototype.
Added the patch to my build and it seems to
be working as expected.


A few thoughts triggered by this approach:

- we have to alert the corosync-people as in
  a chat with Jan Friesse he pointed me to the
  fact that for corosync 3.x the wd-service was
  planned to be removed

  especially delicate as the binding is very loose
  so that - as is - it builds against a corosync with
  disabled wd-service without any complaints...

- as of now if you enable wd-service in the
  corosync-build it is on by default and would
  be hogging the watchdog presumably
  (there is obviously a pull request that makes
  it default to off)

- with my thoughts about adding an API to
  sbd previously in the thread I was trying to
  target closer observation of pacemaker_remoted
  as well (remote-nodes don't have corosync
  running)

  I guess it would be possible to run corosync
  with a static config as single-node cluster
  bound to localhost for that purpose.

  I read the thread about corosync-remote and
  that happening might make the special-handling
  for pacemaker-remote obsolete anyway ... 

- to enable the approach to live alongside
  sbd it would be possible to make sbd use
  the corosync-API as well for watchdog purposes
  instead of opening the watchdog directly

  This shouldn't be a big deal for sbd used to
  observe a pacemaker-node as cluster-watcher
  (the part of sbd that sends cpg-pings to corosync)
  already builds against corosync.
  The blockdevice-part of sbd being basically
  generic it might be an issue though.

Regards,
Klaus

>
>
> Best Regards,
> Hideo Yamauchi.
>
>
> ----- Original Message -----
>> From: "[email protected]" <[email protected]>
>> To: "[email protected]" <[email protected]>
>> Cc: 
>> Date: 2016/10/11, Tue 17:58
>> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is 
>> frozen, cluster decisions are delayed infinitely
>>
>> Hi Klaus,
>>
>> Thank you for comment.
>>
>> I make the patch which is prototype using WD service.
>>
>> Please wait a little.
>>
>> Best Regards,
>> Hideo Yamauchi.
>>
>>
>>
>>
>> ----- Original Message -----
>>>  From: Klaus Wenninger <[email protected]>
>>>  To: [email protected]
>>>  Cc: 
>>>  Date: 2016/10/10, Mon 21:03
>>>  Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd 
>> is frozen, cluster decisions are delayed infinitely
>>>  On 10/07/2016 11:10 PM, [email protected] wrote:
>>>>   Hi All,
>>>>
>>>>   Our user may not necessarily use sdb.
>>>>
>>>>   I confirmed that there was a method using WD service of corosync as 
>> one 
>>>  method not to use sdb.
>>>>   Pacemaker watches the process of pacemaker by WD service using CMAP 
>> and can 
>>>  carry out watchdog.
>>>
>>>  Have to have a look at that...
>>>  But if we establish some in-between-layer in pacemaker we could have this
>>>  as one of the possibilities besides e.g. sbd (with enhanced API), going for
>>>  a watchdog-device directly, ...
>>>
>>>>
>>>>   We can set up a patch of pacemaker.
>>>  Always helpful to discuss/clarify an idea once some code is available ...
>>>
>>>>   Was the discussion of using WD service over so far?
>>>  Not from my pov. Just a day off ;-)
>>>
>>>>
>>>>   Best Regard,
>>>>   Hideo Yamauchi.
>>>>
>>>>
>>>>   ----- Original Message -----
>>>>>   From: Klaus Wenninger <[email protected]>
>>>>>   To: Ulrich Windl <[email protected]>; 
>>>  [email protected]
>>>>>   Cc: 
>>>>>   Date: 2016/10/7, Fri 17:47
>>>>>   Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the 
>> DC 
>>>  crmd is frozen, cluster decisions are delayed infinitely
>>>>>   On 10/07/2016 08:14 AM, Ulrich Windl wrote:
>>>>>>>>>    Klaus Wenninger <[email protected]> 
>> schrieb am 
>>>>>   06.10.2016 um 18:03 in
>>>>>>    Nachricht 
>> <[email protected]>:
>>>>>>>    On 10/05/2016 04:22 PM, [email protected] wrote:
>>>>>>>>    Hi All,
>>>>>>>>
>>>>>>>>>>    If a user uses sbd, can the cluster evade a 
>>>  problem of 
>>>>>   SIGSTOP of crmd?
>>>>>>>>>    
>>>>>>>>>    As pointed out earlier, maybe crmd should feed a 
>>>  watchdog. Then 
>>>>>   stopping 
>>>>>>>    crmd 
>>>>>>>>>    will reboot the node (unless the watchdog fails).
>>>>>>>>    Thank you for comment.
>>>>>>>>
>>>>>>>>    We examine watchdog of crmd, too.
>>>>>>>>    In addition, I comment after examination advanced.
>>>>>>>    Was thinking of doing a small test implementation going
>>>>>>>    a little in the direction Lars Ellenberg had been 
>> pointing 
>>>  out.
>>>>>>>    a couple of thoughts I had so far:
>>>>>>>
>>>>>>>    - add an API (via DBus or libqb - favoring libqb atm) to 
>> sbd
>>>>>>>      an application can use to create a watchdog within sbd
>>>>>>    Why has it to be done within sbd?
>>>>>   Not necessarily, could be spawned out as well into an own project 
>> or
>>>>>   something already existent could be taken.
>>>>>   Remember to have added a dbus-interface to
>>>>>   https://sourceforge.net/projects/watchdog/ for a project once.
>>>>>   If you have a suggestion I'm open.
>>>>>   Going off sbd would have the advantage of a smooth start:
>>>>>
>>>>>   - cluster/pacemaker-watcher are there already and can
>>>>>     be replaced/moved over time
>>>>>   - the lifecycle of the daemon (when started/stopped) is
>>>>>     already something that is in the code and in the people's 
>> minds
>>>>>>>    - parameters for the first are a name and a timeout
>>>>>>>
>>>>>>>    - first use-case would be crmd observation
>>>>>>>
>>>>>>>    - later on we could think of removing pacemaker 
>> dependencies
>>>>>>>      from sbd by moving the actual implementation of
>>>>>>>      pacemaker-watcher and probably cluster-watcher as well
>>>>>>>      into pacemaker - using the new API
>>>>>>>
>>>>>>>    - this of course creates sbd dependency within pacemaker 
>> so
>>>>>>>      that it would make sense to offer a simpler and 
>>>  self-contained
>>>>>>>      implementation within pacemaker as an alternative
>>>>>>    I think the watchdog interface is so simple that you 
>> don't 
>>>  need a relay 
>>>>>   for it. The only limit I can imagine is the number of watchdogs 
>>>  available of 
>>>>>   some specific hardware.
>>>>>   That is the point ;-)
>>>>>>>      thus it would be favorable to have the dependency
>>>>>>>      within a non-compulsory pacemaker-rpm so that
>>>>>>>      we can offer an alternative that doesn't use sbd
>>>>>>>      at maybe the cost of being less reliable or one
>>>>>>>      that owns a hardware-watchdog by itself for systems
>>>>>>>      where this is still unused.
>>>>>>>
>>>>>>>      - e.g. via some kind of plugin (Andrew forgive me -
>>>>>>>                                                       no 
>> pils ;-) 
>>>  )
>>>>>>>      - or via an additional daemon
>>>>>>>
>>>>>>>    What did you have in mind?
>>>>>>>    Maybe it makes sense to synchronize...
>>>>>>>
>>>>>>>    Regards,
>>>>>>>    Klaus
>>>>>>>    
>>>>>>>>    Best Regards,
>>>>>>>>    Hideo Yamauchi.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    ----- Original Message -----
>>>>>>>>>    From: Ulrich Windl 
>>>  <[email protected]>
>>>>>>>>>    To: [email protected]; 
>> [email protected] 
>>>>>>>>>    Cc: 
>>>>>>>>>    Date: 2016/10/5, Wed 23:08
>>>>>>>>>    Subject: Antw: Re: [ClusterLabs] Antw: Re: When 
>> the DC 
>>>  crmd is 
>>>>>   frozen, 
>>>>>>>    cluster decisions are delayed infinitely
>>>>>>>>>>>>     <[email protected]> 
>>>  schrieb am 
>>>>>   21.09.2016 um 11:52 
>>>>>>>>>    in Nachricht
>>>>>>>>>    
>>>  <[email protected]>:
>>>>>>>>>>     Hi All,
>>>>>>>>>>
>>>>>>>>>>     Was the final conclusion given about this 
>>>  problem?
>>>>>>>>>>     If a user uses sbd, can the cluster evade a 
>>>  problem of 
>>>>>   SIGSTOP of crmd?
>>>>>>>>>    As pointed out earlier, maybe crmd should feed a 
>>>  watchdog. Then 
>>>>>   stopping 
>>>>>>>    crmd 
>>>>>>>>>    will reboot the node (unless the watchdog fails).
>>>>>>>>>
>>>>>>>>>>     We are interested in this problem, too.
>>>>>>>>>>
>>>>>>>>>>     Best Regards,
>>>>>>>>>>
>>>>>>>>>>     Hideo Yamauchi.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     
>> _______________________________________________
>>>>>>>>>>     Users mailing list: [email protected] 
>>>>>>>>>>    http://clusterlabs.org/mailman/listinfo/users 
>>>>>>>>>>     Project Home: http://www.clusterlabs.org 
>>>>>>>>>>     Getting started: 
>>>>>   http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>>>>>>>>     Bugs: http://bugs.clusterlabs.org 
>>>>>>>>    _______________________________________________
>>>>>>>>    Users mailing list: [email protected] 
>>>>>>>>    http://clusterlabs.org/mailman/listinfo/users 
>>>>>>>>
>>>>>>>>    Project Home: http://www.clusterlabs.org 
>>>>>>>>    Getting started: 
>>>>>   http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>>>>>>    Bugs: http://bugs.clusterlabs.org 
>>>>>>>    _______________________________________________
>>>>>>>    Users mailing list: [email protected] 
>>>>>>>    http://clusterlabs.org/mailman/listinfo/users 
>>>>>>>
>>>>>>>    Project Home: http://www.clusterlabs.org 
>>>>>>>    Getting started: 
>>>>>   http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>>>>>    Bugs: http://bugs.clusterlabs.org 
>>>>>   _______________________________________________
>>>>>   Users mailing list: [email protected]
>>>>>   http://clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>>   Project Home: http://www.clusterlabs.org
>>>>>   Getting started: 
>>>  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>   Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>   _______________________________________________
>>>>   Users mailing list: [email protected]
>>>>   http://clusterlabs.org/mailman/listinfo/users
>>>>
>>>>   Project Home: http://www.clusterlabs.org
>>>>   Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>   Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>  _______________________________________________
>>>  Users mailing list: [email protected]
>>>  http://clusterlabs.org/mailman/listinfo/users
>>>
>>>  Project Home: http://www.clusterlabs.org
>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs: http://bugs.clusterlabs.org
>>>
>> _______________________________________________
>> Users mailing list: [email protected]
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> _______________________________________________
> Users mailing list: [email protected]
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

Reply via email to