Re: [ClusterLabs] Resource not starting correctly IV

2019-04-16 Thread JCA
Thanks. In the end, I found out that my target application has a setting
whereby the application becomes instantly detectable to the monitoring side
of my script. After doing this, the associated resource is created
flawlessly every time.

On Tue, Apr 16, 2019 at 1:46 PM Jan Pokorný  wrote:

> [letter-casing wise:
>  it's either "Pacemaker" or down-to-the-terminal "pacemaker"]
>
> On 16/04/19 10:21 -0600, JCA wrote:
> > 2. It would seem that what Pacemaker is doing is the following:
> >a. Check out whether the app is running.
> >b. If it is not, launch it.
> >c. Check out again
> >d. If running, exit.
> >e. Otherwise, stop it.
> > f. Launch it.
> >g. Go to a.
> >
> > [...]
> >
> > 4. If the above is correct, and if I am getting the picture correctly, it
> > would seem that the problem is that my monitoring function does not
> detect
> > immediately that my app is up and running. That's clearly my problem.
> > However, is there any way to get Pacemaker to introduce a delay between
> > steps b and c in section 2 above?
>
> Ah, it should have occurred to me!
>
> Typical solution, I think, is to have a sleep loop following the
> daemon launch within "start" action that will run (subset) of what
> "monitor" normally does, so as to synchronize on the "service ready"
> moment.  Default timeout for "start" within agent's metadata should
> then reflect the common time to get to the point "monitor" is happy
> plus some reserve.
>
> Some agents may do more elaborate things like precisely limiting such
> waiting in respect to the time they were actually given by the
> resource manager/pacemaker (if I don't misremember, that value is
> provided through environment variables for sort of an introspection).
>
> Resource agent experts could advise here.
>
> (Truth to be told, "daemon readiness" used to be a very marginalized
> problem putting barriers to practical [= race-free] dependency ordering
> etc., luckily clever people realized that the most precize tracking
> can only be at the hands of the actual daemon implementors if event
> driven paradigm is to be applied.  For instance, if you can influence
> my_app, and it's a standard forking daemon, it would be best if the
> parent exited only when the daemon is truly ready to provide service
> -- this usually requires some typically signal-based synchronization
> amongst the daemon processes.  With systemd, situation is much simpler
> since no forking is necessary, just a call to sd_notify(3) -- in that
> case, though, your agent would need to mimic the server side of the
> sd_notify protocol since nothing would do it for you.)
>
> > 5. Following up on 4: if my script sleeps for a few seconds immediately
> > after launching my app (it's a daemon) in myapp_start then everything
> works
> > fine. Indeed, the call sequence in node one now becomes:
> >
> >  monitor:
> >
> > Status: NOT_RUNNING
> > Exit: NOT_RUNNING
> >
> >   start:
> >
> > Validate: SUCCESS
> > Status: NOT_RUNNING
> > Start: SUCCESS
> > Exit: SUCCESS
> >
> >   monitor:
> >
> > Status: SUCCESS
> > Exit: SUCCESS
>
> That's easier but less effective and reliable (more opportunistic than
> fact-based) than polling the "monitor" outcomes privately within "start"
> as sketched above.
>
> --
> Jan (Poki)
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Resource not starting correctly IV

2019-04-16 Thread Jan Pokorný
[letter-casing wise:
 it's either "Pacemaker" or down-to-the-terminal "pacemaker"]

On 16/04/19 10:21 -0600, JCA wrote:
> 2. It would seem that what Pacemaker is doing is the following:
>a. Check out whether the app is running.
>b. If it is not, launch it.
>c. Check out again
>d. If running, exit.
>e. Otherwise, stop it.
> f. Launch it.
>g. Go to a.
> 
> [...]
> 
> 4. If the above is correct, and if I am getting the picture correctly, it
> would seem that the problem is that my monitoring function does not detect
> immediately that my app is up and running. That's clearly my problem.
> However, is there any way to get Pacemaker to introduce a delay between
> steps b and c in section 2 above?

Ah, it should have occurred to me!

Typical solution, I think, is to have a sleep loop following the
daemon launch within "start" action that will run (subset) of what
"monitor" normally does, so as to synchronize on the "service ready"
moment.  Default timeout for "start" within agent's metadata should
then reflect the common time to get to the point "monitor" is happy
plus some reserve.

Some agents may do more elaborate things like precisely limiting such
waiting in respect to the time they were actually given by the
resource manager/pacemaker (if I don't misremember, that value is
provided through environment variables for sort of an introspection).

Resource agent experts could advise here.

(Truth to be told, "daemon readiness" used to be a very marginalized
problem putting barriers to practical [= race-free] dependency ordering
etc., luckily clever people realized that the most precize tracking
can only be at the hands of the actual daemon implementors if event
driven paradigm is to be applied.  For instance, if you can influence
my_app, and it's a standard forking daemon, it would be best if the
parent exited only when the daemon is truly ready to provide service
-- this usually requires some typically signal-based synchronization
amongst the daemon processes.  With systemd, situation is much simpler
since no forking is necessary, just a call to sd_notify(3) -- in that
case, though, your agent would need to mimic the server side of the
sd_notify protocol since nothing would do it for you.)

> 5. Following up on 4: if my script sleeps for a few seconds immediately
> after launching my app (it's a daemon) in myapp_start then everything works
> fine. Indeed, the call sequence in node one now becomes:
> 
>  monitor:
> 
> Status: NOT_RUNNING
> Exit: NOT_RUNNING
> 
>   start:
> 
> Validate: SUCCESS
> Status: NOT_RUNNING
> Start: SUCCESS
> Exit: SUCCESS
> 
>   monitor:
> 
> Status: SUCCESS
> Exit: SUCCESS

That's easier but less effective and reliable (more opportunistic than
fact-based) than polling the "monitor" outcomes privately within "start"
as sketched above.

-- 
Jan (Poki)


pgpC486CyR_af.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Coming in 2.0.2: check whether a date-based rule is expired

2019-04-16 Thread Ken Gaillot
Hi all,

I wanted to point out an experimental feature that will be part of the
next release.

We are adding a "crm_rule" command that has the ability to check
whether a particular date-based rule is currently in effect.

The motivation is a perennial user complaint: expired constraints
remain in the configuration, which can be confusing.

We don't automatically remove such constraints, for several reasons: we
try to avoid modifying any user-specified configuration; expired
constraints are useful context when investigating an issue after it
happened; and crm_simulate can be run for any configuration for an
arbitrary past date to see what would have happened at that time.

The new command gives users (and high-level tools) a way to determine
whether a rule is in effect, so they can remove it themselves, whether
manually or in an automated way such as a cron.

You can use it like:

crm_rule -r  [-d ] [-X ]

With just -r, it will tell you whether the specified rule from the
configuration is currently in effect. If you give -d, it will check as
of that date and time (ISO 8601 format). If you give it -X, it will
look for the rule in the given XML rather than the CIB (you can also
use "-X -" to read the XML from standard input).

Example output:

% crm_rule -r my-current-rule
Rule my-current-role is still in effect

% crm_rule -r some-long-ago-rule
Rule some-long-ago-rule is expired

% crm_rule -r some-future-rule
Rule some-future-rule has not yet taken effect

% crm_rule -r some-recurring-rule
Could not determine whether rule some-recurring-rule is expired

Scripts can use the exit status to distinguish the various cases.

The command will be considered experimental for the 2.0.2 release; its
interface and behavior may change in future versions. The current
implementation has a limitation: the rule may contain only a single
date_expression, and the expression's operation must not be date_spec.

Other capabilities may eventually be added to crm_rule, for example the
ability to evaluate the current value of any cluster or resource
property.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] SBD as watchdog daemon

2019-04-16 Thread Олег Самойлов
Well, I checked this PR
https://github.com/ClusterLabs/sbd/pull/27
from author repository
https://github.com/jjd27/sbd/tree/cluster-quorum

The problem is still exists. When corosync is frozen on one node, both node are 
rebooted. Don’t apply this PR.

> 16 апр. 2019 г., в 19:13, Klaus Wenninger  написал(а):

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Resource not starting correctly IV

2019-04-16 Thread JCA
Thanks to everybody who has contributed to this. Let me summarize things,
if it is only for my own benefit - I learn more quickly when I try to
explain that I am trying to learn something to others.

I instrumented my script in order to find out exactly how many times it is
invoked when creating my resource, and exactly what functions in the script
are invoked. Just as a reminder, the logs I am about to describe are
created directly as a result from executing the following command:

# pcs resource create ClusterMyApp ocf:myapp:myapp-script op monitor
interval=30s

myapp-script is always the same, and the starting conditions for the app
that it is meant to launch are always exactly the same. In all cases before
issuing the command above I made sure to delete the resource, if already
there.

What follows is a log of the way in which myapp-script was invoked as a
result of executing the command above. It consists of a series of blocks,
like the following:

  monitor:

Status: NOT_RUNNING
Exit: NOT_RUNNING

This block is an invocation of myapp-script with argument 'monitor'. The
'Status' line means myapp_monitor was invoked, and it returned
OCF_NOT_RUNNING.  The 'Exit' line means that myapp-script exited with
OCF_NOT_RUNNING.  In a block with more than two lines, the line immediately
preceding the 'Exit' line represents the function in the script that was
invoked as a consequence of the argument passed down to the script. The
other lines are nested function invocations, as a consequence of that.

A typical log obtained in node one would be the following:

monitor:

Status: NOT_RUNNING
Exit: NOT_RUNNING

start:

Validate: SUCCESS
Status: NOT_RUNNING
Start: SUCCESS
Exit: SUCCESS

monitor:

Status: NOT_RUNNING
Exit: NOT_RUNNING

stop:

Validate: SUCCESS
Status: SUCCESS
Stop: SUCCESS
Exit: SUCCESS

start:

Validate: SUCCESS
Status: NOT_RUNNING
Start: SUCCESS
Exit: SUCCESS

monitor:

   Status: SUCCESS
   Exit: SUCCESS

A few observations:

1. The monitor/start/stop sequence above can be repeated many times, and
the number of times it is repeated varies from one run to the next.
Occasionally, just three calls are made: monitor, start and monitor,
exiting with SUCCESS.

2. It would seem that what PaceMaker is doing is the following:
   a. Check out whether the app is running.
   b. If it is not, launch it.
   c. Check out again
   d. If running, exit.
   e. Otherwise, stop it.
f. Launch it.
   g. Go to a.

3. In node two, the log obtained as a consequence of creating the resource
always seems to be

monitor:

   Status: NOT_RUNNING
   Exit: NOT_RUNNING

which  makes sense to me.

4. If the above is correct, and if I am getting the picture correctly, it
would seem that the problem is that my monitoring function does not detect
immediately that my app is up and running. That's clearly my problem.
However, is there any way to get PaceMaker to introduce a delay between
steps b and c in section 2 above?

5. Following up on 4: if my script sleeps for a few seconds immediately
after launching my app (it's a daemon) in myapp_start then everything works
fine. Indeed, the call sequence in node one now becomes:

 monitor:

Status: NOT_RUNNING
Exit: NOT_RUNNING

  start:

Validate: SUCCESS
Status: NOT_RUNNING
Start: SUCCESS
Exit: SUCCESS

  monitor:

Status: SUCCESS
Exit: SUCCESS
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] SBD as watchdog daemon

2019-04-16 Thread Klaus Wenninger
On 4/16/19 5:27 PM, Олег Самойлов wrote:
>
>> 16 апр. 2019 г., в 16:21, Klaus Wenninger  написал(а):
>>
>> On 4/16/19 3:12 PM, Олег Самойлов wrote:
>>> Okey, looked like I found where it must be fixed.
>>>
>>> sbd-cluster.c
>>>
>>>/* TODO - Make a CPG call and only call notify_parent() when we 
>>> get a reply */
>>>notify_parent();
>>>
>>> Can anyone explain me, how to make mentioned CPG call?
>> There should be a PR already that does exactly that.
> Not only.
>
>> It just has to be rebased.
> Not true. This PR is in conflict with the master branch.

Which is what I wanted to express with 'has to be rebased' ;-)

>
>> But be aware that this isn't gonna solve your halted-pacemaker-daemons
>> issue.
> Also not true. I tried to merge this PR and I has solved several conflicts 
> intuitively. Now watchdog fires when corosync is frozen (half of my problems 
> is solved).
Exactly - which is why I was directing your attention to the
pacemaker-daemons.
>  But… It fires on both nodes. :) May be this is due to my lack of knowledge 
> of the corosync infrastructure.
>
> This PR is from 2017 year, why you didn’t fix and apply such very important 
> PR yet?
Because there were other things to do that were even more important ;-)
And as you've just discovered yourself things are not always that easy ...
Even if the issue with the non-blocked node restarting is solved there
are still delicate issues with startup/shutdown, installation/deinstallation
gradually configuring up a cluster from a single node over two-node to
several nodes, ... to be considered.

Klaus
>

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] SBD as watchdog daemon

2019-04-16 Thread Олег Самойлов


> 16 апр. 2019 г., в 16:21, Klaus Wenninger  написал(а):
> 
> On 4/16/19 3:12 PM, Олег Самойлов wrote:
>> Okey, looked like I found where it must be fixed.
>> 
>> sbd-cluster.c
>> 
>>/* TODO - Make a CPG call and only call notify_parent() when we 
>> get a reply */
>>notify_parent();
>> 
>> Can anyone explain me, how to make mentioned CPG call?
> There should be a PR already that does exactly that.

Not only.

> It just has to be rebased.

Not true. This PR is in conflict with the master branch.

> But be aware that this isn't gonna solve your halted-pacemaker-daemons
> issue.

Also not true. I tried to merge this PR and I has solved several conflicts 
intuitively. Now watchdog fires when corosync is frozen (half of my problems is 
solved). But… It fires on both nodes. :) May be this is due to my lack of 
knowledge of the corosync infrastructure.

This PR is from 2017 year, why you didn’t fix and apply such very important PR 
yet?

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] SBD as watchdog daemon

2019-04-16 Thread Klaus Wenninger
On 4/16/19 3:12 PM, Олег Самойлов wrote:
> Okey, looked like I found where it must be fixed.
>
> sbd-cluster.c
>
> /* TODO - Make a CPG call and only call notify_parent() when we 
> get a reply */
> notify_parent();
>
> Can anyone explain me, how to make mentioned CPG call?
There should be a PR already that does exactly that.
It just has to be rebased.
But be aware that this isn't gonna solve your halted-pacemaker-daemons
issue.

Klaus
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/


-- 
Klaus Wenninger

Senior Software Engineer, EMEA ENG Base Operating Systems

Red Hat

kwenn...@redhat.com   

Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn, 
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Michael O'Neill, Tom Savage, Eric Shander

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Resource not starting correctly III

2019-04-16 Thread Jan Pokorný
On 15/04/19 16:01 -0600, JCA wrote:
> This is weird. Further experiments, consisting of creating and deleting the
> resource, reveal that, on creating the resource, myapp-script may be
> invoked multiple times - sometimes four, sometimes twenty or so, sometimes
> returning OCF_SUCCESS, some other times returning OCF_NOT_RUNNING. And
> whether or not it succeeds, as per pcs status, this seems to be something
> completely random.

Please, don't forget that the agent gets also invoked so as to extract
its metadata (action is meta-data in that case).  You would figure
this out if you followed Ulrich's advice.

Apologies if this possibility is expressly skipped in your experiments.

-- 
Jan (Poki)


pgpoTl6uJqDKV.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Antw: Re: Resource not starting correctly

2019-04-16 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 16.04.2019 um 00:30 in
Nachricht
<144df656215fc1ed6b3a35cffd1cbd2436f2a785.ca...@redhat.com>:
[...]
> The cluster successfully probed the service on both nodes, and started
> it on node one. It then tried to start a 30‑second recurring monitor
> for the service, but the monitor immediately failed (the expected
> result was running, but the monitor said it was not running).

Using the ocf-tester for own agents is highly recommended IMHO. Here is some
old script of mine that I used to check my own multipath agent:
#RA=/usr/lib/ocf/resource.d/xola/multipath
RA=./multipath
if [ "$1" = "manual" ]; then
shift
OCF_ROOT=/usr/lib/ocf OCF_RESOURCE_INSTANCE=multipath \
OCF_RESKEY_mapname="3600508b4001085e3f34d" \
OCF_RESKEY_check_depth="10" \
OCF_RESKEY_iosched="noop" \
OCF_RESKEY_map_delay="5" \
OCF_RESKEY_udev_timeout="33" \
$RA "$@"
echo "Exit status is $?"
else
/usr/sbin/ocf-tester -n multipath \
-o mapname="3600508b4001085e3f34d" \
-o check_depth="10" \
-o iosched="noop" \
-o map_delay="5" \
-o udev_timeout="33" \
$RA
fi
---
So you can do a "./script manual start|stop|monitor", or let ocf-tester check
the whole orchestra ;-)

[...]

Regards,
Ulrich


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Antw: Resource not starting correctly II

2019-04-16 Thread Ulrich Windl
>>> JCA <1.41...@gmail.com> schrieb am 15.04.2019 um 23:30 in Nachricht
:
> Well, I remain puzzled. I added a statement to the end of my script in
> order to capture its return value. Much to my surprise, when I create the
> associated resource (as described in my previous post)  myapp-script gets
> invoked four times in node one (where the resource is created) and two in
> node two. Just as intriguing, the return value in one is, for each instance
> 
> OCF_NOT_RUNNING
> OCF_SUCCESS
> OCF_SUCCESS
> OCF_NOT_RUNNING

It may be help ful to log "$1" (action) together with the exit status...

> 
> while in node two I get
> 
> OCF_NOT_RUNNING
> OCF_SUCCESS
> 
> And now things work, as can be seen in the output from pcs status:
> 
> Cluster name: MyCluster
> Stack: corosync
> Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with
> quorum
> Last updated: Mon Apr 15 15:21:08 2019
> Last change: Mon Apr 15 15:20:47 2019 by root via cibadmin on one
> 
> 2 nodes configured
> 1 resource configured
> 
> Online: [ one two ]
> 
> Full list of resources:
> 
>  ClusterMyApp (ocf::myapp:myapp-script): Started one
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
> 
> 
> Can anybody clarify what is going on? And how to guarantee that, assuming
> that my app starts correctly, the creation resource will predictably
> succeed?




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/