Re: [ClusterLabs] Fuzzy/misleading references to "restart" of a resource

2019-12-05 Thread Jan Pokorný
On 05/12/19 10:41 +0300, Andrei Borzenkov wrote:
> On Thu, Dec 5, 2019 at 1:04 AM Jan Pokorný  wrote:
>> 
>> On 04/12/19 21:19 +0100, Jan Pokorný wrote:
>>> OTOH, this enforced split of state transitions is perhaps what makes
>>> the transaction (comprising perhaps countless other interdependent
>>> resources) serializable and thus feasible at all (think: you cannot
>>> nest any further handling -- so as to satisfy given constraints -- in
>>> between stop and start when that's an atom, otherwise), and that's
>>> exactly how, say, systemd approaches that, likely for that very reason:
>>> https://github.com/systemd/systemd/commit/6539dd7c42946d9ba5dc43028b8b5785eb2db3c5
>> 
>> Yet, systemd started to allow for certain stop-start ("restart")
>> optimizations at "stop" phase, I've just learnt:
>> https://github.com/systemd/systemd/pull/13696#discussion_r330186864
>> But it doesn't merge/atomicize the two discrete steps, still.
>> 
> 
> systemd development consists of series of ad hoc single use case
> extensions, each done completely isolated, without considering impact
> on other parts which is usually "fixed" by adding yet another ad hoc
> extension. I do not think that is the best example to follow.

Didn't meant to run into this debate, noticed this was perhaps
mainly to satisfy their in-project services, but nonetheless,
pragmatic value for a wide audience here is that any "why being
stopped?" discrimination is now possible, lending itself to
"restart optimization enabler" label should that be handy.

Re style of evolutionary additions that are perhaps too tunnel-visioned,
you'll find examples everywhere, incl. ClusterLabs/cluster projects :-)
Common problem appears to be a lack of formalized/documented enough (as
if it wasn't a proprietary knowledge but rather a fully baked
programming interface) intermediate representations (next to some
further confinements related to transitioning from one set of states to
another), easy to externalize for an immediate feedback ("state dump")
and to asses input-to-output transformation correctness (ad-hoc or
unit testing) to assist thinking in both low-level isolated realms and
in the higher-level architectural perspective (how the primitive
"components" fit together).  Another way of thinking about this is
a directly observable "full state buffer", that would naturally tend
to prevent code-degrading on-the-fly and ad-hoc merging of what are
individual phases.  Without deeper knowledge admittedly, I consider
this something that, for instance, LLVM project got intriguingly and
intrinsically right, and that's perhaps where to take a better
example from.

>> OCF could possibly be amended to allow for a similar semantic
>> indication of "stop to be reversed shortly on this very node if
>> things go well" if there was a tangible use case, say using
>> "stop-with-start-pending" action instead of "stop"
>> (and the amendment possibly building on an idea of addon profiles
>> https://github.com/ClusterLabs/OCF-spec/issues/17 if there was
>> an actual infrastructure for that and not just a daydreaming).
>> 
> 
> I do not see how it is possible to shorthand resource restart. Cluster
> resource manager manages not isolated resources, but groups of
> interdependent resources. In general it is impossible to restart
> single resource without coordinate restart of multiple resources. And
> this should happen in defined order (you cannot "restart" mount point
> without stopping any user of it first).
> 
> Moreover, restart is expected to clean up resources and actually
> result in pristine state. This is implicit assumption.

I tend to agree, but I am far from being a creative author of
resources agents or service life-cycle focused person.  That was more
to cater hypothetical optimizations that were once considered, see the
referred scenario I linked up-thread:
https://github.com/ClusterLabs/OCF-spec/blob/start/resource_agent/API/02#L225
(I dare not to evaluate the value it would bring or not).

-- 
Jan (Poki)


pgphJ6C1F9jH0.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fuzzy/misleading references to "restart" of a resource

2019-12-04 Thread Andrei Borzenkov
On Thu, Dec 5, 2019 at 1:04 AM Jan Pokorný  wrote:
>
> On 04/12/19 21:19 +0100, Jan Pokorný wrote:
> > OTOH, this enforced split of state transitions is perhaps what makes
> > the transaction (comprising perhaps countless other interdependent
> > resources) serializable and thus feasible at all (think: you cannot
> > nest any further handling -- so as to satisfy given constraints -- in
> > between stop and start when that's an atom, otherwise), and that's
> > exactly how, say, systemd approaches that, likely for that very reason:
> > https://github.com/systemd/systemd/commit/6539dd7c42946d9ba5dc43028b8b5785eb2db3c5
>
> Yet, systemd started to allow for certain stop-start ("restart")
> optimizations at "stop" phase, I've just learnt:
> https://github.com/systemd/systemd/pull/13696#discussion_r330186864
> But it doesn't merge/atomicize the two discrete steps, still.
>

systemd development consists of series of ad hoc single use case
extensions, each done completely isolated, without considering impact
on other parts which is usually "fixed" by adding yet another ad hoc
extension. I do not think that is the best example to follow.

> OCF could possibly be amended to allow for a similar semantic
> indication of "stop to be reversed shortly on this very node if
> things go well" if there was a tangible use case, say using
> "stop-with-start-pending" action instead of "stop"
> (and the amendment possibly building on an idea of addon profiles
> https://github.com/ClusterLabs/OCF-spec/issues/17 if there was
> an actual infrastructure for that and not just a daydreaming).
>

I do not see how it is possible to shorthand resource restart. Cluster
resource manager manages not isolated resources, but groups of
interdependent resources. In general it is impossible to restart
single resource without coordinate restart of multiple resources. And
this should happen in defined order (you cannot "restart" mount point
without stopping any user of it first).

Moreover, restart is expected to clean up resources and actually
result in pristine state. This is implicit assumption.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fuzzy/misleading references to "restart" of a resource

2019-12-04 Thread Jan Pokorný
On 04/12/19 21:19 +0100, Jan Pokorný wrote:
> OTOH, this enforced split of state transitions is perhaps what makes
> the transaction (comprising perhaps countless other interdependent
> resources) serializable and thus feasible at all (think: you cannot
> nest any further handling -- so as to satisfy given constraints -- in
> between stop and start when that's an atom, otherwise), and that's
> exactly how, say, systemd approaches that, likely for that very reason:
> https://github.com/systemd/systemd/commit/6539dd7c42946d9ba5dc43028b8b5785eb2db3c5

Yet, systemd started to allow for certain stop-start ("restart")
optimizations at "stop" phase, I've just learnt:
https://github.com/systemd/systemd/pull/13696#discussion_r330186864
But it doesn't merge/atomicize the two discrete steps, still.

OCF could possibly be amended to allow for a similar semantic
indication of "stop to be reversed shortly on this very node if
things go well" if there was a tangible use case, say using
"stop-with-start-pending" action instead of "stop"
(and the amendment possibly building on an idea of addon profiles
https://github.com/ClusterLabs/OCF-spec/issues/17 if there was
an actual infrastructure for that and not just a daydreaming).

-- 
Jan (Poki)


pgp4GfhUvs0xd.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Fuzzy/misleading references to "restart" of a resource (Was: When does pacemaker call 'restart'/'force-reload' operations on LSB resource?)

2019-12-04 Thread Jan Pokorný
On 04/12/19 14:53 +0900, Ondrej wrote:
> When adding 'LSB' script to pacemaker cluster I can see that
> pacemaker advertises 'restart' and 'force-reload' operations to be
> present - regardless if the LSB script supports it or not.  This
> seems to be coming from following piece of code.
> 
> https://github.com/ClusterLabs/pacemaker/blob/92b0c1d69ab1feb0b89e141b5007f8792e69655e/lib/services/services_lsb.c#L39-L40
> 
> Questions:
> 1.  When the 'restart' and 'force-reload' operations are called on
> the LSB script cluster resource?

[reordered]

> I would have expected that 'restart' operation would be called when
> using 'crm_resource --restart --resource myResource', but I can see
> that 'stop' and 'start' operations are used in that case instead.

This is due to how "crm_resource --restart" is arranged,
directly in the implementation of this CLI tool itself
(see tools/crm_resource_runtime.c:cli_resource_restart):

- first, target-role meta-attribute for resource is set to Stopped

- then, once the activity settled, it is set back to the target-role
  it was originally at

Performing this stepwise like this, there's no reasonably
implementable mapping back to a single step being the actual
composition (stop, start -> restart) when the plan is not shared
in full in advance (it is not) with the respective moving parts.
And there's plain common sense that would still preclude it (below).

Hence, it is in actuality a great discovery that "restart" trigerring
verb/action is in fact completely neglected and bogus when it comes
to handling by pacemaker.  If it implements any optimizations (thanks
to having the intimate knowledge of the resource at hand, plus knowing
before-after state combo and possibly how to transition in one go),
cluster resource management won't benefit from that in any way.

Interestingly, such optimizations are exactly what the original
OCF draft had in mind :-)
https://github.com/ClusterLabs/OCF-spec/blob/start/resource_agent/API/02#L225
(even more interestingly, only to be reconsidered again some decades
later: https://github.com/ClusterLabs/OCF-spec/issues/10;
yeah, aren't we masters of following targets moving to the extent they
are sometimes contradictory?  I'd blame a desperate lack of written
[and easily obtainable] design decisions made in the past for that)

They are mandated by LSB as well, but hey, in systemd era, we are
now _free_ to call LSB severely broken as it (shamefully, I'd say)
never even tried to accommodate proper dealing with dependency
chains (and actual serializability thereof!), as explained
in an example below.  Or put in other words, LSB was never meant
to stand for a holistic resource management, something both systemd
and pacemaker attempt to cover (single/multi-machine wide).

OTOH, this enforced split of state transitions is perhaps what makes
the transaction (comprising perhaps countless other interdependent
resources) serializable and thus feasible at all (think: you cannot
nest any further handling -- so as to satisfy given constraints -- in
between stop and start when that's an atom, otherwise), and that's
exactly how, say, systemd approaches that, likely for that very reason:
https://github.com/systemd/systemd/commit/6539dd7c42946d9ba5dc43028b8b5785eb2db3c5

So I see a room for improvement here as our takeaway:

* resource agents:

  - some agents declare/implement "restart" action when there is
no practical reason to (AudibleAlarm, Xinetd, dhcpd, etc.)
[as a side note, there are non-sensical considerations, such as
when default "start" and "stop" timeouts for dhcpd are 20 seconds
each, how come, then, that "restart" defined as "stop; start"
would also make do with 20 seconds altogher, unless there is
some amortized work I fail to see :-)]

* pacemaker:

  - artificially generated meta-data mention "restart" action when
there is no good reason to (lib/services/services_lsb.c)

  - there are some correct clues in Pacemaker Explained, but perhaps,
it shall take a time to emphasize that whenever "restart" is
referred, it is never an atomic step, but always a sequence
of two steps that may be considered atomic on their own,
but possibly interleaved with other steps so as to retain
soundness wrt. the imposed constraints and/or changes made
in parallel

  - the same gist of "restart" shall be sketched in a help screen
of crm_resource

> For 'force-reload' I have no idea on how to try trigger it looking
> at 'crm_resource --help' output.

Sorry, that's even more bogus, as there's no relevance whatsoever.
It needs to either be dropped from artificially generated meta-data
as well, or investigated further whether there's any reason to make
of such an operation triggerable by users, and if positive, how
much of impact spread to be expected when implemented (do the
dependent services need to be reloaded or "restarted" as well,
since the change might be non-local? any precedent there?
again, hard to