Re: [ClusterLabs] Fuzzy/misleading references to "restart" of a resource
On 05/12/19 10:41 +0300, Andrei Borzenkov wrote: > On Thu, Dec 5, 2019 at 1:04 AM Jan Pokorný wrote: >> >> On 04/12/19 21:19 +0100, Jan Pokorný wrote: >>> OTOH, this enforced split of state transitions is perhaps what makes >>> the transaction (comprising perhaps countless other interdependent >>> resources) serializable and thus feasible at all (think: you cannot >>> nest any further handling -- so as to satisfy given constraints -- in >>> between stop and start when that's an atom, otherwise), and that's >>> exactly how, say, systemd approaches that, likely for that very reason: >>> https://github.com/systemd/systemd/commit/6539dd7c42946d9ba5dc43028b8b5785eb2db3c5 >> >> Yet, systemd started to allow for certain stop-start ("restart") >> optimizations at "stop" phase, I've just learnt: >> https://github.com/systemd/systemd/pull/13696#discussion_r330186864 >> But it doesn't merge/atomicize the two discrete steps, still. >> > > systemd development consists of series of ad hoc single use case > extensions, each done completely isolated, without considering impact > on other parts which is usually "fixed" by adding yet another ad hoc > extension. I do not think that is the best example to follow. Didn't meant to run into this debate, noticed this was perhaps mainly to satisfy their in-project services, but nonetheless, pragmatic value for a wide audience here is that any "why being stopped?" discrimination is now possible, lending itself to "restart optimization enabler" label should that be handy. Re style of evolutionary additions that are perhaps too tunnel-visioned, you'll find examples everywhere, incl. ClusterLabs/cluster projects :-) Common problem appears to be a lack of formalized/documented enough (as if it wasn't a proprietary knowledge but rather a fully baked programming interface) intermediate representations (next to some further confinements related to transitioning from one set of states to another), easy to externalize for an immediate feedback ("state dump") and to asses input-to-output transformation correctness (ad-hoc or unit testing) to assist thinking in both low-level isolated realms and in the higher-level architectural perspective (how the primitive "components" fit together). Another way of thinking about this is a directly observable "full state buffer", that would naturally tend to prevent code-degrading on-the-fly and ad-hoc merging of what are individual phases. Without deeper knowledge admittedly, I consider this something that, for instance, LLVM project got intriguingly and intrinsically right, and that's perhaps where to take a better example from. >> OCF could possibly be amended to allow for a similar semantic >> indication of "stop to be reversed shortly on this very node if >> things go well" if there was a tangible use case, say using >> "stop-with-start-pending" action instead of "stop" >> (and the amendment possibly building on an idea of addon profiles >> https://github.com/ClusterLabs/OCF-spec/issues/17 if there was >> an actual infrastructure for that and not just a daydreaming). >> > > I do not see how it is possible to shorthand resource restart. Cluster > resource manager manages not isolated resources, but groups of > interdependent resources. In general it is impossible to restart > single resource without coordinate restart of multiple resources. And > this should happen in defined order (you cannot "restart" mount point > without stopping any user of it first). > > Moreover, restart is expected to clean up resources and actually > result in pristine state. This is implicit assumption. I tend to agree, but I am far from being a creative author of resources agents or service life-cycle focused person. That was more to cater hypothetical optimizations that were once considered, see the referred scenario I linked up-thread: https://github.com/ClusterLabs/OCF-spec/blob/start/resource_agent/API/02#L225 (I dare not to evaluate the value it would bring or not). -- Jan (Poki) pgphJ6C1F9jH0.pgp Description: PGP signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fuzzy/misleading references to "restart" of a resource
On Thu, Dec 5, 2019 at 1:04 AM Jan Pokorný wrote: > > On 04/12/19 21:19 +0100, Jan Pokorný wrote: > > OTOH, this enforced split of state transitions is perhaps what makes > > the transaction (comprising perhaps countless other interdependent > > resources) serializable and thus feasible at all (think: you cannot > > nest any further handling -- so as to satisfy given constraints -- in > > between stop and start when that's an atom, otherwise), and that's > > exactly how, say, systemd approaches that, likely for that very reason: > > https://github.com/systemd/systemd/commit/6539dd7c42946d9ba5dc43028b8b5785eb2db3c5 > > Yet, systemd started to allow for certain stop-start ("restart") > optimizations at "stop" phase, I've just learnt: > https://github.com/systemd/systemd/pull/13696#discussion_r330186864 > But it doesn't merge/atomicize the two discrete steps, still. > systemd development consists of series of ad hoc single use case extensions, each done completely isolated, without considering impact on other parts which is usually "fixed" by adding yet another ad hoc extension. I do not think that is the best example to follow. > OCF could possibly be amended to allow for a similar semantic > indication of "stop to be reversed shortly on this very node if > things go well" if there was a tangible use case, say using > "stop-with-start-pending" action instead of "stop" > (and the amendment possibly building on an idea of addon profiles > https://github.com/ClusterLabs/OCF-spec/issues/17 if there was > an actual infrastructure for that and not just a daydreaming). > I do not see how it is possible to shorthand resource restart. Cluster resource manager manages not isolated resources, but groups of interdependent resources. In general it is impossible to restart single resource without coordinate restart of multiple resources. And this should happen in defined order (you cannot "restart" mount point without stopping any user of it first). Moreover, restart is expected to clean up resources and actually result in pristine state. This is implicit assumption. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fuzzy/misleading references to "restart" of a resource
On 04/12/19 21:19 +0100, Jan Pokorný wrote: > OTOH, this enforced split of state transitions is perhaps what makes > the transaction (comprising perhaps countless other interdependent > resources) serializable and thus feasible at all (think: you cannot > nest any further handling -- so as to satisfy given constraints -- in > between stop and start when that's an atom, otherwise), and that's > exactly how, say, systemd approaches that, likely for that very reason: > https://github.com/systemd/systemd/commit/6539dd7c42946d9ba5dc43028b8b5785eb2db3c5 Yet, systemd started to allow for certain stop-start ("restart") optimizations at "stop" phase, I've just learnt: https://github.com/systemd/systemd/pull/13696#discussion_r330186864 But it doesn't merge/atomicize the two discrete steps, still. OCF could possibly be amended to allow for a similar semantic indication of "stop to be reversed shortly on this very node if things go well" if there was a tangible use case, say using "stop-with-start-pending" action instead of "stop" (and the amendment possibly building on an idea of addon profiles https://github.com/ClusterLabs/OCF-spec/issues/17 if there was an actual infrastructure for that and not just a daydreaming). -- Jan (Poki) pgp4GfhUvs0xd.pgp Description: PGP signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Fuzzy/misleading references to "restart" of a resource (Was: When does pacemaker call 'restart'/'force-reload' operations on LSB resource?)
On 04/12/19 14:53 +0900, Ondrej wrote: > When adding 'LSB' script to pacemaker cluster I can see that > pacemaker advertises 'restart' and 'force-reload' operations to be > present - regardless if the LSB script supports it or not. This > seems to be coming from following piece of code. > > https://github.com/ClusterLabs/pacemaker/blob/92b0c1d69ab1feb0b89e141b5007f8792e69655e/lib/services/services_lsb.c#L39-L40 > > Questions: > 1. When the 'restart' and 'force-reload' operations are called on > the LSB script cluster resource? [reordered] > I would have expected that 'restart' operation would be called when > using 'crm_resource --restart --resource myResource', but I can see > that 'stop' and 'start' operations are used in that case instead. This is due to how "crm_resource --restart" is arranged, directly in the implementation of this CLI tool itself (see tools/crm_resource_runtime.c:cli_resource_restart): - first, target-role meta-attribute for resource is set to Stopped - then, once the activity settled, it is set back to the target-role it was originally at Performing this stepwise like this, there's no reasonably implementable mapping back to a single step being the actual composition (stop, start -> restart) when the plan is not shared in full in advance (it is not) with the respective moving parts. And there's plain common sense that would still preclude it (below). Hence, it is in actuality a great discovery that "restart" trigerring verb/action is in fact completely neglected and bogus when it comes to handling by pacemaker. If it implements any optimizations (thanks to having the intimate knowledge of the resource at hand, plus knowing before-after state combo and possibly how to transition in one go), cluster resource management won't benefit from that in any way. Interestingly, such optimizations are exactly what the original OCF draft had in mind :-) https://github.com/ClusterLabs/OCF-spec/blob/start/resource_agent/API/02#L225 (even more interestingly, only to be reconsidered again some decades later: https://github.com/ClusterLabs/OCF-spec/issues/10; yeah, aren't we masters of following targets moving to the extent they are sometimes contradictory? I'd blame a desperate lack of written [and easily obtainable] design decisions made in the past for that) They are mandated by LSB as well, but hey, in systemd era, we are now _free_ to call LSB severely broken as it (shamefully, I'd say) never even tried to accommodate proper dealing with dependency chains (and actual serializability thereof!), as explained in an example below. Or put in other words, LSB was never meant to stand for a holistic resource management, something both systemd and pacemaker attempt to cover (single/multi-machine wide). OTOH, this enforced split of state transitions is perhaps what makes the transaction (comprising perhaps countless other interdependent resources) serializable and thus feasible at all (think: you cannot nest any further handling -- so as to satisfy given constraints -- in between stop and start when that's an atom, otherwise), and that's exactly how, say, systemd approaches that, likely for that very reason: https://github.com/systemd/systemd/commit/6539dd7c42946d9ba5dc43028b8b5785eb2db3c5 So I see a room for improvement here as our takeaway: * resource agents: - some agents declare/implement "restart" action when there is no practical reason to (AudibleAlarm, Xinetd, dhcpd, etc.) [as a side note, there are non-sensical considerations, such as when default "start" and "stop" timeouts for dhcpd are 20 seconds each, how come, then, that "restart" defined as "stop; start" would also make do with 20 seconds altogher, unless there is some amortized work I fail to see :-)] * pacemaker: - artificially generated meta-data mention "restart" action when there is no good reason to (lib/services/services_lsb.c) - there are some correct clues in Pacemaker Explained, but perhaps, it shall take a time to emphasize that whenever "restart" is referred, it is never an atomic step, but always a sequence of two steps that may be considered atomic on their own, but possibly interleaved with other steps so as to retain soundness wrt. the imposed constraints and/or changes made in parallel - the same gist of "restart" shall be sketched in a help screen of crm_resource > For 'force-reload' I have no idea on how to try trigger it looking > at 'crm_resource --help' output. Sorry, that's even more bogus, as there's no relevance whatsoever. It needs to either be dropped from artificially generated meta-data as well, or investigated further whether there's any reason to make of such an operation triggerable by users, and if positive, how much of impact spread to be expected when implemented (do the dependent services need to be reloaded or "restarted" as well, since the change might be non-local? any precedent there? again, hard to