Ken Gaillot <kgail...@redhat.com> wrote: > On 05/12/2016 06:21 AM, Adam Spiers wrote: > > Ken Gaillot <kgail...@redhat.com> wrote: > >> On 05/10/2016 02:29 AM, Ulrich Windl wrote: > >>>> Here is what I'm testing currently: > >>>> > >>>> - When the cluster recovers a resource, the resource agent's stop action > >>>> will get a new variable, OCF_RESKEY_CRM_meta_recovery_left = > >>>> migration-threshold - fail-count on the local node.
[snipped] > > I'd prefer plural (OCF_RESKEY_CRM_meta_recoveries_left) but other than > > that I think it's good. OCF_RESKEY_CRM_meta_retries_left is shorter; > > not sure whether it's marginally worse or better though. > > I'm now leaning to restart_remaining (restarts_remaining would be just > as good). restarts_remaining would be better IMHO, given that it's expected that often multiple restarts will be remaining. [snipped] > > OK, so the RA code would typically be something like this? > > > > if [ ${OCF_RESKEY_CRM_meta_retries_left:-0} = 0 ]; then > > # This is the final stop, so tell the external service > > # not to send any more work our way. > > disable_service > > fi > > I'd use -eq :) but yes Right, -eq is better style for numeric comparison :-) [snipped] > >>>> -- If a resource is being recovered, but the fail-count is being cleared > >>>> in the same transition, the cluster will ignore migration-threshold (and > >>>> the variable will not be set). The RA might see recovery_left=5, 4, 3, > >>>> then someone clears the fail-count, and it won't see recovery_left even > >>>> though there is a stop and start being attempted. > > > > Hmm. So how would the RA distinguish that case from the one where > > the stop is final? > > That's the main question in all this. There are quite a few scenarios > where there's no meaningful distinction between 0 and unset. With the > current implementation at least, the ideal approach is for the RA to > treat the last stop before a restart the same as a final stop. OK ... [snipped] > > So IIUC, you are talking about a scenario like this: > > > > 1. The whole group starts fine. > > 2. Some time later, the neutron openvswitch agent crashes. > > 3. Pacemaker shuts down nova-compute since it depends upon > > the neutron agent due to being later in the same group. > > 4. Pacemaker repeatedly tries to start the neutron agent, > > but reaches migration-threshold. > > > > At this point, nova-compute is permanently down, but its RA never got > > passed OCF_RESKEY_CRM_meta_retries_left with a value of 0 or unset, > > so it never knew to do a nova service-disable. > > Basically right, but it would be unset (not empty -- it's never empty). > > However, this is a solvable issue. If it's important, I can add the > variable to all siblings of the failed resource if the entire group > would be forced away. Good to hear. > > (BTW, in this scenario, the group is actually cloned, so no migration > > to another compute node happens.) > > Clones are the perfect example of the lack of distinction between 0 and > unset. For an anonymous clone running on all nodes, the countdown will > be 3,2,1,unset because the specific clone instance doesn't need to be > started anywhere else (it looks more like a final stop of that > instance). But for unique clones, or anonymous clones where another node > is available to run the instance, it might be 0. I see, thanks. > > Did I get that right? If so, yes it does sound like an issue. Maybe > > it is possible to avoid this problem by avoiding the use of groups, > > and instead just use interleaved clones with ordering constraints > > between them? > > That's not any better, and in fact it would be more difficult to add the > variable to the dependent resource in such a situation, compared to a group. > > Generally, only the failed resource will get the variable, not resources > that may be stopped and started because they depend on the failed > resource in some way. OK. So that might be a problem for you guys than for us, since we use cloned groups, and you don't: https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/high-availability-for-compute-instances/chapter-1-use-high-availability-to-protect-instances > >> More generally, I suppose the point is to better support services that > >> can do a lesser tear-down for a stop-start cycle than a full stop. The > >> distinction between the two cases may not be 100% clear (as with your > >> fencing example), but the idea is that it would be used for > >> optimization, not some required behavior. > > > > This discussion is prompting me to get this clearer in my head, which > > is good :-) > > > > I suppose we *could* simply modify the existing NovaCompute OCF RA so > > that every time it executes the 'stop' action, it immediately sends > > the service-disable message to nova-api, and similarly send > > service-enable during the 'start' action. However this probably has a > > few downsides: > > > > 1. It could cause rapid flapping of the service state server-side (at > > least disable followed quickly by enable, or more if it took > > multiple retries to successfully restart nova-compute), and extra > > associated noise/load on nova-api and the MQ and DB. > > 2. It would slow down recovery. > > If the start can always send service-enable regardless of whether > service-disable was previously sent, without much performance penalty, > then that's a good use case for this. The stop could send > service-disable when the variable is 0 or unset; the gain would be in > not having to send service-disable when the variable is >=1. Right. I'm not sure I like the idea of always sending service-enable regardless, even though I was the one to air that possibility. It would risk overriding a nova service-disable invoked manually by a cloud operator for other reasons. One way around this might be to locally cache the expected disable/enable state to a file, and to only invoke service-enable when service-disable was previously invoked by the same RA. > > 3. What happens if whatever is causing nova-compute to fail is also > > causing nova-api to be unreachable from this compute node? > > This is not really addressable by the local node. I think in such a > situation, fencing will likely be invoked, and it can be addressed then. Good point. > > So as you say, the intended optimization here is to make the > > stop-start cycle faster and more lightweight than the final stop. > > > >> I am not sure the current implementation described above is sufficient, > >> but it should be a good starting point to work from. > > > > Hopefully, but you've raised more questions in my head :-) > > > > For example, I think there are probably other use cases, e.g. > > > > - Take configurable action after failure to restart libvirtd > > (one possible action is fencing the node; another is to > > notify the cloud operator) > > Just musing a bit ... on-fail + migration-threshold could have been > designed to be more flexible: > > hard-fail-threshold: When an operation fails this many times, the > cluster will consider the failure to be a "hard" failure. Until this > many failures, the cluster will try to recover the resource on the same > node. How is this different to migration-threshold, other than in name? > hard-fail-action: What to do when the operation reaches > hard-fail-threshold ("ban" would work like current "restart" i.e. move > to another node, and ignore/block/stop/standby/fence would work the same > as now) And I'm not sure I understand how this is different to / more flexible than what we can do with on-fail now? > That would allow fence etc. to be done only after a specified number of > retries. Ah, hindsight ... Isn't that possible now, e.g. with migration-threshold=3 and on-fail=fence? I feel like I'm missing something. [snipped] > > - neutron-l3-agent RA detects that the agent is unhealthy, and iff it > > fails to restart it, we want to trigger migration of any routers on > > that l3-agent to a healthy l3-agent. Currently we wait for the > > connection between the agent and the neutron server to time out, > > which is unpleasantly slow. This case is more of a requirement than > > an optimization, because we really don't want to migrate routers to > > another node unless we have to, because a) it takes time, and b) is > > disruptive enough that we don't want to have to migrate them back > > soon after if we discover we can successfully recover the unhealthy > > l3-agent. > > > > - Remove a failed backend from an haproxy-fronted service if > > it can't be restarted. > > > > - Notify any other service (OpenStack or otherwise) where the failing > > local resource is a backend worker for some central service. I > > guess ceilometer, cinder, mistral etc. are all potential > > examples of this. Any thoughts on the sanity of these? > > Finally, there's the fundamental question when responsibility of > > monitoring and cleaning up after failures should be handled by > > Pacemaker and OCF RAs, or whether sometimes a central service should > > handle that itself. For example we could tune the nova / neutron > > agent timeouts to be much more aggressive, and then those servers > > would notice agent failures themselves quick enough that we wouldn't > > have to configure Pacemaker to detect them and then notify the > > servers. > > > > I'm not sure if there is any good reason why Pacemaker can more > > reliably detect failures than those native keepalive mechanisms. The > > main difference appears to be that Pacemaker executes monitoring > > directly on the monitored node via lrmd, and then relays the results > > back via corosync, whereas server/agent heartbeating typically relies > > on the state of a simple TCP connection. In that sense, Pacemaker is > > more flexible in what it can monitor, and the monitoring may also take > > place over different networks depending on the configuration. And of > > course it can do fencing when this is required. But in the cases > > where more sophisticated monitoring and fencing are not required, > > I wonder if this is worth the added complexity. Thoughts? > > Pacemaker also adds rich dependencies that can take into account far > more information than the central service will know -- constraints, > utilization attributes, health attributes, rules. True. But this is mainly of benefit when the clean-up involves doing things to other services, and in cases such as neutron-l3-agent, I suspect it tends not to. _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org