Ken Gaillot <[email protected]> wrote: > On 05/20/2016 10:40 AM, Adam Spiers wrote: > > Ken Gaillot <[email protected]> wrote: > >> Just musing a bit ... on-fail + migration-threshold could have been > >> designed to be more flexible: > >> > >> hard-fail-threshold: When an operation fails this many times, the > >> cluster will consider the failure to be a "hard" failure. Until this > >> many failures, the cluster will try to recover the resource on the same > >> node. > > > > How is this different to migration-threshold, other than in name? > > > >> hard-fail-action: What to do when the operation reaches > >> hard-fail-threshold ("ban" would work like current "restart" i.e. move > >> to another node, and ignore/block/stop/standby/fence would work the same > >> as now) > > > > And I'm not sure I understand how this is different to / more flexible > > than what we can do with on-fail now? > > > >> That would allow fence etc. to be done only after a specified number of > >> retries. Ah, hindsight ... > > > > Isn't that possible now, e.g. with migration-threshold=3 and > > on-fail=fence? I feel like I'm missing something. > > migration-threshold only applies when on-fail=restart. If on-fail=fence > or something else, that action always applies after the first failure.
*sound of penny dropping* Ahah! Thanks, yes that's what I was missing :-) > So hard-fail-threshold would indeed be the same as migration-threshold, > but applied to all actions (and would be renamed, since the resource > won't migrate in the other cases). Gotcha. > >>> - neutron-l3-agent RA detects that the agent is unhealthy, and iff it > >>> fails to restart it, we want to trigger migration of any routers on > >>> that l3-agent to a healthy l3-agent. Currently we wait for the > >>> connection between the agent and the neutron server to time out, > >>> which is unpleasantly slow. This case is more of a requirement than > >>> an optimization, because we really don't want to migrate routers to > >>> another node unless we have to, because a) it takes time, and b) is > >>> disruptive enough that we don't want to have to migrate them back > >>> soon after if we discover we can successfully recover the unhealthy > >>> l3-agent. > >>> > >>> - Remove a failed backend from an haproxy-fronted service if > >>> it can't be restarted. > >>> > >>> - Notify any other service (OpenStack or otherwise) where the failing > >>> local resource is a backend worker for some central service. I > >>> guess ceilometer, cinder, mistral etc. are all potential > >>> examples of this. > > > > Any thoughts on the sanity of these? > > Beyond my expertise. But sounds reasonable. We should probably migrate this part of the discussion to openstack-dev ... _______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
