Paul Rubio created HADOOP-10684: ----------------------------------- Summary: Extend HA support for more use cases Key: HADOOP-10684 URL: https://issues.apache.org/jira/browse/HADOOP-10684 Project: Hadoop Common Issue Type: Improvement Components: ha Reporter: Paul Rubio Priority: Minor
We'd like the current HA framework to be more configurable from a behavior standpoint. In particular: - Add the ability for a HAServiceTarget to survive a configurable number of health check failures (default of 0) before HealthMonitor (HM) reports service not responding or service unhealthy. For instance, predicate the HM on a state machine whose default implementation can be overridden by method or constructor argument. The default would behave the same as today. -- If a target fails a health check but does not exceed the maximum number of consecutive check failures, it’d be desirable if the target and/or controller were alerted. --- i.e. Introduce a SERVICE_DYING state --Additionally, it’d be desirable if a mechanism existed, similar to fencing semantics, for “reviving” a service that transitioned to SERVICE_DYING. --- i.e. attemptRevive(…) - Add the ability to allow a service to completely fail (no failover or failback possible). There are scenarios where allowing a failover or failback could cause more damage. -- E.g. a recovered master with stale data. The master may have been manually recovered (human error). - Add affinity to a particular HAServiceTarget. -- In other words, allow the controller to prefer one target over another when deciding leadership. -- If a higher affinity, but previously unhealthy target, becomes healthy then it should be allowed to become the leader. -- Likewise, if two targets are racing for a ZooKeeper lock, then the controller should "prefer" the higher the affinity target. -- It might make more sense to add a different implementation/subclass of the ZKFailoverController (i.e. ZKAffinityFailoverController) than modify current behavior. Please comment with thoughts/ideas/etc... Thanks. -- This message was sent by Atlassian JIRA (v6.2#6252)