Vinod Kumar Vavilapalli created YARN-4725:
---------------------------------------------
Summary: [Umbrella] Auto-restart of containers
Key: YARN-4725
URL: https://issues.apache.org/jira/browse/YARN-4725
Project: Hadoop YARN
Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
See overview doc at YARN-4692, copying the sub-section to track all related
efforts.
Today, when a container (process-tree) dies, NodeManager assumes that the
container’s allocation is also expired, and reports accordingly to the
ResourceManager which then releases the allocation. For service containers,
this is undesirable in many cases. Long running containers may exit for various
reasons, crash and need to restart but forcing them to go through the complete
scheduling cycle, resource localization etc is both unnecessary and expensive.
(Task) For services it will be good to have NodeManagers automatically
restart containers. This looks a lot like inittab / daemontools at the system
level.
We will need to enable app-specific policies (very similar to the handling of
AM restarts at YARN level) for restarting containers automatically but limit
such restarts if a container dies too often in a short interval of time.
YARN-3998 is an existing ticket that looks at some if not all of this
functionality.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)