Peter Simon created YARN-7686:
---------------------------------

             Summary: Yarn containers failover if datanode/nodemanager fails
                 Key: YARN-7686
                 URL: https://issues.apache.org/jira/browse/YARN-7686
             Project: Hadoop YARN
          Issue Type: New Feature
          Components: resourcemanager
    Affects Versions: 2.6.0
            Reporter: Peter Simon


While running an application on Yarn, one of the datanodes/nodemanagers went 
offline due to power issues. The first application attempt was failed due to 
lost containers. When the second attempt started, there were no heartbeat 
interval happened to the Namenode, and the second attempt still got the 
datanode/nodemanager as possible worker node for the containers. While the host 
was unreachable, therefore the container attempts were failed, led to the 
second application attempt also failed, caused the application failure.
There could be a failover process for container attempts, so if on one node new 
container can't be brought up, the ResourceManager should try to allocate the 
new container on a different node.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to