Junping Du commented on YARN-2561:

This is due to YARN-1337, we removed the NodeRemoveEvent for node reconnect. It 
was correct for node with recovery work enabled, but will cause container get 
frozen for cluster with disabling NM work preserving (default so far). Already 
have a quick fix for it and will post it soon. 

> MR job client cannot reconnect to AM after NM restart.
> ------------------------------------------------------
>                 Key: YARN-2561
>                 URL: https://issues.apache.org/jira/browse/YARN-2561
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Tassapol Athiapinya
>            Assignee: Junping Du
>            Priority: Critical
> Work-preserving NM restart is disabled.
> Submit a job. Restart NM with AM running. Job client won't be able to connect 
> to new AM attempt and hang with connect retries.

This message was sent by Atlassian JIRA

Reply via email to