Wangda Tan commented on YARN-2249:

Hi [~jianhe],
Thanks for working on the patch,
I've read your patch, several comments/questions

1) I haven't followed work preserving restart discussions for a long time. How 
current RM handle the problem: after RM restarted, it started allocate 
resource, and NM report container recover, but there's no resource available in 
a node/queue?
I remember we've discussed this topic while you were working on YARN-1368, 
which is RM will not allocate new resource for x secs after restart for NM can 
reconnect and recover containers. If you chose that appoarch, we can cache 
outstanding container release request until x secs after restart reached.
And could you elaborate why you use NM liveness expire time? Can we improve 

2) It seems to me using
+    this.pendingRelease =
+        CacheBuilder.newBuilder().expireAfterWrite
Is not a good enough because it will cache every release request from AM. 
Actually, we only need cache release request for a period of time after AM 
reconnected to RM. After the time reaches, release logic should behave as 

3) I think we shouldn't {{logFailure}} for rmContainer not found in this case. 
IMHO, we should {{logFailure}} when release request removing from cache instead.

4) We should notify AM about container completed message when we decide to not 
recover a container.
And we should add this to test as well.

5) Test,
Can we wait for some state instead of {{Thread.sleep(3000);}}?


> RM may receive container release request on AM resync before container is 
> actually recovered
> --------------------------------------------------------------------------------------------
>                 Key: YARN-2249
>                 URL: https://issues.apache.org/jira/browse/YARN-2249
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>         Attachments: YARN-2249.1.patch, YARN-2249.1.patch
> AM resync on RM restart will send outstanding container release requests back 
> to the new RM. In the meantime, NMs report the container statuses back to RM 
> to recover the containers. If RM receives the container release request  
> before the container is actually recovered in scheduler, the container won't 
> be released and the release request will be lost.

This message was sent by Atlassian JIRA

Reply via email to