[
https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088965#comment-14088965
]
Wangda Tan commented on YARN-2249:
----------------------------------
Hi [~jianhe],
Thanks for working on the patch,
I've read your patch, several comments/questions
1) I haven't followed work preserving restart discussions for a long time. How
current RM handle the problem: after RM restarted, it started allocate
resource, and NM report container recover, but there's no resource available in
a node/queue?
I remember we've discussed this topic while you were working on YARN-1368,
which is RM will not allocate new resource for x secs after restart for NM can
reconnect and recover containers. If you chose that appoarch, we can cache
outstanding container release request until x secs after restart reached.
And could you elaborate why you use NM liveness expire time? Can we improve
this?
2) It seems to me using
{code}
+ this.pendingRelease =
+ CacheBuilder.newBuilder().expireAfterWrite
{code}
Is not a good enough because it will cache every release request from AM.
Actually, we only need cache release request for a period of time after AM
reconnected to RM. After the time reaches, release logic should behave as
before.
3) I think we shouldn't {{logFailure}} for rmContainer not found in this case.
IMHO, we should {{logFailure}} when release request removing from cache instead.
4) We should notify AM about container completed message when we decide to not
recover a container.
And we should add this to test as well.
5) Test,
Can we wait for some state instead of {{Thread.sleep(3000);}}?
Thanks,
Wangda
> RM may receive container release request on AM resync before container is
> actually recovered
> --------------------------------------------------------------------------------------------
>
> Key: YARN-2249
> URL: https://issues.apache.org/jira/browse/YARN-2249
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Jian He
> Assignee: Jian He
> Attachments: YARN-2249.1.patch, YARN-2249.1.patch
>
>
> AM resync on RM restart will send outstanding container release requests back
> to the new RM. In the meantime, NMs report the container statuses back to RM
> to recover the containers. If RM receives the container release request
> before the container is actually recovered in scheduler, the container won't
> be released and the release request will be lost.
--
This message was sent by Atlassian JIRA
(v6.2#6252)