[
https://issues.apache.org/jira/browse/YARN-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15813092#comment-15813092
]
Jason Lowe commented on YARN-4148:
----------------------------------
The unit test failures appear to be unrelated. They pass for me locally with
the patch applied, and there are JIRAs that are tracking those failures. The
TestDelegationTokenRenewer failure is being tracked by YARN-5816 and the
TestRMRestart failure is tracked by YARN-5548.
Thanks for the review, [~djp]! If you agree the failures are unrelated then
feel free to commit, or I'll do so in a few days unless I hear otherwise.
> When killing app, RM releases app's resource before they are released by NM
> ---------------------------------------------------------------------------
>
> Key: YARN-4148
> URL: https://issues.apache.org/jira/browse/YARN-4148
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Reporter: Jun Gong
> Assignee: Jason Lowe
> Attachments: YARN-4148.001.patch, YARN-4148.002.patch,
> YARN-4148.003.patch, YARN-4148.wip.patch,
> free_in_scheduler_but_not_node_prototype-branch-2.7.patch
>
>
> When killing a app, RM scheduler releases app's resource as soon as possible,
> then it might allocate these resource for new requests. But NM have not
> released them at that time.
> The problem was found when we supported GPU as a resource(YARN-4122). Test
> environment: a NM had 6 GPUs, app A used all 6 GPUs, app B was requesting 3
> GPUs. Killed app A, then RM released A's 6 GPUs, and allocated 3 GPUs to B.
> But when B tried to start container on NM, NM found it didn't have 3 GPUs to
> allocate because it had not released A's GPUs.
> I think the problem also exists for CPU/Memory. It might cause OOM when
> memory is overused.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]