Jun Gong created YARN-4148:
------------------------------
Summary: When killing app, RM releases app's resource before they
are released by NM
Key: YARN-4148
URL: https://issues.apache.org/jira/browse/YARN-4148
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong
When killing a app, RM scheduler releases app's resource as soon as possible,
then it might allocate these resource for new requests. But NM have not
released them at that time.
The problem was found when we supported GPU as a resource(YARN-4122). Test
environment: a NM had 6 GPUs, app A used all 6 GPUs, app B was requesting 3
GPUs. Killed app A, then RM released A's 6 GPUs, and allocated 3 GPUs to B. But
when B tried to start container on NM, NM found it didn't have 3 GPUs to
allocate because it had not released A's GPUs.
I think the problem also exists for CPU/Memory. It might cause OOM when memory
is overused.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)