[
https://issues.apache.org/jira/browse/YARN-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe updated YARN-4148:
-----------------------------
Attachment: free_in_scheduler_but_not_node_prototype-branch-2.7.patch
Sorry for joining the discussion late, as I missed this originally. As I
mentioned in YARN-5290, having the RM wait until the NM confirms container
release can unnecessarily slow down subsequent allocations on other nodes due
to scheduler limits (user limit, queue limit, etc.). We could leverage some
form of the NM queuing, but I agree it could be confusing when the AM launches
a container and it doesn't appear to be active afterwards when querying the
node.
We could have the RM wait until it receives hard confirmation from the NM
before it releases the resources associated with a container, but that would
needlessly slow down scheduling in some cases. For example, if a user is at the
scheduler user limit but releases a container on node A, I don't see why we
have to wait until that container is confirmed dead over two subsequent NM
heartbeats (one to tell the NM to shoot it and another to confirm its dead)
before allowing the user to allocate another container of the same size on node
B. However I do think it's bad for us to allocate the new container on the same
node as the released one since we can accidentally overwhelm the node if the
old container isn't cleaned up fast enough.
Therefore I propose that we go ahead and let the scheduler queues and user
limit computations update immediately so other nodes can be scheduled, but we
don't release the resources in the SchedulerNode itself until the node confirms
a previously running container is dead. IMHO if the RM ever sees a container in
the RUNNING state on a node, it should never think that node has freed the
resources for that container until the node itself says that container has
completed.
Here's a prototype patch against branch-2.7 that is similar to what we're using
internally to work around this issue. It goes ahead and releases the resources
for running containers in the scheduler bookkeeping (i.e.: cluster resource,
queues, user limits, etc.) but _not_ in the SchedulerNode. So the RM could
allocate those resources elsewhere but not on the current node until the node
reports the container as completed.
NOTE: with any of these "wait until the node says the container is done"
approaches it's important to get the fix for YARN-5197 or if the NM ever skips
sending a container completion event the RM will leak those resources on the
node.
There is an interesting corner case where the RM has handed out a container to
an AM (i.e.: container is in the ACQUIRED state) but it hasn't seen it running
on a node yet. If the container is killed by the RM or AM, there's still a
chance where the container could appear on the node after the RM has considered
those resources freed. We'll have to decide how to handle that race. One way to
solve it is to assume the container resources could still be "used" until it
has had a chance to tell the NM that the container token for that container is
no longer valid and confirmed in a subsequent NM heartbeat that the container
has not appeared since. Maybe there's a simpler/faster way to safely free the
containers resources for that race condition?
> When killing app, RM releases app's resource before they are released by NM
> ---------------------------------------------------------------------------
>
> Key: YARN-4148
> URL: https://issues.apache.org/jira/browse/YARN-4148
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Reporter: Jun Gong
> Assignee: Jun Gong
> Attachments: YARN-4148.001.patch, YARN-4148.wip.patch,
> free_in_scheduler_but_not_node_prototype-branch-2.7.patch
>
>
> When killing a app, RM scheduler releases app's resource as soon as possible,
> then it might allocate these resource for new requests. But NM have not
> released them at that time.
> The problem was found when we supported GPU as a resource(YARN-4122). Test
> environment: a NM had 6 GPUs, app A used all 6 GPUs, app B was requesting 3
> GPUs. Killed app A, then RM released A's 6 GPUs, and allocated 3 GPUs to B.
> But when B tried to start container on NM, NM found it didn't have 3 GPUs to
> allocate because it had not released A's GPUs.
> I think the problem also exists for CPU/Memory. It might cause OOM when
> memory is overused.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]