Jun Gong created YARN-4148:

             Summary: When killing app, RM releases app's resource before they 
are released by NM
                 Key: YARN-4148
                 URL: https://issues.apache.org/jira/browse/YARN-4148
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
            Reporter: Jun Gong
            Assignee: Jun Gong

When killing a app, RM scheduler releases app's resource as soon as possible, 
then it might allocate these resource for new requests. But NM have not 
released them at that time.

The problem was found when we supported GPU as a resource(YARN-4122).  Test 
environment: a NM had 6 GPUs, app A used all 6 GPUs, app B was requesting 3 
GPUs. Killed app A, then RM released A's 6 GPUs, and allocated 3 GPUs to B. But 
when B tried to start container on NM, NM found it didn't have 3 GPUs to 
allocate because it had not released A's GPUs.

I think the problem also exists for CPU/Memory. It might cause OOM when memory 
is overused.

This message was sent by Atlassian JIRA

Reply via email to