john lilley created YARN-4449:
---------------------------------

             Summary: ResourceManager can return task container with less than 
requested memory
                 Key: YARN-4449
                 URL: https://issues.apache.org/jira/browse/YARN-4449
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 2.3.0
         Environment: Cloudera CDH 5.4.5
            Reporter: john lilley
            Priority: Minor
         Attachments: app312_rm.log

Occasionally, and apparently only when more than one YARN task is running at 
once, a ResourceManager may return a container that was reserved for the AM 
launch, which is smaller than the requested container size for a task.

We observed this as a failure, task killed due to over-memory use.  When 
investigating, we found the following had happened:
•       Client requests AM launch with 1024MB memory 
•       RM reserves a container _000001 with 1024MB memory
•       RM allocates container _000002 with 1024MB memory and launches the AM 
in that
•       When the AM starts requesting task containers with 2048MB memory, the 
reserved _000001 is still there, and the scheduler returns it, because that’s 
what reserved containers are for.  However it doesn’t check that the reserved 
container has as much memory as being requested presently.

This seems to be a timing problem and occurs erratically.  Sorry I could not 
try this on a newer cluster because it is so hard to reproduce.  However, you 
can see in our AM's log where it asks for 2000MB and gets 1024MB:

2015-12-09 02:41:10 INFO net.redpoint.yarnapp.ApplicationMaster: 
TaskLauncher.run: ** STARTING CONTAINER **
  task = Task['([...] containerRequest=Capability[<memory:2000, 
vCores:0>]Priority[0], container=container_1446677679275_0312_01_000001, 
state=new, result=null, diagnostics='null', retries=0]
  container = Container: [ContainerId: container_1446677679275_0312_01_000001, 
NodeId: rpb-cdh-kerb-2.office.datalever.com:8041, NodeHttpAddress: 
rpb-cdh-kerb-2.office.datalever.com:8042, Resource: <memory:1024, vCores:1>, 
Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.58.41:8041 
}, ]

This is probably more clear in the attached snippet of RM log, where you can 
see this happening with appid 312 (ignore 311 which is also in there).  You can 
see that the RM reserves one container, launches the AM in a second, then later 
returns the reserved container in response to a task container request of 
2000MB, so it comes up short.

This is relatively easy to work around (just reject that container and wait for 
another) which is why this is minor importance.  But it seems that YARN should 
give you the memory you requested, and it doesn't in this case.  Perhaps this 
"as designed", but it is certainly unexpected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to