[ 
https://issues.apache.org/jira/browse/YARN-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

john lilley updated YARN-4449:
------------------------------
    Attachment: app312_rm.log

> ResourceManager can return task container with less than requested memory
> -------------------------------------------------------------------------
>
>                 Key: YARN-4449
>                 URL: https://issues.apache.org/jira/browse/YARN-4449
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.3.0
>         Environment: Cloudera CDH 5.4.5
>            Reporter: john lilley
>            Priority: Minor
>         Attachments: app312_rm.log
>
>
> Occasionally, and apparently only when more than one YARN task is running at 
> once, a ResourceManager may return a container that was reserved for the AM 
> launch, which is smaller than the requested container size for a task.
> We observed this as a failure, task killed due to over-memory use.  When 
> investigating, we found the following had happened:
> •     Client requests AM launch with 1024MB memory 
> •     RM reserves a container _000001 with 1024MB memory
> •     RM allocates container _000002 with 1024MB memory and launches the AM 
> in that
> •     When the AM starts requesting task containers with 2048MB memory, the 
> reserved _000001 is still there, and the scheduler returns it, because that’s 
> what reserved containers are for.  However it doesn’t check that the reserved 
> container has as much memory as being requested presently.
> This seems to be a timing problem and occurs erratically.  Sorry I could not 
> try this on a newer cluster because it is so hard to reproduce.  However, you 
> can see in our AM's log where it asks for 2000MB and gets 1024MB:
> 2015-12-09 02:41:10 INFO net.redpoint.yarnapp.ApplicationMaster: 
> TaskLauncher.run: ** STARTING CONTAINER **
>   task = Task['([...] containerRequest=Capability[<memory:2000, 
> vCores:0>]Priority[0], container=container_1446677679275_0312_01_000001, 
> state=new, result=null, diagnostics='null', retries=0]
>   container = Container: [ContainerId: 
> container_1446677679275_0312_01_000001, NodeId: 
> rpb-cdh-kerb-2.office.datalever.com:8041, NodeHttpAddress: 
> rpb-cdh-kerb-2.office.datalever.com:8042, Resource: <memory:1024, vCores:1>, 
> Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.58.41:8041 
> }, ]
> This is probably more clear in the attached snippet of RM log, where you can 
> see this happening with appid 312 (ignore 311 which is also in there).  You 
> can see that the RM reserves one container, launches the AM in a second, then 
> later returns the reserved container in response to a task container request 
> of 2000MB, so it comes up short.
> This is relatively easy to work around (just reject that container and wait 
> for another) which is why this is minor importance.  But it seems that YARN 
> should give you the memory you requested, and it doesn't in this case.  
> Perhaps this "as designed", but it is certainly unexpected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to