[ https://issues.apache.org/jira/browse/YARN-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
john lilley updated YARN-4449: ------------------------------ Attachment: app312_rm.log > ResourceManager can return task container with less than requested memory > ------------------------------------------------------------------------- > > Key: YARN-4449 > URL: https://issues.apache.org/jira/browse/YARN-4449 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.3.0 > Environment: Cloudera CDH 5.4.5 > Reporter: john lilley > Priority: Minor > Attachments: app312_rm.log > > > Occasionally, and apparently only when more than one YARN task is running at > once, a ResourceManager may return a container that was reserved for the AM > launch, which is smaller than the requested container size for a task. > We observed this as a failure, task killed due to over-memory use. When > investigating, we found the following had happened: > • Client requests AM launch with 1024MB memory > • RM reserves a container _000001 with 1024MB memory > • RM allocates container _000002 with 1024MB memory and launches the AM > in that > • When the AM starts requesting task containers with 2048MB memory, the > reserved _000001 is still there, and the scheduler returns it, because that’s > what reserved containers are for. However it doesn’t check that the reserved > container has as much memory as being requested presently. > This seems to be a timing problem and occurs erratically. Sorry I could not > try this on a newer cluster because it is so hard to reproduce. However, you > can see in our AM's log where it asks for 2000MB and gets 1024MB: > 2015-12-09 02:41:10 INFO net.redpoint.yarnapp.ApplicationMaster: > TaskLauncher.run: ** STARTING CONTAINER ** > task = Task['([...] containerRequest=Capability[<memory:2000, > vCores:0>]Priority[0], container=container_1446677679275_0312_01_000001, > state=new, result=null, diagnostics='null', retries=0] > container = Container: [ContainerId: > container_1446677679275_0312_01_000001, NodeId: > rpb-cdh-kerb-2.office.datalever.com:8041, NodeHttpAddress: > rpb-cdh-kerb-2.office.datalever.com:8042, Resource: <memory:1024, vCores:1>, > Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.58.41:8041 > }, ] > This is probably more clear in the attached snippet of RM log, where you can > see this happening with appid 312 (ignore 311 which is also in there). You > can see that the RM reserves one container, launches the AM in a second, then > later returns the reserved container in response to a task container request > of 2000MB, so it comes up short. > This is relatively easy to work around (just reject that container and wait > for another) which is why this is minor importance. But it seems that YARN > should give you the memory you requested, and it doesn't in this case. > Perhaps this "as designed", but it is certainly unexpected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)