[
https://issues.apache.org/jira/browse/YARN-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe resolved YARN-2283.
------------------------------
Resolution: Duplicate
Yes, it is very likely a duplicate of MAPREDUCE-5888, especially since it no
longer reproduces on later releases. Resolving as a duplicate.
The RM is not failing to release the container, rather the RM is intentionally
giving the AM some time to clean things up after unregistering (i.e.: the
FINISHING state). Unfortunately before MAPREDUCE-5888 was fixed the AM could
hang during a failed job because of a non-daemon thread that was lingering
around and preventing the JVM from shutting down. The RM eventually decides
that the AM has used too much time to cleanup and kills it.
> RM failed to release the AM container
> -------------------------------------
>
> Key: YARN-2283
> URL: https://issues.apache.org/jira/browse/YARN-2283
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.4.0
> Environment: NM1: AM running
> NM2: Map task running
> mapreduce.map.maxattempts=1
> Reporter: Nishan Shetty
> Priority: Critical
>
> During container stability test i faced this problem
> While job is running map task got killed
> Observe that eventhough application is FAILED MRAppMaster process is running
> till timeout because RM did not release the AM container
> {code}
> 2014-07-14 14:43:33,899 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> container_1405318134611_0002_01_000005 Container Transitioned from RUNNING to
> COMPLETED
> 2014-07-14 14:43:33,899 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
> Completed container: container_1405318134611_0002_01_000005 in state:
> COMPLETED event:FINISHED
> 2014-07-14 14:43:33,899 INFO
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=testos
> OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
> APPID=application_1405318134611_0002
> CONTAINERID=container_1405318134611_0002_01_000005
> 2014-07-14 14:43:33,899 INFO
> org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
> Finish information of container container_1405318134611_0002_01_000005 is
> written
> 2014-07-14 14:43:33,899 INFO
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter:
> Stored the finish data of container container_1405318134611_0002_01_000005
> 2014-07-14 14:43:33,899 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode:
> Released container container_1405318134611_0002_01_000005 of capacity
> <memory:1024, vCores:1> on host HOST-10-18-40-153:45026, which currently has
> 1 containers, <memory:2048, vCores:1> used and <memory:6144, vCores:7>
> available, release resources=true
> 2014-07-14 14:43:33,899 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> default used=<memory:2048, vCores:1> numContainers=1 user=testos
> user-resources=<memory:2048, vCores:1>
> 2014-07-14 14:43:33,899 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> completedContainer container=Container: [ContainerId:
> container_1405318134611_0002_01_000005, NodeId: HOST-10-18-40-153:45026,
> NodeHttpAddress: HOST-10-18-40-153:45025, Resource: <memory:1024, vCores:1>,
> Priority: 5, Token: Token { kind: ContainerToken, service: 10.18.40.153:45026
> }, ] queue=default: capacity=1.0, absoluteCapacity=1.0,
> usedResources=<memory:2048, vCores:1>, usedCapacity=0.25,
> absoluteUsedCapacity=0.25, numApps=1, numContainers=1 cluster=<memory:8192,
> vCores:8>
> 2014-07-14 14:43:33,899 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
> completedContainer queue=root usedCapacity=0.25 absoluteUsedCapacity=0.25
> used=<memory:2048, vCores:1> cluster=<memory:8192, vCores:8>
> 2014-07-14 14:43:33,899 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
> Re-sorting completed queue: root.default stats: default: capacity=1.0,
> absoluteCapacity=1.0, usedResources=<memory:2048, vCores:1>,
> usedCapacity=0.25, absoluteUsedCapacity=0.25, numApps=1, numContainers=1
> 2014-07-14 14:43:33,899 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Application attempt appattempt_1405318134611_0002_000001 released container
> container_1405318134611_0002_01_000005 on node: host: HOST-10-18-40-153:45026
> #containers=1 available=6144 used=2048 with event: FINISHED
> 2014-07-14 14:43:34,924 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
> Updating application attempt appattempt_1405318134611_0002_000001 with final
> state: FINISHING
> 2014-07-14 14:43:34,924 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
> appattempt_1405318134611_0002_000001 State change from RUNNING to FINAL_SAVING
> 2014-07-14 14:43:34,924 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating
> application application_1405318134611_0002 with final state: FINISHING
> 2014-07-14 14:43:34,947 INFO
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore:
> Watcher event type: NodeDataChanged with state:SyncConnected for
> path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1405318134611_0002/appattempt_1405318134611_0002_000001
> for Service
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2014-07-14 14:43:34,947 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
> application_1405318134611_0002 State change from RUNNING to FINAL_SAVING
> 2014-07-14 14:43:34,947 INFO
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing
> info for app: application_1405318134611_0002
> 2014-07-14 14:43:34,947 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
> appattempt_1405318134611_0002_000001 State change from FINAL_SAVING to
> FINISHING
> 2014-07-14 14:43:35,012 INFO
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore:
> Watcher event type: NodeDataChanged with state:SyncConnected for
> path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1405318134611_0002 for
> Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore
> in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
> STARTED
> 2014-07-14 14:43:35,013 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
> application_1405318134611_0002 State change from FINAL_SAVING to FINISHING
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)