[ https://issues.apache.org/jira/browse/YARN-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Lowe resolved YARN-2283. ------------------------------ Resolution: Duplicate Yes, it is very likely a duplicate of MAPREDUCE-5888, especially since it no longer reproduces on later releases. Resolving as a duplicate. The RM is not failing to release the container, rather the RM is intentionally giving the AM some time to clean things up after unregistering (i.e.: the FINISHING state). Unfortunately before MAPREDUCE-5888 was fixed the AM could hang during a failed job because of a non-daemon thread that was lingering around and preventing the JVM from shutting down. The RM eventually decides that the AM has used too much time to cleanup and kills it. > RM failed to release the AM container > ------------------------------------- > > Key: YARN-2283 > URL: https://issues.apache.org/jira/browse/YARN-2283 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.4.0 > Environment: NM1: AM running > NM2: Map task running > mapreduce.map.maxattempts=1 > Reporter: Nishan Shetty > Priority: Critical > > During container stability test i faced this problem > While job is running map task got killed > Observe that eventhough application is FAILED MRAppMaster process is running > till timeout because RM did not release the AM container > {code} > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1405318134611_0002_01_000005 Container Transitioned from RUNNING to > COMPLETED > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Completed container: container_1405318134611_0002_01_000005 in state: > COMPLETED event:FINISHED > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=testos > OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1405318134611_0002 > CONTAINERID=container_1405318134611_0002_01_000005 > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: > Finish information of container container_1405318134611_0002_01_000005 is > written > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Stored the finish data of container container_1405318134611_0002_01_000005 > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode: > Released container container_1405318134611_0002_01_000005 of capacity > <memory:1024, vCores:1> on host HOST-10-18-40-153:45026, which currently has > 1 containers, <memory:2048, vCores:1> used and <memory:6144, vCores:7> > available, release resources=true > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > default used=<memory:2048, vCores:1> numContainers=1 user=testos > user-resources=<memory:2048, vCores:1> > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > completedContainer container=Container: [ContainerId: > container_1405318134611_0002_01_000005, NodeId: HOST-10-18-40-153:45026, > NodeHttpAddress: HOST-10-18-40-153:45025, Resource: <memory:1024, vCores:1>, > Priority: 5, Token: Token { kind: ContainerToken, service: 10.18.40.153:45026 > }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, > usedResources=<memory:2048, vCores:1>, usedCapacity=0.25, > absoluteUsedCapacity=0.25, numApps=1, numContainers=1 cluster=<memory:8192, > vCores:8> > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > completedContainer queue=root usedCapacity=0.25 absoluteUsedCapacity=0.25 > used=<memory:2048, vCores:1> cluster=<memory:8192, vCores:8> > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Re-sorting completed queue: root.default stats: default: capacity=1.0, > absoluteCapacity=1.0, usedResources=<memory:2048, vCores:1>, > usedCapacity=0.25, absoluteUsedCapacity=0.25, numApps=1, numContainers=1 > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Application attempt appattempt_1405318134611_0002_000001 released container > container_1405318134611_0002_01_000005 on node: host: HOST-10-18-40-153:45026 > #containers=1 available=6144 used=2048 with event: FINISHED > 2014-07-14 14:43:34,924 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Updating application attempt appattempt_1405318134611_0002_000001 with final > state: FINISHING > 2014-07-14 14:43:34,924 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1405318134611_0002_000001 State change from RUNNING to FINAL_SAVING > 2014-07-14 14:43:34,924 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating > application application_1405318134611_0002 with final state: FINISHING > 2014-07-14 14:43:34,947 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: NodeDataChanged with state:SyncConnected for > path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1405318134611_0002/appattempt_1405318134611_0002_000001 > for Service > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED > 2014-07-14 14:43:34,947 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: > application_1405318134611_0002 State change from RUNNING to FINAL_SAVING > 2014-07-14 14:43:34,947 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing > info for app: application_1405318134611_0002 > 2014-07-14 14:43:34,947 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1405318134611_0002_000001 State change from FINAL_SAVING to > FINISHING > 2014-07-14 14:43:35,012 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: NodeDataChanged with state:SyncConnected for > path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1405318134611_0002 for > Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore > in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > STARTED > 2014-07-14 14:43:35,013 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: > application_1405318134611_0002 State change from FINAL_SAVING to FINISHING > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)