[jira] [Resolved] (YARN-2283) RM failed to release the AM container

Jason Lowe (JIRA) Thu, 31 Jul 2014 06:44:33 -0700

     [ 
https://issues.apache.org/jira/browse/YARN-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jason Lowe resolved YARN-2283.
------------------------------

    Resolution: Duplicate

Yes, it is very likely a duplicate of MAPREDUCE-5888, especially since it no 
longer reproduces on later releases.  Resolving as a duplicate.

The RM is not failing to release the container, rather the RM is intentionally 
giving the AM some time to clean things up after unregistering (i.e.: the 
FINISHING state).  Unfortunately before MAPREDUCE-5888 was fixed the AM could 
hang during a failed job because of a non-daemon thread that was lingering 
around and preventing the JVM from shutting down.  The RM eventually decides 
that the AM has used too much time to cleanup and kills it.

> RM failed to release the AM container
> -------------------------------------
>
>                 Key: YARN-2283
>                 URL: https://issues.apache.org/jira/browse/YARN-2283
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.4.0
>         Environment: NM1: AM running
> NM2: Map task running
> mapreduce.map.maxattempts=1
>            Reporter: Nishan Shetty
>            Priority: Critical
>
> During container stability test i faced this problem
> While job is running map task got killed
> Observe that eventhough application is FAILED MRAppMaster process is running 
> till timeout because RM did not release  the AM container
> {code}
> 2014-07-14 14:43:33,899 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1405318134611_0002_01_000005 Container Transitioned from RUNNING to 
> COMPLETED
> 2014-07-14 14:43:33,899 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Completed container: container_1405318134611_0002_01_000005 in state: 
> COMPLETED event:FINISHED
> 2014-07-14 14:43:33,899 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=testos 
> OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
> APPID=application_1405318134611_0002    
> CONTAINERID=container_1405318134611_0002_01_000005
> 2014-07-14 14:43:33,899 INFO 
> org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
>  Finish information of container container_1405318134611_0002_01_000005 is 
> written
> 2014-07-14 14:43:33,899 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
> Stored the finish data of container container_1405318134611_0002_01_000005
> 2014-07-14 14:43:33,899 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode:
>  Released container container_1405318134611_0002_01_000005 of capacity 
> <memory:1024, vCores:1> on host HOST-10-18-40-153:45026, which currently has 
> 1 containers, <memory:2048, vCores:1> used and <memory:6144, vCores:7> 
> available, release resources=true
> 2014-07-14 14:43:33,899 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> default used=<memory:2048, vCores:1> numContainers=1 user=testos 
> user-resources=<memory:2048, vCores:1>
> 2014-07-14 14:43:33,899 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> completedContainer container=Container: [ContainerId: 
> container_1405318134611_0002_01_000005, NodeId: HOST-10-18-40-153:45026, 
> NodeHttpAddress: HOST-10-18-40-153:45025, Resource: <memory:1024, vCores:1>, 
> Priority: 5, Token: Token { kind: ContainerToken, service: 10.18.40.153:45026 
> }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, 
> usedResources=<memory:2048, vCores:1>, usedCapacity=0.25, 
> absoluteUsedCapacity=0.25, numApps=1, numContainers=1 cluster=<memory:8192, 
> vCores:8>
> 2014-07-14 14:43:33,899 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> completedContainer queue=root usedCapacity=0.25 absoluteUsedCapacity=0.25 
> used=<memory:2048, vCores:1> cluster=<memory:8192, vCores:8>
> 2014-07-14 14:43:33,899 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Re-sorting completed queue: root.default stats: default: capacity=1.0, 
> absoluteCapacity=1.0, usedResources=<memory:2048, vCores:1>, 
> usedCapacity=0.25, absoluteUsedCapacity=0.25, numApps=1, numContainers=1
> 2014-07-14 14:43:33,899 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Application attempt appattempt_1405318134611_0002_000001 released container 
> container_1405318134611_0002_01_000005 on node: host: HOST-10-18-40-153:45026 
> #containers=1 available=6144 used=2048 with event: FINISHED
> 2014-07-14 14:43:34,924 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Updating application attempt appattempt_1405318134611_0002_000001 with final 
> state: FINISHING
> 2014-07-14 14:43:34,924 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1405318134611_0002_000001 State change from RUNNING to FINAL_SAVING
> 2014-07-14 14:43:34,924 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating 
> application application_1405318134611_0002 with final state: FINISHING
> 2014-07-14 14:43:34,947 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: NodeDataChanged with state:SyncConnected for 
> path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1405318134611_0002/appattempt_1405318134611_0002_000001
>  for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2014-07-14 14:43:34,947 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1405318134611_0002 State change from RUNNING to FINAL_SAVING
> 2014-07-14 14:43:34,947 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing 
> info for app: application_1405318134611_0002
> 2014-07-14 14:43:34,947 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1405318134611_0002_000001 State change from FINAL_SAVING to 
> FINISHING
> 2014-07-14 14:43:35,012 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: NodeDataChanged with state:SyncConnected for 
> path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1405318134611_0002 for 
> Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore 
> in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: 
> STARTED
> 2014-07-14 14:43:35,013 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1405318134611_0002 State change from FINAL_SAVING to FINISHING
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (YARN-2283) RM failed to release the AM container

Reply via email to