[jira] [Commented] (YARN-2283) RM failed to release the AM container
[ https://issues.apache.org/jira/browse/YARN-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14081170#comment-14081170 ] Sunil G commented on YARN-2283: --- Thank you [~jlowe]. Yes, I have taken the thread dump and could see ThreadPoolExecutor is still there. I have applied patch and verified the same, it is not creating the same problem. Thank you. > RM failed to release the AM container > - > > Key: YARN-2283 > URL: https://issues.apache.org/jira/browse/YARN-2283 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 > Environment: NM1: AM running > NM2: Map task running > mapreduce.map.maxattempts=1 >Reporter: Nishan Shetty >Priority: Critical > > During container stability test i faced this problem > While job is running map task got killed > Observe that eventhough application is FAILED MRAppMaster process is running > till timeout because RM did not release the AM container > {code} > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1405318134611_0002_01_05 Container Transitioned from RUNNING to > COMPLETED > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Completed container: container_1405318134611_0002_01_05 in state: > COMPLETED event:FINISHED > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=testos > OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1405318134611_0002 > CONTAINERID=container_1405318134611_0002_01_05 > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: > Finish information of container container_1405318134611_0002_01_05 is > written > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Stored the finish data of container container_1405318134611_0002_01_05 > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode: > Released container container_1405318134611_0002_01_05 of capacity > on host HOST-10-18-40-153:45026, which currently has > 1 containers, used and > available, release resources=true > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > default used= numContainers=1 user=testos > user-resources= > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > completedContainer container=Container: [ContainerId: > container_1405318134611_0002_01_05, NodeId: HOST-10-18-40-153:45026, > NodeHttpAddress: HOST-10-18-40-153:45025, Resource: , > Priority: 5, Token: Token { kind: ContainerToken, service: 10.18.40.153:45026 > }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, > usedResources=, usedCapacity=0.25, > absoluteUsedCapacity=0.25, numApps=1, numContainers=1 cluster= vCores:8> > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > completedContainer queue=root usedCapacity=0.25 absoluteUsedCapacity=0.25 > used= cluster= > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Re-sorting completed queue: root.default stats: default: capacity=1.0, > absoluteCapacity=1.0, usedResources=, > usedCapacity=0.25, absoluteUsedCapacity=0.25, numApps=1, numContainers=1 > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Application attempt appattempt_1405318134611_0002_01 released container > container_1405318134611_0002_01_05 on node: host: HOST-10-18-40-153:45026 > #containers=1 available=6144 used=2048 with event: FINISHED > 2014-07-14 14:43:34,924 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Updating application attempt appattempt_1405318134611_0002_01 with final > state: FINISHING > 2014-07-14 14:43:34,924 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1405318134611_0002_01 State change from RUNNING to FINAL_SAVING > 2014-07-14 14:43:34,924 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating > application application_1405318134611_0002 with final state: FINISHING > 2014-07-14 14:43:34,947 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: NodeDataChanged with state:SyncConnected for > path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1405318134611_0002/app
[jira] [Commented] (YARN-2283) RM failed to release the AM container
[ https://issues.apache.org/jira/browse/YARN-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080804#comment-14080804 ] Sunil G commented on YARN-2283: --- Seems to be duplicate to MAPREDUCE-5888 [~jlowe] cud u pls confirm whether its the same issue. > RM failed to release the AM container > - > > Key: YARN-2283 > URL: https://issues.apache.org/jira/browse/YARN-2283 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 > Environment: NM1: AM running > NM2: Map task running > mapreduce.map.maxattempts=1 >Reporter: Nishan Shetty >Priority: Critical > > During container stability test i faced this problem > While job is running map task got killed > Observe that eventhough application is FAILED MRAppMaster process is running > till timeout because RM did not release the AM container > {code} > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1405318134611_0002_01_05 Container Transitioned from RUNNING to > COMPLETED > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Completed container: container_1405318134611_0002_01_05 in state: > COMPLETED event:FINISHED > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=testos > OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1405318134611_0002 > CONTAINERID=container_1405318134611_0002_01_05 > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: > Finish information of container container_1405318134611_0002_01_05 is > written > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Stored the finish data of container container_1405318134611_0002_01_05 > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode: > Released container container_1405318134611_0002_01_05 of capacity > on host HOST-10-18-40-153:45026, which currently has > 1 containers, used and > available, release resources=true > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > default used= numContainers=1 user=testos > user-resources= > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > completedContainer container=Container: [ContainerId: > container_1405318134611_0002_01_05, NodeId: HOST-10-18-40-153:45026, > NodeHttpAddress: HOST-10-18-40-153:45025, Resource: , > Priority: 5, Token: Token { kind: ContainerToken, service: 10.18.40.153:45026 > }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, > usedResources=, usedCapacity=0.25, > absoluteUsedCapacity=0.25, numApps=1, numContainers=1 cluster= vCores:8> > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > completedContainer queue=root usedCapacity=0.25 absoluteUsedCapacity=0.25 > used= cluster= > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Re-sorting completed queue: root.default stats: default: capacity=1.0, > absoluteCapacity=1.0, usedResources=, > usedCapacity=0.25, absoluteUsedCapacity=0.25, numApps=1, numContainers=1 > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Application attempt appattempt_1405318134611_0002_01 released container > container_1405318134611_0002_01_05 on node: host: HOST-10-18-40-153:45026 > #containers=1 available=6144 used=2048 with event: FINISHED > 2014-07-14 14:43:34,924 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Updating application attempt appattempt_1405318134611_0002_01 with final > state: FINISHING > 2014-07-14 14:43:34,924 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1405318134611_0002_01 State change from RUNNING to FINAL_SAVING > 2014-07-14 14:43:34,924 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating > application application_1405318134611_0002 with final state: FINISHING > 2014-07-14 14:43:34,947 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: NodeDataChanged with state:SyncConnected for > path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1405318134611_0002/appattempt_1405318134611_0002_01 > for Service > org.apache.hadoop.yarn.server.resourcemanager.rec
[jira] [Commented] (YARN-2283) RM failed to release the AM container
[ https://issues.apache.org/jira/browse/YARN-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080480#comment-14080480 ] Nishan Shetty commented on YARN-2283: - I checked this issue, it is not coming in trunk. This issue is reproducible in 2.4.* > RM failed to release the AM container > - > > Key: YARN-2283 > URL: https://issues.apache.org/jira/browse/YARN-2283 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 > Environment: NM1: AM running > NM2: Map task running > mapreduce.map.maxattempts=1 >Reporter: Nishan Shetty >Priority: Critical > > During container stability test i faced this problem > While job is running map task got killed > Observe that eventhough application is FAILED MRAppMaster process is running > till timeout because RM did not release the AM container > {code} > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1405318134611_0002_01_05 Container Transitioned from RUNNING to > COMPLETED > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Completed container: container_1405318134611_0002_01_05 in state: > COMPLETED event:FINISHED > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=testos > OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1405318134611_0002 > CONTAINERID=container_1405318134611_0002_01_05 > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: > Finish information of container container_1405318134611_0002_01_05 is > written > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Stored the finish data of container container_1405318134611_0002_01_05 > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode: > Released container container_1405318134611_0002_01_05 of capacity > on host HOST-10-18-40-153:45026, which currently has > 1 containers, used and > available, release resources=true > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > default used= numContainers=1 user=testos > user-resources= > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > completedContainer container=Container: [ContainerId: > container_1405318134611_0002_01_05, NodeId: HOST-10-18-40-153:45026, > NodeHttpAddress: HOST-10-18-40-153:45025, Resource: , > Priority: 5, Token: Token { kind: ContainerToken, service: 10.18.40.153:45026 > }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, > usedResources=, usedCapacity=0.25, > absoluteUsedCapacity=0.25, numApps=1, numContainers=1 cluster= vCores:8> > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > completedContainer queue=root usedCapacity=0.25 absoluteUsedCapacity=0.25 > used= cluster= > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Re-sorting completed queue: root.default stats: default: capacity=1.0, > absoluteCapacity=1.0, usedResources=, > usedCapacity=0.25, absoluteUsedCapacity=0.25, numApps=1, numContainers=1 > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Application attempt appattempt_1405318134611_0002_01 released container > container_1405318134611_0002_01_05 on node: host: HOST-10-18-40-153:45026 > #containers=1 available=6144 used=2048 with event: FINISHED > 2014-07-14 14:43:34,924 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Updating application attempt appattempt_1405318134611_0002_01 with final > state: FINISHING > 2014-07-14 14:43:34,924 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1405318134611_0002_01 State change from RUNNING to FINAL_SAVING > 2014-07-14 14:43:34,924 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating > application application_1405318134611_0002 with final state: FINISHING > 2014-07-14 14:43:34,947 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: NodeDataChanged with state:SyncConnected for > path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1405318134611_0002/appattempt_1405318134611_0002_01 > for Service > org.apache.hadoop.yarn.server.resourcemanager.re
[jira] [Commented] (YARN-2283) RM failed to release the AM container
[ https://issues.apache.org/jira/browse/YARN-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14079234#comment-14079234 ] Sunil G commented on YARN-2283: --- I tried to reproduce this and I found AM memory is immediately released. Could you please try to recur this and give the exact steps? > RM failed to release the AM container > - > > Key: YARN-2283 > URL: https://issues.apache.org/jira/browse/YARN-2283 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 > Environment: NM1: AM running > NM2: Map task running > mapreduce.map.maxattempts=1 >Reporter: Nishan Shetty >Priority: Critical > > During container stability test i faced this problem > While job is running map task got killed > Observe that eventhough application is FAILED MRAppMaster process is running > till timeout because RM did not release the AM container > {code} > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1405318134611_0002_01_05 Container Transitioned from RUNNING to > COMPLETED > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Completed container: container_1405318134611_0002_01_05 in state: > COMPLETED event:FINISHED > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=testos > OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1405318134611_0002 > CONTAINERID=container_1405318134611_0002_01_05 > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: > Finish information of container container_1405318134611_0002_01_05 is > written > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: > Stored the finish data of container container_1405318134611_0002_01_05 > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode: > Released container container_1405318134611_0002_01_05 of capacity > on host HOST-10-18-40-153:45026, which currently has > 1 containers, used and > available, release resources=true > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > default used= numContainers=1 user=testos > user-resources= > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > completedContainer container=Container: [ContainerId: > container_1405318134611_0002_01_05, NodeId: HOST-10-18-40-153:45026, > NodeHttpAddress: HOST-10-18-40-153:45025, Resource: , > Priority: 5, Token: Token { kind: ContainerToken, service: 10.18.40.153:45026 > }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, > usedResources=, usedCapacity=0.25, > absoluteUsedCapacity=0.25, numApps=1, numContainers=1 cluster= vCores:8> > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > completedContainer queue=root usedCapacity=0.25 absoluteUsedCapacity=0.25 > used= cluster= > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > Re-sorting completed queue: root.default stats: default: capacity=1.0, > absoluteCapacity=1.0, usedResources=, > usedCapacity=0.25, absoluteUsedCapacity=0.25, numApps=1, numContainers=1 > 2014-07-14 14:43:33,899 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Application attempt appattempt_1405318134611_0002_01 released container > container_1405318134611_0002_01_05 on node: host: HOST-10-18-40-153:45026 > #containers=1 available=6144 used=2048 with event: FINISHED > 2014-07-14 14:43:34,924 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Updating application attempt appattempt_1405318134611_0002_01 with final > state: FINISHING > 2014-07-14 14:43:34,924 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1405318134611_0002_01 State change from RUNNING to FINAL_SAVING > 2014-07-14 14:43:34,924 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating > application application_1405318134611_0002 with final state: FINISHING > 2014-07-14 14:43:34,947 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Watcher event type: NodeDataChanged with state:SyncConnected for > path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1405318134611_0002/appattempt_1405318134611_0002_01 > for Service > org.apache.