[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305098#comment-16305098 ] stefanlee commented on YARN-3979: - [~piaoyu zhang] thanks for this jira, could you please tell me why optimize *yarn.resourcemanager.client.thread-count 50 -> 100* *yarn.resourcemanager.scheduler.client.thread-count 50->100* *yarn.resourcemanager.resource-tracker.client.thread-count 50 -> 80* ? > Am in ResourceLocalizationService hang 10 min cause RM kill AM > --- > > Key: YARN-3979 > URL: https://issues.apache.org/jira/browse/YARN-3979 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0 > Environment: CentOS 6.5 Hadoop-2.2.0 >Reporter: zhangyubiao > Attachments: ERROR103.log > > > 2015-07-27 02:46:17,348 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Created localizer for container_1437735375558 > _104282_01_01 > 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: > Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) > 2015-07-27 02:56:18,510 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for appattempt_1437735375558_104282_0 > 1 (auth:TOKEN) for protocol=interface > org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700886#comment-14700886 ] zhangyubiao commented on YARN-3979: --- Thanks for Rohith Sharma K S's patch , We stop the copy of Logs that the program gone , and we will test patch for our test enviroment and if it's OK . we will patch for our production envirments . Thank you for your help. Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14682240#comment-14682240 ] Rohith Sharma K S commented on YARN-3979: - I had look at the RM logs shared, I strongly suspect that it is because of the same reason in YARN-3990. From the shared log, I see below logs which indicates that asyncdispatcher is overloaded with unnecessary events. May be you can use patch of YARN-3990 and test it. {noformat} 2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: BJHC-HERA-18352.hadoop.jd.local:50086 Node Transitioned from RUNNING to LOST 2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved BJHC-HADOOP-HERA-17280.jd.local to /rack/rack4065 2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 2515000 2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 2515000 2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node BJHC-HADOOP-HERA-17280.jd.local(cmPort: 50086 httpPort: 8042) registered with capability: memory:57344, vCores:28, assigned nodeId BJHC-HADOOP-HERA-17280.jd.local:50086 2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved BJHC-HERA-164102.hadoop.jd.local to /rack/rack41007 2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node BJHC-HERA-164102.hadoop.jd.local(cmPort: 50086 httpPort: 8042) registered with capability: memory:57344, vCores:28, assigned nodeId BJHC-HERA-164102.hadoop.jd.local:50086 2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 2516000 2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 2516000 2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Node not found resyncing BJHC-HERA-18043.hadoop.jd.local:50086 2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 2517000 2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 2517000 2015-07-29 01:58:27,113 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 2518000 2015-07-29 01:58:27,113 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 2518000 2015-07-29 01:58:27,113 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 2519000 {noformat} Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647275#comment-14647275 ] zhangyubiao commented on YARN-3979: --- I find that the CPU and load is high because of we use crontab to copy the RM Logs。 Today we stop the copy ,the CPU and load become normal 。 Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647303#comment-14647303 ] zhangyubiao commented on YARN-3979: --- I send you RM Logs just now Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647358#comment-14647358 ] zhangyubiao commented on YARN-3979: --- And Today we find that Yarn Memory Reserved very Large , AM stuck to Lanuch. Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646510#comment-14646510 ] Rohith Sharma K S commented on YARN-3979: - Oops, 50 lakh events I checked the attached logs, since you have attached only ERROR logs, did not able to trace it. One observation is there are many InvalidStateTransitions events CLEAN_UP in RMNodeImpl. # Would you possible give RM logs, if not able to attach to JIRA, could you send me through mail. # would give more info like what is the cluster size? how much is apps are running? how many were completed? What is the state of state of NodeManager i.e whether they are running OR any other state? Which version of Hadoop are you using? Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647095#comment-14647095 ] zhangyubiao commented on YARN-3979: --- the cluster is about 1600。and about 550 apps running. 2 lakh apps completed . NodeManager in one times all lost and recovery for a monment 。 I use Hadoop-2.2.0 in CentOS 6.5 Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647130#comment-14647130 ] zhangyubiao commented on YARN-3979: --- I had send you an email of RM Jstack log and I wil send your app log soon Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647177#comment-14647177 ] Rohith Sharma K S commented on YARN-3979: - Thanks for the information!! bq. NodeManager in one times all lost and recovery for a monment I can think of the scenario very close to YARN-3990. Since you have 2 lakh apps completed and 1600 NodeManager, when the all the nodes lost and reconnected, the number of events that generated are {{(2lakh completed + 550 running = 200550)*1600(number of NodeManager) = 32088}} events..Ooops!!! Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645601#comment-14645601 ] zhangyubiao commented on YARN-3979: --- Thank you for reply @Rohith Sharma K S Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645535#comment-14645535 ] Rohith Sharma K S commented on YARN-3979: - How many applications completed? How many applications are running? How many NM are running? When is this event queeu is full? Any observation you made? Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645381#comment-14645381 ] zhangyubiao commented on YARN-3979: --- I find the RM hang and get the Pstack at that time . Thread 370 (Thread 0x7f263e4f1700 (LWP 35718)): #0 0x003abf40b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f263ec01f8e in os::PlatformEvent::park() () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #2 0x7f263ebd3985 in Monitor::IWait(Thread*, long) () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #3 0x7f263ebd3fed in Monitor::wait(bool, long, bool) () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #4 0x7f263ed101f5 in Threads::destroy_vm() () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #5 0x7f263ea0c97b in jni_DestroyJavaVM () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #6 0x4000223f in JavaMain () #7 0x003abf407851 in start_thread () from /lib64/libpthread.so.0 #8 0x003abece811d in clone () from /lib64/libc.so.6 Thread 369 (Thread 0x7f263dfa1700 (LWP 35719)): #0 0x003abf40b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f263ec01f8e in os::PlatformEvent::park() () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #2 0x7f263ebd3985 in Monitor::IWait(Thread*, long) () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #3 0x7f263ebd414e in Monitor::wait(bool, long, bool) () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #4 0x7f263ed67668 in GangWorker::loop() () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #5 0x7f263ed675b4 in GangWorker::run() () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #6 0x7f263ec0296f in java_start(Thread*) () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #7 0x003abf407851 in start_thread () from /lib64/libpthread.so.0 #8 0x003abece811d in clone () from /lib64/libc.so.6 Thread 368 (Thread 0x7f263dea0700 (LWP 35720)): #0 0x003abf40b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f263ec01f8e in os::PlatformEvent::park() () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #2 0x7f263ebd3985 in Monitor::IWait(Thread*, long) () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #3 0x7f263ebd414e in Monitor::wait(bool, long, bool) () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #4 0x7f263ed67668 in GangWorker::loop() () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #5 0x7f263ed675b4 in GangWorker::run() () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #6 0x7f263ec0296f in java_start(Thread*) () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #7 0x003abf407851 in start_thread () from /lib64/libpthread.so.0 #8 0x003abece811d in clone () from /lib64/libc.so.6 Thread 367 (Thread 0x7f263dd9f700 (LWP 35721)): #0 0x003abf40b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f263ec01f8e in os::PlatformEvent::park() () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #2 0x7f263ebd3985 in Monitor::IWait(Thread*, long) () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #3 0x7f263ebd414e in Monitor::wait(bool, long, bool) () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #4 0x7f263ed67668 in GangWorker::loop() () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #5 0x7f263ed675b4 in GangWorker::run() () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #6 0x7f263ec0296f in java_start(Thread*) () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #7 0x003abf407851 in start_thread () from /lib64/libpthread.so.0 #8 0x003abece811d in clone () from /lib64/libc.so.6 Thread 366 (Thread 0x7f263dc9e700 (LWP 35722)): #0 0x003abf40b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f263ec01f8e in os::PlatformEvent::park() () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #2 0x7f263ebd3985 in Monitor::IWait(Thread*, long) () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #3 0x7f263ebd414e in Monitor::wait(bool, long, bool) () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #4 0x7f263ed67668 in GangWorker::loop() () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #5 0x7f263ed675b4 in GangWorker::run() () from /software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so #6 0x7f263ec0296f in java_start(Thread*) () from
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645487#comment-14645487 ] zhangyubiao commented on YARN-3979: --- In the first time I find the some job hang 10 min and we chang the yarn.resourcemanager.client.thread-count 50 - 100 yarn.resourcemanager.scheduler.client.thread-count 50-100 yarn.resourcemanager.resource-tracker.client.thread-count 50 - 80 ,and few days we find that RM machine in sometimes the load and CPU use beging high. And I find the RM Logs event queue begin very large 2015-07-29 01:59:21,196 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4924000 2015-07-29 01:59:21,196 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4925000 2015-07-29 01:59:21,196 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4926000 2015-07-29 01:59:21,196 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4927000 2015-07-29 01:59:21,196 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4928000 2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4929000 2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 493 2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 493 2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4931000 2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4932000 2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4933000 2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4934000 2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4935000 2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4936000 2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4937000 2015-07-29 01:59:21,198 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4938000 2015-07-29 01:59:21,198 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4939000 2015-07-29 01:59:21,198 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 494 2015-07-29 01:59:21,198 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4941000 2015-07-29 01:59:21,198 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4942000 2015-07-29 01:59:21,198 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4943000 2015-07-29 01:59:21,198 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4944000 2015-07-29 01:59:21,199 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4945000 2015-07-29 01:59:21,199 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4946000 2015-07-29 01:59:21,199 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4947000 2015-07-29 01:59:21,199 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4948000 2015-07-29 01:59:21,199 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4949000 Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645404#comment-14645404 ] Rohith Sharma K S commented on YARN-3979: - [~piaoyu zhang] In the description you have given NM logs, but in previous comment you have give stack trace of RM. It would be easy to analyze if you can provide more info like RM logs, NM logs and AM logs if started. And NM stack trace would help much since NM side holding 10 mins. Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)