[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2017-12-27 Thread stefanlee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305098#comment-16305098
 ] 

stefanlee commented on YARN-3979:
-

[~piaoyu zhang] thanks for this jira, could you please tell me why optimize  
 *yarn.resourcemanager.client.thread-count 50 -> 100*
*yarn.resourcemanager.scheduler.client.thread-count 50->100*
*yarn.resourcemanager.resource-tracker.client.thread-count 50 -> 80*  ?

> Am in ResourceLocalizationService hang 10 min cause RM kill  AM
> ---
>
> Key: YARN-3979
> URL: https://issues.apache.org/jira/browse/YARN-3979
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0
> Environment: CentOS 6.5  Hadoop-2.2.0
>Reporter: zhangyubiao
> Attachments: ERROR103.log
>
>
> 2015-07-27 02:46:17,348 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Created localizer for container_1437735375558
> _104282_01_01
> 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE)
> 2015-07-27 02:56:18,510 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for appattempt_1437735375558_104282_0
> 1 (auth:TOKEN) for protocol=interface 
> org.apache.hadoop.yarn.api.ContainerManagementProtocolPB



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2015-08-18 Thread zhangyubiao (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700886#comment-14700886
 ] 

zhangyubiao commented on YARN-3979:
---

Thanks for Rohith Sharma K S's  patch , We stop the copy of Logs that the 
program gone , and we will test patch for our test enviroment and if it's OK .  
we will patch for our production envirments . Thank you for your help.

 Am in ResourceLocalizationService hang 10 min cause RM kill  AM
 ---

 Key: YARN-3979
 URL: https://issues.apache.org/jira/browse/YARN-3979
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: CentOS 6.5  Hadoop-2.2.0
Reporter: zhangyubiao
 Attachments: ERROR103.log


 2015-07-27 02:46:17,348 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Created localizer for container_1437735375558
 _104282_01_01
 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
 Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE)
 2015-07-27 02:56:18,510 INFO 
 SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
  Authorization successful for appattempt_1437735375558_104282_0
 1 (auth:TOKEN) for protocol=interface 
 org.apache.hadoop.yarn.api.ContainerManagementProtocolPB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2015-08-11 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14682240#comment-14682240
 ] 

Rohith Sharma K S commented on YARN-3979:
-

I had look at the RM logs shared, I strongly suspect that it is because of the 
same reason in YARN-3990.
From the shared log, I see below logs which indicates that asyncdispatcher is 
overloaded with unnecessary events. May be you can use patch of YARN-3990 and 
test it.
{noformat}
2015-07-29 01:58:27,112 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
BJHC-HERA-18352.hadoop.jd.local:50086 Node Transitioned from RUNNING to LOST
2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved 
BJHC-HADOOP-HERA-17280.jd.local to /rack/rack4065
2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 2515000
2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 2515000
2015-07-29 01:58:27,112 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
NodeManager from node BJHC-HADOOP-HERA-17280.jd.local(cmPort: 50086 httpPort: 
8042) registered with capability: memory:57344, vCores:28, assigned nodeId 
BJHC-HADOOP-HERA-17280.jd.local:50086
2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved 
BJHC-HERA-164102.hadoop.jd.local to /rack/rack41007
2015-07-29 01:58:27,112 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
NodeManager from node BJHC-HERA-164102.hadoop.jd.local(cmPort: 50086 httpPort: 
8042) registered with capability: memory:57344, vCores:28, assigned nodeId 
BJHC-HERA-164102.hadoop.jd.local:50086
2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 2516000
2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 2516000
2015-07-29 01:58:27,112 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Node not 
found resyncing BJHC-HERA-18043.hadoop.jd.local:50086
2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 2517000
2015-07-29 01:58:27,112 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 2517000
2015-07-29 01:58:27,113 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 2518000
2015-07-29 01:58:27,113 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 2518000
2015-07-29 01:58:27,113 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 2519000
{noformat}

 Am in ResourceLocalizationService hang 10 min cause RM kill  AM
 ---

 Key: YARN-3979
 URL: https://issues.apache.org/jira/browse/YARN-3979
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: CentOS 6.5  Hadoop-2.2.0
Reporter: zhangyubiao
 Attachments: ERROR103.log


 2015-07-27 02:46:17,348 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Created localizer for container_1437735375558
 _104282_01_01
 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
 Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE)
 2015-07-27 02:56:18,510 INFO 
 SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
  Authorization successful for appattempt_1437735375558_104282_0
 1 (auth:TOKEN) for protocol=interface 
 org.apache.hadoop.yarn.api.ContainerManagementProtocolPB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2015-07-30 Thread zhangyubiao (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647275#comment-14647275
 ] 

zhangyubiao commented on YARN-3979:
---

I  find that the CPU and load is high  because of we use crontab to copy the RM 
Logs。
Today we stop the copy ,the CPU and load become normal 。

 Am in ResourceLocalizationService hang 10 min cause RM kill  AM
 ---

 Key: YARN-3979
 URL: https://issues.apache.org/jira/browse/YARN-3979
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: CentOS 6.5  Hadoop-2.2.0
Reporter: zhangyubiao
 Attachments: ERROR103.log


 2015-07-27 02:46:17,348 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Created localizer for container_1437735375558
 _104282_01_01
 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
 Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE)
 2015-07-27 02:56:18,510 INFO 
 SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
  Authorization successful for appattempt_1437735375558_104282_0
 1 (auth:TOKEN) for protocol=interface 
 org.apache.hadoop.yarn.api.ContainerManagementProtocolPB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2015-07-30 Thread zhangyubiao (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647303#comment-14647303
 ] 

zhangyubiao commented on YARN-3979:
---

I send you RM Logs just now 

 Am in ResourceLocalizationService hang 10 min cause RM kill  AM
 ---

 Key: YARN-3979
 URL: https://issues.apache.org/jira/browse/YARN-3979
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: CentOS 6.5  Hadoop-2.2.0
Reporter: zhangyubiao
 Attachments: ERROR103.log


 2015-07-27 02:46:17,348 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Created localizer for container_1437735375558
 _104282_01_01
 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
 Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE)
 2015-07-27 02:56:18,510 INFO 
 SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
  Authorization successful for appattempt_1437735375558_104282_0
 1 (auth:TOKEN) for protocol=interface 
 org.apache.hadoop.yarn.api.ContainerManagementProtocolPB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2015-07-30 Thread zhangyubiao (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647358#comment-14647358
 ] 

zhangyubiao commented on YARN-3979:
---

And Today we find that Yarn Memory Reserved very Large , AM stuck to Lanuch.

 Am in ResourceLocalizationService hang 10 min cause RM kill  AM
 ---

 Key: YARN-3979
 URL: https://issues.apache.org/jira/browse/YARN-3979
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: CentOS 6.5  Hadoop-2.2.0
Reporter: zhangyubiao
 Attachments: ERROR103.log


 2015-07-27 02:46:17,348 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Created localizer for container_1437735375558
 _104282_01_01
 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
 Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE)
 2015-07-27 02:56:18,510 INFO 
 SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
  Authorization successful for appattempt_1437735375558_104282_0
 1 (auth:TOKEN) for protocol=interface 
 org.apache.hadoop.yarn.api.ContainerManagementProtocolPB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2015-07-29 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646510#comment-14646510
 ] 

Rohith Sharma K S commented on YARN-3979:
-

Oops, 50 lakh events 
I checked the attached logs, since you have attached only ERROR logs, did not 
able to trace it. One observation is there are many InvalidStateTransitions 
events CLEAN_UP  in RMNodeImpl. 
# Would you possible give RM logs, if not able to attach  to JIRA, could you 
send me through mail. 
# would give more info like what is the cluster size? how much is apps are 
running? how many were completed? What is the state of state of NodeManager i.e 
whether they are running OR any other state? Which version  of Hadoop are you 
using?

 Am in ResourceLocalizationService hang 10 min cause RM kill  AM
 ---

 Key: YARN-3979
 URL: https://issues.apache.org/jira/browse/YARN-3979
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: CentOS 6.5  Hadoop-2.2.0
Reporter: zhangyubiao
 Attachments: ERROR103.log


 2015-07-27 02:46:17,348 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Created localizer for container_1437735375558
 _104282_01_01
 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
 Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE)
 2015-07-27 02:56:18,510 INFO 
 SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
  Authorization successful for appattempt_1437735375558_104282_0
 1 (auth:TOKEN) for protocol=interface 
 org.apache.hadoop.yarn.api.ContainerManagementProtocolPB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2015-07-29 Thread zhangyubiao (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647095#comment-14647095
 ] 

zhangyubiao commented on YARN-3979:
---

the cluster is about 1600。and about 550 apps running.  2  lakh  apps completed 
.   NodeManager in one times all lost and  recovery for a monment 。 I use 
Hadoop-2.2.0 in CentOS 6.5 

 Am in ResourceLocalizationService hang 10 min cause RM kill  AM
 ---

 Key: YARN-3979
 URL: https://issues.apache.org/jira/browse/YARN-3979
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: CentOS 6.5  Hadoop-2.2.0
Reporter: zhangyubiao
 Attachments: ERROR103.log


 2015-07-27 02:46:17,348 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Created localizer for container_1437735375558
 _104282_01_01
 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
 Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE)
 2015-07-27 02:56:18,510 INFO 
 SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
  Authorization successful for appattempt_1437735375558_104282_0
 1 (auth:TOKEN) for protocol=interface 
 org.apache.hadoop.yarn.api.ContainerManagementProtocolPB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2015-07-29 Thread zhangyubiao (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647130#comment-14647130
 ] 

zhangyubiao commented on YARN-3979:
---

I had send you an email of RM Jstack log 
and I wil send your app log soon 


 Am in ResourceLocalizationService hang 10 min cause RM kill  AM
 ---

 Key: YARN-3979
 URL: https://issues.apache.org/jira/browse/YARN-3979
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: CentOS 6.5  Hadoop-2.2.0
Reporter: zhangyubiao
 Attachments: ERROR103.log


 2015-07-27 02:46:17,348 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Created localizer for container_1437735375558
 _104282_01_01
 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
 Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE)
 2015-07-27 02:56:18,510 INFO 
 SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
  Authorization successful for appattempt_1437735375558_104282_0
 1 (auth:TOKEN) for protocol=interface 
 org.apache.hadoop.yarn.api.ContainerManagementProtocolPB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2015-07-29 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647177#comment-14647177
 ] 

Rohith Sharma K S commented on YARN-3979:
-

Thanks for the information!!
bq. NodeManager in one times all lost and recovery for a monment
I can think of the scenario very close to YARN-3990. Since you have 2 lakh apps 
completed and 1600 NodeManager, when the all the nodes lost and reconnected, 
the number of events that generated are {{(2lakh completed + 550 running = 
200550)*1600(number of NodeManager) = 32088}} events..Ooops!!!

 Am in ResourceLocalizationService hang 10 min cause RM kill  AM
 ---

 Key: YARN-3979
 URL: https://issues.apache.org/jira/browse/YARN-3979
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: CentOS 6.5  Hadoop-2.2.0
Reporter: zhangyubiao
 Attachments: ERROR103.log


 2015-07-27 02:46:17,348 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Created localizer for container_1437735375558
 _104282_01_01
 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
 Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE)
 2015-07-27 02:56:18,510 INFO 
 SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
  Authorization successful for appattempt_1437735375558_104282_0
 1 (auth:TOKEN) for protocol=interface 
 org.apache.hadoop.yarn.api.ContainerManagementProtocolPB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2015-07-29 Thread zhangyubiao (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645601#comment-14645601
 ] 

zhangyubiao commented on YARN-3979:
---

Thank you for reply  @Rohith Sharma K S 

 Am in ResourceLocalizationService hang 10 min cause RM kill  AM
 ---

 Key: YARN-3979
 URL: https://issues.apache.org/jira/browse/YARN-3979
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: CentOS 6.5  Hadoop-2.2.0
Reporter: zhangyubiao
 Attachments: ERROR103.log


 2015-07-27 02:46:17,348 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Created localizer for container_1437735375558
 _104282_01_01
 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
 Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE)
 2015-07-27 02:56:18,510 INFO 
 SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
  Authorization successful for appattempt_1437735375558_104282_0
 1 (auth:TOKEN) for protocol=interface 
 org.apache.hadoop.yarn.api.ContainerManagementProtocolPB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2015-07-29 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645535#comment-14645535
 ] 

Rohith Sharma K S commented on YARN-3979:
-

How many applications completed? How many applications are running? How many NM 
are running? When is this event queeu is full? Any observation  you made?

 Am in ResourceLocalizationService hang 10 min cause RM kill  AM
 ---

 Key: YARN-3979
 URL: https://issues.apache.org/jira/browse/YARN-3979
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: CentOS 6.5  Hadoop-2.2.0
Reporter: zhangyubiao

 2015-07-27 02:46:17,348 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Created localizer for container_1437735375558
 _104282_01_01
 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
 Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE)
 2015-07-27 02:56:18,510 INFO 
 SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
  Authorization successful for appattempt_1437735375558_104282_0
 1 (auth:TOKEN) for protocol=interface 
 org.apache.hadoop.yarn.api.ContainerManagementProtocolPB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2015-07-28 Thread zhangyubiao (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645381#comment-14645381
 ] 

zhangyubiao commented on YARN-3979:
---

I find the RM hang and get the Pstack at that time .


Thread 370 (Thread 0x7f263e4f1700 (LWP 35718)):
#0  0x003abf40b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
#1  0x7f263ec01f8e in os::PlatformEvent::park() () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#2  0x7f263ebd3985 in Monitor::IWait(Thread*, long) () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#3  0x7f263ebd3fed in Monitor::wait(bool, long, bool) () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#4  0x7f263ed101f5 in Threads::destroy_vm() () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#5  0x7f263ea0c97b in jni_DestroyJavaVM () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#6  0x4000223f in JavaMain ()
#7  0x003abf407851 in start_thread () from /lib64/libpthread.so.0
#8  0x003abece811d in clone () from /lib64/libc.so.6
Thread 369 (Thread 0x7f263dfa1700 (LWP 35719)):
#0  0x003abf40b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
#1  0x7f263ec01f8e in os::PlatformEvent::park() () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#2  0x7f263ebd3985 in Monitor::IWait(Thread*, long) () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#3  0x7f263ebd414e in Monitor::wait(bool, long, bool) () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#4  0x7f263ed67668 in GangWorker::loop() () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#5  0x7f263ed675b4 in GangWorker::run() () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#6  0x7f263ec0296f in java_start(Thread*) () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#7  0x003abf407851 in start_thread () from /lib64/libpthread.so.0
#8  0x003abece811d in clone () from /lib64/libc.so.6
Thread 368 (Thread 0x7f263dea0700 (LWP 35720)):
#0  0x003abf40b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
#1  0x7f263ec01f8e in os::PlatformEvent::park() () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#2  0x7f263ebd3985 in Monitor::IWait(Thread*, long) () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#3  0x7f263ebd414e in Monitor::wait(bool, long, bool) () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#4  0x7f263ed67668 in GangWorker::loop() () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#5  0x7f263ed675b4 in GangWorker::run() () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#6  0x7f263ec0296f in java_start(Thread*) () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#7  0x003abf407851 in start_thread () from /lib64/libpthread.so.0
#8  0x003abece811d in clone () from /lib64/libc.so.6
Thread 367 (Thread 0x7f263dd9f700 (LWP 35721)):
#0  0x003abf40b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
#1  0x7f263ec01f8e in os::PlatformEvent::park() () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#2  0x7f263ebd3985 in Monitor::IWait(Thread*, long) () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#3  0x7f263ebd414e in Monitor::wait(bool, long, bool) () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#4  0x7f263ed67668 in GangWorker::loop() () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#5  0x7f263ed675b4 in GangWorker::run() () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#6  0x7f263ec0296f in java_start(Thread*) () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#7  0x003abf407851 in start_thread () from /lib64/libpthread.so.0
#8  0x003abece811d in clone () from /lib64/libc.so.6
Thread 366 (Thread 0x7f263dc9e700 (LWP 35722)):
#0  0x003abf40b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
#1  0x7f263ec01f8e in os::PlatformEvent::park() () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#2  0x7f263ebd3985 in Monitor::IWait(Thread*, long) () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#3  0x7f263ebd414e in Monitor::wait(bool, long, bool) () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#4  0x7f263ed67668 in GangWorker::loop() () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#5  0x7f263ed675b4 in GangWorker::run() () from 
/software/servers/jdk1.6.0_25/jre/lib/amd64/server/libjvm.so
#6  0x7f263ec0296f in java_start(Thread*) () from 

[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2015-07-28 Thread zhangyubiao (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645487#comment-14645487
 ] 

zhangyubiao commented on YARN-3979:
---

In the first time I find the some job hang 10 min  and we chang the 
yarn.resourcemanager.client.thread-count 50 - 100 
yarn.resourcemanager.scheduler.client.thread-count  50-100
yarn.resourcemanager.resource-tracker.client.thread-count 50 - 80 
,and few days we find that RM machine in sometimes the load and 
CPU use beging high.

And I find the RM Logs  event queue begin very large 

2015-07-29 01:59:21,196 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4924000
2015-07-29 01:59:21,196 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4925000
2015-07-29 01:59:21,196 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4926000
2015-07-29 01:59:21,196 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4927000
2015-07-29 01:59:21,196 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4928000
2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4929000
2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 493
2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 493
2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4931000
2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4932000
2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4933000
2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4934000
2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4935000
2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4936000
2015-07-29 01:59:21,197 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4937000
2015-07-29 01:59:21,198 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4938000
2015-07-29 01:59:21,198 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4939000
2015-07-29 01:59:21,198 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 494
2015-07-29 01:59:21,198 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4941000
2015-07-29 01:59:21,198 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4942000
2015-07-29 01:59:21,198 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4943000
2015-07-29 01:59:21,198 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4944000
2015-07-29 01:59:21,199 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4945000
2015-07-29 01:59:21,199 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4946000
2015-07-29 01:59:21,199 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4947000
2015-07-29 01:59:21,199 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4948000
2015-07-29 01:59:21,199 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size 
of event-queue is 4949000

 Am in ResourceLocalizationService hang 10 min cause RM kill  AM
 ---

 Key: YARN-3979
 URL: https://issues.apache.org/jira/browse/YARN-3979
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: CentOS 6.5  Hadoop-2.2.0
Reporter: zhangyubiao

 2015-07-27 02:46:17,348 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Created localizer for container_1437735375558
 _104282_01_01
 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
 Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE)
 2015-07-27 02:56:18,510 INFO 
 SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
  Authorization successful for appattempt_1437735375558_104282_0
 1 (auth:TOKEN) for protocol=interface 
 org.apache.hadoop.yarn.api.ContainerManagementProtocolPB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2015-07-28 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645404#comment-14645404
 ] 

Rohith Sharma K S commented on YARN-3979:
-

[~piaoyu zhang] In the description you have given NM logs, but in previous 
comment you have give stack trace of RM. It would be easy to analyze if you can 
provide more info like RM logs, NM logs and AM logs if started. And NM stack 
trace would help much since NM side holding 10 mins. 

 Am in ResourceLocalizationService hang 10 min cause RM kill  AM
 ---

 Key: YARN-3979
 URL: https://issues.apache.org/jira/browse/YARN-3979
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: CentOS 6.5  Hadoop-2.2.0
Reporter: zhangyubiao

 2015-07-27 02:46:17,348 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Created localizer for container_1437735375558
 _104282_01_01
 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
 Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE)
 2015-07-27 02:56:18,510 INFO 
 SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
  Authorization successful for appattempt_1437735375558_104282_0
 1 (auth:TOKEN) for protocol=interface 
 org.apache.hadoop.yarn.api.ContainerManagementProtocolPB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)