[jira] [Commented] (YARN-4127) RM fail with noAuth error if switched from failover mode to non-failover mode

2015-11-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986931#comment-14986931
 ] 

Hadoop QA commented on YARN-4127:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
43s {color} | {color:green} branch-2.7 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s 
{color} | {color:green} branch-2.7 passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s 
{color} | {color:green} branch-2.7 passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
18s {color} | {color:green} branch-2.7 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
18s {color} | {color:green} branch-2.7 passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 15s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 in branch-2.7 cannot run convertXmlToText from findbugs {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s 
{color} | {color:green} branch-2.7 passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s 
{color} | {color:green} branch-2.7 passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
25s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 21s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 13s 
{color} | {color:red} Patch generated 4 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 (total was 226, now 229). {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
14s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 1549 line(s) that end in whitespace. Use 
git apply --whitespace=fix. {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 39s 
{color} | {color:red} The patch has 137 line(s) with tabs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
18s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 51m 18s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_60. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 52m 21s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_79. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 44m 58s 
{color} | {color:red} Patch generated 65 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 165m 36s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_60 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.server.resourcemanager.TestRMRestart |
|   | hadoop.yarn.server.resourcemanager.TestResourceTrackerService |
|   | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
| JDK v1.7.0_79 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | 

[jira] [Updated] (YARN-4327) RM can not renew TIMELINE_DELEGATION_TOKEN in securt clusters

2015-11-03 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-4327:
---
Description: 
in hadoop 2.7.1
ATS conf like this: 

yarn.timeline-service.http-authentication.type
simple


yarn.timeline-service.http-authentication.kerberos.principal
HTTP/_h...@xxx.com


yarn.timeline-service.http-authentication.kerberos.keytab
/etc/hadoop/keytabs/xxx.keytab



yarn.timeline-service.principal
xxx/_h...@xxx.com


yarn.timeline-service.keytab
/etc/hadoop/keytabs/xxx.keytab



yarn.timeline-service.best-effort
true


yarn.timeline-service.enabled
true
  
 

I'd like to allow everyone to access ATS from HTTP as RM,HDFS.
client can submit job to RM and  add TIMELINE_DELEGATION_TOKEN  to AM Context, 
but RM can not renew  TIMELINE_DELEGATION_TOKEN and make application to failure.
RM logs:
2015-11-03 11:58:38,191 WARN 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: 
Unable to add the application to the delegation token renewer.
java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, 
Service: 10.12.38.4:8188, Ident: (owner=yarn-test, renewer=yarn-test, 
realUser=, issueDate=1446523118046, maxDate=1447127918046, sequenceNumber=9, 
masterKeyId=2)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:439)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:78)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:847)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:828)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: HTTP status [500], message [Null user]
at 
org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:169)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:287)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:212)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:414)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:396)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:378)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$5.run(TimelineClientImpl.java:451)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:183)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:466)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:400)
at 
org.apache.hadoop.yarn.security.client.TimelineDelegationTokenIdentifier$Renewer.renew(TimelineDelegationTokenIdentifier.java:81)
at org.apache.hadoop.security.token.Token.renew(Token.java:377)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:543)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:540)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.renewToken(DelegationTokenRenewer.java:538)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:437)
... 6 more

ATS logs:
2015-11-03 14:47:45,407 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Creating password for identifier: owner=yarn-test, renewer=yarn-test, 
realUser=, issueDate=1446533265407, 

[jira] [Created] (YARN-4327) RM can not renew TIMELINE_DELEGATION_TOKEN in securt clusters

2015-11-03 Thread zhangshilong (JIRA)
zhangshilong created YARN-4327:
--

 Summary: RM can not renew  TIMELINE_DELEGATION_TOKEN in securt 
clusters
 Key: YARN-4327
 URL: https://issues.apache.org/jira/browse/YARN-4327
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.7.1
 Environment: hadoop 2.7.1hdfs,yarn, mrhistoryserver, ATS all use 
kerberos security.
conf like this:

  hadoop.security.authorization
  true
  Is service-level authorization enabled?



  hadoop.security.authentication
  kerberos
  Possible values are simple (no authentication), and kerberos
  


Reporter: zhangshilong


in hadoop 2.7.1
ATS conf like this: 

yarn.timeline-service.http-authentication.type
simple


yarn.timeline-service.http-authentication.kerberos.principal
HTTP/_h...@xxx.com


yarn.timeline-service.http-authentication.kerberos.keytab
/etc/hadoop/keytabs/xxx.keytab



yarn.timeline-service.principal
xxx/_h...@xxx.com


yarn.timeline-service.keytab
/etc/hadoop/keytabs/xxx.keytab



yarn.timeline-service.best-effort
true


yarn.timeline-service.enabled
true
  
 

I'd like to allow everyone to access ATS from HTTP as RM,HDFS.
client can submit job to RM and  add TIMELINE_DELEGATION_TOKEN  to AM Context, 
but RM can not renew  TIMELINE_DELEGATION_TOKEN and make application to failure.
RM logs:
2015-11-03 11:58:38,191 WARN 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer: 
Unable to add the application to the delegation token renewer.
java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, 
Service: 10.12.38.4:8188, Ident: (owner=yarn-test, renewer=yarn-test, 
realUser=, issueDate=1446523118046, maxDate=1447127918046, sequenceNumber=9, 
masterKeyId=2)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:439)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:78)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:847)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:828)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: HTTP status [500], message [Null user]
at 
org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:169)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:287)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:212)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:414)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:396)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:378)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$5.run(TimelineClientImpl.java:451)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:183)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:466)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:400)
at 
org.apache.hadoop.yarn.security.client.TimelineDelegationTokenIdentifier$Renewer.renew(TimelineDelegationTokenIdentifier.java:81)
at org.apache.hadoop.security.token.Token.renew(Token.java:377)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:543)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:540)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
  

[jira] [Updated] (YARN-4327) RM can not renew TIMELINE_DELEGATION_TOKEN in secure clusters

2015-11-03 Thread zhangshilong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangshilong updated YARN-4327:
---
Summary: RM can not renew  TIMELINE_DELEGATION_TOKEN in secure clusters  
(was: RM can not renew  TIMELINE_DELEGATION_TOKEN in securt clusters)

> RM can not renew  TIMELINE_DELEGATION_TOKEN in secure clusters
> --
>
> Key: YARN-4327
> URL: https://issues.apache.org/jira/browse/YARN-4327
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.7.1
> Environment: hadoop 2.7.1hdfs,yarn, mrhistoryserver, ATS all use 
> kerberos security.
> conf like this:
> 
>   hadoop.security.authorization
>   true
>   Is service-level authorization enabled?
> 
> 
>   hadoop.security.authentication
>   kerberos
>   Possible values are simple (no authentication), and kerberos
>   
> 
>Reporter: zhangshilong
>
> bin hadoop 2.7.1
> ATS conf like this: 
> 
> yarn.timeline-service.http-authentication.type
> simple
> 
> 
> yarn.timeline-service.http-authentication.kerberos.principal
> HTTP/_h...@xxx.com
> 
> 
> yarn.timeline-service.http-authentication.kerberos.keytab
> /etc/hadoop/keytabs/xxx.keytab
> 
> 
> yarn.timeline-service.principal
> xxx/_h...@xxx.com
> 
> 
> yarn.timeline-service.keytab
> /etc/hadoop/keytabs/xxx.keytab
> 
> 
> yarn.timeline-service.best-effort
> true
> 
> 
> yarn.timeline-service.enabled
> true
>   
>  
> I'd like to allow everyone to access ATS from HTTP as RM,HDFS.
> client can submit job to RM and  add TIMELINE_DELEGATION_TOKEN  to AM 
> Context, but RM can not renew  TIMELINE_DELEGATION_TOKEN and make application 
> to failure.
> RM logs:
> 2015-11-03 11:58:38,191 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Unable to add the application to the delegation token renewer.
> java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, 
> Service: 10.12.38.4:8188, Ident: (owner=yarn-test, renewer=yarn-test, 
> realUser=, issueDate=1446523118046, maxDate=1447127918046, sequenceNumber=9, 
> masterKeyId=2)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:439)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:78)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:847)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:828)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: HTTP status [500], message [Null user]
> at 
> org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:169)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:287)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:212)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:414)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:396)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:378)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$5.run(TimelineClientImpl.java:451)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:183)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:466)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:400)
> at 
> org.apache.hadoop.yarn.security.client.TimelineDelegationTokenIdentifier$Renewer.renew(TimelineDelegationTokenIdentifier.java:81)
> at 

[jira] [Commented] (YARN-4327) RM can not renew TIMELINE_DELEGATION_TOKEN in securt clusters

2015-11-03 Thread zhangshilong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987054#comment-14987054
 ] 

zhangshilong commented on YARN-4327:


yarn.timeline-service.http-authentication.type  simple,  ATS will use  
PseudoAuthenticationHandler, and generate a token like u=null=null. This make 
ATS causes java.lang.IllegalArgumentException.

> RM can not renew  TIMELINE_DELEGATION_TOKEN in securt clusters
> --
>
> Key: YARN-4327
> URL: https://issues.apache.org/jira/browse/YARN-4327
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.7.1
> Environment: hadoop 2.7.1hdfs,yarn, mrhistoryserver, ATS all use 
> kerberos security.
> conf like this:
> 
>   hadoop.security.authorization
>   true
>   Is service-level authorization enabled?
> 
> 
>   hadoop.security.authentication
>   kerberos
>   Possible values are simple (no authentication), and kerberos
>   
> 
>Reporter: zhangshilong
>
> in hadoop 2.7.1
> ATS conf like this: 
> 
> yarn.timeline-service.http-authentication.type
> simple
> 
> 
> yarn.timeline-service.http-authentication.kerberos.principal
> HTTP/_h...@xxx.com
> 
> 
> yarn.timeline-service.http-authentication.kerberos.keytab
> /etc/hadoop/keytabs/xxx.keytab
> 
> 
> yarn.timeline-service.principal
> xxx/_h...@xxx.com
> 
> 
> yarn.timeline-service.keytab
> /etc/hadoop/keytabs/xxx.keytab
> 
> 
> yarn.timeline-service.best-effort
> true
> 
> 
> yarn.timeline-service.enabled
> true
>   
>  
> I'd like to allow everyone to access ATS from HTTP as RM,HDFS.
> client can submit job to RM and  add TIMELINE_DELEGATION_TOKEN  to AM 
> Context, but RM can not renew  TIMELINE_DELEGATION_TOKEN and make application 
> to failure.
> RM logs:
> 2015-11-03 11:58:38,191 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer:
>  Unable to add the application to the delegation token renewer.
> java.io.IOException: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, 
> Service: 10.12.38.4:8188, Ident: (owner=yarn-test, renewer=yarn-test, 
> realUser=, issueDate=1446523118046, maxDate=1447127918046, sequenceNumber=9, 
> masterKeyId=2)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:439)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$700(DelegationTokenRenewer.java:78)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:847)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:828)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: HTTP status [500], message [Null user]
> at 
> org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:169)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.doDelegationTokenOperation(DelegationTokenAuthenticator.java:287)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.renewDelegationToken(DelegationTokenAuthenticator.java:212)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.renewDelegationToken(DelegationTokenAuthenticatedURL.java:414)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:396)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$3.run(TimelineClientImpl.java:378)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$5.run(TimelineClientImpl.java:451)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:183)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:466)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.renewDelegationToken(TimelineClientImpl.java:400)
> at 
> 

[jira] [Updated] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2015-11-03 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-4311:
--
Attachment: YARN-4311-v1.patch

This is a preliminary proposal patch. Through this change, any node is checked 
if it should be removed from all lists by the isInvalidAndAbsent() method. 
There are however different outcomes based on the initial state of the node.

If the node state is running (,is part of the include list) and is taken out of 
the list, followed by -refreshNodes (and it is not of course in the exclude 
list), the node is shutdown and shutdown node count is incremented . The node 
is not considered a decommed node.

If the node is in both exclude and include list and -refreshNodes has been 
done, (meaning it is a decommissioned node,) then  removing it from both those 
lists takes it out completely not showing up in shutdown or decommed or 
unhealthy or active lists. There is one case where shutdown counters are 
misleading and which this patch hasnt addressed. If the node was running and 
was taken out of include list, and it comes back up after being added to the 
include list, the shutdown counters still stick to the same value. This needs 
to be changed since current state transitions dont account for it.

Need some inputs from the community on the semantics of such a fix. It may be a 
good idea to have them counted (but not listed) as shutdown nodes in the first 
case since any mistake in configuring the include list will lose all 
information about the nodes (counters and nodelists) which may be undesirable.

Appreciate any comments/suggestions/caveats. I have not fixed associated test 
failures through this patch.

> Removing nodes from include and exclude lists will not remove them from 
> decommissioned nodes list
> -
>
> Key: YARN-4311
> URL: https://issues.apache.org/jira/browse/YARN-4311
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: YARN-4311-v1.patch
>
>
> In order to fully forget about a node, removing the node from include and 
> exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The 
> tricky part that [~jlowe] pointed out was the case when include lists are not 
> used, in that case we don't want the nodes to fall off if they are not active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4325) purge app state from NM state-store should be independent of log aggregation

2015-11-03 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987722#comment-14987722
 ] 

Junping Du commented on YARN-4325:
--

Hi [~vinodkv], we found in a long running cluster, NMs recovery will try to 
recover tens of thousands of apps and most of them are old and stale. For now, 
the removal of app state in NM state store is triggered by 
ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED which created by 
aggregation or non-aggregation log handling only. 

So I were suspecting the purge of app state could be affected by log 
aggregation exceptions, like some permission issue below:
{noformat}
2015-10-13 01:58:40,277 WARN  logaggregation.LogAggregationService 
(LogAggregationService.java:verifyAndCreateRemoteLogDir(195)) - Remote Root Log 
Dir [/app-logs] already exist, but with incorrect permissions. Expected: 
[rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple 
users.
336 2015-10-13 01:58:40,277 WARN  logaggregation.AppLogAggregatorImpl 
(AppLogAggregatorImpl.java:(182)) - rollingMonitorInterval is set as -1. 
The log rolling mornitoring interval is disabled. The logs will be aggregated 
after this application is finished.
{noformat}

I am still debugging it, please free free to move to release after 2.7.2.

> purge app state from NM state-store should be independent of log aggregation
> 
>
> Key: YARN-4325
> URL: https://issues.apache.org/jira/browse/YARN-4325
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
>
> From a long running cluster, we found tens of thousands of stale apps still 
> be recovered in NM restart recovery. The reason is some wrong configuration 
> setting to log aggregation so the end of log aggregation events are not 
> received so stale apps are not purged properly. We should make sure the 
> removal of app state to be independent of log aggregation life cycle. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3862) Decide which contents to retrieve and send back in response in TimelineReader

2015-11-03 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987829#comment-14987829
 ] 

Allen Wittenauer commented on YARN-3862:


Yetus is happily pointing out how many build fixes your branch is missing 
and/or have yet to be committed (like HADOOP-12492).  So I'd recommend a rebase 
or some large scale cherry-picking.

> Decide which contents to retrieve and send back in response in TimelineReader
> -
>
> Key: YARN-3862
> URL: https://issues.apache.org/jira/browse/YARN-3862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-3862-YARN-2928.wip.01.patch, 
> YARN-3862-YARN-2928.wip.02.patch
>
>
> Currently, we will retrieve all the contents of the field if that field is 
> specified in the query API. In case of configs and metrics, this can become a 
> lot of data even though the user doesn't need it. So we need to provide a way 
> to query only a set of configs or metrics.
> As a comma spearated list of configs/metrics to be returned will be quite 
> cumbersome to specify, we have to support either of the following options :
> # Prefix match
> # Regex
> # Group the configs/metrics and query that group.
> We also need a facility to specify a metric time window to return metrics in 
> a that window. This may be useful in plotting graphs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-11-03 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3849:
-
Target Version/s: 2.7.3

Discussed with [~vinodkv], since 2.7.2 is almost done, set target version of 
this ticket to 2.7.3.

> Too much of preemption activity causing continuos killing of containers 
> across queues
> -
>
> Key: YARN-3849
> URL: https://issues.apache.org/jira/browse/YARN-3849
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, 
> 0003-YARN-3849.patch, 0004-YARN-3849.patch
>
>
> Two queues are used. Each queue has given a capacity of 0.5. Dominant 
> Resource policy is used.
> 1. An app is submitted in QueueA which is consuming full cluster capacity
> 2. After submitting an app in QueueB, there are some demand  and invoking 
> preemption in QueueA
> 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
> all containers other than AM is getting killed in QueueA
> 4. Now the app in QueueB is trying to take over cluster with the current free 
> space. But there are some updated demand from the app in QueueA which lost 
> its containers earlier, and preemption is kicked in QueueB now.
> Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
> apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-11-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987819#comment-14987819
 ] 

Hadoop QA commented on YARN-1509:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 7s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
53s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
32s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
32s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
13s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 13s 
{color} | {color:red} hadoop-yarn-applications-distributedshell in the patch 
failed. {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 56s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 56s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 51s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 28s 
{color} | {color:red} Patch generated 3 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 124, now 120). {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
27s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
21s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 16m 55s {color} 
| {color:red} hadoop-yarn-applications-distributedshell in the patch failed 
with JDK v1.8.0_60. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 49m 33s {color} 
| {color:red} hadoop-yarn-client in the patch failed with JDK v1.8.0_60. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 16m 57s {color} 
| {color:red} hadoop-yarn-applications-distributedshell in the patch failed 
with JDK v1.7.0_79. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 49m 36s {color} 
| {color:red} hadoop-yarn-client in the patch failed with JDK v1.7.0_79. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
29s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 149m 45s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_60 Failed junit tests | 
hadoop.yarn.applications.distributedshell.TestDistributedShellWithNodeLabels |
|   | hadoop.yarn.client.TestGetGroups |
| JDK v1.8.0_60 Timed out junit tests | 
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell |
|   | org.apache.hadoop.yarn.client.api.impl.TestAMRMClient |
|   | org.apache.hadoop.yarn.client.api.impl.TestYarnClient |
|   | 

[jira] [Commented] (YARN-4326) TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to default port 8188

2015-11-03 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987817#comment-14987817
 ] 

Wangda Tan commented on YARN-4326:
--

Thank for working on this, [~mding]. Fix looks good to me.

> TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to 
> default port 8188
> ---
>
> Key: YARN-4326
> URL: https://issues.apache.org/jira/browse/YARN-4326
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4326.patch
>
>
> The timeout originates in ApplicationMaster, where it fails to connect to 
> timeline server, and retry exceeds limits:
> {code}
> 2015-11-02 21:57:38,066 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(299)) - Timeline service address: 
> http://mdinglin02:0/ws/v1/timeline/
> 2015-11-02 21:57:38,099 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:logException(213)) - Exception caught by 
> TimelineClientConnectionRetry, will try 30 more time(s).
> ...
> ...
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
> at com.sun.jersey.api.client.Client.handle(Client.java:648)
> at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
> at 
> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
> at 
> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-11-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987878#comment-14987878
 ] 

Hadoop QA commented on YARN-1509:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 7s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
18s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 48s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
27s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
30s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 8s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 15s 
{color} | {color:red} hadoop-yarn-applications-distributedshell in the patch 
failed. {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 55s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 50s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 50s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 28s 
{color} | {color:red} Patch generated 3 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 125, now 121). {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
30s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
17s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 16m 54s {color} 
| {color:red} hadoop-yarn-applications-distributedshell in the patch failed 
with JDK v1.8.0_60. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 49m 25s {color} 
| {color:red} hadoop-yarn-client in the patch failed with JDK v1.8.0_60. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 16m 59s {color} 
| {color:red} hadoop-yarn-applications-distributedshell in the patch failed 
with JDK v1.7.0_79. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 49m 34s {color} 
| {color:red} hadoop-yarn-client in the patch failed with JDK v1.7.0_79. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
30s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 148m 27s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_60 Failed junit tests | 
hadoop.yarn.applications.distributedshell.TestDistributedShellWithNodeLabels |
|   | hadoop.yarn.client.TestGetGroups |
| JDK v1.8.0_60 Timed out junit tests | 
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell |
|   | org.apache.hadoop.yarn.client.api.impl.TestYarnClient |
|   | org.apache.hadoop.yarn.client.api.impl.TestAMRMClient |
|   | 

[jira] [Updated] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

2015-11-03 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3849:
-
Labels:   (was: 2.7.2-candidate)

> Too much of preemption activity causing continuos killing of containers 
> across queues
> -
>
> Key: YARN-3849
> URL: https://issues.apache.org/jira/browse/YARN-3849
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: 0001-YARN-3849.patch, 0002-YARN-3849.patch, 
> 0003-YARN-3849.patch, 0004-YARN-3849.patch
>
>
> Two queues are used. Each queue has given a capacity of 0.5. Dominant 
> Resource policy is used.
> 1. An app is submitted in QueueA which is consuming full cluster capacity
> 2. After submitting an app in QueueB, there are some demand  and invoking 
> preemption in QueueA
> 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
> all containers other than AM is getting killed in QueueA
> 4. Now the app in QueueB is trying to take over cluster with the current free 
> space. But there are some updated demand from the app in QueueA which lost 
> its containers earlier, and preemption is kicked in QueueB now.
> Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
> apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4321) Incessant retries if NoAuthException is thrown by Zookeeper in non HA mode

2015-11-03 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-4321:
---
Target Version/s: 2.7.2  (was: 2.7.2, 2.6.3)

> Incessant retries if NoAuthException is thrown by Zookeeper in non HA mode
> --
>
> Key: YARN-4321
> URL: https://issues.apache.org/jira/browse/YARN-4321
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Fix For: 2.7.2
>
> Attachments: YARN-4321-branch-2.7.01.patch
>
>
> This applies to only branch-2.7 or earlier code.
> When a {{NoAuthException}} is thrown in non HA mode(like in the scenario of 
> YARN-4127), RM incessantly keeps on retrying the ZK operation.
> {noformat}
> 2015-10-23 09:22:10,209 DEBUG [SyncThread:0] server.DataTree 
> (DataTree.java:processTxn(949)) - Ignoring processTxn failure hdr: -1 : 
> error: -102
> 2015-10-23 09:22:10,210 DEBUG [main-SendThread(127.0.0.1:11221)] 
> zookeeper.ClientCnxn (ClientCnxn.java:readResponse(818)) - Reading reply 
> sessionid:0x15092d1ebe10001, packet:: clientPath:null serverPath:null 
> finished:false header:: 7591,1  replyHeader:: 7591,7610,-102  request:: 
> '/rmstore/ZKRMStateRoot/RMAppRoot,,v{s{31,s{'world,'anyone}}},0  response::
> 2015-10-23 09:22:10,210 INFO  [ProcessThread(sid:0 cport:-1):] 
> server.PrepRequestProcessor (PrepRequestProcessor.java:pRequest(645)) - Got 
> user-level KeeperException when processing sessionid:0x15092d1ebe10001 
> type:create cxid:0x1da8 zxid:0x1dbb txntype:-1 reqpath:n/a Error Path:null 
> Error:KeeperErrorCode = NoAuth
> {noformat}
> This is because we do not handle NoAuthException properly in branch-2.7 code 
> when HA is not enabled.
> In {{ZKRMStateStore#runWithRetries}}, we have code as under. As can be seen 
> if HA is not enabled, we neither rethrow NoAuthException nor do we have any 
> logic to increment retries and back out if retries are maxed out.
> {code}
>  T runWithRetries() throws Exception {
>   int retry = 0;
>   while (true) {
> try {
>   return runWithCheck();
> } catch (KeeperException.NoAuthException nae) {
>   if (HAUtil.isHAEnabled(getConfig())) {
> // NoAuthException possibly means that this store is fenced due to
> // another RM becoming active. Even if not,
> // it is safer to assume we have been fenced
> throw new StoreFencedException();
>   }
> } catch (KeeperException ke) {
>   .
>}
>  }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3862) Decide which contents to retrieve and send back in response in TimelineReader

2015-11-03 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987669#comment-14987669
 ] 

Varun Saxena commented on YARN-3862:


Yes, you are correct. Will fix this in next patch

> Decide which contents to retrieve and send back in response in TimelineReader
> -
>
> Key: YARN-3862
> URL: https://issues.apache.org/jira/browse/YARN-3862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-3862-YARN-2928.wip.01.patch, 
> YARN-3862-YARN-2928.wip.02.patch
>
>
> Currently, we will retrieve all the contents of the field if that field is 
> specified in the query API. In case of configs and metrics, this can become a 
> lot of data even though the user doesn't need it. So we need to provide a way 
> to query only a set of configs or metrics.
> As a comma spearated list of configs/metrics to be returned will be quite 
> cumbersome to specify, we have to support either of the following options :
> # Prefix match
> # Regex
> # Group the configs/metrics and query that group.
> We also need a facility to specify a metric time window to return metrics in 
> a that window. This may be useful in plotting graphs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2055) Preemption: Jobs are failing due to AMs are getting launched and killed multiple times

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-2055.
---
Resolution: Not A Problem

Between YARN-2074, MAPREDUCE-5956 and YARN-2022, this is not much of an issue 
anymore.

For the remaining toggle of AM containers between nearly full-queues, not much 
can really be done.

I am closing this as not-a-problem anymore, please reopen as necessary.

> Preemption: Jobs are failing due to AMs are getting launched and killed 
> multiple times
> --
>
> Key: YARN-2055
> URL: https://issues.apache.org/jira/browse/YARN-2055
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Mayank Bansal
>
> If Queue A does not have enough capacity to run AM, then AM will borrow 
> capacity from queue B to run AM in that case AM will be killed if queue B 
> will reclaim its capacity and again AM will be launched and killed again, in 
> that case job will be failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4325) purge app state from NM state-store should be independent of log aggregation

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987678#comment-14987678
 ] 

Vinod Kumar Vavilapalli commented on YARN-4325:
---

[~djp], the JIRA is a little light on details, will help if you can paste 
exception / log messages etc.

Also, does this only happen with mis-configuration? And you are planning to 
work on this soon? If not, I'd not hold 2.7.2 off for this.

> purge app state from NM state-store should be independent of log aggregation
> 
>
> Key: YARN-4325
> URL: https://issues.apache.org/jira/browse/YARN-4325
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
>
> From a long running cluster, we found tens of thousands of stale apps still 
> be recovered in NM restart recovery. The reason is some wrong configuration 
> setting to log aggregation so the end of log aggregation events are not 
> received so stale apps are not purged properly. We should make sure the 
> removal of app state to be independent of log aggregation life cycle. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4321) Incessant retries if NoAuthException is thrown by Zookeeper in non HA mode

2015-11-03 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-4321:
---
Target Version/s: 2.7.2, 2.6.3  (was: 2.7.2)

> Incessant retries if NoAuthException is thrown by Zookeeper in non HA mode
> --
>
> Key: YARN-4321
> URL: https://issues.apache.org/jira/browse/YARN-4321
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Fix For: 2.7.2
>
> Attachments: YARN-4321-branch-2.7.01.patch
>
>
> This applies to only branch-2.7 or earlier code.
> When a {{NoAuthException}} is thrown in non HA mode(like in the scenario of 
> YARN-4127), RM incessantly keeps on retrying the ZK operation.
> {noformat}
> 2015-10-23 09:22:10,209 DEBUG [SyncThread:0] server.DataTree 
> (DataTree.java:processTxn(949)) - Ignoring processTxn failure hdr: -1 : 
> error: -102
> 2015-10-23 09:22:10,210 DEBUG [main-SendThread(127.0.0.1:11221)] 
> zookeeper.ClientCnxn (ClientCnxn.java:readResponse(818)) - Reading reply 
> sessionid:0x15092d1ebe10001, packet:: clientPath:null serverPath:null 
> finished:false header:: 7591,1  replyHeader:: 7591,7610,-102  request:: 
> '/rmstore/ZKRMStateRoot/RMAppRoot,,v{s{31,s{'world,'anyone}}},0  response::
> 2015-10-23 09:22:10,210 INFO  [ProcessThread(sid:0 cport:-1):] 
> server.PrepRequestProcessor (PrepRequestProcessor.java:pRequest(645)) - Got 
> user-level KeeperException when processing sessionid:0x15092d1ebe10001 
> type:create cxid:0x1da8 zxid:0x1dbb txntype:-1 reqpath:n/a Error Path:null 
> Error:KeeperErrorCode = NoAuth
> {noformat}
> This is because we do not handle NoAuthException properly in branch-2.7 code 
> when HA is not enabled.
> In {{ZKRMStateStore#runWithRetries}}, we have code as under. As can be seen 
> if HA is not enabled, we neither rethrow NoAuthException nor do we have any 
> logic to increment retries and back out if retries are maxed out.
> {code}
>  T runWithRetries() throws Exception {
>   int retry = 0;
>   while (true) {
> try {
>   return runWithCheck();
> } catch (KeeperException.NoAuthException nae) {
>   if (HAUtil.isHAEnabled(getConfig())) {
> // NoAuthException possibly means that this store is fenced due to
> // another RM becoming active. Even if not,
> // it is safer to assume we have been fenced
> throw new StoreFencedException();
>   }
> } catch (KeeperException ke) {
>   .
>}
>  }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4236) Metric for aggregated resources allocation per queue

2015-11-03 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4236:
---
Attachment: (was: YARN-4236.patch)

> Metric for aggregated resources allocation per queue
> 
>
> Key: YARN-4236
> URL: https://issues.apache.org/jira/browse/YARN-4236
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4236.patch
>
>
> We currently track allocated memory and allocated vcores per queue but we 
> don't have a good rate metric on how fast we're allocating these things. In 
> other words, a straight line in allocatedmb could equally be one extreme of 
> no new containers are being allocated or allocating a bunch of containers 
> where we free exactly what we allocate each time. Adding a resources 
> allocated per second per queue would give us a better insight into the rate 
> of resource churn on a queue. Based on this aggregated resource allocation 
> per queue we can easily have some tools to measure the rate of resource 
> allocation per queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4236) Metric for aggregated resources allocation per queue

2015-11-03 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4236:
---
Attachment: YARN-4236.patch

> Metric for aggregated resources allocation per queue
> 
>
> Key: YARN-4236
> URL: https://issues.apache.org/jira/browse/YARN-4236
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4236.patch
>
>
> We currently track allocated memory and allocated vcores per queue but we 
> don't have a good rate metric on how fast we're allocating these things. In 
> other words, a straight line in allocatedmb could equally be one extreme of 
> no new containers are being allocated or allocating a bunch of containers 
> where we free exactly what we allocate each time. Adding a resources 
> allocated per second per queue would give us a better insight into the rate 
> of resource churn on a queue. Based on this aggregated resource allocation 
> per queue we can easily have some tools to measure the rate of resource 
> allocation per queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted

2015-11-03 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4218:
---
Attachment: (was: YARN-4218.2.patch)

> Metric for resource*time that was preempted
> ---
>
> Key: YARN-4218
> URL: https://issues.apache.org/jira/browse/YARN-4218
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.patch, 
> YARN-4218.wip.patch, screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> After YARN-415 we have the ability to track the resource*time footprint of a 
> job and preemption metrics shows how many containers were preempted on a job. 
> However we don't have a metric showing the resource*time footprint cost of 
> preemption. In other words, we know how many containers were preempted but we 
> don't have a good measure of how much work was lost as a result of preemption.
> We should add this metric so we can analyze how much work preemption is 
> costing on a grid and better track which jobs were heavily impacted by it. A 
> job that has 100 containers preempted that only lasted a minute each and were 
> very small is going to be less impacted than a job that only lost a single 
> container but that container was huge and had been running for 3 days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4236) Metric for aggregated resources allocation per queue

2015-11-03 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4236:
---
Attachment: (was: YARN-4236.patch)

> Metric for aggregated resources allocation per queue
> 
>
> Key: YARN-4236
> URL: https://issues.apache.org/jira/browse/YARN-4236
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4236.patch
>
>
> We currently track allocated memory and allocated vcores per queue but we 
> don't have a good rate metric on how fast we're allocating these things. In 
> other words, a straight line in allocatedmb could equally be one extreme of 
> no new containers are being allocated or allocating a bunch of containers 
> where we free exactly what we allocate each time. Adding a resources 
> allocated per second per queue would give us a better insight into the rate 
> of resource churn on a queue. Based on this aggregated resource allocation 
> per queue we can easily have some tools to measure the rate of resource 
> allocation per queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-11-03 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987842#comment-14987842
 ] 

Eric Payne commented on YARN-3769:
--

Tests {{hadoop.yarn.server.resourcemanager.TestClientRMTokens}} and 
{{hadoop.yarn.server.resourcemanager.TestAMAuthorization}} are not failing for 
me in may own build environment.

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, 
> YARN-3769.003.patch, YARN-3769.004.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4328) Findbugs warning in resourcemanager in branch-2.7 and branch-2.6

2015-11-03 Thread Varun Saxena (JIRA)
Varun Saxena created YARN-4328:
--

 Summary: Findbugs warning in resourcemanager in branch-2.7 and 
branch-2.6
 Key: YARN-4328
 URL: https://issues.apache.org/jira/browse/YARN-4328
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: Varun Saxena
Priority: Minor


This issue exists in both branch-2.7 and branch-2.6
{noformat}


{noformat}

Below issue exists only in branch-2.6
{noformat}


{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-11-03 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987876#comment-14987876
 ] 

MENG DING commented on YARN-1510:
-

The mvn install test is flawed. It goes directly into 
{{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell}}
 directory and does a {{mvn install}} which still uses the old local maven repo 
for yarn client. This causes the build to fail. The mvn install test should be 
done in the root hadoop directory.

{code}
Mon Nov  2 17:08:05 UTC 2015
cd 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
mvn -Dmaven.repo.local=/home/jenkins/yetus-m2/hadoop-trunk-1 -DskipTests -fae 
clean install -DskipTests=true -Dmaven.javadoc.skip=true
[INFO] Scanning for projects...
...
...
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ 
hadoop-yarn-applications-distributedshell ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 4 source files to 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/target/classes
[INFO] -
[ERROR] COMPILATION ERROR : 
[INFO] -
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[861,55]
 cannot find symbol
  symbol:   class AbstractCallbackHandler
  location: class org.apache.hadoop.yarn.client.api.async.NMClientAsync
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[565,21]
 no suitable constructor found for 
NMClientAsyncImpl(org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.NMCallbackHandler)
constructor 
org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl.NMClientAsyncImpl(java.lang.String,org.apache.hadoop.yarn.client.api.NMClient,org.apache.hadoop.yarn.client.api.async.NMClientAsync.CallbackHandler)
 is not applicable
  (actual and formal argument lists differ in length)
constructor 
org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl.NMClientAsyncImpl(java.lang.String,org.apache.hadoop.yarn.client.api.async.NMClientAsync.CallbackHandler)
 is not applicable
  (actual and formal argument lists differ in length)
constructor 
org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl.NMClientAsyncImpl(org.apache.hadoop.yarn.client.api.async.NMClientAsync.CallbackHandler)
 is not applicable
  (actual argument 
org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.NMCallbackHandler
 cannot be converted to 
org.apache.hadoop.yarn.client.api.async.NMClientAsync.CallbackHandler by method 
invocation conversion)
...
...
{code}

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, 
> YARN-1510.5.patch, YARN-1510.6.patch, YARN-1510.7.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4127) RM fail with noAuth error if switched from failover mode to non-failover mode

2015-11-03 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987612#comment-14987612
 ] 

Varun Saxena commented on YARN-4127:


Filed YARN-4328 for the findbugs issue reported above

> RM fail with noAuth error if switched from failover mode to non-failover mode 
> --
>
> Key: YARN-4127
> URL: https://issues.apache.org/jira/browse/YARN-4127
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Jian He
>Assignee: Varun Saxena
> Attachments: YARN-4127-branch-2.7.01.patch, 
> YARN-4127-branch-2.7.02.patch, YARN-4127.01.patch, YARN-4127.02.patch
>
>
> The scenario is that RM failover was initially enabled, so the zkRootNodeAcl 
> is by default set with the *RM ID* in the ACL string 
> If RM failover is then switched to be disabled,  it cannot load data from ZK 
> and fail with noAuth error. After I reset the root node ACL, it again can 
> access.
> {code}
> 15/09/08 14:28:34 ERROR resourcemanager.ResourceManager: Failed to 
> load/recover state
> org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:113)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
>   at 
> org.apache.curator.framework.imps.CuratorTransactionImpl.doOperation(CuratorTransactionImpl.java:159)
>   at 
> org.apache.curator.framework.imps.CuratorTransactionImpl.access$200(CuratorTransactionImpl.java:44)
>   at 
> org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:129)
>   at 
> org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:125)
>   at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
>   at 
> org.apache.curator.framework.imps.CuratorTransactionImpl.commit(CuratorTransactionImpl.java:122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$SafeTransaction.commit(ZKRMStateStore.java:1009)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.safeSetData(ZKRMStateStore.java:985)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.getAndIncrementEpoch(ZKRMStateStore.java:374)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:579)
>   at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:973)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1014)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1010)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1010)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1050)
>   at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1194)
> {code}
>  the problem may be that in non-failover mode, RM doesn't use the *RM-ID* to 
> connect with ZK and thus fail with no Auth error.
> We should be able to switch failover on and off with no interruption to the 
> user. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3862) Decide which contents to retrieve and send back in response in TimelineReader

2015-11-03 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987636#comment-14987636
 ] 

Sangjin Lee commented on YARN-3862:
---

[~aw], sorry to bother you here, but our latest jenkins run is an unmitigated 
failure full of incomprehensible errors (projects banned, unrecognizable 
dependencies showing up out of nowhere, etc.). Would you be able to share what 
is going on here and how we can restore this? Thanks.

> Decide which contents to retrieve and send back in response in TimelineReader
> -
>
> Key: YARN-3862
> URL: https://issues.apache.org/jira/browse/YARN-3862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-3862-YARN-2928.wip.01.patch, 
> YARN-3862-YARN-2928.wip.02.patch
>
>
> Currently, we will retrieve all the contents of the field if that field is 
> specified in the query API. In case of configs and metrics, this can become a 
> lot of data even though the user doesn't need it. So we need to provide a way 
> to query only a set of configs or metrics.
> As a comma spearated list of configs/metrics to be returned will be quite 
> cumbersome to specify, we have to support either of the following options :
> # Prefix match
> # Regex
> # Group the configs/metrics and query that group.
> We also need a facility to specify a metric time window to return metrics in 
> a that window. This may be useful in plotting graphs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-11-03 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987810#comment-14987810
 ] 

Wangda Tan commented on YARN-1510:
--

Rekicked Jenkins.

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, 
> YARN-1510.5.patch, YARN-1510.6.patch, YARN-1510.7.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4325) purge app state from NM state-store should be independent of log aggregation

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-4325:
--
Target Version/s: 2.6.3, 2.7.3  (was: 2.7.2, 2.6.3)

Okay, moving it out while you continue debugging.

> purge app state from NM state-store should be independent of log aggregation
> 
>
> Key: YARN-4325
> URL: https://issues.apache.org/jira/browse/YARN-4325
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
>
> From a long running cluster, we found tens of thousands of stale apps still 
> be recovered in NM restart recovery. The reason is some wrong configuration 
> setting to log aggregation so the end of log aggregation events are not 
> received so stale apps are not purged properly. We should make sure the 
> removal of app state to be independent of log aggregation life cycle. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3112) AM restart and keep containers from previous attempts, then new container launch failed

2015-11-03 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3112:
---
Description: 
This error is very similar to YARN-1795, YARN-1839, but i have check the 
solution of those jira, the patches are already included in my version. I think 
this error is caused by the different NMTokens between old and new appattempts. 
New AM has inherited the old tokens from previous AM according to my 
configuration (keepContainers=true), so the token for new containers are 
replaced by the old one in the NMTokenCache.

{noformat}
206 2015-01-29 10:04:49,603 ERROR [ContainerLauncher #0] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container 
launch failed for  container_1422546145900_0001_02_02 : 
org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent 
for ixk02:47625
 207 ›   at 
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProt
 ocolProxy.java:256)
 208 ›   at 
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtoc
 olProxy.java:246)
 209 ›   at 
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:132)
 210 ›   at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:401)
 211 ›   at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
 212 ›   at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:367)
 213 ›   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 214 ›   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 215 ›   at java.lang.Thread.run(Thread.java:722)
{noformat}

  was:
This error is very similar to YARN-1795, YARN-1839, but i have check the 
solution of those jira, the patches are already included in my version. I think 
this error is caused by the different NMTokens between old and new appattempts. 
New AM has inherited the old tokens from previous AM according to my 
configuration (keepContainers=true), so the token for new containers are 
replaced by the old one in the NMTokenCache.

206 2015-01-29 10:04:49,603 ERROR [ContainerLauncher #0] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container 
launch failed for  container_1422546145900_0001_02_02 : 
org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent 
for ixk02:47625
 207 ›   at 
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProt
 ocolProxy.java:256)
 208 ›   at 
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtoc
 olProxy.java:246)
 209 ›   at 
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:132)
 210 ›   at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:401)
 211 ›   at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
 212 ›   at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:367)
 213 ›   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 214 ›   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 215 ›   at java.lang.Thread.run(Thread.java:722)


> AM restart and keep containers from previous attempts, then new container 
> launch failed
> ---
>
> Key: YARN-3112
> URL: https://issues.apache.org/jira/browse/YARN-3112
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications, resourcemanager
>Affects Versions: 2.6.0
> Environment: in real linux cluster
>Reporter: Jack Chen
>
> This error is very similar to YARN-1795, YARN-1839, but i have check the 
> solution of those jira, the patches are already included in my version. I 
> think this error is caused by the different NMTokens between old and new 
> appattempts. New AM has inherited the old tokens from previous AM according 
> to my configuration (keepContainers=true), so the token for new containers 
> are replaced by the old one in the NMTokenCache.
> {noformat}
> 206 2015-01-29 10:04:49,603 ERROR [ContainerLauncher #0] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: 

[jira] [Commented] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-11-03 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987868#comment-14987868
 ] 

MENG DING commented on YARN-1509:
-

The test failure should be related to YARN-4326.

In addition, the mvn install test script is flawed. It goes directly into 
{{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell}}
 directory and does a {{mvn install}} which still uses the old local maven repo 
for yarn client. This causes the build to fail. The mvn install test should be 
done in the root hadoop directory.

{code}
Tue Nov  3 16:18:38 UTC 2015
cd 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
mvn -Dmaven.repo.local=/home/jenkins/yetus-m2/hadoop-trunk-0 -DskipTests -fae 
clean install -DskipTests=true -Dmaven.javadoc.skip=true
[INFO] Scanning for projects...
...
...
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ 
hadoop-yarn-applications-distributedshell ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 4 source files to 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/target/classes
[INFO] -
[ERROR] COMPILATION ERROR : 
[INFO] -
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[735,50]
 cannot find symbol
  symbol:   class AbstractCallbackHandler
  location: class org.apache.hadoop.yarn.client.api.async.AMRMClientAsync
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[559,20]
 cannot find symbol
  symbol:   class AbstractCallbackHandler
  location: class org.apache.hadoop.yarn.client.api.async.AMRMClientAsync
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[737,5]
 method does not override or implement a method from a supertype
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[805,5]
 method does not override or implement a method from a supertype
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[838,5]
 method does not override or implement a method from a supertype
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[841,5]
 method does not override or implement a method from a supertype
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[846,5]
 method does not override or implement a method from a supertype
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[849,5]
 method does not override or implement a method from a supertype
[ERROR] 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java:[857,5]
 method does not override or implement a method from a supertype
[INFO] 9 errors 
[INFO] -
[INFO] 
[INFO] BUILD FAILURE
{code}


> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.10.patch, 
> YARN-1509.2.patch, 

[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-11-03 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987893#comment-14987893
 ] 

Wangda Tan commented on YARN-1510:
--

I see, will run client tests locally. [~mding] could you file a ticket to track 
the mvninstall issue?

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, 
> YARN-1510.5.patch, YARN-1510.6.patch, YARN-1510.7.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4218) Metric for resource*time that was preempted

2015-11-03 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4218:
---
Attachment: YARN-4218.2.patch

> Metric for resource*time that was preempted
> ---
>
> Key: YARN-4218
> URL: https://issues.apache.org/jira/browse/YARN-4218
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4218.2.patch, YARN-4218.2.patch, YARN-4218.patch, 
> YARN-4218.wip.patch, screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> After YARN-415 we have the ability to track the resource*time footprint of a 
> job and preemption metrics shows how many containers were preempted on a job. 
> However we don't have a metric showing the resource*time footprint cost of 
> preemption. In other words, we know how many containers were preempted but we 
> don't have a good measure of how much work was lost as a result of preemption.
> We should add this metric so we can analyze how much work preemption is 
> costing on a grid and better track which jobs were heavily impacted by it. A 
> job that has 100 containers preempted that only lasted a minute each and were 
> very small is going to be less impacted than a job that only lost a single 
> container but that container was huge and had been running for 3 days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4326) Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to default port 8188

2015-11-03 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4326:
-
Summary: Fix TestDistributedShell timeout as AHS in MiniYarnCluster no 
longer binds to default port 8188  (was: TestDistributedShell timeout as AHS in 
MiniYarnCluster no longer binds to default port 8188)

> Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to 
> default port 8188
> ---
>
> Key: YARN-4326
> URL: https://issues.apache.org/jira/browse/YARN-4326
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4326.patch
>
>
> The timeout originates in ApplicationMaster, where it fails to connect to 
> timeline server, and retry exceeds limits:
> {code}
> 2015-11-02 21:57:38,066 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(299)) - Timeline service address: 
> http://mdinglin02:0/ws/v1/timeline/
> 2015-11-02 21:57:38,099 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:logException(213)) - Exception caught by 
> TimelineClientConnectionRetry, will try 30 more time(s).
> ...
> ...
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
> at com.sun.jersey.api.client.Client.handle(Client.java:648)
> at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
> at 
> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
> at 
> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3862) Decide which contents to retrieve and send back in response in TimelineReader

2015-11-03 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987902#comment-14987902
 ] 

Sangjin Lee commented on YARN-3862:
---

Thanks. We'll discuss when we should address the next rebase. Thanks for 
checking.

> Decide which contents to retrieve and send back in response in TimelineReader
> -
>
> Key: YARN-3862
> URL: https://issues.apache.org/jira/browse/YARN-3862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
> Attachments: YARN-3862-YARN-2928.wip.01.patch, 
> YARN-3862-YARN-2928.wip.02.patch
>
>
> Currently, we will retrieve all the contents of the field if that field is 
> specified in the query API. In case of configs and metrics, this can become a 
> lot of data even though the user doesn't need it. So we need to provide a way 
> to query only a set of configs or metrics.
> As a comma spearated list of configs/metrics to be returned will be quite 
> cumbersome to specify, we have to support either of the following options :
> # Prefix match
> # Regex
> # Group the configs/metrics and query that group.
> We also need a facility to specify a metric time window to return metrics in 
> a that window. This may be useful in plotting graphs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4236) Metric for aggregated resources allocation per queue

2015-11-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987932#comment-14987932
 ] 

Hadoop QA commented on YARN-4236:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 6s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
20s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
12s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 8s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
28s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 11s 
{color} | {color:red} Patch generated 2 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 (total was 52, now 54). {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
18s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 60m 35s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_60. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 59m 41s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_79. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
26s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 132m 6s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_60 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK v1.7.0_79 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.7.1 Server=1.7.1 
Image:test-patch-base-hadoop-date2015-11-03 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12770352/YARN-4236.patch |
| JIRA Issue | YARN-4236 |
| Optional Tests |  asflicense  javac  javadoc  mvninstall  unit  findbugs  
checkstyle  compile  |
| uname | Linux 896914ddbb37 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 

[jira] [Updated] (YARN-3877) YarnClientImpl.submitApplication swallows exceptions

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3877:
--
Target Version/s: 2.8.0  (was: 2.7.2)

Moving this improvement out of 2.7.2 and from future maintenance lines.

> YarnClientImpl.submitApplication swallows exceptions
> 
>
> Key: YARN-3877
> URL: https://issues.apache.org/jira/browse/YARN-3877
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.7.2
>Reporter: Steve Loughran
>Assignee: Varun Saxena
>Priority: Minor
> Attachments: YARN-3877.01.patch, YARN-3877.02.patch, 
> YARN-3877.03.patch
>
>
> When {{YarnClientImpl.submitApplication}} spins waiting for the application 
> to be accepted, any interruption during its Sleep() calls are logged and 
> swallowed.
> this makes it hard to interrupt the thread during shutdown. Really it should 
> throw some form of exception and let the caller deal with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4292) ResourceUtilization should be a part of NodeInfo REST API

2015-11-03 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988303#comment-14988303
 ] 

Wangda Tan commented on YARN-4292:
--

Thanks [~sunilg] working on this, general approach of this patch looks good.

Do you think is it better to add a ResourceUtilizationInfo class? After 
YARN-3926, there could be more resource types added to YARN, manually managing 
these items is hard.

> ResourceUtilization should be a part of NodeInfo REST API
> -
>
> Key: YARN-4292
> URL: https://issues.apache.org/jira/browse/YARN-4292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Sunil G
> Attachments: 0001-YARN-4292.patch, 0002-YARN-4292.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3946) Allow fetching exact reason as to why a submitted app is in ACCEPTED state.

2015-11-03 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-3946:

Attachment: YARN-3946.v1.001.patch
YARN3946_attemptDiagnistic message.png

Hi [~wangda],[~rohithsharma],[~sunilg], [~sumit.nigam] & [~nijel].

As mentioned by Wangda in his 
[comment|https://issues.apache.org/jira/browse/YARN-4091?focusedCommentId=14735266=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14735266]
 in YARN-4091, its very difficult to capture to the status when *App's 
leafqueue or parent queue beyond its limit* as it would not be good to loop 
through all the apps in the hierarchy and update the status for each node 
update and also it will loose its imp info from previous updates.

So i think valid cases where we can update AMLaunchDiagnostics in 
SchedulerApplicationAttempt as (ForCS) :
 
* App is in Pending state, AMLimit/userlimit of the queue
* App waiting for resources of partition for AM to be launched (once moved from 
pending state)
* App waiting for resources of partition for AM to be launched Some nodes are 
blacklisted (if it fails to launch because of some black list nodes)
* AMLimit of the queue doesnt allow to launch 
* UserLimit of the queue doesnt allow to launch

Please check if the approach is proper, if its usefull and required then can 
get similar thing done for FairScheduler also. cc/ [~ka...@cloudera.com]

Also have taken the liberty to modify some small issues in 
{{SchedulerApplicationAttempt.isWaitingForAMContainer}} in the same patch if 
required can raise another jira and put these small changes there.


> Allow fetching exact reason as to why a submitted app is in ACCEPTED state.
> ---
>
> Key: YARN-3946
> URL: https://issues.apache.org/jira/browse/YARN-3946
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Sumit Nigam
>Assignee: Naganarasimha G R
> Attachments: YARN-3946.v1.001.patch, YARN3946_attemptDiagnistic 
> message.png
>
>
> Currently there is no direct way to get the exact reason as to why a 
> submitted app is still in ACCEPTED state. It should be possible to know 
> through RM REST API as to what aspect is not being met - say, queue limits 
> being reached, or core/ memory requirement not being met, or AM limit being 
> reached, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-11-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988182#comment-14988182
 ] 

Hadoop QA commented on YARN-1510:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 5s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
2s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
26s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
26s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
59s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 12s 
{color} | {color:red} hadoop-yarn-applications-distributedshell in the patch 
failed. {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 47s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 47s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
26s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
25s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 44s 
{color} | {color:green} hadoop-yarn-applications-distributedshell in the patch 
passed with JDK v1.8.0_60. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 49m 17s {color} 
| {color:red} hadoop-yarn-client in the patch failed with JDK v1.8.0_60. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 50s 
{color} | {color:green} hadoop-yarn-applications-distributedshell in the patch 
passed with JDK v1.7.0_79. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 49m 27s {color} 
| {color:red} hadoop-yarn-client in the patch failed with JDK v1.7.0_79. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
21s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 126m 20s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_60 Failed junit tests | hadoop.yarn.client.TestGetGroups |
| JDK v1.8.0_60 Timed out junit tests | 
org.apache.hadoop.yarn.client.api.impl.TestYarnClient |
|   | org.apache.hadoop.yarn.client.api.impl.TestAMRMClient |
|   | org.apache.hadoop.yarn.client.api.impl.TestNMClient |
| JDK v1.7.0_79 Failed junit tests | hadoop.yarn.client.TestGetGroups |
| JDK v1.7.0_79 Timed out junit tests | 
org.apache.hadoop.yarn.client.api.impl.TestYarnClient |
|   | 

[jira] [Commented] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-11-03 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988238#comment-14988238
 ] 

Wangda Tan commented on YARN-3769:
--

[~eepayne], Thanks for update.

bq. If you want, we can pull this out and put it as part of a different JIRA so 
we can document and discuss that particular flapping situation separately.
I would prefer to make it to be a separate JIRA, since it is a not directly 
related fix. Will review PCPP after you separate those changes (since you're OK 
with making it separated :))

bq. Yes, you are correct. getHeadroom could be calculating zero headroom when 
we don't want it to. And, I agree that we don't need to limit pending resources 
to max queue capacity when calculating pending resources. The concern for this 
fix is that user limit factor should be considered and limit the pending value. 
The max queue capacity will be considered during the offer stage of the 
preemption calculations.

I agree with your existing appoarch, user-limit should be capped by max queue 
capacity as well.

One nit for LeafQueue changes:
{code}
1534minPendingAndPreemptable =
1535Resources.componentwiseMax(Resources.none(),
1536Resources.subtract(
1537userNameToHeadroom.get(userName), 
minPendingAndPreemptable));
1538  
{code}

you don't need to do componmentwiseMax here, since minPendingAndPreemptable <= 
headroom, and you can use substractFrom to make code simpler.

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, 
> YARN-3769.003.patch, YARN-3769.004.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4326) Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to default port 8188

2015-11-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988249#comment-14988249
 ] 

Hudson commented on YARN-4326:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #1356 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1356/])
YARN-4326. Fix TestDistributedShell timeout as AHS in MiniYarnCluster no 
(wangda: rev 0783184f4b3f669f7211e42b395b62d63144100d)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java


> Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to 
> default port 8188
> ---
>
> Key: YARN-4326
> URL: https://issues.apache.org/jira/browse/YARN-4326
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: MENG DING
>Assignee: MENG DING
> Fix For: 2.8.0
>
> Attachments: YARN-4326.patch
>
>
> The timeout originates in ApplicationMaster, where it fails to connect to 
> timeline server, and retry exceeds limits:
> {code}
> 2015-11-02 21:57:38,066 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(299)) - Timeline service address: 
> http://mdinglin02:0/ws/v1/timeline/
> 2015-11-02 21:57:38,099 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:logException(213)) - Exception caught by 
> TimelineClientConnectionRetry, will try 30 more time(s).
> ...
> ...
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
> at com.sun.jersey.api.client.Client.handle(Client.java:648)
> at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
> at 
> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
> at 
> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4326) Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to default port 8188

2015-11-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988436#comment-14988436
 ] 

Hudson commented on YARN-4326:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #622 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/622/])
YARN-4326. Fix TestDistributedShell timeout as AHS in MiniYarnCluster no 
(wangda: rev 0783184f4b3f669f7211e42b395b62d63144100d)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java
* hadoop-yarn-project/CHANGES.txt


> Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to 
> default port 8188
> ---
>
> Key: YARN-4326
> URL: https://issues.apache.org/jira/browse/YARN-4326
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: MENG DING
>Assignee: MENG DING
> Fix For: 2.8.0
>
> Attachments: YARN-4326.patch
>
>
> The timeout originates in ApplicationMaster, where it fails to connect to 
> timeline server, and retry exceeds limits:
> {code}
> 2015-11-02 21:57:38,066 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(299)) - Timeline service address: 
> http://mdinglin02:0/ws/v1/timeline/
> 2015-11-02 21:57:38,099 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:logException(213)) - Exception caught by 
> TimelineClientConnectionRetry, will try 30 more time(s).
> ...
> ...
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
> at com.sun.jersey.api.client.Client.handle(Client.java:648)
> at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
> at 
> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
> at 
> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2014) Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9

2015-11-03 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988106#comment-14988106
 ] 

Jason Lowe commented on YARN-2014:
--

No, AFAIK this was never fixed.  As I mentioned earlier, my best guess was that 
it was related to the significantly increased classloading that 2.x is doing 
relative to 0.23.

> Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9
> 
>
> Key: YARN-2014
> URL: https://issues.apache.org/jira/browse/YARN-2014
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: patrick white
>Assignee: Jason Lowe
>
> Performance comparison benchmarks from 2.x against 0.23 shows AM scalability 
> benchmark's runtime is approximately 10% slower in 2.4.0. The trend is 
> consistent across later releases in both lines, latest release numbers are:
> 2.4.0.0 runtime 255.6 seconds (avg 5 passes)
> 0.23.9.12 runtime 230.4 seconds (avg 5 passes)
> Diff: -9.9% 
> AM Scalability test is essentially a sleep job that measures time to launch 
> and complete a large number of mappers.
> The diff is consistent and has been reproduced in both a larger (350 node, 
> 100,000 mappers) perf environment, as well as a small (10 node, 2,900 
> mappers) demo cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2015-11-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988114#comment-14988114
 ] 

Hadoop QA commented on YARN-4311:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 7s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
12s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
12s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 5s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
26s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 21s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 11s 
{color} | {color:red} Patch generated 1 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 (total was 86, now 87). {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 58m 32s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_60. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 59m 11s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_79. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
22s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 129m 2s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_60 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestResourceTrackerService |
|   | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK v1.7.0_79 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestResourceTrackerService |
|   | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.7.1 Server=1.7.1 
Image:test-patch-base-hadoop-date2015-11-03 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12770360/YARN-4311-v1.patch |
| JIRA Issue | YARN-4311 |
| Optional Tests |  asflicense  javac  javadoc  

[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-11-03 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988091#comment-14988091
 ] 

MENG DING commented on YARN-1510:
-

I logged YETUS-159 for this issue.

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, 
> YARN-1510.5.patch, YARN-1510.6.patch, YARN-1510.7.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration

2015-11-03 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3136:
-
Attachment: YARN-3136.branch-2.7.patch

> getTransferredContainers can be a bottleneck during AM registration
> ---
>
> Key: YARN-3136
> URL: https://issues.apache.org/jira/browse/YARN-3136
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Sunil G
>  Labels: 2.7.2-candidate
> Fix For: 2.8.0, 2.7.2
>
> Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 
> 00011-YARN-3136.patch, 00012-YARN-3136.patch, 00013-YARN-3136.patch, 
> 0002-YARN-3136.patch, 0003-YARN-3136.patch, 0004-YARN-3136.patch, 
> 0005-YARN-3136.patch, 0006-YARN-3136.patch, 0007-YARN-3136.patch, 
> 0008-YARN-3136.patch, 0009-YARN-3136.patch, YARN-3136.branch-2.7.patch
>
>
> While examining RM stack traces on a busy cluster I noticed a pattern of AMs 
> stuck waiting for the scheduler lock trying to call getTransferredContainers. 
>  The scheduler lock is highly contended, especially on a large cluster with 
> many nodes heartbeating, and it would be nice if we could find a way to 
> eliminate the need to grab this lock during this call.  We've already done 
> similar work during AM allocate calls to make sure they don't needlessly grab 
> the scheduler lock, and it would be good to do so here as well, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration

2015-11-03 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988110#comment-14988110
 ] 

Wangda Tan commented on YARN-3136:
--

Backport and commit to branch-2.7. Ran all tests of resourcemanager before 
pushing. Attaching patch which is committed to branch-2.7.

> getTransferredContainers can be a bottleneck during AM registration
> ---
>
> Key: YARN-3136
> URL: https://issues.apache.org/jira/browse/YARN-3136
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Sunil G
>  Labels: 2.7.2-candidate
> Fix For: 2.8.0, 2.7.2
>
> Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 
> 00011-YARN-3136.patch, 00012-YARN-3136.patch, 00013-YARN-3136.patch, 
> 0002-YARN-3136.patch, 0003-YARN-3136.patch, 0004-YARN-3136.patch, 
> 0005-YARN-3136.patch, 0006-YARN-3136.patch, 0007-YARN-3136.patch, 
> 0008-YARN-3136.patch, 0009-YARN-3136.patch, YARN-3136.branch-2.7.patch
>
>
> While examining RM stack traces on a busy cluster I noticed a pattern of AMs 
> stuck waiting for the scheduler lock trying to call getTransferredContainers. 
>  The scheduler lock is highly contended, especially on a large cluster with 
> many nodes heartbeating, and it would be nice if we could find a way to 
> eliminate the need to grab this lock during this call.  We've already done 
> similar work during AM allocate calls to make sure they don't needlessly grab 
> the scheduler lock, and it would be good to do so here as well, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-11-03 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987966#comment-14987966
 ] 

MENG DING commented on YARN-1510:
-

[~leftnoteasy], should I log this in the Apache Yetus community?

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, 
> YARN-1510.5.patch, YARN-1510.6.patch, YARN-1510.7.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4236) Metric for aggregated resources allocation per queue

2015-11-03 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated YARN-4236:
---
Attachment: YARN-4236.2.patch

.2 patch address checkstyle issue

> Metric for aggregated resources allocation per queue
> 
>
> Key: YARN-4236
> URL: https://issues.apache.org/jira/browse/YARN-4236
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4236.2.patch, YARN-4236.patch
>
>
> We currently track allocated memory and allocated vcores per queue but we 
> don't have a good rate metric on how fast we're allocating these things. In 
> other words, a straight line in allocatedmb could equally be one extreme of 
> no new containers are being allocated or allocating a bunch of containers 
> where we free exactly what we allocate each time. Adding a resources 
> allocated per second per queue would give us a better insight into the rate 
> of resource churn on a queue. Based on this aggregated resource allocation 
> per queue we can easily have some tools to measure the rate of resource 
> allocation per queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4326) Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to default port 8188

2015-11-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988083#comment-14988083
 ] 

Hudson commented on YARN-4326:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2563 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2563/])
YARN-4326. Fix TestDistributedShell timeout as AHS in MiniYarnCluster no 
(wangda: rev 0783184f4b3f669f7211e42b395b62d63144100d)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java


> Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to 
> default port 8188
> ---
>
> Key: YARN-4326
> URL: https://issues.apache.org/jira/browse/YARN-4326
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: MENG DING
>Assignee: MENG DING
> Fix For: 2.8.0
>
> Attachments: YARN-4326.patch
>
>
> The timeout originates in ApplicationMaster, where it fails to connect to 
> timeline server, and retry exceeds limits:
> {code}
> 2015-11-02 21:57:38,066 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(299)) - Timeline service address: 
> http://mdinglin02:0/ws/v1/timeline/
> 2015-11-02 21:57:38,099 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:logException(213)) - Exception caught by 
> TimelineClientConnectionRetry, will try 30 more time(s).
> ...
> ...
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
> at com.sun.jersey.api.client.Client.handle(Client.java:648)
> at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
> at 
> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
> at 
> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration

2015-11-03 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3136:
-
Fix Version/s: 2.7.2

> getTransferredContainers can be a bottleneck during AM registration
> ---
>
> Key: YARN-3136
> URL: https://issues.apache.org/jira/browse/YARN-3136
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Sunil G
>  Labels: 2.7.2-candidate
> Fix For: 2.8.0, 2.7.2
>
> Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 
> 00011-YARN-3136.patch, 00012-YARN-3136.patch, 00013-YARN-3136.patch, 
> 0002-YARN-3136.patch, 0003-YARN-3136.patch, 0004-YARN-3136.patch, 
> 0005-YARN-3136.patch, 0006-YARN-3136.patch, 0007-YARN-3136.patch, 
> 0008-YARN-3136.patch, 0009-YARN-3136.patch
>
>
> While examining RM stack traces on a busy cluster I noticed a pattern of AMs 
> stuck waiting for the scheduler lock trying to call getTransferredContainers. 
>  The scheduler lock is highly contended, especially on a large cluster with 
> many nodes heartbeating, and it would be nice if we could find a way to 
> eliminate the need to grab this lock during this call.  We've already done 
> similar work during AM allocate calls to make sure they don't needlessly grab 
> the scheduler lock, and it would be good to do so here as well, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4326) Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to default port 8188

2015-11-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988003#comment-14988003
 ] 

Hudson commented on YARN-4326:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8748 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8748/])
YARN-4326. Fix TestDistributedShell timeout as AHS in MiniYarnCluster no 
(wangda: rev 0783184f4b3f669f7211e42b395b62d63144100d)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java


> Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to 
> default port 8188
> ---
>
> Key: YARN-4326
> URL: https://issues.apache.org/jira/browse/YARN-4326
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: MENG DING
>Assignee: MENG DING
> Fix For: 2.8.0
>
> Attachments: YARN-4326.patch
>
>
> The timeout originates in ApplicationMaster, where it fails to connect to 
> timeline server, and retry exceeds limits:
> {code}
> 2015-11-02 21:57:38,066 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(299)) - Timeline service address: 
> http://mdinglin02:0/ws/v1/timeline/
> 2015-11-02 21:57:38,099 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:logException(213)) - Exception caught by 
> TimelineClientConnectionRetry, will try 30 more time(s).
> ...
> ...
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
> at com.sun.jersey.api.client.Client.handle(Client.java:648)
> at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
> at 
> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
> at 
> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4326) Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to default port 8188

2015-11-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988457#comment-14988457
 ] 

Hudson commented on YARN-4326:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2504 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2504/])
YARN-4326. Fix TestDistributedShell timeout as AHS in MiniYarnCluster no 
(wangda: rev 0783184f4b3f669f7211e42b395b62d63144100d)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java
* hadoop-yarn-project/CHANGES.txt


> Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to 
> default port 8188
> ---
>
> Key: YARN-4326
> URL: https://issues.apache.org/jira/browse/YARN-4326
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: MENG DING
>Assignee: MENG DING
> Fix For: 2.8.0
>
> Attachments: YARN-4326.patch
>
>
> The timeout originates in ApplicationMaster, where it fails to connect to 
> timeline server, and retry exceeds limits:
> {code}
> 2015-11-02 21:57:38,066 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(299)) - Timeline service address: 
> http://mdinglin02:0/ws/v1/timeline/
> 2015-11-02 21:57:38,099 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:logException(213)) - Exception caught by 
> TimelineClientConnectionRetry, will try 30 more time(s).
> ...
> ...
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
> at com.sun.jersey.api.client.Client.handle(Client.java:648)
> at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
> at 
> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
> at 
> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4184) Remove update reservation state api from state store as its not used by ReservationSystem

2015-11-03 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-4184:
-
Assignee: Sean Po  (was: Subru Krishnan)

> Remove update reservation state api from state store as its not used by 
> ReservationSystem
> -
>
> Key: YARN-4184
> URL: https://issues.apache.org/jira/browse/YARN-4184
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Reporter: Anubhav Dhoot
>Assignee: Sean Po
>
> ReservationSystem uses remove/add for updates and thus update api in state 
> store is not needed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3946) Allow fetching exact reason as to why a submitted app is in ACCEPTED state.

2015-11-03 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988640#comment-14988640
 ] 

Wangda Tan commented on YARN-3946:
--

Hi [~Naganarasimha], 
Thanks for working on this, general idea of this approach looks good, few 
suggestions about what to show: 
- AM launch diagnostics should have an intial value after added to scheduler:
For unmanaged AM, it should be "User launched the Application Master since it's 
unmanaged"
For managed AM, it should be "Added to scheduler, waiting to be scheduled" with 
some general suggestions about configurations to look at, such as user-limit, 
am-percent, queue-limit, etc.
- Loop all applications when queue exceeds limit is too costly. I'd prefer to 
do nothing when this happens.
- After application moved to activated state, if the application is traversed 
by scheduler but cannot allocate any resource, you may put something like 
"Trying to allocate to AM on node=x, etc.". After YARN-4091 we should be able 
to get more detailed information about why this happened.
- Not caused by your patch, isWaitingForAMContainer checks if master container 
created, you may also need to check if application is in recover state or not. 
Because AM could contact to RM before AM container recovered by RM.
- Similar to above, you may need to put diagnostic message when AM is 
recovering by RM
- After AM launched, diag could be something like "AM is launched", which will 
be better than empty text.

Regarding to implementation:
- Since RMAppAttempt and SchedulerApplicationAttempt has 1 to 1 relationship, 
we can save a reference to RMAppAttemt in SchedulerApplicationAttempt, which 
could avoid getting it from {{RMContext.getRMApps()...}}
- Since String is immutable, amLaunchDiagnostics could be violate so we don't 
need acquire locks.
- Suggest to add to REST API / web UI together with this patch if changes are 
not complex.

> Allow fetching exact reason as to why a submitted app is in ACCEPTED state.
> ---
>
> Key: YARN-3946
> URL: https://issues.apache.org/jira/browse/YARN-3946
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Sumit Nigam
>Assignee: Naganarasimha G R
> Attachments: YARN-3946.v1.001.patch, YARN3946_attemptDiagnistic 
> message.png
>
>
> Currently there is no direct way to get the exact reason as to why a 
> submitted app is still in ACCEPTED state. It should be possible to know 
> through RM REST API as to what aspect is not being met - say, queue limits 
> being reached, or core/ memory requirement not being met, or AM limit being 
> reached, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4329) Allow fetching exact reason as to why a submitted app is in ACCEPTED state in Fair Scheduler

2015-11-03 Thread Naganarasimha G R (JIRA)
Naganarasimha G R created YARN-4329:
---

 Summary: Allow fetching exact reason as to why a submitted app is 
in ACCEPTED state in Fair Scheduler
 Key: YARN-4329
 URL: https://issues.apache.org/jira/browse/YARN-4329
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: fairscheduler, resourcemanager
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R


Similar to YARN-3946, it would be useful to capture possible reason why the 
Application is in accepted state in FairScheduler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-11-03 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988740#comment-14988740
 ] 

Naganarasimha G R commented on YARN-4091:
-

Hi [~sunilg], I think we need to update the document as there has been 
significant change in the approaches. Also we can create 2 subjiras for the 
REST based tracing of Single Node Update call, one for what what ever you are 
working on for CS and i would like to start working on similar implementation 
in FairScheduler side. Ok ?

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Improvement on debugdiagnostic information - YARN.pdf
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-11-03 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988513#comment-14988513
 ] 

Wangda Tan commented on YARN-1510:
--

I run failed tests locally (yarn-client). There're some tests failed in my 
local box, I'm not sure if they're related.

{code}
Failed tests:
  TestYarnClient.testReservationAPIs:1207 The size of the largest gang in the 
reservation refinition () exceed the capacity available 
( )
at 
org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:45)
at 
org.apache.hadoop.yarn.server.resourcemanager.reservation.ReservationInputValidator.validateReservationDefinition(ReservationInputValidator.java:168)
at 
org.apache.hadoop.yarn.server.resourcemanager.reservation.ReservationInputValidator.validateReservationSubmissionRequest(ReservationInputValidator.java:212)
at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitReservation(ClientRMService.java:1206)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitReservation(ApplicationClientProtocolPBServiceImpl.java:452)
at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:499)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299)


Tests in error:
  TestAMRMClient.setup:146 » IndexOutOfBounds Index: 0, Size: 0
  TestNMClient.testNMClient:227->allocateContainers:244 » IndexOutOfBounds 
Index...
  TestNMClient.testNMClientNoCleanupOnStop:210->allocateContainers:244 » 
IndexOutOfBounds
{code}

YARN-4326 was included when running tests.

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, 
> YARN-1510.5.patch, YARN-1510.6.patch, YARN-1510.7.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2681:
--
Target Version/s: 2.8.0  (was: 2.7.2)

Moving new-features out of 2.7 maintenance lines into 2.8.0.

> Support bandwidth enforcement for containers while reading from HDFS
> 
>
> Key: YARN-2681
> URL: https://issues.apache.org/jira/browse/YARN-2681
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Affects Versions: 2.5.1
> Environment: Linux
>Reporter: Nam H. Do
>  Labels: BB2015-05-TBR
> Attachments: Traffic Control Design.png, YARN-2681.001.patch, 
> YARN-2681.002.patch, YARN-2681.003.patch, YARN-2681.004.patch, 
> YARN-2681.005.patch, YARN-2681.patch
>
>
> To read/write data from HDFS on data node, applications establise TCP/IP 
> connections with the datanode. The HDFS read can be controled by setting 
> Linux Traffic Control  (TC) subsystem on the data node to make filters on 
> appropriate connections.
> The current cgroups net_cls concept can not be applied on the node where the 
> container is launched, netheir on data node since:
> -   TC hanldes outgoing bandwidth only, so it can be set on container node 
> (HDFS read = incoming data for the container)
> -   Since HDFS data node is handled by only one process,  it is not possible 
> to use net_cls to separate connections from different containers to the 
> datanode.
> Tasks:
> 1) Extend Resource model to define bandwidth enforcement rate
> 2) Monitor TCP/IP connection estabilised by container handling process and 
> its child processes
> 3) Set Linux Traffic Control rules on data node base on address:port pairs in 
> order to enforce bandwidth of outgoing data
> Concept: http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf
> Implementation: 
> http://www.hit.bme.hu/~dohoai/documents/HdfsTrafficControl.pdf
> http://www.hit.bme.hu/~dohoai/documents/HdfsTrafficControl_UML_diagram.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3575) Job using 2.5 jars fails on a 2.6 cluster whose RM has been restarted

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-3575.
---
Resolution: Won't Fix

Agreed. I requested that this be filed largely for documentation concerns.

There is one simple way sites can avoid this _incompatibility_: Do not use 
RM-recovery on a 2.6+ cluster while you still have applications running 
versions < 2.6. Once all apps are upgraded to 2.6+, you can enable RM-recovery 
and perform work-preserving RM restarts and rolling upgrades.

Closing this for now as won't fix. Please reopen if you disagree. Thanks.

> Job using 2.5 jars fails on a 2.6 cluster whose RM has been restarted
> -
>
> Key: YARN-3575
> URL: https://issues.apache.org/jira/browse/YARN-3575
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>
> Trying to launch a job that uses the 2.5 jars fails on a 2.6 cluster whose RM 
> has been restarted (i.e.: epoch != 0) becaue the epoch number starts 
> appearing in the container IDs and the 2.5 jars no longer know how to parse 
> the container IDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4236) Metric for aggregated resources allocation per queue

2015-11-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988485#comment-14988485
 ] 

Hadoop QA commented on YARN-4236:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 5s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
50s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
12s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
14s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 5s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
27s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 21s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
14s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 58m 13s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_60. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 59m 13s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_79. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
21s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 128m 27s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_60 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK v1.7.0_79 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.7.1 Server=1.7.1 
Image:test-patch-base-hadoop-date2015-11-03 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12770399/YARN-4236.2.patch |
| JIRA Issue | YARN-4236 |
| Optional Tests |  asflicense  javac  javadoc  mvninstall  unit  findbugs  
checkstyle  compile  |
| uname | Linux 29147a072c07 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 

[jira] [Updated] (YARN-4226) Make capacity scheduler queue's preemption status REST API consistent with GUI

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-4226:
--
Target Version/s: 2.7.3  (was: 3.0.0, 2.8.0, 2.7.2)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> Make capacity scheduler queue's preemption status REST API consistent with GUI
> --
>
> Key: YARN-4226
> URL: https://issues.apache.org/jira/browse/YARN-4226
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>
> In the capacity scheduler GUI, the preemption status has the following form:
> {code}
> Preemption:   disabled
> {code}
> However, the REST API shows the following for the same status:
> {code}
> preemptionDisabled":true
> {code}
> The latter is confusing and should be consistent with the format in the GUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3755) Log the command of launching containers

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3755:
--
Target Version/s: 2.7.3  (was: 2.7.2)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> Log the command of launching containers
> ---
>
> Key: YARN-3755
> URL: https://issues.apache.org/jira/browse/YARN-3755
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.7.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: YARN-3755-1.patch, YARN-3755-2.patch
>
>
> In the resource manager log, yarn would log the command for launching AM, 
> this is very useful. But there's no such log in the NN log for launching 
> containers. It would be difficult to diagnose when containers fails to launch 
> due to some issue in the commands. Although user can look at the commands in 
> the container launch script file, this is an internal things of yarn, usually 
> user don't know that. In user's perspective, they only know what commands 
> they specify when building yarn application. 
> {code}
> 2015-06-01 16:06:42,245 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command 
> to launch container container_1433145984561_0001_01_01 : 
> $JAVA_HOME/bin/java -server -Djava.net.preferIPv4Stack=true 
> -Dhadoop.metrics.log.level=WARN  -Xmx1024m  
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir= -Dtez.root.logger=info,CLA 
> -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster 
> 1>/stdout 2>/stderr
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3477) TimelineClientImpl swallows exceptions

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3477:
--
Target Version/s: 2.7.3  (was: 2.7.2)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> TimelineClientImpl swallows exceptions
> --
>
> Key: YARN-3477
> URL: https://issues.apache.org/jira/browse/YARN-3477
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0, 2.7.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: YARN-3477-001.patch, YARN-3477-002.patch
>
>
> If timeline client fails more than the retry count, the original exception is 
> not thrown. Instead some runtime exception is raised saying "retries run out"
> # the failing exception should be rethrown, ideally via 
> NetUtils.wrapException to include URL of the failing endpoing
> # Otherwise, the raised RTE should (a) state that URL and (b) set the 
> original fault as the inner cause



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3898) YARN web console only proxies GET to application master but doesn't provide any feedback for other HTTP methods

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3898:
--
Target Version/s: 2.7.3  (was: 2.7.2)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> YARN web console only proxies GET to application master but doesn't provide 
> any feedback for other HTTP methods
> ---
>
> Key: YARN-3898
> URL: https://issues.apache.org/jira/browse/YARN-3898
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.1
>Reporter: Kam Kasravi
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> YARN web console should provide some feedback when filtering (and preventing) 
> DELETE, POST, PUT, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3344) procfs stat file is not in the expected format warning

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3344:
--
Target Version/s: 2.7.3  (was: 2.8.0, 2.7.2)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> procfs stat file is not in the expected format warning
> --
>
> Key: YARN-3344
> URL: https://issues.apache.org/jira/browse/YARN-3344
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Jon Bringhurst
>Assignee: Ravindra Kumar Naik
> Attachments: YARN-3344-trunk.005.patch
>
>
> Although this doesn't appear to be causing any functional issues, it is 
> spamming our log files quite a bit. :)
> It appears that the regex in ProcfsBasedProcessTree doesn't work for all 
> /proc//stat files.
> Here's the error I'm seeing:
> {noformat}
> "source_host": "asdf",
> "method": "constructProcessInfo",
> "level": "WARN",
> "message": "Unexpected: procfs stat file is not in the expected format 
> for process with pid 6953"
> "file": "ProcfsBasedProcessTree.java",
> "line_number": "514",
> "class": "org.apache.hadoop.yarn.util.ProcfsBasedProcessTree",
> {noformat}
> And here's the basic info on process with pid 6953:
> {noformat}
> [asdf ~]$ cat /proc/6953/stat
> 6953 (python2.6 /expo) S 1871 1871 1871 0 -1 4202496 9364 1080 0 0 25 3 0 0 
> 20 0 1 0 144918696 205295616 5856 18446744073709551615 1 1 0 0 0 0 0 16781312 
> 2 18446744073709551615 0 0 17 13 0 0 0 0 0
> [asdf ~]$ ps aux|grep 6953
> root  6953  0.0  0.0 200484 23424 ?S21:44   0:00 python2.6 
> /export/apps/salt/minion-scripts/module-sync.py
> jbringhu 13481  0.0  0.0 105312   872 pts/0S+   22:13   0:00 grep -i 6953
> [asdf ~]$ 
> {noformat}
> This is using 2.6.32-431.11.2.el6.x86_64 in RHEL 6.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-4225:
--
Target Version/s: 2.7.3  (was: 3.0.0, 2.8.0, 2.7.2)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> Add preemption status to yarn queue -status for capacity scheduler
> --
>
> Key: YARN-4225
> URL: https://issues.apache.org/jira/browse/YARN-4225
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3687) We should be able to remove node-label if there's no queue can use it.

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3687:
--
Target Version/s: 2.7.3  (was: 2.7.2)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> We should be able to remove node-label if there's no queue can use it.
> --
>
> Key: YARN-3687
> URL: https://issues.apache.org/jira/browse/YARN-3687
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> Currently, we cannot remove node label from the cluster if there's no queue 
> configure it, but actually we should be able to remove it if capacity on the 
> node label in root queue is 0. This can avoid painful when user wants to 
> reconfigure node label.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3769:
--
Target Version/s: 2.7.3  (was: 2.8.0, 2.7.2)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> -
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.7.002.patch, YARN-3769-branch-2.7.003.patch, 
> YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, 
> YARN-3769.003.patch, YARN-3769.004.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3946) Allow fetching exact reason as to why a submitted app is in ACCEPTED state in CS

2015-11-03 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-3946:

Component/s: capacity scheduler

> Allow fetching exact reason as to why a submitted app is in ACCEPTED state in 
> CS
> 
>
> Key: YARN-3946
> URL: https://issues.apache.org/jira/browse/YARN-3946
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Sumit Nigam
>Assignee: Naganarasimha G R
> Attachments: YARN-3946.v1.001.patch, YARN3946_attemptDiagnistic 
> message.png
>
>
> Currently there is no direct way to get the exact reason as to why a 
> submitted app is still in ACCEPTED state. It should be possible to know 
> through RM REST API as to what aspect is not being met - say, queue limits 
> being reached, or core/ memory requirement not being met, or AM limit being 
> reached, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3946) Allow fetching exact reason as to why a submitted app is in ACCEPTED state in CS

2015-11-03 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-3946:

Summary: Allow fetching exact reason as to why a submitted app is in 
ACCEPTED state in CS  (was: Allow fetching exact reason as to why a submitted 
app is in ACCEPTED state.)

> Allow fetching exact reason as to why a submitted app is in ACCEPTED state in 
> CS
> 
>
> Key: YARN-3946
> URL: https://issues.apache.org/jira/browse/YARN-3946
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Sumit Nigam
>Assignee: Naganarasimha G R
> Attachments: YARN-3946.v1.001.patch, YARN3946_attemptDiagnistic 
> message.png
>
>
> Currently there is no direct way to get the exact reason as to why a 
> submitted app is still in ACCEPTED state. It should be possible to know 
> through RM REST API as to what aspect is not being met - say, queue limits 
> being reached, or core/ memory requirement not being met, or AM limit being 
> reached, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2746) YARNDelegationTokenID misses serializing version from the common abstract ID

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2746:
--
Target Version/s: 2.6.3, 2.7.3  (was: 2.7.2, 2.6.3)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> YARNDelegationTokenID misses serializing version from the common abstract ID
> 
>
> Key: YARN-2746
> URL: https://issues.apache.org/jira/browse/YARN-2746
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jian He
>
> I found this during review of YARN-2743.
> bq. AbstractDTId had a version, we dropped that in the protobuf 
> serialization. We should just write it during the serialization and read it 
> back?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1480) RM web services getApps() accepts many more filters than ApplicationCLI "list" command

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1480:
--
Target Version/s: 2.6.3, 2.7.3  (was: 2.7.2, 2.6.3)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> RM web services getApps() accepts many more filters than ApplicationCLI 
> "list" command
> --
>
> Key: YARN-1480
> URL: https://issues.apache.org/jira/browse/YARN-1480
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Kenji Kikushima
> Attachments: YARN-1480-2.patch, YARN-1480-3.patch, YARN-1480-4.patch, 
> YARN-1480-5.patch, YARN-1480-6.patch, YARN-1480.patch
>
>
> Nowadays RM web services getApps() accepts many more filters than 
> ApplicationCLI "list" command, which only accepts "state" and "type". IMHO, 
> ideally, different interfaces should provide consistent functionality. Is it 
> better to allow more filters in ApplicationCLI?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3478) FairScheduler page not performed because different enum of YarnApplicationState and RMAppState

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3478:
--
Target Version/s: 2.6.3, 2.7.3  (was: 2.7.2, 2.6.3)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> FairScheduler page not performed because different enum of 
> YarnApplicationState and RMAppState 
> ---
>
> Key: YARN-3478
> URL: https://issues.apache.org/jira/browse/YARN-3478
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.6.0
>Reporter: Xu Chen
> Attachments: YARN-3478.1.patch, YARN-3478.2.patch, YARN-3478.3.patch, 
> screenshot-1.png
>
>
> Got exception from log 
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
> at 
> com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
> at 
> com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
> at 
> com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
> at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
> at 
> com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:96)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1225)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.lib.DynamicUserWebFilter$DynamicUserFilter.doFilter(DynamicUserWebFilter.java:59)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.handle(Server.java:326)
> at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> at 
> 

[jira] [Updated] (YARN-2014) Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2014:
--
Target Version/s: 2.6.3, 2.7.3  (was: 2.7.2, 2.6.3)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9
> 
>
> Key: YARN-2014
> URL: https://issues.apache.org/jira/browse/YARN-2014
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: patrick white
>Assignee: Jason Lowe
>
> Performance comparison benchmarks from 2.x against 0.23 shows AM scalability 
> benchmark's runtime is approximately 10% slower in 2.4.0. The trend is 
> consistent across later releases in both lines, latest release numbers are:
> 2.4.0.0 runtime 255.6 seconds (avg 5 passes)
> 0.23.9.12 runtime 230.4 seconds (avg 5 passes)
> Diff: -9.9% 
> AM Scalability test is essentially a sleep job that measures time to launch 
> and complete a large number of mappers.
> The diff is consistent and has been reproduced in both a larger (350 node, 
> 100,000 mappers) perf environment, as well as a small (10 node, 2,900 
> mappers) demo cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1848) Persist ClusterMetrics across RM HA transitions

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1848:
--
Target Version/s: 2.6.3, 2.7.3  (was: 2.7.2, 2.6.3)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> Persist ClusterMetrics across RM HA transitions
> ---
>
> Key: YARN-1848
> URL: https://issues.apache.org/jira/browse/YARN-1848
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>
> Post YARN-1705, ClusterMetrics are reset on transition to standby. This is 
> acceptable as the metrics show statistics since an RM has become active. 
> Users might want to see metrics since the cluster was ever started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1767) Windows: Allow a way for users to augment classpath of YARN daemons

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1767:
--
Target Version/s: 2.6.3, 2.7.3  (was: 2.7.2, 2.6.3)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> Windows: Allow a way for users to augment classpath of YARN daemons
> ---
>
> Key: YARN-1767
> URL: https://issues.apache.org/jira/browse/YARN-1767
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Karthik Kambatla
>
> YARN-1429 adds a way to augment the classpath for *nix-based systems. Need 
> something similar for Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4060) Revisit default retry config for connection with RM

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-4060:
--
Target Version/s: 2.6.3, 2.7.3  (was: 2.7.2, 2.6.3)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> Revisit default retry config for connection with RM 
> 
>
> Key: YARN-4060
> URL: https://issues.apache.org/jira/browse/YARN-4060
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
>
> 15 minutes timeout for AM/NM connection with RM in non-ha scenario turns out 
> to be  short in production environment.  The suggestion is to increase that 
> to 30 min. Also, the retry-interval is set to 30 seconds which appears too 
> long. We may reduce that to 10 seconds ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2599) Standby RM should also expose some jmx and metrics

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2599:
--
Target Version/s: 2.6.3, 2.7.3  (was: 2.7.2, 2.6.3)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> Standby RM should also expose some jmx and metrics
> --
>
> Key: YARN-2599
> URL: https://issues.apache.org/jira/browse/YARN-2599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
>
> YARN-1898 redirects jmx and metrics to the Active. As discussed there, we 
> need to separate out metrics displayed so the Standby RM can also be 
> monitored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2657) MiniYARNCluster to (optionally) add MicroZookeeper service

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2657:
--
Target Version/s: 2.6.3, 2.7.3  (was: 2.7.2, 2.6.3)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> MiniYARNCluster to (optionally) add MicroZookeeper service
> --
>
> Key: YARN-2657
> URL: https://issues.apache.org/jira/browse/YARN-2657
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: test
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: YARN-2567-001.patch, YARN-2657-002.patch
>
>
> This is needed for testing things like YARN-2646: add an option for the 
> {{MiniYarnCluster}} to start a {{MicroZookeeperService}}.
> This is just another YARN service to create and track the lifecycle. The 
> {{MicroZookeeperService}} publishes its binding information for direct takeup 
> by the registry services...this can address in-VM race conditions.
> The default setting for this service is "off"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2037) Add restart support for Unmanaged AMs

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2037:
--
Target Version/s: 2.6.3, 2.7.3  (was: 2.7.2, 2.6.3)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> Add restart support for Unmanaged AMs
> -
>
> Key: YARN-2037
> URL: https://issues.apache.org/jira/browse/YARN-2037
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>
> It would be nice to allow Unmanaged AMs also to restart in a work-preserving 
> way. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1856) cgroups based memory monitoring for containers

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1856:
--
Target Version/s: 2.6.3, 2.7.3  (was: 2.7.2, 2.6.3)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> cgroups based memory monitoring for containers
> --
>
> Key: YARN-1856
> URL: https://issues.apache.org/jira/browse/YARN-1856
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Karthik Kambatla
>Assignee: Varun Vasudev
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2038) Revisit how AMs learn of containers from previous attempts

2015-11-03 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2038:
--
Target Version/s: 2.6.3, 2.7.3  (was: 2.7.2, 2.6.3)

Moving out all non-critical / non-blocker issues that didn't make it out of 
2.7.2 into 2.7.3.

> Revisit how AMs learn of containers from previous attempts
> --
>
> Key: YARN-2038
> URL: https://issues.apache.org/jira/browse/YARN-2038
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>
> Based on YARN-556, we need to update the way AMs learn about containers 
> allocation previous attempts. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4326) Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to default port 8188

2015-11-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988623#comment-14988623
 ] 

Hudson commented on YARN-4326:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #567 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/567/])
YARN-4326. Fix TestDistributedShell timeout as AHS in MiniYarnCluster no 
(wangda: rev 0783184f4b3f669f7211e42b395b62d63144100d)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java
* hadoop-yarn-project/CHANGES.txt


> Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to 
> default port 8188
> ---
>
> Key: YARN-4326
> URL: https://issues.apache.org/jira/browse/YARN-4326
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: MENG DING
>Assignee: MENG DING
> Fix For: 2.8.0
>
> Attachments: YARN-4326.patch
>
>
> The timeout originates in ApplicationMaster, where it fails to connect to 
> timeline server, and retry exceeds limits:
> {code}
> 2015-11-02 21:57:38,066 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(299)) - Timeline service address: 
> http://mdinglin02:0/ws/v1/timeline/
> 2015-11-02 21:57:38,099 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:logException(213)) - Exception caught by 
> TimelineClientConnectionRetry, will try 30 more time(s).
> ...
> ...
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
> at com.sun.jersey.api.client.Client.handle(Client.java:648)
> at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
> at 
> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
> at 
> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-11-03 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988765#comment-14988765
 ] 

Sunil G commented on YARN-4091:
---

Hi [~Naganarasimha Garla]
Thank you for the comment. I will update doc with the changed approach, also I 
am almost ready with CS patch with a basic REST implementation, will soon put a 
patch here. Same REST subjira will hold good for both scheduler I think. But 
indeed we need an implementation of this approach in Fair too. 

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Improvement on debugdiagnostic information - YARN.pdf
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3946) Allow fetching exact reason as to why a submitted app is in ACCEPTED state in CS

2015-11-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988810#comment-14988810
 ] 

Hadoop QA commented on YARN-3946:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 8s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
21s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
13s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
16s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
16s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
29s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 14s 
{color} | {color:red} Patch generated 10 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 (total was 524, now 531). {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
25s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 80m 57s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 85m 39s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_79. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
38s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 179m 36s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_66 Failed junit tests | 
hadoop.yarn.server.resourcemanager.reservation.TestCapacitySchedulerPlanFollower
 |
|   | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations |
|   | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationLimits |
| JDK v1.8.0_66 Timed out junit tests | 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue |
| JDK v1.7.0_79 Failed junit tests | 

[jira] [Commented] (YARN-4326) Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to default port 8188

2015-11-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988227#comment-14988227
 ] 

Hudson commented on YARN-4326:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #633 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/633/])
YARN-4326. Fix TestDistributedShell timeout as AHS in MiniYarnCluster no 
(wangda: rev 0783184f4b3f669f7211e42b395b62d63144100d)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java


> Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to 
> default port 8188
> ---
>
> Key: YARN-4326
> URL: https://issues.apache.org/jira/browse/YARN-4326
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: MENG DING
>Assignee: MENG DING
> Fix For: 2.8.0
>
> Attachments: YARN-4326.patch
>
>
> The timeout originates in ApplicationMaster, where it fails to connect to 
> timeline server, and retry exceeds limits:
> {code}
> 2015-11-02 21:57:38,066 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:serviceInit(299)) - Timeline service address: 
> http://mdinglin02:0/ws/v1/timeline/
> 2015-11-02 21:57:38,099 INFO  [main] impl.TimelineClientImpl 
> (TimelineClientImpl.java:logException(213)) - Exception caught by 
> TimelineClientConnectionRetry, will try 30 more time(s).
> ...
> ...
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245)
> at com.sun.jersey.api.client.Client.handle(Client.java:648)
> at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
> at 
> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
> at 
> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571)
> at 
> org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3432) Cluster metrics have wrong Total Memory when there is reserved memory on CS

2015-11-03 Thread qiang xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989049#comment-14989049
 ] 

qiang xu commented on YARN-3432:


I think we need to think about the vcore too.
In current code, it is:
this.totalMB = availableMB + allocatedMB;
this.totalVirtualCores = availableVirtualCores + allocatedVirtualCores;


Do we need to makethis.totalVirtualCores = availableVirtualCores + 
allocatedVirtualCores+containersReserved;

> Cluster metrics have wrong Total Memory when there is reserved memory on CS
> ---
>
> Key: YARN-3432
> URL: https://issues.apache.org/jira/browse/YARN-3432
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Brahma Reddy Battula
> Attachments: YARN-3432-002.patch, YARN-3432.patch
>
>
> I noticed that when reservations happen when using the Capacity Scheduler, 
> the UI and web services report the wrong total memory.
> For example.  I have a 300GB of total memory in my cluster.  I allocate 50 
> and I reserve 10.  The cluster metrics for total memory get reported as 290GB.
> This was broken by https://issues.apache.org/jira/browse/YARN-656 so perhaps 
> there is a difference between fair scheduler and capacity scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3432) Cluster metrics have wrong Total Memory when there is reserved memory on CS

2015-11-03 Thread qiang xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989055#comment-14989055
 ] 

qiang xu commented on YARN-3432:


Sorry, should be
 this.totalVirtualCores = availableVirtualCores + 
allocatedVirtualCores+reservedVirtualCores

> Cluster metrics have wrong Total Memory when there is reserved memory on CS
> ---
>
> Key: YARN-3432
> URL: https://issues.apache.org/jira/browse/YARN-3432
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Brahma Reddy Battula
> Attachments: YARN-3432-002.patch, YARN-3432.patch
>
>
> I noticed that when reservations happen when using the Capacity Scheduler, 
> the UI and web services report the wrong total memory.
> For example.  I have a 300GB of total memory in my cluster.  I allocate 50 
> and I reserve 10.  The cluster metrics for total memory get reported as 290GB.
> This was broken by https://issues.apache.org/jira/browse/YARN-656 so perhaps 
> there is a difference between fair scheduler and capacity scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4127) RM fail with noAuth error if switched from failover mode to non-failover mode

2015-11-03 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989060#comment-14989060
 ] 

Varun Saxena commented on YARN-4127:


Thanks [~jianhe] for the review and commit

> RM fail with noAuth error if switched from failover mode to non-failover mode 
> --
>
> Key: YARN-4127
> URL: https://issues.apache.org/jira/browse/YARN-4127
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Jian He
>Assignee: Varun Saxena
> Fix For: 2.7.2
>
> Attachments: YARN-4127-branch-2.7.01.patch, 
> YARN-4127-branch-2.7.02.patch, YARN-4127.01.patch, YARN-4127.02.patch
>
>
> The scenario is that RM failover was initially enabled, so the zkRootNodeAcl 
> is by default set with the *RM ID* in the ACL string 
> If RM failover is then switched to be disabled,  it cannot load data from ZK 
> and fail with noAuth error. After I reset the root node ACL, it again can 
> access.
> {code}
> 15/09/08 14:28:34 ERROR resourcemanager.ResourceManager: Failed to 
> load/recover state
> org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:113)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
>   at 
> org.apache.curator.framework.imps.CuratorTransactionImpl.doOperation(CuratorTransactionImpl.java:159)
>   at 
> org.apache.curator.framework.imps.CuratorTransactionImpl.access$200(CuratorTransactionImpl.java:44)
>   at 
> org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:129)
>   at 
> org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:125)
>   at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
>   at 
> org.apache.curator.framework.imps.CuratorTransactionImpl.commit(CuratorTransactionImpl.java:122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$SafeTransaction.commit(ZKRMStateStore.java:1009)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.safeSetData(ZKRMStateStore.java:985)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.getAndIncrementEpoch(ZKRMStateStore.java:374)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:579)
>   at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:973)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1014)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1010)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1010)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1050)
>   at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1194)
> {code}
>  the problem may be that in non-failover mode, RM doesn't use the *RM-ID* to 
> connect with ZK and thus fail with no Auth error.
> We should be able to switch failover on and off with no interruption to the 
> user. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-11-03 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1509:

Attachment: YARN-1509.9.patch

Thanks [~jianhe] for reviewing the patch and giving feedback offline. To 
summarize, in the following function:

{code: title=AMRMClientImpl.java}
+  protected void removePendingChangeRequests(
+  List changedContainers, boolean isIncrease) {
+for (Container changedContainer : changedContainers) {
+  ContainerId containerId = changedContainer.getId();
+  if (pendingChange.get(containerId) == null) {
+continue;
+  }
+  Resource target = pendingChange.get(containerId).getValue();
+  if (target == null) {
+continue;
+  }
+  Resource changed = changedContainer.getResource();
+  if (isIncrease) {
+if (Resources.fitsIn(target, changed)) {
+  if (LOG.isDebugEnabled()) {
+LOG.debug("RM has confirmed increased resource allocation for "
++ "container " + containerId + ". Current resource allocation:"
++ changed + ". Remove pending change request:"
++ target);
+  }
+  pendingChange.remove(containerId);
+}
+  } else {
+if (Resources.fitsIn(changed, target)) {
+  if (LOG.isDebugEnabled()) {
+LOG.debug("RM has confirmed decreased resource allocation for "
++ "container " + containerId + ". Current resource allocation:"
++ changed + ". Remove pending change request:"
++ target);
+  }
+  pendingChange.remove(containerId);
+}
+  }
+}
+  }
{code}
* There is no need to check null for {{target}}, as under no circumstance will 
it become null.
* Better yet, there is even no need to compare {{changed}} with {{target}}. 
Because {{Resources.fitsIn(changed, target)}} will always be true for confirmed 
increase request, and same with {{Resources.fitsIn(changed, target)}} for 
confirmed decreased request. I added these checks originally to be defensive, 
but after all, there is really no need for them.

Attaching latest patch that addresses the above.

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.2.patch, YARN-1509.3.patch, 
> YARN-1509.4.patch, YARN-1509.5.patch, YARN-1509.6.patch, YARN-1509.7.patch, 
> YARN-1509.8.patch, YARN-1509.9.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers

2015-11-03 Thread MENG DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-1509:

Attachment: YARN-1509.10.patch

Please ignore the previous patch, and see the latest one.

> Make AMRMClient support send increase container request and get 
> increased/decreased containers
> --
>
> Key: YARN-1509
> URL: https://issues.apache.org/jira/browse/YARN-1509
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1509.1.patch, YARN-1509.10.patch, 
> YARN-1509.2.patch, YARN-1509.3.patch, YARN-1509.4.patch, YARN-1509.5.patch, 
> YARN-1509.6.patch, YARN-1509.7.patch, YARN-1509.8.patch, YARN-1509.9.patch
>
>
> As described in YARN-1197, we need add API in AMRMClient to support
> 1) Add increase request
> 2) Can get successfully increased/decreased containers from RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4127) RM fail with noAuth error if switched from failover mode to non-failover mode

2015-11-03 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987591#comment-14987591
 ] 

Varun Saxena commented on YARN-4127:


Test failures are unrelated. Except the one related to node labels, there are 
JIRAs' corresponding to them. Anyways all are unrelated.
Findbugs is also unrelated. Will raise a separate JIRA for it.
Whitespace issues seem unrelated too.



> RM fail with noAuth error if switched from failover mode to non-failover mode 
> --
>
> Key: YARN-4127
> URL: https://issues.apache.org/jira/browse/YARN-4127
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Jian He
>Assignee: Varun Saxena
> Attachments: YARN-4127-branch-2.7.01.patch, 
> YARN-4127-branch-2.7.02.patch, YARN-4127.01.patch, YARN-4127.02.patch
>
>
> The scenario is that RM failover was initially enabled, so the zkRootNodeAcl 
> is by default set with the *RM ID* in the ACL string 
> If RM failover is then switched to be disabled,  it cannot load data from ZK 
> and fail with noAuth error. After I reset the root node ACL, it again can 
> access.
> {code}
> 15/09/08 14:28:34 ERROR resourcemanager.ResourceManager: Failed to 
> load/recover state
> org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:113)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
>   at 
> org.apache.curator.framework.imps.CuratorTransactionImpl.doOperation(CuratorTransactionImpl.java:159)
>   at 
> org.apache.curator.framework.imps.CuratorTransactionImpl.access$200(CuratorTransactionImpl.java:44)
>   at 
> org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:129)
>   at 
> org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:125)
>   at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
>   at 
> org.apache.curator.framework.imps.CuratorTransactionImpl.commit(CuratorTransactionImpl.java:122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$SafeTransaction.commit(ZKRMStateStore.java:1009)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.safeSetData(ZKRMStateStore.java:985)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.getAndIncrementEpoch(ZKRMStateStore.java:374)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:579)
>   at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:973)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1014)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1010)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1010)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1050)
>   at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1194)
> {code}
>  the problem may be that in non-failover mode, RM doesn't use the *RM-ID* to 
> connect with ZK and thus fail with no Auth error.
> We should be able to switch failover on and off with no interruption to the 
> user. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)