date:20191106

[jira] [Comment Edited] (YARN-9957) The first container we recover may not be the AM

2019-11-06 Thread Xianghao Lu (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969006#comment-16969006
 ] 

Xianghao Lu edited comment on YARN-9957 at 11/7/19 7:47 AM:


The failure above has nothing to do with the patch, [~asuresh]  [~rkanter], Can 
you please review the patch?


was (Author: luxianghao):
The failure above has nothing to do with the patch, [~arun.sur...@gmail.com] 
[~rkanter], Can you please review the patch?

> The first container we recover may not be the AM
> 
>
> Key: YARN-9957
> URL: https://issues.apache.org/jira/browse/YARN-9957
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.9.1
>Reporter: Xianghao Lu
>Assignee: Xianghao Lu
>Priority: Major
> Fix For: 2.9.1
>
> Attachments: 1.jpg, 2.jpg, YARN-9957-branch-2.9.1.001.patch
>
>
> YARN-7382 says that if not running unmanaged, the first container we recover 
> is always the AM, however, the actual situation is not like this, this can 
> lead to a wrong am resource usage after rm recover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9957) The first container we recover may not be the AM

2019-11-06 Thread Xianghao Lu (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969006#comment-16969006
 ] 

Xianghao Lu commented on YARN-9957:
---

The failure above has nothing to do with the patch, [~arun.sur...@gmail.com] 
[~rkanter], Can you please review the patch?

> The first container we recover may not be the AM
> 
>
> Key: YARN-9957
> URL: https://issues.apache.org/jira/browse/YARN-9957
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.9.1
>Reporter: Xianghao Lu
>Assignee: Xianghao Lu
>Priority: Major
> Fix For: 2.9.1
>
> Attachments: 1.jpg, 2.jpg, YARN-9957-branch-2.9.1.001.patch
>
>
> YARN-7382 says that if not running unmanaged, the first container we recover 
> is always the AM, however, the actual situation is not like this, this can 
> lead to a wrong am resource usage after rm recover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9957) The first container we recover may not be the AM

2019-11-06 Thread Hadoop QA (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968999#comment-16968999
 ] 

Hadoop QA commented on YARN-9957:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red} 21m 
50s{color} | {color:red} Docker failed to build yetus/hadoop:ef54f78530d. 
{color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-9957 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12985156/YARN-9957-branch-2.9.1.001.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/25112/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> The first container we recover may not be the AM
> 
>
> Key: YARN-9957
> URL: https://issues.apache.org/jira/browse/YARN-9957
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.9.1
>Reporter: Xianghao Lu
>Assignee: Xianghao Lu
>Priority: Major
> Fix For: 2.9.1
>
> Attachments: 1.jpg, 2.jpg, YARN-9957-branch-2.9.1.001.patch
>
>
> YARN-7382 says that if not running unmanaged, the first container we recover 
> is always the AM, however, the actual situation is not like this, this can 
> lead to a wrong am resource usage after rm recover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9957) The first container we recover may not be the AM

2019-11-06 Thread Xianghao Lu (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968976#comment-16968976
 ] 

Xianghao Lu commented on YARN-9957:
---

The test sample is as described below
 # submit a MR job, am container capacity is 2048 mb, map container capacity is 
4096 mb, reduce container capacity is 4096 mb ,resouce usage is shown in 1.jpg
 # restart rm, after rm recovered, resouce usage is shown in 2.jpg and I found 
am used resouce is 4096 mb !1.jpg!

> The first container we recover may not be the AM
> 
>
> Key: YARN-9957
> URL: https://issues.apache.org/jira/browse/YARN-9957
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.9.1
>Reporter: Xianghao Lu
>Assignee: Xianghao Lu
>Priority: Major
> Fix For: 2.9.1
>
> Attachments: 1.jpg, 2.jpg
>
>
> YARN-7382 says that if not running unmanaged, the first container we recover 
> is always the AM, however, the actual situation is not like this, this can 
> lead to a wrong am resource usage after rm recover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9957) The first container we recover may not be the AM

2019-11-06 Thread Xianghao Lu (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianghao Lu updated YARN-9957:
--
Attachment: 2.jpg
1.jpg

> The first container we recover may not be the AM
> 
>
> Key: YARN-9957
> URL: https://issues.apache.org/jira/browse/YARN-9957
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.9.1
>Reporter: Xianghao Lu
>Assignee: Xianghao Lu
>Priority: Major
> Fix For: 2.9.1
>
> Attachments: 1.jpg, 2.jpg
>
>
> YARN-7382 says that if not running unmanaged, the first container we recover 
> is always the AM, however, the actual situation is not like this, this can 
> lead to a wrong am resource usage after rm recover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9956) Improve connection error message for YARN ApiServerClient

2019-11-06 Thread Prabhu Joseph (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968973#comment-16968973
 ] 

Prabhu Joseph commented on YARN-9956:
-

[~eyang] Yes Sure, will work on this. Thanks.

> Improve connection error message for YARN ApiServerClient
> -
>
> Key: YARN-9956
> URL: https://issues.apache.org/jira/browse/YARN-9956
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Yang
>Priority: Major
>
> In HA environment, yarn.resourcemanager.webapp.address configuration is 
> optional.  ApiServiceClient may produce confusing error message like this:
> {code}
> 19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
> host1.example.com:8090
> 19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
> host2.example.com:8090
> 19/10/30 20:13:42 INFO util.log: Logging initialized @2301ms
> 19/10/30 20:13:42 ERROR client.ApiServiceClient: Error: {}
> GSSException: No valid credentials provided (Mechanism level: Server not 
> found in Kerberos database (7) - LOOKING_UP_SERVER)
>   at 
> java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:771)
>   at 
> java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:266)
>   at 
> java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:196)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:125)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:105)
>   at java.base/java.security.AccessController.doPrivileged(Native Method)
>   at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.generateToken(ApiServiceClient.java:105)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:290)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:125)
> Caused by: KrbException: Server not found in Kerberos database (7) - 
> LOOKING_UP_SERVER
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:73)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:251)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:262)
>   at 
> java.security.jgss/sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:308)
>   at 
> java.security.jgss/sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:126)
>   at 
> java.security.jgss/sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:458)
>   at 
> java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:695)
>   ... 15 more
> Caused by: KrbException: Identifier doesn't match expected value (906)
>   at 
> java.security.jgss/sun.security.krb5.internal.KDCRep.init(KDCRep.java:140)
>   at 
> java.security.jgss/sun.security.krb5.internal.TGSRep.init(TGSRep.java:65)
>   at 
> java.security.jgss/sun.security.krb5.internal.TGSRep.(TGSRep.java:60)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:55)
>   ... 21 more
> 19/10/30 20:13:42 ERROR client.ApiServiceClient: Fail to launch application: 
> java.io.IOException: java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:293)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:125)
> Caused by: java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1894)
>

[jira] [Created] (YARN-9957) The first container we recover may not be the AM

2019-11-06 Thread Xianghao Lu (Jira)

Xianghao Lu created YARN-9957:
-

 Summary: The first container we recover may not be the AM
 Key: YARN-9957
 URL: https://issues.apache.org/jira/browse/YARN-9957
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.9.1
Reporter: Xianghao Lu
Assignee: Xianghao Lu
 Fix For: 2.9.1


YARN-7382 says that if not running unmanaged, the first container we recover is 
always the AM, however, the actual situation is not like this, this can lead to 
a wrong am resource usage after rm recover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-9956) Improve connection error message for YARN ApiServerClient

2019-11-06 Thread Prabhu Joseph (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-9956:
---

Assignee: Prabhu Joseph

> Improve connection error message for YARN ApiServerClient
> -
>
> Key: YARN-9956
> URL: https://issues.apache.org/jira/browse/YARN-9956
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Yang
>Assignee: Prabhu Joseph
>Priority: Major
>
> In HA environment, yarn.resourcemanager.webapp.address configuration is 
> optional.  ApiServiceClient may produce confusing error message like this:
> {code}
> 19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
> host1.example.com:8090
> 19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
> host2.example.com:8090
> 19/10/30 20:13:42 INFO util.log: Logging initialized @2301ms
> 19/10/30 20:13:42 ERROR client.ApiServiceClient: Error: {}
> GSSException: No valid credentials provided (Mechanism level: Server not 
> found in Kerberos database (7) - LOOKING_UP_SERVER)
>   at 
> java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:771)
>   at 
> java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:266)
>   at 
> java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:196)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:125)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:105)
>   at java.base/java.security.AccessController.doPrivileged(Native Method)
>   at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.generateToken(ApiServiceClient.java:105)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:290)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:125)
> Caused by: KrbException: Server not found in Kerberos database (7) - 
> LOOKING_UP_SERVER
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:73)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:251)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:262)
>   at 
> java.security.jgss/sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:308)
>   at 
> java.security.jgss/sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:126)
>   at 
> java.security.jgss/sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:458)
>   at 
> java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:695)
>   ... 15 more
> Caused by: KrbException: Identifier doesn't match expected value (906)
>   at 
> java.security.jgss/sun.security.krb5.internal.KDCRep.init(KDCRep.java:140)
>   at 
> java.security.jgss/sun.security.krb5.internal.TGSRep.init(TGSRep.java:65)
>   at 
> java.security.jgss/sun.security.krb5.internal.TGSRep.(TGSRep.java:60)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:55)
>   ... 21 more
> 19/10/30 20:13:42 ERROR client.ApiServiceClient: Fail to launch application: 
> java.io.IOException: java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:293)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:125)
> Caused by: java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1894)
>   at 
>

[jira] [Commented] (YARN-9537) Add configuration to disable AM preemption

2019-11-06 Thread zhoukang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968947#comment-16968947
 ] 

zhoukang commented on YARN-9537:


[~tangzhankun] could you help review this？ thanks

> Add configuration to disable AM preemption
> --
>
> Key: YARN-9537
> URL: https://issues.apache.org/jira/browse/YARN-9537
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.2.0, 3.1.2
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-9537-002.patch, YARN-9537.001.patch, 
> YARN-9537.003.patch
>
>
> In this issue, i will add a configuration to support disable AM preemption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-9612) Support using ip to register NodeID

2019-11-06 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang reassigned YARN-9612:
--

Assignee: zhoukang

> Support using ip to register NodeID
> ---
>
> Key: YARN-9612
> URL: https://issues.apache.org/jira/browse/YARN-9612
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> In the environment like k8s. We should support ip when register NodeID with 
> RM since the hostname will be podName which can not be be resolved by DNS of 
> k8s



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-9739) appsTableData in AppsBlock may cause OOM

2019-11-06 Thread zhoukang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang reassigned YARN-9739:
--

Assignee: zhoukang

> appsTableData in AppsBlock may cause OOM
> 
>
> Key: YARN-9739
> URL: https://issues.apache.org/jira/browse/YARN-9739
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: heap0.png, heap1.png, stack.png
>
>
> If we have many users list the applications, it may cause RM OOM



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9768) RM Renew Delegation token thread should timeout and retry

2019-11-06 Thread Jira



[ 
https://issues.apache.org/jira/browse/YARN-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968848#comment-16968848
 ] 

Íñigo Goiri commented on YARN-9768:
---

+1 on [^YARN-9768.008.patch].

> RM Renew Delegation token thread should timeout and retry
> -
>
> Key: YARN-9768
> URL: https://issues.apache.org/jira/browse/YARN-9768
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: CR Hota
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-9768.001.patch, YARN-9768.002.patch, 
> YARN-9768.003.patch, YARN-9768.004.patch, YARN-9768.005.patch, 
> YARN-9768.006.patch, YARN-9768.007.patch, YARN-9768.008.patch
>
>
> Delegation token renewer thread in RM (DelegationTokenRenewer.java) renews 
> HDFS tokens received to check for validity and expiration time.
> This call is made to an underlying HDFS NN or Router Node (which has exact 
> APIs as HDFS NN). If one of the nodes is bad and the renew call is stuck the 
> thread remains stuck indefinitely. The thread should ideally timeout the 
> renewToken and retry from the client's perspective.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9937) Add missing queue configs in RMWebService#CapacitySchedulerQueueInfo

2019-11-06 Thread Hadoop QA (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968792#comment-16968792
 ] 

Hadoop QA commented on YARN-9937:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 36m 
16s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 45s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 43s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 98m 
12s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
51s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}202m 44s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | YARN-9937 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12985112/YARN-9937-addendum-01.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 5dfdf54c1216 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 
05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / dd90025 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/25111/testReport/ |
| Max. process+thread count | 830 (vs. ulimit of 5500) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/25111/console |
| Powered by | Apache Yetus

[jira] [Commented] (YARN-8292) Fix the dominant resource preemption cannot happen when some of the resource vector becomes negative

2019-11-06 Thread Eric Payne (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968732#comment-16968732
 ] 

Eric Payne commented on YARN-8292:
--

It looks like the extended resource functionality for preemption unit tests may 
be dependent on changes made in YARN-7411 to 
ProportionalCapacityPreemptionPolicyMockFramework#parseResourceFromString. 
YARN-7411 was only put into 3.1.0.

> Fix the dominant resource preemption cannot happen when some of the resource 
> vector becomes negative
> 
>
> Key: YARN-8292
> URL: https://issues.apache.org/jira/browse/YARN-8292
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8292.001.patch, YARN-8292.002.patch, 
> YARN-8292.003.patch, YARN-8292.004.patch, YARN-8292.005.patch, 
> YARN-8292.006.patch, YARN-8292.007.patch, YARN-8292.008.patch, 
> YARN-8292.009.patch, YARN-8292.branch-2.009.patch
>
>
> This is an example of the problem: 
>   
> {code}
> //   guaranteed,  max,used,   pending
> "root(=[30:18:6  30:18:6 12:12:6 1:1:1]);" + //root
> "-a(=[10:6:2 10:6:2  6:6:3   0:0:0]);" + // a
> "-b(=[10:6:2 10:6:2  6:6:3   0:0:0]);" + // b
> "-c(=[10:6:2 10:6:2  0:0:0   1:1:1])"; // c
> {code}
> There're 3 resource types. Total resource of the cluster is 30:18:6
> For both of a/b, there're 3 containers running, each of container is 2:2:1.
> Queue c uses 0 resource, and have 1:1:1 pending resource.
> Under existing logic, preemption cannot happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8292) Fix the dominant resource preemption cannot happen when some of the resource vector becomes negative

2019-11-06 Thread Eric Payne (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968656#comment-16968656
 ] 

Eric Payne commented on YARN-8292:
--

I backported this to branch-2 and attached YARN-8292.branch-2.009.patch. In my 
manual tests on a 4-node pseudo cluster, it allows preemptions to proceed in 
the case where the dominant resource is above the queue capacity and 
non-dominant resource(s) is (are) less. However, I have not put the JIRA into 
patch-submitted state because the two unit tests added to test preemption with 
3 resources are failing. I dug into it a little bit and see that in 2.10, when 
it allocates resources to the Mock queue, the extended resource is not added to 
the current configuration or usage of the queue.
[~leftnoteasy] / [~sunilg] / [~jhung], are you aware of any missing extended 
resource configuration that should be backported for the 2.10 RM / CS mocks?

Here is one of the test failures:
{noformat}
[ERROR]   
TestProportionalCapacityPreemptionPolicyInterQueueWithDRF.test3ResourceTypesInterQueuePreemption:117
 
Wanted but not invoked:
eventHandler.handle(

);
-> at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicyInterQueueWithDRF.test3ResourceTypesInterQueuePreemption(TestProportionalCapacityPreemptionPolicyInterQueueWithDRF.java:117)
Actually, there were zero interactions with this mock.
{noformat}



> Fix the dominant resource preemption cannot happen when some of the resource 
> vector becomes negative
> 
>
> Key: YARN-8292
> URL: https://issues.apache.org/jira/browse/YARN-8292
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8292.001.patch, YARN-8292.002.patch, 
> YARN-8292.003.patch, YARN-8292.004.patch, YARN-8292.005.patch, 
> YARN-8292.006.patch, YARN-8292.007.patch, YARN-8292.008.patch, 
> YARN-8292.009.patch, YARN-8292.branch-2.009.patch
>
>
> This is an example of the problem: 
>   
> {code}
> //   guaranteed,  max,used,   pending
> "root(=[30:18:6  30:18:6 12:12:6 1:1:1]);" + //root
> "-a(=[10:6:2 10:6:2  6:6:3   0:0:0]);" + // a
> "-b(=[10:6:2 10:6:2  6:6:3   0:0:0]);" + // b
> "-c(=[10:6:2 10:6:2  0:0:0   1:1:1])"; // c
> {code}
> There're 3 resource types. Total resource of the cluster is 30:18:6
> For both of a/b, there're 3 containers running, each of container is 2:2:1.
> Queue c uses 0 resource, and have 1:1:1 pending resource.
> Under existing logic, preemption cannot happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9937) Add missing queue configs in RMWebService#CapacitySchedulerQueueInfo

2019-11-06 Thread Prabhu Joseph (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-9937:

Attachment: YARN-9937-addendum-01.patch

> Add missing queue configs in RMWebService#CapacitySchedulerQueueInfo
> 
>
> Key: YARN-9937
> URL: https://issues.apache.org/jira/browse/YARN-9937
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: Screen Shot 2019-10-28 at 8.54.53 PM.png, 
> YARN-9937-001.patch, YARN-9937-002.patch, YARN-9937-003.patch, 
> YARN-9937-004.patch, YARN-9937-addendum-01.patch, 
> YARN-9937-branch-3.2.001.patch, YARN-9937-branch-3.2.002.patch
>
>
> Below are the missing queue configs which are not part of RMWebServices 
> scheduler endpoint. 
> 1. Maximum Allocation
> 2. Queue ACLs
> 3. Queue Priority
> 4. Application Lifetime



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8292) Fix the dominant resource preemption cannot happen when some of the resource vector becomes negative

2019-11-06 Thread Eric Payne (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-8292:
-
Attachment: YARN-8292.branch-2.009.patch

> Fix the dominant resource preemption cannot happen when some of the resource 
> vector becomes negative
> 
>
> Key: YARN-8292
> URL: https://issues.apache.org/jira/browse/YARN-8292
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8292.001.patch, YARN-8292.002.patch, 
> YARN-8292.003.patch, YARN-8292.004.patch, YARN-8292.005.patch, 
> YARN-8292.006.patch, YARN-8292.007.patch, YARN-8292.008.patch, 
> YARN-8292.009.patch, YARN-8292.branch-2.009.patch
>
>
> This is an example of the problem: 
>   
> {code}
> //   guaranteed,  max,used,   pending
> "root(=[30:18:6  30:18:6 12:12:6 1:1:1]);" + //root
> "-a(=[10:6:2 10:6:2  6:6:3   0:0:0]);" + // a
> "-b(=[10:6:2 10:6:2  6:6:3   0:0:0]);" + // b
> "-c(=[10:6:2 10:6:2  0:0:0   1:1:1])"; // c
> {code}
> There're 3 resource types. Total resource of the cluster is 30:18:6
> For both of a/b, there're 3 containers running, each of container is 2:2:1.
> Queue c uses 0 resource, and have 1:1:1 pending resource.
> Under existing logic, preemption cannot happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9956) Improve connection error message for YARN ApiServerClient

2019-11-06 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968632#comment-16968632
 ] 

Eric Yang commented on YARN-9956:
-

[~prabhujoseph] can you help out with this issue?  Thanks

> Improve connection error message for YARN ApiServerClient
> -
>
> Key: YARN-9956
> URL: https://issues.apache.org/jira/browse/YARN-9956
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Yang
>Priority: Major
>
> In HA environment, yarn.resourcemanager.webapp.address configuration is 
> optional.  ApiServiceClient may produce confusing error message like this:
> {code}
> 19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
> host1.example.com:8090
> 19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
> host2.example.com:8090
> 19/10/30 20:13:42 INFO util.log: Logging initialized @2301ms
> 19/10/30 20:13:42 ERROR client.ApiServiceClient: Error: {}
> GSSException: No valid credentials provided (Mechanism level: Server not 
> found in Kerberos database (7) - LOOKING_UP_SERVER)
>   at 
> java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:771)
>   at 
> java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:266)
>   at 
> java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:196)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:125)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:105)
>   at java.base/java.security.AccessController.doPrivileged(Native Method)
>   at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.generateToken(ApiServiceClient.java:105)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:290)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:125)
> Caused by: KrbException: Server not found in Kerberos database (7) - 
> LOOKING_UP_SERVER
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:73)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:251)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:262)
>   at 
> java.security.jgss/sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:308)
>   at 
> java.security.jgss/sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:126)
>   at 
> java.security.jgss/sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:458)
>   at 
> java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:695)
>   ... 15 more
> Caused by: KrbException: Identifier doesn't match expected value (906)
>   at 
> java.security.jgss/sun.security.krb5.internal.KDCRep.init(KDCRep.java:140)
>   at 
> java.security.jgss/sun.security.krb5.internal.TGSRep.init(TGSRep.java:65)
>   at 
> java.security.jgss/sun.security.krb5.internal.TGSRep.(TGSRep.java:60)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:55)
>   ... 21 more
> 19/10/30 20:13:42 ERROR client.ApiServiceClient: Fail to launch application: 
> java.io.IOException: java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:293)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:125)
> Caused by: java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1894)

[jira] [Commented] (YARN-9956) Improve connection error message for YARN ApiServerClient

2019-11-06 Thread Eric Yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968629#comment-16968629
 ] 

Eric Yang commented on YARN-9956:
-

Krb5.log evidence shows ApiServiceClient attempting to acquire TGS for both 
resource manager and also the non-existed RM.

{code}
Oct 30 22:01:41 host1.example.com krb5kdc[4157](info): TGS_REQ (8 etypes {18 17 
20 19 16 23 1 3}) 172.27.135.195: ISSUE: authtime 1572472015, etypes {rep=16 
tkt=16 ses=16}, hb...@example.com for HTTP/host1.example@example.com
Oct 30 22:01:41 host1.example.com krb5kdc[4157](info): TGS_REQ (8 etypes {18 17 
20 19 16 23 1 3}) 172.27.135.195: ISSUE: authtime 1572472015, etypes {rep=16 
tkt=16 ses=16}, hb...@example.com for krbtgt/example@example.com
Oct 30 22:01:42 host1.example.com krb5kdc[4157](info): TGS_REQ (8 etypes {18 17 
20 19 16 23 1 3}) 172.27.135.195: ISSUE: authtime 1572472015, etypes {rep=16 
tkt=16 ses=16}, hb...@example.com for HTTP/host2.example@example.com
Oct 30 22:01:42 host1.example.com krb5kdc[4157](info): TGS_REQ (8 etypes {18 17 
20 19 16 23 1 3}) 172.27.135.195: ISSUE: authtime 1572472015, etypes {rep=16 
tkt=16 ses=16}, hb...@example.com for krbtgt/example@example.com
Oct 30 22:01:42 host1.example.com krb5kdc[4157](info): TGS_REQ (8 etypes {18 17 
20 19 16 23 1 3}) 172.27.135.195: LOOKING_UP_SERVER: authtime 0,  
hb...@example.com for HTTP/0.0@example.com, Server not found in Kerberos 
database
{code}


> Improve connection error message for YARN ApiServerClient
> -
>
> Key: YARN-9956
> URL: https://issues.apache.org/jira/browse/YARN-9956
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Yang
>Priority: Major
>
> In HA environment, yarn.resourcemanager.webapp.address configuration is 
> optional.  ApiServiceClient may produce confusing error message like this:
> {code}
> 19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
> host1.example.com:8090
> 19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
> host2.example.com:8090
> 19/10/30 20:13:42 INFO util.log: Logging initialized @2301ms
> 19/10/30 20:13:42 ERROR client.ApiServiceClient: Error: {}
> GSSException: No valid credentials provided (Mechanism level: Server not 
> found in Kerberos database (7) - LOOKING_UP_SERVER)
>   at 
> java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:771)
>   at 
> java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:266)
>   at 
> java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:196)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:125)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:105)
>   at java.base/java.security.AccessController.doPrivileged(Native Method)
>   at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.generateToken(ApiServiceClient.java:105)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:290)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
>   at 
> org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:125)
> Caused by: KrbException: Server not found in Kerberos database (7) - 
> LOOKING_UP_SERVER
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:73)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:251)
>   at 
> java.security.jgss/sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:262)
>   at 
> java.security.jgss/sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:308)
>   at 
> java.security.jgss/sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:126)
>   at 
> java.security.jgss/sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:458)
>   at 
> java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:695)
>   ... 15 more
> Caused by: KrbException: Identifier doesn't match expected value (906)
>   at 
>

[jira] [Commented] (YARN-9768) RM Renew Delegation token thread should timeout and retry

2019-11-06 Thread Hadoop QA (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968628#comment-16968628
 ] 

Hadoop QA commented on YARN-9768:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
38s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
39s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
 6s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m  
3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 46s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m  
0s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
15s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  7m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 30s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
57s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
52s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
46s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 87m 
28s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
39s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}180m  8s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | YARN-9768 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12985083/YARN-9768.008.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  xml  |
| uname | Linux 29be7dd0f16b 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 
05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality |

[jira] [Created] (YARN-9956) Improve connection error message for YARN ApiServerClient

2019-11-06 Thread Eric Yang (Jira)

Eric Yang created YARN-9956:
---

 Summary: Improve connection error message for YARN ApiServerClient
 Key: YARN-9956
 URL: https://issues.apache.org/jira/browse/YARN-9956
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Eric Yang


In HA environment, yarn.resourcemanager.webapp.address configuration is 
optional.  ApiServiceClient may produce confusing error message like this:

{code}
19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
host1.example.com:8090
19/10/30 20:13:42 INFO client.ApiServiceClient: Fail to connect to: 
host2.example.com:8090
19/10/30 20:13:42 INFO util.log: Logging initialized @2301ms
19/10/30 20:13:42 ERROR client.ApiServiceClient: Error: {}
GSSException: No valid credentials provided (Mechanism level: Server not found 
in Kerberos database (7) - LOOKING_UP_SERVER)
at 
java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:771)
at 
java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:266)
at 
java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:196)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:125)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient$1.run(ApiServiceClient.java:105)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient.generateToken(ApiServiceClient.java:105)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:290)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
at 
org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at 
org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:125)
Caused by: KrbException: Server not found in Kerberos database (7) - 
LOOKING_UP_SERVER
at 
java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:73)
at 
java.security.jgss/sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:251)
at 
java.security.jgss/sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:262)
at 
java.security.jgss/sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:308)
at 
java.security.jgss/sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:126)
at 
java.security.jgss/sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:458)
at 
java.security.jgss/sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:695)
... 15 more
Caused by: KrbException: Identifier doesn't match expected value (906)
at 
java.security.jgss/sun.security.krb5.internal.KDCRep.init(KDCRep.java:140)
at 
java.security.jgss/sun.security.krb5.internal.TGSRep.init(TGSRep.java:65)
at 
java.security.jgss/sun.security.krb5.internal.TGSRep.(TGSRep.java:60)
at 
java.security.jgss/sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:55)
... 21 more
19/10/30 20:13:42 ERROR client.ApiServiceClient: Fail to launch application: 
java.io.IOException: java.lang.reflect.UndeclaredThrowableException
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:293)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:271)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient.actionLaunch(ApiServiceClient.java:416)
at 
org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:589)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at 
org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:125)
Caused by: java.lang.reflect.UndeclaredThrowableException
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1894)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient.generateToken(ApiServiceClient.java:105)
at 
org.apache.hadoop.yarn.service.client.ApiServiceClient.getApiClient(ApiServiceClient.java:290)
... 6 more
Caused by: 
org.apache.hadoop.security.authentication.client.AuthenticationException: 
GSSException: No valid credentials provided (Mechanism level: Server

[jira] [Commented] (YARN-9562) Add Java changes for the new RuncContainerRuntime

2019-11-06 Thread Shane Kumpf (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968539#comment-16968539
 ] 

Shane Kumpf commented on YARN-9562:
---

bq. So you're setting linux-container-executor.nonsecure-mode.limit-users to 
true with linux-container-executor.nonsecure-mode.local-user set to nobody in 
your yarn-site.xml? Is that the use case here?

Exactly correct

> Add Java changes for the new RuncContainerRuntime
> -
>
> Key: YARN-9562
> URL: https://issues.apache.org/jira/browse/YARN-9562
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9562.001.patch, YARN-9562.002.patch, 
> YARN-9562.003.patch, YARN-9562.004.patch, YARN-9562.005.patch, 
> YARN-9562.006.patch, YARN-9562.007.patch, YARN-9562.008.patch, 
> YARN-9562.009.patch, YARN-9562.010.patch, YARN-9562.011.patch, 
> YARN-9562.012.patch, YARN-9562.013.patch
>
>
> This JIRA will be used to add the Java changes for the new 
> RuncContainerRuntime. This will work off of YARN-9560 to use much of the 
> existing DockerLinuxContainerRuntime code once it is moved up into an 
> abstract class that can be extended. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9562) Add Java changes for the new RuncContainerRuntime

2019-11-06 Thread Jim Brennan (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968537#comment-16968537
 ] 

Jim Brennan commented on YARN-9562:
---

[~ebadger], [~shaneku...@gmail.com] for the record, I ran with 
{{linux-container-executor.nonsecure-mode.limit-users=false}}

> Add Java changes for the new RuncContainerRuntime
> -
>
> Key: YARN-9562
> URL: https://issues.apache.org/jira/browse/YARN-9562
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9562.001.patch, YARN-9562.002.patch, 
> YARN-9562.003.patch, YARN-9562.004.patch, YARN-9562.005.patch, 
> YARN-9562.006.patch, YARN-9562.007.patch, YARN-9562.008.patch, 
> YARN-9562.009.patch, YARN-9562.010.patch, YARN-9562.011.patch, 
> YARN-9562.012.patch, YARN-9562.013.patch
>
>
> This JIRA will be used to add the Java changes for the new 
> RuncContainerRuntime. This will work off of YARN-9560 to use much of the 
> existing DockerLinuxContainerRuntime code once it is moved up into an 
> abstract class that can be extended. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9562) Add Java changes for the new RuncContainerRuntime

2019-11-06 Thread Eric Badger (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968529#comment-16968529
 ] 

Eric Badger commented on YARN-9562:
---

[~shaneku...@gmail.com], thanks for the review!

bq. I am running all containers as the nobody user in this case.
So you're setting {{linux-container-executor.nonsecure-mode.limit-users}} to 
true with {{linux-container-executor.nonsecure-mode.local-user}} set to nobody 
in your yarn-site.xml? Is that the use case here?

I'll address the rest of your comments in a followup patch.

> Add Java changes for the new RuncContainerRuntime
> -
>
> Key: YARN-9562
> URL: https://issues.apache.org/jira/browse/YARN-9562
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9562.001.patch, YARN-9562.002.patch, 
> YARN-9562.003.patch, YARN-9562.004.patch, YARN-9562.005.patch, 
> YARN-9562.006.patch, YARN-9562.007.patch, YARN-9562.008.patch, 
> YARN-9562.009.patch, YARN-9562.010.patch, YARN-9562.011.patch, 
> YARN-9562.012.patch, YARN-9562.013.patch
>
>
> This JIRA will be used to add the Java changes for the new 
> RuncContainerRuntime. This will work off of YARN-9560 to use much of the 
> existing DockerLinuxContainerRuntime code once it is moved up into an 
> abstract class that can be extended. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9768) RM Renew Delegation token thread should timeout and retry

2019-11-06 Thread Manikandan R (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manikandan R updated YARN-9768:
---
Attachment: YARN-9768.008.patch

> RM Renew Delegation token thread should timeout and retry
> -
>
> Key: YARN-9768
> URL: https://issues.apache.org/jira/browse/YARN-9768
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: CR Hota
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-9768.001.patch, YARN-9768.002.patch, 
> YARN-9768.003.patch, YARN-9768.004.patch, YARN-9768.005.patch, 
> YARN-9768.006.patch, YARN-9768.007.patch, YARN-9768.008.patch
>
>
> Delegation token renewer thread in RM (DelegationTokenRenewer.java) renews 
> HDFS tokens received to check for validity and expiration time.
> This call is made to an underlying HDFS NN or Router Node (which has exact 
> APIs as HDFS NN). If one of the nodes is bad and the renew call is stuck the 
> thread remains stuck indefinitely. The thread should ideally timeout the 
> renewToken and retry from the client's perspective.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9768) RM Renew Delegation token thread should timeout and retry

2019-11-06 Thread Manikandan R (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968482#comment-16968482
 ] 

Manikandan R commented on YARN-9768:


Taken care.

> RM Renew Delegation token thread should timeout and retry
> -
>
> Key: YARN-9768
> URL: https://issues.apache.org/jira/browse/YARN-9768
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: CR Hota
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-9768.001.patch, YARN-9768.002.patch, 
> YARN-9768.003.patch, YARN-9768.004.patch, YARN-9768.005.patch, 
> YARN-9768.006.patch, YARN-9768.007.patch, YARN-9768.008.patch
>
>
> Delegation token renewer thread in RM (DelegationTokenRenewer.java) renews 
> HDFS tokens received to check for validity and expiration time.
> This call is made to an underlying HDFS NN or Router Node (which has exact 
> APIs as HDFS NN). If one of the nodes is bad and the renew call is stuck the 
> thread remains stuck indefinitely. The thread should ideally timeout the 
> renewToken and retry from the client's perspective.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9011) Race condition during decommissioning

2019-11-06 Thread Adam Antal (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968409#comment-16968409
 ] 

Adam Antal commented on YARN-9011:
--

Thanks for the patch, [~pbacsko].

I am not entirely sure that I perfectly understood this patch, but the above 
discussion made me confidence about this is the right approach.

I think it would be a good idea to stick to the generic naming convention by 
renaming {{HostsFileReader$refresh(String,String,boolean)}} to 
{{refreshInternal}}. Could you please do that to make that class more clear?

I was a bit concerned what happens if an unchecked exception occurs inside 
{{NodesListManager$handleExcludeNodeList}} (between calling 
{{HostsFileReader$lazyRefresh}} and {{HostsFileReader$finishRefresh}}), but I 
am assured that the internal structure will not get damaged by this.

Also a side-question: why did you move the following line inside 
{{ResourceTrackerService$nodeHeartbeat}}. Since the same object is received and 
it is not possible that this object is added/removed during the two calls I 
would not touch it.
{code:java}
RMNode rmNode = this.rmContext.getRMNodes().get(nodeId);

{code}

I liked the extra null check you perform on {{rmNode.getState()}} in 
{{ResourceTrackerService $updateAppCollectorsMap}}. I'd take one step further - 
could you start that function with an rmNode == null condition (and returning 
false in that case), we could save the extra null checks in that function and 
it would also make the null check in {{NodesListManager 
$isGracefullyDecommissionableNode}} inessential.

I'd give a +1 (non-binding) pending on these changes.

> Race condition during decommissioning
> -
>
> Key: YARN-9011
> URL: https://issues.apache.org/jira/browse/YARN-9011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9011-001.patch, YARN-9011-002.patch, 
> YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, 
> YARN-9011-006.patch, YARN-9011-007.patch, YARN-9011-008.patch
>
>
> During internal testing, we found a nasty race condition which occurs during 
> decommissioning.
> Node manager, incorrect behaviour:
> {noformat}
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
> hostname:node-6.hostname.com
> {noformat}
> Node manager, expected behaviour:
> {noformat}
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
> decommissioned
> {noformat}
> Note the two different messages from the RM ("Disallowed NodeManager" vs 
> "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
> inconsistent state of nodes while they're being updated:
> {noformat}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
> include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
>  exclude:{node-6.hostname.com}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node-6.hostname.com:8041 with state RUNNING
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
> node-6.hostname.com
> 2018-06-18 21:00:17,576 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node-6.hostname.com:8041 in DECOMMISSIONING.
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=172.26.22.115OPERATION=refreshNodes  TARGET=AdminService 
> RESULT=SUCCESS
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
> original total capability: 
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
>

[jira] [Commented] (YARN-9605) Add ZkConfiguredFailoverProxyProvider for RM HA

2019-11-06 Thread Hadoop QA (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968284#comment-16968284
 ] 

Hadoop QA commented on YARN-9605:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
35s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
23s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
 4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 20m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
19m 56s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  6m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
28s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
25s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 17m 
35s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} cc {color} | {color:red} 17m 35s{color} | 
{color:red} root generated 5 new + 21 unchanged - 5 fixed = 26 total (was 26) 
{color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 17m 
35s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
2m 40s{color} | {color:orange} root: The patch generated 4 new + 22 unchanged - 
0 fixed = 26 total (was 22) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 51s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  7m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
22s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  9m 43s{color} 
| {color:red} hadoop-common in the patch failed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
57s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
55s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 94m  4s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
46s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}233m 52s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.fs.TestTrash |
|   | hadoop.yarn.server.resourcemanager.TestRMRestart |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | YARN-9605 |
| JIRA Patch URL |

[jira] [Commented] (YARN-9920) YarnAuthorizationProvider AccessRequest gets Null RemoteAddress from FairScheduler

2019-11-06 Thread Sunil G (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968219#comment-16968219
 ] 

Sunil G commented on YARN-9920:
---

cc [~wilfreds], cud u also pls take a look

> YarnAuthorizationProvider AccessRequest gets Null RemoteAddress from 
> FairScheduler
> --
>
> Key: YARN-9920
> URL: https://issues.apache.org/jira/browse/YARN-9920
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, security
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9920-001.patch, YARN-9920-002.patch, 
> YARN-9920-003.patch
>
>
> YarnAuthorizationProvider AccessRequest has null RemoteAddress in case of 
> FairScheduler. FSQueue#hasAccess uses Server.getRemoteAddress() which will be 
> null when the call is from RMWebServices and EventDispatcher. It works fine 
> when called by IPC Server Handler.
> FSQueue#hasAccess is called at three places where (2) and (3) returns null.
> *1. IPC Server -> RMAppManager#createAndPopulateNewRMApp -> FSQueue#hasAccess 
> -> Server.getRemoteAddress returns correct Remote IP.*
>  
> *2. IPC Server -> RMAppManager#createAndPopulateNewRMApp -> 
> AppAddedSchedulerEvent*
>     *EventDispatcher -> FairScheduler#addApplication -> FSQueue.hasAccess -> 
> Server.getRemoteAddress returns null*
>   
> {code:java}
> org.apache.hadoop.yarn.security.ConfiguredYarnAuthorizer.checkPermission(ConfiguredYarnAuthorizer.java:101)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.hasAccess(FSQueue.java:316)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:509)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1268)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:133)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> {code}
>  
> *3. RMWebServices -> QueueACLsManager#checkAccess -> FSQueue.hasAccess -> 
> Server.getRemoteAddress returns null.*
> {code:java}
> org.apache.hadoop.yarn.security.ConfiguredYarnAuthorizer.checkPermission(ConfiguredYarnAuthorizer.java:101)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.hasAccess(FSQueue.java:316)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.checkAccess(FairScheduler.java:1610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.QueueACLsManager.checkAccess(QueueACLsManager.java:84)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.hasAccess(RMWebServices.java:270)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getApps(RMWebServices.java:553)
> {code}
>  
> Have verified with CapacityScheduler and it works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-9957) The first container we recover may not be the AM

[jira] [Commented] (YARN-9957) The first container we recover may not be the AM

[jira] [Commented] (YARN-9957) The first container we recover may not be the AM

[jira] [Commented] (YARN-9957) The first container we recover may not be the AM

[jira] [Updated] (YARN-9957) The first container we recover may not be the AM

[jira] [Commented] (YARN-9956) Improve connection error message for YARN ApiServerClient

[jira] [Created] (YARN-9957) The first container we recover may not be the AM

[jira] [Assigned] (YARN-9956) Improve connection error message for YARN ApiServerClient

[jira] [Commented] (YARN-9537) Add configuration to disable AM preemption

[jira] [Assigned] (YARN-9612) Support using ip to register NodeID

[jira] [Assigned] (YARN-9739) appsTableData in AppsBlock may cause OOM

[jira] [Commented] (YARN-9768) RM Renew Delegation token thread should timeout and retry

[jira] [Commented] (YARN-9937) Add missing queue configs in RMWebService#CapacitySchedulerQueueInfo

[jira] [Commented] (YARN-8292) Fix the dominant resource preemption cannot happen when some of the resource vector becomes negative

[jira] [Commented] (YARN-8292) Fix the dominant resource preemption cannot happen when some of the resource vector becomes negative

[jira] [Updated] (YARN-9937) Add missing queue configs in RMWebService#CapacitySchedulerQueueInfo

[jira] [Updated] (YARN-8292) Fix the dominant resource preemption cannot happen when some of the resource vector becomes negative

[jira] [Commented] (YARN-9956) Improve connection error message for YARN ApiServerClient

[jira] [Commented] (YARN-9956) Improve connection error message for YARN ApiServerClient

[jira] [Commented] (YARN-9768) RM Renew Delegation token thread should timeout and retry

[jira] [Created] (YARN-9956) Improve connection error message for YARN ApiServerClient

[jira] [Commented] (YARN-9562) Add Java changes for the new RuncContainerRuntime

[jira] [Commented] (YARN-9562) Add Java changes for the new RuncContainerRuntime

[jira] [Commented] (YARN-9562) Add Java changes for the new RuncContainerRuntime

[jira] [Updated] (YARN-9768) RM Renew Delegation token thread should timeout and retry

[jira] [Commented] (YARN-9768) RM Renew Delegation token thread should timeout and retry

[jira] [Commented] (YARN-9011) Race condition during decommissioning

[jira] [Commented] (YARN-9605) Add ZkConfiguredFailoverProxyProvider for RM HA

[jira] [Commented] (YARN-9920) YarnAuthorizationProvider AccessRequest gets Null RemoteAddress from FairScheduler

29 matches

Site Navigation

Mail list logo

Footer information