[jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure

2018-11-16 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690334#comment-16690334
 ] 

Hadoop QA commented on YARN-4721:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  6s{color} 
| {color:red} YARN-4721 does not apply to trunk. Rebase required? Wrong Branch? 
See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-4721 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12799703/HADOOP-12289-003.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/22584/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -
>
> Key: YARN-4721
> URL: https://issues.apache.org/jira/browse/YARN-4721
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, security
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: oct16-medium
> Attachments: HADOOP-12289-002.patch, HADOOP-12289-003.patch, 
> HADOOP-12889-001.patch
>
>
> If the RM can't auth with HDFS, this can first surface during job submission, 
> which can cause confusion about what's wrong and whose credentials are 
> playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice. 
> If it can't auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this 
> point. Certainly it can't currently accept work. But should it fail fast or 
> keep going in the hope that the problem is in the KDC or NN and will fix 
> itself without an RM restart?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure

2017-10-11 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201229#comment-16201229
 ] 

Subru Krishnan commented on YARN-4721:
--

Pushing it out from 2.9.0 due to lack of recent activity. Feel free to revert 
if required.

> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -
>
> Key: YARN-4721
> URL: https://issues.apache.org/jira/browse/YARN-4721
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, security
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>  Labels: oct16-medium
> Attachments: HADOOP-12289-002.patch, HADOOP-12289-003.patch, 
> HADOOP-12889-001.patch
>
>
> If the RM can't auth with HDFS, this can first surface during job submission, 
> which can cause confusion about what's wrong and whose credentials are 
> playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice. 
> If it can't auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this 
> point. Certainly it can't currently accept work. But should it fail fast or 
> keep going in the hope that the problem is in the KDC or NN and will fix 
> itself without an RM restart?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure

2017-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186549#comment-16186549
 ] 

Hadoop QA commented on YARN-4721:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  6s{color} 
| {color:red} YARN-4721 does not apply to trunk. Rebase required? Wrong Branch? 
See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-4721 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12799703/HADOOP-12289-003.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/17706/console |
| Powered by | Apache Yetus 0.6.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -
>
> Key: YARN-4721
> URL: https://issues.apache.org/jira/browse/YARN-4721
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, security
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>  Labels: oct16-medium
> Attachments: HADOOP-12289-002.patch, HADOOP-12289-003.patch, 
> HADOOP-12889-001.patch
>
>
> If the RM can't auth with HDFS, this can first surface during job submission, 
> which can cause confusion about what's wrong and whose credentials are 
> playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice. 
> If it can't auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this 
> point. Certainly it can't currently accept work. But should it fail fast or 
> keep going in the hope that the problem is in the KDC or NN and will fix 
> itself without an RM restart?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure

2016-05-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282657#comment-15282657
 ] 

Junping Du commented on YARN-4721:
--

[~ste...@apache.org], any update on the patch?

> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -
>
> Key: YARN-4721
> URL: https://issues.apache.org/jira/browse/YARN-4721
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-12289-002.patch, HADOOP-12289-003.patch, 
> HADOOP-12889-001.patch
>
>
> If the RM can't auth with HDFS, this can first surface during job submission, 
> which can cause confusion about what's wrong and whose credentials are 
> playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice. 
> If it can't auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this 
> point. Certainly it can't currently accept work. But should it fail fast or 
> keep going in the hope that the problem is in the KDC or NN and will fix 
> itself without an RM restart?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure

2016-04-28 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262130#comment-15262130
 ] 

Junping Du commented on YARN-4721:
--

Thanks [~ste...@apache.org] for updating the patch. 003 patch looks pretty 
good. 
Just one question: it seems KDiagBinding use builder pattern for all optional 
parameter in construction. The only exception is withKeytabConfKey:
{noformat}
+public void withKeytabConfKey(String key) {
+  this.keytabConfKey = key;
+}
{noformat}
Shall we follow the same pattern of Builder?
If so, in ResourceManager.doSecureLogin(), 
{noformat}
+String keytabFilename = conf.get(YarnConfiguration.RM_KEYTAB);
+if (keytabFilename != null) {
+  binding.withKeytab(new File(keytabFilename));
+}
{noformat}
Can we just simply do binding.withKeytabConfKey(YarnConfiguration.RM_KEYTAB) 
and handle null case inside of withKeytabConfKey()?
Other looks good to me.

> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -
>
> Key: YARN-4721
> URL: https://issues.apache.org/jira/browse/YARN-4721
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-12289-002.patch, HADOOP-12289-003.patch, 
> HADOOP-12889-001.patch
>
>
> If the RM can't auth with HDFS, this can first surface during job submission, 
> which can cause confusion about what's wrong and whose credentials are 
> playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice. 
> If it can't auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this 
> point. Certainly it can't currently accept work. But should it fail fast or 
> keep going in the hope that the problem is in the KDC or NN and will fix 
> itself without an RM restart?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure

2016-04-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248684#comment-15248684
 ] 

Hadoop QA commented on YARN-4721:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 53s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
55s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 42s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 49s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
9s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 33s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
30s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
41s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 20s 
{color} | {color:green} trunk passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 33s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
14s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 2s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 2s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 48s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 48s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 8s 
{color} | {color:red} root: patch generated 19 new + 228 unchanged - 3 fixed = 
247 total (was 231) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 33s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
30s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 53s 
{color} | {color:red} hadoop-common-project/hadoop-common generated 2 new + 0 
unchanged - 0 fixed = 2 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 16s 
{color} | {color:green} the patch passed with JDK v1.8.0_77 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 34s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 25s 
{color} | {color:green} hadoop-common in the patch passed with JDK v1.8.0_77. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 79m 59s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_77. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 43s 
{color} | {color:green} hadoop-common in the patch passed with JDK v1.7.0_95. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 82m 35s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| 

[jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure

2016-04-19 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247975#comment-15247975
 ] 

Junping Du commented on YARN-4721:
--

bq. I'm going to propose "hadoop.security.kerberos.diagnostics", rather than an 
RM specific one. That way, the code can be replicated in the NMs and the HDFS 
components, with the same outcome: hadoop diagnostics
+1. That sounds reasonable. Shall we split the JIRA into two - one for Hadoop 
project code and one for YARN?

> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -
>
> Key: YARN-4721
> URL: https://issues.apache.org/jira/browse/YARN-4721
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-12889-001.patch
>
>
> If the RM can't auth with HDFS, this can first surface during job submission, 
> which can cause confusion about what's wrong and whose credentials are 
> playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice. 
> If it can't auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this 
> point. Certainly it can't currently accept work. But should it fail fast or 
> keep going in the hope that the problem is in the KDC or NN and will fix 
> itself without an RM restart?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure

2016-04-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247864#comment-15247864
 ] 

Steve Loughran commented on YARN-4721:
--

I had a quick code review of this with [~djp]; summary

# make this an option which can be turned on or off
# catch & swallow (log @ warn) all kdiag exceptions. This is to increase 
confidence that the diagnostics won't break things.

Regarding configuration, I'm going to propose 
"hadoop.security.kerberos.diagnostics", rather than an RM specific one. That 
way, the code can be replicated in the NMs and the HDFS components, with the 
same outcome: hadoop diagnostics

> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -
>
> Key: YARN-4721
> URL: https://issues.apache.org/jira/browse/YARN-4721
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-12889-001.patch
>
>
> If the RM can't auth with HDFS, this can first surface during job submission, 
> which can cause confusion about what's wrong and whose credentials are 
> playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice. 
> If it can't auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this 
> point. Certainly it can't currently accept work. But should it fail fast or 
> keep going in the hope that the problem is in the KDC or NN and will fix 
> itself without an RM restart?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure

2016-03-08 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184752#comment-15184752
 ] 

Steve Loughran commented on YARN-4721:
--

I'm trying to say, the test is "if this cluster has a default filesystem, it 
can be listed"

not HDFS, just that fs.default.name is not empty. We could even make the check 
conditional on the cluster having a default FS. But if you do have a default 
FS, YARN had better be able to talk to it. I will change the title to make 
clear that this is broader than just HDFS. 

> One could argue for a stand-alone service (outside of YARN) that does these 
> validations.

That doesn't address the problem I'm looking at, which is: validate that a 
specific process started under a specific principal on a specific host has the 
credentials needed to access a critical part of the cluster infrastructure, the 
default FS.

> So, the notion of "this cluster cannot talk to my HDFS" doesn't generalize. 
> It is context dependent and almost always "may app cannot talk to this and 
> that HDFS instances".

I agree, which is why distcp will need special attention. However, YARN does 
have a specific notion of the defaultFS for a filesystem; ATS1.5 ramps it by 
only working with an FS which implements flush() by making data durable and 
visible to others (though it doesn't require the metadata to be 
complete/visible). 

It's authentication of that YARN process to the cluster FS —or more 
specifically, identifying why it sometimes doesn't happen— that I'm trying to 
look at.

Anyway, this initial patch doesn't attempt any of that, it looks at 
UGI.isSecurityEnabled, and if so does some extra diagnostics, fails fast on a 
few conditions guaranteed to stop hadoop working. Do you have any issues with 
that part of the patch?

> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -
>
> Key: YARN-4721
> URL: https://issues.apache.org/jira/browse/YARN-4721
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-12889-001.patch
>
>
> If the RM can't auth with HDFS, this can first surface during job submission, 
> which can cause confusion about what's wrong and whose credentials are 
> playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice. 
> If it can't auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this 
> point. Certainly it can't currently accept work. But should it fail fast or 
> keep going in the hope that the problem is in the KDC or NN and will fix 
> itself without an RM restart?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure

2016-03-07 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183684#comment-15183684
 ] 

Vinod Kumar Vavilapalli commented on YARN-4721:
---

bq. rather than the more fundamental "your RM doesn't have the credentials to 
talk to HDFS"
The thing is YARN is built agnostic of file-systems and your proposal of "ls /" 
breaks this very fundamental assumption - that is why I am against it. One 
could argue for a stand-alone service (outside of YARN) that does these 
validations.

There are apps that do not depend on file-systems - for e.g. Samza. And there 
are apps that depend on multiple file-systems - for e.g. distcp. So, the notion 
of "this cluster cannot talk to my HDFS" doesn't generalize. It is context 
dependent and almost always "may app cannot talk to this and that HDFS 
instances".

> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -
>
> Key: YARN-4721
> URL: https://issues.apache.org/jira/browse/YARN-4721
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-12889-001.patch
>
>
> If the RM can't auth with HDFS, this can first surface during job submission, 
> which can cause confusion about what's wrong and whose credentials are 
> playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice. 
> If it can't auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this 
> point. Certainly it can't currently accept work. But should it fail fast or 
> keep going in the hope that the problem is in the KDC or NN and will fix 
> itself without an RM restart?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure

2016-03-07 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182810#comment-15182810
 ] 

Steve Loughran commented on YARN-4721:
--

this patch, initially, sets up kerberos diagnostics without side effects.

I'd also like to do, after this, an {{ls / }} of the filesystem. Maybe just 
make this another option to run if yarn.resourcemanager.kdiag.enabled=true .  
Against a kerberized FS this would trigger fast negotiation and, if there are 
problems report. (this would have to be done async, obviously). 

The problem with the current —talk-during-renew— process is that it means that 
if any problem surfaces, it doesn't surface until someone submits work. It then 
surfaces as "job submit failed", rather than the more fundamental "your RM 
doesn't have the credentials to talk to HDFS"

This would not be a hard coded binding to HDFS; purely a check that the cluster 
FS is readable by the RM principal

> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -
>
> Key: YARN-4721
> URL: https://issues.apache.org/jira/browse/YARN-4721
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-12889-001.patch
>
>
> If the RM can't auth with HDFS, this can first surface during job submission, 
> which can cause confusion about what's wrong and whose credentials are 
> playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice. 
> If it can't auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this 
> point. Certainly it can't currently accept work. But should it fail fast or 
> keep going in the hope that the problem is in the KDC or NN and will fix 
> itself without an RM restart?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure

2016-03-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181354#comment-15181354
 ] 

Hadoop QA commented on YARN-4721:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 16s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
37s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 5s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 46s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
7s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 30s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
30s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
39s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 15s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 32s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 2s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 2s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 45s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 45s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 7s 
{color} | {color:red} root: patch generated 19 new + 228 unchanged - 3 fixed = 
247 total (was 231) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
29s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 51s 
{color} | {color:red} hadoop-common-project/hadoop-common generated 2 new + 0 
unchanged - 0 fixed = 2 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 17s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 33s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 13s 
{color} | {color:green} hadoop-common in the patch passed with JDK v1.8.0_74. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 72m 19s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 18s 
{color} | {color:green} hadoop-common in the patch passed with JDK v1.7.0_95. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 71m 46s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| 

[jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure

2016-03-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180595#comment-15180595
 ] 

Vinod Kumar Vavilapalli commented on YARN-4721:
---

bq. I don't know what the policy should be if the RM can't auth to HDFS at this 
point.
By design, (most of) RM is agnostic of file-systems.

bq. Instead, the RM could try to talk to HDFS on launch, ls / should suffice. 
If it can't auth, it can then tell UGI to log more and retry.
There are only a couple of places where there are run-time dependencies (a) 
User passes HDFS delegation-tokens for auto-renewal (b) Some of the 
generic-history / Timeline-Service implementations are file-system based. But 
they are at run-time and we should actively avoid any static dependencies like 
"ls /".

I don't understand the patch completely, but it seems like you are adding 
extra-validation checks to make sure that RM can authenticate successfully with 
*kerberos* (and log diagnostics in case of failures) and not HDFS itself 
specifically. If I am getting that right, it should be okay to do such 
diagnostics.

> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -
>
> Key: YARN-4721
> URL: https://issues.apache.org/jira/browse/YARN-4721
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-12889-001.patch
>
>
> If the RM can't auth with HDFS, this can first surface during job submission, 
> which can cause confusion about what's wrong and whose credentials are 
> playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice. 
> If it can't auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this 
> point. Certainly it can't currently accept work. But should it fail fast or 
> keep going in the hope that the problem is in the KDC or NN and will fix 
> itself without an RM restart?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure

2016-03-04 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180522#comment-15180522
 ] 

Steve Loughran commented on YARN-4721:
--

HADOOP-12889 covers the changes to KDiag  & UGI for this

> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -
>
> Key: YARN-4721
> URL: https://issues.apache.org/jira/browse/YARN-4721
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
> Attachments: HADOOP-12889-001.patch
>
>
> If the RM can't auth with HDFS, this can first surface during job submission, 
> which can cause confusion about what's wrong and whose credentials are 
> playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice. 
> If it can't auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this 
> point. Certainly it can't currently accept work. But should it fail fast or 
> keep going in the hope that the problem is in the KDC or NN and will fix 
> itself without an RM restart?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)