[ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180595#comment-15180595
 ] 

Vinod Kumar Vavilapalli commented on YARN-4721:
-----------------------------------------------

bq. I don't know what the policy should be if the RM can't auth to HDFS at this 
point.
By design, (most of) RM is agnostic of file-systems.

bq. Instead, the RM could try to talk to HDFS on launch, ls / should suffice. 
If it can't auth, it can then tell UGI to log more and retry.
There are only a couple of places where there are run-time dependencies (a) 
User passes HDFS delegation-tokens for auto-renewal (b) Some of the 
generic-history / Timeline-Service implementations are file-system based. But 
they are at run-time and we should actively avoid any static dependencies like 
"ls /".

I don't understand the patch completely, but it seems like you are adding 
extra-validation checks to make sure that RM can authenticate successfully with 
*kerberos* (and log diagnostics in case of failures) and not HDFS itself 
specifically. If I am getting that right, it should be okay to do such 
diagnostics.

> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -----------------------------------------------------------------------------
>
>                 Key: YARN-4721
>                 URL: https://issues.apache.org/jira/browse/YARN-4721
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>         Attachments: HADOOP-12889-001.patch
>
>
> If the RM can't auth with HDFS, this can first surface during job submission, 
> which can cause confusion about what's wrong and whose credentials are 
> playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice. 
> If it can't auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this 
> point. Certainly it can't currently accept work. But should it fail fast or 
> keep going in the hope that the problem is in the KDC or NN and will fix 
> itself without an RM restart?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to