[
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180595#comment-15180595
]
Vinod Kumar Vavilapalli commented on YARN-4721:
-----------------------------------------------
bq. I don't know what the policy should be if the RM can't auth to HDFS at this
point.
By design, (most of) RM is agnostic of file-systems.
bq. Instead, the RM could try to talk to HDFS on launch, ls / should suffice.
If it can't auth, it can then tell UGI to log more and retry.
There are only a couple of places where there are run-time dependencies (a)
User passes HDFS delegation-tokens for auto-renewal (b) Some of the
generic-history / Timeline-Service implementations are file-system based. But
they are at run-time and we should actively avoid any static dependencies like
"ls /".
I don't understand the patch completely, but it seems like you are adding
extra-validation checks to make sure that RM can authenticate successfully with
*kerberos* (and log diagnostics in case of failures) and not HDFS itself
specifically. If I am getting that right, it should be okay to do such
diagnostics.
> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -----------------------------------------------------------------------------
>
> Key: YARN-4721
> URL: https://issues.apache.org/jira/browse/YARN-4721
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: resourcemanager
> Affects Versions: 2.8.0
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Attachments: HADOOP-12889-001.patch
>
>
> If the RM can't auth with HDFS, this can first surface during job submission,
> which can cause confusion about what's wrong and whose credentials are
> playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice.
> If it can't auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this
> point. Certainly it can't currently accept work. But should it fail fast or
> keep going in the hope that the problem is in the KDC or NN and will fix
> itself without an RM restart?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)