[jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure

Vinod Kumar Vavilapalli (JIRA) Mon, 07 Mar 2016 12:57:19 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183684#comment-15183684
 ]


Vinod Kumar Vavilapalli commented on YARN-4721:
-----------------------------------------------

bq. rather than the more fundamental "your RM doesn't have the credentials to 
talk to HDFS"
The thing is YARN is built agnostic of file-systems and your proposal of "ls /" 
breaks this very fundamental assumption - that is why I am against it. One 
could argue for a stand-alone service (outside of YARN) that does these 
validations.

There are apps that do not depend on file-systems - for e.g. Samza. And there 
are apps that depend on multiple file-systems - for e.g. distcp. So, the notion 
of "this cluster cannot talk to my HDFS" doesn't generalize. It is context 
dependent and almost always "may app cannot talk to this and that HDFS 
instances".

> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -----------------------------------------------------------------------------
>
>                 Key: YARN-4721
>                 URL: https://issues.apache.org/jira/browse/YARN-4721
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>         Attachments: HADOOP-12889-001.patch
>
>
> If the RM can't auth with HDFS, this can first surface during job submission, 
> which can cause confusion about what's wrong and whose credentials are 
> playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice. 
> If it can't auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this 
> point. Certainly it can't currently accept work. But should it fail fast or 
> keep going in the hope that the problem is in the KDC or NN and will fix 
> itself without an RM restart?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure

Reply via email to