[ 
https://issues.apache.org/jira/browse/YARN-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195099#comment-14195099
 ] 

Wangda Tan commented on YARN-2795:
----------------------------------

Just tried to test in a security enabled cluster, without this patch, RM will 
failed to start because we don't login before accessing HDFS.
And with this patch, RM can successfully start with labels stored on HDFS. And 
tried to submit a MR job after start, it can also successfully completed as 
well.

> Resource Manager fails startup with HDFS label storage and secure cluster
> -------------------------------------------------------------------------
>
>                 Key: YARN-2795
>                 URL: https://issues.apache.org/jira/browse/YARN-2795
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Phil D'Amore
>            Assignee: Wangda Tan
>         Attachments: YARN-2795-20141101-1.patch, YARN-2795-20141102-1.patch, 
> YARN-2795-20141102-2.patch
>
>
> When node labels are in use, and yarn.node-labels.fs-store.root-dir is set to 
> a hdfs:// path, and the cluster is using kerberos, the RM fails to start 
> while trying to unmarshal the label store.  The following error/stack trace 
> is observed:
> {code}
> 2014-10-31 11:55:53,807 INFO  service.AbstractService 
> (AbstractService.java:noteFailure(272)) - Service o
> rg.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in state 
> INITED; cause: java.io.IOExcepti
> on: Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: GSS initiate faile
> d [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos tg
> t)]; Host Details : local host is: "host.running.rm/10.0.0.34"; destination 
> hos
> t is: "host.running.nn":8020;
> java.io.IOException: Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: G
> SS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Failed to fin
> d any Kerberos tgt)]; Host Details : local host is: 
> "host.running.rm/10.0.0.34"
> ; destination host is: "host.running.nn":8020;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy14.mkdirs(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProt
> ocolTranslatorPB.java:539)
>         at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187
> )
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy15.mkdirs(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2731)
>         at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2702)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:870)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:866)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:866)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:859)
>         at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1817)
>         at 
> org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.init(FileSystemNodeLabelsStore.java:87)
>         at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:206)
>         at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceInit(CommonNodeLabelsManager.java:199)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.serviceInit(RMNodeLabelsManager.java:62)
>         at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:547)
>         at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:986)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:245)
>         at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1216)
> Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS 
> initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Failed to find any Kerberos tgt)]
>         at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:680)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>         at 
> org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:730)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1438)
>         ... 30 more
> {code}
> I think this is a startup ordering issue, in that the scheduler is 
> initialized before the RM would prime the cred cache.  My reasoning is based 
> on what happens when I don't set the yarn.node-labels.fs-store.root-dir 
> property, so no HDFS interaction happens when the scheduler initializes.  
> Here is the relevant snippet from the log:
> {code}
> 2014-10-31 12:04:09,739 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:parseQueue(602)) - Initialized queu
> e: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, 
> vCores:0>, usedCapacity=0.0, absoluteUsedCa
> pacity=0.0, numApps=0, numContainers=0
> 2014-10-31 12:04:09,739 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:parseQueue(602)) - Initialized queu
> e: root: numChildQueue= 1, capacity=1.0, absoluteCapacity=1.0, 
> usedResources=<memory:0, vCores:0>usedCapacity=0.0, n
> umApps=0, numContainers=0
> 2014-10-31 12:04:09,742 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:initializeQueues(466)) - Initialize
> d root queue root: numChildQueue= 1, capacity=1.0, absoluteCapacity=1.0, 
> usedResources=<memory:0, vCores:0>usedCapac
> ity=0.0, numApps=0, numContainers=0
> 2014-10-31 12:04:09,742 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:initializeQueueMappings(435)) - Ini
> tialized queue mappings, override: false
> 2014-10-31 12:04:09,742 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:initScheduler(304)) - Initialized C
> apacityScheduler with calculator=class 
> org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator, minimumAlloca
> tion=<<memory:256, vCores:1>>, maximumAllocation=<<memory:2048, vCores:32>>, 
> asynchronousScheduling=false, asyncSche
> duleInterval=5ms
> 2014-10-31 12:04:09,866 INFO  security.UserGroupInformation 
> (UserGroupInformation.java:loginUserFromKeytab(938)) - L
> ogin successful for user rm/host.running...@slider1.example.com using keytab 
> file /etc/sec
> urity/keytabs/rm.service.keytab
> {code}
> You can see the scheduler initializes, and only then does the cred cache get 
> primed.  This results in a successful RM start, but of course my HDFS-backed 
> labels are now not loaded.
> I think that if the cred cached were initialized before the scheduler, this 
> error would not happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to