[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969379#comment-15969379 ] Michael Gummelt commented on SPARK-20328: - Ah, yes, of course. Thanks. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968474#comment-15968474 ] Marcelo Vanzin commented on SPARK-20328: bq. Since the driver is authenticated, it can request further delegation tokens No. To create a delegation token you need a TGT. You can't create a delegation token just with an existing delegation token. If that were possible, all the shenanigans to distribute the user's keytab for long running applications wouldn't be needed. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968469#comment-15968469 ] Michael Gummelt commented on SPARK-20328: - bq. I have no idea what that means. I'm pretty sure a delegation token is just another way for a subject to authenticate. So the driver uses the delegation token provided to it by {{spark-submit}} to authenticate. This is what I mean by "driver is already logged in via the delegation token". Since the driver is authenticated, it can request further delegation tokens. But my point is that it shouldn't need to, because that code is not "delegating" the tokens to any other process, which is the only thing delegation tokens are needed for. But this is neither here nor there. I think I know what I have to do. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968440#comment-15968440 ] Marcelo Vanzin commented on SPARK-20328: bq. that the driver is already logged in via the delegation token I have no idea what that means. Yes, "spark-submit" has a TGT. It uses it to login to e.g. HDFS and generate delegation tokens. But in this situation, the *driver* has no TGT; it only has the delegation token generated by the spark-submit process, which has no communication with the driver. So when the driver calls into that code you linked that tries to fetch delegation tokens, it should fail. But it doesn't. Which tells me the code detects whether there is already a valid delegation token and, in that case, doesn't try to create a new one. So what I'm saying is that if the above is correct, all you have to do is create delegation tokens yourself when the Mesos backend initializes (i.e. before any HadoopRDD code is run), and you'll avoid the issue with setting the configuration options. That's something you'd have to do anyway, because delegation tokens are needed by the executors to talk to the data nodes. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968432#comment-15968432 ] Michael Gummelt commented on SPARK-20328: - bq. It depends. e.g. on YARN, when you submit in cluster mode, the driver is running in the cluster and all it has are delegation tokens. (The TGT is only available to the launcher process.) Right, but my understanding is that the driver is already logged in via the delegation token provided to it by the {{spark-submit}} process (via {{amContainer.setTokens}}), so it wouldn't need to then fetch further delegation tokens. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968423#comment-15968423 ] Marcelo Vanzin commented on SPARK-20328: bq. But it shouldn't need delegation tokens at all, right? It depends. e.g. on YARN, when you submit in cluster mode, the driver is running in the cluster and all it has are delegation tokens. (The TGT is only available to the launcher process.) Actually it would be interesting to understand how that case works internally; because if that code is trying to generate delegation tokens, it should theoretically fail in the above scenario. So maybe it doesn't generate tokens if they're already there, and that could be a workaround for your case too. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968416#comment-15968416 ] Michael Gummelt commented on SPARK-20328: - > It shouldn't need to do it not for the reasons you mention, but because Spark > already the necessary credentials available (either a TGT, or a valid > delegation token for HDFS). But it shouldn't need delegation tokens at all, right? > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968411#comment-15968411 ] Michael Gummelt commented on SPARK-20328: - > The Mesos backend (I mean the code in Spark, not the Mesos service) can set > the configs in the SparkContext's "hadoopConfiguration" object, can't it? I suppose this would work. It would rely on the assumption that the Mesos scheduler backend is started before the HadoopRDD is created, which happens to be true, but ideally we wouldn't have to rely on that ordering. Right now I'm just setting it in {{SparkSubmit}}, but that's not great either. I filed a Hadoop ticket for the {{FileInputFormat}} issue and linked it here. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968396#comment-15968396 ] Marcelo Vanzin commented on SPARK-20328: bq. The problem can't be solved in the Mesos backend I meant setting the configs. The Mesos backend (I mean the code in Spark, not the Mesos service) can set the configs in the SparkContext's "hadoopConfiguration" object, can't it? Otherwise you'd be putting a burden on the user to have a proper Hadoop config around with those properties set. bq. is why in the world is FileInputFormat fetching delegation tokens That's actually a good question. It shouldn't need to do it not for the reasons you mention, but because Spark already the necessary credentials available (either a TGT, or a valid delegation token for HDFS). > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968388#comment-15968388 ] Michael Gummelt commented on SPARK-20328: - Hey [~vanzin], thanks for the response. Everything you said is correct, but I want to clarify one thing: > You just need to make the Mesos backend in Spark do that automatically for > the submitting user. The problem can't be solved in the Mesos backend. When I fetch delegation tokens for transmission to Executors in the Mesos backend, there's no problem. I can set whatever renewer I want. The problem is that there's a second location where delegation tokens are fetched: {{HadoopRDD}}. This is entirely separate from the fetching that the scheduler backends do (either Mesos or YARN). {{HadoopRDD}} tries to fetch split data, and ultimately calls into {{TokenCache}} in the hadoop library, which fetches delegation tokens with the renewer set to the YARN ResourceManager's principal: https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L213 The big question I have, which I suppose is more for the {{hadoop}} team, is why in the world is {{FileInputFormat}} fetching delegation tokens. AFAICT, they're not sending those tokens to any other process. They're just fetching split data directly from the Name Nodes, and there should be no delegation required. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968343#comment-15968343 ] Marcelo Vanzin commented on SPARK-20328: Hmm... that seems related to delegation token support. Delegation tokens need a "renewer", and generally in YARN applications the renewer is the YARN service (IIRC), since it will take care of renewing delegation tokens submitted with your application (and cancel them after the application is done). In your case Mesos doesn't know about kerberos, so the user submitting the app needs to be the renewer; and, aside from this particular issue, you may need to add code to actually renew those tokens periodically (this is different from creating new tokens after their max life time). I don't think you'll find a different way around this from the one you have (setting the YARN configs). You just need to make the Mesos backend in Spark do that automatically for the submitting user. As far as the Hadoop library, you could try to open a bug so they can add an explicit option so that non-MR, non-YARN applications can set the renewer more easily. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968232#comment-15968232 ] Michael Gummelt commented on SPARK-20328: - cc [~colorant] [~hfeng] [~vanzin] > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured. If it isn't, an exception > is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler. I have a workaround where I set a YARN-specific > configuration variable to trick {{TokenCache}} into thinking YARN is > configured, but this is obviously suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org