[GitHub] [spark] dbtsai commented on a change in pull request #28788: [SPARK-31960][Yarn][Build] Only populate Hadoop classpath for no-hadoop build

2020-06-18 Thread GitBox


dbtsai commented on a change in pull request #28788:
URL: https://github.com/apache/spark/pull/28788#discussion_r441986877



##
File path: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala
##
@@ -74,10 +76,11 @@ package object config {
 .doc("Whether to populate Hadoop classpath from 
`yarn.application.classpath` and " +
   "`mapreduce.application.classpath` Note that if this is set to `false`, 
it requires " +
   "a `with-Hadoop` Spark distribution that bundles Hadoop runtime or user 
has to provide " +
-  "a Hadoop installation separately.")
+  "a Hadoop installation separately. By default, for `with-hadoop` Spark 
distribution, " +
+  "this is set to `false`; for `no-hadoop` distribution, this is set to 
`true`.")
 .version("2.4.6")
 .booleanConf
-.createWithDefault(true)
+.createWithDefault(isHadoopProvided())

Review comment:
   I'll add it as a followup PR.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dbtsai commented on a change in pull request #28788: [SPARK-31960][Yarn][Build] Only populate Hadoop classpath for no-hadoop build

2020-06-18 Thread GitBox


dbtsai commented on a change in pull request #28788:
URL: https://github.com/apache/spark/pull/28788#discussion_r441986680



##
File path: docs/running-on-yarn.md
##
@@ -82,6 +82,19 @@ In `cluster` mode, the driver runs on a different machine 
than the client, so `S
 
 Running Spark on YARN requires a binary distribution of Spark which is built 
with YARN support.
 Binary distributions can be downloaded from the [downloads 
page](https://spark.apache.org/downloads.html) of the project website.
+There are two variants of Spark binary distributions you can download. One is 
pre-built with a certain
+version of Apache Hadoop; this Spark distribution contains built-in Hadoop 
runtime, so we call it with-hadoop Spark

Review comment:
   addressed. thanks.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dbtsai commented on a change in pull request #28788: [SPARK-31960][Yarn][Build] Only populate Hadoop classpath for no-hadoop build

2020-06-18 Thread GitBox


dbtsai commented on a change in pull request #28788:
URL: https://github.com/apache/spark/pull/28788#discussion_r441985622



##
File path: docs/running-on-yarn.md
##
@@ -82,6 +82,19 @@ In `cluster` mode, the driver runs on a different machine 
than the client, so `S
 
 Running Spark on YARN requires a binary distribution of Spark which is built 
with YARN support.
 Binary distributions can be downloaded from the [downloads 
page](https://spark.apache.org/downloads.html) of the project website.
+There are two variants of Spark binary distributions you can download. One is 
pre-built with a certain
+version of Apache Hadoop; this Spark distribution contains built-in Hadoop 
runtime, so we call it `with-hadoop` Spark
+distribution. The other one is pre-built with user-provided Hadoop; since this 
Spark distribution
+doesn't contain a built-in Hadoop runtime, it's smaller, but users have to 
provide a Hadoop installation separately.
+We call this variant `no-hadoop` Spark distribution. For `with-hadoop` Spark 
distribution, since
+it contains a built-in Hadoop runtime already, by default, when a job is 
submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not
+populate Yarn's classpath into Spark. To override this behavior, you can set 
spark.yarn.populateHadoopClasspath=true.
+For `no-hadoop` Spark distribution, Spark will populate Yarn's classpath by 
default in order to get Hadoop runtime. Note that some features such
+as Hive support are not available in `no-hadoop` Spark distribution. For 
`with-hadoop` Spark distribution,

Review comment:
   Okay, I'm getting old now, bad memory :) Just somehow remember you told 
me this, and that was why our internal `no-hadoop` build can not support hive. 
I removed those lines. Thanks.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dbtsai commented on a change in pull request #28788: [SPARK-31960][Yarn][Build] Only populate Hadoop classpath for no-hadoop build

2020-06-18 Thread GitBox


dbtsai commented on a change in pull request #28788:
URL: https://github.com/apache/spark/pull/28788#discussion_r441985622



##
File path: docs/running-on-yarn.md
##
@@ -82,6 +82,19 @@ In `cluster` mode, the driver runs on a different machine 
than the client, so `S
 
 Running Spark on YARN requires a binary distribution of Spark which is built 
with YARN support.
 Binary distributions can be downloaded from the [downloads 
page](https://spark.apache.org/downloads.html) of the project website.
+There are two variants of Spark binary distributions you can download. One is 
pre-built with a certain
+version of Apache Hadoop; this Spark distribution contains built-in Hadoop 
runtime, so we call it `with-hadoop` Spark
+distribution. The other one is pre-built with user-provided Hadoop; since this 
Spark distribution
+doesn't contain a built-in Hadoop runtime, it's smaller, but users have to 
provide a Hadoop installation separately.
+We call this variant `no-hadoop` Spark distribution. For `with-hadoop` Spark 
distribution, since
+it contains a built-in Hadoop runtime already, by default, when a job is 
submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not
+populate Yarn's classpath into Spark. To override this behavior, you can set 
spark.yarn.populateHadoopClasspath=true.
+For `no-hadoop` Spark distribution, Spark will populate Yarn's classpath by 
default in order to get Hadoop runtime. Note that some features such
+as Hive support are not available in `no-hadoop` Spark distribution. For 
`with-hadoop` Spark distribution,

Review comment:
   Okay, I'm getting old now :) Just somehow remember you told me this, and 
that was why our internal `no-hadoop` build can not support hive. I removed 
those lines.

##
File path: docs/running-on-yarn.md
##
@@ -82,6 +82,19 @@ In `cluster` mode, the driver runs on a different machine 
than the client, so `S
 
 Running Spark on YARN requires a binary distribution of Spark which is built 
with YARN support.
 Binary distributions can be downloaded from the [downloads 
page](https://spark.apache.org/downloads.html) of the project website.
+There are two variants of Spark binary distributions you can download. One is 
pre-built with a certain
+version of Apache Hadoop; this Spark distribution contains built-in Hadoop 
runtime, so we call it `with-hadoop` Spark
+distribution. The other one is pre-built with user-provided Hadoop; since this 
Spark distribution
+doesn't contain a built-in Hadoop runtime, it's smaller, but users have to 
provide a Hadoop installation separately.
+We call this variant `no-hadoop` Spark distribution. For `with-hadoop` Spark 
distribution, since
+it contains a built-in Hadoop runtime already, by default, when a job is 
submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not
+populate Yarn's classpath into Spark. To override this behavior, you can set 
spark.yarn.populateHadoopClasspath=true.
+For `no-hadoop` Spark distribution, Spark will populate Yarn's classpath by 
default in order to get Hadoop runtime. Note that some features such
+as Hive support are not available in `no-hadoop` Spark distribution. For 
`with-hadoop` Spark distribution,

Review comment:
   Okay, I'm getting old now :) Just somehow remember you told me this, and 
that was why our internal `no-hadoop` build can not support hive. I removed 
those lines. Thanks.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dbtsai commented on a change in pull request #28788: [SPARK-31960][Yarn][Build] Only populate Hadoop classpath for no-hadoop build

2020-06-16 Thread GitBox


dbtsai commented on a change in pull request #28788:
URL: https://github.com/apache/spark/pull/28788#discussion_r440618141



##
File path: docs/running-on-yarn.md
##
@@ -82,6 +82,19 @@ In `cluster` mode, the driver runs on a different machine 
than the client, so `S
 
 Running Spark on YARN requires a binary distribution of Spark which is built 
with YARN support.
 Binary distributions can be downloaded from the [downloads 
page](https://spark.apache.org/downloads.html) of the project website.
+There are two variants of Spark binary distributions you can download. One is 
pre-built with a certain
+version of Apache Hadoop; this Spark distribution contains built-in Hadoop 
runtime, so we call it `with-hadoop` Spark
+distribution. The other one is pre-built with user-provided Hadoop; since this 
Spark distribution
+doesn't contain a built-in Hadoop runtime, it's smaller, but users have to 
provide a Hadoop installation separately.
+We call this variant `no-hadoop` Spark distribution. For `with-hadoop` Spark 
distribution, since
+it contains a built-in Hadoop runtime already, by default, when a job is 
submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not
+populate Yarn's classpath into Spark. To override this behavior, you can set 
spark.yarn.populateHadoopClasspath=true.
+For `no-hadoop` Spark distribution, Spark will populate Yarn's classpath by 
default in order to get Hadoop runtime. Note that some features such
+as Hive support are not available in `no-hadoop` Spark distribution. For 
`with-hadoop` Spark distribution,

Review comment:
   Maybe I'm wrong, but I got the impression from @dongjoon-hyun that 
no-hadoop Spark distribution doesn't support Hive.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dbtsai commented on a change in pull request #28788: [SPARK-31960][Yarn][Build] Only populate Hadoop classpath for no-hadoop build

2020-06-11 Thread GitBox


dbtsai commented on a change in pull request #28788:
URL: https://github.com/apache/spark/pull/28788#discussion_r439203266



##
File path: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala
##
@@ -74,10 +76,11 @@ package object config {
 .doc("Whether to populate Hadoop classpath from 
`yarn.application.classpath` and " +
   "`mapreduce.application.classpath` Note that if this is set to `false`, 
it requires " +
   "a `with-Hadoop` Spark distribution that bundles Hadoop runtime or user 
has to provide " +
-  "a Hadoop installation separately.")
+  "a Hadoop installation separately. By default, for `with-hadoop` Spark 
distribution, " +
+  "this is set to `false`; for `no-hadoop` distribution, this is set to 
`true`.")
 .version("2.4.6")
 .booleanConf
-.createWithDefault(true)
+.createWithDefault(isHadoopProvided())

Review comment:
   I'll add it to migration guide as well.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dbtsai commented on a change in pull request #28788: [SPARK-31960][Yarn][Build] Only populate Hadoop classpath for no-hadoop build

2020-06-11 Thread GitBox


dbtsai commented on a change in pull request #28788:
URL: https://github.com/apache/spark/pull/28788#discussion_r439203081



##
File path: docs/running-on-yarn.md
##
@@ -82,6 +82,19 @@ In `cluster` mode, the driver runs on a different machine 
than the client, so `S
 
 Running Spark on YARN requires a binary distribution of Spark which is built 
with YARN support.
 Binary distributions can be downloaded from the [downloads 
page](https://spark.apache.org/downloads.html) of the project website.
+There are two variants of Spark binary distributions you can download. One is 
pre-built with a certain

Review comment:
   In the download link, we use terminology of `pre-built` and `pre-built 
with user provided hadoop`, and in some other places, we use `with-hadoop`, and 
`no-hadoop` distributions. That was why I use both of them here to make it 
clear.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dbtsai commented on a change in pull request #28788: [SPARK-31960][Yarn][Build] Only populate Hadoop classpath for no-hadoop build

2020-06-11 Thread GitBox


dbtsai commented on a change in pull request #28788:
URL: https://github.com/apache/spark/pull/28788#discussion_r439136561



##
File path: resource-managers/yarn/pom.xml
##
@@ -30,8 +30,18 @@
   
 yarn
 1.19
+false
   
 
+  
+
+  hadoop-provided
+  
+true
+  
+
+  
+

Review comment:
   The next line of added doc `To build Spark yourself, refer to [Building 
Spark](building-spark.html).` has the link to the doc you mentioned. I think 
they should be able to figure it out when they read building spark session.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dbtsai commented on a change in pull request #28788: [SPARK-31960][Yarn][Build] Only populate Hadoop classpath for no-hadoop build

2020-06-11 Thread GitBox


dbtsai commented on a change in pull request #28788:
URL: https://github.com/apache/spark/pull/28788#discussion_r439125137



##
File path: resource-managers/yarn/pom.xml
##
@@ -30,8 +30,18 @@
   
 yarn
 1.19
+false
   
 
+  
+
+  hadoop-provided
+  
+true
+  
+
+  
+

Review comment:
   This profile is already in parent POM, so when we build it with 
`hadoop-provided` from the root project, this will only add the property to 
yarn scope as we don't need this property globally for the entire Spark project.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dbtsai commented on a change in pull request #28788: [SPARK-31960][Yarn][Build] Only populate Hadoop classpath for no-hadoop build

2020-06-11 Thread GitBox


dbtsai commented on a change in pull request #28788:
URL: https://github.com/apache/spark/pull/28788#discussion_r439088312



##
File path: docs/running-on-yarn.md
##
@@ -82,6 +82,19 @@ In `cluster` mode, the driver runs on a different machine 
than the client, so `S
 
 Running Spark on YARN requires a binary distribution of Spark which is built 
with YARN support.
 Binary distributions can be downloaded from the [downloads 
page](https://spark.apache.org/downloads.html) of the project website.
+There are two variants of Spark binary distributions you can download. One is 
pre-built with a certain
+version of Apache Hadoop; this Spark distribution contains built-in Hadoop 
runtime, so we call it with-hadoop Spark
+distribution. The other one is pre-built with user-provided Hadoop; since this 
Spark distribution
+doesn't contain built-in Hadoop runtime, so users have to provide a Hadoop 
installation separately.

Review comment:
   Fixed. Thanks!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dbtsai commented on a change in pull request #28788: [SPARK-31960][Yarn][Build] Only populate Hadoop classpath for no-hadoop build

2020-06-10 Thread GitBox


dbtsai commented on a change in pull request #28788:
URL: https://github.com/apache/spark/pull/28788#discussion_r438463484



##
File path: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala
##
@@ -393,5 +395,19 @@ package object config {
   private[yarn] val YARN_EXECUTOR_RESOURCE_TYPES_PREFIX = 
"spark.yarn.executor.resource."
   private[yarn] val YARN_DRIVER_RESOURCE_TYPES_PREFIX = 
"spark.yarn.driver.resource."
   private[yarn] val YARN_AM_RESOURCE_TYPES_PREFIX = "spark.yarn.am.resource."
-
+  lazy val IS_HADOOP_PROVIDED: Boolean = {
+val configPath = "org/apache/spark/deploy/yarn/config.properties"
+val propertyKey = "spark.yarn.isHadoopProvided"
+try {
+  val prop = new Properties()
+  prop.load(ClassLoader.getSystemClassLoader.
+getResourceAsStream(configPath))
+  prop.getProperty(propertyKey).toBoolean
+} catch {
+  case e: Exception =>
+log.warn(s"Can not load the default value of `$propertyKey` from " +
+  s"$configPath with error, ${e.toString}. Using false as a default 
value."
+false

Review comment:
   When hadoop is provided, that means our Spark distribution doesn't have 
built-in hadoop; thus, we need to populate the hadoop classpath from Yarn. I 
can add comment there so people can understand it easier.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dbtsai commented on a change in pull request #28788: [SPARK-31960][Yarn][Build] Only populate Hadoop classpath for no-hadoop build

2020-06-10 Thread GitBox


dbtsai commented on a change in pull request #28788:
URL: https://github.com/apache/spark/pull/28788#discussion_r438430412



##
File path: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala
##
@@ -77,7 +79,7 @@ package object config {
   "a Hadoop installation separately.")
 .version("2.4.6")
 .booleanConf
-.createWithDefault(true)
+.createWithDefault(IS_HADOOP_PROVIDED)

Review comment:
   Because we have two favors of builds, and for `with-hadoop` build, it's 
better not to populate the hadoop classpath while for no-hadoop build, we need 
to populate the hadoop class path from yarn. Thus, we can not have one default 
config value for both cases.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dbtsai commented on a change in pull request #28788: [SPARK-31960][Yarn][Build] Only populate Hadoop classpath for no-hadoop build

2020-06-10 Thread GitBox


dbtsai commented on a change in pull request #28788:
URL: https://github.com/apache/spark/pull/28788#discussion_r438429459



##
File path: 
resource-managers/yarn/src/main/resources/org/apache/spark/deploy/yarn/config.properties
##
@@ -0,0 +1 @@
+spark.yarn.populateHadoopClasspath = ${spark.yarn.populateHadoopClasspath}

Review comment:
   I messed up while I cherry-picked from internal branch to this PR. 
Fixed. Thanks.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org