[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-04 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131329214
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ---
@@ -134,6 +135,16 @@ private[hive] object SparkSQLCLIDriver extends Logging 
{
   // Hive 1.2 + not supported in CLI
   throw new RuntimeException("Remote operations not supported")
 }
+// Respect the configurations set by --hiveconf from the command line
+// (based on Hive's CliDriver).
+val hiveConfFromCmd = 
sessionState.getOverriddenConfigurations.entrySet().asScala
+val newHiveConf = hiveConfFromCmd.map { kv =>
+  // If the same property is configured by spark.hadoop.xxx, we ignore 
it and
+  // obey settings from spark properties
+  val k = kv.getKey
+  val v = sys.props.getOrElseUpdate(SPARK_HADOOP_PROP_PREFIX + k, 
kv.getValue)
--- End diff --

I checked the whole project that `newClientForExecution ` is only used at 
[HiveThriftServer2.scala#L58](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L58),
 
[HiveThriftServer2.scala#L86](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L86)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131323098
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala 
---
@@ -404,6 +404,13 @@ private[spark] object HiveUtils extends Logging {
 propMap.put(ConfVars.METASTORE_EVENT_LISTENERS.varname, "")
 propMap.put(ConfVars.METASTORE_END_FUNCTION_LISTENERS.varname, "")
 
+// Copy any "spark.hadoop.foo=bar" system properties into conf as 
"foo=bar"
+sys.props.foreach { case (key, value) =>
--- End diff --

As I mentioned above, we should not do this for `newClientForExecution`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131322795
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ---
@@ -134,6 +135,16 @@ private[hive] object SparkSQLCLIDriver extends Logging 
{
   // Hive 1.2 + not supported in CLI
   throw new RuntimeException("Remote operations not supported")
 }
+// Respect the configurations set by --hiveconf from the command line
+// (based on Hive's CliDriver).
+val hiveConfFromCmd = 
sessionState.getOverriddenConfigurations.entrySet().asScala
+val newHiveConf = hiveConfFromCmd.map { kv =>
+  // If the same property is configured by spark.hadoop.xxx, we ignore 
it and
+  // obey settings from spark properties
+  val k = kv.getKey
+  val v = sys.props.getOrElseUpdate(SPARK_HADOOP_PROP_PREFIX + k, 
kv.getValue)
--- End diff --

`newClientForExecution ` is used for us to read/write hive serde tables. 
This is the major concern I have. Let us add another parameter in 
`newTemporaryConfiguration `. When `newClientForExecution `  is calling 
`newTemporaryConfiguration `, we should not get the hive conf from sys.prop. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131322107
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ---
@@ -50,6 +50,7 @@ private[hive] object SparkSQLCLIDriver extends Logging {
   private val prompt = "spark-sql"
   private val continuedPrompt = "".padTo(prompt.length, ' ')
   private var transport: TSocket = _
+  private final val SPARK_HADOOP_PROP_PREFIX = "spark.hadoop."
--- End diff --

After thinking more, I think we should just consider `spark.hadoop.` in 
this PR, unless we get the other feedbacks from the community. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-04 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131321807
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ---
@@ -134,6 +135,16 @@ private[hive] object SparkSQLCLIDriver extends Logging 
{
   // Hive 1.2 + not supported in CLI
   throw new RuntimeException("Remote operations not supported")
 }
+// Respect the configurations set by --hiveconf from the command line
+// (based on Hive's CliDriver).
+val hiveConfFromCmd = 
sessionState.getOverriddenConfigurations.entrySet().asScala
+val newHiveConf = hiveConfFromCmd.map { kv =>
+  // If the same property is configured by spark.hadoop.xxx, we ignore 
it and
+  // obey settings from spark properties
+  val k = kv.getKey
+  val v = sys.props.getOrElseUpdate(SPARK_HADOOP_PROP_PREFIX + k, 
kv.getValue)
--- End diff --

`newClientForExecution` is used ONLY in HiveThriftServer2, where it is used 
to get a hiveconf. There is no more a execution hive client, IMO this method be 
removed. This activity happens after `SparkSQLEnv.init`, so it is OK for 
`spark.hadoop.` properties.

I realize that `--hiveconf` should be added to `sys.props` as 
`spark.hadoop.xxx`  before `SparkSQLEnv.init` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131320806
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ---
@@ -134,6 +135,16 @@ private[hive] object SparkSQLCLIDriver extends Logging 
{
   // Hive 1.2 + not supported in CLI
   throw new RuntimeException("Remote operations not supported")
 }
+// Respect the configurations set by --hiveconf from the command line
+// (based on Hive's CliDriver).
+val hiveConfFromCmd = 
sessionState.getOverriddenConfigurations.entrySet().asScala
+val newHiveConf = hiveConfFromCmd.map { kv =>
+  // If the same property is configured by spark.hadoop.xxx, we ignore 
it and
+  // obey settings from spark properties
+  val k = kv.getKey
+  val v = sys.props.getOrElseUpdate(SPARK_HADOOP_PROP_PREFIX + k, 
kv.getValue)
--- End diff --

When we build `SparkConf` in `SparkSQLEnv`, we get the conf from system 
prop because `loadDefaults` is set to `true`. That is the way we pass 
`-hiveconf` values to `sc.hadoopConfiguration`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-04 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131320240
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala 
---
@@ -404,6 +404,13 @@ private[spark] object HiveUtils extends Logging {
 propMap.put(ConfVars.METASTORE_EVENT_LISTENERS.varname, "")
 propMap.put(ConfVars.METASTORE_END_FUNCTION_LISTENERS.varname, "")
 
+// Copy any "spark.hadoop.foo=bar" system properties into conf as 
"foo=bar"
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-04 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131320143
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ---
@@ -157,12 +168,8 @@ private[hive] object SparkSQLCLIDriver extends Logging 
{
 // Execute -i init files (always in silent mode)
 cli.processInitFiles(sessionState)
 
-// Respect the configurations set by --hiveconf from the command line
-// (based on Hive's CliDriver).
-val it = sessionState.getOverriddenConfigurations.entrySet().iterator()
-while (it.hasNext) {
-  val kv = it.next()
-  SparkSQLEnv.sqlContext.setConf(kv.getKey, kv.getValue)
+newHiveConf.foreach{ kv =>
--- End diff --

thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-04 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131320120
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ---
@@ -157,12 +168,8 @@ private[hive] object SparkSQLCLIDriver extends Logging 
{
 // Execute -i init files (always in silent mode)
 cli.processInitFiles(sessionState)
 
-// Respect the configurations set by --hiveconf from the command line
-// (based on Hive's CliDriver).
-val it = sessionState.getOverriddenConfigurations.entrySet().iterator()
--- End diff --

--hiveconf abc.def will be add to system properties as spark.hadoop.abc.def 
if is not existed , before `SparkSQLEnv.init`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131318739
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ---
@@ -157,12 +168,8 @@ private[hive] object SparkSQLCLIDriver extends Logging 
{
 // Execute -i init files (always in silent mode)
 cli.processInitFiles(sessionState)
 
-// Respect the configurations set by --hiveconf from the command line
-// (based on Hive's CliDriver).
-val it = sessionState.getOverriddenConfigurations.entrySet().iterator()
--- End diff --

What is the reason you move it to line 140?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131317729
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ---
@@ -134,6 +135,16 @@ private[hive] object SparkSQLCLIDriver extends Logging 
{
   // Hive 1.2 + not supported in CLI
   throw new RuntimeException("Remote operations not supported")
 }
+// Respect the configurations set by --hiveconf from the command line
+// (based on Hive's CliDriver).
+val hiveConfFromCmd = 
sessionState.getOverriddenConfigurations.entrySet().asScala
+val newHiveConf = hiveConfFromCmd.map { kv =>
+  // If the same property is configured by spark.hadoop.xxx, we ignore 
it and
+  // obey settings from spark properties
+  val k = kv.getKey
+  val v = sys.props.getOrElseUpdate(SPARK_HADOOP_PROP_PREFIX + k, 
kv.getValue)
--- End diff --

Let me try to summarize the impacts of these changes. The [initial 
call](https://github.com/yaooqinn/spark/blob/5043eb69b41d1d0263e8814da27a934491bc936c/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala#L86)
 of `newTemporaryConfiguration` is before we setting `sys.props`. The 
subsequent call of `newTemporaryConfiguration` in `newClientForExecution` will 
be used for Hive execution clients. Thus, the changes will affect Hive 
execution clients. 

Could you check all the codes in Spark are using `sys.prop`? Will this 
change impact them?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131315665
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ---
@@ -157,12 +168,8 @@ private[hive] object SparkSQLCLIDriver extends Logging 
{
 // Execute -i init files (always in silent mode)
 cli.processInitFiles(sessionState)
 
-// Respect the configurations set by --hiveconf from the command line
-// (based on Hive's CliDriver).
-val it = sessionState.getOverriddenConfigurations.entrySet().iterator()
-while (it.hasNext) {
-  val kv = it.next()
-  SparkSQLEnv.sqlContext.setConf(kv.getKey, kv.getValue)
+newHiveConf.foreach{ kv =>
--- End diff --

`foreach{` -> `foreach {`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131314418
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala 
---
@@ -404,6 +404,13 @@ private[spark] object HiveUtils extends Logging {
 propMap.put(ConfVars.METASTORE_EVENT_LISTENERS.varname, "")
 propMap.put(ConfVars.METASTORE_END_FUNCTION_LISTENERS.varname, "")
 
+// Copy any "spark.hadoop.foo=bar" system properties into conf as 
"foo=bar"
--- End diff --

@yaooqinn Please follow what @tejasapatil said and create a util function. 
In addition, `newTemporaryConfiguration` is being used for `SparkSQLCLIDriver`, 
and thus, please update the function description of 
`newTemporaryConfiguration`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-03 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131194926
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ---
@@ -50,6 +50,7 @@ private[hive] object SparkSQLCLIDriver extends Logging {
   private val prompt = "spark-sql"
   private val continuedPrompt = "".padTo(prompt.length, ' ')
   private var transport: TSocket = _
+  private final val SPARK_HADOOP_PROP_PREFIX = "spark.hadoop."
--- End diff --

`spark.hadoop.` was tribal knowledge and was a sneaky way to stick values 
into Hadoop `Configuration` object (which can later also pass on to 
`HiveConf`). What does `spark.hive.` do ? Have never seen such configs and 
would like to know.

Keeping that aside, are you proposing to drop that prefix at L145 ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-03 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131188713
  
--- Diff: docs/configuration.md ---
@@ -2326,7 +2326,7 @@ from this directory.
 # Inheriting Hadoop Cluster Configuration
 
 If you plan to read and write from HDFS using Spark, there are two Hadoop 
configuration files that
-should be included on Spark's classpath:
+should be included on Spark's class path:
--- End diff --

nit: everywhere in the documentation `classpath` is being used so changing 
just one instance will make the doc inconsistent. Lets keep this as it is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-03 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131187701
  
--- Diff: docs/configuration.md ---
@@ -2335,5 +2335,61 @@ The location of these configuration files varies 
across Hadoop versions, but
 a common location is inside of `/etc/hadoop/conf`. Some tools create
 configurations on-the-fly, but offer a mechanisms to download copies of 
them.
 
-To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/spark-env.sh`
+To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/conf/spark-env.sh`
 to a location containing the configuration files.
+
+# Custom Hadoop/Hive Configuration
+
+If your Spark applications interacting with Hadoop, Hive, or both, there 
are probably Hadoop/Hive
+configuration files in Spark's class path.
+
+Multiple running applications might require different Hadoop/Hive client 
side configurations.
+You can copy and modify `hdfs-site.xml`, `core-site.xml`, `yarn-site.xml`, 
`hive-site.xml` in
+Spark's class path for each application, but it is not very convenient and 
these
+files are best to be shared with common properties to avoid hard-coding 
certain configurations.
+
+The better choice is to use spark hadoop properties in the form of 
`spark.hadoop.*`. 
+They can be considered as same as normal spark properties which can be set 
in `$SPARK_HOME/conf/spark-defalut.conf`
+
+In some cases, you may want to avoid hard-coding certain configurations in 
a `SparkConf`. For
+instance, Spark allows you to simply create an empty conf and set 
spark/spark hadoop properties.
+
+{% highlight scala %}
+val conf = new SparkConf().set("spark.hadoop.abc.def","xyz")
+val sc = new SparkContext(conf)
+{% endhighlight %}
+
+Also, you can modify or add configurations at runtime:
+{% highlight bash %}
+./bin/spark-submit \ 
+  --name "My app" \ 
+  --master local[4] \  
+  --conf spark.eventLog.enabled=false \ 
+  --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps" \ 
+  --conf spark.hadoop.abc.def=xyz \ 
+  myApp.jar
+{% endhighlight %}
+
+## Typical Hadoop/Hive Configurations
--- End diff --

curious : whats the motive behind having this section ? I feel that we 
should not get into suggesting these configs external to spark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-03 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131185632
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala 
---
@@ -404,6 +404,13 @@ private[spark] object HiveUtils extends Logging {
 propMap.put(ConfVars.METASTORE_EVENT_LISTENERS.varname, "")
 propMap.put(ConfVars.METASTORE_END_FUNCTION_LISTENERS.varname, "")
 
+// Copy any "spark.hadoop.foo=bar" system properties into conf as 
"foo=bar"
--- End diff --

lets move this to a util method so that we know this is done in 2 places


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-03 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131095350
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ---
@@ -50,6 +50,7 @@ private[hive] object SparkSQLCLIDriver extends Logging {
   private val prompt = "spark-sql"
   private val continuedPrompt = "".padTo(prompt.length, ' ')
   private var transport: TSocket = _
+  private final val SPARK_HADOOP_PROP_PREFIX = "spark.hadoop."
--- End diff --

good point. I see `spark.hive` in some of my configs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-03 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131094720
  
--- Diff: docs/configuration.md ---
@@ -2335,5 +2335,61 @@ The location of these configuration files varies 
across Hadoop versions, but
 a common location is inside of `/etc/hadoop/conf`. Some tools create
 configurations on-the-fly, but offer a mechanisms to download copies of 
them.
 
-To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/spark-env.sh`
+To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/conf/spark-env.sh`
 to a location containing the configuration files.
+
+# Custom Hadoop/Hive Configuration
+
+If your Spark applications interacting with Hadoop, Hive, or both, there 
are probably Hadoop/Hive
+configuration files in Spark's class path.
+
+Multiple running applications might require different Hadoop/Hive client 
side configurations.
+You can copy and modify `hdfs-site.xml`, `core-site.xml`, `yarn-site.xml`, 
`hive-site.xml` in
+Spark's class path for each application, but it is not very convenient and 
these
+files are best to be shared with common properties to avoid hard-coding 
certain configurations.
+
+The better choice is to use spark hadoop properties in the form of 
`spark.hadoop.*`. 
+They can be considered as same as normal spark properties which can be set 
in `$SPARK_HOME/conf/spark-defalut.conf`
+
+In some cases, you may want to avoid hard-coding certain configurations in 
a `SparkConf`. For
+instance, Spark allows you to simply create an empty conf and set 
spark/spark hadoop properties.
+
+{% highlight scala %}
+val conf = new SparkConf().set("spark.hadoop.abc.def","xyz")
+val sc = new SparkContext(conf)
+{% endhighlight %}
+
+Also, you can modify or add configurations at runtime:
+{% highlight bash %}
+./bin/spark-submit \ 
+  --name "My app" \ 
+  --master local[4] \  
+  --conf spark.eventLog.enabled=false \ 
+  --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps" \ 
+  --conf spark.hadoop.abc.def=xyz \ 
+  myApp.jar
+{% endhighlight %}
+
+## Typical Hadoop/Hive Configurations
+
+
+
+  spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version
+  1
+  
+The file output committer algorithm version, valid algorithm version 
number: 1 or 2.
+Version 2 may have better performance, but version 1 may handle 
failures better in certain situations,
+as per https://issues.apache.org/jira/browse/MAPREDUCE-4815;>MAPREDUCE-4815.
+  
+
+
+
+  spark.hadoop.fs.hdfs.impl.disable.cache
--- End diff --

this is a pretty dangerous one to point people at, especially since it's 
fixed in future Hadoop versions & backported to some distros —and the cost of 
creating a new HDFS client on every worker can get very expensive if you have a 
spark process with many threads, all fielding work from the same user (thread 
pools, IPC connections, )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-03 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131093892
  
--- Diff: docs/configuration.md ---
@@ -2335,5 +2335,61 @@ The location of these configuration files varies 
across Hadoop versions, but
 a common location is inside of `/etc/hadoop/conf`. Some tools create
 configurations on-the-fly, but offer a mechanisms to download copies of 
them.
 
-To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/spark-env.sh`
+To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/conf/spark-env.sh`
 to a location containing the configuration files.
+
+# Custom Hadoop/Hive Configuration
+
+If your Spark applications interacting with Hadoop, Hive, or both, there 
are probably Hadoop/Hive
+configuration files in Spark's class path.
+
+Multiple running applications might require different Hadoop/Hive client 
side configurations.
+You can copy and modify `hdfs-site.xml`, `core-site.xml`, `yarn-site.xml`, 
`hive-site.xml` in
+Spark's class path for each application, but it is not very convenient and 
these
+files are best to be shared with common properties to avoid hard-coding 
certain configurations.
--- End diff --

"best shared"

You can'd do that anyway on a production Spark on Yarn cluster as if you 
did., lots of other things would break. How about

```
In a Spark cluster running on YARN, these configuration files are set 
cluster-wide, and cannot safely be changed by the application.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-03 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131093320
  
--- Diff: docs/configuration.md ---
@@ -2335,5 +2335,61 @@ The location of these configuration files varies 
across Hadoop versions, but
 a common location is inside of `/etc/hadoop/conf`. Some tools create
 configurations on-the-fly, but offer a mechanisms to download copies of 
them.
 
-To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/spark-env.sh`
+To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/conf/spark-env.sh`
 to a location containing the configuration files.
+
+# Custom Hadoop/Hive Configuration
+
+If your Spark applications interacting with Hadoop, Hive, or both, there 
are probably Hadoop/Hive
--- End diff --

s/applications/r/application is/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-03 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131074348
  
--- Diff: docs/configuration.md ---
@@ -2335,5 +2335,59 @@ The location of these configuration files varies 
across Hadoop versions, but
 a common location is inside of `/etc/hadoop/conf`. Some tools create
 configurations on-the-fly, but offer a mechanisms to download copies of 
them.
 
-To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/spark-env.sh`
+To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/conf/spark-env.sh`
 to a location containing the configuration files.
+
+# Custom Hadoop/Hive Configuration
+
+If your Spark Application interacting with Hadoop, Hive, or both, there 
are probably Hadoop/Hive
+configuration files in Spark's ClassPath.
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-03 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131068575
  
--- Diff: docs/configuration.md ---
@@ -2335,5 +2335,59 @@ The location of these configuration files varies 
across Hadoop versions, but
 a common location is inside of `/etc/hadoop/conf`. Some tools create
 configurations on-the-fly, but offer a mechanisms to download copies of 
them.
 
-To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/spark-env.sh`
+To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/conf/spark-env.sh`
 to a location containing the configuration files.
+
+# Custom Hadoop/Hive Configuration
+
+If your Spark Application interacting with Hadoop, Hive, or both, there 
are probably Hadoop/Hive
+configuration files in Spark's ClassPath.
+
+In most cases, you may have more than one applications running and rely on 
some different Hadoop/Hive
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-03 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131068501
  
--- Diff: docs/configuration.md ---
@@ -2335,5 +2335,59 @@ The location of these configuration files varies 
across Hadoop versions, but
 a common location is inside of `/etc/hadoop/conf`. Some tools create
 configurations on-the-fly, but offer a mechanisms to download copies of 
them.
 
-To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/spark-env.sh`
+To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/conf/spark-env.sh`
 to a location containing the configuration files.
+
+# Custom Hadoop/Hive Configuration
+
+If your Spark Application interacting with Hadoop, Hive, or both, there 
are probably Hadoop/Hive
+configuration files in Spark's ClassPath.
+
+In most cases, you may have more than one applications running and rely on 
some different Hadoop/Hive
+client side configurations. You can copy and modify `hdfs-site.xml`, 
`core-site.xml`, `yarn-site.xml`,
+`hive-site.xml` in Spark's ClassPath for each application, but it is not 
very convenient and these
+files are best to be shared with common properties to avoid hard-coding 
certain configurations.
+
+The better choice is to use spark hadoop properties in the form of 
`spark.hadoop.*`. 
+They can be considered as same as normal spark properties which can be set 
in `$SPARK_HOME/conf/spark-defalut.conf`
+
+In some cases, you may want to avoid hard-coding certain configurations in 
a `SparkConf`. For
+instance. Spark allows you to simply create an empty conf and set 
spark/spark hadoop properties.
+
+{% highlight scala %}
+val conf = new SparkConf().set("spark.hadoop.abc.def","xyz")
+val sc = new SparkContext(conf)
+{% endhighlight %}
+
+Also, you can modify or add configurations at runtime:
+{% highlight bash %}
+./bin/spark-submit \ 
+  --name "My app" \ 
+  --master local[4] \  
+  --conf spark.eventLog.enabled=false \ 
+  --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps" \ 
+  --conf spark.hadoop.abc.def=xyz
+  myApp.jar
+{% endhighlight %}
+
+## Typical Hadoop/Hive Configurations
+
+
+
+  spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version
+  1
+  
+The file output committer algorithm version, valid algorithm version 
number: 1 or 2.
+Version 2 may have better performance, but version 1 may handle 
failures better in certain situations,
+as per https://issues.apache.org/jira/browse/MAPREDUCE-4815;>MAPREDUCE-4815.
+  
+
+
+
+  spark.hadoop.fs.hdfs.impl.disable.cache
+  false
+  
+Don't cache 'hdfs' filesystem instances. Set true if HDFS Token Expiry 
in long-running spark applicaitons.https://issues.apache.org/jira/browse/HDFS-9276;>HDFS-9276.
--- End diff --

@gatorsmile i guess fs.hdfs.impl.disable.cache true means disable caching 
dfs client instance no tokens, FileSystem.get() will always create a new one. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131060237
  
--- Diff: docs/configuration.md ---
@@ -2335,5 +2335,59 @@ The location of these configuration files varies 
across Hadoop versions, but
 a common location is inside of `/etc/hadoop/conf`. Some tools create
 configurations on-the-fly, but offer a mechanisms to download copies of 
them.
 
-To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/spark-env.sh`
+To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/conf/spark-env.sh`
 to a location containing the configuration files.
+
+# Custom Hadoop/Hive Configuration
+
+If your Spark Application interacting with Hadoop, Hive, or both, there 
are probably Hadoop/Hive
+configuration files in Spark's ClassPath.
+
+In most cases, you may have more than one applications running and rely on 
some different Hadoop/Hive
--- End diff --

`In most cases, you may have more than one applications running and rely on 
some different Hadoop/Hive`
->
`Multiple running applications might require different Hadoop/Hive client 
side configurations.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131059952
  
--- Diff: docs/configuration.md ---
@@ -2335,5 +2335,59 @@ The location of these configuration files varies 
across Hadoop versions, but
 a common location is inside of `/etc/hadoop/conf`. Some tools create
 configurations on-the-fly, but offer a mechanisms to download copies of 
them.
 
-To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/spark-env.sh`
+To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/conf/spark-env.sh`
 to a location containing the configuration files.
+
+# Custom Hadoop/Hive Configuration
+
+If your Spark Application interacting with Hadoop, Hive, or both, there 
are probably Hadoop/Hive
+configuration files in Spark's ClassPath.
--- End diff --

`ClassPath ` -> `class path`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131059673
  
--- Diff: docs/configuration.md ---
@@ -2335,5 +2335,59 @@ The location of these configuration files varies 
across Hadoop versions, but
 a common location is inside of `/etc/hadoop/conf`. Some tools create
 configurations on-the-fly, but offer a mechanisms to download copies of 
them.
 
-To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/spark-env.sh`
+To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/conf/spark-env.sh`
 to a location containing the configuration files.
+
+# Custom Hadoop/Hive Configuration
+
+If your Spark Application interacting with Hadoop, Hive, or both, there 
are probably Hadoop/Hive
--- End diff --

`Application ` -> `applications`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131059429
  
--- Diff: docs/configuration.md ---
@@ -2335,5 +2335,59 @@ The location of these configuration files varies 
across Hadoop versions, but
 a common location is inside of `/etc/hadoop/conf`. Some tools create
 configurations on-the-fly, but offer a mechanisms to download copies of 
them.
 
-To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/spark-env.sh`
+To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/conf/spark-env.sh`
 to a location containing the configuration files.
+
+# Custom Hadoop/Hive Configuration
+
+If your Spark Application interacting with Hadoop, Hive, or both, there 
are probably Hadoop/Hive
+configuration files in Spark's ClassPath.
+
+In most cases, you may have more than one applications running and rely on 
some different Hadoop/Hive
+client side configurations. You can copy and modify `hdfs-site.xml`, 
`core-site.xml`, `yarn-site.xml`,
+`hive-site.xml` in Spark's ClassPath for each application, but it is not 
very convenient and these
+files are best to be shared with common properties to avoid hard-coding 
certain configurations.
+
+The better choice is to use spark hadoop properties in the form of 
`spark.hadoop.*`. 
+They can be considered as same as normal spark properties which can be set 
in `$SPARK_HOME/conf/spark-defalut.conf`
+
+In some cases, you may want to avoid hard-coding certain configurations in 
a `SparkConf`. For
+instance. Spark allows you to simply create an empty conf and set 
spark/spark hadoop properties.
+
+{% highlight scala %}
+val conf = new SparkConf().set("spark.hadoop.abc.def","xyz")
+val sc = new SparkContext(conf)
+{% endhighlight %}
+
+Also, you can modify or add configurations at runtime:
+{% highlight bash %}
+./bin/spark-submit \ 
+  --name "My app" \ 
+  --master local[4] \  
+  --conf spark.eventLog.enabled=false \ 
+  --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps" \ 
+  --conf spark.hadoop.abc.def=xyz
+  myApp.jar
+{% endhighlight %}
+
+## Typical Hadoop/Hive Configurations
+
+
+
+  spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version
+  1
+  
+The file output committer algorithm version, valid algorithm version 
number: 1 or 2.
+Version 2 may have better performance, but version 1 may handle 
failures better in certain situations,
+as per https://issues.apache.org/jira/browse/MAPREDUCE-4815;>MAPREDUCE-4815.
+  
+
+
+
+  spark.hadoop.fs.hdfs.impl.disable.cache
+  false
+  
+Don't cache 'hdfs' filesystem instances. Set true if HDFS Token Expiry 
in long-running spark applicaitons.https://issues.apache.org/jira/browse/HDFS-9276;>HDFS-9276.
--- End diff --

`When true, HDFS instances do not cache delegation tokens. With the cached 
tokens, HDFS delegation token updates might fail in long-running Spark 
applications.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-02 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r131057644
  
--- Diff: docs/configuration.md ---
@@ -2335,5 +2335,59 @@ The location of these configuration files varies 
across Hadoop versions, but
 a common location is inside of `/etc/hadoop/conf`. Some tools create
 configurations on-the-fly, but offer a mechanisms to download copies of 
them.
 
-To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/spark-env.sh`
+To make these files visible to Spark, set `HADOOP_CONF_DIR` in 
`$SPARK_HOME/conf/spark-env.sh`
--- End diff --

@zsxwing @liancheng Could you please take a look at the documentation? 
Anything is missing or inaccurate? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-02 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r130792745
  
--- Diff: 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
 ---
@@ -283,4 +283,17 @@ class CliSuite extends SparkFunSuite with 
BeforeAndAfterAll with Logging {
   "SET conf3;" -> "conftest"
 )
   }
+
+  test("SPARK-21451: spark.sql.warehouse.dir should respect options in 
--hiveconf") {
+runCliWithin(1.minute)("set spark.sql.warehouse.dir;" -> 
warehousePath.getAbsolutePath)
+  }
+
+  test("SPARK-21451: Apply spark.hadoop.* configurations") {
--- End diff --

Yes, after sc initialized, spark.hadoop.hive.metastore.warehouse.dir will 
be translated into a hadoop conf hive.metastore.warehouse.dir as an alternative 
of warehouse dir. This test case couldn't tell whether this pr works. CliSuite 
may not see these values only if we explicitly set them to SqlConf.

The original code did break another test case anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-01 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r130536572
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ---
@@ -50,6 +50,7 @@ private[hive] object SparkSQLCLIDriver extends Logging {
   private val prompt = "spark-sql"
   private val continuedPrompt = "".padTo(prompt.length, ' ')
   private var transport: TSocket = _
+  private final val SPARK_HADOOP_PROP_PREFIX = "spark.hadoop."
--- End diff --

Just a question, why the prefix has to be `spark.hadoop.`?

See the related PR: https://github.com/apache/spark/pull/2379


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-01 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r130534640
  
--- Diff: 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
 ---
@@ -283,4 +283,17 @@ class CliSuite extends SparkFunSuite with 
BeforeAndAfterAll with Logging {
   "SET conf3;" -> "conftest"
 )
   }
+
+  test("SPARK-21451: spark.sql.warehouse.dir should respect options in 
--hiveconf") {
+runCliWithin(1.minute)("set spark.sql.warehouse.dir;" -> 
warehousePath.getAbsolutePath)
+  }
+
+  test("SPARK-21451: Apply spark.hadoop.* configurations") {
--- End diff --

Without the fix, this test case still can succeed. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-07-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r130029786
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveUtilsSuite.scala ---
@@ -33,4 +33,13 @@ class HiveUtilsSuite extends QueryTest with SQLTestUtils 
with TestHiveSingleton
   assert(conf(ConfVars.METASTORE_END_FUNCTION_LISTENERS.varname) === 
"")
 }
   }
+
+  test("newTemporaryConfiguration respect spark.hadoop.foo=bar in 
SparkConf") {
+sys.props.put("spark.hadoop.foo", "bar")
+Seq(true, false) foreach { useInMemoryDerby =>
+  val hiveConf = HiveUtils.newTemporaryConfiguration(useInMemoryDerby)
+  intercept[NoSuchElementException](hiveConf("spark.hadoop.foo") === 
"bar")
--- End diff --

nit: assert(!hiveConf.contains("spark.hadoop.foo"))


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-07-27 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r129998159
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveUtilsSuite.scala ---
@@ -33,4 +33,13 @@ class HiveUtilsSuite extends QueryTest with SQLTestUtils 
with TestHiveSingleton
   assert(conf(ConfVars.METASTORE_END_FUNCTION_LISTENERS.varname) === 
"")
 }
   }
+
+  test("newTemporaryConfiguration respect spark.hadoop.foo=bar in 
SparkConf") {
+sys.props.put("spark.hadoop.foo", "bar")
--- End diff --

@cloud-fan at the very beginning, the spark-sumit do the same thing that 
add properties from --conf and spark-default.conf to sys.props. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-07-27 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r129774912
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveUtilsSuite.scala ---
@@ -33,4 +33,13 @@ class HiveUtilsSuite extends QueryTest with SQLTestUtils 
with TestHiveSingleton
   assert(conf(ConfVars.METASTORE_END_FUNCTION_LISTENERS.varname) === 
"")
 }
   }
+
+  test("newTemporaryConfiguration respect spark.hadoop.foo=bar in 
SparkConf") {
+sys.props.put("spark.hadoop.foo", "bar")
--- End diff --

The test says we should respect hadoop conf in `SparkConf`, but why we 
handle system properties?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-07-19 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r128186557
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala 
---
@@ -404,6 +404,13 @@ private[spark] object HiveUtils extends Logging {
 propMap.put(ConfVars.METASTORE_EVENT_LISTENERS.varname, "")
 propMap.put(ConfVars.METASTORE_END_FUNCTION_LISTENERS.varname, "")
 
+// Copy any "spark.hadoop.foo=bar" system properties into conf as 
"foo=bar"
--- End diff --

check 

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L102


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-07-19 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r128184486
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala 
---
@@ -404,6 +404,13 @@ private[spark] object HiveUtils extends Logging {
 propMap.put(ConfVars.METASTORE_EVENT_LISTENERS.varname, "")
 propMap.put(ConfVars.METASTORE_END_FUNCTION_LISTENERS.varname, "")
 
+// Copy any "spark.hadoop.foo=bar" system properties into conf as 
"foo=bar"
--- End diff --

do we have documents saying that `spark.hadoop.xxx` is supported? or are 
you proposing a new feature?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-07-19 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r128170401
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala 
---
@@ -404,6 +404,13 @@ private[spark] object HiveUtils extends Logging {
 propMap.put(ConfVars.METASTORE_EVENT_LISTENERS.varname, "")
 propMap.put(ConfVars.METASTORE_END_FUNCTION_LISTENERS.varname, "")
 
+// Copy any "spark.hadoop.foo=bar" system properties into conf as 
"foo=bar"
--- End diff --

if we run `bin/spark-sql --conf spark.hadoop.hive.exec.strachdir=/some/dir` 
or in spark-default.conf, SessionState.start(cliSessionState)  will not use 
this dir but the default


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-07-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r128143343
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala 
---
@@ -404,6 +404,13 @@ private[spark] object HiveUtils extends Logging {
 propMap.put(ConfVars.METASTORE_EVENT_LISTENERS.varname, "")
 propMap.put(ConfVars.METASTORE_END_FUNCTION_LISTENERS.varname, "")
 
+// Copy any "spark.hadoop.foo=bar" system properties into conf as 
"foo=bar"
--- End diff --

why do we should do so?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-07-18 Thread yaooqinn
GitHub user yaooqinn opened a pull request:

https://github.com/apache/spark/pull/18668

[SPARK-21451][SQL]get `spark.hadoop.*` properties from sysProps to hiveconf 



## What changes were proposed in this pull request?

get `spark.hadoop.*` properties from sysProps to hiveconf

## How was this patch tested?
UT

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yaooqinn/spark SPARK-21451

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18668.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18668


commit 89d9b86616196fde5d0b3a08fb284e6af6afe588
Author: Kent Yao 
Date:   2017-07-18T06:41:24Z

HiveConf in SparkSQLCLIDriver doesn't respect 
spark.hadoop.some.hive.variables




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org