[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r155166993
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -587,7 +601,8 @@ object DataSource extends Logging {
 if (provider1.toLowerCase(Locale.ROOT) == "orc" ||
--- End diff --

Yep. I'll remove this, `provider1.toLowerCase(Locale.ROOT) == "orc" ||`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r155165795
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -553,6 +557,8 @@ object DataSource extends Logging {
   "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" 
-> parquet,
   "org.apache.spark.sql.hive.orc.DefaultSource" -> orc,
   "org.apache.spark.sql.hive.orc" -> orc,
+  "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> 
nativeOrc,
+  "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc,
--- End diff --

These changes are pretty safe. In case, we move the orc to the other 
location, it will still refer to the right location. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r155165558
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -553,6 +557,8 @@ object DataSource extends Logging {
   "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" 
-> parquet,
   "org.apache.spark.sql.hive.orc.DefaultSource" -> orc,
   "org.apache.spark.sql.hive.orc" -> orc,
+  "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> 
nativeOrc,
+  "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc,
--- End diff --

`org.apache.spark.sql.execution.datasources.*` path is meant to be private 
too. But I think it's okay to leave it for practical use cases with some 
comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-05 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r155165388
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -587,7 +601,8 @@ object DataSource extends Logging {
 if (provider1.toLowerCase(Locale.ROOT) == "orc" ||
--- End diff --

`provider1` can't be "orc" anymore. It can be only 
`classOf[OrcFileFormat].getCanonicalName` or 
`"org.apache.spark.sql.hive.orc.OrcFileFormat"`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r155165377
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -553,6 +557,8 @@ object DataSource extends Logging {
   "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" 
-> parquet,
   "org.apache.spark.sql.hive.orc.DefaultSource" -> orc,
   "org.apache.spark.sql.hive.orc" -> orc,
+  "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> 
nativeOrc,
+  "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc,
--- End diff --

I think we should rename variable and/or fix the comments there when we 
touch some codes around there to prevent confusion next time though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-05 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r155164921
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -553,6 +557,8 @@ object DataSource extends Logging {
   "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" 
-> parquet,
   "org.apache.spark.sql.hive.orc.DefaultSource" -> orc,
   "org.apache.spark.sql.hive.orc" -> orc,
+  "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> 
nativeOrc,
+  "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc,
--- End diff --

For parquet, this is for historical reason, see #13311.

Previously you can use parquet by 
`org.apache.spark.sql.execution.datasources.parquet` and 
`org.apache.spark.sql.execution.datasources.parquet.DefaultSource`. So it is 
also for backward compatibility.

For this new orc, it is not the same case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r155163332
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -553,6 +557,8 @@ object DataSource extends Logging {
   "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" 
-> parquet,
   "org.apache.spark.sql.hive.orc.DefaultSource" -> orc,
   "org.apache.spark.sql.hive.orc" -> orc,
+  "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> 
nativeOrc,
+  "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc,
--- End diff --

This is for safety. We also do it for parquet


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r155162795
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -553,6 +557,8 @@ object DataSource extends Logging {
   "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" 
-> parquet,
   "org.apache.spark.sql.hive.orc.DefaultSource" -> orc,
   "org.apache.spark.sql.hive.orc" -> orc,
+  "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> 
nativeOrc,
+  "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc,
--- End diff --

When I received [the 
advice](https://github.com/apache/spark/pull/19871#discussion_r154580007), I 
thought it's for consistency.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r155162572
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -553,6 +557,8 @@ object DataSource extends Logging {
   "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" 
-> parquet,
   "org.apache.spark.sql.hive.orc.DefaultSource" -> orc,
   "org.apache.spark.sql.hive.orc" -> orc,
+  "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> 
nativeOrc,
+  "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc,
--- End diff --

Like `USING org.apache.spark.sql.hive.orc`, we want to use `USING 
org.apache.spark.sql.execution.datasources.orc`, don't we?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r155160863
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -553,6 +557,8 @@ object DataSource extends Logging {
   "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" 
-> parquet,
   "org.apache.spark.sql.hive.orc.DefaultSource" -> orc,
   "org.apache.spark.sql.hive.orc" -> orc,
+  "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> 
nativeOrc,
+  "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc,
--- End diff --

ah good catch! sounds like we don't need compatibility rule for the new orc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-05 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r155160106
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -553,6 +557,8 @@ object DataSource extends Logging {
   "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" 
-> parquet,
   "org.apache.spark.sql.hive.orc.DefaultSource" -> orc,
   "org.apache.spark.sql.hive.orc" -> orc,
+  "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> 
nativeOrc,
+  "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc,
--- End diff --

This map is for backward compatibility in case we move data sources around. 
I think this `datasources.orc` is newly added. Why we need to add them here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r155135243
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala ---
@@ -621,4 +621,21 @@ class OrcQuerySuite extends QueryTest with 
BeforeAndAfterAll with OrcTest {
  makeOrcFile((1 to 10).map(Tuple1.apply), path2)
  assertResult(20)(read.orc(path1.getCanonicalPath, 
path2.getCanonicalPath).count())
}
+
+  test("SPARK-20728 Make ORCFileFormat configurable between sql/hive and 
sql/core") {
+Seq(
+  ("native", 
classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat]),
+  ("hive", 
classOf[org.apache.spark.sql.hive.orc.OrcFileFormat])).foreach { case (i, 
format) =>
--- End diff --

For this one, I will update https://github.com/apache/spark/pull/19882 .
I updated in my local and am running some tests.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r155135145
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -587,7 +601,8 @@ object DataSource extends Logging {
 if (provider1.toLowerCase(Locale.ROOT) == "orc" ||
   provider1.startsWith("org.apache.spark.sql.hive.orc")) {
   throw new AnalysisException(
-"The ORC data source must be used with Hive support 
enabled")
+"Hive-based ORC data source must be used with Hive 
support enabled. " +
+"Please use native ORC data source instead")
--- End diff --

@HyukjinKwon .
For this one, I made https://github.com/apache/spark/pull/19903.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19871


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154881449
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -587,7 +601,8 @@ object DataSource extends Logging {
 if (provider1.toLowerCase(Locale.ROOT) == "orc" ||
   provider1.startsWith("org.apache.spark.sql.hive.orc")) {
   throw new AnalysisException(
-"The ORC data source must be used with Hive support 
enabled")
+"Hive-based ORC data source must be used with Hive 
support enabled. " +
+"Please use native ORC data source instead")
--- End diff --

I think we should make this more actionable, saying `spark.sql.orc.impl` 
should be set to `native` explicitly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154884137
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala ---
@@ -621,4 +621,21 @@ class OrcQuerySuite extends QueryTest with 
BeforeAndAfterAll with OrcTest {
  makeOrcFile((1 to 10).map(Tuple1.apply), path2)
  assertResult(20)(read.orc(path1.getCanonicalPath, 
path2.getCanonicalPath).count())
}
+
+  test("SPARK-20728 Make ORCFileFormat configurable between sql/hive and 
sql/core") {
+Seq(
+  ("native", 
classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat]),
+  ("hive", 
classOf[org.apache.spark.sql.hive.orc.OrcFileFormat])).foreach { case (i, 
format) =>
--- End diff --

nit: `i` => `orcImpl`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154865943
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

And for here, I added the following to prevent `Multiple sources found`. 
Last time, I missed this way. My bad.
```
+  case name if name.equalsIgnoreCase("orc") &&
+conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "hive" =>
+"org.apache.spark.sql.hive.orc.OrcFileFormat"
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154864703
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

sounds good


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154862662
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

So, there is no more `The ORC data source must be used with Hive support 
enabled`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154860853
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

I agree with both of you.

Just for explanation: The original design completely preserves the previous 
behavior.
Without `SQLConf.ORC_IMPLEMENTATION` option, Spark doesn't know 
OrcFileFormat. So, in case of non-Hive support, creating data source with "orc" 
will fail with unknown data source.

Anyway, I'm happy to update according to your advice. :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r15480
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

+1 for ^ 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154852231
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

and also add comments here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154852193
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

This sounds counter-intuitive, I think we should register the new orc 
instead of the old one.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154848336
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

@cloud-fan . To avoid that issue, new OrcFileFormat is not registered 
intentionally.
@HyukjinKwon 's comment is correct.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154848218
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

I was looking the exact same path. It seems not because  it's not 
registered to `ServiceLoader` 
(`src/main/resources/org.apache.spark.sql.sources.DataSourceRegister`). So, 
short name for the newer ORC source would not find be used.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154847064
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

if `ORC_IMPLEMENTATION` is `hive`, we leave the provider as it was, which 
may be `orc`. Then we will hit `Multiple sources found` issue, aren't we? Both 
the old and new orc has the same short name `orc`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154756706
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -537,6 +540,7 @@ object DataSource extends Logging {
 val csv = classOf[CSVFileFormat].getCanonicalName
 val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat"
 val orc = "org.apache.spark.sql.hive.orc.OrcFileFormat"
+val newOrc = classOf[OrcFileFormat].getCanonicalName
--- End diff --

Yep. It sounds better.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154755799
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -537,6 +540,7 @@ object DataSource extends Logging {
 val csv = classOf[CSVFileFormat].getCanonicalName
 val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat"
 val orc = "org.apache.spark.sql.hive.orc.OrcFileFormat"
+val newOrc = classOf[OrcFileFormat].getCanonicalName
--- End diff --

Please do not use the name like `newXYZ`. When the newer one was added, the 
name will be confusing.

How about `nativeOrc`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154752978
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -363,6 +363,14 @@ object SQLConf {
 .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo"))
 .createWithDefault("snappy")
 
+  val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion")
+.doc("When true, use new OrcFileFormat in sql/core module instead of 
the one in sql/hive. " +
+  "Since new OrcFileFormat uses Apache ORC library instead of ORC 
library Hive 1.2.1, it is " +
+  "more stable and faster.")
--- End diff --

Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154752924
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -363,6 +363,14 @@ object SQLConf {
 .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo"))
 .createWithDefault("snappy")
 
+  val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion")
+.doc("When true, use new OrcFileFormat in sql/core module instead of 
the one in sql/hive. " +
+  "Since new OrcFileFormat uses Apache ORC library instead of ORC 
library Hive 1.2.1, it is " +
+  "more stable and faster.")
+.internal()
+.booleanConf
--- End diff --

Yep.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154752812
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -363,6 +363,14 @@ object SQLConf {
 .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo"))
 .createWithDefault("snappy")
 
+  val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion")
--- End diff --

No problem to change to it. But, since the name is given by @cloud-fan 
before, ping @cloud-fan .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154743731
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -363,6 +363,14 @@ object SQLConf {
 .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo"))
 .createWithDefault("snappy")
 
+  val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion")
+.doc("When true, use new OrcFileFormat in sql/core module instead of 
the one in sql/hive. " +
+  "Since new OrcFileFormat uses Apache ORC library instead of ORC 
library Hive 1.2.1, it is " +
+  "more stable and faster.")
--- End diff --

> When true, use the native version of ORC support instead of the ORC 
library in Hive 1.2.1, which is by default prior to Spark 2.3. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154742876
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -363,6 +363,14 @@ object SQLConf {
 .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo"))
 .createWithDefault("snappy")
 
+  val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion")
+.doc("When true, use new OrcFileFormat in sql/core module instead of 
the one in sql/hive. " +
+  "Since new OrcFileFormat uses Apache ORC library instead of ORC 
library Hive 1.2.1, it is " +
+  "more stable and faster.")
+.internal()
+.booleanConf
--- End diff --

```
.checkValues(Set("hive", "native"))
.createWithDefault("native")
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154742713
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -363,6 +363,14 @@ object SQLConf {
 .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo"))
 .createWithDefault("snappy")
 
+  val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion")
--- End diff --

`spark.sql.orc.impl`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154709256
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,12 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(conf: SQLConf, provider: String): Class[_] = {
--- End diff --

So, are you suggesting `lookupDataSource(provider, useNewOrc=true)`, 
@jiangxb1987 ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154707683
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -85,7 +87,8 @@ case class DataSource(
 
   case class SourceInfo(name: String, schema: StructType, 
partitionColumns: Seq[String])
 
-  lazy val providingClass: Class[_] = 
DataSource.lookupDataSource(className)
+  lazy val providingClass: Class[_] =
+DataSource.lookupDataSource(sparkSession.sessionState.conf, className)
--- End diff --

Sure!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154707557
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -363,6 +363,14 @@ object SQLConf {
 .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo"))
 .createWithDefault("snappy")
 
+  val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion")
+.doc("When true, use new OrcFileFormat in sql/core module instead of 
the one in sql/hive. " +
+  "Since new OrcFileFormat uses Apache ORC library instead of ORC 
library Hive 1.2.1, it is " +
+  "more stable and faster.")
--- End diff --

Thank you for review, @HyukjinKwon . 
Do you mean `Apache ORC library is more stable, but new OrcFileFormat is 
not` because it's introduced newly?
Actually, that's true in the Spark's viewpoint, but new OrcFileFormat 
contains more bug fixes and new features too. If you allow, I want to keep 
this. :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154640805
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,12 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(conf: SQLConf, provider: String): Class[_] = {
--- End diff --

After more thinking, I think it don't worth to pass the whole SQLConf into 
this function, we just need to know whether `SQLConf.ORC_USE_NEW_VERSION` is 
enabled.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154638371
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -363,6 +363,14 @@ object SQLConf {
 .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo"))
 .createWithDefault("snappy")
 
+  val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion")
+.doc("When true, use new OrcFileFormat in sql/core module instead of 
the one in sql/hive. " +
+  "Since new OrcFileFormat uses Apache ORC library instead of ORC 
library Hive 1.2.1, it is " +
+  "more stable and faster.")
--- End diff --

tiny nit: let's take out `more stable` ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154638156
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -85,7 +87,8 @@ case class DataSource(
 
   case class SourceInfo(name: String, schema: StructType, 
partitionColumns: Seq[String])
 
-  lazy val providingClass: Class[_] = 
DataSource.lookupDataSource(className)
+  lazy val providingClass: Class[_] =
+DataSource.lookupDataSource(sparkSession.sessionState.conf, className)
--- End diff --

I'd put this conf as the last argument actually if you wouldn't mind .. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154583136
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -195,8 +195,18 @@ case class RelationConversions(
 .convertToLogicalRelation(relation, options, 
classOf[ParquetFileFormat], "parquet")
 } else {
   val options = relation.tableMeta.storage.properties
-  sessionCatalog.metastoreCatalog
-.convertToLogicalRelation(relation, options, 
classOf[OrcFileFormat], "orc")
+  if (conf.getConf(SQLConf.ORC_USE_NEW_VERSION)) {
+sessionCatalog.metastoreCatalog.convertToLogicalRelation(
+  relation,
+  options,
+  
classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat],
+  "orc")
+  } else {
+sessionCatalog.metastoreCatalog.convertToLogicalRelation(
+  relation,
+  options,
+  classOf[org.apache.spark.sql.hive.orc.OrcFileFormat], "orc")
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154582712
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2153,4 +2153,21 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   }
 }
   }
+
+  test("SPARK-20728 Make ORCFileFormat configurable between sql/hive and 
sql/core") {
--- End diff --

Yep.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154582670
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +571,11 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(conf: SQLConf, provider: String): Class[_] = {
+var provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
--- End diff --

Thanks. Sure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154580498
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -195,8 +195,18 @@ case class RelationConversions(
 .convertToLogicalRelation(relation, options, 
classOf[ParquetFileFormat], "parquet")
 } else {
   val options = relation.tableMeta.storage.properties
-  sessionCatalog.metastoreCatalog
-.convertToLogicalRelation(relation, options, 
classOf[OrcFileFormat], "orc")
+  if (conf.getConf(SQLConf.ORC_USE_NEW_VERSION)) {
+sessionCatalog.metastoreCatalog.convertToLogicalRelation(
+  relation,
+  options,
+  
classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat],
+  "orc")
+  } else {
+sessionCatalog.metastoreCatalog.convertToLogicalRelation(
+  relation,
+  options,
+  classOf[org.apache.spark.sql.hive.orc.OrcFileFormat], "orc")
--- End diff --

indents.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154580323
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2153,4 +2153,21 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   }
 }
   }
+
+  test("SPARK-20728 Make ORCFileFormat configurable between sql/hive and 
sql/core") {
--- End diff --

Move it to `OrcQuerySuite`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154580007
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +571,11 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(conf: SQLConf, provider: String): Class[_] = {
+var provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
--- End diff --

Also add the maps for new orc format to `backwardCompatibilityMap`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154577760
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +571,11 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(conf: SQLConf, provider: String): Class[_] = {
+var provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
--- End diff --

instead of using `var`, you can use the pattern match


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154567195
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/DDLSourceLoadSuite.scala 
---
@@ -54,11 +55,17 @@ class DDLSourceLoadSuite extends DataSourceTest with 
SharedSQLContext {
   .load().schema == StructType(Seq(StructField("stringType", 
StringType, nullable = false
   }
 
-  test("should fail to load ORC without Hive Support") {
-val e = intercept[AnalysisException] {
-  spark.read.format("orc").load()
+  test("should fail to load ORC only if spark.sql.orc.enabled=false and 
without Hive Support") {
--- End diff --

Oh, I confused with 
[SQLQuerySuite.scala](https://github.com/apache/spark/pull/19871/files#diff-1ea02a6fab84e938582f7f87cc4d9ea1R2157)
 in hive. Sorry, I'll remove this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154565988
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +571,12 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(conf: SQLConf, provider: String): Class[_] = {
+var provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+if (Seq("orc").contains(provider1.toLowerCase) && 
conf.getConf(SQLConf.ORC_USE_NEW_VERSION)) {
--- End diff --

Oh. Yep.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154565883
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/DDLSourceLoadSuite.scala 
---
@@ -54,11 +55,17 @@ class DDLSourceLoadSuite extends DataSourceTest with 
SharedSQLContext {
   .load().schema == StructType(Seq(StructField("stringType", 
StringType, nullable = false
   }
 
-  test("should fail to load ORC without Hive Support") {
-val e = intercept[AnalysisException] {
-  spark.read.format("orc").load()
+  test("should fail to load ORC only if spark.sql.orc.enabled=false and 
without Hive Support") {
--- End diff --

that test also check the exception: 
https://github.com/apache/spark/pull/19871/files#diff-5a2e7f03d14856c8769fd3ddea8742bdR2790


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154565755
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +571,12 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(conf: SQLConf, provider: String): Class[_] = {
+var provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+if (Seq("orc").contains(provider1.toLowerCase) && 
conf.getConf(SQLConf.ORC_USE_NEW_VERSION)) {
--- End diff --

"orc".equalsIgnoreCast(...)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154565348
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +570,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(sparkSession: SparkSession, provider: String): 
Class[_] = {
+var provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+if (Seq("orc", 
"org.apache.spark.sql.hive.orc.OrcFileFormat").contains(provider1.toLowerCase) 
&&
+sparkSession.conf.get(SQLConf.ORC_ENABLED)) {
--- End diff --

It's done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154563993
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -363,6 +363,11 @@ object SQLConf {
 .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo"))
 .createWithDefault("snappy")
 
+  val ORC_ENABLED = buildConf("spark.sql.orc.enabled")
+.doc("When true, use OrcFileFormat in sql/core module instead of the 
one in sql/hive module.")
--- End diff --

Yep. I'll elaborate more.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154563902
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/DDLSourceLoadSuite.scala 
---
@@ -54,11 +55,17 @@ class DDLSourceLoadSuite extends DataSourceTest with 
SharedSQLContext {
   .load().schema == StructType(Seq(StructField("stringType", 
StringType, nullable = false
   }
 
-  test("should fail to load ORC without Hive Support") {
-val e = intercept[AnalysisException] {
-  spark.read.format("orc").load()
+  test("should fail to load ORC only if spark.sql.orc.enabled=false and 
without Hive Support") {
--- End diff --

Ur, those tests cover different cases.
- In this test: `true` -> Use new OrcFileFormat, `false` -> Throw Exception 
(the existing behavior)
- In that test: `true` -> Use new OrcFileFormat, `false` -> Use old 
OrcFileFormat (the existing behavior).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154563532
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +570,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(sparkSession: SparkSession, provider: String): 
Class[_] = {
+var provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+if (Seq("orc", 
"org.apache.spark.sql.hive.orc.OrcFileFormat").contains(provider1.toLowerCase) 
&&
--- End diff --

I see.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154563501
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +570,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(sparkSession: SparkSession, provider: String): 
Class[_] = {
--- End diff --

Yep.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154563142
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -363,6 +363,11 @@ object SQLConf {
 .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo"))
 .createWithDefault("snappy")
 
+  val ORC_ENABLED = buildConf("spark.sql.orc.enabled")
--- End diff --

Sure!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-03 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154558750
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +570,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(sparkSession: SparkSession, provider: String): 
Class[_] = {
+var provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+if (Seq("orc", 
"org.apache.spark.sql.hive.orc.OrcFileFormat").contains(provider1.toLowerCase) 
&&
+sparkSession.conf.get(SQLConf.ORC_ENABLED)) {
--- End diff --

Shouldn't we get the conf from `sessionState`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-03 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154558153
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -363,6 +363,11 @@ object SQLConf {
 .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo"))
 .createWithDefault("snappy")
 
+  val ORC_ENABLED = buildConf("spark.sql.orc.enabled")
+.doc("When true, use OrcFileFormat in sql/core module instead of the 
one in sql/hive module.")
--- End diff --

The description should include the major difference of these two orc 
versions.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154556299
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/DDLSourceLoadSuite.scala 
---
@@ -54,11 +55,17 @@ class DDLSourceLoadSuite extends DataSourceTest with 
SharedSQLContext {
   .load().schema == StructType(Seq(StructField("stringType", 
StringType, nullable = false
   }
 
-  test("should fail to load ORC without Hive Support") {
-val e = intercept[AnalysisException] {
-  spark.read.format("orc").load()
+  test("should fail to load ORC only if spark.sql.orc.enabled=false and 
without Hive Support") {
--- End diff --

I think this test is replaced by 
https://github.com/apache/spark/pull/19871/files#diff-5a2e7f03d14856c8769fd3ddea8742bdR2788


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154556243
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +570,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(sparkSession: SparkSession, provider: String): 
Class[_] = {
+var provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+if (Seq("orc", 
"org.apache.spark.sql.hive.orc.OrcFileFormat").contains(provider1.toLowerCase) 
&&
--- End diff --

"org.apache.spark.sql.hive.orc.OrcFileFormat" should still point to the old 
implementation


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154556215
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +570,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(sparkSession: SparkSession, provider: String): 
Class[_] = {
--- End diff --

instead of passing the `SparkSession`, I think we only need `SQLConf`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154556132
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -363,6 +363,11 @@ object SQLConf {
 .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo"))
 .createWithDefault("snappy")
 
+  val ORC_ENABLED = buildConf("spark.sql.orc.enabled")
--- End diff --

how about `spark.sql.orc.useNewVersion`? Also let's make it an internal 
config and enable it by default.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-03 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/19871

[SPARK-20728][SQL] Make OrcFileFormat configurable between sql/hive and 
sql/core

## What changes were proposed in this pull request?

This PR aims to provide a configuration to choose the default 
`OrcFileFormat` from legacy `sql/hive` module or new `sql/core` module.

For example, this configuration will affects the following operations.
```scala
spark.read.orc(...)
```
```sql
CREATE TABLE t
USING ORC
...
```

## How was this patch tested?

Pass the Jenkins with new test suites.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark spark-sql-orc-enabled

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19871.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19871


commit 37e240cf12ea7463c2e0ea56501b812e12745869
Author: Dongjoon Hyun 
Date:   2017-12-03T23:33:55Z

[SPARK-20728][SQL] Make ORCFileFormat configurable between sql/hive and 
sql/core




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org