[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155166993 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -587,7 +601,8 @@ object DataSource extends Logging { if (provider1.toLowerCase(Locale.ROOT) == "orc" || --- End diff -- Yep. I'll remove this, `provider1.toLowerCase(Locale.ROOT) == "orc" ||`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155165795 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -553,6 +557,8 @@ object DataSource extends Logging { "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, + "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc, + "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc, --- End diff -- These changes are pretty safe. In case, we move the orc to the other location, it will still refer to the right location. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155165558 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -553,6 +557,8 @@ object DataSource extends Logging { "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, + "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc, + "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc, --- End diff -- `org.apache.spark.sql.execution.datasources.*` path is meant to be private too. But I think it's okay to leave it for practical use cases with some comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155165388 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -587,7 +601,8 @@ object DataSource extends Logging { if (provider1.toLowerCase(Locale.ROOT) == "orc" || --- End diff -- `provider1` can't be "orc" anymore. It can be only `classOf[OrcFileFormat].getCanonicalName` or `"org.apache.spark.sql.hive.orc.OrcFileFormat"`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155165377 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -553,6 +557,8 @@ object DataSource extends Logging { "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, + "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc, + "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc, --- End diff -- I think we should rename variable and/or fix the comments there when we touch some codes around there to prevent confusion next time though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155164921 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -553,6 +557,8 @@ object DataSource extends Logging { "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, + "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc, + "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc, --- End diff -- For parquet, this is for historical reason, see #13311. Previously you can use parquet by `org.apache.spark.sql.execution.datasources.parquet` and `org.apache.spark.sql.execution.datasources.parquet.DefaultSource`. So it is also for backward compatibility. For this new orc, it is not the same case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155163332 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -553,6 +557,8 @@ object DataSource extends Logging { "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, + "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc, + "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc, --- End diff -- This is for safety. We also do it for parquet --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155162795 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -553,6 +557,8 @@ object DataSource extends Logging { "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, + "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc, + "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc, --- End diff -- When I received [the advice](https://github.com/apache/spark/pull/19871#discussion_r154580007), I thought it's for consistency. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155162572 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -553,6 +557,8 @@ object DataSource extends Logging { "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, + "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc, + "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc, --- End diff -- Like `USING org.apache.spark.sql.hive.orc`, we want to use `USING org.apache.spark.sql.execution.datasources.orc`, don't we? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155160863 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -553,6 +557,8 @@ object DataSource extends Logging { "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, + "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc, + "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc, --- End diff -- ah good catch! sounds like we don't need compatibility rule for the new orc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155160106 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -553,6 +557,8 @@ object DataSource extends Logging { "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, + "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc, + "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc, --- End diff -- This map is for backward compatibility in case we move data sources around. I think this `datasources.orc` is newly added. Why we need to add them here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155135243 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala --- @@ -621,4 +621,21 @@ class OrcQuerySuite extends QueryTest with BeforeAndAfterAll with OrcTest { makeOrcFile((1 to 10).map(Tuple1.apply), path2) assertResult(20)(read.orc(path1.getCanonicalPath, path2.getCanonicalPath).count()) } + + test("SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core") { +Seq( + ("native", classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat]), + ("hive", classOf[org.apache.spark.sql.hive.orc.OrcFileFormat])).foreach { case (i, format) => --- End diff -- For this one, I will update https://github.com/apache/spark/pull/19882 . I updated in my local and am running some tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155135145 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -587,7 +601,8 @@ object DataSource extends Logging { if (provider1.toLowerCase(Locale.ROOT) == "orc" || provider1.startsWith("org.apache.spark.sql.hive.orc")) { throw new AnalysisException( -"The ORC data source must be used with Hive support enabled") +"Hive-based ORC data source must be used with Hive support enabled. " + +"Please use native ORC data source instead") --- End diff -- @HyukjinKwon . For this one, I made https://github.com/apache/spark/pull/19903. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19871 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154881449 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -587,7 +601,8 @@ object DataSource extends Logging { if (provider1.toLowerCase(Locale.ROOT) == "orc" || provider1.startsWith("org.apache.spark.sql.hive.orc")) { throw new AnalysisException( -"The ORC data source must be used with Hive support enabled") +"Hive-based ORC data source must be used with Hive support enabled. " + +"Please use native ORC data source instead") --- End diff -- I think we should make this more actionable, saying `spark.sql.orc.impl` should be set to `native` explicitly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154884137 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala --- @@ -621,4 +621,21 @@ class OrcQuerySuite extends QueryTest with BeforeAndAfterAll with OrcTest { makeOrcFile((1 to 10).map(Tuple1.apply), path2) assertResult(20)(read.orc(path1.getCanonicalPath, path2.getCanonicalPath).count()) } + + test("SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core") { +Seq( + ("native", classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat]), + ("hive", classOf[org.apache.spark.sql.hive.orc.OrcFileFormat])).foreach { case (i, format) => --- End diff -- nit: `i` => `orcImpl` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154865943 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +574,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(provider: String, conf: SQLConf): Class[_] = { +val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) match { + case name if name.equalsIgnoreCase("orc") && + conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" => +classOf[OrcFileFormat].getCanonicalName + case name => name --- End diff -- And for here, I added the following to prevent `Multiple sources found`. Last time, I missed this way. My bad. ``` + case name if name.equalsIgnoreCase("orc") && +conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "hive" => +"org.apache.spark.sql.hive.orc.OrcFileFormat" ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154864703 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +574,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(provider: String, conf: SQLConf): Class[_] = { +val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) match { + case name if name.equalsIgnoreCase("orc") && + conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" => +classOf[OrcFileFormat].getCanonicalName + case name => name --- End diff -- sounds good --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154862662 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +574,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(provider: String, conf: SQLConf): Class[_] = { +val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) match { + case name if name.equalsIgnoreCase("orc") && + conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" => +classOf[OrcFileFormat].getCanonicalName + case name => name --- End diff -- So, there is no more `The ORC data source must be used with Hive support enabled`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154860853 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +574,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(provider: String, conf: SQLConf): Class[_] = { +val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) match { + case name if name.equalsIgnoreCase("orc") && + conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" => +classOf[OrcFileFormat].getCanonicalName + case name => name --- End diff -- I agree with both of you. Just for explanation: The original design completely preserves the previous behavior. Without `SQLConf.ORC_IMPLEMENTATION` option, Spark doesn't know OrcFileFormat. So, in case of non-Hive support, creating data source with "orc" will fail with unknown data source. Anyway, I'm happy to update according to your advice. :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r15480 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +574,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(provider: String, conf: SQLConf): Class[_] = { +val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) match { + case name if name.equalsIgnoreCase("orc") && + conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" => +classOf[OrcFileFormat].getCanonicalName + case name => name --- End diff -- +1 for ^ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154852231 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +574,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(provider: String, conf: SQLConf): Class[_] = { +val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) match { + case name if name.equalsIgnoreCase("orc") && + conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" => +classOf[OrcFileFormat].getCanonicalName + case name => name --- End diff -- and also add comments here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154852193 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +574,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(provider: String, conf: SQLConf): Class[_] = { +val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) match { + case name if name.equalsIgnoreCase("orc") && + conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" => +classOf[OrcFileFormat].getCanonicalName + case name => name --- End diff -- This sounds counter-intuitive, I think we should register the new orc instead of the old one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154848336 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +574,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(provider: String, conf: SQLConf): Class[_] = { +val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) match { + case name if name.equalsIgnoreCase("orc") && + conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" => +classOf[OrcFileFormat].getCanonicalName + case name => name --- End diff -- @cloud-fan . To avoid that issue, new OrcFileFormat is not registered intentionally. @HyukjinKwon 's comment is correct. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154848218 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +574,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(provider: String, conf: SQLConf): Class[_] = { +val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) match { + case name if name.equalsIgnoreCase("orc") && + conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" => +classOf[OrcFileFormat].getCanonicalName + case name => name --- End diff -- I was looking the exact same path. It seems not because it's not registered to `ServiceLoader` (`src/main/resources/org.apache.spark.sql.sources.DataSourceRegister`). So, short name for the newer ORC source would not find be used. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154847064 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +574,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(provider: String, conf: SQLConf): Class[_] = { +val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) match { + case name if name.equalsIgnoreCase("orc") && + conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" => +classOf[OrcFileFormat].getCanonicalName + case name => name --- End diff -- if `ORC_IMPLEMENTATION` is `hive`, we leave the provider as it was, which may be `orc`. Then we will hit `Multiple sources found` issue, aren't we? Both the old and new orc has the same short name `orc`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154756706 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -537,6 +540,7 @@ object DataSource extends Logging { val csv = classOf[CSVFileFormat].getCanonicalName val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat" val orc = "org.apache.spark.sql.hive.orc.OrcFileFormat" +val newOrc = classOf[OrcFileFormat].getCanonicalName --- End diff -- Yep. It sounds better. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154755799 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -537,6 +540,7 @@ object DataSource extends Logging { val csv = classOf[CSVFileFormat].getCanonicalName val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat" val orc = "org.apache.spark.sql.hive.orc.OrcFileFormat" +val newOrc = classOf[OrcFileFormat].getCanonicalName --- End diff -- Please do not use the name like `newXYZ`. When the newer one was added, the name will be confusing. How about `nativeOrc`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154752978 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -363,6 +363,14 @@ object SQLConf { .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo")) .createWithDefault("snappy") + val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion") +.doc("When true, use new OrcFileFormat in sql/core module instead of the one in sql/hive. " + + "Since new OrcFileFormat uses Apache ORC library instead of ORC library Hive 1.2.1, it is " + + "more stable and faster.") --- End diff -- Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154752924 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -363,6 +363,14 @@ object SQLConf { .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo")) .createWithDefault("snappy") + val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion") +.doc("When true, use new OrcFileFormat in sql/core module instead of the one in sql/hive. " + + "Since new OrcFileFormat uses Apache ORC library instead of ORC library Hive 1.2.1, it is " + + "more stable and faster.") +.internal() +.booleanConf --- End diff -- Yep. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154752812 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -363,6 +363,14 @@ object SQLConf { .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo")) .createWithDefault("snappy") + val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion") --- End diff -- No problem to change to it. But, since the name is given by @cloud-fan before, ping @cloud-fan . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154743731 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -363,6 +363,14 @@ object SQLConf { .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo")) .createWithDefault("snappy") + val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion") +.doc("When true, use new OrcFileFormat in sql/core module instead of the one in sql/hive. " + + "Since new OrcFileFormat uses Apache ORC library instead of ORC library Hive 1.2.1, it is " + + "more stable and faster.") --- End diff -- > When true, use the native version of ORC support instead of the ORC library in Hive 1.2.1, which is by default prior to Spark 2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154742876 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -363,6 +363,14 @@ object SQLConf { .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo")) .createWithDefault("snappy") + val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion") +.doc("When true, use new OrcFileFormat in sql/core module instead of the one in sql/hive. " + + "Since new OrcFileFormat uses Apache ORC library instead of ORC library Hive 1.2.1, it is " + + "more stable and faster.") +.internal() +.booleanConf --- End diff -- ``` .checkValues(Set("hive", "native")) .createWithDefault("native") ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154742713 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -363,6 +363,14 @@ object SQLConf { .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo")) .createWithDefault("snappy") + val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion") --- End diff -- `spark.sql.orc.impl` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154709256 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +574,12 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(conf: SQLConf, provider: String): Class[_] = { --- End diff -- So, are you suggesting `lookupDataSource(provider, useNewOrc=true)`, @jiangxb1987 ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154707683 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -85,7 +87,8 @@ case class DataSource( case class SourceInfo(name: String, schema: StructType, partitionColumns: Seq[String]) - lazy val providingClass: Class[_] = DataSource.lookupDataSource(className) + lazy val providingClass: Class[_] = +DataSource.lookupDataSource(sparkSession.sessionState.conf, className) --- End diff -- Sure! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154707557 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -363,6 +363,14 @@ object SQLConf { .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo")) .createWithDefault("snappy") + val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion") +.doc("When true, use new OrcFileFormat in sql/core module instead of the one in sql/hive. " + + "Since new OrcFileFormat uses Apache ORC library instead of ORC library Hive 1.2.1, it is " + + "more stable and faster.") --- End diff -- Thank you for review, @HyukjinKwon . Do you mean `Apache ORC library is more stable, but new OrcFileFormat is not` because it's introduced newly? Actually, that's true in the Spark's viewpoint, but new OrcFileFormat contains more bug fixes and new features too. If you allow, I want to keep this. :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154640805 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +574,12 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(conf: SQLConf, provider: String): Class[_] = { --- End diff -- After more thinking, I think it don't worth to pass the whole SQLConf into this function, we just need to know whether `SQLConf.ORC_USE_NEW_VERSION` is enabled. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154638371 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -363,6 +363,14 @@ object SQLConf { .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo")) .createWithDefault("snappy") + val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion") +.doc("When true, use new OrcFileFormat in sql/core module instead of the one in sql/hive. " + + "Since new OrcFileFormat uses Apache ORC library instead of ORC library Hive 1.2.1, it is " + + "more stable and faster.") --- End diff -- tiny nit: let's take out `more stable` .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154638156 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -85,7 +87,8 @@ case class DataSource( case class SourceInfo(name: String, schema: StructType, partitionColumns: Seq[String]) - lazy val providingClass: Class[_] = DataSource.lookupDataSource(className) + lazy val providingClass: Class[_] = +DataSource.lookupDataSource(sparkSession.sessionState.conf, className) --- End diff -- I'd put this conf as the last argument actually if you wouldn't mind .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154583136 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -195,8 +195,18 @@ case class RelationConversions( .convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet") } else { val options = relation.tableMeta.storage.properties - sessionCatalog.metastoreCatalog -.convertToLogicalRelation(relation, options, classOf[OrcFileFormat], "orc") + if (conf.getConf(SQLConf.ORC_USE_NEW_VERSION)) { +sessionCatalog.metastoreCatalog.convertToLogicalRelation( + relation, + options, + classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat], + "orc") + } else { +sessionCatalog.metastoreCatalog.convertToLogicalRelation( + relation, + options, + classOf[org.apache.spark.sql.hive.orc.OrcFileFormat], "orc") --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154582712 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2153,4 +2153,21 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } } + + test("SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core") { --- End diff -- Yep. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154582670 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +571,11 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(conf: SQLConf, provider: String): Class[_] = { +var provider1 = backwardCompatibilityMap.getOrElse(provider, provider) --- End diff -- Thanks. Sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154580498 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -195,8 +195,18 @@ case class RelationConversions( .convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet") } else { val options = relation.tableMeta.storage.properties - sessionCatalog.metastoreCatalog -.convertToLogicalRelation(relation, options, classOf[OrcFileFormat], "orc") + if (conf.getConf(SQLConf.ORC_USE_NEW_VERSION)) { +sessionCatalog.metastoreCatalog.convertToLogicalRelation( + relation, + options, + classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat], + "orc") + } else { +sessionCatalog.metastoreCatalog.convertToLogicalRelation( + relation, + options, + classOf[org.apache.spark.sql.hive.orc.OrcFileFormat], "orc") --- End diff -- indents. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154580323 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2153,4 +2153,21 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } } + + test("SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core") { --- End diff -- Move it to `OrcQuerySuite` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154580007 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +571,11 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(conf: SQLConf, provider: String): Class[_] = { +var provider1 = backwardCompatibilityMap.getOrElse(provider, provider) --- End diff -- Also add the maps for new orc format to `backwardCompatibilityMap` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154577760 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +571,11 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(conf: SQLConf, provider: String): Class[_] = { +var provider1 = backwardCompatibilityMap.getOrElse(provider, provider) --- End diff -- instead of using `var`, you can use the pattern match --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154567195 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/DDLSourceLoadSuite.scala --- @@ -54,11 +55,17 @@ class DDLSourceLoadSuite extends DataSourceTest with SharedSQLContext { .load().schema == StructType(Seq(StructField("stringType", StringType, nullable = false } - test("should fail to load ORC without Hive Support") { -val e = intercept[AnalysisException] { - spark.read.format("orc").load() + test("should fail to load ORC only if spark.sql.orc.enabled=false and without Hive Support") { --- End diff -- Oh, I confused with [SQLQuerySuite.scala](https://github.com/apache/spark/pull/19871/files#diff-1ea02a6fab84e938582f7f87cc4d9ea1R2157) in hive. Sorry, I'll remove this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154565988 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +571,12 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(conf: SQLConf, provider: String): Class[_] = { +var provider1 = backwardCompatibilityMap.getOrElse(provider, provider) +if (Seq("orc").contains(provider1.toLowerCase) && conf.getConf(SQLConf.ORC_USE_NEW_VERSION)) { --- End diff -- Oh. Yep. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154565883 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/DDLSourceLoadSuite.scala --- @@ -54,11 +55,17 @@ class DDLSourceLoadSuite extends DataSourceTest with SharedSQLContext { .load().schema == StructType(Seq(StructField("stringType", StringType, nullable = false } - test("should fail to load ORC without Hive Support") { -val e = intercept[AnalysisException] { - spark.read.format("orc").load() + test("should fail to load ORC only if spark.sql.orc.enabled=false and without Hive Support") { --- End diff -- that test also check the exception: https://github.com/apache/spark/pull/19871/files#diff-5a2e7f03d14856c8769fd3ddea8742bdR2790 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154565755 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +571,12 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(conf: SQLConf, provider: String): Class[_] = { +var provider1 = backwardCompatibilityMap.getOrElse(provider, provider) +if (Seq("orc").contains(provider1.toLowerCase) && conf.getConf(SQLConf.ORC_USE_NEW_VERSION)) { --- End diff -- "orc".equalsIgnoreCast(...) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154565348 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +570,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(sparkSession: SparkSession, provider: String): Class[_] = { +var provider1 = backwardCompatibilityMap.getOrElse(provider, provider) +if (Seq("orc", "org.apache.spark.sql.hive.orc.OrcFileFormat").contains(provider1.toLowerCase) && +sparkSession.conf.get(SQLConf.ORC_ENABLED)) { --- End diff -- It's done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154563993 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -363,6 +363,11 @@ object SQLConf { .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo")) .createWithDefault("snappy") + val ORC_ENABLED = buildConf("spark.sql.orc.enabled") +.doc("When true, use OrcFileFormat in sql/core module instead of the one in sql/hive module.") --- End diff -- Yep. I'll elaborate more. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154563902 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/DDLSourceLoadSuite.scala --- @@ -54,11 +55,17 @@ class DDLSourceLoadSuite extends DataSourceTest with SharedSQLContext { .load().schema == StructType(Seq(StructField("stringType", StringType, nullable = false } - test("should fail to load ORC without Hive Support") { -val e = intercept[AnalysisException] { - spark.read.format("orc").load() + test("should fail to load ORC only if spark.sql.orc.enabled=false and without Hive Support") { --- End diff -- Ur, those tests cover different cases. - In this test: `true` -> Use new OrcFileFormat, `false` -> Throw Exception (the existing behavior) - In that test: `true` -> Use new OrcFileFormat, `false` -> Use old OrcFileFormat (the existing behavior). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154563532 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +570,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(sparkSession: SparkSession, provider: String): Class[_] = { +var provider1 = backwardCompatibilityMap.getOrElse(provider, provider) +if (Seq("orc", "org.apache.spark.sql.hive.orc.OrcFileFormat").contains(provider1.toLowerCase) && --- End diff -- I see. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154563501 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +570,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(sparkSession: SparkSession, provider: String): Class[_] = { --- End diff -- Yep. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154563142 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -363,6 +363,11 @@ object SQLConf { .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo")) .createWithDefault("snappy") + val ORC_ENABLED = buildConf("spark.sql.orc.enabled") --- End diff -- Sure! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154558750 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +570,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(sparkSession: SparkSession, provider: String): Class[_] = { +var provider1 = backwardCompatibilityMap.getOrElse(provider, provider) +if (Seq("orc", "org.apache.spark.sql.hive.orc.OrcFileFormat").contains(provider1.toLowerCase) && +sparkSession.conf.get(SQLConf.ORC_ENABLED)) { --- End diff -- Shouldn't we get the conf from `sessionState`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154558153 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -363,6 +363,11 @@ object SQLConf { .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo")) .createWithDefault("snappy") + val ORC_ENABLED = buildConf("spark.sql.orc.enabled") +.doc("When true, use OrcFileFormat in sql/core module instead of the one in sql/hive module.") --- End diff -- The description should include the major difference of these two orc versions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154556299 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/DDLSourceLoadSuite.scala --- @@ -54,11 +55,17 @@ class DDLSourceLoadSuite extends DataSourceTest with SharedSQLContext { .load().schema == StructType(Seq(StructField("stringType", StringType, nullable = false } - test("should fail to load ORC without Hive Support") { -val e = intercept[AnalysisException] { - spark.read.format("orc").load() + test("should fail to load ORC only if spark.sql.orc.enabled=false and without Hive Support") { --- End diff -- I think this test is replaced by https://github.com/apache/spark/pull/19871/files#diff-5a2e7f03d14856c8769fd3ddea8742bdR2788 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154556243 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +570,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(sparkSession: SparkSession, provider: String): Class[_] = { +var provider1 = backwardCompatibilityMap.getOrElse(provider, provider) +if (Seq("orc", "org.apache.spark.sql.hive.orc.OrcFileFormat").contains(provider1.toLowerCase) && --- End diff -- "org.apache.spark.sql.hive.orc.OrcFileFormat" should still point to the old implementation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154556215 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +570,13 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(sparkSession: SparkSession, provider: String): Class[_] = { --- End diff -- instead of passing the `SparkSession`, I think we only need `SQLConf` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154556132 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -363,6 +363,11 @@ object SQLConf { .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo")) .createWithDefault("snappy") + val ORC_ENABLED = buildConf("spark.sql.orc.enabled") --- End diff -- how about `spark.sql.orc.useNewVersion`? Also let's make it an internal config and enable it by default. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/19871 [SPARK-20728][SQL] Make OrcFileFormat configurable between sql/hive and sql/core ## What changes were proposed in this pull request? This PR aims to provide a configuration to choose the default `OrcFileFormat` from legacy `sql/hive` module or new `sql/core` module. For example, this configuration will affects the following operations. ```scala spark.read.orc(...) ``` ```sql CREATE TABLE t USING ORC ... ``` ## How was this patch tested? Pass the Jenkins with new test suites. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark spark-sql-orc-enabled Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19871.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19871 commit 37e240cf12ea7463c2e0ea56501b812e12745869 Author: Dongjoon HyunDate: 2017-12-03T23:33:55Z [SPARK-20728][SQL] Make ORCFileFormat configurable between sql/hive and sql/core --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org