[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/23108#discussion_r238524452 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -186,6 +186,54 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = { +val tableName1 = "spark_orc1" +val tableName2 = "spark_orc2" + +withTempDir { dir => + val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", "c3").repartition(1) + withTable(tableName1, tableName2) { +val dataDir = s"${dir.getCanonicalPath}/dir1/" +val parentDir = s"${dir.getCanonicalPath}/" +val wildCardDir = new File(s"${dir}/*").toURI +someDF1.write.orc(dataDir) +val parentDirStatement = + s""" + |CREATE EXTERNAL TABLE $tableName1( + | c1 int, + | c2 int, + | c3 string) + |STORED AS orc + |LOCATION '${parentDir}'""".stripMargin +sql(parentDirStatement) +val parentDirSqlStatement = s"select * from ${tableName1}" +if (isConvertMetastore) { + checkAnswer(sql(parentDirSqlStatement), Nil) +} else { + checkAnswer(sql(parentDirSqlStatement), + (1 to 2).map(i => Row(i, i, s"orc$i"))) +} + +val wildCardStatement = + s""" + |CREATE EXTERNAL TABLE $tableName2( + | c1 int, + | c2 int, + | c3 string) + |STORED AS orc + |LOCATION '$wildCardDir'""".stripMargin --- End diff -- Thanks, @kevinyu98 . Also, please update the PR title ``` - [Spark-25993][SQL][TEST]Add test cases for resolution of ORC table location + [SPARK-25993][SQL][TEST] Add test cases for CREATE EXTERNAL TABLE with subdirectories ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...
Github user kevinyu98 commented on a diff in the pull request: https://github.com/apache/spark/pull/23108#discussion_r238469695 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -186,6 +186,54 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = { +val tableName1 = "spark_orc1" +val tableName2 = "spark_orc2" + +withTempDir { dir => + val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", "c3").repartition(1) + withTable(tableName1, tableName2) { +val dataDir = s"${dir.getCanonicalPath}/dir1/" +val parentDir = s"${dir.getCanonicalPath}/" +val wildCardDir = new File(s"${dir}/*").toURI +someDF1.write.orc(dataDir) +val parentDirStatement = + s""" + |CREATE EXTERNAL TABLE $tableName1( + | c1 int, + | c2 int, + | c3 string) + |STORED AS orc + |LOCATION '${parentDir}'""".stripMargin +sql(parentDirStatement) +val parentDirSqlStatement = s"select * from ${tableName1}" +if (isConvertMetastore) { + checkAnswer(sql(parentDirSqlStatement), Nil) +} else { + checkAnswer(sql(parentDirSqlStatement), + (1 to 2).map(i => Row(i, i, s"orc$i"))) +} + +val wildCardStatement = + s""" + |CREATE EXTERNAL TABLE $tableName2( + | c1 int, + | c2 int, + | c3 string) + |STORED AS orc + |LOCATION '$wildCardDir'""".stripMargin --- End diff -- @dongjoon-hyun I didn't add three level subdirectores in this PR, should I ? I was thinking to add the three levels in the follow up PR. Let me know what you prefer. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...
Github user kevinyu98 commented on a diff in the pull request: https://github.com/apache/spark/pull/23108#discussion_r238367919 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -186,6 +186,54 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = { +val tableName1 = "spark_orc1" +val tableName2 = "spark_orc2" + +withTempDir { dir => + val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", "c3").repartition(1) + withTable(tableName1, tableName2) { +val dataDir = s"${dir.getCanonicalPath}/dir1/" +val parentDir = s"${dir.getCanonicalPath}/" +val wildCardDir = new File(s"${dir}/*").toURI +someDF1.write.orc(dataDir) +val parentDirStatement = + s""" + |CREATE EXTERNAL TABLE $tableName1( + | c1 int, + | c2 int, + | c3 string) + |STORED AS orc + |LOCATION '${parentDir}'""".stripMargin +sql(parentDirStatement) +val parentDirSqlStatement = s"select * from ${tableName1}" +if (isConvertMetastore) { + checkAnswer(sql(parentDirSqlStatement), Nil) +} else { + checkAnswer(sql(parentDirSqlStatement), + (1 to 2).map(i => Row(i, i, s"orc$i"))) +} + +val wildCardStatement = + s""" + |CREATE EXTERNAL TABLE $tableName2( + | c1 int, + | c2 int, + | c3 string) + |STORED AS orc + |LOCATION '$wildCardDir'""".stripMargin --- End diff -- @dongjoon-hyun Sorry for the delay. My got some issues with my Intellij environment. Sure, I will add three level subdirectories for this PR. FYI, I also tried with `convertMetastoreParquet` for Parquet, the behavior is consistent. sql("set spark.sql.hive.convertMetastoreParquet = true") three level Parquet: -- "/" can only read current directory -- "/*" can read sub directory data, but not three level subdirectories. sql("set spark.sql.hive.convertMetastoreParquet = false") -- "/" can only read current directory -- "/*" can read sub directory data, but not three level subdirectories. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/23108#discussion_r237948799 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -186,6 +186,54 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = { +val tableName1 = "spark_orc1" +val tableName2 = "spark_orc2" + +withTempDir { dir => + val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", "c3").repartition(1) + withTable(tableName1, tableName2) { +val dataDir = s"${dir.getCanonicalPath}/dir1/" +val parentDir = s"${dir.getCanonicalPath}/" +val wildCardDir = new File(s"${dir}/*").toURI +someDF1.write.orc(dataDir) +val parentDirStatement = + s""" + |CREATE EXTERNAL TABLE $tableName1( + | c1 int, + | c2 int, + | c3 string) + |STORED AS orc + |LOCATION '${parentDir}'""".stripMargin +sql(parentDirStatement) +val parentDirSqlStatement = s"select * from ${tableName1}" +if (isConvertMetastore) { + checkAnswer(sql(parentDirSqlStatement), Nil) +} else { + checkAnswer(sql(parentDirSqlStatement), + (1 to 2).map(i => Row(i, i, s"orc$i"))) +} + +val wildCardStatement = + s""" + |CREATE EXTERNAL TABLE $tableName2( + | c1 int, + | c2 int, + | c3 string) + |STORED AS orc + |LOCATION '$wildCardDir'""".stripMargin --- End diff -- Thank you for investigating. I agree with you for (1). For the test case, please add three-level subdirectories. That will help us to improve Spark later. You may file another JIRA issue for that as a new feature JIRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...
Github user kevinyu98 commented on a diff in the pull request: https://github.com/apache/spark/pull/23108#discussion_r237691564 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -186,6 +186,54 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = { +val tableName1 = "spark_orc1" +val tableName2 = "spark_orc2" + +withTempDir { dir => + val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", "c3").repartition(1) + withTable(tableName1, tableName2) { +val dataDir = s"${dir.getCanonicalPath}/dir1/" +val parentDir = s"${dir.getCanonicalPath}/" +val wildCardDir = new File(s"${dir}/*").toURI +someDF1.write.orc(dataDir) +val parentDirStatement = + s""" + |CREATE EXTERNAL TABLE $tableName1( + | c1 int, + | c2 int, + | c3 string) + |STORED AS orc + |LOCATION '${parentDir}'""".stripMargin +sql(parentDirStatement) +val parentDirSqlStatement = s"select * from ${tableName1}" +if (isConvertMetastore) { + checkAnswer(sql(parentDirSqlStatement), Nil) +} else { + checkAnswer(sql(parentDirSqlStatement), + (1 to 2).map(i => Row(i, i, s"orc$i"))) +} + +val wildCardStatement = + s""" + |CREATE EXTERNAL TABLE $tableName2( + | c1 int, + | c2 int, + | c3 string) + |STORED AS orc + |LOCATION '$wildCardDir'""".stripMargin --- End diff -- @dongjoon-hyun Thanks for the suggestions. I tried with three level subdirectores for Parquet/ORC. Here is the result: sql("set spark.sql.hive.convertMetastoreOrc=true") three level directories ORC: - "/*" can read sub directory data, but not three level subdirectories - "/" can only read current directory Parquet: - "/*" can read sub directory data, but not three level subdirectories - "/" can only read current directory sql("set spark.sql.hive.convertMetastoreOrc=false") ORC: - "/" can read three level subdirectories - "/*" can't read any data parquet: - "/" can only read current directory - "/*" can read sub directory data, but not three level subdirectories. With sql("set spark.sql.hive.convertMetastoreOrc=true"), the ORC and Parquet behavior is consistent. 1. I think this PR is aiming only one-level subdirectores. 2. Sure, I will add one more for Parquet. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/23108#discussion_r237337683 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -186,6 +186,54 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = { +val tableName1 = "spark_orc1" +val tableName2 = "spark_orc2" + +withTempDir { dir => + val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", "c3").repartition(1) + withTable(tableName1, tableName2) { +val dataDir = s"${dir.getCanonicalPath}/dir1/" +val parentDir = s"${dir.getCanonicalPath}/" +val wildCardDir = new File(s"${dir}/*").toURI +someDF1.write.orc(dataDir) +val parentDirStatement = + s""" + |CREATE EXTERNAL TABLE $tableName1( + | c1 int, + | c2 int, + | c3 string) + |STORED AS orc + |LOCATION '${parentDir}'""".stripMargin +sql(parentDirStatement) +val parentDirSqlStatement = s"select * from ${tableName1}" +if (isConvertMetastore) { + checkAnswer(sql(parentDirSqlStatement), Nil) +} else { + checkAnswer(sql(parentDirSqlStatement), + (1 to 2).map(i => Row(i, i, s"orc$i"))) +} + +val wildCardStatement = + s""" + |CREATE EXTERNAL TABLE $tableName2( + | c1 int, + | c2 int, + | c3 string) + |STORED AS orc + |LOCATION '$wildCardDir'""".stripMargin --- End diff -- I have two suggestions. 1. Is this PR aiming only one-level subdirectories? Could you check the behavior on one, two, three level subdirectories in Parquet Hive tables first? 2. Since the test case looks general for both Parquet/ORC, please add a test case for Parquet while you are here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...
Github user kevinyu98 commented on a diff in the pull request: https://github.com/apache/spark/pull/23108#discussion_r237272654 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -111,6 +111,8 @@ displayTitle: Spark SQL Upgrading Guide - Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. + - In version 2.3 and earlier, `spark.sql.hive.converMetastoreOrc` default is `false`, if you specify a directory in the `LOCATION` clause in the `CREATE EXTERNAL TABLE STORED AS ORC LOCATION` sql statement, Spark will use the Hive ORC reader to read the data into the table if the directory or sub-directory contains the matching data, if you specify the wild card(*), the Hive ORC reader will not be able to read the data, because it is treating the wild card as a directory. For example: ORC data is stored at `/tmp/orctab1/dir1/`, `create external table tab1(...) stored as orc location '/tmp/orctab1/'` will read the data into the table, `create external table tab2(...) stored as orc location '/tmp/orctab1/*' ` will not. Since Spark 2.4, `spark.sql.hive.convertMetaStoreOrc` default is `true`, Spark will use native ORC reader, if you specify the wild card, it will try to read the matching data from current directory and sub-directory, if you specify a directory which does not conta ins the matching data, native ORC reader will not be able to read, even the data is in the sub-directory. For example: ORC data is stored at `/tmp/orctab1/dir1/`, `create external table tab3(...) stored as orc location '/tmp/orctab1/'` will not read the data from sub-directory into the table. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. --- End diff -- Thanks, I will make changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...
Github user kevinyu98 commented on a diff in the pull request: https://github.com/apache/spark/pull/23108#discussion_r237272454 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -186,6 +186,54 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = { +val tableName1 = "spark_orc1" +val tableName2 = "spark_orc2" + +withTempDir { dir => + val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", "c3").repartition(1) + withTable(tableName1, tableName2) { +val dataDir = s"${dir.getCanonicalPath}/dir1/" +val parentDir = s"${dir.getCanonicalPath}/" +val wildCardDir = new File(s"${dir}/*").toURI +someDF1.write.orc(dataDir) +val parentDirStatement = + s""" + |CREATE EXTERNAL TABLE $tableName1( + | c1 int, + | c2 int, + | c3 string) + |STORED AS orc + |LOCATION '${parentDir}'""".stripMargin +sql(parentDirStatement) +val parentDirSqlStatement = s"select * from ${tableName1}" +if (isConvertMetastore) { + checkAnswer(sql(parentDirSqlStatement), Nil) +} else { + checkAnswer(sql(parentDirSqlStatement), + (1 to 2).map(i => Row(i, i, s"orc$i"))) +} + +val wildCardStatement = + s""" + |CREATE EXTERNAL TABLE $tableName2( + | c1 int, + | c2 int, + | c3 string) + |STORED AS orc + |LOCATION '$wildCardDir'""".stripMargin --- End diff -- @dongjoon-hyun Hello Dongjoon, yes, you are right. It will create a directory with the name is '*', and it is the same behavior prior spark 2.4. I was just following the examples from the jira. Do you have any suggestions here? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/23108#discussion_r236866087 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -111,6 +111,8 @@ displayTitle: Spark SQL Upgrading Guide - Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. + - In version 2.3 and earlier, `spark.sql.hive.converMetastoreOrc` default is `false`, if you specify a directory in the `LOCATION` clause in the `CREATE EXTERNAL TABLE STORED AS ORC LOCATION` sql statement, Spark will use the Hive ORC reader to read the data into the table if the directory or sub-directory contains the matching data, if you specify the wild card(*), the Hive ORC reader will not be able to read the data, because it is treating the wild card as a directory. For example: ORC data is stored at `/tmp/orctab1/dir1/`, `create external table tab1(...) stored as orc location '/tmp/orctab1/'` will read the data into the table, `create external table tab2(...) stored as orc location '/tmp/orctab1/*' ` will not. Since Spark 2.4, `spark.sql.hive.convertMetaStoreOrc` default is `true`, Spark will use native ORC reader, if you specify the wild card, it will try to read the matching data from current directory and sub-directory, if you specify a directory which does not conta ins the matching data, native ORC reader will not be able to read, even the data is in the sub-directory. For example: ORC data is stored at `/tmp/orctab1/dir1/`, `create external table tab3(...) stored as orc location '/tmp/orctab1/'` will not read the data from sub-directory into the table. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. --- End diff -- I've read this again. In fact, this is not a new behavior for Spark users because Apache Spark uses Parquet as a default format since 2.0 and the default behavior of `STORED AS PARQUET` works like this. In order to give the rich context to the users and to avoid irrelevant confusions, we had better merge this part into the above line (line 112). For example, I'd like to update line 112 like the following. > applied. **In addition, this makes Spark's Hive table behavior more consistent over different formats. For example, for both ORC/Parquet Hive tables, `LOCATION '/table/*'` is required instead of `LOCATION '/table/'` to create an external table reading its direct sub-directories.** To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/23108#discussion_r236835472 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -186,6 +186,54 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = { +val tableName1 = "spark_orc1" +val tableName2 = "spark_orc2" + +withTempDir { dir => + val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", "c3").repartition(1) + withTable(tableName1, tableName2) { +val dataDir = s"${dir.getCanonicalPath}/dir1/" +val parentDir = s"${dir.getCanonicalPath}/" +val wildCardDir = new File(s"${dir}/*").toURI +someDF1.write.orc(dataDir) +val parentDirStatement = + s""" + |CREATE EXTERNAL TABLE $tableName1( + | c1 int, + | c2 int, + | c3 string) + |STORED AS orc + |LOCATION '${parentDir}'""".stripMargin +sql(parentDirStatement) +val parentDirSqlStatement = s"select * from ${tableName1}" +if (isConvertMetastore) { + checkAnswer(sql(parentDirSqlStatement), Nil) +} else { + checkAnswer(sql(parentDirSqlStatement), + (1 to 2).map(i => Row(i, i, s"orc$i"))) +} + +val wildCardStatement = + s""" + |CREATE EXTERNAL TABLE $tableName2( + | c1 int, + | c2 int, + | c3 string) + |STORED AS orc + |LOCATION '$wildCardDir'""".stripMargin --- End diff -- @kevinyu98 . This works, but there is a side effect with this. I mean this creates additional directory who name is '*' literally. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/23108#discussion_r236797263 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -111,6 +111,8 @@ displayTitle: Spark SQL Upgrading Guide - Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. + - In version 2.3 and earlier, `spark.sql.hive.converMetastoreOrc` default is `false`, if you specify a directory in the `LOCATION` clause in the `CREATE EXTERNAL TABLE STORED AS ORC LOCATION` sql statement, Spark will use the Hive ORC reader to read the data into the table if the directory or sub-directory contains the matching data, if you specify the wild card(*), the Hive ORC reader will not be able to read the data, because it is treating the wild card as a directory. For example: ORC data is stored at `/tmp/orctab1/dir1/`, `create external table tab1(...) stored as orc location '/tmp/orctab1/'` will read the data into the table, `create external table tab2(...) stored as orc location '/tmp/orctab1/*' ` will not. Since Spark 2.4, `spark.sql.hive.convertMetaStoreOrc` default is `true`, Spark will use native ORC reader, if you specify the wild card, it will try to read the matching data from current directory and sub-directory, if you specify a directory which does not conta ins the matching data, native ORC reader will not be able to read, even the data is in the sub-directory. For example: ORC data is stored at `/tmp/orctab1/dir1/`, `create external table tab3(...) stored as orc location '/tmp/orctab1/'` will not read the data from sub-directory into the table. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. --- End diff -- `spark.sql.hive.converMetastoreOrc` -> `spark.sql.hive.convertMetastoreOrc` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...
Github user kevinyu98 commented on a diff in the pull request: https://github.com/apache/spark/pull/23108#discussion_r235790938 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala --- @@ -597,6 +597,38 @@ abstract class OrcQueryTest extends OrcTest { assert(m4.contains("Malformed ORC file")) } } + + test("SPARK-25993 Add test cases for resolution of ORC table location") { --- End diff -- ok, I will move the test case to there. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...
Github user kevinyu98 commented on a diff in the pull request: https://github.com/apache/spark/pull/23108#discussion_r235790826 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -111,6 +111,8 @@ displayTitle: Spark SQL Upgrading Guide - Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. + - In version 2.3 and earlier, `spark.sql.hive.converMetastoreOrc` default is `false`, if you specify a directory in the `LOCATION` clause in the `CREATE EXTERNAL TABLE STORED AS ORC LOCATION` sql statement, Spark will use the Hive ORC reader to read the data into the table if the directory or sub-directory contains the matching data, if you specify the wild card(*), the Hive ORC reader will not be able to read the data, because it is treating the wild card as a directory. For example: ORC data is stored at `/tmp/orctab1/dir1/`, `create external table tab1(...) stored as orc location '/tmp/orctab1/'` will read the data into the table, `create external table tab2(...) stored as orc location '/tmp/orctab1/*' ` will not. Since Spark 2.4, `spark.sql.hive.convertMetaStoreOrc` default is `true`, Spark will use native ORC reader, it will read the data if you specify the wild card, but will not if you specify the parent directory. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. --- End diff -- sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/23108#discussion_r235671486 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala --- @@ -597,6 +597,38 @@ abstract class OrcQueryTest extends OrcTest { assert(m4.contains("Malformed ORC file")) } } + + test("SPARK-25993 Add test cases for resolution of ORC table location") { --- End diff -- `HiveOrcSourceSuite.scala` will be the better place. And, we had better have the following and cover both case behaviors; `true` and `false`. ``` Seq(true, false).foreach { convertMetastore => withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> s"$convertMetastore") { ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/23108#discussion_r235670505 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -111,6 +111,8 @@ displayTitle: Spark SQL Upgrading Guide - Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. + - In version 2.3 and earlier, `spark.sql.hive.converMetastoreOrc` default is `false`, if you specify a directory in the `LOCATION` clause in the `CREATE EXTERNAL TABLE STORED AS ORC LOCATION` sql statement, Spark will use the Hive ORC reader to read the data into the table if the directory or sub-directory contains the matching data, if you specify the wild card(*), the Hive ORC reader will not be able to read the data, because it is treating the wild card as a directory. For example: ORC data is stored at `/tmp/orctab1/dir1/`, `create external table tab1(...) stored as orc location '/tmp/orctab1/'` will read the data into the table, `create external table tab2(...) stored as orc location '/tmp/orctab1/*' ` will not. Since Spark 2.4, `spark.sql.hive.convertMetaStoreOrc` default is `true`, Spark will use native ORC reader, it will read the data if you specify the wild card, but will not if you specify the parent directory. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. --- End diff -- Could you change `but will not if you specify the parent directory` more clearly with examples like the other sentence? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...
GitHub user kevinyu98 opened a pull request: https://github.com/apache/spark/pull/23108 [Spark-25993][SQL][TEST]Add test cases for resolution of ORC table location ## What changes were proposed in this pull request? Add these test cases for resolution of ORC table location reported by [SPARK-25993](https://issues.apache.org/jira/browse/SPARK-25993) Update the > sql-migration-guide-update doc ## How was this patch tested? This is a new test case. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinyu98/spark spark-25993 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23108.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23108 commit 4e45ef90fba26b34bd4d9b575b6bf793d0500fdc Author: Kevin Yu Date: 2018-11-21T16:28:41Z add test case for orc table location commit e238764f278883b05d4bc88243facf897d357e84 Author: Kevin Yu Date: 2018-11-21T19:43:47Z doc the change in migration-guide --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org