[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

2018-12-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23108#discussion_r238524452
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -186,6 +186,54 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = {
+val tableName1 = "spark_orc1"
+val tableName2 = "spark_orc2"
+
+withTempDir { dir =>
+  val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", 
"c3").repartition(1)
+  withTable(tableName1, tableName2) {
+val dataDir = s"${dir.getCanonicalPath}/dir1/"
+val parentDir = s"${dir.getCanonicalPath}/"
+val wildCardDir = new File(s"${dir}/*").toURI
+someDF1.write.orc(dataDir)
+val parentDirStatement =
+  s"""
+ |CREATE EXTERNAL TABLE $tableName1(
+ |  c1 int,
+ |  c2 int,
+ |  c3 string)
+ |STORED AS orc
+ |LOCATION '${parentDir}'""".stripMargin
+sql(parentDirStatement)
+val parentDirSqlStatement = s"select * from ${tableName1}"
+if (isConvertMetastore) {
+  checkAnswer(sql(parentDirSqlStatement), Nil)
+} else {
+ checkAnswer(sql(parentDirSqlStatement),
+   (1 to 2).map(i => Row(i, i, s"orc$i")))
+}
+
+val wildCardStatement =
+  s"""
+ |CREATE EXTERNAL TABLE $tableName2(
+ |  c1 int,
+ |  c2 int,
+ |  c3 string)
+ |STORED AS orc
+ |LOCATION '$wildCardDir'""".stripMargin
--- End diff --

Thanks, @kevinyu98 . Also, please update the PR title
```
- [Spark-25993][SQL][TEST]Add test cases for resolution of ORC table 
location 
+ [SPARK-25993][SQL][TEST] Add test cases for CREATE EXTERNAL TABLE with 
subdirectories
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

2018-12-03 Thread kevinyu98
Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/23108#discussion_r238469695
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -186,6 +186,54 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = {
+val tableName1 = "spark_orc1"
+val tableName2 = "spark_orc2"
+
+withTempDir { dir =>
+  val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", 
"c3").repartition(1)
+  withTable(tableName1, tableName2) {
+val dataDir = s"${dir.getCanonicalPath}/dir1/"
+val parentDir = s"${dir.getCanonicalPath}/"
+val wildCardDir = new File(s"${dir}/*").toURI
+someDF1.write.orc(dataDir)
+val parentDirStatement =
+  s"""
+ |CREATE EXTERNAL TABLE $tableName1(
+ |  c1 int,
+ |  c2 int,
+ |  c3 string)
+ |STORED AS orc
+ |LOCATION '${parentDir}'""".stripMargin
+sql(parentDirStatement)
+val parentDirSqlStatement = s"select * from ${tableName1}"
+if (isConvertMetastore) {
+  checkAnswer(sql(parentDirSqlStatement), Nil)
+} else {
+ checkAnswer(sql(parentDirSqlStatement),
+   (1 to 2).map(i => Row(i, i, s"orc$i")))
+}
+
+val wildCardStatement =
+  s"""
+ |CREATE EXTERNAL TABLE $tableName2(
+ |  c1 int,
+ |  c2 int,
+ |  c3 string)
+ |STORED AS orc
+ |LOCATION '$wildCardDir'""".stripMargin
--- End diff --

@dongjoon-hyun I didn't add three level subdirectores in this PR, should I 
? I was thinking to add the three levels in the follow up PR. Let me know what 
you prefer. Thanks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

2018-12-03 Thread kevinyu98
Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/23108#discussion_r238367919
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -186,6 +186,54 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = {
+val tableName1 = "spark_orc1"
+val tableName2 = "spark_orc2"
+
+withTempDir { dir =>
+  val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", 
"c3").repartition(1)
+  withTable(tableName1, tableName2) {
+val dataDir = s"${dir.getCanonicalPath}/dir1/"
+val parentDir = s"${dir.getCanonicalPath}/"
+val wildCardDir = new File(s"${dir}/*").toURI
+someDF1.write.orc(dataDir)
+val parentDirStatement =
+  s"""
+ |CREATE EXTERNAL TABLE $tableName1(
+ |  c1 int,
+ |  c2 int,
+ |  c3 string)
+ |STORED AS orc
+ |LOCATION '${parentDir}'""".stripMargin
+sql(parentDirStatement)
+val parentDirSqlStatement = s"select * from ${tableName1}"
+if (isConvertMetastore) {
+  checkAnswer(sql(parentDirSqlStatement), Nil)
+} else {
+ checkAnswer(sql(parentDirSqlStatement),
+   (1 to 2).map(i => Row(i, i, s"orc$i")))
+}
+
+val wildCardStatement =
+  s"""
+ |CREATE EXTERNAL TABLE $tableName2(
+ |  c1 int,
+ |  c2 int,
+ |  c3 string)
+ |STORED AS orc
+ |LOCATION '$wildCardDir'""".stripMargin
--- End diff --

@dongjoon-hyun Sorry for the delay. My got some issues with my Intellij 
environment. Sure, I will add three level subdirectories for this PR. FYI, I 
also tried with `convertMetastoreParquet` for Parquet, the behavior is 
consistent. 
sql("set spark.sql.hive.convertMetastoreParquet = true")

three level 

Parquet:

-- "/"  can only read current directory
-- "/*" can read sub directory data, but not three level subdirectories.

sql("set spark.sql.hive.convertMetastoreParquet = false")

-- "/"  can only read current directory
-- "/*" can read sub directory data, but not three level subdirectories.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

2018-11-30 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23108#discussion_r237948799
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -186,6 +186,54 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = {
+val tableName1 = "spark_orc1"
+val tableName2 = "spark_orc2"
+
+withTempDir { dir =>
+  val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", 
"c3").repartition(1)
+  withTable(tableName1, tableName2) {
+val dataDir = s"${dir.getCanonicalPath}/dir1/"
+val parentDir = s"${dir.getCanonicalPath}/"
+val wildCardDir = new File(s"${dir}/*").toURI
+someDF1.write.orc(dataDir)
+val parentDirStatement =
+  s"""
+ |CREATE EXTERNAL TABLE $tableName1(
+ |  c1 int,
+ |  c2 int,
+ |  c3 string)
+ |STORED AS orc
+ |LOCATION '${parentDir}'""".stripMargin
+sql(parentDirStatement)
+val parentDirSqlStatement = s"select * from ${tableName1}"
+if (isConvertMetastore) {
+  checkAnswer(sql(parentDirSqlStatement), Nil)
+} else {
+ checkAnswer(sql(parentDirSqlStatement),
+   (1 to 2).map(i => Row(i, i, s"orc$i")))
+}
+
+val wildCardStatement =
+  s"""
+ |CREATE EXTERNAL TABLE $tableName2(
+ |  c1 int,
+ |  c2 int,
+ |  c3 string)
+ |STORED AS orc
+ |LOCATION '$wildCardDir'""".stripMargin
--- End diff --

Thank you for investigating. I agree with you for (1). For the test case, 
please add three-level subdirectories. That will help us to improve Spark 
later. You may file another JIRA issue for that as a new feature JIRA.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

2018-11-29 Thread kevinyu98
Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/23108#discussion_r237691564
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -186,6 +186,54 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = {
+val tableName1 = "spark_orc1"
+val tableName2 = "spark_orc2"
+
+withTempDir { dir =>
+  val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", 
"c3").repartition(1)
+  withTable(tableName1, tableName2) {
+val dataDir = s"${dir.getCanonicalPath}/dir1/"
+val parentDir = s"${dir.getCanonicalPath}/"
+val wildCardDir = new File(s"${dir}/*").toURI
+someDF1.write.orc(dataDir)
+val parentDirStatement =
+  s"""
+ |CREATE EXTERNAL TABLE $tableName1(
+ |  c1 int,
+ |  c2 int,
+ |  c3 string)
+ |STORED AS orc
+ |LOCATION '${parentDir}'""".stripMargin
+sql(parentDirStatement)
+val parentDirSqlStatement = s"select * from ${tableName1}"
+if (isConvertMetastore) {
+  checkAnswer(sql(parentDirSqlStatement), Nil)
+} else {
+ checkAnswer(sql(parentDirSqlStatement),
+   (1 to 2).map(i => Row(i, i, s"orc$i")))
+}
+
+val wildCardStatement =
+  s"""
+ |CREATE EXTERNAL TABLE $tableName2(
+ |  c1 int,
+ |  c2 int,
+ |  c3 string)
+ |STORED AS orc
+ |LOCATION '$wildCardDir'""".stripMargin
--- End diff --

@dongjoon-hyun Thanks for the suggestions. I tried with three level 
subdirectores for Parquet/ORC. Here is the result:

sql("set spark.sql.hive.convertMetastoreOrc=true")

 three level directories

 ORC:

 

 - "/*" can read sub directory data, but not three level subdirectories 
 - "/"  can only read current directory

 Parquet:

 - "/*" can read sub directory data, but not three level subdirectories
 - "/"  can only read current directory
 

 sql("set spark.sql.hive.convertMetastoreOrc=false")

 ORC:

 - "/"  can read three level subdirectories
 - "/*" can't read any data

 parquet:

 - "/"  can only read current directory
 - "/*" can read sub directory data, but not three level subdirectories.

With sql("set spark.sql.hive.convertMetastoreOrc=true"), the ORC and 
Parquet behavior is consistent. 
1. I think this PR is aiming only one-level subdirectores.
2. Sure, I will add one more for Parquet.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

2018-11-28 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23108#discussion_r237337683
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -186,6 +186,54 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = {
+val tableName1 = "spark_orc1"
+val tableName2 = "spark_orc2"
+
+withTempDir { dir =>
+  val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", 
"c3").repartition(1)
+  withTable(tableName1, tableName2) {
+val dataDir = s"${dir.getCanonicalPath}/dir1/"
+val parentDir = s"${dir.getCanonicalPath}/"
+val wildCardDir = new File(s"${dir}/*").toURI
+someDF1.write.orc(dataDir)
+val parentDirStatement =
+  s"""
+ |CREATE EXTERNAL TABLE $tableName1(
+ |  c1 int,
+ |  c2 int,
+ |  c3 string)
+ |STORED AS orc
+ |LOCATION '${parentDir}'""".stripMargin
+sql(parentDirStatement)
+val parentDirSqlStatement = s"select * from ${tableName1}"
+if (isConvertMetastore) {
+  checkAnswer(sql(parentDirSqlStatement), Nil)
+} else {
+ checkAnswer(sql(parentDirSqlStatement),
+   (1 to 2).map(i => Row(i, i, s"orc$i")))
+}
+
+val wildCardStatement =
+  s"""
+ |CREATE EXTERNAL TABLE $tableName2(
+ |  c1 int,
+ |  c2 int,
+ |  c3 string)
+ |STORED AS orc
+ |LOCATION '$wildCardDir'""".stripMargin
--- End diff --

I have two suggestions.

1. Is this PR aiming only one-level subdirectories? Could you check the 
behavior on one, two, three level subdirectories in Parquet Hive tables first?
2. Since the test case looks general for both Parquet/ORC, please add a 
test case for Parquet while you are here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

2018-11-28 Thread kevinyu98
Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/23108#discussion_r237272654
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -111,6 +111,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Since Spark 2.0, Spark converts Parquet Hive tables by default for 
better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, 
too. It means Spark uses its own ORC support by default instead of Hive SerDe. 
As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with 
Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's 
ORC data source table and ORC vectorization would be applied. To set `false` to 
`spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
 
+  - In version 2.3 and earlier, `spark.sql.hive.converMetastoreOrc` 
default is `false`, if you specify a directory in the `LOCATION` clause in the 
`CREATE EXTERNAL TABLE STORED AS ORC LOCATION` sql statement, Spark will use 
the Hive ORC reader to read the data into the table if the directory or 
sub-directory contains the matching data, if you specify the wild card(*), the 
Hive ORC reader will not be able to read the data, because it is treating the 
wild card as a directory. For example: ORC data is stored at 
`/tmp/orctab1/dir1/`, `create external table tab1(...) stored as orc location 
'/tmp/orctab1/'` will read the data into the table, `create external table 
tab2(...) stored as orc location '/tmp/orctab1/*' ` will not. Since Spark 2.4, 
`spark.sql.hive.convertMetaStoreOrc` default is `true`, Spark will use native 
ORC reader, if you specify the wild card, it will try to read the matching data 
from current directory and sub-directory, if you specify a directory which does 
not conta
 ins the matching data, native ORC reader will not be able to read, even the 
data is in the sub-directory. For example: ORC data is stored at 
`/tmp/orctab1/dir1/`, `create external table tab3(...) stored as orc location 
'/tmp/orctab1/'` will not read the data from sub-directory into the table.  To 
set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous 
behavior.
--- End diff --

Thanks, I will make changes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

2018-11-28 Thread kevinyu98
Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/23108#discussion_r237272454
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -186,6 +186,54 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = {
+val tableName1 = "spark_orc1"
+val tableName2 = "spark_orc2"
+
+withTempDir { dir =>
+  val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", 
"c3").repartition(1)
+  withTable(tableName1, tableName2) {
+val dataDir = s"${dir.getCanonicalPath}/dir1/"
+val parentDir = s"${dir.getCanonicalPath}/"
+val wildCardDir = new File(s"${dir}/*").toURI
+someDF1.write.orc(dataDir)
+val parentDirStatement =
+  s"""
+ |CREATE EXTERNAL TABLE $tableName1(
+ |  c1 int,
+ |  c2 int,
+ |  c3 string)
+ |STORED AS orc
+ |LOCATION '${parentDir}'""".stripMargin
+sql(parentDirStatement)
+val parentDirSqlStatement = s"select * from ${tableName1}"
+if (isConvertMetastore) {
+  checkAnswer(sql(parentDirSqlStatement), Nil)
+} else {
+ checkAnswer(sql(parentDirSqlStatement),
+   (1 to 2).map(i => Row(i, i, s"orc$i")))
+}
+
+val wildCardStatement =
+  s"""
+ |CREATE EXTERNAL TABLE $tableName2(
+ |  c1 int,
+ |  c2 int,
+ |  c3 string)
+ |STORED AS orc
+ |LOCATION '$wildCardDir'""".stripMargin
--- End diff --

@dongjoon-hyun Hello Dongjoon, yes, you are right. It will create a 
directory with the name is '*', and it is the same behavior prior spark 2.4. I 
was just following the examples from the jira. Do you have any suggestions 
here? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

2018-11-27 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23108#discussion_r236866087
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -111,6 +111,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Since Spark 2.0, Spark converts Parquet Hive tables by default for 
better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, 
too. It means Spark uses its own ORC support by default instead of Hive SerDe. 
As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with 
Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's 
ORC data source table and ORC vectorization would be applied. To set `false` to 
`spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
 
+  - In version 2.3 and earlier, `spark.sql.hive.converMetastoreOrc` 
default is `false`, if you specify a directory in the `LOCATION` clause in the 
`CREATE EXTERNAL TABLE STORED AS ORC LOCATION` sql statement, Spark will use 
the Hive ORC reader to read the data into the table if the directory or 
sub-directory contains the matching data, if you specify the wild card(*), the 
Hive ORC reader will not be able to read the data, because it is treating the 
wild card as a directory. For example: ORC data is stored at 
`/tmp/orctab1/dir1/`, `create external table tab1(...) stored as orc location 
'/tmp/orctab1/'` will read the data into the table, `create external table 
tab2(...) stored as orc location '/tmp/orctab1/*' ` will not. Since Spark 2.4, 
`spark.sql.hive.convertMetaStoreOrc` default is `true`, Spark will use native 
ORC reader, if you specify the wild card, it will try to read the matching data 
from current directory and sub-directory, if you specify a directory which does 
not conta
 ins the matching data, native ORC reader will not be able to read, even the 
data is in the sub-directory. For example: ORC data is stored at 
`/tmp/orctab1/dir1/`, `create external table tab3(...) stored as orc location 
'/tmp/orctab1/'` will not read the data from sub-directory into the table.  To 
set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous 
behavior.
--- End diff --

I've read this again. In fact, this is not a new behavior for Spark users 
because Apache Spark uses Parquet as a default format since 2.0 and the default 
behavior of `STORED AS PARQUET` works like this.

In order to give the rich context to the users and to avoid irrelevant 
confusions, we had better merge this part into the above line (line 112). For 
example, I'd like to update line 112 like the following.

> applied. **In addition, this makes Spark's Hive table behavior more 
consistent over different formats. For example, for both ORC/Parquet Hive 
tables, `LOCATION '/table/*'` is required instead of `LOCATION '/table/'` to 
create an external table reading its direct sub-directories.** To set `false` 
to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

2018-11-27 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23108#discussion_r236835472
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -186,6 +186,54 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testORCTableLocation(isConvertMetastore: Boolean): Unit = {
+val tableName1 = "spark_orc1"
+val tableName2 = "spark_orc2"
+
+withTempDir { dir =>
+  val someDF1 = Seq((1, 1, "orc1"), (2, 2, "orc2")).toDF("c1", "c2", 
"c3").repartition(1)
+  withTable(tableName1, tableName2) {
+val dataDir = s"${dir.getCanonicalPath}/dir1/"
+val parentDir = s"${dir.getCanonicalPath}/"
+val wildCardDir = new File(s"${dir}/*").toURI
+someDF1.write.orc(dataDir)
+val parentDirStatement =
+  s"""
+ |CREATE EXTERNAL TABLE $tableName1(
+ |  c1 int,
+ |  c2 int,
+ |  c3 string)
+ |STORED AS orc
+ |LOCATION '${parentDir}'""".stripMargin
+sql(parentDirStatement)
+val parentDirSqlStatement = s"select * from ${tableName1}"
+if (isConvertMetastore) {
+  checkAnswer(sql(parentDirSqlStatement), Nil)
+} else {
+ checkAnswer(sql(parentDirSqlStatement),
+   (1 to 2).map(i => Row(i, i, s"orc$i")))
+}
+
+val wildCardStatement =
+  s"""
+ |CREATE EXTERNAL TABLE $tableName2(
+ |  c1 int,
+ |  c2 int,
+ |  c3 string)
+ |STORED AS orc
+ |LOCATION '$wildCardDir'""".stripMargin
--- End diff --

@kevinyu98 . This works, but there is a side effect with this. I mean this 
creates additional directory who name is '*' literally.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

2018-11-27 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23108#discussion_r236797263
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -111,6 +111,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Since Spark 2.0, Spark converts Parquet Hive tables by default for 
better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, 
too. It means Spark uses its own ORC support by default instead of Hive SerDe. 
As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with 
Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's 
ORC data source table and ORC vectorization would be applied. To set `false` to 
`spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
 
+  - In version 2.3 and earlier, `spark.sql.hive.converMetastoreOrc` 
default is `false`, if you specify a directory in the `LOCATION` clause in the 
`CREATE EXTERNAL TABLE STORED AS ORC LOCATION` sql statement, Spark will use 
the Hive ORC reader to read the data into the table if the directory or 
sub-directory contains the matching data, if you specify the wild card(*), the 
Hive ORC reader will not be able to read the data, because it is treating the 
wild card as a directory. For example: ORC data is stored at 
`/tmp/orctab1/dir1/`, `create external table tab1(...) stored as orc location 
'/tmp/orctab1/'` will read the data into the table, `create external table 
tab2(...) stored as orc location '/tmp/orctab1/*' ` will not. Since Spark 2.4, 
`spark.sql.hive.convertMetaStoreOrc` default is `true`, Spark will use native 
ORC reader, if you specify the wild card, it will try to read the matching data 
from current directory and sub-directory, if you specify a directory which does 
not conta
 ins the matching data, native ORC reader will not be able to read, even the 
data is in the sub-directory. For example: ORC data is stored at 
`/tmp/orctab1/dir1/`, `create external table tab3(...) stored as orc location 
'/tmp/orctab1/'` will not read the data from sub-directory into the table.  To 
set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous 
behavior.
--- End diff --

`spark.sql.hive.converMetastoreOrc` -> `spark.sql.hive.convertMetastoreOrc`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

2018-11-22 Thread kevinyu98
Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/23108#discussion_r235790938
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala
 ---
@@ -597,6 +597,38 @@ abstract class OrcQueryTest extends OrcTest {
   assert(m4.contains("Malformed ORC file"))
 }
   }
+
+  test("SPARK-25993 Add test cases for resolution of ORC table location") {
--- End diff --

ok, I will move the test case to there. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

2018-11-22 Thread kevinyu98
Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/23108#discussion_r235790826
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -111,6 +111,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Since Spark 2.0, Spark converts Parquet Hive tables by default for 
better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, 
too. It means Spark uses its own ORC support by default instead of Hive SerDe. 
As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with 
Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's 
ORC data source table and ORC vectorization would be applied. To set `false` to 
`spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
 
+  - In version 2.3 and earlier, `spark.sql.hive.converMetastoreOrc` 
default is `false`, if you specify a directory in the `LOCATION` clause in the 
`CREATE EXTERNAL TABLE STORED AS ORC LOCATION` sql statement, Spark will use 
the Hive ORC reader to read the data into the table if the directory or 
sub-directory contains the matching data, if you specify the wild card(*), the 
Hive ORC reader will not be able to read the data, because it is treating the 
wild card as a directory. For example: ORC data is stored at 
`/tmp/orctab1/dir1/`, `create external table tab1(...) stored as orc location 
'/tmp/orctab1/'` will read the data into the table, `create external table 
tab2(...) stored as orc location '/tmp/orctab1/*' ` will not. Since Spark 2.4, 
`spark.sql.hive.convertMetaStoreOrc` default is `true`, Spark will use native 
ORC reader, it will read the data if you specify the wild card, but will not if 
you specify the parent directory. To set `false` to 
`spark.sql.hive.convertMetastoreOrc` 
 restores the previous behavior.
--- End diff --

sure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

2018-11-22 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23108#discussion_r235671486
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala
 ---
@@ -597,6 +597,38 @@ abstract class OrcQueryTest extends OrcTest {
   assert(m4.contains("Malformed ORC file"))
 }
   }
+
+  test("SPARK-25993 Add test cases for resolution of ORC table location") {
--- End diff --

`HiveOrcSourceSuite.scala` will be the better place. And, we had better 
have the following and cover both case behaviors; `true` and `false`.
```
Seq(true, false).foreach { convertMetastore =>
  withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> 
s"$convertMetastore") {
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

2018-11-22 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23108#discussion_r235670505
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -111,6 +111,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Since Spark 2.0, Spark converts Parquet Hive tables by default for 
better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, 
too. It means Spark uses its own ORC support by default instead of Hive SerDe. 
As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with 
Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's 
ORC data source table and ORC vectorization would be applied. To set `false` to 
`spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
 
+  - In version 2.3 and earlier, `spark.sql.hive.converMetastoreOrc` 
default is `false`, if you specify a directory in the `LOCATION` clause in the 
`CREATE EXTERNAL TABLE STORED AS ORC LOCATION` sql statement, Spark will use 
the Hive ORC reader to read the data into the table if the directory or 
sub-directory contains the matching data, if you specify the wild card(*), the 
Hive ORC reader will not be able to read the data, because it is treating the 
wild card as a directory. For example: ORC data is stored at 
`/tmp/orctab1/dir1/`, `create external table tab1(...) stored as orc location 
'/tmp/orctab1/'` will read the data into the table, `create external table 
tab2(...) stored as orc location '/tmp/orctab1/*' ` will not. Since Spark 2.4, 
`spark.sql.hive.convertMetaStoreOrc` default is `true`, Spark will use native 
ORC reader, it will read the data if you specify the wild card, but will not if 
you specify the parent directory. To set `false` to 
`spark.sql.hive.convertMetastoreOrc` 
 restores the previous behavior.
--- End diff --

Could you change `but will not if you specify the parent directory` more 
clearly with examples like the other sentence?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23108: [Spark-25993][SQL][TEST]Add test cases for resolu...

2018-11-21 Thread kevinyu98
GitHub user kevinyu98 opened a pull request:

https://github.com/apache/spark/pull/23108

[Spark-25993][SQL][TEST]Add test cases for resolution of ORC table location

## What changes were proposed in this pull request?

Add these test cases for resolution of ORC table location reported by 
[SPARK-25993](https://issues.apache.org/jira/browse/SPARK-25993)
Update the 

> sql-migration-guide-update

 doc
## How was this patch tested?

This is a new test case.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kevinyu98/spark spark-25993

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23108.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23108


commit 4e45ef90fba26b34bd4d9b575b6bf793d0500fdc
Author: Kevin Yu 
Date:   2018-11-21T16:28:41Z

add test case for orc table location

commit e238764f278883b05d4bc88243facf897d357e84
Author: Kevin Yu 
Date:   2018-11-21T19:43:47Z

doc the change in migration-guide




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org