[GitHub] [spark] maropu commented on a change in pull request #29045: [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables

2020-07-25 Thread GitBox


maropu commented on a change in pull request #29045:
URL: https://github.com/apache/spark/pull/29045#discussion_r460375329



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##
@@ -199,4 +205,25 @@ object OrcUtils extends Logging {
   
s"map<${orcTypeDescriptionString(m.keyType)},${orcTypeDescriptionString(m.valueType)}>"
 case _ => dt.catalogString
   }
+
+  /**
+   * @return Returns the result schema string based on the canPruneCols flag.
+   * resultSchemaString will be created using resultsSchema in case of
+   * canPruneCols is true and for canPruneCols as false value
+   * resultSchemaString will be created using the actual dataSchema.

Review comment:
   > Shall I raised the PR against the new Jira or with this same jira
   
   Its okay to refer to this JIRA ticket. Then, please add `[FOLLOWUP]` in the 
PR title.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29045: [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables

2020-07-12 Thread GitBox


maropu commented on a change in pull request #29045:
URL: https://github.com/apache/spark/pull/29045#discussion_r453376631



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##
@@ -116,47 +116,53 @@ object OrcUtils extends Logging {
   }
 
   /**
-   * Returns the requested column ids from the given ORC file. Column id can 
be -1, which means the
-   * requested column doesn't exist in the ORC file. Returns None if the given 
ORC file is empty.
+   * Returns the requested column ids from the given ORC file and Boolean flag 
to use actual
+   * schema or result schema. Column id can be -1, which means the requested 
column doesn't
+   * exist in the ORC file. Returns None if the given ORC file is empty.

Review comment:
   Could you add `@return` for describing what's a return value 
`(Option[Array[Int]], Boolean)`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29045: [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables

2020-07-11 Thread GitBox


maropu commented on a change in pull request #29045:
URL: https://github.com/apache/spark/pull/29045#discussion_r453170403



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
##
@@ -181,10 +183,19 @@ class OrcFileFormat
   val readerOptions = OrcFile.readerOptions(conf).filesystem(fs)
   val requestedColIdsOrEmptyFile =
 Utils.tryWithResource(OrcFile.createReader(filePath, readerOptions)) { 
reader =>
+  // for ORC file written by Hive, no field names
+  // in the physical schema, there is a need to send the
+  // entire dataSchema instead of required schema
+  val orcFieldNames = reader.getSchema.getFieldNames.asScala
+  if (orcFieldNames.forall(_.startsWith("_col"))) {

Review comment:
   What does this code mean? `_col` needs to be hard-coded?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29045: [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables

2020-07-11 Thread GitBox


maropu commented on a change in pull request #29045:
URL: https://github.com/apache/spark/pull/29045#discussion_r453170193



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala
##
@@ -77,4 +77,43 @@ class OrcColumnarBatchReaderSuite extends QueryTest with 
SharedSparkSession {
   assert(p1.getUTF8String(0) === partitionValues.getUTF8String(0))
 }
   }
+
+  test("SPARK-32234: orc data created by the hive tables having _col fields 
name") {
+var error: Throwable = null
+withTable("test_date_hive_orc") {
+  spark.sql(
+"""
+  |CREATE TABLE `test_date_hive_orc`
+  | (`_col1` INT,`_col2` STRING,`_col3` INT)
+  |  USING orc
+""".stripMargin)
+  spark.sql(
+"""insert into
+  | test_date_hive_orc
+  |  values(9, '12', 2020)
+""".stripMargin)
+
+  val df = spark.sql("select _col2 from test_date_hive_orc")
+  checkAnswer(df, Row("12"))
+}
+  }
+
+  test("SPARK-32234: orc data created by the spark having proper fields name") 
{
+withTable("test_date_spark_orc") {
+  spark.sql(
+"""
+  |CREATE TABLE `test_date_spark_orc`
+  | (`d_date_sk` INT,`d_date_id` STRING,`d_year` INT)
+  |  USING orc
+""".stripMargin)
+  spark.sql(
+"""insert into
+  | test_date_spark_orc
+  |  values(9, '12', 2020)
+""".stripMargin)
+
+  val df = spark.sql("select d_date_id from test_date_spark_orc")

Review comment:
   nit: please use the uppercases for SQL keywrods (e.g., SELECT) where 
possible





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29045: [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables

2020-07-11 Thread GitBox


maropu commented on a change in pull request #29045:
URL: https://github.com/apache/spark/pull/29045#discussion_r453170133



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala
##
@@ -77,4 +77,43 @@ class OrcColumnarBatchReaderSuite extends QueryTest with 
SharedSparkSession {
   assert(p1.getUTF8String(0) === partitionValues.getUTF8String(0))
 }
   }
+
+  test("SPARK-32234: orc data created by the hive tables having _col fields 
name") {
+var error: Throwable = null
+withTable("test_date_hive_orc") {
+  spark.sql(
+"""
+  |CREATE TABLE `test_date_hive_orc`

Review comment:
   nit: we need the backquotes?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29045: [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables

2020-07-11 Thread GitBox


maropu commented on a change in pull request #29045:
URL: https://github.com/apache/spark/pull/29045#discussion_r453170070



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala
##
@@ -77,4 +77,43 @@ class OrcColumnarBatchReaderSuite extends QueryTest with 
SharedSparkSession {
   assert(p1.getUTF8String(0) === partitionValues.getUTF8String(0))
 }
   }
+
+  test("SPARK-32234: orc data created by the hive tables having _col fields 
name") {
+var error: Throwable = null

Review comment:
   please remove this.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29045: [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables

2020-07-10 Thread GitBox


maropu commented on a change in pull request #29045:
URL: https://github.com/apache/spark/pull/29045#discussion_r453103909



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala
##
@@ -77,4 +77,64 @@ class OrcColumnarBatchReaderSuite extends QueryTest with 
SharedSparkSession {
   assert(p1.getUTF8String(0) === partitionValues.getUTF8String(0))
 }
   }
+
+  test("SPARK-32234: orc data created by the hive tables having _col fields 
name") {
+var error: Throwable = null
+withTable("test_date_hive_orc") {
+  spark.sql(
+"""
+  |CREATE TABLE `test_date_hive_orc`
+  | (`_col1` INT,`_col2` STRING,`_col3` INT)
+  |  USING orc
+""".stripMargin)
+  spark.sql(
+"""insert into
+  | test_date_hive_orc
+  |  values(9, '12', 2020)
+""".stripMargin)
+  try {
+val df = spark.sql("select _col2 from test_date_hive_orc")
+checkAnswer(df, Row("12"))
+  } catch {
+case e: Throwable =>
+  error = e
+  }
+  assert(error == null)

Review comment:
   I think you don't need this check.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29045: [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables

2020-07-10 Thread GitBox


maropu commented on a change in pull request #29045:
URL: https://github.com/apache/spark/pull/29045#discussion_r453103816



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala
##
@@ -77,4 +77,64 @@ class OrcColumnarBatchReaderSuite extends QueryTest with 
SharedSparkSession {
   assert(p1.getUTF8String(0) === partitionValues.getUTF8String(0))
 }
   }
+
+  test("SPARK-32234: orc data created by the hive tables having _col fields 
name") {
+var error: Throwable = null
+withTable("test_date_hive_orc") {
+  spark.sql(
+"""
+  |CREATE TABLE `test_date_hive_orc`
+  | (`_col1` INT,`_col2` STRING,`_col3` INT)
+  |  USING orc
+""".stripMargin)
+  spark.sql(
+"""insert into
+  | test_date_hive_orc
+  |  values(9, '12', 2020)
+""".stripMargin)
+  try {
+val df = spark.sql("select _col2 from test_date_hive_orc")
+checkAnswer(df, Row("12"))
+  } catch {
+case e: Throwable =>
+  error = e
+  }
+  assert(error == null)
+  spark.sql(
+s"""
+   |DROP TABLE IF

Review comment:
   you don't need to drop the table. Please see the implementation of 
`withTable`.

##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala
##
@@ -77,4 +77,64 @@ class OrcColumnarBatchReaderSuite extends QueryTest with 
SharedSparkSession {
   assert(p1.getUTF8String(0) === partitionValues.getUTF8String(0))
 }
   }
+
+  test("SPARK-32234: orc data created by the hive tables having _col fields 
name") {
+var error: Throwable = null
+withTable("test_date_hive_orc") {
+  spark.sql(
+"""
+  |CREATE TABLE `test_date_hive_orc`
+  | (`_col1` INT,`_col2` STRING,`_col3` INT)
+  |  USING orc
+""".stripMargin)
+  spark.sql(
+"""insert into
+  | test_date_hive_orc
+  |  values(9, '12', 2020)
+""".stripMargin)
+  try {
+val df = spark.sql("select _col2 from test_date_hive_orc")
+checkAnswer(df, Row("12"))
+  } catch {
+case e: Throwable =>
+  error = e
+  }
+  assert(error == null)
+  spark.sql(
+s"""
+   |DROP TABLE IF

Review comment:
   You don't need to drop the table. Please see the implementation of 
`withTable`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29045: [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables

2020-07-09 Thread GitBox


maropu commented on a change in pull request #29045:
URL: https://github.com/apache/spark/pull/29045#discussion_r452623395



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala
##
@@ -77,4 +77,43 @@ class OrcColumnarBatchReaderSuite extends QueryTest with 
SharedSparkSession {
   assert(p1.getUTF8String(0) === partitionValues.getUTF8String(0))
 }
   }
+
+  test("orc data created by the hive tables having _col fields name") {
+var error: Throwable = null
+val table = """CREATE TABLE `test_date_hive_orc`
+  | (`_col1` INT,`_col2` STRING,`_col3` INT)
+  |  USING orc""".stripMargin
+spark.sql(table).collect
+spark.sql("insert into test_date_hive_orc values(9, '12', 2020)").collect
+val df = spark.sql("select _col2 from test_date_hive_orc")
+try {
+  val data = df.collect()
+  assert(data.length == 1)
+} catch {
+  case e: Throwable =>
+error = e
+}
+assert(error == null)
+spark.sql(s"DROP TABLE IF EXISTS test_date_hive_orc")

Review comment:
   Please check the other tests carefully then follow how to write tests 
there. How about refactoring it like this?
   ```
   withTable("test_date_hive_orc") {
 spark.sql(
   s"""
  |CREATE TABLE test_date_hive_orc
  |  (col1 INT, col2 STRING, col3 INT)
  |  USING orc
""".stripMargin)
 spark.sql(
   s"""
  |INSERT INTO test_date_hive_orc VALUES
  |  (9, '12', 2020)
""".stripMargin)
   
 val df = spark.sql("SELECT col2 FROM test_date_hive_orc")
 checkAnswer(df, Row(...))
   }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29045: [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables

2020-07-09 Thread GitBox


maropu commented on a change in pull request #29045:
URL: https://github.com/apache/spark/pull/29045#discussion_r452621076



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala
##
@@ -77,4 +77,43 @@ class OrcColumnarBatchReaderSuite extends QueryTest with 
SharedSparkSession {
   assert(p1.getUTF8String(0) === partitionValues.getUTF8String(0))
 }
   }
+
+  test("orc data created by the hive tables having _col fields name") {

Review comment:
   Plz add the prefix `test("SPARK-32234: orc data...`.

##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala
##
@@ -77,4 +77,43 @@ class OrcColumnarBatchReaderSuite extends QueryTest with 
SharedSparkSession {
   assert(p1.getUTF8String(0) === partitionValues.getUTF8String(0))
 }
   }
+
+  test("orc data created by the hive tables having _col fields name") {
+var error: Throwable = null
+val table = """CREATE TABLE `test_date_hive_orc`
+  | (`_col1` INT,`_col2` STRING,`_col3` INT)
+  |  USING orc""".stripMargin
+spark.sql(table).collect
+spark.sql("insert into test_date_hive_orc values(9, '12', 2020)").collect
+val df = spark.sql("select _col2 from test_date_hive_orc")
+try {
+  val data = df.collect()
+  assert(data.length == 1)
+} catch {
+  case e: Throwable =>
+error = e
+}
+assert(error == null)
+spark.sql(s"DROP TABLE IF EXISTS test_date_hive_orc")
+  }
+
+  test("orc data created by the spark having proper fields name") {

Review comment:
   ditto





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29045: [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables

2020-07-09 Thread GitBox


maropu commented on a change in pull request #29045:
URL: https://github.com/apache/spark/pull/29045#discussion_r452620989



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala
##
@@ -77,4 +77,43 @@ class OrcColumnarBatchReaderSuite extends QueryTest with 
SharedSparkSession {
   assert(p1.getUTF8String(0) === partitionValues.getUTF8String(0))
 }
   }
+
+  test("orc data created by the hive tables having _col fields name") {
+var error: Throwable = null
+val table = """CREATE TABLE `test_date_hive_orc`
+  | (`_col1` INT,`_col2` STRING,`_col3` INT)
+  |  USING orc""".stripMargin
+spark.sql(table).collect
+spark.sql("insert into test_date_hive_orc values(9, '12', 2020)").collect
+val df = spark.sql("select _col2 from test_date_hive_orc")
+try {
+  val data = df.collect()
+  assert(data.length == 1)
+} catch {
+  case e: Throwable =>
+error = e
+}
+assert(error == null)
+spark.sql(s"DROP TABLE IF EXISTS test_date_hive_orc")
+  }
+
+  test("orc data created by the spark having proper fields name") {
+var error: Throwable = null
+val table = """CREATE TABLE `test_date_spark_orc`
+  | (`d_date_sk` INT,`d_date_id` STRING,`d_year` INT)
+  |  USING orc""".stripMargin
+spark.sql(table).collect
+spark.sql("insert into test_date_spark_orc values(9, '12', 2020)").collect
+val df = spark.sql("select d_date_id from test_date_spark_orc")
+try {
+  val data = df.collect()
+  assert(data.length == 1)
+} catch {
+  case e: Throwable =>
+error = e
+}
+assert(error == null)
+spark.sql(s"DROP TABLE IF EXISTS test_date_spark_orc")
+  }
+

Review comment:
   nit: remove this blank.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org