[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour

2022-09-22 Thread GitBox


cloud-fan commented on code in PR #36027:
URL: https://github.com/apache/spark/pull/36027#discussion_r977797330


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -66,6 +66,10 @@ import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
 import org.apache.spark.util.{CircularBuffer, Utils}
 
+private[hive] class CatalogAndHiveTableImpl(

Review Comment:
   does this really help? We need to cast an `Object` to `HiveTable` anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour

2022-09-22 Thread GitBox


cloud-fan commented on code in PR #36027:
URL: https://github.com/apache/spark/pull/36027#discussion_r977795842


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -435,6 +439,14 @@ private[hive] class HiveClientImpl(
 getRawTableOption(dbName, tableName).map(convertHiveTableToCatalogTable)
   }
 
+  override def getCatalogAndHiveTableOption(
+  dbName: String,
+  tableName: String): Option[CatalogAndHiveTable] = withHiveState {
+logDebug(s"Looking up $dbName.$tableName")
+getRawTableOption(dbName, tableName)
+  .map(t => new CatalogAndHiveTableImpl(convertHiveTableToCatalogTable(t), 
t))

Review Comment:
   I think this is still sub-optimal. Sometimes we just need the raw hive table 
and this has an extra `convertHiveTableToCatalogTable` invocation. How about we 
make it lazy?
   ```
   class RawHiveTable(val rawTable: Object) {
 def toSparkTable: CatalogTable = ...
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour

2022-09-20 Thread GitBox


cloud-fan commented on code in PR #36027:
URL: https://github.com/apache/spark/pull/36027#discussion_r976029766


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -1026,7 +1026,14 @@ private[hive] object HiveClientImpl extends Logging {
 } else {
   
CharVarcharUtils.getRawTypeString(c.metadata).getOrElse(c.dataType.catalogString)
 }
-new FieldSchema(c.name, typeString, c.getComment().orNull)
+val name = if (lowerCase) {
+  // scalastyle:off caselocale
+  c.name.toLowerCase
+  // scalastyle:on caselocale
+} else {
+  c.name

Review Comment:
   We can use `Object` to store the raw hive table, so that we don't need to 
expose the Hive classes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour

2022-09-20 Thread GitBox


cloud-fan commented on code in PR #36027:
URL: https://github.com/apache/spark/pull/36027#discussion_r975330647


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -1026,7 +1026,14 @@ private[hive] object HiveClientImpl extends Logging {
 } else {
   
CharVarcharUtils.getRawTypeString(c.metadata).getOrElse(c.dataType.catalogString)
 }
-new FieldSchema(c.name, typeString, c.getComment().orNull)
+val name = if (lowerCase) {
+  // scalastyle:off caselocale
+  c.name.toLowerCase
+  // scalastyle:on caselocale
+} else {
+  c.name

Review Comment:
   Looking at the related code again, I feel it's unnecessary to have this 
`hive table -> spark table -> hive table` roundtrip when we need to pass the 
raw table to call `HiveClient` APIs. Shall we expose a `getHiveTable` function 
in `HiveClient` and let `HiveExternalCatalog` call it? This avoids the `spark 
table -> hive table` conversion and we should be able to fix this bug.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour

2022-09-20 Thread GitBox


cloud-fan commented on code in PR #36027:
URL: https://github.com/apache/spark/pull/36027#discussion_r975321833


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -1026,7 +1026,14 @@ private[hive] object HiveClientImpl extends Logging {
 } else {
   
CharVarcharUtils.getRawTypeString(c.metadata).getOrElse(c.dataType.catalogString)
 }
-new FieldSchema(c.name, typeString, c.getComment().orNull)
+val name = if (lowerCase) {
+  // scalastyle:off caselocale
+  c.name.toLowerCase
+  // scalastyle:on caselocale
+} else {
+  c.name

Review Comment:
   Ah I see. This is very tricky. Although HMS is not case-preserving, but we 
still need to pass the original column names to Hive, as this may affect the 
column names in the schema of physical parquet files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour

2022-09-20 Thread GitBox


cloud-fan commented on code in PR #36027:
URL: https://github.com/apache/spark/pull/36027#discussion_r974926525


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -1026,7 +1026,14 @@ private[hive] object HiveClientImpl extends Logging {
 } else {
   
CharVarcharUtils.getRawTypeString(c.metadata).getOrElse(c.dataType.catalogString)
 }
-new FieldSchema(c.name, typeString, c.getComment().orNull)
+val name = if (lowerCase) {
+  // scalastyle:off caselocale
+  c.name.toLowerCase
+  // scalastyle:on caselocale
+} else {
+  c.name

Review Comment:
   when shall we not lower-case it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour

2022-09-13 Thread GitBox


cloud-fan commented on code in PR #36027:
URL: https://github.com/apache/spark/pull/36027#discussion_r969667603


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -1095,7 +1095,11 @@ private[hive] object HiveClientImpl extends Logging {
 table.bucketSpec match {
   case Some(bucketSpec) if !HiveExternalCatalog.isDatasourceTable(table) =>
 hiveTable.setNumBuckets(bucketSpec.numBuckets)
-hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava)

Review Comment:
   then let's lower case normal column names as well? Just don't trust the Hive 
metastore and normalize all names at Spark side.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour

2022-09-09 Thread GitBox


cloud-fan commented on code in PR #36027:
URL: https://github.com/apache/spark/pull/36027#discussion_r966824596


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -1095,7 +1095,11 @@ private[hive] object HiveClientImpl extends Logging {
 table.bucketSpec match {
   case Some(bucketSpec) if !HiveExternalCatalog.isDatasourceTable(table) =>
 hiveTable.setNumBuckets(bucketSpec.numBuckets)
-hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava)

Review Comment:
   I can't recall why we need to pass `rawTable` for some cases... Let's go 
with a different direction: Spark assumes that HMS always lower-case 
table/column names, and it's a bit tricky if bucket spec does not lower-case 
column names. Let's help Hive to fix it in 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L1114-L1118
 by adding a `.map(_.toLowerCase)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour

2022-09-09 Thread GitBox


cloud-fan commented on code in PR #36027:
URL: https://github.com/apache/spark/pull/36027#discussion_r966801663


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -1095,7 +1095,11 @@ private[hive] object HiveClientImpl extends Logging {
 table.bucketSpec match {
   case Some(bucketSpec) if !HiveExternalCatalog.isDatasourceTable(table) =>
 hiveTable.setNumBuckets(bucketSpec.numBuckets)
-hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava)

Review Comment:
   this seems like a bug. We already did `val catalogTable = 
restoreTableMetadata(rawTable)` and why don't we pass `catalogTable`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour

2022-09-09 Thread GitBox


cloud-fan commented on code in PR #36027:
URL: https://github.com/apache/spark/pull/36027#discussion_r966724740


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -1095,7 +1095,11 @@ private[hive] object HiveClientImpl extends Logging {
 table.bucketSpec match {
   case Some(bucketSpec) if !HiveExternalCatalog.isDatasourceTable(table) =>
 hiveTable.setNumBuckets(bucketSpec.numBuckets)
-hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava)

Review Comment:
   AFAIK Spark stores table schema and bucket spec as table properties, and 
restores it back when reading the table from Hive metastore. So it is always 
case preserving in Spark, for both schema and bucket spec. Is there something 
wrong?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour

2022-09-07 Thread GitBox


cloud-fan commented on code in PR #36027:
URL: https://github.com/apache/spark/pull/36027#discussion_r964768062


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -1095,7 +1095,11 @@ private[hive] object HiveClientImpl extends Logging {
 table.bucketSpec match {
   case Some(bucketSpec) if !HiveExternalCatalog.isDatasourceTable(table) =>
 hiveTable.setNumBuckets(bucketSpec.numBuckets)
-hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava)

Review Comment:
   shall we just lower-case the bucket column names?



##
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala:
##
@@ -2694,6 +2694,23 @@ abstract class SQLQuerySuiteBase extends QueryTest with 
SQLTestUtils with TestHi
   }
 }
   }
+
+  test("SPARK-38717: Handle Hive's bucket spec case preserving behaviour") {
+withTable("t") {
+  sql(
+s"""
+   |CREATE TABLE t(
+   |  c STRING,
+   |  B_C STRING
+   |)
+   |PARTITIONED BY (p_c STRING)
+   |CLUSTERED BY (B_C) INTO 4 BUCKETS
+   |STORED AS PARQUET
+   |""".stripMargin)
+  val df = sql("SELECT * FROM t")
+  df.collect()

Review Comment:
   let's test with `checkAnswer`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org