[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour
cloud-fan commented on code in PR #36027: URL: https://github.com/apache/spark/pull/36027#discussion_r977797330 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -66,6 +66,10 @@ import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types._ import org.apache.spark.util.{CircularBuffer, Utils} +private[hive] class CatalogAndHiveTableImpl( Review Comment: does this really help? We need to cast an `Object` to `HiveTable` anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour
cloud-fan commented on code in PR #36027: URL: https://github.com/apache/spark/pull/36027#discussion_r977795842 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -435,6 +439,14 @@ private[hive] class HiveClientImpl( getRawTableOption(dbName, tableName).map(convertHiveTableToCatalogTable) } + override def getCatalogAndHiveTableOption( + dbName: String, + tableName: String): Option[CatalogAndHiveTable] = withHiveState { +logDebug(s"Looking up $dbName.$tableName") +getRawTableOption(dbName, tableName) + .map(t => new CatalogAndHiveTableImpl(convertHiveTableToCatalogTable(t), t)) Review Comment: I think this is still sub-optimal. Sometimes we just need the raw hive table and this has an extra `convertHiveTableToCatalogTable` invocation. How about we make it lazy? ``` class RawHiveTable(val rawTable: Object) { def toSparkTable: CatalogTable = ... } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour
cloud-fan commented on code in PR #36027: URL: https://github.com/apache/spark/pull/36027#discussion_r976029766 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -1026,7 +1026,14 @@ private[hive] object HiveClientImpl extends Logging { } else { CharVarcharUtils.getRawTypeString(c.metadata).getOrElse(c.dataType.catalogString) } -new FieldSchema(c.name, typeString, c.getComment().orNull) +val name = if (lowerCase) { + // scalastyle:off caselocale + c.name.toLowerCase + // scalastyle:on caselocale +} else { + c.name Review Comment: We can use `Object` to store the raw hive table, so that we don't need to expose the Hive classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour
cloud-fan commented on code in PR #36027: URL: https://github.com/apache/spark/pull/36027#discussion_r975330647 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -1026,7 +1026,14 @@ private[hive] object HiveClientImpl extends Logging { } else { CharVarcharUtils.getRawTypeString(c.metadata).getOrElse(c.dataType.catalogString) } -new FieldSchema(c.name, typeString, c.getComment().orNull) +val name = if (lowerCase) { + // scalastyle:off caselocale + c.name.toLowerCase + // scalastyle:on caselocale +} else { + c.name Review Comment: Looking at the related code again, I feel it's unnecessary to have this `hive table -> spark table -> hive table` roundtrip when we need to pass the raw table to call `HiveClient` APIs. Shall we expose a `getHiveTable` function in `HiveClient` and let `HiveExternalCatalog` call it? This avoids the `spark table -> hive table` conversion and we should be able to fix this bug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour
cloud-fan commented on code in PR #36027: URL: https://github.com/apache/spark/pull/36027#discussion_r975321833 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -1026,7 +1026,14 @@ private[hive] object HiveClientImpl extends Logging { } else { CharVarcharUtils.getRawTypeString(c.metadata).getOrElse(c.dataType.catalogString) } -new FieldSchema(c.name, typeString, c.getComment().orNull) +val name = if (lowerCase) { + // scalastyle:off caselocale + c.name.toLowerCase + // scalastyle:on caselocale +} else { + c.name Review Comment: Ah I see. This is very tricky. Although HMS is not case-preserving, but we still need to pass the original column names to Hive, as this may affect the column names in the schema of physical parquet files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour
cloud-fan commented on code in PR #36027: URL: https://github.com/apache/spark/pull/36027#discussion_r974926525 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -1026,7 +1026,14 @@ private[hive] object HiveClientImpl extends Logging { } else { CharVarcharUtils.getRawTypeString(c.metadata).getOrElse(c.dataType.catalogString) } -new FieldSchema(c.name, typeString, c.getComment().orNull) +val name = if (lowerCase) { + // scalastyle:off caselocale + c.name.toLowerCase + // scalastyle:on caselocale +} else { + c.name Review Comment: when shall we not lower-case it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour
cloud-fan commented on code in PR #36027: URL: https://github.com/apache/spark/pull/36027#discussion_r969667603 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -1095,7 +1095,11 @@ private[hive] object HiveClientImpl extends Logging { table.bucketSpec match { case Some(bucketSpec) if !HiveExternalCatalog.isDatasourceTable(table) => hiveTable.setNumBuckets(bucketSpec.numBuckets) -hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava) Review Comment: then let's lower case normal column names as well? Just don't trust the Hive metastore and normalize all names at Spark side. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour
cloud-fan commented on code in PR #36027: URL: https://github.com/apache/spark/pull/36027#discussion_r966824596 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -1095,7 +1095,11 @@ private[hive] object HiveClientImpl extends Logging { table.bucketSpec match { case Some(bucketSpec) if !HiveExternalCatalog.isDatasourceTable(table) => hiveTable.setNumBuckets(bucketSpec.numBuckets) -hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava) Review Comment: I can't recall why we need to pass `rawTable` for some cases... Let's go with a different direction: Spark assumes that HMS always lower-case table/column names, and it's a bit tricky if bucket spec does not lower-case column names. Let's help Hive to fix it in https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L1114-L1118 by adding a `.map(_.toLowerCase)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour
cloud-fan commented on code in PR #36027: URL: https://github.com/apache/spark/pull/36027#discussion_r966801663 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -1095,7 +1095,11 @@ private[hive] object HiveClientImpl extends Logging { table.bucketSpec match { case Some(bucketSpec) if !HiveExternalCatalog.isDatasourceTable(table) => hiveTable.setNumBuckets(bucketSpec.numBuckets) -hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava) Review Comment: this seems like a bug. We already did `val catalogTable = restoreTableMetadata(rawTable)` and why don't we pass `catalogTable`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour
cloud-fan commented on code in PR #36027: URL: https://github.com/apache/spark/pull/36027#discussion_r966724740 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -1095,7 +1095,11 @@ private[hive] object HiveClientImpl extends Logging { table.bucketSpec match { case Some(bucketSpec) if !HiveExternalCatalog.isDatasourceTable(table) => hiveTable.setNumBuckets(bucketSpec.numBuckets) -hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava) Review Comment: AFAIK Spark stores table schema and bucket spec as table properties, and restores it back when reading the table from Hive metastore. So it is always case preserving in Spark, for both schema and bucket spec. Is there something wrong? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36027: [SPARK-38717][SQL] Handle Hive's bucket spec case preserving behaviour
cloud-fan commented on code in PR #36027: URL: https://github.com/apache/spark/pull/36027#discussion_r964768062 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -1095,7 +1095,11 @@ private[hive] object HiveClientImpl extends Logging { table.bucketSpec match { case Some(bucketSpec) if !HiveExternalCatalog.isDatasourceTable(table) => hiveTable.setNumBuckets(bucketSpec.numBuckets) -hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava) Review Comment: shall we just lower-case the bucket column names? ## sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: ## @@ -2694,6 +2694,23 @@ abstract class SQLQuerySuiteBase extends QueryTest with SQLTestUtils with TestHi } } } + + test("SPARK-38717: Handle Hive's bucket spec case preserving behaviour") { +withTable("t") { + sql( +s""" + |CREATE TABLE t( + | c STRING, + | B_C STRING + |) + |PARTITIONED BY (p_c STRING) + |CLUSTERED BY (B_C) INTO 4 BUCKETS + |STORED AS PARQUET + |""".stripMargin) + val df = sql("SELECT * FROM t") + df.collect() Review Comment: let's test with `checkAnswer` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org