Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-23 Thread via GitHub


cloud-fan closed pull request #46083: [SPARK-47873][SQL] Write collated strings 
to Hive metastore using the regular string type
URL: https://github.com/apache/spark/pull/46083


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-23 Thread via GitHub


cloud-fan commented on PR #46083:
URL: https://github.com/apache/spark/pull/46083#issuecomment-2072351055

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-23 Thread via GitHub


stefankandic commented on PR #46083:
URL: https://github.com/apache/spark/pull/46083#issuecomment-2071987485

   @cloud-fan all tests passing after scalastyle fix


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-23 Thread via GitHub


stefankandic commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1575780017


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:
##
@@ -268,23 +296,6 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, 
hadoopConf: Configurat
 val provider = table.provider.get
 val options = new SourceOptions(table.storage.properties)
 
-// To work around some hive metastore issues, e.g. not case-preserving, 
bad decimal type

Review Comment:
   https://github.com/apache/spark/assets/154237371/d761434c-e1bd-418c-bf3c-3b36b368101a;>
   
   checked, it is indeed called only once



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-23 Thread via GitHub


cloud-fan commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1575698057


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:
##
@@ -268,23 +296,6 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, 
hadoopConf: Configurat
 val provider = table.provider.get
 val options = new SourceOptions(table.storage.properties)
 
-// To work around some hive metastore issues, e.g. not case-preserving, 
bad decimal type

Review Comment:
   can you double check if `createDataSourceTable` is only called in one place 
so that it's safe to move the code?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-22 Thread via GitHub


stefankandic commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1575038248


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:
##
@@ -320,6 +322,7 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, 
hadoopConf: Configurat
   }
 
   table.copy(
+schema = getHiveCompatibleSchema(table.schema),

Review Comment:
   actually i really liked you previous comment about converting the schema at 
the similar place so I refactored table property creation to be done before the 
call to `createDataSourceTable`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-22 Thread via GitHub


stefankandic commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1575038248


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:
##
@@ -320,6 +322,7 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, 
hadoopConf: Configurat
   }
 
   table.copy(
+schema = getHiveCompatibleSchema(table.schema),

Review Comment:
   actually i liked you previous comment about converting the schema at the 
similar place so I refactored table property creation to be done before the 
call to `createDataSourceTable`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-22 Thread via GitHub


cloud-fan commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1574582893


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:
##
@@ -320,6 +322,7 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, 
hadoopConf: Configurat
   }
 
   table.copy(
+schema = getHiveCompatibleSchema(table.schema),

Review Comment:
   Can we add some comments saying that we convert the table schema at the last 
step here, so that the table schema JSON string in the table properties still 
uses the original schema.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-22 Thread via GitHub


cloud-fan commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1574579235


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:
##
@@ -239,6 +240,7 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, 
hadoopConf: Configurat
 ignoreIfExists)

Review Comment:
   ah you are right



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-22 Thread via GitHub


stefankandic commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1574440520


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:
##
@@ -239,6 +240,7 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, 
hadoopConf: Configurat
 ignoreIfExists)

Review Comment:
   But isn't schema json being saved to table properties inside 
`createDataSourceTable`? 
   
   As far as I see it if we convert the schema before we will lose the ability 
to read back the collations



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-22 Thread via GitHub


stefankandic commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1574440520


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:
##
@@ -239,6 +240,7 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, 
hadoopConf: Configurat
 ignoreIfExists)

Review Comment:
   But isn't schema json being saved to table properties inside 
`createDataSourceTable`? 
   
   As far as I see it if we convert the schema before we will lose the ability 
to read it back



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-22 Thread via GitHub


cloud-fan commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1574411630


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:
##
@@ -239,6 +240,7 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, 
hadoopConf: Configurat
 ignoreIfExists)

Review Comment:
   shall we convert the table schema here, other than inside 
`createDataSourceTable`? The code looks clearer if we convert the table schema 
at the similar place for both Hive serde and data source tables.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-18 Thread via GitHub


stefankandic commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1570322161


##
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala:
##
@@ -303,4 +303,35 @@ private[spark] object SchemaUtils {
   case _ => false
 }
   }
+
+  /**
+   * Replaces any collated string type with non collated StringType
+   * recursively in the given schema.
+   */
+  def replaceCollatedStringWithStringInSchema(schema: StructType): StructType 
= {
+StructType(schema.map { field =>
+  if (hasNonBinarySortableCollatedString(field.dataType)) {

Review Comment:
   I fixed this in a separate PR so please check it out #46116



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-17 Thread via GitHub


stefankandic commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1568655624


##
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala:
##
@@ -303,4 +303,35 @@ private[spark] object SchemaUtils {
   case _ => false
 }
   }
+
+  /**
+   * Replaces any collated string type with non collated StringType
+   * recursively in the given schema.
+   */
+  def replaceCollatedStringWithStringInSchema(schema: StructType): StructType 
= {
+StructType(schema.map { field =>
+  if (hasNonBinarySortableCollatedString(field.dataType)) {

Review Comment:
   when I wrote this method a while back it was called 
`hasNonDefaultCollatedString` and the logic was `collationId != 0`.  However 
the name and logic were changed a while back by @mihailom-db when he introduced 
a concept of default collation (session collation).
   
   All in all, I think the method should be used for exactly this use-case but 
we should revisit naming/logic of it separately



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-17 Thread via GitHub


stefankandic commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1568655624


##
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala:
##
@@ -303,4 +303,35 @@ private[spark] object SchemaUtils {
   case _ => false
 }
   }
+
+  /**
+   * Replaces any collated string type with non collated StringType
+   * recursively in the given schema.
+   */
+  def replaceCollatedStringWithStringInSchema(schema: StructType): StructType 
= {
+StructType(schema.map { field =>
+  if (hasNonBinarySortableCollatedString(field.dataType)) {

Review Comment:
   when I wrote this method a while back it was called 
`hasNonDefaultCollatedString` and the logic was `collationId != 0`.  However 
the name and logic were changed a while back by @mihailom-db when he introduced 
a concept of default collation (session collation).
   
   All in all, method here should be used for exactly this use-case but we 
should revisit naming/logic of it separately



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-17 Thread via GitHub


stefankandic commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1568655624


##
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala:
##
@@ -303,4 +303,35 @@ private[spark] object SchemaUtils {
   case _ => false
 }
   }
+
+  /**
+   * Replaces any collated string type with non collated StringType
+   * recursively in the given schema.
+   */
+  def replaceCollatedStringWithStringInSchema(schema: StructType): StructType 
= {
+StructType(schema.map { field =>
+  if (hasNonBinarySortableCollatedString(field.dataType)) {

Review Comment:
   when I wrote this method a while back it was called 
`hasNonDefaultCollatedString` and the logic was `collationId != 0`.  However 
the name and logic were changed a while back by @mihailom-db when he introduced 
a concept of default collation (session collation).
   
   All in all, method here should be used for exactly this use-case but we 
revisit naming/logic of it separately



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-17 Thread via GitHub


stefankandic commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1568655624


##
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala:
##
@@ -303,4 +303,35 @@ private[spark] object SchemaUtils {
   case _ => false
 }
   }
+
+  /**
+   * Replaces any collated string type with non collated StringType
+   * recursively in the given schema.
+   */
+  def replaceCollatedStringWithStringInSchema(schema: StructType): StructType 
= {
+StructType(schema.map { field =>
+  if (hasNonBinarySortableCollatedString(field.dataType)) {

Review Comment:
   when a wrote this method a while back it was called 
`hasNonDefaultCollatedString` and the logic was `collationId != 0`.  However 
the name and logic were changed a while back by @mihailom-db when he introduced 
a concept of default collation (session collation).
   
   All in all, method here should be used for exactly this use-case but we 
revisit naming/logic of it separately



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-17 Thread via GitHub


cloud-fan commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1568644785


##
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##
@@ -454,4 +454,29 @@ object DataType {
   case (fromDataType, toDataType) => fromDataType == toDataType
 }
   }
+
+  /**
+   * Compares two types, ignoring nullability of ArrayType, MapType, 
StructType, ignoring case
+   * sensitivity of field names in StructType as well as differences in 
collation for String types.
+   */
+  def equalsIgnoreCaseNullabilityAndCollation(from: DataType, to: DataType): 
Boolean = {

Review Comment:
   I noticed that this is in `sql/api`. If we are not going to use it in the 
Spark Connect client,  we should just add this method to `DataTypeUtils`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-17 Thread via GitHub


stefankandic commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1568622375


##
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala:
##
@@ -303,4 +303,35 @@ private[spark] object SchemaUtils {
   case _ => false
 }
   }
+
+  /**
+   * Replaces any collated string type with non collated StringType
+   * recursively in the given schema.
+   */
+  def replaceCollatedStringWithStringInSchema(schema: StructType): StructType 
= {
+StructType(schema.map { field =>
+  if (hasNonBinarySortableCollatedString(field.dataType)) {

Review Comment:
   i'm not a fan of the name, but that is what the function will actually do. 
It will check if the collation `supportsBinaryOrdering` which only UTF8_BINARY 
collation does



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-17 Thread via GitHub


stefankandic commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1568655624


##
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala:
##
@@ -303,4 +303,35 @@ private[spark] object SchemaUtils {
   case _ => false
 }
   }
+
+  /**
+   * Replaces any collated string type with non collated StringType
+   * recursively in the given schema.
+   */
+  def replaceCollatedStringWithStringInSchema(schema: StructType): StructType 
= {
+StructType(schema.map { field =>
+  if (hasNonBinarySortableCollatedString(field.dataType)) {

Review Comment:
   when a wrote this method a while back it was called 
`hasNonDefaultCollatedString` and the logic was `collationId != 0`.  However 
the name was changed a while back by @mihailom-db when he introduced a concept 
of default collation (session collation).
   
   All in all, method here should be used for exactly this use-case but we 
revisit naming/logic of it separately



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-17 Thread via GitHub


cloud-fan commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1568607199


##
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala:
##
@@ -303,4 +303,35 @@ private[spark] object SchemaUtils {
   case _ => false
 }
   }
+
+  /**
+   * Replaces any collated string type with non collated StringType
+   * recursively in the given schema.
+   */
+  def replaceCollatedStringWithStringInSchema(schema: StructType): StructType 
= {
+StructType(schema.map { field =>
+  if (hasNonBinarySortableCollatedString(field.dataType)) {

Review Comment:
   I think `binarySortable` is not the property we are looking for. We need to 
fine collation id != 0, according to how `StringType.typeName` is implemented.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-17 Thread via GitHub


stefankandic commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1568652156


##
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##
@@ -454,4 +454,29 @@ object DataType {
   case (fromDataType, toDataType) => fromDataType == toDataType
 }
   }
+
+  /**
+   * Compares two types, ignoring nullability of ArrayType, MapType, 
StructType, ignoring case
+   * sensitivity of field names in StructType as well as differences in 
collation for String types.
+   */
+  def equalsIgnoreCaseNullabilityAndCollation(from: DataType, to: DataType): 
Boolean = {

Review Comment:
   sure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-17 Thread via GitHub


cloud-fan commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1568645699


##
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala:
##
@@ -303,4 +303,35 @@ private[spark] object SchemaUtils {
   case _ => false
 }
   }
+
+  /**
+   * Replaces any collated string type with non collated StringType
+   * recursively in the given schema.
+   */
+  def replaceCollatedStringWithStringInSchema(schema: StructType): StructType 
= {
+StructType(schema.map { field =>
+  if (hasNonBinarySortableCollatedString(field.dataType)) {

Review Comment:
   I don't think this is future-proof. I think it's better to write new code to 
explicitly check `collationId != 0`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-17 Thread via GitHub


stefankandic commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1568617954


##
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala:
##
@@ -303,4 +303,35 @@ private[spark] object SchemaUtils {
   case _ => false
 }
   }
+
+  /**
+   * Replaces any collated string type with non collated StringType
+   * recursively in the given schema.
+   */
+  def replaceCollatedStringWithStringInSchema(schema: StructType): StructType 
= {
+StructType(schema.map { field =>
+  if (hasNonBinarySortableCollatedString(field.dataType)) {

Review Comment:
   i'm not a fan of the name `hasNonBinarySortableCollatedString`, but that 
method does what you described.
   It will check if a collation `supportsBinaryEquality` => if id == 0



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-17 Thread via GitHub


stefankandic commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1568617954


##
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala:
##
@@ -303,4 +303,35 @@ private[spark] object SchemaUtils {
   case _ => false
 }
   }
+
+  /**
+   * Replaces any collated string type with non collated StringType
+   * recursively in the given schema.
+   */
+  def replaceCollatedStringWithStringInSchema(schema: StructType): StructType 
= {
+StructType(schema.map { field =>
+  if (hasNonBinarySortableCollatedString(field.dataType)) {

Review Comment:
   i'm not a fan of the name `hasNonBinarySortableCollatedString`, but that 
method does what you described.
   It will check if a collation `supportsBinaryEquality` => if id == 0



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-17 Thread via GitHub


stefankandic commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1568617954


##
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala:
##
@@ -303,4 +303,35 @@ private[spark] object SchemaUtils {
   case _ => false
 }
   }
+
+  /**
+   * Replaces any collated string type with non collated StringType
+   * recursively in the given schema.
+   */
+  def replaceCollatedStringWithStringInSchema(schema: StructType): StructType 
= {
+StructType(schema.map { field =>
+  if (hasNonBinarySortableCollatedString(field.dataType)) {

Review Comment:
   i'm not a fan of the name `hasNonBinarySortableCollatedString`, but that 
method does what you described it will check if a collation 
`supportsBinaryEquality` => if id == 0



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-17 Thread via GitHub


stefankandic commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1568616448


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/DataTypeUtils.scala:
##
@@ -47,6 +47,14 @@ object DataTypeUtils {
 DataType.equalsIgnoreCaseAndNullability(from, to)
   }
 
+  /**
+   * Compares two types, ignoring nullability of ArrayType, MapType, 
StructType, ignoring case
+   * sensitivity of field names in StructType as well as differences in 
collation for String types.
+   */
+  def equalsIgnoreCaseNullabilityAndStringTypes(from: DataType, to: DataType): 
Boolean = {

Review Comment:
   both `equalsIgnoreNullability` and `equalsIgnoreCaseAndNullability` do the 
same thing so I just wanted to be consistent



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-17 Thread via GitHub


cloud-fan commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1568607999


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/DataTypeUtils.scala:
##
@@ -47,6 +47,14 @@ object DataTypeUtils {
 DataType.equalsIgnoreCaseAndNullability(from, to)
   }
 
+  /**
+   * Compares two types, ignoring nullability of ArrayType, MapType, 
StructType, ignoring case
+   * sensitivity of field names in StructType as well as differences in 
collation for String types.
+   */
+  def equalsIgnoreCaseNullabilityAndStringTypes(from: DataType, to: DataType): 
Boolean = {

Review Comment:
   why do we need this extra level of indirection?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47873][SQL] Write collated strings to Hive metastore using the regular string type [spark]

2024-04-17 Thread via GitHub


cloud-fan commented on code in PR #46083:
URL: https://github.com/apache/spark/pull/46083#discussion_r1568600521


##
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##
@@ -454,4 +454,29 @@ object DataType {
   case (fromDataType, toDataType) => fromDataType == toDataType
 }
   }
+
+  /**
+   * Compares two types, ignoring nullability of ArrayType, MapType, 
StructType, ignoring case
+   * sensitivity of field names in StructType as well as differences in 
collation for String types.
+   */
+  def equalsIgnoreCaseNullabilityAndStringTypes(from: DataType, to: DataType): 
Boolean = {

Review Comment:
   ```suggestion
 def equalsIgnoreCaseNullabilityAndCollation(from: DataType, to: DataType): 
Boolean = {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org