Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-20 Thread via GitHub


cloud-fan commented on PR #46280:
URL: https://github.com/apache/spark/pull/46280#issuecomment-2121377372

   I think so. String type with collation should be normal string type in the 
Hive table schema, so that other engines can still read it. We only keep the 
collation info in the Spark-specific table schema JSON string in table 
properties.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-20 Thread via GitHub


stefankandic commented on PR #46280:
URL: https://github.com/apache/spark/pull/46280#issuecomment-2120702050

   @cloud-fan I looked into HMS a bit, and it seems that we can't save column 
metadata there, so I guess we will still have to keep converting schema with 
collation to schema without when creating a table in hive even though 
collations are no longer a type?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-18 Thread via GitHub


cloud-fan closed pull request #46280: [SPARK-48175][SQL][PYTHON] Store 
collation information in metadata and not in type for SER/DE
URL: https://github.com/apache/spark/pull/46280


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-18 Thread via GitHub


cloud-fan commented on PR #46280:
URL: https://github.com/apache/spark/pull/46280#issuecomment-2118674278

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-17 Thread via GitHub


stefankandic commented on PR #46280:
URL: https://github.com/apache/spark/pull/46280#issuecomment-2117427722

   @cloud-fan all checks passing, can we merge this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-17 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1604650741


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -36,11 +36,45 @@
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  /**
+   * Identifier for single a collation.
+   */
+  public static class CollationIdentifier {
+public final String provider;
+public final String name;
+public final String version;

Review Comment:
   that makes sense



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-17 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1604650376


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -36,11 +36,57 @@
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  /**
+   * Identifier for single a collation.
+   */
+  public static class CollationIdentifier {
+public final String provider;
+public final String name;
+public final String version;
+
+public CollationIdentifier(String provider, String collationName, String 
version) {
+  this.provider = provider;
+  this.name = collationName;
+  this.version = version;
+}
+
+public static CollationIdentifier fromString(String identifier) {
+  long numDots = identifier.chars().filter(ch -> ch == '.').count();
+  assert(numDots > 0);
+
+  if (numDots == 1) {
+String[] parts = identifier.split("\\.", 2);
+return new CollationIdentifier(parts[0], parts[1], null);
+  }
+
+  String[] parts = identifier.split("\\.", 3);
+  return new CollationIdentifier(parts[0], parts[1], parts[2]);
+}
+
+@Override
+public String toString() {
+  if (version != null) {
+return String.format("%s.%s.%s", provider, name, version);
+  }
+
+  return toStringWithoutVersion();
+}
+
+/**
+ * Returns the identifier's string value without the version.
+ */
+public String toStringWithoutVersion() {

Review Comment:
   added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-17 Thread via GitHub


cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1604604686


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -36,11 +36,57 @@
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  /**
+   * Identifier for single a collation.
+   */
+  public static class CollationIdentifier {
+public final String provider;
+public final String name;
+public final String version;
+
+public CollationIdentifier(String provider, String collationName, String 
version) {
+  this.provider = provider;
+  this.name = collationName;
+  this.version = version;
+}
+
+public static CollationIdentifier fromString(String identifier) {
+  long numDots = identifier.chars().filter(ch -> ch == '.').count();
+  assert(numDots > 0);
+
+  if (numDots == 1) {
+String[] parts = identifier.split("\\.", 2);
+return new CollationIdentifier(parts[0], parts[1], null);
+  }
+
+  String[] parts = identifier.split("\\.", 3);
+  return new CollationIdentifier(parts[0], parts[1], parts[2]);
+}
+
+@Override
+public String toString() {
+  if (version != null) {
+return String.format("%s.%s.%s", provider, name, version);
+  }
+
+  return toStringWithoutVersion();
+}
+
+/**
+ * Returns the identifier's string value without the version.
+ */
+public String toStringWithoutVersion() {

Review Comment:
   can we add more comments to explain when we should call 
`toStringWithoutVersion` instead of `toString`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-17 Thread via GitHub


cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r160462


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -36,11 +36,45 @@
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  /**
+   * Identifier for single a collation.
+   */
+  public static class CollationIdentifier {
+public final String provider;
+public final String name;
+public final String version;

Review Comment:
   when can `version` be null?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-17 Thread via GitHub


cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1604598115


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -36,11 +36,45 @@
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  /**
+   * Identifier for single a collation.
+   */
+  public static class CollationIdentifier {
+public final String provider;
+public final String name;
+public final String version;

Review Comment:
   To strictly follow the java style, I think these fields should be `private` 
and we add `getXXX` methods to return them. The field type should not be 
`Optional` but the `getXXX` return type can be `Optional`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603733508


##
sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala:
##
@@ -712,4 +714,181 @@ class DataTypeSuite extends SparkFunSuite {
 
 assert(result === expected)
   }
+
+  test("schema with collation should not change during ser/de") {
+val simpleStruct = StructType(
+  StructField("c1", StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val nestedStruct = StructType(
+  StructField("nested", simpleStruct) :: Nil)
+
+val caseInsensitiveNames = StructType(
+  StructField("c1", StringType(UNICODE_COLLATION_ID)) ::
+  StructField("C1", StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val specialCharsInName = StructType(
+  StructField("c1.*23?", StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val arrayInSchema = StructType(
+  StructField("arrayField", ArrayType(StringType(UNICODE_COLLATION_ID))) 
:: Nil)
+
+val mapInSchema = StructType(
+  StructField("mapField",
+MapType(StringType(UNICODE_COLLATION_ID), 
StringType(UNICODE_COLLATION_ID))) :: Nil)
+
+val mapWithKeyInNameInSchema = StructType(
+  StructField("name.key", StringType) ::
+  StructField("name",
+MapType(StringType(UNICODE_COLLATION_ID), 
StringType(UNICODE_COLLATION_ID))) :: Nil)
+
+val arrayInMapInNestedSchema = StructType(
+  StructField("arrInMap",
+MapType(StringType(UNICODE_COLLATION_ID),
+ArrayType(StringType(UNICODE_COLLATION_ID :: Nil)
+
+val nestedArrayInMap = StructType(
+  StructField("nestedArrayInMap",
+ArrayType(MapType(StringType(UNICODE_COLLATION_ID),
+  ArrayType(ArrayType(StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val schemaWithMultipleFields = StructType(
+  simpleStruct.fields ++ nestedStruct.fields ++ arrayInSchema.fields ++ 
mapInSchema.fields ++
+mapWithKeyInNameInSchema ++ arrayInMapInNestedSchema.fields ++ 
nestedArrayInMap.fields)
+
+Seq(
+  simpleStruct, caseInsensitiveNames, specialCharsInName, nestedStruct, 
arrayInSchema,
+  mapInSchema, mapWithKeyInNameInSchema, nestedArrayInMap, 
arrayInMapInNestedSchema,
+  schemaWithMultipleFields)
+  .foreach { schema =>
+val json = schema.json
+val parsed = DataType.fromJson(json)
+assert(parsed === schema)
+  }
+  }
+
+  test("non string field has collation metadata") {

Review Comment:
   `StructTypeSuite` is used just to validate `toJson` value, deserialization 
is tested in `DataTypeSuite`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603732173


##
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##
@@ -208,22 +206,35 @@ object DataType {
   }
 
   // NOTE: Map fields must be sorted in alphabetical order to keep consistent 
with the Python side.
-  private[sql] def parseDataType(json: JValue): DataType = json match {
+  private[sql] def parseDataType(
+  json: JValue,
+  fieldPath: String = "",
+  collationsMap: Map[String, String] = Map.empty): DataType = json match {
 case JString(name) =>
-  nameToType(name)
+  collationsMap.get(fieldPath) match {
+case Some(collation) =>
+  assertValidTypeForCollations(fieldPath, name, collationsMap)
+  stringTypeWithCollation(collation)
+case _ => nameToType(name)
+  }
 
 case JSortedObject(
 ("containsNull", JBool(n)),
 ("elementType", t: JValue),
 ("type", JString("array"))) =>
-  ArrayType(parseDataType(t), n)
+  assertValidTypeForCollations(fieldPath, "array", collationsMap)
+  val elementType = parseDataTypeWithCollation(t, fieldPath + ".element", 
collationsMap)
+  ArrayType(elementType, n)
 
 case JSortedObject(
 ("keyType", k: JValue),
 ("type", JString("map")),
 ("valueContainsNull", JBool(n)),
 ("valueType", v: JValue)) =>
-  MapType(parseDataType(k), parseDataType(v), n)
+  assertValidTypeForCollations(fieldPath, "map", collationsMap)
+  val keyType = parseDataTypeWithCollation(k, fieldPath + ".key", 
collationsMap)
+  val valueType = parseDataTypeWithCollation(v, fieldPath + ".value", 
collationsMap)

Review Comment:
   good catch! After some refactoring earlier this is no longer needed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603727528


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -36,11 +36,45 @@
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  /**
+   * Identifier for single a collation.
+   */
+  public static class CollationIdentifier {
+public final String provider;
+public final String name;
+public final String version;

Review Comment:
   So you also think we should put optional here? Even intellij complains when 
it is used in a field.
   
   https://github.com/apache/spark/assets/154237371/ea7bd5c5-43cb-4826-aae5-bf76588b6198;>
   
   Also, you can read below that it's intended to be used for method return type



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603651070


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,61 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = schemaCollationValue(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = schemaCollationValue(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)
+fieldToCollationMap.toMap
+  }
+
+  private def isCollatedString(dt: DataType): Boolean = dt match {
+case st: StringType => !st.isUTF8BinaryCollation
+case _ => false
+  }
+
+  private def schemaCollationValue(dt: DataType): String = dt match {
+case st: StringType =>
+  val collation = CollationFactory.fetchCollation(st.collationId)
+  collation.identifier().toStringWithoutVersion()

Review Comment:
   for writing statistics - version change invalidates them



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603406813


##
sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala:
##
@@ -712,4 +714,181 @@ class DataTypeSuite extends SparkFunSuite {
 
 assert(result === expected)
   }
+
+  test("schema with collation should not change during ser/de") {
+val simpleStruct = StructType(
+  StructField("c1", StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val nestedStruct = StructType(
+  StructField("nested", simpleStruct) :: Nil)
+
+val caseInsensitiveNames = StructType(
+  StructField("c1", StringType(UNICODE_COLLATION_ID)) ::
+  StructField("C1", StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val specialCharsInName = StructType(
+  StructField("c1.*23?", StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val arrayInSchema = StructType(
+  StructField("arrayField", ArrayType(StringType(UNICODE_COLLATION_ID))) 
:: Nil)
+
+val mapInSchema = StructType(
+  StructField("mapField",
+MapType(StringType(UNICODE_COLLATION_ID), 
StringType(UNICODE_COLLATION_ID))) :: Nil)
+
+val mapWithKeyInNameInSchema = StructType(
+  StructField("name.key", StringType) ::
+  StructField("name",
+MapType(StringType(UNICODE_COLLATION_ID), 
StringType(UNICODE_COLLATION_ID))) :: Nil)
+
+val arrayInMapInNestedSchema = StructType(
+  StructField("arrInMap",
+MapType(StringType(UNICODE_COLLATION_ID),
+ArrayType(StringType(UNICODE_COLLATION_ID :: Nil)
+
+val nestedArrayInMap = StructType(
+  StructField("nestedArrayInMap",
+ArrayType(MapType(StringType(UNICODE_COLLATION_ID),
+  ArrayType(ArrayType(StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val schemaWithMultipleFields = StructType(
+  simpleStruct.fields ++ nestedStruct.fields ++ arrayInSchema.fields ++ 
mapInSchema.fields ++
+mapWithKeyInNameInSchema ++ arrayInMapInNestedSchema.fields ++ 
nestedArrayInMap.fields)
+
+Seq(
+  simpleStruct, caseInsensitiveNames, specialCharsInName, nestedStruct, 
arrayInSchema,
+  mapInSchema, mapWithKeyInNameInSchema, nestedArrayInMap, 
arrayInMapInNestedSchema,
+  schemaWithMultipleFields)
+  .foreach { schema =>
+val json = schema.json
+val parsed = DataType.fromJson(json)
+assert(parsed === schema)
+  }
+  }
+
+  test("non string field has collation metadata") {

Review Comment:
   how do we decide to put test in here or `StructTypeSuite`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603405105


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,61 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = schemaCollationValue(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = schemaCollationValue(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)
+fieldToCollationMap.toMap
+  }
+
+  private def isCollatedString(dt: DataType): Boolean = dt match {
+case st: StringType => !st.isUTF8BinaryCollation
+case _ => false
+  }
+
+  private def schemaCollationValue(dt: DataType): String = dt match {
+case st: StringType =>
+  val collation = CollationFactory.fetchCollation(st.collationId)
+  collation.identifier().toStringWithoutVersion()

Review Comment:
   when will we use the version?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603404296


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,61 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = schemaCollationValue(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = schemaCollationValue(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)
+fieldToCollationMap.toMap
+  }
+
+  private def isCollatedString(dt: DataType): Boolean = dt match {
+case st: StringType => !st.isUTF8BinaryCollation
+case _ => false
+  }
+
+  private def schemaCollationValue(dt: DataType): String = dt match {
+case st: StringType =>
+  val collation = CollationFactory.fetchCollation(st.collationId)
+  collation.identifier().toStringWithoutVersion()
+case _ =>
+  throw new IllegalStateException(s"Unexpected data type $dt")

Review Comment:
   we should use `SparkException.internalError`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603399289


##
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##
@@ -208,22 +206,35 @@ object DataType {
   }
 
   // NOTE: Map fields must be sorted in alphabetical order to keep consistent 
with the Python side.
-  private[sql] def parseDataType(json: JValue): DataType = json match {
+  private[sql] def parseDataType(
+  json: JValue,
+  fieldPath: String = "",
+  collationsMap: Map[String, String] = Map.empty): DataType = json match {
 case JString(name) =>
-  nameToType(name)
+  collationsMap.get(fieldPath) match {
+case Some(collation) =>
+  assertValidTypeForCollations(fieldPath, name, collationsMap)
+  stringTypeWithCollation(collation)
+case _ => nameToType(name)
+  }
 
 case JSortedObject(
 ("containsNull", JBool(n)),
 ("elementType", t: JValue),
 ("type", JString("array"))) =>
-  ArrayType(parseDataType(t), n)
+  assertValidTypeForCollations(fieldPath, "array", collationsMap)
+  val elementType = parseDataTypeWithCollation(t, fieldPath + ".element", 
collationsMap)
+  ArrayType(elementType, n)
 
 case JSortedObject(
 ("keyType", k: JValue),
 ("type", JString("map")),
 ("valueContainsNull", JBool(n)),
 ("valueType", v: JValue)) =>
-  MapType(parseDataType(k), parseDataType(v), n)
+  assertValidTypeForCollations(fieldPath, "map", collationsMap)
+  val keyType = parseDataTypeWithCollation(k, fieldPath + ".key", 
collationsMap)
+  val valueType = parseDataTypeWithCollation(v, fieldPath + ".value", 
collationsMap)

Review Comment:
   why do we need `parseDataTypeWithCollation`? It looks correct to just call 
`parseDataType`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603393035


##
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##
@@ -208,22 +206,35 @@ object DataType {
   }
 
   // NOTE: Map fields must be sorted in alphabetical order to keep consistent 
with the Python side.
-  private[sql] def parseDataType(json: JValue): DataType = json match {
+  private[sql] def parseDataType(
+  json: JValue,
+  fieldPath: String = "",
+  collationsMap: Map[String, String] = Map.empty): DataType = json match {
 case JString(name) =>
-  nameToType(name)
+  collationsMap.get(fieldPath) match {
+case Some(collation) =>
+  assertValidTypeForCollations(fieldPath, name, collationsMap)
+  stringTypeWithCollation(collation)
+case _ => nameToType(name)
+  }
 
 case JSortedObject(
 ("containsNull", JBool(n)),
 ("elementType", t: JValue),
 ("type", JString("array"))) =>
-  ArrayType(parseDataType(t), n)
+  assertValidTypeForCollations(fieldPath, "array", collationsMap)
+  val elementType = parseDataTypeWithCollation(t, fieldPath + ".element", 
collationsMap)
+  ArrayType(elementType, n)
 
 case JSortedObject(
 ("keyType", k: JValue),
 ("type", JString("map")),
 ("valueContainsNull", JBool(n)),
 ("valueType", v: JValue)) =>
-  MapType(parseDataType(k), parseDataType(v), n)
+  assertValidTypeForCollations(fieldPath, "map", collationsMap)

Review Comment:
   shall we do this assert in the StructType branch as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603384345


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -110,6 +158,8 @@ public Collation(
   // No Collation can simultaneously support binary equality and lowercase 
equality
   assert(!supportsBinaryEquality || !supportsLowercaseEquality);
 
+  assert(SUPPORTED_PROVIDERS.contains(provider) || provider == null);

Review Comment:
   when `provider` can be null?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603382702


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -36,11 +36,45 @@
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  /**
+   * Identifier for single a collation.
+   */
+  public static class CollationIdentifier {
+public final String provider;
+public final String name;
+public final String version;

Review Comment:
   attributes are not optional... It's a bad practice to leave attributes 
un-initialized, which is null for reference type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


olaky commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603195283


##
python/pyspark/sql/tests/test_types.py:
##
@@ -549,6 +549,129 @@ def test_convert_list_to_str(self):
 self.assertEqual(df.count(), 1)
 self.assertEqual(df.head(), Row(name="[123]", income=120))
 
+def test_schema_with_collations_json_ser_de(self):
+from pyspark.sql.types import _parse_datatype_json_string
+
+unicode_collation = "UNICODE"
+
+simple_struct = StructType([StructField("c1", 
StringType(unicode_collation))])
+
+nested_struct = StructType([StructField("nested", simple_struct)])
+
+array_in_schema = StructType(
+[StructField("array", ArrayType(StringType(unicode_collation)))]
+)
+
+map_in_schema = StructType(
+[
+StructField(
+"map", MapType(StringType(unicode_collation), 
StringType(unicode_collation))
+)
+]
+)
+
+array_in_map = StructType(
+[
+StructField(
+"arrInMap",
+MapType(
+StringType(unicode_collation), 
ArrayType(StringType(unicode_collation))
+),
+)
+]
+)
+
+nested_array_in_map = StructType(
+[
+StructField(
+"nestedArrayInMap",
+ArrayType(
+MapType(
+StringType(unicode_collation),
+
ArrayType(ArrayType(StringType(unicode_collation))),
+)
+),
+)
+]
+)
+
+schema_with_multiple_fields = StructType(
+simple_struct.fields
++ nested_struct.fields
++ array_in_schema.fields
++ map_in_schema.fields
++ array_in_map.fields
++ nested_array_in_map.fields
+)
+
+schemas = [
+simple_struct,
+nested_struct,
+array_in_schema,
+map_in_schema,
+nested_array_in_map,
+array_in_map,
+schema_with_multiple_fields,
+]
+
+for schema in schemas:
+scala_datatype = 
self.spark._jsparkSession.parseDataType(schema.json())

Review Comment:
   right, my bad



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


olaky commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603191384


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -36,11 +36,45 @@
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  /**
+   * Identifier for single a collation.
+   */
+  public static class CollationIdentifier {
+public final String provider;
+public final String name;
+public final String version;

Review Comment:
   Honestly not too familiar with how this is usually done in Java, but 
Optional would clearly express that the absence is not a bug (which a null 
often indicates)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601881695


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,60 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: mutable.Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = collationName(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = collationName(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)

Review Comment:
   Each column has separate metadata so map and string can never clash -> here 
is how it would look like in json:
   
   ```json
   {
 "type": "struct",
 "fields": [
   {
 "name": "c1.key",
 "type": "string",
 "nullable": true,
 "metadata": {
   "__COLLATIONS": {
 "c1.key": "ICU.UNICODE_CI"
   }
 }
   },
   {
 "name": "c1",
 "type": {
   "type": "map",
   "keyType": "string",
   "valueType": "string",
   "valueContainsNull": true
 },
 "nullable": true,
 "metadata": {
   "__COLLATIONS": {
 "c1.key": "ICU.UNICODE"
   }
 }
   }
 ]
   }
   ```
   
   "c1.key" will mean different things in each column - for the first it's just 
the entire column and for the second it is the key of the map



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601881695


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,60 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: mutable.Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = collationName(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = collationName(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)

Review Comment:
   Each column has separate metadata so map and string can never clash -> here 
is how it would look like in json:
   
   ```json
   {
 "type": "struct",
 "fields": [
   {
 "name": "c1.key",
 "type": "string",
 "nullable": true,
 "metadata": {
   "__COLLATIONS": {
 "c1.key": "ICU.UNICODE_CI"
   }
 }
   },
   {
 "name": "c1",
 "type": {
   "type": "map",
   "keyType": "string",
   "valueType": "string",
   "valueContainsNull": true
 },
 "nullable": true,
 "metadata": {
   "__COLLATIONS": {
 "c1.value": "ICU.UNICODE",
 "c1.key": "ICU.UNICODE"
   }
 }
   }
 ]
   }
   ```
   
   "c1.key" will mean different things in each column - for the first it's just 
the entire column and for the second it is the key of the map



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601878212


##
python/pyspark/sql/types.py:
##
@@ -1702,9 +1791,16 @@ def _parse_datatype_json_string(json_string: str) -> 
DataType:
 return _parse_datatype_json_value(json.loads(json_string))
 
 
-def _parse_datatype_json_value(json_value: Union[dict, str]) -> DataType:
+def _parse_datatype_json_value(
+json_value: Union[dict, str],
+fieldPath: str = "",
+collationsMap: Optional[Dict[str, str]] = None,
+) -> DataType:
 if not isinstance(json_value, dict):
 if json_value in _all_atomic_types.keys():
+if collationsMap is not None and fieldPath in collationsMap:
+collationName = collationsMap[fieldPath].split(".")[1]

Review Comment:
   sure!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601839139


##
python/pyspark/sql/types.py:
##
@@ -876,30 +894,86 @@ def __init__(
 self.dataType = dataType
 self.nullable = nullable
 self.metadata = metadata or {}
+self._collationMetadata: Optional[Dict[str, str]] = None
 
 def simpleString(self) -> str:
 return "%s:%s" % (self.name, self.dataType.simpleString())
 
 def __repr__(self) -> str:
 return "StructField('%s', %s, %s)" % (self.name, self.dataType, 
str(self.nullable))
 
+def __eq__(self, other: Any) -> bool:
+# since collationMetadata is lazy evaluated we should not use it in 
equality check

Review Comment:
   I agree, doing this for every json call is annoying but since we don't have 
the security of immutable data like in scala perhaps we should avoid lazy eval



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601793860


##
python/pyspark/sql/tests/test_types.py:
##
@@ -549,6 +549,129 @@ def test_convert_list_to_str(self):
 self.assertEqual(df.count(), 1)
 self.assertEqual(df.head(), Row(name="[123]", income=120))
 
+def test_schema_with_collations_json_ser_de(self):
+from pyspark.sql.types import _parse_datatype_json_string
+
+unicode_collation = "UNICODE"
+
+simple_struct = StructType([StructField("c1", 
StringType(unicode_collation))])
+
+nested_struct = StructType([StructField("nested", simple_struct)])
+
+array_in_schema = StructType(
+[StructField("array", ArrayType(StringType(unicode_collation)))]
+)
+
+map_in_schema = StructType(
+[
+StructField(
+"map", MapType(StringType(unicode_collation), 
StringType(unicode_collation))
+)
+]
+)
+
+array_in_map = StructType(
+[
+StructField(
+"arrInMap",
+MapType(
+StringType(unicode_collation), 
ArrayType(StringType(unicode_collation))
+),
+)
+]
+)
+
+nested_array_in_map = StructType(
+[
+StructField(
+"nestedArrayInMap",
+ArrayType(
+MapType(
+StringType(unicode_collation),
+
ArrayType(ArrayType(StringType(unicode_collation))),
+)
+),
+)
+]
+)
+
+schema_with_multiple_fields = StructType(
+simple_struct.fields
++ nested_struct.fields
++ array_in_schema.fields
++ map_in_schema.fields
++ array_in_map.fields
++ nested_array_in_map.fields
+)
+
+schemas = [
+simple_struct,
+nested_struct,
+array_in_schema,
+map_in_schema,
+nested_array_in_map,
+array_in_map,
+schema_with_multiple_fields,
+]
+
+for schema in schemas:
+scala_datatype = 
self.spark._jsparkSession.parseDataType(schema.json())
+python_datatype = 
_parse_datatype_json_string(scala_datatype.json())
+assert schema == python_datatype
+assert schema == _parse_datatype_json_string(schema.json())
+
+def test_schema_with_collations_on_non_string_types(self):
+from pyspark.sql.types import _parse_datatype_json_string, 
_COLLATIONS_METADATA_KEY
+
+collations_on_int_col_json = f"""
+{{
+  "type": "struct",
+  "fields": [
+{{
+  "name": "c1",
+  "type": "integer",
+  "nullable": true,
+  "metadata": {{
+"{_COLLATIONS_METADATA_KEY}": {{
+  "c1": "icu.UNICODE"
+}}
+  }}
+}}
+  ]
+}}
+"""
+
+collations_in_map_json = f"""
+{{
+  "type": "struct",
+  "fields": [
+{{
+  "name": "mapField",
+  "type": {{
+"type": "map",
+"keyType": "string",
+"valueType": "integer",
+"valueContainsNull": true
+  }},
+  "nullable": true,
+  "metadata": {{
+"{_COLLATIONS_METADATA_KEY}": {{
+  "mapField.value": "icu.UNICODE"

Review Comment:
   We talked about this one a bit offline, but I would rather tackle this as a 
separate issue than just a collation protocol error. Currently, both python and 
scala code will not fail when encountering duplicate keys; python will just 
pick one to put in the dictionary and scala will have both in the `JObject`. 
What do you think @cloud-fan 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601793860


##
python/pyspark/sql/tests/test_types.py:
##
@@ -549,6 +549,129 @@ def test_convert_list_to_str(self):
 self.assertEqual(df.count(), 1)
 self.assertEqual(df.head(), Row(name="[123]", income=120))
 
+def test_schema_with_collations_json_ser_de(self):
+from pyspark.sql.types import _parse_datatype_json_string
+
+unicode_collation = "UNICODE"
+
+simple_struct = StructType([StructField("c1", 
StringType(unicode_collation))])
+
+nested_struct = StructType([StructField("nested", simple_struct)])
+
+array_in_schema = StructType(
+[StructField("array", ArrayType(StringType(unicode_collation)))]
+)
+
+map_in_schema = StructType(
+[
+StructField(
+"map", MapType(StringType(unicode_collation), 
StringType(unicode_collation))
+)
+]
+)
+
+array_in_map = StructType(
+[
+StructField(
+"arrInMap",
+MapType(
+StringType(unicode_collation), 
ArrayType(StringType(unicode_collation))
+),
+)
+]
+)
+
+nested_array_in_map = StructType(
+[
+StructField(
+"nestedArrayInMap",
+ArrayType(
+MapType(
+StringType(unicode_collation),
+
ArrayType(ArrayType(StringType(unicode_collation))),
+)
+),
+)
+]
+)
+
+schema_with_multiple_fields = StructType(
+simple_struct.fields
++ nested_struct.fields
++ array_in_schema.fields
++ map_in_schema.fields
++ array_in_map.fields
++ nested_array_in_map.fields
+)
+
+schemas = [
+simple_struct,
+nested_struct,
+array_in_schema,
+map_in_schema,
+nested_array_in_map,
+array_in_map,
+schema_with_multiple_fields,
+]
+
+for schema in schemas:
+scala_datatype = 
self.spark._jsparkSession.parseDataType(schema.json())
+python_datatype = 
_parse_datatype_json_string(scala_datatype.json())
+assert schema == python_datatype
+assert schema == _parse_datatype_json_string(schema.json())
+
+def test_schema_with_collations_on_non_string_types(self):
+from pyspark.sql.types import _parse_datatype_json_string, 
_COLLATIONS_METADATA_KEY
+
+collations_on_int_col_json = f"""
+{{
+  "type": "struct",
+  "fields": [
+{{
+  "name": "c1",
+  "type": "integer",
+  "nullable": true,
+  "metadata": {{
+"{_COLLATIONS_METADATA_KEY}": {{
+  "c1": "icu.UNICODE"
+}}
+  }}
+}}
+  ]
+}}
+"""
+
+collations_in_map_json = f"""
+{{
+  "type": "struct",
+  "fields": [
+{{
+  "name": "mapField",
+  "type": {{
+"type": "map",
+"keyType": "string",
+"valueType": "integer",
+"valueContainsNull": true
+  }},
+  "nullable": true,
+  "metadata": {{
+"{_COLLATIONS_METADATA_KEY}": {{
+  "mapField.value": "icu.UNICODE"

Review Comment:
   We talked about this one a bit offline, but I would rather tackle this as a 
separate issue than just a collation protocol error. Currently, both python and 
scala code will not fail when encountering duplicate keys; python will just 
pick one to put in the dictionary and scala will have both in the `JObject`. 
What do you think @cloud-fan ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601784552


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -36,11 +36,45 @@
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  /**
+   * Identifier for single a collation.
+   */
+  public static class CollationIdentifier {
+public final String provider;
+public final String name;
+public final String version;

Review Comment:
   I thought that optional is used in java for method return types, and not for 
attributes since they are "optional" anyways?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601770280


##
python/pyspark/sql/tests/test_types.py:
##
@@ -549,6 +549,129 @@ def test_convert_list_to_str(self):
 self.assertEqual(df.count(), 1)
 self.assertEqual(df.head(), Row(name="[123]", income=120))
 
+def test_schema_with_collations_json_ser_de(self):
+from pyspark.sql.types import _parse_datatype_json_string
+
+unicode_collation = "UNICODE"
+
+simple_struct = StructType([StructField("c1", 
StringType(unicode_collation))])
+
+nested_struct = StructType([StructField("nested", simple_struct)])
+
+array_in_schema = StructType(
+[StructField("array", ArrayType(StringType(unicode_collation)))]
+)
+
+map_in_schema = StructType(
+[
+StructField(
+"map", MapType(StringType(unicode_collation), 
StringType(unicode_collation))
+)
+]
+)
+
+array_in_map = StructType(
+[
+StructField(
+"arrInMap",
+MapType(
+StringType(unicode_collation), 
ArrayType(StringType(unicode_collation))
+),
+)
+]
+)
+
+nested_array_in_map = StructType(
+[
+StructField(
+"nestedArrayInMap",
+ArrayType(
+MapType(
+StringType(unicode_collation),
+
ArrayType(ArrayType(StringType(unicode_collation))),
+)
+),
+)
+]
+)
+
+schema_with_multiple_fields = StructType(
+simple_struct.fields
++ nested_struct.fields
++ array_in_schema.fields
++ map_in_schema.fields
++ array_in_map.fields
++ nested_array_in_map.fields
+)
+
+schemas = [
+simple_struct,
+nested_struct,
+array_in_schema,
+map_in_schema,
+nested_array_in_map,
+array_in_map,
+schema_with_multiple_fields,
+]
+
+for schema in schemas:
+scala_datatype = 
self.spark._jsparkSession.parseDataType(schema.json())

Review Comment:
   aren't we doing that in the following line `python_datatype = 
_parse_datatype_json_string(scala_datatype.json())`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601767087


##
sql/catalyst/src/test/scala/org/apache/spark/sql/types/StructTypeSuite.scala:
##
@@ -606,4 +612,181 @@ class StructTypeSuite extends SparkFunSuite with 
SQLHelper {
 "b STRING NOT NULL,c STRING COMMENT 'nullable comment'")
 assert(fromDDL(struct.toDDL) === struct)
   }
+
+  test("simple struct with collations to json") {
+val simpleStruct = StructType(
+  StructField("c1", StringType(UNICODE_COLLATION)) :: Nil)
+
+val expectedJson =
+  s"""
+ |{
+ |  "type": "struct",
+ |  "fields": [
+ |{
+ |  "name": "c1",

Review Comment:
   there are already tests with dots in `DataTypeSuite` because we care about 
deserialization there as well so that's why I didn't put them here. I also 
added there tests for case sensitivity and special characters.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601765377


##
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##
@@ -208,22 +206,33 @@ object DataType {
   }
 
   // NOTE: Map fields must be sorted in alphabetical order to keep consistent 
with the Python side.
-  private[sql] def parseDataType(json: JValue): DataType = json match {
+  private[sql] def parseDataType(
+  json: JValue,
+  fieldPath: String = "",

Review Comment:
   there can be no ambiguity with the dots in the names, string or a list 
doesn't make any difference



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601763659


##
python/pyspark/sql/types.py:
##
@@ -263,21 +263,30 @@ def __init__(self, collation: Optional[str] = None):
 def fromCollationId(self, collationId: int) -> "StringType":
 return StringType(StringType.collationNames[collationId])
 
-def collationIdToName(self) -> str:
-if self.collationId == 0:
-return ""
-else:
-return " collate %s" % StringType.collationNames[self.collationId]
+@classmethod
+def collationIdToName(cls, collationId: int) -> str:
+return StringType.collationNames[collationId]
 
 @classmethod
 def collationNameToId(cls, collationName: str) -> int:
 return StringType.collationNames.index(collationName)
 
+@classmethod
+def collationProvider(cls, collationName: str) -> str:
+if collationName.startswith("UTF8"):

Review Comment:
   I think that we will probably have to revamp this class completely to 
support all collations anyways, so it will probably look more like the scala 
side, am i correct @nikolamand-db?
   
   That's why I think we should be fine with this for now, but I've added a 
`TODO` to do it properly



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub


olaky commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601024762


##
python/pyspark/sql/tests/test_types.py:
##
@@ -549,6 +549,129 @@ def test_convert_list_to_str(self):
 self.assertEqual(df.count(), 1)
 self.assertEqual(df.head(), Row(name="[123]", income=120))
 
+def test_schema_with_collations_json_ser_de(self):
+from pyspark.sql.types import _parse_datatype_json_string
+
+unicode_collation = "UNICODE"
+
+simple_struct = StructType([StructField("c1", 
StringType(unicode_collation))])
+
+nested_struct = StructType([StructField("nested", simple_struct)])
+
+array_in_schema = StructType(
+[StructField("array", ArrayType(StringType(unicode_collation)))]
+)
+
+map_in_schema = StructType(
+[
+StructField(
+"map", MapType(StringType(unicode_collation), 
StringType(unicode_collation))
+)
+]
+)
+
+array_in_map = StructType(
+[
+StructField(
+"arrInMap",
+MapType(
+StringType(unicode_collation), 
ArrayType(StringType(unicode_collation))
+),
+)
+]
+)
+
+nested_array_in_map = StructType(

Review Comment:
   ```suggestion
   nested_array_in_map_value = StructType(
   ```



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -36,11 +36,45 @@
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  /**
+   * Identifier for single a collation.
+   */
+  public static class CollationIdentifier {
+public final String provider;
+public final String name;
+public final String version;
+
+public CollationIdentifier(String provider, String collationName, String 
version) {
+  this.provider = provider;
+  this.name = collationName;
+  this.version = version;
+}
+
+public static CollationIdentifier fromString(String identifier) {
+  String[] parts = identifier.split("\\.", 3);
+  return new CollationIdentifier(parts[0], parts[1], parts[2]);
+}
+
+@Override
+public String toString() {
+  return String.format("%s.%s.%s", provider, name, version);
+}
+
+/**
+ * Returns the identifier's string value without the version.
+ */
+public String valueWithoutVersion() {

Review Comment:
   ```suggestion
   public String toStringWithoutVersion() {
   ```



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -36,11 +36,45 @@
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  /**
+   * Identifier for single a collation.
+   */
+  public static class CollationIdentifier {
+public final String provider;
+public final String name;
+public final String version;

Review Comment:
   How are we representing identifiers without version? Should we make this 
Optional[String]? The provider might also be optional, but this will make 
parsing a lot more difficult



##
python/pyspark/sql/tests/test_types.py:
##
@@ -549,6 +549,129 @@ def test_convert_list_to_str(self):
 self.assertEqual(df.count(), 1)
 self.assertEqual(df.head(), Row(name="[123]", income=120))
 
+def test_schema_with_collations_json_ser_de(self):
+from pyspark.sql.types import _parse_datatype_json_string
+
+unicode_collation = "UNICODE"
+
+simple_struct = StructType([StructField("c1", 
StringType(unicode_collation))])
+
+nested_struct = StructType([StructField("nested", simple_struct)])
+
+array_in_schema = StructType(
+[StructField("array", ArrayType(StringType(unicode_collation)))]
+)
+
+map_in_schema = StructType(
+[
+StructField(
+"map", MapType(StringType(unicode_collation), 
StringType(unicode_collation))
+)
+]
+)
+
+array_in_map = StructType(
+[
+StructField(
+"arrInMap",
+MapType(
+StringType(unicode_collation), 
ArrayType(StringType(unicode_collation))
+),
+)
+]
+)
+
+nested_array_in_map = StructType(
+[
+StructField(
+"nestedArrayInMap",
+ArrayType(
+MapType(
+StringType(unicode_collation),

Review Comment:
   What about nested collations in map keys?



##

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-13 Thread via GitHub


cstavr commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1598916467


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -36,11 +36,42 @@
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  public static class CollationIdentifier {
+public final String provider;
+public final String name;
+public final String version;
+
+public CollationIdentifier(String provider, String collationName, String 
version) {
+  this.provider = provider;
+  this.name = collationName;
+  this.version = version;
+}
+
+public static CollationIdentifier fromString(String identifier) {
+  String[] parts = identifier.split("\\.", 3);
+  return new CollationIdentifier(parts[0], parts[1], parts[2]);
+}
+
+@Override
+public String toString() {
+  return String.format("%s.%s.%s", provider, name, version);
+}
+
+/**
+ * Returns

Review Comment:
   ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-13 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1598823433


##
python/pyspark/sql/types.py:
##
@@ -1702,9 +1791,16 @@ def _parse_datatype_json_string(json_string: str) -> 
DataType:
 return _parse_datatype_json_value(json.loads(json_string))
 
 
-def _parse_datatype_json_value(json_value: Union[dict, str]) -> DataType:
+def _parse_datatype_json_value(
+json_value: Union[dict, str],
+fieldPath: str = "",
+collationsMap: Optional[Dict[str, str]] = None,
+) -> DataType:
 if not isinstance(json_value, dict):
 if json_value in _all_atomic_types.keys():
+if collationsMap is not None and fieldPath in collationsMap:
+collationName = collationsMap[fieldPath].split(".")[1]

Review Comment:
   In spark we only use name to identify collations and we don't need the 
provider at all. That is why I asked if provider could be optional in the delta 
protocol but the issue would be parsing the identifier in that case
   
   https://github.com/delta-io/delta/pull/3068#discussion_r1593742078



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-13 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1598821762


##
python/pyspark/sql/types.py:
##
@@ -1756,6 +1854,15 @@ def _parse_datatype_json_value(json_value: Union[dict, 
str]) -> DataType:
 )
 
 
+def _parse_type_with_collation(
+json_value: Dict[str, Any], fieldPath: str, collationsMap: 
Optional[Dict[str, str]]
+) -> DataType:

Review Comment:
   yes, added now!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-13 Thread via GitHub


cstavr commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1598693494


##
python/pyspark/sql/types.py:
##
@@ -1702,9 +1791,16 @@ def _parse_datatype_json_string(json_string: str) -> 
DataType:
 return _parse_datatype_json_value(json.loads(json_string))
 
 
-def _parse_datatype_json_value(json_value: Union[dict, str]) -> DataType:
+def _parse_datatype_json_value(
+json_value: Union[dict, str],
+fieldPath: str = "",
+collationsMap: Optional[Dict[str, str]] = None,
+) -> DataType:
 if not isinstance(json_value, dict):
 if json_value in _all_atomic_types.keys():
+if collationsMap is not None and fieldPath in collationsMap:
+collationName = collationsMap[fieldPath].split(".")[1]

Review Comment:
   So we are using only the name? What about the rest of the parts?



##
python/pyspark/sql/types.py:
##
@@ -263,21 +263,31 @@ def __init__(self, collation: Optional[str] = None):
 def fromCollationId(self, collationId: int) -> "StringType":
 return StringType(StringType.collationNames[collationId])
 
-def collationIdToName(self) -> str:
-if self.collationId == 0:
-return ""
-else:
-return " collate %s" % StringType.collationNames[self.collationId]
+@classmethod
+def collationIdToName(cls, collationId: int) -> str:
+return StringType.collationNames[collationId]
 
 @classmethod
 def collationNameToId(cls, collationName: str) -> int:
 return StringType.collationNames.index(collationName)
 
+@classmethod
+def collationProvider(cls, collationName: str) -> str:
+if collationName.startswith("UTF8"):
+return "spark"
+return "icu"
+
 def simpleString(self) -> str:
-return "string" + self.collationIdToName()
+return (
+"string"
+if self.isUTF8BinaryCollation()
+else "string collate " + self.collationIdToName(self.collationId)
+)

Review Comment:
   nit: Since this is multi-line anyway, I think that standard if/else block is 
better here.



##
python/pyspark/sql/types.py:
##
@@ -1756,6 +1854,15 @@ def _parse_datatype_json_value(json_value: Union[dict, 
str]) -> DataType:
 )
 
 
+def _parse_type_with_collation(
+json_value: Dict[str, Any], fieldPath: str, collationsMap: 
Optional[Dict[str, str]]
+) -> DataType:
+if collationsMap and fieldPath in collationsMap:
+collationName = collationsMap[fieldPath].split(".")[1]

Review Comment:
   ditto



##
python/pyspark/sql/types.py:
##
@@ -1756,6 +1854,15 @@ def _parse_datatype_json_value(json_value: Union[dict, 
str]) -> DataType:
 )
 
 
+def _parse_type_with_collation(
+json_value: Dict[str, Any], fieldPath: str, collationsMap: 
Optional[Dict[str, str]]
+) -> DataType:

Review Comment:
   Don't we also need the string type check here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-13 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1598434466


##
python/pyspark/sql/types.py:
##
@@ -876,30 +894,86 @@ def __init__(
 self.dataType = dataType
 self.nullable = nullable
 self.metadata = metadata or {}
+self._collationMetadata: Optional[Dict[str, str]] = None
 
 def simpleString(self) -> str:
 return "%s:%s" % (self.name, self.dataType.simpleString())
 
 def __repr__(self) -> str:
 return "StructField('%s', %s, %s)" % (self.name, self.dataType, 
str(self.nullable))
 
+def __eq__(self, other: Any) -> bool:
+# since collationMetadata is lazy evaluated we should not use it in 
equality check

Review Comment:
   I would argue that using `self.__dict__ == other.__dict__` is even more 
dangerous and yet this is the current implementation of equals in all data 
types. This was only done in order for assert to pass in the tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-13 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1598434466


##
python/pyspark/sql/types.py:
##
@@ -876,30 +894,86 @@ def __init__(
 self.dataType = dataType
 self.nullable = nullable
 self.metadata = metadata or {}
+self._collationMetadata: Optional[Dict[str, str]] = None
 
 def simpleString(self) -> str:
 return "%s:%s" % (self.name, self.dataType.simpleString())
 
 def __repr__(self) -> str:
 return "StructField('%s', %s, %s)" % (self.name, self.dataType, 
str(self.nullable))
 
+def __eq__(self, other: Any) -> bool:
+# since collationMetadata is lazy evaluated we should not use it in 
equality check

Review Comment:
   I would argue that just using `self.__dict__ == other.__dict__` is even more 
dangerous and yet this is the current implementation of equals in all data 
types. This was only done in order for assert to pass in the tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-13 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1598430811


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,60 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: mutable.Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = collationName(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = collationName(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)
+fieldToCollationMap
+  }
+
+  private def isCollatedString(dt: DataType): Boolean = dt match {
+case st: StringType => !st.isUTF8BinaryCollation
+case _ => false
+  }
+
+  private def collationName(dt: DataType): String = dt match {
+case st: StringType =>

Review Comment:
   Casting to a specific `StringType(collationId)` is kind of ugly so I would 
rather avoid it. However, even though we only call this method when we know 
it's a string type I added an exception in the match clause to be more explicit 
with the error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-13 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1598434466


##
python/pyspark/sql/types.py:
##
@@ -876,30 +894,86 @@ def __init__(
 self.dataType = dataType
 self.nullable = nullable
 self.metadata = metadata or {}
+self._collationMetadata: Optional[Dict[str, str]] = None
 
 def simpleString(self) -> str:
 return "%s:%s" % (self.name, self.dataType.simpleString())
 
 def __repr__(self) -> str:
 return "StructField('%s', %s, %s)" % (self.name, self.dataType, 
str(self.nullable))
 
+def __eq__(self, other: Any) -> bool:
+# since collationMetadata is lazy evaluated we should not use it in 
equality check

Review Comment:
   I would argue that just using `self.__dict__ == other.__dict__` is also 
dangerous and yet this is the current implementation of equals in all data 
types. This was only done in order for assert to pass in the tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-13 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1598430811


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,60 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: mutable.Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = collationName(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = collationName(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)
+fieldToCollationMap
+  }
+
+  private def isCollatedString(dt: DataType): Boolean = dt match {
+case st: StringType => !st.isUTF8BinaryCollation
+case _ => false
+  }
+
+  private def collationName(dt: DataType): String = dt match {
+case st: StringType =>

Review Comment:
   Casting to a specific `StringType(collationId)` is kind of ugly so I would 
rather avoid it. However, even though we only call this method when we know 
it's a string type I added an exception in the match clause to be more explicit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-13 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1598427469


##
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##
@@ -274,6 +288,39 @@ object DataType {
   messageParameters = Map("other" -> compact(render(other
   }
 
+  /**
+   * Checks if the current field is in the collation map, and if it is it 
returns
+   * a StringType with the given collation. Otherwise, it further parses its 
type.
+   */
+  private def resolveType(
+  json: JValue,
+  path: String,
+  collationsMap: Map[String, String]): DataType = {
+collationsMap.get(path) match {
+  case Some(collation) => stringTypeWithCollation(collation)

Review Comment:
   good catch! Added new error class and tests for that case



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-13 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1598426717


##
python/pyspark/sql/tests/test_types.py:
##
@@ -549,6 +549,76 @@ def test_convert_list_to_str(self):
 self.assertEqual(df.count(), 1)
 self.assertEqual(df.head(), Row(name="[123]", income=120))
 
+def test_schema_with_collations_json_ser_de(self):
+from pyspark.sql.types import _parse_datatype_json_string
+
+unicode_collation = "UNICODE"
+
+simple_struct = StructType([StructField("c1", 
StringType(unicode_collation))])
+
+nested_struct = StructType([StructField("nested", simple_struct)])
+
+array_in_schema = StructType(
+[StructField("array", ArrayType(StringType(unicode_collation)))]
+)
+
+map_in_schema = StructType(
+[
+StructField(
+"map", MapType(StringType(unicode_collation), 
StringType(unicode_collation))
+)
+]
+)
+
+array_in_map_in_nested_schema = StructType(

Review Comment:
   you are right we don't need that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-13 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1598424420


##
python/pyspark/sql/tests/test_types.py:
##
@@ -549,6 +549,76 @@ def test_convert_list_to_str(self):
 self.assertEqual(df.count(), 1)
 self.assertEqual(df.head(), Row(name="[123]", income=120))
 
+def test_schema_with_collations_json_ser_de(self):
+from pyspark.sql.types import _parse_datatype_json_string
+
+unicode_collation = "UNICODE"
+
+simple_struct = StructType([StructField("c1", 
StringType(unicode_collation))])
+
+nested_struct = StructType([StructField("nested", simple_struct)])
+
+array_in_schema = StructType(
+[StructField("array", ArrayType(StringType(unicode_collation)))]
+)
+
+map_in_schema = StructType(
+[
+StructField(
+"map", MapType(StringType(unicode_collation), 
StringType(unicode_collation))
+)
+]
+)
+
+array_in_map_in_nested_schema = StructType(
+[
+StructField(
+"arrInMap",
+MapType(
+StringType(unicode_collation), 
ArrayType(StringType(unicode_collation))
+),
+)
+]
+)
+
+nested_array_in_map = StructType(
+[
+StructField(
+"nestedArrayInMap",
+ArrayType(
+MapType(
+StringType(unicode_collation),
+
ArrayType(ArrayType(StringType(unicode_collation))),
+)
+),
+)
+]
+)
+
+schema_with_multiple_fields = StructType(
+simple_struct.fields
++ nested_struct.fields
++ array_in_schema.fields
++ map_in_schema.fields
++ array_in_map_in_nested_schema.fields
++ nested_array_in_map.fields
+)
+
+schemas = [
+simple_struct,
+nested_struct,
+array_in_schema,
+map_in_schema,
+nested_array_in_map,
+array_in_map_in_nested_schema,
+schema_with_multiple_fields,
+]
+
+for schema in schemas:
+scala_datatype = 
self.spark._jsparkSession.parseDataType(schema.json())
+python_datatype = 
_parse_datatype_json_string(scala_datatype.json())
+assert schema == python_datatype

Review Comment:
   `scala_datatype` is a jvm object so we can't just compare it to the `schema`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-13 Thread via GitHub


cstavr commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1597674391


##
python/pyspark/sql/types.py:
##
@@ -263,21 +263,23 @@ def __init__(self, collation: Optional[str] = None):
 def fromCollationId(self, collationId: int) -> "StringType":
 return StringType(StringType.collationNames[collationId])
 
-def collationIdToName(self) -> str:
-if self.collationId == 0:
-return ""
-else:
-return " collate %s" % StringType.collationNames[self.collationId]
+@classmethod
+def collationIdToName(cls, collationId: int) -> str:
+return StringType.collationNames[collationId]
 
 @classmethod
 def collationNameToId(cls, collationName: str) -> int:
 return StringType.collationNames.index(collationName)
 
 def simpleString(self) -> str:
-return "string" + self.collationIdToName()
+return (
+"string"
+if self.isUTF8BinaryCollation()
+else "string collate" + self.collationIdToName(self.collationId)

Review Comment:
   ```suggestion
   else "string collate " + self.collationIdToName(self.collationId)
   ```



##
python/pyspark/sql/tests/test_types.py:
##
@@ -549,6 +549,76 @@ def test_convert_list_to_str(self):
 self.assertEqual(df.count(), 1)
 self.assertEqual(df.head(), Row(name="[123]", income=120))
 
+def test_schema_with_collations_json_ser_de(self):
+from pyspark.sql.types import _parse_datatype_json_string
+
+unicode_collation = "UNICODE"
+
+simple_struct = StructType([StructField("c1", 
StringType(unicode_collation))])
+
+nested_struct = StructType([StructField("nested", simple_struct)])
+
+array_in_schema = StructType(
+[StructField("array", ArrayType(StringType(unicode_collation)))]
+)
+
+map_in_schema = StructType(
+[
+StructField(
+"map", MapType(StringType(unicode_collation), 
StringType(unicode_collation))
+)
+]
+)
+
+array_in_map_in_nested_schema = StructType(

Review Comment:
   What does in "in nested schema" mean here?
   
   ```suggestion
   array_in_map = StructType(
   ```



##
python/pyspark/sql/tests/test_types.py:
##
@@ -549,6 +549,76 @@ def test_convert_list_to_str(self):
 self.assertEqual(df.count(), 1)
 self.assertEqual(df.head(), Row(name="[123]", income=120))
 
+def test_schema_with_collations_json_ser_de(self):
+from pyspark.sql.types import _parse_datatype_json_string
+
+unicode_collation = "UNICODE"
+
+simple_struct = StructType([StructField("c1", 
StringType(unicode_collation))])
+
+nested_struct = StructType([StructField("nested", simple_struct)])
+
+array_in_schema = StructType(
+[StructField("array", ArrayType(StringType(unicode_collation)))]
+)
+
+map_in_schema = StructType(
+[
+StructField(
+"map", MapType(StringType(unicode_collation), 
StringType(unicode_collation))
+)
+]
+)
+
+array_in_map_in_nested_schema = StructType(
+[
+StructField(
+"arrInMap",
+MapType(
+StringType(unicode_collation), 
ArrayType(StringType(unicode_collation))
+),
+)
+]
+)
+
+nested_array_in_map = StructType(
+[
+StructField(
+"nestedArrayInMap",
+ArrayType(
+MapType(
+StringType(unicode_collation),
+
ArrayType(ArrayType(StringType(unicode_collation))),
+)
+),
+)
+]
+)
+
+schema_with_multiple_fields = StructType(
+simple_struct.fields
++ nested_struct.fields
++ array_in_schema.fields
++ map_in_schema.fields
++ array_in_map_in_nested_schema.fields
++ nested_array_in_map.fields
+)
+
+schemas = [
+simple_struct,
+nested_struct,
+array_in_schema,
+map_in_schema,
+nested_array_in_map,
+array_in_map_in_nested_schema,
+schema_with_multiple_fields,
+]
+
+for schema in schemas:
+scala_datatype = 
self.spark._jsparkSession.parseDataType(schema.json())
+python_datatype = 
_parse_datatype_json_string(scala_datatype.json())
+assert schema == python_datatype

Review Comment:
   ```suggestion
   scala_datatype = 
self.spark._jsparkSession.parseDataType(schema.json())
   assert schema == 

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-09 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1595221619


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,60 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: mutable.Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = collationName(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = collationName(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)

Review Comment:
   I added a test to be sure this works, but I don't see why we would need any 
special logic for this case.
   
   Even if we have a column `name.key` this will never be ambiguous in the 
metadata as the struct field could be a string or a map but never both, so 
parsing will just work regardless - we could never have a case of adding the 
collation two times.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-09 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1595221619


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,60 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: mutable.Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = collationName(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = collationName(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)

Review Comment:
   I added a test to be sure this works, but I don't see why we would need any 
special logic for this case.
   
   Even if we have a column name.key this will never be ambiguous in the 
metadata as the struct field could be a string or a map but never both, so 
parsing will just work regardless - we could never have a case of adding the 
collation two times.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-09 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1595218281


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,60 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: mutable.Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = collationName(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = collationName(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)

Review Comment:
   I added a test to be sure this works, but I don't see why we would need any 
special logic for this case.
   
   Even if we have a column `name.key` this will never be ambiguous in the 
metadata as the struct field could be a string or a map but never both, so 
parsing will just work regardless - we could never have a case of adding the 
collation two times.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-09 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1595218281


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,60 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: mutable.Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = collationName(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = collationName(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)

Review Comment:
   I added a test to be sure this works, but I don't see why we would need any 
special logic for this case.
   
   Even if we have a column `name.key` this will never be ambiguous in the 
metadata as the struct field could be a string or a map but never both, so 
parsing will just work regardless - we could never have a case of adding the 
collation two times.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-09 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1595218281


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,60 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: mutable.Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = collationName(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = collationName(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)

Review Comment:
   I added a test to be sure this works, but I don't see why we would need any 
special logic for this case.
   
   Even if we have a column `name.key` this will never be ambiguous in the 
metadata as the struct field could be a string or a map but never both, so 
parsing will just work regardless.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-09 Thread via GitHub


cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1595169435


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,60 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: mutable.Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = collationName(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = collationName(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)

Review Comment:
   This is per field metadata, do we really need to put the filed name in the 
path key? My worry is that the field name can contain special chars, e.g. 
`name.key`, which may cause issues without proper quoting. How about we use 
empty string as the path key for the top-level field?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-08 Thread via GitHub


stefankandic commented on PR #46280:
URL: https://github.com/apache/spark/pull/46280#issuecomment-2100170222

   @cloud-fan please take a look when you have the time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org