Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-19 Thread via GitHub


nikolamand-db closed pull request #45856: [SPARK-47297][SQL] Add collations 
support to split regex expression
URL: https://github.com/apache/spark/pull/45856


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-19 Thread via GitHub


nikolamand-db commented on PR #45856:
URL: https://github.com/apache/spark/pull/45856#issuecomment-2066549783

   Closing as we have new approach for all regex functions 
https://github.com/apache/spark/pull/46077.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-15 Thread via GitHub


nikolamand-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1566339138


##
sql/core/src/test/scala/org/apache/spark/sql/CollationRegexpExpressionsSuite.scala:
##
@@ -116,30 +116,39 @@ class CollationRegexpExpressionsSuite
 
   test("Support StringSplit string expression with collation") {
 // Supported collations
-case class StringSplitTestCase[R](l: String, r: String, c: String, result: 
R)
+case class StringSplitTestCase[R](l: String, r: String, c: String, result: 
R, limit: Int = -1)
 val testCases = Seq(
-  StringSplitTestCase("ABC", "[B]", "UTF8_BINARY", Seq("A", "C"))
+  StringSplitTestCase("ABC", "[B]", "UTF8_BINARY", Seq("A", "C")),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY", Seq("ABC")),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("A", "C")),
+  StringSplitTestCase("AAA", "[a]", "UTF8_BINARY_LCASE", Seq("", "", "", 
"")),
+  StringSplitTestCase("AAA", "[b]", "UTF8_BINARY_LCASE", Seq("AAA")),
+  StringSplitTestCase("aAbB", "[ab]", "UTF8_BINARY_LCASE", Seq("", "", "", 
"", "")),

Review Comment:
   Rewritten tests, please check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-15 Thread via GitHub


nikolamand-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1565992929


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##
@@ -543,25 +543,40 @@ case class RLike(left: Expression, right: Expression) 
extends StringRegexExpress
 case class StringSplit(str: Expression, regex: Expression, limit: Expression)
   extends TernaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
-  override def dataType: DataType = ArrayType(StringType, containsNull = false)
-  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, 
IntegerType)
+  override def dataType: DataType = ArrayType(str.dataType, containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeBinaryLcase, StringTypeAnyCollation, IntegerType)

Review Comment:
   This is to support session-level default collation. If the user changes it 
and passes regex string literal, it will be interpreted as collated string. We 
don't want to throw exception in such cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1565989321


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -143,6 +143,40 @@ public static boolean execICU(final UTF8String l, final 
UTF8String r,
* Collation-aware regexp expressions.
*/
 
+  public static class StringSplit {
+public static UTF8String[] exec(final UTF8String string, final UTF8String 
regex,
+final int limit, final int collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  if (collation.supportsBinaryEquality) {
+return execBinary(string, regex, limit);
+  } else {
+return execLowercase(string, regex, limit);

Review Comment:
   I think `final lazy val collationId: Int = 
str.dataType.asInstanceOf[StringType].collationId` for `StringTypeBinaryLcase` 
won't allow that to happen. But an additional assert in the else branch 
couldn't hurt too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1565979015


##
sql/core/src/test/scala/org/apache/spark/sql/CollationRegexpExpressionsSuite.scala:
##
@@ -116,30 +116,39 @@ class CollationRegexpExpressionsSuite
 
   test("Support StringSplit string expression with collation") {
 // Supported collations
-case class StringSplitTestCase[R](l: String, r: String, c: String, result: 
R)
+case class StringSplitTestCase[R](l: String, r: String, c: String, result: 
R, limit: Int = -1)
 val testCases = Seq(
-  StringSplitTestCase("ABC", "[B]", "UTF8_BINARY", Seq("A", "C"))
+  StringSplitTestCase("ABC", "[B]", "UTF8_BINARY", Seq("A", "C")),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY", Seq("ABC")),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("A", "C")),
+  StringSplitTestCase("AAA", "[a]", "UTF8_BINARY_LCASE", Seq("", "", "", 
"")),
+  StringSplitTestCase("AAA", "[b]", "UTF8_BINARY_LCASE", Seq("AAA")),
+  StringSplitTestCase("aAbB", "[ab]", "UTF8_BINARY_LCASE", Seq("", "", "", 
"", "")),

Review Comment:
   please write unit tests for all corner cases, instead of end-to-end tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1565976473


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##
@@ -543,25 +543,40 @@ case class RLike(left: Expression, right: Expression) 
extends StringRegexExpress
 case class StringSplit(str: Expression, regex: Expression, limit: Expression)
   extends TernaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
-  override def dataType: DataType = ArrayType(StringType, containsNull = false)
-  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, 
IntegerType)
+  override def dataType: DataType = ArrayType(str.dataType, containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeBinaryLcase, StringTypeAnyCollation, IntegerType)

Review Comment:
   +1, why would we allow any collation for the regex string?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1565975383


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -143,6 +143,40 @@ public static boolean execICU(final UTF8String l, final 
UTF8String r,
* Collation-aware regexp expressions.
*/
 
+  public static class StringSplit {
+public static UTF8String[] exec(final UTF8String string, final UTF8String 
regex,
+final int limit, final int collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  if (collation.supportsBinaryEquality) {
+return execBinary(string, regex, limit);
+  } else {
+return execLowercase(string, regex, limit);

Review Comment:
   If it's not supported yet, we should add 
`assert(collation.supportsLowercaseEquality)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1565974429


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -143,6 +143,40 @@ public static boolean execICU(final UTF8String l, final 
UTF8String r,
* Collation-aware regexp expressions.
*/
 
+  public static class StringSplit {
+public static UTF8String[] exec(final UTF8String string, final UTF8String 
regex,
+final int limit, final int collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  if (collation.supportsBinaryEquality) {
+return execBinary(string, regex, limit);
+  } else {
+return execLowercase(string, regex, limit);

Review Comment:
   there is no `execICU` for this operation? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1565835313


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -143,6 +143,41 @@ public static boolean execICU(final UTF8String l, final 
UTF8String r,
* Collation-aware regexp expressions.
*/
 
+  public static class StringSplit {
+public static UTF8String[] exec(final UTF8String string, final UTF8String 
regex,
+final int limit, final int collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  if (collation.supportsBinaryEquality) {
+return execBinary(string, regex, limit);
+  } else {
+return execLowercase(string, regex, limit);
+  }
+}
+public static String genCode(final String string, final String regex, 
final String limit,
+final int collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  String expr = "CollationSupport.StringSplit.exec";
+  if (collation.supportsBinaryEquality) {
+return String.format(expr + "Binary(%s, %s, %s)", string, regex, 
limit);
+  } else {
+return String.format(expr + "Lowercase(%s, %s, %s)", string, regex, 
limit);
+  }
+}
+public static UTF8String[] execBinary(final UTF8String string, final 
UTF8String regex,
+final int limit) {
+  return string.split(regex, limit);
+}
+public static UTF8String[] execLowercase(final UTF8String string, final 
UTF8String regex,
+final int limit) {
+  if (string.numBytes() != 0 && regex.numBytes() == 0) {
+return string.split(regex, limit);
+  } else {
+// ui flags toggle unicode case-insensitive matching
+return string.split(UTF8String.fromString("(?ui)" + regex.toString()), 
limit);

Review Comment:
   I wonder if `UTF8String.concat(UTF8String.fromString("(?ui)"), regex)` would 
be better
   it does allocate a new byte array either way, but I think we should do it 
instead
   
   maybe even add something like `CollationAwareUTF8String.getLowercaseRegex` 
that does that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1565828173


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -143,6 +143,41 @@ public static boolean execICU(final UTF8String l, final 
UTF8String r,
* Collation-aware regexp expressions.
*/
 
+  public static class StringSplit {
+public static UTF8String[] exec(final UTF8String string, final UTF8String 
regex,
+final int limit, final int collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  if (collation.supportsBinaryEquality) {
+return execBinary(string, regex, limit);
+  } else {
+return execLowercase(string, regex, limit);
+  }
+}
+public static String genCode(final String string, final String regex, 
final String limit,
+final int collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  String expr = "CollationSupport.StringSplit.exec";
+  if (collation.supportsBinaryEquality) {
+return String.format(expr + "Binary(%s, %s, %s)", string, regex, 
limit);
+  } else {
+return String.format(expr + "Lowercase(%s, %s, %s)", string, regex, 
limit);
+  }
+}
+public static UTF8String[] execBinary(final UTF8String string, final 
UTF8String regex,
+final int limit) {
+  return string.split(regex, limit);
+}
+public static UTF8String[] execLowercase(final UTF8String string, final 
UTF8String regex,
+final int limit) {
+  if (string.numBytes() != 0 && regex.numBytes() == 0) {
+return string.split(regex, limit);
+  } else {
+// ui flags toggle unicode case-insensitive matching
+return string.split(UTF8String.fromString("(?ui)" + regex.toString()), 
limit);

Review Comment:
   do we really need to allocate a new string "regex.toString()" just to 
pre-append "(?ui)"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-15 Thread via GitHub


nikolamand-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1565804085


##
sql/core/src/test/scala/org/apache/spark/sql/CollationRegexpExpressionsSuite.scala:
##
@@ -116,26 +116,37 @@ class CollationRegexpExpressionsSuite
 
   test("Support StringSplit string expression with collation") {
 // Supported collations
-case class StringSplitTestCase[R](l: String, r: String, c: String, result: 
R)
+case class StringSplitTestCase[R](l: String, r: String, c: String, result: 
R, limit: Int = -1)
 val testCases = Seq(
-  StringSplitTestCase("ABC", "[B]", "UTF8_BINARY", Seq("A", "C"))
+  StringSplitTestCase("ABC", "[B]", "UTF8_BINARY", Seq("A", "C")),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY", Seq("ABC")),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("A", "C")),
+  StringSplitTestCase("AAA", "[a]", "UTF8_BINARY_LCASE", Seq("", "", "", 
"")),
+  StringSplitTestCase("AAA", "[b]", "UTF8_BINARY_LCASE", Seq("AAA")),
+  StringSplitTestCase("aAbB", "[ab]", "UTF8_BINARY_LCASE", Seq("", "", "", 
"", "")),
+  StringSplitTestCase("", "", "UTF8_BINARY_LCASE", Seq("")),
+  StringSplitTestCase("", "[a]", "UTF8_BINARY_LCASE", Seq("")),
+  StringSplitTestCase("xAxBxaxbx", "[AB]", "UTF8_BINARY_LCASE", Seq("x", 
"x", "x", "x", "x")),
+  StringSplitTestCase("ABC", "", "UTF8_BINARY_LCASE", Seq("A", "B", "C")),
+  // test split with limit
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("ABC"), 1),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("A", "C"), 2),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("A", "C"), 3),
+  StringSplitTestCase("ABC", "[B]", "UNICODE", Seq("A", "C")),
+  StringSplitTestCase("ABC", "[b]", "UNICODE", Seq("ABC"))
 )
 testCases.foreach(t => {
-  val query = s"SELECT split(collate('${t.l}', '${t.c}'), 
collate('${t.r}', '${t.c}'))"
+  val query = s"SELECT split(collate('${t.l}', '${t.c}'), '${t.r}', 
${t.limit})"
   // Result & data type
   checkAnswer(sql(query), Row(t.result))
   
assert(sql(query).schema.fields.head.dataType.sameType(ArrayType(StringType(t.c
   // TODO: Implicit casting (not currently supported)

Review Comment:
   Removed both TODOs since string split doesn't have any custom collation cast 
logic as regex parameter's collation is irrelevant.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1565671721


##
sql/core/src/test/scala/org/apache/spark/sql/CollationRegexpExpressionsSuite.scala:
##
@@ -116,26 +116,37 @@ class CollationRegexpExpressionsSuite
 
   test("Support StringSplit string expression with collation") {
 // Supported collations
-case class StringSplitTestCase[R](l: String, r: String, c: String, result: 
R)
+case class StringSplitTestCase[R](l: String, r: String, c: String, result: 
R, limit: Int = -1)
 val testCases = Seq(
-  StringSplitTestCase("ABC", "[B]", "UTF8_BINARY", Seq("A", "C"))
+  StringSplitTestCase("ABC", "[B]", "UTF8_BINARY", Seq("A", "C")),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY", Seq("ABC")),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("A", "C")),
+  StringSplitTestCase("AAA", "[a]", "UTF8_BINARY_LCASE", Seq("", "", "", 
"")),
+  StringSplitTestCase("AAA", "[b]", "UTF8_BINARY_LCASE", Seq("AAA")),
+  StringSplitTestCase("aAbB", "[ab]", "UTF8_BINARY_LCASE", Seq("", "", "", 
"", "")),
+  StringSplitTestCase("", "", "UTF8_BINARY_LCASE", Seq("")),
+  StringSplitTestCase("", "[a]", "UTF8_BINARY_LCASE", Seq("")),
+  StringSplitTestCase("xAxBxaxbx", "[AB]", "UTF8_BINARY_LCASE", Seq("x", 
"x", "x", "x", "x")),
+  StringSplitTestCase("ABC", "", "UTF8_BINARY_LCASE", Seq("A", "B", "C")),
+  // test split with limit
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("ABC"), 1),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("A", "C"), 2),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("A", "C"), 3),
+  StringSplitTestCase("ABC", "[B]", "UNICODE", Seq("A", "C")),
+  StringSplitTestCase("ABC", "[b]", "UNICODE", Seq("ABC"))
 )
 testCases.foreach(t => {
-  val query = s"SELECT split(collate('${t.l}', '${t.c}'), 
collate('${t.r}', '${t.c}'))"
+  val query = s"SELECT split(collate('${t.l}', '${t.c}'), '${t.r}', 
${t.limit})"
   // Result & data type
   checkAnswer(sql(query), Row(t.result))
   
assert(sql(query).schema.fields.head.dataType.sameType(ArrayType(StringType(t.c
   // TODO: Implicit casting (not currently supported)

Review Comment:
   don't leave any TODOs
   
   if there is no casting to be done for this expression, note that in the 
comment here and explain why that's the case



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1565669167


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##
@@ -543,25 +544,29 @@ case class RLike(left: Expression, right: Expression) 
extends StringRegexExpress
 case class StringSplit(str: Expression, regex: Expression, limit: Expression)
   extends TernaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
-  override def dataType: DataType = ArrayType(StringType, containsNull = false)
-  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, 
IntegerType)
+  override def dataType: DataType = ArrayType(str.dataType, containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeBinaryLcase, StringTypeAnyCollation, IntegerType)
   override def first: Expression = str
   override def second: Expression = regex
   override def third: Expression = limit
 
+  final lazy val collationId: Int = 
str.dataType.asInstanceOf[StringType].collationId
+
   def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(-1))
 
   override def nullSafeEval(string: Any, regex: Any, limit: Any): Any = {
-val strings = string.asInstanceOf[UTF8String].split(
-  regex.asInstanceOf[UTF8String], limit.asInstanceOf[Int])
+val strings = 
CollationSupport.StringSplit.exec(string.asInstanceOf[UTF8String],
+  regex.asInstanceOf[UTF8String], limit.asInstanceOf[Int], collationId)
 new GenericArrayData(strings.asInstanceOf[Array[Any]])
   }
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 val arrayClass = classOf[GenericArrayData].getName
 nullSafeCodeGen(ctx, ev, (str, regex, limit) => {

Review Comment:
   for single function call (like here), use `defineCodeGen`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1565667394


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -143,6 +145,36 @@ public static boolean execICU(final UTF8String l, final 
UTF8String r,
* Collation-aware regexp expressions.
*/
 
+  public static class StringSplit {
+public static UTF8String[] exec(final UTF8String l, final UTF8String r, 
final int limit,

Review Comment:
   let's be consistent with naming
   
   `string, regex, limit` instead of `l, r, limit`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-11 Thread via GitHub


uros-db commented on PR #45856:
URL: https://github.com/apache/spark/pull/45856#issuecomment-2049814727

   heads up: we’ve done some major code restructuring in 
https://github.com/apache/spark/pull/45978, so please sync these changes before 
moving on
   
   @nikolamand-db you’ll likely need to rewrite the code in this PR, so please 
follow the guidelines outlined in 
https://issues.apache.org/jira/browse/SPARK-47410


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-08 Thread via GitHub


mihailom-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1555843730


##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##
@@ -1078,7 +1078,22 @@ public UTF8String[] split(UTF8String pattern, int limit) 
{
   }
   return result;
 }
-return split(pattern.toString(), limit);
+return split(pattern.toString(), limit, regexFlags);
+  }
+
+  public UTF8String[] split(UTF8String pattern, int limit) {
+return split(pattern, limit, 0); // Pattern without regex flags
+  }
+
+  public UTF8String[] splitCollationAware(UTF8String pattern, int limit, int 
collationId) {
+if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+  return split(pattern, limit);
+}
+if (collationId == CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID) {
+  return split(pattern, limit, Pattern.UNICODE_CASE | 
Pattern.CASE_INSENSITIVE);
+}
+throw new UnsupportedOperationException("Unsupported collation " +
+  CollationFactory.fetchCollation(collationId).collationName);

Review Comment:
   +1, this should fail in inputTypeCheck of the related expression. I am not 
sure if we need this double guard.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-04 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1552941872


##
sql/core/src/test/scala/org/apache/spark/sql/CollationRegexpExpressionsSuite.scala:
##
@@ -165,15 +165,23 @@ class CollationRegexpExpressionsSuite
 
 // Supported collations
 val checks = Seq(
-  CollationTestCase("ABC", "[B]", "UTF8_BINARY", Seq("A", "C"))
+  CollationTestCase("ABC", "[B]", "UTF8_BINARY", Seq("A", "C")),
+  CollationTestCase("ABC", "[b]", "UTF8_BINARY", Seq("ABC")),
+  CollationTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("A", "C")),
+  CollationTestCase("AAA", "[a]", "UTF8_BINARY_LCASE", Seq("", "", "", 
"")),
+  CollationTestCase("AAA", "[b]", "UTF8_BINARY_LCASE", Seq("AAA")),
+  CollationTestCase("aAbB", "[ab]", "UTF8_BINARY_LCASE", Seq("", "", "", 
"", "")),
+  CollationTestCase("", "", "UTF8_BINARY_LCASE", Seq("")),
+  CollationTestCase("", "[a]", "UTF8_BINARY_LCASE", Seq("")),
+  CollationTestCase("xAxBxaxbx", "[AB]", "UTF8_BINARY_LCASE", Seq("x", 
"x", "x", "x", "x")),
+  CollationTestCase("ABC", "[B]", "UNICODE", Seq("A", "C")),
+  CollationTestCase("ABC", "[b]", "UNICODE", Seq("ABC"))
 )
 checks.foreach(ct =>
   checkEvaluation(prepareStringSplit(ct.s1, ct.s2, ct.collation), 
ct.expectedResult)

Review Comment:
   I think we should consider rewriting `prepareStringSplit` here, as it 
doesn't make much sense to do:
   `val splitByExpr = Literal.create(splitBy, StringType(collationId))` 
considering that **collationAwareSplit** doesn't respect collation for this 
parameter at all
   
   furthermore, consider how casting makes sense in this instance, and whether 
it's needed at all
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-04 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1552938423


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##
@@ -543,25 +543,40 @@ case class RLike(left: Expression, right: Expression) 
extends StringRegexExpress
 case class StringSplit(str: Expression, regex: Expression, limit: Expression)
   extends TernaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
-  override def dataType: DataType = ArrayType(StringType, containsNull = false)
-  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, 
IntegerType)
+  override def dataType: DataType = ArrayType(str.dataType, containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeBinaryLcase, StringTypeAnyCollation, IntegerType)

Review Comment:
   what does it mean for "regex" to be of type `StringTypeAnyCollation`, as it 
doesn't seem to me that collation is respected/needed for this parameter to 
begin with? for example, consider: `[,]` with UNICODE_CI collation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-04 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1552929298


##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##
@@ -1078,7 +1078,22 @@ public UTF8String[] split(UTF8String pattern, int limit) 
{
   }
   return result;
 }
-return split(pattern.toString(), limit);
+return split(pattern.toString(), limit, regexFlags);
+  }
+
+  public UTF8String[] split(UTF8String pattern, int limit) {
+return split(pattern, limit, 0); // Pattern without regex flags
+  }
+
+  public UTF8String[] splitCollationAware(UTF8String pattern, int limit, int 
collationId) {
+if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+  return split(pattern, limit);
+}
+if (collationId == CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID) {
+  return split(pattern, limit, Pattern.UNICODE_CASE | 
Pattern.CASE_INSENSITIVE);
+}
+throw new UnsupportedOperationException("Unsupported collation " +
+  CollationFactory.fetchCollation(collationId).collationName);

Review Comment:
   if you are using `StringTypeBinaryLcase` for this expression, it should 
never come to this point
   that said, we may want to keep this here anyway because this method is part 
of a public interface



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-04 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1552920335


##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##
@@ -1078,7 +1078,22 @@ public UTF8String[] split(UTF8String pattern, int limit) 
{
   }
   return result;
 }
-return split(pattern.toString(), limit);
+return split(pattern.toString(), limit, regexFlags);
+  }
+
+  public UTF8String[] split(UTF8String pattern, int limit) {
+return split(pattern, limit, 0); // Pattern without regex flags
+  }
+
+  public UTF8String[] splitCollationAware(UTF8String pattern, int limit, int 
collationId) {

Review Comment:
   let's go with: `collationAwareSplit` in order to unify naming



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-03 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1550949300


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##
@@ -543,25 +543,52 @@ case class RLike(left: Expression, right: Expression) 
extends StringRegexExpress
 case class StringSplit(str: Expression, regex: Expression, limit: Expression)
   extends TernaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
-  override def dataType: DataType = ArrayType(StringType, containsNull = false)
-  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, 
IntegerType)
+  override def dataType: DataType = ArrayType(str.dataType, containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeAnyCollation, StringTypeAnyCollation, IntegerType)
+
+  override def checkInputDataTypes(): TypeCheckResult = str.dataType match {
+case s: StringType =>
+  if (s.supportsBinaryEquality || s.isUTF8BinaryLcaseCollation) {
+super.checkInputDataTypes()
+  } else {
+TypeCheckFailure("Unsupported collation: " +
+  CollationFactory.fetchCollation(s.collationId).collationName)
+  }
+case _ => super.checkInputDataTypes()
+  }
+
   override def first: Expression = str
   override def second: Expression = regex
   override def third: Expression = limit
 
+  private def collationId = str.dataType.asInstanceOf[StringType].collationId
+  private def binarySplit = 
CollationFactory.fetchCollation(collationId).supportsBinaryEquality
+
   def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(-1))
 
   override def nullSafeEval(string: Any, regex: Any, limit: Any): Any = {
-val strings = string.asInstanceOf[UTF8String].split(
-  regex.asInstanceOf[UTF8String], limit.asInstanceOf[Int])
+val strings = if (binarySplit) {
+  string.asInstanceOf[UTF8String].split(
+regex.asInstanceOf[UTF8String], limit.asInstanceOf[Int])
+} else {
+  string.asInstanceOf[UTF8String].splitCaseInsensitive(
+regex.asInstanceOf[UTF8String], limit.asInstanceOf[Int])

Review Comment:
   I don't think branching here is a good idea, consider re-writing UTF8String 
methods to accommodate this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-03 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1550949300


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##
@@ -543,25 +543,52 @@ case class RLike(left: Expression, right: Expression) 
extends StringRegexExpress
 case class StringSplit(str: Expression, regex: Expression, limit: Expression)
   extends TernaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
-  override def dataType: DataType = ArrayType(StringType, containsNull = false)
-  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, 
IntegerType)
+  override def dataType: DataType = ArrayType(str.dataType, containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeAnyCollation, StringTypeAnyCollation, IntegerType)
+
+  override def checkInputDataTypes(): TypeCheckResult = str.dataType match {
+case s: StringType =>
+  if (s.supportsBinaryEquality || s.isUTF8BinaryLcaseCollation) {
+super.checkInputDataTypes()
+  } else {
+TypeCheckFailure("Unsupported collation: " +
+  CollationFactory.fetchCollation(s.collationId).collationName)
+  }
+case _ => super.checkInputDataTypes()
+  }
+
   override def first: Expression = str
   override def second: Expression = regex
   override def third: Expression = limit
 
+  private def collationId = str.dataType.asInstanceOf[StringType].collationId
+  private def binarySplit = 
CollationFactory.fetchCollation(collationId).supportsBinaryEquality
+
   def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(-1))
 
   override def nullSafeEval(string: Any, regex: Any, limit: Any): Any = {
-val strings = string.asInstanceOf[UTF8String].split(
-  regex.asInstanceOf[UTF8String], limit.asInstanceOf[Int])
+val strings = if (binarySplit) {
+  string.asInstanceOf[UTF8String].split(
+regex.asInstanceOf[UTF8String], limit.asInstanceOf[Int])
+} else {
+  string.asInstanceOf[UTF8String].splitCaseInsensitive(
+regex.asInstanceOf[UTF8String], limit.asInstanceOf[Int])

Review Comment:
   I don't think branching here is a good idea



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-03 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1550948477


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##
@@ -543,25 +543,52 @@ case class RLike(left: Expression, right: Expression) 
extends StringRegexExpress
 case class StringSplit(str: Expression, regex: Expression, limit: Expression)
   extends TernaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
-  override def dataType: DataType = ArrayType(StringType, containsNull = false)
-  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, 
IntegerType)
+  override def dataType: DataType = ArrayType(str.dataType, containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeAnyCollation, StringTypeAnyCollation, IntegerType)
+
+  override def checkInputDataTypes(): TypeCheckResult = str.dataType match {
+case s: StringType =>
+  if (s.supportsBinaryEquality || s.isUTF8BinaryLcaseCollation) {
+super.checkInputDataTypes()
+  } else {
+TypeCheckFailure("Unsupported collation: " +
+  CollationFactory.fetchCollation(s.collationId).collationName)
+  }
+case _ => super.checkInputDataTypes()
+  }
+
   override def first: Expression = str
   override def second: Expression = regex
   override def third: Expression = limit
 
+  private def collationId = str.dataType.asInstanceOf[StringType].collationId

Review Comment:
   use:
   `final lazy val collationId: Int = 
str.dataType.asInstanceOf[StringType].collationId`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-03 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1550946908


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##
@@ -543,25 +543,52 @@ case class RLike(left: Expression, right: Expression) 
extends StringRegexExpress
 case class StringSplit(str: Expression, regex: Expression, limit: Expression)
   extends TernaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
-  override def dataType: DataType = ArrayType(StringType, containsNull = false)
-  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, 
IntegerType)
+  override def dataType: DataType = ArrayType(str.dataType, containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeAnyCollation, StringTypeAnyCollation, IntegerType)
+
+  override def checkInputDataTypes(): TypeCheckResult = str.dataType match {
+case s: StringType =>
+  if (s.supportsBinaryEquality || s.isUTF8BinaryLcaseCollation) {

Review Comment:
   we shouldn't do this manually for every function
   instead, a more general way to handle this has been introduced in 
`CollationTypeConstraints.scala`
   
   use the appropriate case object stemming from `StringTypeCollated`
   (in this case: StringTypeBinary)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-03 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1550947385


##
sql/core/src/test/scala/org/apache/spark/sql/CollationRegexpExpressionsSuite.scala:
##
@@ -149,38 +149,40 @@ class CollationRegexpExpressionsSuite extends QueryTest 
with SharedSparkSession
   test("Support StringSplit string expression with Collation") {
 // Supported collations
 val checks = Seq(
-  CollationTestCase("ABC", "[B]", "UTF8_BINARY", 2)
+  CollationTestCase("ABC", "[B]", "UTF8_BINARY", Seq("A", "C")),
+  CollationTestCase("ABC", "[b]", "UTF8_BINARY", Seq("ABC")),
+  CollationTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("A", "C")),
+  CollationTestCase("AAA", "[a]", "UTF8_BINARY_LCASE", Seq("", "", "", 
"")),
+  CollationTestCase("xAxBxaxbx", "[AB]", "UTF8_BINARY_LCASE", Seq("x", 
"x", "x", "x", "x")),
+  CollationTestCase("ABC", "[B]", "UNICODE", Seq("A", "C")),
+  CollationTestCase("ABC", "[b]", "UNICODE", Seq("ABC"))
 )
 checks.foreach(ct => {
-  checkAnswer(sql(s"SELECT size(split(collate('${ct.s1}', 
'${ct.collation}')" +
-s",collate('${ct.s2}', '${ct.collation}')))"), Row(ct.expectedResult))
+  checkAnswer(sql(s"SELECT split(collate('${ct.s1}', '${ct.collation}')" +
+s",collate('${ct.s2}', '${ct.collation}'))"), Row(ct.expectedResult))
 })
 // Unsupported collations
 val fails = Seq(
-  CollationTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", 0),
-  CollationTestCase("ABC", "[B]", "UNICODE", 2),
-  CollationTestCase("ABC", "[b]", "UNICODE_CI", 0)
+  CollationTestFail("ABC", "[b]", "UNICODE_CI")
 )
 fails.foreach(ct => {
   checkError(
 exception = intercept[ExtendedAnalysisException] {
-  sql(s"SELECT size(split(collate('${ct.s1}', '${ct.collation}')" +
-s",collate('${ct.s2}', '${ct.collation}')))")
+  sql(s"SELECT split(collate('${ct.s1}', '${ct.collation}')" +
+s",collate('${ct.s2}', '${ct.collation}'))")
 },
-errorClass = "DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE",

Review Comment:
   moving over to `StringTypeCollated`, you'll need to update these tests as 
well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-03 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1550945764


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##
@@ -543,25 +543,52 @@ case class RLike(left: Expression, right: Expression) 
extends StringRegexExpress
 case class StringSplit(str: Expression, regex: Expression, limit: Expression)
   extends TernaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
-  override def dataType: DataType = ArrayType(StringType, containsNull = false)
-  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, 
IntegerType)
+  override def dataType: DataType = ArrayType(str.dataType, containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeAnyCollation, StringTypeAnyCollation, IntegerType)

Review Comment:
   `StringTypeAnyCollation` is used for functions that support "all" collation 
types:
   
   1. UTF8_BINARY
   2. UTF8_BINARY_LCASE
   3. others - ICU collations such as UNICODE, UNICODE_CI, etc.
   
   if we only allow 1. and 2. then we should use `StringTypeBinaryLcase`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-03 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1550942822


##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##
@@ -1078,7 +1078,15 @@ public UTF8String[] split(UTF8String pattern, int limit) 
{
   }
   return result;
 }
-return split(pattern.toString(), limit);
+return split(pattern.toString(), limit, regexFlags);
+  }
+
+  public UTF8String[] split(UTF8String pattern, int limit) {
+return split(pattern, limit, 0);

Review Comment:
   "0" may seem a bit vague here, consider adding a comment or use something 
like `NO_REGEX_FLAGS`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-03 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1550943142


##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##
@@ -1095,21 +1103,30 @@ public UTF8String[] splitSQL(UTF8String delimiter, int 
limit) {
 }
   }
 
-  private UTF8String[] split(String delimiter, int limit) {
+  private UTF8String[] split(String delimiter, int limit, int regexFlags) {
 // Java String's split method supports "ignore empty string" behavior when 
the limit is 0
 // whereas other languages do not. To avoid this java specific behavior, 
we fall back to
 // -1 when the limit is 0.
 if (limit == 0) {
   limit = -1;
 }
-String[] splits = toString().split(delimiter, limit);
+String[] splits;
+if (regexFlags == 0) {

Review Comment:
   (see comment above)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-03 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1550938707


##
sql/core/src/test/scala/org/apache/spark/sql/CollationRegexpExpressionsSuite.scala:
##
@@ -149,38 +149,40 @@ class CollationRegexpExpressionsSuite extends QueryTest 
with SharedSparkSession
   test("Support StringSplit string expression with Collation") {

Review Comment:
   try covering some more edge cases in these tests, such as: empty strings, 
uppercase and lowercase mix, different byte-length chars, etc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org