Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-11 Thread via GitHub


nikolamand-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1561110745


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -172,19 +183,31 @@ public Collation(
   }
 
   /**
-   * Auxiliary methods for collation aware string operations.
+   * Returns a StringSearch object for the given pattern and target strings, 
under collation
+   * rules corresponding to the given collationId. The external ICU library 
StringSearch object can
+   * be used to find occurrences of the pattern in the target string, while 
respecting collation.
*/
-
   public static StringSearch getStringSearch(
-  final UTF8String left,
-  final UTF8String right,
+  final UTF8String targetUTF8String,
+  final UTF8String patternUTF8String,
   final int collationId) {
-String pattern = right.toString();
-CharacterIterator target = new StringCharacterIterator(left.toString());
+String pattern = patternUTF8String.toString();
+CharacterIterator target = new 
StringCharacterIterator(targetUTF8String.toString());
 Collator collator = CollationFactory.fetchCollation(collationId).collator;
 return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }
 
+  /**
+   * Returns a collation-unaware StringSearch object for the given pattern and 
target strings.
+   * While this object does not respect collation, it can be used to find 
occurrences of the pattern
+   * in the target string for UTF8_BINARY or UTF8_BINARY_LCASE (if arguments 
are lowercased).
+   */
+  public static StringSearch getStringSearch(

Review Comment:
   > That part becomes a little complicated with expressions that need to 
preserve the original string value, consider StringReplace: 
'BAbaBA'.replace('ab', 'XX') with UTF8_BINARY_LCASE should return 'BXXaBA' (not 
'bxxaba')
   
   Shouldn't this return `BA` with UTF8_BINARY_LCASE collation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-11 Thread via GitHub


cloud-fan closed pull request #45978: [SPARK-47410][SQL] Refactor UTF8String 
and CollationFactory
URL: https://github.com/apache/spark/pull/45978


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-11 Thread via GitHub


cloud-fan commented on PR #45978:
URL: https://github.com/apache/spark/pull/45978#issuecomment-2049808333

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-11 Thread via GitHub


uros-db commented on PR #45978:
URL: https://github.com/apache/spark/pull/45978#issuecomment-2049806525

   @cloud-fan all checks look good, ready to merge


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-11 Thread via GitHub


miland-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1561064293


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -172,19 +183,31 @@ public Collation(
   }
 
   /**
-   * Auxiliary methods for collation aware string operations.
+   * Returns a StringSearch object for the given pattern and target strings, 
under collation
+   * rules corresponding to the given collationId. The external ICU library 
StringSearch object can
+   * be used to find occurrences of the pattern in the target string, while 
respecting collation.
*/
-
   public static StringSearch getStringSearch(
-  final UTF8String left,
-  final UTF8String right,
+  final UTF8String targetUTF8String,
+  final UTF8String patternUTF8String,
   final int collationId) {
-String pattern = right.toString();
-CharacterIterator target = new StringCharacterIterator(left.toString());
+String pattern = patternUTF8String.toString();
+CharacterIterator target = new 
StringCharacterIterator(targetUTF8String.toString());
 Collator collator = CollationFactory.fetchCollation(collationId).collator;
 return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }
 
+  /**
+   * Returns a collation-unaware StringSearch object for the given pattern and 
target strings.
+   * While this object does not respect collation, it can be used to find 
occurrences of the pattern
+   * in the target string for UTF8_BINARY or UTF8_BINARY_LCASE (if arguments 
are lowercased).
+   */
+  public static StringSearch getStringSearch(

Review Comment:
   @uros-db +1 on this reasoning



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-11 Thread via GitHub


uros-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1561064861


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationStringExpressions.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string expressions.
+ */
+public final class CollationStringExpressions {
+
+  /**
+   * Collation aware string expressions.
+   */
+  public static class Contains {
+public static boolean containsCollationAware(UTF8String l, UTF8String r, 
int collationId) {
+  if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) 
{
+return containsBinary(l, r);

Review Comment:
   update: we ended up doing something similar
   https://github.com/apache/spark/pull/45978#discussion_r1560761873



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-11 Thread via GitHub


cloud-fan commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1561059364


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -172,19 +183,31 @@ public Collation(
   }
 
   /**
-   * Auxiliary methods for collation aware string operations.
+   * Returns a StringSearch object for the given pattern and target strings, 
under collation
+   * rules corresponding to the given collationId. The external ICU library 
StringSearch object can
+   * be used to find occurrences of the pattern in the target string, while 
respecting collation.
*/
-
   public static StringSearch getStringSearch(
-  final UTF8String left,
-  final UTF8String right,
+  final UTF8String targetUTF8String,
+  final UTF8String patternUTF8String,
   final int collationId) {
-String pattern = right.toString();
-CharacterIterator target = new StringCharacterIterator(left.toString());
+String pattern = patternUTF8String.toString();
+CharacterIterator target = new 
StringCharacterIterator(targetUTF8String.toString());
 Collator collator = CollationFactory.fetchCollation(collationId).collator;
 return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }
 
+  /**
+   * Returns a collation-unaware StringSearch object for the given pattern and 
target strings.
+   * While this object does not respect collation, it can be used to find 
occurrences of the pattern
+   * in the target string for UTF8_BINARY or UTF8_BINARY_LCASE (if arguments 
are lowercased).
+   */
+  public static StringSearch getStringSearch(

Review Comment:
   Ah let's keep it then



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-11 Thread via GitHub


uros-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1561054469


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -172,19 +183,31 @@ public Collation(
   }
 
   /**
-   * Auxiliary methods for collation aware string operations.
+   * Returns a StringSearch object for the given pattern and target strings, 
under collation
+   * rules corresponding to the given collationId. The external ICU library 
StringSearch object can
+   * be used to find occurrences of the pattern in the target string, while 
respecting collation.
*/
-
   public static StringSearch getStringSearch(
-  final UTF8String left,
-  final UTF8String right,
+  final UTF8String targetUTF8String,
+  final UTF8String patternUTF8String,
   final int collationId) {
-String pattern = right.toString();
-CharacterIterator target = new StringCharacterIterator(left.toString());
+String pattern = patternUTF8String.toString();
+CharacterIterator target = new 
StringCharacterIterator(targetUTF8String.toString());
 Collator collator = CollationFactory.fetchCollation(collationId).collator;
 return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }
 
+  /**
+   * Returns a collation-unaware StringSearch object for the given pattern and 
target strings.
+   * While this object does not respect collation, it can be used to find 
occurrences of the pattern
+   * in the target string for UTF8_BINARY or UTF8_BINARY_LCASE (if arguments 
are lowercased).
+   */
+  public static StringSearch getStringSearch(

Review Comment:
   That part becomes a little complicated with expressions that need to 
preserve the original string value, consider StringReplace: 
`'BAbaBA'.replace('ab', 'XX')` with UTF8_BINARY_LCASE should return 'BXXaBA' 
(not 'bxxaba')
   so .toLowerCase version is used for matching, and original version 
(non-Lowercase) is used for replacement
   
   @miland-db's currently 2 open PRs use this variant of StringSearch, although 
it's not clear if that's definitely a good approach. That said, I do think we 
can definitely figure out ways to do this in so that we rely on `UTF8String` 
functions rather than `StringSearch` for UTF8_BINARY_LCASE, but we're yet to 
figure out if that's really more efficient (or even easier to implement) - for 
example, Milan noticed that the `UTF8String` implementation of "translate" 
actually does ".toString()" followed by a bunch of ".substring" calls, while 
StringSearch uses a `CharacterIterator` which should probably be more efficient
   
   we may consider benchmarking this, but in terms of this PR @cloud-fan I 
think we could agree on either:
   1. keeping this version of `getStringSearch` here until we're sure that 
we're not gonna need it for any expression (we could remove it once we've 
swooped over all expressions)
   2. removing this version of `getStringSearch` here until we're sure that we 
need it somewhere (we could add it back once there's a guaranteed need for it)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-11 Thread via GitHub


uros-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1561054469


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -172,19 +183,31 @@ public Collation(
   }
 
   /**
-   * Auxiliary methods for collation aware string operations.
+   * Returns a StringSearch object for the given pattern and target strings, 
under collation
+   * rules corresponding to the given collationId. The external ICU library 
StringSearch object can
+   * be used to find occurrences of the pattern in the target string, while 
respecting collation.
*/
-
   public static StringSearch getStringSearch(
-  final UTF8String left,
-  final UTF8String right,
+  final UTF8String targetUTF8String,
+  final UTF8String patternUTF8String,
   final int collationId) {
-String pattern = right.toString();
-CharacterIterator target = new StringCharacterIterator(left.toString());
+String pattern = patternUTF8String.toString();
+CharacterIterator target = new 
StringCharacterIterator(targetUTF8String.toString());
 Collator collator = CollationFactory.fetchCollation(collationId).collator;
 return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }
 
+  /**
+   * Returns a collation-unaware StringSearch object for the given pattern and 
target strings.
+   * While this object does not respect collation, it can be used to find 
occurrences of the pattern
+   * in the target string for UTF8_BINARY or UTF8_BINARY_LCASE (if arguments 
are lowercased).
+   */
+  public static StringSearch getStringSearch(

Review Comment:
   That part becomes a little complicated with expressions that need to 
preserve the original string value, consider StringReplace:
   `'BAbaBA'.replace('ab', 'XX')` with UTF8_BINARY_LCASE should return 'BXXaBA' 
(not 'bxxaba')
   so .toLowerCase version is used for matching, and original veresion is used 
for replacement
   
   @miland-db's currently 2 open PRs use this variant of StringSearch, although 
it's not clear if that's definitely a good approach. That said, I do think we 
can definitely figure out ways to do this in so that we rely on `UTF8String` 
functions rather than `StringSearch` for UTF8_BINARY_LCASE, but we're yet to 
figure out if that's really more efficient (or even easier to implement) - for 
example, Milan noticed that the `UTF8String` implementation of "translate" 
actually does ".toString()" followed by a bunch of ".substring" calls, while 
StringSearch uses a `CharacterIterator` which should probably be more efficient
   
   we may consider benchmarking this, but in terms of this PR @cloud-fan I 
think we could agree on either:
   1. keeping this version of `getStringSearch` here until we're sure that 
we're not gonna need it for any expression (we could remove it once we've 
swooped over all expressions)
   2. removing this version of `getStringSearch` here until we're sure that we 
need it somewhere (we could add it back once there's a guaranteed need for it)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-11 Thread via GitHub


cloud-fan commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1560959880


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -172,19 +183,31 @@ public Collation(
   }
 
   /**
-   * Auxiliary methods for collation aware string operations.
+   * Returns a StringSearch object for the given pattern and target strings, 
under collation
+   * rules corresponding to the given collationId. The external ICU library 
StringSearch object can
+   * be used to find occurrences of the pattern in the target string, while 
respecting collation.
*/
-
   public static StringSearch getStringSearch(
-  final UTF8String left,
-  final UTF8String right,
+  final UTF8String targetUTF8String,
+  final UTF8String patternUTF8String,
   final int collationId) {
-String pattern = right.toString();
-CharacterIterator target = new StringCharacterIterator(left.toString());
+String pattern = patternUTF8String.toString();
+CharacterIterator target = new 
StringCharacterIterator(targetUTF8String.toString());
 Collator collator = CollationFactory.fetchCollation(collationId).collator;
 return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }
 
+  /**
+   * Returns a collation-unaware StringSearch object for the given pattern and 
target strings.
+   * While this object does not respect collation, it can be used to find 
occurrences of the pattern
+   * in the target string for UTF8_BINARY or UTF8_BINARY_LCASE (if arguments 
are lowercased).
+   */
+  public static StringSearch getStringSearch(

Review Comment:
   If we call `.toLowerCase` first, then we can use the same implementation of 
`UTF8_BINARY` which directly calls `UTF8String` methods, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-11 Thread via GitHub


uros-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1560772584


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -0,0 +1,174 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation-aware expressions (StringExpressions, 
RegexpExpressions, and
+ * other expressions that require custom collation support), as well as 
private utility methods for
+ * collation-aware UTF8String operations needed to implement .
+ */
+public final class CollationSupport {
+
+  /**
+   * Collation-aware string expressions.
+   */
+
+  public static class Contains {
+public static boolean contains(final UTF8String l, final UTF8String r, 
final int collationId) {

Review Comment:
   done



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -93,9 +101,12 @@ public Collation(
   this.hashFunction = hashFunction;
   this.supportsBinaryEquality = supportsBinaryEquality;
   this.supportsBinaryOrdering = supportsBinaryOrdering;
+  this.supportsLowercaseEquality = 
collationName.equals("UTF8_BINARY_LCASE");

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-11 Thread via GitHub


dbatomic commented on PR #45978:
URL: https://github.com/apache/spark/pull/45978#issuecomment-2049342650

   Just wanted to thank you for doing this. IMO, things are much cleaner than 
they used to be.
   Also, @mihailom-db , @nikolamand-db , @stefankandic and @stevomitric as FYI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-11 Thread via GitHub


dbatomic commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1560761873


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -0,0 +1,174 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation-aware expressions (StringExpressions, 
RegexpExpressions, and
+ * other expressions that require custom collation support), as well as 
private utility methods for
+ * collation-aware UTF8String operations needed to implement .
+ */
+public final class CollationSupport {
+
+  /**
+   * Collation-aware string expressions.
+   */
+
+  public static class Contains {
+public static boolean contains(final UTF8String l, final UTF8String r, 
final int collationId) {

Review Comment:
   Do we need to repeat class name in every method?
   Can't you say something like:
   Contains.exec(...)
   Contains.genCode(...)
   Contains.binaryExec(...)
   Contains.lowercaseExec(...)
   Contains.icuExec(...)
   
   instead of
   Contains.contains(...)
   
   My hope is still that we will be able to add some base classes in future 
that will unify parts of this logic for some expressions. So let's try to keep 
naming uniform.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-11 Thread via GitHub


dbatomic commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1560754473


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -93,9 +101,12 @@ public Collation(
   this.hashFunction = hashFunction;
   this.supportsBinaryEquality = supportsBinaryEquality;
   this.supportsBinaryOrdering = supportsBinaryOrdering;
+  this.supportsLowercaseEquality = 
collationName.equals("UTF8_BINARY_LCASE");

Review Comment:
   Can you push `supportsLowercaseEquality` in constructor argument instead of 
string eq check here? You can make it default to false if needed, but I would 
rather be explicit in Collation construction.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-11 Thread via GitHub


uros-db commented on PR #45978:
URL: https://github.com/apache/spark/pull/45978#issuecomment-2049008530

   updated PR description, stand by for some more small changes before merging


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


uros-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1560453598


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -172,19 +183,31 @@ public Collation(
   }
 
   /**
-   * Auxiliary methods for collation aware string operations.
+   * Returns a StringSearch object for the given pattern and target strings, 
under collation
+   * rules corresponding to the given collationId. The external ICU library 
StringSearch object can
+   * be used to find occurrences of the pattern in the target string, while 
respecting collation.
*/
-
   public static StringSearch getStringSearch(
-  final UTF8String left,
-  final UTF8String right,
+  final UTF8String targetUTF8String,
+  final UTF8String patternUTF8String,
   final int collationId) {
-String pattern = right.toString();
-CharacterIterator target = new StringCharacterIterator(left.toString());
+String pattern = patternUTF8String.toString();
+CharacterIterator target = new 
StringCharacterIterator(targetUTF8String.toString());
 Collator collator = CollationFactory.fetchCollation(collationId).collator;
 return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }
 
+  /**
+   * Returns a collation-unaware StringSearch object for the given pattern and 
target strings.
+   * While this object does not respect collation, it can be used to find 
occurrences of the pattern
+   * in the target string for UTF8_BINARY or UTF8_BINARY_LCASE (if arguments 
are lowercased).
+   */
+  public static StringSearch getStringSearch(

Review Comment:
   it will be needed for custom UTF8_BINARY_LCASE implementations for certain 
expressions (for example: StringReplace, StringTranslate)
   
   since UTF8_BINARY_LCASE doesn't have an ICU collator instance, we can only 
call `.toLowerCase` on both arguments and then use this raw (binary) 
`StringSearch` instance to implement those expressions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


uros-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1560454235


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -0,0 +1,174 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation-aware expressions (StringExpressions, 
RegexpExpressions, and
+ * other expressions that require custom collation support), as well as 
private utility methods for
+ * collation-aware UTF8String operations needed to implement .
+ */
+public final class CollationSupport {
+
+  /**
+   * Collation-aware string expressions.
+   */
+
+  public static class Contains {
+public static boolean contains(final UTF8String l, final UTF8String r, 
final int collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  if (collation.supportsBinaryEquality) {
+return containsBinary(l, r);
+  } else if (collation.supportsLowercaseEquality) {
+return containsLowercase(l, r);
+  } else {
+return containsICU(l, r, collationId);
+  }
+}
+public static String containsGenCode(final String l, final String r, final 
int collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  String expr = "CollationSupport.Contains.contains";
+  if (collation.supportsBinaryEquality) {
+return String.format(expr + "Binary(%s, %s)", l, r);
+  } else if (collation.supportsLowercaseEquality) {
+return String.format(expr + "Lowercase(%s, %s)", l, r);
+  } else {
+return String.format(expr + "ICU(%s, %s, %d)", l, r, collationId);
+  }
+}
+public static boolean containsBinary(final UTF8String l, final UTF8String 
r) {
+  return l.contains(r);
+}
+public static boolean containsLowercase(final UTF8String l, final 
UTF8String r) {
+  return l.toLowerCase().contains(r.toLowerCase());

Review Comment:
   agreed, I'll open another ticket for improvements like that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


cloud-fan commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1560452406


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -0,0 +1,174 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation-aware expressions (StringExpressions, 
RegexpExpressions, and
+ * other expressions that require custom collation support), as well as 
private utility methods for
+ * collation-aware UTF8String operations needed to implement .
+ */
+public final class CollationSupport {
+
+  /**
+   * Collation-aware string expressions.
+   */
+
+  public static class Contains {
+public static boolean contains(final UTF8String l, final UTF8String r, 
final int collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  if (collation.supportsBinaryEquality) {
+return containsBinary(l, r);
+  } else if (collation.supportsLowercaseEquality) {
+return containsLowercase(l, r);
+  } else {
+return containsICU(l, r, collationId);
+  }
+}
+public static String containsGenCode(final String l, final String r, final 
int collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  String expr = "CollationSupport.Contains.contains";
+  if (collation.supportsBinaryEquality) {
+return String.format(expr + "Binary(%s, %s)", l, r);
+  } else if (collation.supportsLowercaseEquality) {
+return String.format(expr + "Lowercase(%s, %s)", l, r);
+  } else {
+return String.format(expr + "ICU(%s, %s, %d)", l, r, collationId);
+  }
+}
+public static boolean containsBinary(final UTF8String l, final UTF8String 
r) {
+  return l.contains(r);
+}
+public static boolean containsLowercase(final UTF8String l, final 
UTF8String r) {
+  return l.toLowerCase().contains(r.toLowerCase());

Review Comment:
   We can probably improve it by `compareLowerCase` introduced in 
https://github.com/apache/spark/pull/45816 . We can do it larer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


cloud-fan commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1560451327


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -172,19 +183,31 @@ public Collation(
   }
 
   /**
-   * Auxiliary methods for collation aware string operations.
+   * Returns a StringSearch object for the given pattern and target strings, 
under collation
+   * rules corresponding to the given collationId. The external ICU library 
StringSearch object can
+   * be used to find occurrences of the pattern in the target string, while 
respecting collation.
*/
-
   public static StringSearch getStringSearch(
-  final UTF8String left,
-  final UTF8String right,
+  final UTF8String targetUTF8String,
+  final UTF8String patternUTF8String,
   final int collationId) {
-String pattern = right.toString();
-CharacterIterator target = new StringCharacterIterator(left.toString());
+String pattern = patternUTF8String.toString();
+CharacterIterator target = new 
StringCharacterIterator(targetUTF8String.toString());
 Collator collator = CollationFactory.fetchCollation(collationId).collator;
 return new StringSearch(pattern, target, (RuleBasedCollator) collator);
   }
 
+  /**
+   * Returns a collation-unaware StringSearch object for the given pattern and 
target strings.
+   * While this object does not respect collation, it can be used to find 
occurrences of the pattern
+   * in the target string for UTF8_BINARY or UTF8_BINARY_LCASE (if arguments 
are lowercased).
+   */
+  public static StringSearch getStringSearch(

Review Comment:
   why do we need it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


HyukjinKwon commented on PR #45978:
URL: https://github.com/apache/spark/pull/45978#issuecomment-2048651969

   Can you please fill the PR description?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


uros-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1559445525


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string expressions.
+ */
+public final class CollationSupport {
+
+  /**
+   * Collation aware string expressions.
+   */
+  public static class Contains {
+public static boolean containsCollationAware(UTF8String l, UTF8String r, 
int collationId) {

Review Comment:
   agreed, going with `contains` 
([ref](https://github.com/apache/spark/pull/45978#discussion_r1559302476))



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


uros-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1559441689


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string expressions.
+ */
+public final class CollationSupport {

Review Comment:
   and another reason is I wanted to have `private static class 
CollationAwareUTF8String` in the same outer class (there's no reason to expose 
this API outside of collation awareness for expressions)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


uros-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1559441689


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string expressions.
+ */
+public final class CollationSupport {

Review Comment:
   and another reason is I wanted to have `private static class 
CollationAwareUTF8String` in the same outer class



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


uros-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1559438477


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string expressions.
+ */
+public final class CollationSupport {

Review Comment:
   that could be troublesome for importing in general, and especially in genCode
   
   imagine `evaluator.setDefaultImports` getting flooded with:
   classOf[Expr1].getName
   classOf[Expr2].getName
   classOf[Expr3].getName
   ...
   classOf[Expr99].getName



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


dbatomic commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1559428506


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string expressions.
+ */
+public final class CollationSupport {
+
+  /**
+   * Collation aware string expressions.
+   */
+  public static class Contains {
+public static boolean containsCollationAware(UTF8String l, UTF8String r, 
int collationId) {

Review Comment:
   Maybe just contains? No need for "collationAware" suffix in this context?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


dbatomic commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1559427040


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string expressions.
+ */
+public final class CollationSupport {

Review Comment:
   Maybe we can just put this in new namespace? No need for nested class?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org