Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-28 Thread via GitHub


cloud-fan closed pull request #46511: [SPARK-48221][SQL] Alter string search 
logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, 
StringLocate)
URL: https://github.com/apache/spark/pull/46511


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-28 Thread via GitHub


cloud-fan commented on PR #46511:
URL: https://github.com/apache/spark/pull/46511#issuecomment-2135733655

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-23 Thread via GitHub


uros-db commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1612011134


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -430,7 +430,7 @@ public static int execBinary(final UTF8String string, final 
UTF8String substring
 }
 public static int execLowercase(final UTF8String string, final UTF8String 
substring,
 final int start) {
-  return string.toLowerCase().indexOf(substring.toLowerCase(), start);

Review Comment:
   so as part of this PR, we actually changed the core definition of 
string-searching in UTF8_BINARY_LCASE, i.e. what it means for one substring 
(_pattern_) to be found in another string (_target_) under UTF8_BINARY_LCASE
   
   in the old implementation, `contains("İ", "i")` would return `true` - 
however, this behaviour is incorrect because it relies on the fact that 
`substr(lower("İ"), 1, 1)` == `"i"` (incorrect, old implementation), instead of 
`lower(substr("İ", 1, 1))` != `"i"` (correct, new implementation)
   
   and this is all due to the fact that `lower("İ") = "i\u0307"` (1 uppercase 
character -> 2 lowercase characters)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-23 Thread via GitHub


uros-db commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1612011134


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -430,7 +430,7 @@ public static int execBinary(final UTF8String string, final 
UTF8String substring
 }
 public static int execLowercase(final UTF8String string, final UTF8String 
substring,
 final int start) {
-  return string.toLowerCase().indexOf(substring.toLowerCase(), start);

Review Comment:
   so as part of this PR, we actually changed the core definition of 
string-searching in UTF8_BINARY_LCASE, i.e. what it means for one substring 
(_pattern_) to be found in another string (_target_) under UTF8_BINARY_LCASE
   
   in the old implementation, `contains("İ", "i")` would return `true` - 
however, this behaviour is incorrect because it relies on the fact that 
`substr(lower("İ"), 1, 1)` == `"i"` (incorrect, old implementation), instead of 
`lower(substr("İ", 1, 1))` != `"i"` (correct, new implementation)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-23 Thread via GitHub


uros-db commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1612011134


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -430,7 +430,7 @@ public static int execBinary(final UTF8String string, final 
UTF8String substring
 }
 public static int execLowercase(final UTF8String string, final UTF8String 
substring,
 final int start) {
-  return string.toLowerCase().indexOf(substring.toLowerCase(), start);

Review Comment:
   so as part of this PR, we actually changed the core definition of 
string-searching in UTF8_BINARY_LCASE, i.e. what it means for one substring 
(_pattern_) to be found in another string (_target_) under UTF8_BINARY_LCASE
   
   in the old implementation, `contains("İ", "i")` would return `true` - 
however, this behaviour is incorrect because it relies on the fact that 
`substr(lower("İ"), 1, 1)` == `"i"` (incorrect, old implementation), instead of 
`lower(substr("İ"substr))` != `"i"` (correct, new implementation)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-23 Thread via GitHub


uros-db commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1612011134


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -430,7 +430,7 @@ public static int execBinary(final UTF8String string, final 
UTF8String substring
 }
 public static int execLowercase(final UTF8String string, final UTF8String 
substring,
 final int start) {
-  return string.toLowerCase().indexOf(substring.toLowerCase(), start);

Review Comment:
   so as part of this PR, we actually changed the core definition of 
string-searching in UTF8_BINARY_LCASE, i.e. what it means for one substring 
(_pattern_) to be found in another string (_target_) under UTF8_BINARY_LCASE
   
   in the old implementation, `contains("İ", "i")` would return `true` - 
however, this behaviour is incorrect because it relies on the fact that 
`substr(lower("İ"), 1, 1)` == `"i"` (incorrect, old implementation), instead of 
`lower(substr("İ"substr))` == `"i"` (correct, new implementation)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-23 Thread via GitHub


uros-db commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1612011134


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -430,7 +430,7 @@ public static int execBinary(final UTF8String string, final 
UTF8String substring
 }
 public static int execLowercase(final UTF8String string, final UTF8String 
substring,
 final int start) {
-  return string.toLowerCase().indexOf(substring.toLowerCase(), start);

Review Comment:
   so as part of this PR, we actually changed the core definition of 
string-searching in UTF8_BINARY_LCASE, i.e. what it means for one substring 
(_pattern_) to be found in another string (_target_) under UTF8_BINARY_LCASE
   
   in the old implementation, `contains("İ", "i")` would return `true` - 
however, this behaviour is incorrect because it relies on the fact that 
`substr(lower("İ"))` == `"i"` (incorrect, old implementation), instead of 
`lower(substr("İ"))` == `"i"` (correct, new implementation)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-23 Thread via GitHub


uros-db commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1612004770


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -430,7 +430,7 @@ public static int execBinary(final UTF8String string, final 
UTF8String substring
 }
 public static int execLowercase(final UTF8String string, final UTF8String 
substring,
 final int start) {
-  return string.toLowerCase().indexOf(substring.toLowerCase(), start);

Review Comment:
   no, unfortunately it's not - while it works fine for ASCII, it actually 
gives wrong results in some special cases featuring conditional case mapping, 
when a character has a lowercase equivalent that consists of multiple 
characters, or is found at a particular place in the string (context-awareness)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-22 Thread via GitHub


uros-db commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1609894240


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,155 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found when searching 
for a pattern
+   * string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix, 
starting from the
+   * specified position (0-based index referring to character position in 
UTF8String), with respect
+   * to the UTF8_BINARY_LCASE collation. The method assumes that the prefix is 
already lowercased
+   * prior to method call to avoid the overhead of calling .toLowerCase() 
multiple times on the
+   * same prefix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return whether the target string starts with the specified prefix in 
UTF8_BINARY_LCASE
+   */
+  public static boolean lowercaseMatchFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that starts with 
the specified
+   * prefix, starting from the specified position (0-based index referring to 
character position
+   * in UTF8String), with respect to the UTF8_BINARY_LCASE collation. The 
method assumes that the
+   * prefix is already lowercased. The method only considers the part of 
target string that
+   * starts from the specified (inclusive) position (that is, the method does 
not look at UTF8
+   * characters of the target string at or after position `endPos`). If the 
prefix is not found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return length of the target substring that ends with the specified 
prefix in lowercase
+   */
+  public static int lowercaseMatchLengthFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int len = 0; len <= target.numChars() - startPos; ++len) {
+  if (target.substring(startPos, startPos + 
len).toLowerCase().equals(lowercasePattern)) {
+return len;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the position of the first occurrence of the pattern string in the 
target string,
+   * starting from the specified position (0-based index referring to 
character position in
+   * UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method 
assumes that the
+   * pattern string is already lowercased prior to call. If the pattern is not 
found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return the position of the first occurrence of pattern in target
+   */
+  public static int lowercaseFind(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int i = startPos; i <= target.numChars(); ++i) {
+  if (lowercaseMatchFrom(target, lowercasePattern, i)) {
+return i;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns whether the target string ends with the specified suffix, ending 
at the specified
+   * position (0-based index referring to character position in UTF8String), 
with respect to the
+   * UTF8_BINARY_LCASE collation. The method assumes that the suffix is 
already lowercased prior
+   * to method call to avoid the overhead of calling .toLowerCase() multiple 
times on the same
+   * suffix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param endPos the end position for searching (in the target string)
+   * @return whether the target string ends with the specified suffix in 
lowercase
+   */
+  public static boolean lowercaseMatchUntil(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int endPos) {
+return lowercaseMatchLengthUntil(target, lowercasePattern, endPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that ends with 
the specified
+   * suffix, ending at the specified position (0-based index 

Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-22 Thread via GitHub


uros-db commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1609894240


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,155 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found when searching 
for a pattern
+   * string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix, 
starting from the
+   * specified position (0-based index referring to character position in 
UTF8String), with respect
+   * to the UTF8_BINARY_LCASE collation. The method assumes that the prefix is 
already lowercased
+   * prior to method call to avoid the overhead of calling .toLowerCase() 
multiple times on the
+   * same prefix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return whether the target string starts with the specified prefix in 
UTF8_BINARY_LCASE
+   */
+  public static boolean lowercaseMatchFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that starts with 
the specified
+   * prefix, starting from the specified position (0-based index referring to 
character position
+   * in UTF8String), with respect to the UTF8_BINARY_LCASE collation. The 
method assumes that the
+   * prefix is already lowercased. The method only considers the part of 
target string that
+   * starts from the specified (inclusive) position (that is, the method does 
not look at UTF8
+   * characters of the target string at or after position `endPos`). If the 
prefix is not found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return length of the target substring that ends with the specified 
prefix in lowercase
+   */
+  public static int lowercaseMatchLengthFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int len = 0; len <= target.numChars() - startPos; ++len) {
+  if (target.substring(startPos, startPos + 
len).toLowerCase().equals(lowercasePattern)) {
+return len;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the position of the first occurrence of the pattern string in the 
target string,
+   * starting from the specified position (0-based index referring to 
character position in
+   * UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method 
assumes that the
+   * pattern string is already lowercased prior to call. If the pattern is not 
found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return the position of the first occurrence of pattern in target
+   */
+  public static int lowercaseFind(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int i = startPos; i <= target.numChars(); ++i) {
+  if (lowercaseMatchFrom(target, lowercasePattern, i)) {
+return i;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns whether the target string ends with the specified suffix, ending 
at the specified
+   * position (0-based index referring to character position in UTF8String), 
with respect to the
+   * UTF8_BINARY_LCASE collation. The method assumes that the suffix is 
already lowercased prior
+   * to method call to avoid the overhead of calling .toLowerCase() multiple 
times on the same
+   * suffix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param endPos the end position for searching (in the target string)
+   * @return whether the target string ends with the specified suffix in 
lowercase
+   */
+  public static boolean lowercaseMatchUntil(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int endPos) {
+return lowercaseMatchLengthUntil(target, lowercasePattern, endPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that ends with 
the specified
+   * suffix, ending at the specified position (0-based index 

Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-22 Thread via GitHub


uros-db commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1609894240


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,155 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found when searching 
for a pattern
+   * string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix, 
starting from the
+   * specified position (0-based index referring to character position in 
UTF8String), with respect
+   * to the UTF8_BINARY_LCASE collation. The method assumes that the prefix is 
already lowercased
+   * prior to method call to avoid the overhead of calling .toLowerCase() 
multiple times on the
+   * same prefix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return whether the target string starts with the specified prefix in 
UTF8_BINARY_LCASE
+   */
+  public static boolean lowercaseMatchFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that starts with 
the specified
+   * prefix, starting from the specified position (0-based index referring to 
character position
+   * in UTF8String), with respect to the UTF8_BINARY_LCASE collation. The 
method assumes that the
+   * prefix is already lowercased. The method only considers the part of 
target string that
+   * starts from the specified (inclusive) position (that is, the method does 
not look at UTF8
+   * characters of the target string at or after position `endPos`). If the 
prefix is not found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return length of the target substring that ends with the specified 
prefix in lowercase
+   */
+  public static int lowercaseMatchLengthFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int len = 0; len <= target.numChars() - startPos; ++len) {
+  if (target.substring(startPos, startPos + 
len).toLowerCase().equals(lowercasePattern)) {
+return len;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the position of the first occurrence of the pattern string in the 
target string,
+   * starting from the specified position (0-based index referring to 
character position in
+   * UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method 
assumes that the
+   * pattern string is already lowercased prior to call. If the pattern is not 
found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return the position of the first occurrence of pattern in target
+   */
+  public static int lowercaseFind(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int i = startPos; i <= target.numChars(); ++i) {
+  if (lowercaseMatchFrom(target, lowercasePattern, i)) {
+return i;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns whether the target string ends with the specified suffix, ending 
at the specified
+   * position (0-based index referring to character position in UTF8String), 
with respect to the
+   * UTF8_BINARY_LCASE collation. The method assumes that the suffix is 
already lowercased prior
+   * to method call to avoid the overhead of calling .toLowerCase() multiple 
times on the same
+   * suffix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param endPos the end position for searching (in the target string)
+   * @return whether the target string ends with the specified suffix in 
lowercase
+   */
+  public static boolean lowercaseMatchUntil(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int endPos) {
+return lowercaseMatchLengthUntil(target, lowercasePattern, endPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that ends with 
the specified
+   * suffix, ending at the specified position (0-based index 

Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-22 Thread via GitHub


uros-db commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1609894240


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,155 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found when searching 
for a pattern
+   * string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix, 
starting from the
+   * specified position (0-based index referring to character position in 
UTF8String), with respect
+   * to the UTF8_BINARY_LCASE collation. The method assumes that the prefix is 
already lowercased
+   * prior to method call to avoid the overhead of calling .toLowerCase() 
multiple times on the
+   * same prefix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return whether the target string starts with the specified prefix in 
UTF8_BINARY_LCASE
+   */
+  public static boolean lowercaseMatchFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that starts with 
the specified
+   * prefix, starting from the specified position (0-based index referring to 
character position
+   * in UTF8String), with respect to the UTF8_BINARY_LCASE collation. The 
method assumes that the
+   * prefix is already lowercased. The method only considers the part of 
target string that
+   * starts from the specified (inclusive) position (that is, the method does 
not look at UTF8
+   * characters of the target string at or after position `endPos`). If the 
prefix is not found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return length of the target substring that ends with the specified 
prefix in lowercase
+   */
+  public static int lowercaseMatchLengthFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int len = 0; len <= target.numChars() - startPos; ++len) {
+  if (target.substring(startPos, startPos + 
len).toLowerCase().equals(lowercasePattern)) {
+return len;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the position of the first occurrence of the pattern string in the 
target string,
+   * starting from the specified position (0-based index referring to 
character position in
+   * UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method 
assumes that the
+   * pattern string is already lowercased prior to call. If the pattern is not 
found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return the position of the first occurrence of pattern in target
+   */
+  public static int lowercaseFind(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int i = startPos; i <= target.numChars(); ++i) {
+  if (lowercaseMatchFrom(target, lowercasePattern, i)) {
+return i;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns whether the target string ends with the specified suffix, ending 
at the specified
+   * position (0-based index referring to character position in UTF8String), 
with respect to the
+   * UTF8_BINARY_LCASE collation. The method assumes that the suffix is 
already lowercased prior
+   * to method call to avoid the overhead of calling .toLowerCase() multiple 
times on the same
+   * suffix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param endPos the end position for searching (in the target string)
+   * @return whether the target string ends with the specified suffix in 
lowercase
+   */
+  public static boolean lowercaseMatchUntil(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int endPos) {
+return lowercaseMatchLengthUntil(target, lowercasePattern, endPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that ends with 
the specified
+   * suffix, ending at the specified position (0-based index 

Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-22 Thread via GitHub


uros-db commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1609894240


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,155 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found when searching 
for a pattern
+   * string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix, 
starting from the
+   * specified position (0-based index referring to character position in 
UTF8String), with respect
+   * to the UTF8_BINARY_LCASE collation. The method assumes that the prefix is 
already lowercased
+   * prior to method call to avoid the overhead of calling .toLowerCase() 
multiple times on the
+   * same prefix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return whether the target string starts with the specified prefix in 
UTF8_BINARY_LCASE
+   */
+  public static boolean lowercaseMatchFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that starts with 
the specified
+   * prefix, starting from the specified position (0-based index referring to 
character position
+   * in UTF8String), with respect to the UTF8_BINARY_LCASE collation. The 
method assumes that the
+   * prefix is already lowercased. The method only considers the part of 
target string that
+   * starts from the specified (inclusive) position (that is, the method does 
not look at UTF8
+   * characters of the target string at or after position `endPos`). If the 
prefix is not found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return length of the target substring that ends with the specified 
prefix in lowercase
+   */
+  public static int lowercaseMatchLengthFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int len = 0; len <= target.numChars() - startPos; ++len) {
+  if (target.substring(startPos, startPos + 
len).toLowerCase().equals(lowercasePattern)) {
+return len;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the position of the first occurrence of the pattern string in the 
target string,
+   * starting from the specified position (0-based index referring to 
character position in
+   * UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method 
assumes that the
+   * pattern string is already lowercased prior to call. If the pattern is not 
found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return the position of the first occurrence of pattern in target
+   */
+  public static int lowercaseFind(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int i = startPos; i <= target.numChars(); ++i) {
+  if (lowercaseMatchFrom(target, lowercasePattern, i)) {
+return i;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns whether the target string ends with the specified suffix, ending 
at the specified
+   * position (0-based index referring to character position in UTF8String), 
with respect to the
+   * UTF8_BINARY_LCASE collation. The method assumes that the suffix is 
already lowercased prior
+   * to method call to avoid the overhead of calling .toLowerCase() multiple 
times on the same
+   * suffix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param endPos the end position for searching (in the target string)
+   * @return whether the target string ends with the specified suffix in 
lowercase
+   */
+  public static boolean lowercaseMatchUntil(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int endPos) {
+return lowercaseMatchLengthUntil(target, lowercasePattern, endPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that ends with 
the specified
+   * suffix, ending at the specified position (0-based index 

Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-22 Thread via GitHub


uros-db commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1609897510


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,155 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found when searching 
for a pattern
+   * string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix, 
starting from the
+   * specified position (0-based index referring to character position in 
UTF8String), with respect
+   * to the UTF8_BINARY_LCASE collation. The method assumes that the prefix is 
already lowercased
+   * prior to method call to avoid the overhead of calling .toLowerCase() 
multiple times on the
+   * same prefix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return whether the target string starts with the specified prefix in 
UTF8_BINARY_LCASE
+   */
+  public static boolean lowercaseMatchFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that starts with 
the specified
+   * prefix, starting from the specified position (0-based index referring to 
character position
+   * in UTF8String), with respect to the UTF8_BINARY_LCASE collation. The 
method assumes that the
+   * prefix is already lowercased. The method only considers the part of 
target string that
+   * starts from the specified (inclusive) position (that is, the method does 
not look at UTF8
+   * characters of the target string at or after position `endPos`). If the 
prefix is not found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return length of the target substring that ends with the specified 
prefix in lowercase
+   */
+  public static int lowercaseMatchLengthFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int len = 0; len <= target.numChars() - startPos; ++len) {
+  if (target.substring(startPos, startPos + 
len).toLowerCase().equals(lowercasePattern)) {
+return len;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the position of the first occurrence of the pattern string in the 
target string,
+   * starting from the specified position (0-based index referring to 
character position in
+   * UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method 
assumes that the
+   * pattern string is already lowercased prior to call. If the pattern is not 
found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return the position of the first occurrence of pattern in target
+   */
+  public static int lowercaseFind(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int i = startPos; i <= target.numChars(); ++i) {
+  if (lowercaseMatchFrom(target, lowercasePattern, i)) {
+return i;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns whether the target string ends with the specified suffix, ending 
at the specified
+   * position (0-based index referring to character position in UTF8String), 
with respect to the
+   * UTF8_BINARY_LCASE collation. The method assumes that the suffix is 
already lowercased prior
+   * to method call to avoid the overhead of calling .toLowerCase() multiple 
times on the same
+   * suffix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param endPos the end position for searching (in the target string)
+   * @return whether the target string ends with the specified suffix in 
lowercase
+   */
+  public static boolean lowercaseMatchUntil(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int endPos) {
+return lowercaseMatchLengthUntil(target, lowercasePattern, endPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that ends with 
the specified
+   * suffix, ending at the specified position (0-based index 

Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-22 Thread via GitHub


uros-db commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1609894240


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,155 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found when searching 
for a pattern
+   * string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix, 
starting from the
+   * specified position (0-based index referring to character position in 
UTF8String), with respect
+   * to the UTF8_BINARY_LCASE collation. The method assumes that the prefix is 
already lowercased
+   * prior to method call to avoid the overhead of calling .toLowerCase() 
multiple times on the
+   * same prefix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return whether the target string starts with the specified prefix in 
UTF8_BINARY_LCASE
+   */
+  public static boolean lowercaseMatchFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that starts with 
the specified
+   * prefix, starting from the specified position (0-based index referring to 
character position
+   * in UTF8String), with respect to the UTF8_BINARY_LCASE collation. The 
method assumes that the
+   * prefix is already lowercased. The method only considers the part of 
target string that
+   * starts from the specified (inclusive) position (that is, the method does 
not look at UTF8
+   * characters of the target string at or after position `endPos`). If the 
prefix is not found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return length of the target substring that ends with the specified 
prefix in lowercase
+   */
+  public static int lowercaseMatchLengthFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int len = 0; len <= target.numChars() - startPos; ++len) {
+  if (target.substring(startPos, startPos + 
len).toLowerCase().equals(lowercasePattern)) {
+return len;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the position of the first occurrence of the pattern string in the 
target string,
+   * starting from the specified position (0-based index referring to 
character position in
+   * UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method 
assumes that the
+   * pattern string is already lowercased prior to call. If the pattern is not 
found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return the position of the first occurrence of pattern in target
+   */
+  public static int lowercaseFind(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int i = startPos; i <= target.numChars(); ++i) {
+  if (lowercaseMatchFrom(target, lowercasePattern, i)) {
+return i;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns whether the target string ends with the specified suffix, ending 
at the specified
+   * position (0-based index referring to character position in UTF8String), 
with respect to the
+   * UTF8_BINARY_LCASE collation. The method assumes that the suffix is 
already lowercased prior
+   * to method call to avoid the overhead of calling .toLowerCase() multiple 
times on the same
+   * suffix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param endPos the end position for searching (in the target string)
+   * @return whether the target string ends with the specified suffix in 
lowercase
+   */
+  public static boolean lowercaseMatchUntil(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int endPos) {
+return lowercaseMatchLengthUntil(target, lowercasePattern, endPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that ends with 
the specified
+   * suffix, ending at the specified position (0-based index 

Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-22 Thread via GitHub


cloud-fan commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1609843913


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,155 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found when searching 
for a pattern
+   * string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix, 
starting from the
+   * specified position (0-based index referring to character position in 
UTF8String), with respect
+   * to the UTF8_BINARY_LCASE collation. The method assumes that the prefix is 
already lowercased
+   * prior to method call to avoid the overhead of calling .toLowerCase() 
multiple times on the
+   * same prefix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return whether the target string starts with the specified prefix in 
UTF8_BINARY_LCASE
+   */
+  public static boolean lowercaseMatchFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that starts with 
the specified
+   * prefix, starting from the specified position (0-based index referring to 
character position
+   * in UTF8String), with respect to the UTF8_BINARY_LCASE collation. The 
method assumes that the
+   * prefix is already lowercased. The method only considers the part of 
target string that
+   * starts from the specified (inclusive) position (that is, the method does 
not look at UTF8
+   * characters of the target string at or after position `endPos`). If the 
prefix is not found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return length of the target substring that ends with the specified 
prefix in lowercase
+   */
+  public static int lowercaseMatchLengthFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int len = 0; len <= target.numChars() - startPos; ++len) {
+  if (target.substring(startPos, startPos + 
len).toLowerCase().equals(lowercasePattern)) {
+return len;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the position of the first occurrence of the pattern string in the 
target string,
+   * starting from the specified position (0-based index referring to 
character position in
+   * UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method 
assumes that the
+   * pattern string is already lowercased prior to call. If the pattern is not 
found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return the position of the first occurrence of pattern in target
+   */
+  public static int lowercaseFind(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int i = startPos; i <= target.numChars(); ++i) {
+  if (lowercaseMatchFrom(target, lowercasePattern, i)) {
+return i;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns whether the target string ends with the specified suffix, ending 
at the specified
+   * position (0-based index referring to character position in UTF8String), 
with respect to the
+   * UTF8_BINARY_LCASE collation. The method assumes that the suffix is 
already lowercased prior
+   * to method call to avoid the overhead of calling .toLowerCase() multiple 
times on the same
+   * suffix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param endPos the end position for searching (in the target string)
+   * @return whether the target string ends with the specified suffix in 
lowercase
+   */
+  public static boolean lowercaseMatchUntil(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int endPos) {
+return lowercaseMatchLengthUntil(target, lowercasePattern, endPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that ends with 
the specified
+   * suffix, ending at the specified position (0-based index 

Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-22 Thread via GitHub


cloud-fan commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1609842094


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,155 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found when searching 
for a pattern
+   * string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix, 
starting from the
+   * specified position (0-based index referring to character position in 
UTF8String), with respect
+   * to the UTF8_BINARY_LCASE collation. The method assumes that the prefix is 
already lowercased
+   * prior to method call to avoid the overhead of calling .toLowerCase() 
multiple times on the
+   * same prefix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return whether the target string starts with the specified prefix in 
UTF8_BINARY_LCASE
+   */
+  public static boolean lowercaseMatchFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that starts with 
the specified
+   * prefix, starting from the specified position (0-based index referring to 
character position
+   * in UTF8String), with respect to the UTF8_BINARY_LCASE collation. The 
method assumes that the
+   * prefix is already lowercased. The method only considers the part of 
target string that
+   * starts from the specified (inclusive) position (that is, the method does 
not look at UTF8
+   * characters of the target string at or after position `endPos`). If the 
prefix is not found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return length of the target substring that ends with the specified 
prefix in lowercase
+   */
+  public static int lowercaseMatchLengthFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int len = 0; len <= target.numChars() - startPos; ++len) {
+  if (target.substring(startPos, startPos + 
len).toLowerCase().equals(lowercasePattern)) {
+return len;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the position of the first occurrence of the pattern string in the 
target string,
+   * starting from the specified position (0-based index referring to 
character position in
+   * UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method 
assumes that the
+   * pattern string is already lowercased prior to call. If the pattern is not 
found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return the position of the first occurrence of pattern in target
+   */
+  public static int lowercaseFind(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int i = startPos; i <= target.numChars(); ++i) {
+  if (lowercaseMatchFrom(target, lowercasePattern, i)) {
+return i;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns whether the target string ends with the specified suffix, ending 
at the specified
+   * position (0-based index referring to character position in UTF8String), 
with respect to the
+   * UTF8_BINARY_LCASE collation. The method assumes that the suffix is 
already lowercased prior
+   * to method call to avoid the overhead of calling .toLowerCase() multiple 
times on the same
+   * suffix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param endPos the end position for searching (in the target string)
+   * @return whether the target string ends with the specified suffix in 
lowercase
+   */
+  public static boolean lowercaseMatchUntil(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int endPos) {
+return lowercaseMatchLengthUntil(target, lowercasePattern, endPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that ends with 
the specified
+   * suffix, ending at the specified position (0-based index 

Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-22 Thread via GitHub


cloud-fan commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1609839724


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -430,7 +430,7 @@ public static int execBinary(final UTF8String string, final 
UTF8String substring
 }
 public static int execLowercase(final UTF8String string, final UTF8String 
substring,
 final int start) {
-  return string.toLowerCase().indexOf(substring.toLowerCase(), start);

Review Comment:
   to confirm, the previous implementation here is correct, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-21 Thread via GitHub


mkaravel commented on PR #46511:
URL: https://github.com/apache/spark/pull/46511#issuecomment-2123824322

   @uros-db 
   
   I am reading this in the PR description:
   > Why are the changes needed?
   Fix functions that give unexpected results due to the conditional case 
mapping when performing string comparisons under UTF8_BINARY_LCASE.
   
   This is not really the reason we we are making this change (conditional case 
mapping is the reason why we want to change the comparator for 
UTF8_BINARY_LCASE, but not related to string search directly).
   
   The real reason is that we need to do string search at the UT8 char-by-char 
level because otherwise indices returned or considered during string search can 
return strange and unusable results due to the fact that case mappings are 
one-to-many, that is one UTF8 character can map to two or more UTF8 characters 
under the case mapping (both for upper and lower casing).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org