Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
nikolamand-db commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1561110745 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -172,19 +183,31 @@ public Collation( } /** - * Auxiliary methods for collation aware string operations. + * Returns a StringSearch object for the given pattern and target strings, under collation + * rules corresponding to the given collationId. The external ICU library StringSearch object can + * be used to find occurrences of the pattern in the target string, while respecting collation. */ - public static StringSearch getStringSearch( - final UTF8String left, - final UTF8String right, + final UTF8String targetUTF8String, + final UTF8String patternUTF8String, final int collationId) { -String pattern = right.toString(); -CharacterIterator target = new StringCharacterIterator(left.toString()); +String pattern = patternUTF8String.toString(); +CharacterIterator target = new StringCharacterIterator(targetUTF8String.toString()); Collator collator = CollationFactory.fetchCollation(collationId).collator; return new StringSearch(pattern, target, (RuleBasedCollator) collator); } + /** + * Returns a collation-unaware StringSearch object for the given pattern and target strings. + * While this object does not respect collation, it can be used to find occurrences of the pattern + * in the target string for UTF8_BINARY or UTF8_BINARY_LCASE (if arguments are lowercased). + */ + public static StringSearch getStringSearch( Review Comment: > That part becomes a little complicated with expressions that need to preserve the original string value, consider StringReplace: 'BAbaBA'.replace('ab', 'XX') with UTF8_BINARY_LCASE should return 'BXXaBA' (not 'bxxaba') Shouldn't this return `BA` with UTF8_BINARY_LCASE collation? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
cloud-fan closed pull request #45978: [SPARK-47410][SQL] Refactor UTF8String and CollationFactory URL: https://github.com/apache/spark/pull/45978 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
cloud-fan commented on PR #45978: URL: https://github.com/apache/spark/pull/45978#issuecomment-2049808333 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
uros-db commented on PR #45978: URL: https://github.com/apache/spark/pull/45978#issuecomment-2049806525 @cloud-fan all checks look good, ready to merge -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
miland-db commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1561064293 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -172,19 +183,31 @@ public Collation( } /** - * Auxiliary methods for collation aware string operations. + * Returns a StringSearch object for the given pattern and target strings, under collation + * rules corresponding to the given collationId. The external ICU library StringSearch object can + * be used to find occurrences of the pattern in the target string, while respecting collation. */ - public static StringSearch getStringSearch( - final UTF8String left, - final UTF8String right, + final UTF8String targetUTF8String, + final UTF8String patternUTF8String, final int collationId) { -String pattern = right.toString(); -CharacterIterator target = new StringCharacterIterator(left.toString()); +String pattern = patternUTF8String.toString(); +CharacterIterator target = new StringCharacterIterator(targetUTF8String.toString()); Collator collator = CollationFactory.fetchCollation(collationId).collator; return new StringSearch(pattern, target, (RuleBasedCollator) collator); } + /** + * Returns a collation-unaware StringSearch object for the given pattern and target strings. + * While this object does not respect collation, it can be used to find occurrences of the pattern + * in the target string for UTF8_BINARY or UTF8_BINARY_LCASE (if arguments are lowercased). + */ + public static StringSearch getStringSearch( Review Comment: @uros-db +1 on this reasoning -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
uros-db commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1561064861 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationStringExpressions.java: ## @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.util; + +import com.ibm.icu.text.StringSearch; + +import org.apache.spark.unsafe.types.UTF8String; + +/** + * Static entry point for collation aware string expressions. + */ +public final class CollationStringExpressions { + + /** + * Collation aware string expressions. + */ + public static class Contains { +public static boolean containsCollationAware(UTF8String l, UTF8String r, int collationId) { + if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) { +return containsBinary(l, r); Review Comment: update: we ended up doing something similar https://github.com/apache/spark/pull/45978#discussion_r1560761873 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
cloud-fan commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1561059364 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -172,19 +183,31 @@ public Collation( } /** - * Auxiliary methods for collation aware string operations. + * Returns a StringSearch object for the given pattern and target strings, under collation + * rules corresponding to the given collationId. The external ICU library StringSearch object can + * be used to find occurrences of the pattern in the target string, while respecting collation. */ - public static StringSearch getStringSearch( - final UTF8String left, - final UTF8String right, + final UTF8String targetUTF8String, + final UTF8String patternUTF8String, final int collationId) { -String pattern = right.toString(); -CharacterIterator target = new StringCharacterIterator(left.toString()); +String pattern = patternUTF8String.toString(); +CharacterIterator target = new StringCharacterIterator(targetUTF8String.toString()); Collator collator = CollationFactory.fetchCollation(collationId).collator; return new StringSearch(pattern, target, (RuleBasedCollator) collator); } + /** + * Returns a collation-unaware StringSearch object for the given pattern and target strings. + * While this object does not respect collation, it can be used to find occurrences of the pattern + * in the target string for UTF8_BINARY or UTF8_BINARY_LCASE (if arguments are lowercased). + */ + public static StringSearch getStringSearch( Review Comment: Ah let's keep it then -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
uros-db commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1561054469 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -172,19 +183,31 @@ public Collation( } /** - * Auxiliary methods for collation aware string operations. + * Returns a StringSearch object for the given pattern and target strings, under collation + * rules corresponding to the given collationId. The external ICU library StringSearch object can + * be used to find occurrences of the pattern in the target string, while respecting collation. */ - public static StringSearch getStringSearch( - final UTF8String left, - final UTF8String right, + final UTF8String targetUTF8String, + final UTF8String patternUTF8String, final int collationId) { -String pattern = right.toString(); -CharacterIterator target = new StringCharacterIterator(left.toString()); +String pattern = patternUTF8String.toString(); +CharacterIterator target = new StringCharacterIterator(targetUTF8String.toString()); Collator collator = CollationFactory.fetchCollation(collationId).collator; return new StringSearch(pattern, target, (RuleBasedCollator) collator); } + /** + * Returns a collation-unaware StringSearch object for the given pattern and target strings. + * While this object does not respect collation, it can be used to find occurrences of the pattern + * in the target string for UTF8_BINARY or UTF8_BINARY_LCASE (if arguments are lowercased). + */ + public static StringSearch getStringSearch( Review Comment: That part becomes a little complicated with expressions that need to preserve the original string value, consider StringReplace: `'BAbaBA'.replace('ab', 'XX')` with UTF8_BINARY_LCASE should return 'BXXaBA' (not 'bxxaba') so .toLowerCase version is used for matching, and original version (non-Lowercase) is used for replacement @miland-db's currently 2 open PRs use this variant of StringSearch, although it's not clear if that's definitely a good approach. That said, I do think we can definitely figure out ways to do this in so that we rely on `UTF8String` functions rather than `StringSearch` for UTF8_BINARY_LCASE, but we're yet to figure out if that's really more efficient (or even easier to implement) - for example, Milan noticed that the `UTF8String` implementation of "translate" actually does ".toString()" followed by a bunch of ".substring" calls, while StringSearch uses a `CharacterIterator` which should probably be more efficient we may consider benchmarking this, but in terms of this PR @cloud-fan I think we could agree on either: 1. keeping this version of `getStringSearch` here until we're sure that we're not gonna need it for any expression (we could remove it once we've swooped over all expressions) 2. removing this version of `getStringSearch` here until we're sure that we need it somewhere (we could add it back once there's a guaranteed need for it) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
uros-db commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1561054469 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -172,19 +183,31 @@ public Collation( } /** - * Auxiliary methods for collation aware string operations. + * Returns a StringSearch object for the given pattern and target strings, under collation + * rules corresponding to the given collationId. The external ICU library StringSearch object can + * be used to find occurrences of the pattern in the target string, while respecting collation. */ - public static StringSearch getStringSearch( - final UTF8String left, - final UTF8String right, + final UTF8String targetUTF8String, + final UTF8String patternUTF8String, final int collationId) { -String pattern = right.toString(); -CharacterIterator target = new StringCharacterIterator(left.toString()); +String pattern = patternUTF8String.toString(); +CharacterIterator target = new StringCharacterIterator(targetUTF8String.toString()); Collator collator = CollationFactory.fetchCollation(collationId).collator; return new StringSearch(pattern, target, (RuleBasedCollator) collator); } + /** + * Returns a collation-unaware StringSearch object for the given pattern and target strings. + * While this object does not respect collation, it can be used to find occurrences of the pattern + * in the target string for UTF8_BINARY or UTF8_BINARY_LCASE (if arguments are lowercased). + */ + public static StringSearch getStringSearch( Review Comment: That part becomes a little complicated with expressions that need to preserve the original string value, consider StringReplace: `'BAbaBA'.replace('ab', 'XX')` with UTF8_BINARY_LCASE should return 'BXXaBA' (not 'bxxaba') so .toLowerCase version is used for matching, and original veresion is used for replacement @miland-db's currently 2 open PRs use this variant of StringSearch, although it's not clear if that's definitely a good approach. That said, I do think we can definitely figure out ways to do this in so that we rely on `UTF8String` functions rather than `StringSearch` for UTF8_BINARY_LCASE, but we're yet to figure out if that's really more efficient (or even easier to implement) - for example, Milan noticed that the `UTF8String` implementation of "translate" actually does ".toString()" followed by a bunch of ".substring" calls, while StringSearch uses a `CharacterIterator` which should probably be more efficient we may consider benchmarking this, but in terms of this PR @cloud-fan I think we could agree on either: 1. keeping this version of `getStringSearch` here until we're sure that we're not gonna need it for any expression (we could remove it once we've swooped over all expressions) 2. removing this version of `getStringSearch` here until we're sure that we need it somewhere (we could add it back once there's a guaranteed need for it) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
cloud-fan commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1560959880 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -172,19 +183,31 @@ public Collation( } /** - * Auxiliary methods for collation aware string operations. + * Returns a StringSearch object for the given pattern and target strings, under collation + * rules corresponding to the given collationId. The external ICU library StringSearch object can + * be used to find occurrences of the pattern in the target string, while respecting collation. */ - public static StringSearch getStringSearch( - final UTF8String left, - final UTF8String right, + final UTF8String targetUTF8String, + final UTF8String patternUTF8String, final int collationId) { -String pattern = right.toString(); -CharacterIterator target = new StringCharacterIterator(left.toString()); +String pattern = patternUTF8String.toString(); +CharacterIterator target = new StringCharacterIterator(targetUTF8String.toString()); Collator collator = CollationFactory.fetchCollation(collationId).collator; return new StringSearch(pattern, target, (RuleBasedCollator) collator); } + /** + * Returns a collation-unaware StringSearch object for the given pattern and target strings. + * While this object does not respect collation, it can be used to find occurrences of the pattern + * in the target string for UTF8_BINARY or UTF8_BINARY_LCASE (if arguments are lowercased). + */ + public static StringSearch getStringSearch( Review Comment: If we call `.toLowerCase` first, then we can use the same implementation of `UTF8_BINARY` which directly calls `UTF8String` methods, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
uros-db commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1560772584 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java: ## @@ -0,0 +1,174 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.util; + +import com.ibm.icu.text.StringSearch; + +import org.apache.spark.unsafe.types.UTF8String; + +/** + * Static entry point for collation-aware expressions (StringExpressions, RegexpExpressions, and + * other expressions that require custom collation support), as well as private utility methods for + * collation-aware UTF8String operations needed to implement . + */ +public final class CollationSupport { + + /** + * Collation-aware string expressions. + */ + + public static class Contains { +public static boolean contains(final UTF8String l, final UTF8String r, final int collationId) { Review Comment: done ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -93,9 +101,12 @@ public Collation( this.hashFunction = hashFunction; this.supportsBinaryEquality = supportsBinaryEquality; this.supportsBinaryOrdering = supportsBinaryOrdering; + this.supportsLowercaseEquality = collationName.equals("UTF8_BINARY_LCASE"); Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
dbatomic commented on PR #45978: URL: https://github.com/apache/spark/pull/45978#issuecomment-2049342650 Just wanted to thank you for doing this. IMO, things are much cleaner than they used to be. Also, @mihailom-db , @nikolamand-db , @stefankandic and @stevomitric as FYI. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
dbatomic commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1560761873 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java: ## @@ -0,0 +1,174 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.util; + +import com.ibm.icu.text.StringSearch; + +import org.apache.spark.unsafe.types.UTF8String; + +/** + * Static entry point for collation-aware expressions (StringExpressions, RegexpExpressions, and + * other expressions that require custom collation support), as well as private utility methods for + * collation-aware UTF8String operations needed to implement . + */ +public final class CollationSupport { + + /** + * Collation-aware string expressions. + */ + + public static class Contains { +public static boolean contains(final UTF8String l, final UTF8String r, final int collationId) { Review Comment: Do we need to repeat class name in every method? Can't you say something like: Contains.exec(...) Contains.genCode(...) Contains.binaryExec(...) Contains.lowercaseExec(...) Contains.icuExec(...) instead of Contains.contains(...) My hope is still that we will be able to add some base classes in future that will unify parts of this logic for some expressions. So let's try to keep naming uniform. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
dbatomic commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1560754473 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -93,9 +101,12 @@ public Collation( this.hashFunction = hashFunction; this.supportsBinaryEquality = supportsBinaryEquality; this.supportsBinaryOrdering = supportsBinaryOrdering; + this.supportsLowercaseEquality = collationName.equals("UTF8_BINARY_LCASE"); Review Comment: Can you push `supportsLowercaseEquality` in constructor argument instead of string eq check here? You can make it default to false if needed, but I would rather be explicit in Collation construction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
uros-db commented on PR #45978: URL: https://github.com/apache/spark/pull/45978#issuecomment-2049008530 updated PR description, stand by for some more small changes before merging -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
uros-db commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1560453598 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -172,19 +183,31 @@ public Collation( } /** - * Auxiliary methods for collation aware string operations. + * Returns a StringSearch object for the given pattern and target strings, under collation + * rules corresponding to the given collationId. The external ICU library StringSearch object can + * be used to find occurrences of the pattern in the target string, while respecting collation. */ - public static StringSearch getStringSearch( - final UTF8String left, - final UTF8String right, + final UTF8String targetUTF8String, + final UTF8String patternUTF8String, final int collationId) { -String pattern = right.toString(); -CharacterIterator target = new StringCharacterIterator(left.toString()); +String pattern = patternUTF8String.toString(); +CharacterIterator target = new StringCharacterIterator(targetUTF8String.toString()); Collator collator = CollationFactory.fetchCollation(collationId).collator; return new StringSearch(pattern, target, (RuleBasedCollator) collator); } + /** + * Returns a collation-unaware StringSearch object for the given pattern and target strings. + * While this object does not respect collation, it can be used to find occurrences of the pattern + * in the target string for UTF8_BINARY or UTF8_BINARY_LCASE (if arguments are lowercased). + */ + public static StringSearch getStringSearch( Review Comment: it will be needed for custom UTF8_BINARY_LCASE implementations for certain expressions (for example: StringReplace, StringTranslate) since UTF8_BINARY_LCASE doesn't have an ICU collator instance, we can only call `.toLowerCase` on both arguments and then use this raw (binary) `StringSearch` instance to implement those expressions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
uros-db commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1560454235 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java: ## @@ -0,0 +1,174 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.util; + +import com.ibm.icu.text.StringSearch; + +import org.apache.spark.unsafe.types.UTF8String; + +/** + * Static entry point for collation-aware expressions (StringExpressions, RegexpExpressions, and + * other expressions that require custom collation support), as well as private utility methods for + * collation-aware UTF8String operations needed to implement . + */ +public final class CollationSupport { + + /** + * Collation-aware string expressions. + */ + + public static class Contains { +public static boolean contains(final UTF8String l, final UTF8String r, final int collationId) { + CollationFactory.Collation collation = CollationFactory.fetchCollation(collationId); + if (collation.supportsBinaryEquality) { +return containsBinary(l, r); + } else if (collation.supportsLowercaseEquality) { +return containsLowercase(l, r); + } else { +return containsICU(l, r, collationId); + } +} +public static String containsGenCode(final String l, final String r, final int collationId) { + CollationFactory.Collation collation = CollationFactory.fetchCollation(collationId); + String expr = "CollationSupport.Contains.contains"; + if (collation.supportsBinaryEquality) { +return String.format(expr + "Binary(%s, %s)", l, r); + } else if (collation.supportsLowercaseEquality) { +return String.format(expr + "Lowercase(%s, %s)", l, r); + } else { +return String.format(expr + "ICU(%s, %s, %d)", l, r, collationId); + } +} +public static boolean containsBinary(final UTF8String l, final UTF8String r) { + return l.contains(r); +} +public static boolean containsLowercase(final UTF8String l, final UTF8String r) { + return l.toLowerCase().contains(r.toLowerCase()); Review Comment: agreed, I'll open another ticket for improvements like that -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
cloud-fan commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1560452406 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java: ## @@ -0,0 +1,174 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.util; + +import com.ibm.icu.text.StringSearch; + +import org.apache.spark.unsafe.types.UTF8String; + +/** + * Static entry point for collation-aware expressions (StringExpressions, RegexpExpressions, and + * other expressions that require custom collation support), as well as private utility methods for + * collation-aware UTF8String operations needed to implement . + */ +public final class CollationSupport { + + /** + * Collation-aware string expressions. + */ + + public static class Contains { +public static boolean contains(final UTF8String l, final UTF8String r, final int collationId) { + CollationFactory.Collation collation = CollationFactory.fetchCollation(collationId); + if (collation.supportsBinaryEquality) { +return containsBinary(l, r); + } else if (collation.supportsLowercaseEquality) { +return containsLowercase(l, r); + } else { +return containsICU(l, r, collationId); + } +} +public static String containsGenCode(final String l, final String r, final int collationId) { + CollationFactory.Collation collation = CollationFactory.fetchCollation(collationId); + String expr = "CollationSupport.Contains.contains"; + if (collation.supportsBinaryEquality) { +return String.format(expr + "Binary(%s, %s)", l, r); + } else if (collation.supportsLowercaseEquality) { +return String.format(expr + "Lowercase(%s, %s)", l, r); + } else { +return String.format(expr + "ICU(%s, %s, %d)", l, r, collationId); + } +} +public static boolean containsBinary(final UTF8String l, final UTF8String r) { + return l.contains(r); +} +public static boolean containsLowercase(final UTF8String l, final UTF8String r) { + return l.toLowerCase().contains(r.toLowerCase()); Review Comment: We can probably improve it by `compareLowerCase` introduced in https://github.com/apache/spark/pull/45816 . We can do it larer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
cloud-fan commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1560451327 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -172,19 +183,31 @@ public Collation( } /** - * Auxiliary methods for collation aware string operations. + * Returns a StringSearch object for the given pattern and target strings, under collation + * rules corresponding to the given collationId. The external ICU library StringSearch object can + * be used to find occurrences of the pattern in the target string, while respecting collation. */ - public static StringSearch getStringSearch( - final UTF8String left, - final UTF8String right, + final UTF8String targetUTF8String, + final UTF8String patternUTF8String, final int collationId) { -String pattern = right.toString(); -CharacterIterator target = new StringCharacterIterator(left.toString()); +String pattern = patternUTF8String.toString(); +CharacterIterator target = new StringCharacterIterator(targetUTF8String.toString()); Collator collator = CollationFactory.fetchCollation(collationId).collator; return new StringSearch(pattern, target, (RuleBasedCollator) collator); } + /** + * Returns a collation-unaware StringSearch object for the given pattern and target strings. + * While this object does not respect collation, it can be used to find occurrences of the pattern + * in the target string for UTF8_BINARY or UTF8_BINARY_LCASE (if arguments are lowercased). + */ + public static StringSearch getStringSearch( Review Comment: why do we need it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] Refactor UTF8String and CollationFactory [spark]
HyukjinKwon commented on PR #45978: URL: https://github.com/apache/spark/pull/45978#issuecomment-2048651969 Can you please fill the PR description? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]
uros-db commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1559445525 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java: ## @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.util; + +import com.ibm.icu.text.StringSearch; + +import org.apache.spark.unsafe.types.UTF8String; + +/** + * Static entry point for collation aware string expressions. + */ +public final class CollationSupport { + + /** + * Collation aware string expressions. + */ + public static class Contains { +public static boolean containsCollationAware(UTF8String l, UTF8String r, int collationId) { Review Comment: agreed, going with `contains` ([ref](https://github.com/apache/spark/pull/45978#discussion_r1559302476)) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]
uros-db commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1559441689 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java: ## @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.util; + +import com.ibm.icu.text.StringSearch; + +import org.apache.spark.unsafe.types.UTF8String; + +/** + * Static entry point for collation aware string expressions. + */ +public final class CollationSupport { Review Comment: and another reason is I wanted to have `private static class CollationAwareUTF8String` in the same outer class (there's no reason to expose this API outside of collation awareness for expressions) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]
uros-db commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1559441689 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java: ## @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.util; + +import com.ibm.icu.text.StringSearch; + +import org.apache.spark.unsafe.types.UTF8String; + +/** + * Static entry point for collation aware string expressions. + */ +public final class CollationSupport { Review Comment: and another reason is I wanted to have `private static class CollationAwareUTF8String` in the same outer class -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]
uros-db commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1559438477 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java: ## @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.util; + +import com.ibm.icu.text.StringSearch; + +import org.apache.spark.unsafe.types.UTF8String; + +/** + * Static entry point for collation aware string expressions. + */ +public final class CollationSupport { Review Comment: that could be troublesome for importing in general, and especially in genCode imagine `evaluator.setDefaultImports` getting flooded with: classOf[Expr1].getName classOf[Expr2].getName classOf[Expr3].getName ... classOf[Expr99].getName -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]
dbatomic commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1559428506 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java: ## @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.util; + +import com.ibm.icu.text.StringSearch; + +import org.apache.spark.unsafe.types.UTF8String; + +/** + * Static entry point for collation aware string expressions. + */ +public final class CollationSupport { + + /** + * Collation aware string expressions. + */ + public static class Contains { +public static boolean containsCollationAware(UTF8String l, UTF8String r, int collationId) { Review Comment: Maybe just contains? No need for "collationAware" suffix in this context? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]
dbatomic commented on code in PR #45978: URL: https://github.com/apache/spark/pull/45978#discussion_r1559427040 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java: ## @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.util; + +import com.ibm.icu.text.StringSearch; + +import org.apache.spark.unsafe.types.UTF8String; + +/** + * Static entry point for collation aware string expressions. + */ +public final class CollationSupport { Review Comment: Maybe we can just put this in new namespace? No need for nested class? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org