Re: [PR] [DRAFT][SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


uros-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1559355890


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationStringExpressions.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string expressions.
+ */
+public final class CollationStringExpressions {
+
+  /**
+   * Collation aware string expressions.
+   */
+  public static class Contains {
+public static boolean containsCollationAware(UTF8String l, UTF8String r, 
int collationId) {
+  if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) 
{
+return containsBinary(l, r);
+  } else if 
(CollationFactory.fetchCollation(collationId).supportsLowercaseEquality) {
+return containsLowercase(l, r);
+  } else {
+return containsICU(l, r, collationId);
+  }
+}
+public static String containsGenCode(String l, String r, int collationId) {
+  if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) 
{
+return 
String.format("CollationStringExpressions.Contains.containsBinary(%s, %s)", l, 
r);
+  } else if 
(CollationFactory.fetchCollation(collationId).supportsLowercaseEquality) {
+return 
String.format("CollationStringExpressions.Contains.containsLowercase(%s, %s)", 
l, r);
+  } else {
+return 
String.format("CollationStringExpressions.Contains.containsICU(%s, %s, %d)", l, 
r, collationId);
+  }
+}
+public static boolean containsBinary(UTF8String l, UTF8String r) {

Review Comment:
   correction: this is incorrect
   it must remain public because of codeGen



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [DRAFT][SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


uros-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1559298235


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationStringExpressions.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string expressions.
+ */
+public final class CollationStringExpressions {

Review Comment:
   I think most of the code is in relation to `StringExpressions` collation 
awareness, rather than `UTF8String` operations per se
   that said, `CollationStringExpressions` may sound a bit misleading too
   
   maybe we could go with just `CollationSupport.java`? (for example, this will 
eventually cover `StringExpressions`, as well as `RegexpExpressions`, and 
possibly some other expressions as well that are in neither of these - like 
`Between`, alongside any internal helper functions for collation-aware 
`UTF8String` operations)



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationStringExpressions.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string expressions.
+ */
+public final class CollationStringExpressions {

Review Comment:
   I think most of the code is in relation to `StringExpressions` collation 
awareness, rather than `UTF8String` operations per se
   that said, `CollationStringExpressions` may sound a bit misleading too
   
   maybe we could go with just `CollationSupport.java`? (for example, this will 
eventually cover `StringExpressions`, as well as `RegexpExpressions`, and 
possibly some other expressions as well that are in neither of these - like 
`Between`; alongside any internal helper functions for collation-aware 
`UTF8String` operations)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [DRAFT][SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


uros-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1559304749


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationStringExpressions.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string expressions.
+ */
+public final class CollationStringExpressions {
+
+  /**
+   * Collation aware string expressions.
+   */
+  public static class Contains {
+public static boolean containsCollationAware(UTF8String l, UTF8String r, 
int collationId) {
+  if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) 
{
+return containsBinary(l, r);
+  } else if 
(CollationFactory.fetchCollation(collationId).supportsLowercaseEquality) {
+return containsLowercase(l, r);
+  } else {
+return containsICU(l, r, collationId);
+  }
+}
+public static String containsGenCode(String l, String r, int collationId) {
+  if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) 
{
+return 
String.format("CollationStringExpressions.Contains.containsBinary(%s, %s)", l, 
r);
+  } else if 
(CollationFactory.fetchCollation(collationId).supportsLowercaseEquality) {
+return 
String.format("CollationStringExpressions.Contains.containsLowercase(%s, %s)", 
l, r);
+  } else {
+return 
String.format("CollationStringExpressions.Contains.containsICU(%s, %s, %d)", l, 
r, collationId);
+  }
+}
+public static boolean containsBinary(UTF8String l, UTF8String r) {

Review Comment:
   these can (and should) probably be private



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [DRAFT][SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


uros-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1559304159


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationStringExpressions.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string expressions.
+ */
+public final class CollationStringExpressions {
+
+  /**
+   * Collation aware string expressions.
+   */
+  public static class Contains {
+public static boolean containsCollationAware(UTF8String l, UTF8String r, 
int collationId) {
+  if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) 
{
+return containsBinary(l, r);

Review Comment:
   and more on the uniformity side, we could go with:
   binary(l, r);
   lowercase(l, r);
   icu(l, r, collationId);
   
   instead of: containsBinary, containsLowercase, containsICU
   
   although the shorter names may end up causing some confusion?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [DRAFT][SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


uros-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1559302476


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationStringExpressions.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string expressions.
+ */
+public final class CollationStringExpressions {
+
+  /**
+   * Collation aware string expressions.
+   */
+  public static class Contains {
+public static boolean containsCollationAware(UTF8String l, UTF8String r, 
int collationId) {

Review Comment:
   here, we could go with `contains` instead of `containsCollationAware`, so 
that calls look like:
   CollationSupport.Contains.contains(l, r, collationId)
   CollationSupport.StartsWith.startsWith(l, r, collationId)
   CollationSupport.EndsWith.endsWith(l, r, collationId)
   etc.
   
   or even something more general such as `execute`, so that we retain some 
uniformity over different expressions:
   CollationSupport.Contains.execute(l, r, collationId)
   CollationSupport.StartsWith.execute(l, r, collationId)
   CollationSupport.EndsWith.execute(l, r, collationId)
   etc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [DRAFT][SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


uros-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1559298235


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationStringExpressions.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string expressions.
+ */
+public final class CollationStringExpressions {

Review Comment:
   I think most of the code is in relation to `StringExpressions` collation 
awareness, rather than `UTF8String` operations per se
   that said, `CollationStringExpressions` may sound a bit misleading too
   
   maybe we could go with just `CollationSupport.java`? (for example, this will 
eventually cover `StringExpressions`, as well as `RegexpExpressions`, and 
possibly some other expressions as well that are in neither of these - like 
`Between`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [DRAFT][SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


uros-db commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1559299218


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationStringExpressions.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string expressions.
+ */
+public final class CollationStringExpressions {

Review Comment:
   and I believe calls such as: 
`CollationSupport.Contains.containsCollationAware(l, r, collationId)`
   
   would look / sound descriptive enough



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [DRAFT][SPARK-47410][SQL] refactor UTF8String and CollationFactory [spark]

2024-04-10 Thread via GitHub


dbatomic commented on code in PR #45978:
URL: https://github.com/apache/spark/pull/45978#discussion_r1559290838


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationStringExpressions.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.text.StringSearch;
+
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Static entry point for collation aware string expressions.
+ */
+public final class CollationStringExpressions {

Review Comment:
   Probably some different name would be better, given that these are not 
expressions in Spark sense. E.g. UTF8StringCollationOperations?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org