Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
coderfender commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2666346598
##
spark/src/main/scala/org/apache/comet/serde/strings.scala:
##
@@ -286,3 +286,81 @@ trait CommonStringExprs {
}
}
}
+
+object CometRegExpExtract extends CometExpressionSerde[RegExpExtract] {
+ override def getSupportLevel(expr: RegExpExtract): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible
patterns
+expr.regexp match {
+ case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
+ !CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.get()) {
+ withInfo(
+expr,
+s"Regexp pattern $pattern is not compatible with Spark. " +
+ s"Set ${CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key}=true " +
+ "to allow it anyway.")
+ return Incompatible()
+}
+ case _ =>
+return Unsupported(Some("Only literal regexp patterns are supported"))
Review Comment:
minor / nit / nice to have : reg exp could be made regular expression /
regex (should be more useful for non-native english speakers)
##
spark/src/main/scala/org/apache/comet/serde/strings.scala:
##
@@ -286,3 +286,83 @@ trait CommonStringExprs {
}
}
}
+
+object CometRegExpExtract extends CometExpressionSerde[RegExpExtract] {
+ override def getSupportLevel(expr: RegExpExtract): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible
patterns
+expr.regexp match {
+ case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
+ !CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.get()) {
+ withInfo(
+expr,
+s"Regexp pattern $pattern is not compatible with Spark. " +
+ s"Set ${CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key}=true " +
+ "to allow it anyway.")
+ return Incompatible()
+}
+ case _ =>
+return Unsupported(Some("Only literal regexp patterns are supported"))
+}
+
+// Check if idx is a literal
+expr.idx match {
+ case Literal(_, DataTypes.IntegerType) =>
+Compatible()
+ case _ =>
+Unsupported(Some("Only literal group index is supported"))
Review Comment:
thank you
##
docs/source/user-guide/latest/configs.md:
##
@@ -294,6 +294,8 @@ These settings can be used to determine which parts of the
plan are accelerated
| `spark.comet.expression.RLike.enabled` | Enable Comet acceleration for
`RLike` | true |
| `spark.comet.expression.Rand.enabled` | Enable Comet acceleration for `Rand`
| true |
| `spark.comet.expression.Randn.enabled` | Enable Comet acceleration for
`Randn` | true |
+| `spark.comet.expression.RegExpExtract.enabled` | Enable Comet acceleration
for `RegExpExtract` | true |
+| `spark.comet.expression.RegExpExtractAll.enabled` | Enable Comet
acceleration for `RegExpExtractAll` | true |
Review Comment:
Perhaps the default configs can be false ? (given that
COMET_REGEXP_ALLOW_INCOMPATIBLE is defaulted to false)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2650434033
##
spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala:
##
@@ -391,4 +391,315 @@ class CometStringExpressionSuite extends CometTestBase {
}
}
+ test("regexp_extract basic") {
Review Comment:
Also migrated in the latest commit, thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2650416405
##
spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala:
##
@@ -391,4 +391,315 @@ class CometStringExpressionSuite extends CometTestBase {
}
}
+ test("regexp_extract basic") {
+withSQLConf(CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key -> "true") {
+ val data = Seq(
+("100-200", 1),
+("300-400", 1),
+(null, 1), // NULL input
+("no-match", 1), // no match → should return ""
+("abc123def456", 1),
+("", 1) // empty string
+ )
+
+ withParquetTable(data, "tbl") {
+// Test basic extraction: group 0 (full match)
+checkSparkAnswerAndOperator("SELECT regexp_extract(_1,
'(\\d+)-(\\d+)', 0) FROM tbl")
+// Test group 1
+checkSparkAnswerAndOperator("SELECT regexp_extract(_1,
'(\\d+)-(\\d+)', 1) FROM tbl")
Review Comment:
Thanks for catching this issue, already fixed it in latest commit!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2648928758
##
spark/src/main/scala/org/apache/comet/serde/strings.scala:
##
@@ -286,3 +286,83 @@ trait CommonStringExprs {
}
}
}
+
+object CometRegExpExtract extends CometExpressionSerde[RegExpExtract] {
+ override def getSupportLevel(expr: RegExpExtract): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible
patterns
+expr.regexp match {
+ case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
+ !CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.get()) {
+ withInfo(
+expr,
+s"Regexp pattern $pattern is not compatible with Spark. " +
+ s"Set ${CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key}=true " +
+ "to allow it anyway.")
+ return Incompatible()
+}
+ case _ =>
+return Unsupported(Some("Only literal regexp patterns are supported"))
+}
+
+// Check if idx is a literal
+expr.idx match {
+ case Literal(_, DataTypes.IntegerType) =>
+Compatible()
+ case _ =>
+Unsupported(Some("Only literal group index is supported"))
+}
+ }
+
+ override def convert(
+ expr: RegExpExtract,
+ inputs: Seq[Attribute],
+ binding: Boolean): Option[Expr] = {
+val subjectExpr = exprToProtoInternal(expr.subject, inputs, binding)
+val patternExpr = exprToProtoInternal(expr.regexp, inputs, binding)
+val idxExpr = exprToProtoInternal(expr.idx, inputs, binding)
+val optExpr = scalarFunctionExprToProto("regexp_extract", subjectExpr,
patternExpr, idxExpr)
+optExprWithInfo(optExpr, expr, expr.subject, expr.regexp, expr.idx)
+ }
+}
+
+object CometRegExpExtractAll extends CometExpressionSerde[RegExpExtractAll] {
+ override def getSupportLevel(expr: RegExpExtractAll): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible
patterns
+expr.regexp match {
+ case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
Review Comment:
Thanks for the clarification!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
danielhumanmod commented on PR #2831: URL: https://github.com/apache/datafusion-comet/pull/2831#issuecomment-3693765968 > Is there a way to be confident that an arbitrary user-provided regular expression, which Spark expects to be in Java format, is: > > 1. valid in the intended native version > 2. has the same semantics 😅 > > e.g. > > * [[CH] fallback for unsupported regex in re2 incubator-gluten#7866](https://github.com/apache/incubator-gluten/issues/7866) > * https://docs.rs/regex/latest/regex/ doesn't support backreferences--in this case at least you could get a plan-time error, assuming the pattern argument is constant Good point! I share the same concern regarding the semantic differences between Java's regex engine and the native implementation. That is exactly why `isSupportedPattern` is currently hardcoded to return false. We are taking a conservative approach here: the feature is effectively disabled until we can prove via testing (as mentioned by Andy: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2640735448 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
andygrove commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2640824998
##
spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala:
##
@@ -391,4 +391,315 @@ class CometStringExpressionSuite extends CometTestBase {
}
}
+ test("regexp_extract basic") {
+withSQLConf(CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key -> "true") {
+ val data = Seq(
+("100-200", 1),
+("300-400", 1),
+(null, 1), // NULL input
+("no-match", 1), // no match → should return ""
+("abc123def456", 1),
+("", 1) // empty string
+ )
+
+ withParquetTable(data, "tbl") {
+// Test basic extraction: group 0 (full match)
+checkSparkAnswerAndOperator("SELECT regexp_extract(_1,
'(\\d+)-(\\d+)', 0) FROM tbl")
+// Test group 1
+checkSparkAnswerAndOperator("SELECT regexp_extract(_1,
'(\\d+)-(\\d+)', 1) FROM tbl")
Review Comment:
```suggestion
checkSparkAnswerAndOperator("SELECT regexp_extract(_1,
'(d+)-(d+)', 1) FROM tbl")
```
The escaping is incorrect for many of these queries. Here is debug logging
for this query now, which demonstrates that it is not actually extracting
anything:
```
SELECT regexp_extract(_1, '(\d+)-(\d+)', 1) FROM tbl
*(1) CometColumnarToRow
+- CometProject [regexp_extract(_1, (d+)-(d+), 1)#33], [regexp_extract(_1#6,
(d+)-(d+), 1) AS regexp_extract(_1, (d+)-(d+), 1)#33]
+- CometScan [native_iceberg_compat] parquet [_1#6] Batched: true,
DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1
paths)[file:/tmp/spark-46241ff7-ee37-48fe-ad88-ea06bceae81f], PartitionFilters:
[], PushedFilters: [], ReadSchema: struct<_1:string>
++
|regexp_extract(_1, (d+)-(d+), 1)|
++
||
||
||
||
||
|NULL|
++
```
The issue is double-escaping. In the Scala string:
- `(\\d+)-(\\d+)` → Scala produces `(\d+)-(\d+)`
- But SQL then interprets `\d` as an escape sequence → becomes `(d+)-(d+)`
After fixing this, I see:
```
SELECT regexp_extract(_1, '(\\d+)-(\\d+)', 0) FROM tbl
*(1) CometColumnarToRow
+- CometProject [regexp_extract(_1, (\d+)-(\d+), 0)#15],
[regexp_extract(_1#6, (\d+)-(\d+), 0) AS regexp_extract(_1, (\d+)-(\d+), 0)#15]
+- CometScan [native_iceberg_compat] parquet [_1#6] Batched: true,
DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1
paths)[file:/tmp/spark-901feb7b-b26a-4f2a-99ff-fc594af56804], PartitionFilters:
[], PushedFilters: [], ReadSchema: struct<_1:string>
+--+
|regexp_extract(_1, (\d+)-(\d+), 0)|
+--+
| |
| |
| |
|100-200 |
|300-400 |
|NULL |
+--+
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
andygrove commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2640824998
##
spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala:
##
@@ -391,4 +391,315 @@ class CometStringExpressionSuite extends CometTestBase {
}
}
+ test("regexp_extract basic") {
+withSQLConf(CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key -> "true") {
+ val data = Seq(
+("100-200", 1),
+("300-400", 1),
+(null, 1), // NULL input
+("no-match", 1), // no match → should return ""
+("abc123def456", 1),
+("", 1) // empty string
+ )
+
+ withParquetTable(data, "tbl") {
+// Test basic extraction: group 0 (full match)
+checkSparkAnswerAndOperator("SELECT regexp_extract(_1,
'(\\d+)-(\\d+)', 0) FROM tbl")
+// Test group 1
+checkSparkAnswerAndOperator("SELECT regexp_extract(_1,
'(\\d+)-(\\d+)', 1) FROM tbl")
Review Comment:
```suggestion
checkSparkAnswerAndOperator("SELECT regexp_extract(_1,
'(d+)-(d+)', 1) FROM tbl")
```
The escaping is incorrect for many of these queries. Here is debug logging
for this query now, which demonstrates that it is not actually extracting
anything:
```
SELECT regexp_extract(_1, '(\d+)-(\d+)', 1) FROM tbl
*(1) CometColumnarToRow
+- CometProject [regexp_extract(_1, (d+)-(d+), 1)#33], [regexp_extract(_1#6,
(d+)-(d+), 1) AS regexp_extract(_1, (d+)-(d+), 1)#33]
+- CometScan [native_iceberg_compat] parquet [_1#6] Batched: true,
DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1
paths)[file:/tmp/spark-46241ff7-ee37-48fe-ad88-ea06bceae81f], PartitionFilters:
[], PushedFilters: [], ReadSchema: struct<_1:string>
++
|regexp_extract(_1, (d+)-(d+), 1)|
++
||
||
||
||
||
|NULL|
++
```
The issue is double-escaping. In the Scala string:
- "'(\\d+)-(\\d+)'" → Scala produces '(\d+)-(\d+)'
- But SQL then interprets \d as an escape sequence → becomes (d+)-(d+)
After fixing this, I see:
```
SELECT regexp_extract(_1, '(\\d+)-(\\d+)', 0) FROM tbl
*(1) CometColumnarToRow
+- CometProject [regexp_extract(_1, (\d+)-(\d+), 0)#15],
[regexp_extract(_1#6, (\d+)-(\d+), 0) AS regexp_extract(_1, (\d+)-(\d+), 0)#15]
+- CometScan [native_iceberg_compat] parquet [_1#6] Batched: true,
DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1
paths)[file:/tmp/spark-901feb7b-b26a-4f2a-99ff-fc594af56804], PartitionFilters:
[], PushedFilters: [], ReadSchema: struct<_1:string>
+--+
|regexp_extract(_1, (\d+)-(\d+), 0)|
+--+
| |
| |
| |
|100-200 |
|300-400 |
|NULL |
+--+
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
andygrove commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2640795004
##
spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala:
##
@@ -391,4 +391,315 @@ class CometStringExpressionSuite extends CometTestBase {
}
}
+ test("regexp_extract basic") {
Review Comment:
nit: We should probably move these tests to a new
`CometRegexpExpressionSuite`, but it does not need to happen in this PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
andygrove commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2640782456
##
spark/src/main/scala/org/apache/comet/serde/strings.scala:
##
@@ -286,3 +286,83 @@ trait CommonStringExprs {
}
}
}
+
+object CometRegExpExtract extends CometExpressionSerde[RegExpExtract] {
+ override def getSupportLevel(expr: RegExpExtract): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible
patterns
+expr.regexp match {
+ case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
+ !CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.get()) {
+ withInfo(
+expr,
+s"Regexp pattern $pattern is not compatible with Spark. " +
+ s"Set ${CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key}=true " +
+ "to allow it anyway.")
+ return Incompatible()
+}
+ case _ =>
+return Unsupported(Some("Only literal regexp patterns are supported"))
+}
+
+// Check if idx is a literal
+expr.idx match {
+ case Literal(_, DataTypes.IntegerType) =>
+Compatible()
+ case _ =>
+Unsupported(Some("Only literal group index is supported"))
Review Comment:
Spark only supports i32 for `idx`, so the current code seems correct
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
andygrove commented on PR #2831: URL: https://github.com/apache/datafusion-comet/pull/2831#issuecomment-3683202184 Thanks @danielhumanmod. This PR is looking good. I am going to review in more detail later today. Apologies for the delay in getting to this one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
andygrove commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2640735448
##
spark/src/main/scala/org/apache/comet/serde/strings.scala:
##
@@ -286,3 +286,83 @@ trait CommonStringExprs {
}
}
}
+
+object CometRegExpExtract extends CometExpressionSerde[RegExpExtract] {
+ override def getSupportLevel(expr: RegExpExtract): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible
patterns
+expr.regexp match {
+ case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
+ !CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.get()) {
+ withInfo(
+expr,
+s"Regexp pattern $pattern is not compatible with Spark. " +
+ s"Set ${CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key}=true " +
+ "to allow it anyway.")
+ return Incompatible()
+}
+ case _ =>
+return Unsupported(Some("Only literal regexp patterns are supported"))
+}
+
+// Check if idx is a literal
+expr.idx match {
+ case Literal(_, DataTypes.IntegerType) =>
+Compatible()
+ case _ =>
+Unsupported(Some("Only literal group index is supported"))
+}
+ }
+
+ override def convert(
+ expr: RegExpExtract,
+ inputs: Seq[Attribute],
+ binding: Boolean): Option[Expr] = {
+val subjectExpr = exprToProtoInternal(expr.subject, inputs, binding)
+val patternExpr = exprToProtoInternal(expr.regexp, inputs, binding)
+val idxExpr = exprToProtoInternal(expr.idx, inputs, binding)
+val optExpr = scalarFunctionExprToProto("regexp_extract", subjectExpr,
patternExpr, idxExpr)
+optExprWithInfo(optExpr, expr, expr.subject, expr.regexp, expr.idx)
+ }
+}
+
+object CometRegExpExtractAll extends CometExpressionSerde[RegExpExtractAll] {
+ override def getSupportLevel(expr: RegExpExtractAll): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible
patterns
+expr.regexp match {
+ case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
Review Comment:
The goal is to determine (via extensive fuzz testing) whether Comet actually
is already compatible with Spark for certain simple cases and if so we can
update `RegExp.isSupportedPattern`. It may be that there are no such cases.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
acruise commented on PR #2831: URL: https://github.com/apache/datafusion-comet/pull/2831#issuecomment-3637992442 Is there a way to be confident that an arbitrary user-provided regular expression, which Spark expects to be in Java format, is: 1. valid in the intended native version 2. has the same semantics 😅 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
danielhumanmod commented on PR #2831: URL: https://github.com/apache/datafusion-comet/pull/2831#issuecomment-3623432768 Hey @coderfender @martin-g , I’ve addressed the comments and refactored the Rust implementation based on the approach used in the similar UDF [regexp_replace](https://github.com/apache/datafusion/blob/fc6d0a4287ab8054d9f56c9ce0a13ffc0ec3c170/datafusion/functions/src/regex/regexpreplace.rs#L88) in DataFusion. Whenever you have time, I’d appreciate another look. Thanks again for the helpful feedback! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2596611672
##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,401 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result,
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::sync::Arc;
+use std::any::Any;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtract {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extract always returns String
+Ok(DataType::Utf8)
+}
Review Comment:
Added in latest commit, thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2596611672
##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,401 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result,
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::sync::Arc;
+use std::any::Any;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtract {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extract always returns String
+Ok(DataType::Utf8)
+}
Review Comment:
Added, thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2596610973
##
spark/src/main/scala/org/apache/comet/serde/strings.scala:
##
@@ -286,3 +286,83 @@ trait CommonStringExprs {
}
}
}
+
+object CometRegExpExtract extends CometExpressionSerde[RegExpExtract] {
+ override def getSupportLevel(expr: RegExpExtract): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible
patterns
+expr.regexp match {
+ case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
+ !CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.get()) {
+ withInfo(
+expr,
+s"Regexp pattern $pattern is not compatible with Spark. " +
+ s"Set ${CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key}=true " +
+ "to allow it anyway.")
+ return Incompatible()
+}
+ case _ =>
+return Unsupported(Some("Only literal regexp patterns are supported"))
+}
+
+// Check if idx is a literal
+expr.idx match {
+ case Literal(_, DataTypes.IntegerType) =>
+Compatible()
+ case _ =>
+Unsupported(Some("Only literal group index is supported"))
Review Comment:
Look back this comment, given that Spark only convert index into i32 in UDF
impl, do we want to keep behavior aligned to support only IntegerType here?
Let me know if that's acceptable — I can adjust the patch.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2583897542
##
spark/src/main/scala/org/apache/comet/serde/strings.scala:
##
@@ -286,3 +286,83 @@ trait CommonStringExprs {
}
}
}
+
+object CometRegExpExtract extends CometExpressionSerde[RegExpExtract] {
+ override def getSupportLevel(expr: RegExpExtract): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible
patterns
+expr.regexp match {
+ case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
+ !CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.get()) {
+ withInfo(
+expr,
+s"Regexp pattern $pattern is not compatible with Spark. " +
+ s"Set ${CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key}=true " +
+ "to allow it anyway.")
+ return Incompatible()
+}
+ case _ =>
+return Unsupported(Some("Only literal regexp patterns are supported"))
+}
+
+// Check if idx is a literal
+expr.idx match {
+ case Literal(_, DataTypes.IntegerType) =>
+Compatible()
+ case _ =>
+Unsupported(Some("Only literal group index is supported"))
Review Comment:
Good catch, will add support for these data types
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2583879942
##
spark/src/main/scala/org/apache/comet/serde/strings.scala:
##
@@ -286,3 +286,83 @@ trait CommonStringExprs {
}
}
}
+
+object CometRegExpExtract extends CometExpressionSerde[RegExpExtract] {
+ override def getSupportLevel(expr: RegExpExtract): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible
patterns
+expr.regexp match {
+ case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
+ !CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.get()) {
+ withInfo(
+expr,
+s"Regexp pattern $pattern is not compatible with Spark. " +
+ s"Set ${CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key}=true " +
+ "to allow it anyway.")
+ return Incompatible()
+}
+ case _ =>
+return Unsupported(Some("Only literal regexp patterns are supported"))
+}
+
+// Check if idx is a literal
+expr.idx match {
+ case Literal(_, DataTypes.IntegerType) =>
+Compatible()
+ case _ =>
+Unsupported(Some("Only literal group index is supported"))
+}
+ }
+
+ override def convert(
+ expr: RegExpExtract,
+ inputs: Seq[Attribute],
+ binding: Boolean): Option[Expr] = {
+val subjectExpr = exprToProtoInternal(expr.subject, inputs, binding)
+val patternExpr = exprToProtoInternal(expr.regexp, inputs, binding)
+val idxExpr = exprToProtoInternal(expr.idx, inputs, binding)
+val optExpr = scalarFunctionExprToProto("regexp_extract", subjectExpr,
patternExpr, idxExpr)
+optExprWithInfo(optExpr, expr, expr.subject, expr.regexp, expr.idx)
+ }
+}
+
+object CometRegExpExtractAll extends CometExpressionSerde[RegExpExtractAll] {
+ override def getSupportLevel(expr: RegExpExtractAll): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible
patterns
+expr.regexp match {
+ case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
Review Comment:
For now it is a placeholder introduced in this
[PR](https://github.com/danielhumanmod/datafusion-comet/commit/e33d56050187c75afcbbdc2def96a093a019db6a#diff-8aa4d7fa45d8c9eb2d228424f3c64fa2979c4072ab28dd9582729b20ccc7ade5R25)
@andygrove when you have a chance, could you share a bit more context on its
purpose? I’d really appreciate it. Thank you!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2583867063
##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,452 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result,
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::any::Any;
+use std::sync::Arc;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtract {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extract always returns String
+Ok(DataType::Utf8)
+}
+
+fn invoke_with_args(&self, args: ScalarFunctionArgs) ->
Result {
+// regexp_extract(subject, pattern, idx)
+if args.args.len() != 3 {
+return exec_err!(
+"regexp_extract expects 3 arguments, got {}",
+args.args.len()
+);
+}
+
+let subject = &args.args[0];
+let pattern = &args.args[1];
+let idx = &args.args[2];
+
+// Pattern must be a literal string
+let pattern_str = match pattern {
+ColumnarValue::Scalar(ScalarValue::Utf8(Some(s))) => s.clone(),
+_ => {
+return exec_err!("regexp_extract pattern must be a string
literal");
+}
+};
+
+// idx must be a literal int
+let idx_val = match idx {
+ColumnarValue::Scalar(ScalarValue::Int32(Some(i))) => *i as usize,
+_ => {
+return exec_err!("regexp_extract idx must be an integer
literal");
+}
+};
+
+// Compile regex once
+let regex = Regex::new(&pattern_str).map_err(|e| {
+internal_datafusion_err!("Invalid regex pattern '{}': {}",
pattern_str, e)
+})?;
+
+match subject {
+ColumnarValue::Array(array) => {
+let result = regexp_extract_array(array, ®ex, idx_val)?;
+Ok(ColumnarValue::Array(result))
+}
+ColumnarValue::Scalar(ScalarValue::Utf8(s)) => {
+let result = match s {
+Some(text) => Some(extract_group(text, ®ex, idx_val)?),
+None => None, // NULL input → NULL output
+};
+Ok(ColumnarValue::Scalar(ScalarValue::Utf8(result)))
+}
+_ => exec_err!("regexp_extract expects string input"),
+}
+}
+
+fn aliases(&self) -> &[String] {
+&self.aliases
+}
+}
+
+/// Spark-compatible regexp_extract_all function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtractAll {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtractAll {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtractAll {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtractAll {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract_all"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extrac
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2583619102
##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,401 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result,
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::sync::Arc;
+use std::any::Any;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtract {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extract always returns String
+Ok(DataType::Utf8)
+}
+
+fn invoke_with_args(&self, args: ScalarFunctionArgs) ->
Result {
+// regexp_extract(subject, pattern, idx)
+if args.args.len() != 3 {
+return exec_err!(
+"regexp_extract expects 3 arguments, got {}",
+args.args.len()
+);
+}
+
+let subject = &args.args[0];
+let pattern = &args.args[1];
+let idx = &args.args[2];
+
+// Pattern must be a literal string
+let pattern_str = match pattern {
+ColumnarValue::Scalar(ScalarValue::Utf8(Some(s))) => s.clone(),
+_ => {
+return exec_err!("regexp_extract pattern must be a string
literal");
+}
+};
+
+// idx must be a literal int
+let idx_val = match idx {
+ColumnarValue::Scalar(ScalarValue::Int32(Some(i))) => *i as usize,
+_ => {
+return exec_err!("regexp_extract idx must be an integer
literal");
+}
+};
+
+// Compile regex once
+let regex = Regex::new(&pattern_str).map_err(|e| {
+internal_datafusion_err!("Invalid regex pattern '{}': {}",
pattern_str, e)
+})?;
+
+match subject {
Review Comment:
Will extract some shared code, thanks for the suggestion!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2583618322
##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,401 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result,
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::sync::Arc;
+use std::any::Any;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtract {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extract always returns String
+Ok(DataType::Utf8)
+}
+
+fn invoke_with_args(&self, args: ScalarFunctionArgs) ->
Result {
+// regexp_extract(subject, pattern, idx)
+if args.args.len() != 3 {
+return exec_err!(
+"regexp_extract expects 3 arguments, got {}",
+args.args.len()
+);
+}
+
+let subject = &args.args[0];
+let pattern = &args.args[1];
+let idx = &args.args[2];
+
+// Pattern must be a literal string
+let pattern_str = match pattern {
+ColumnarValue::Scalar(ScalarValue::Utf8(Some(s))) => s.clone(),
+_ => {
+return exec_err!("regexp_extract pattern must be a string
literal");
+}
+};
+
+// idx must be a literal int
+let idx_val = match idx {
+ColumnarValue::Scalar(ScalarValue::Int32(Some(i))) => *i as usize,
+_ => {
+return exec_err!("regexp_extract idx must be an integer
literal");
+}
Review Comment:
Good point. I checked the Spark's implementation, the index there is cast
into i32 only. I was wondering if we should follow their behavior
https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L811C21-L811C33
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2583613359
##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,452 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result,
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::any::Any;
+use std::sync::Arc;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtract {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extract always returns String
+Ok(DataType::Utf8)
+}
+
+fn invoke_with_args(&self, args: ScalarFunctionArgs) ->
Result {
+// regexp_extract(subject, pattern, idx)
+if args.args.len() != 3 {
+return exec_err!(
+"regexp_extract expects 3 arguments, got {}",
+args.args.len()
+);
+}
+
+let subject = &args.args[0];
+let pattern = &args.args[1];
+let idx = &args.args[2];
+
+// Pattern must be a literal string
+let pattern_str = match pattern {
+ColumnarValue::Scalar(ScalarValue::Utf8(Some(s))) => s.clone(),
+_ => {
+return exec_err!("regexp_extract pattern must be a string
literal");
+}
+};
+
+// idx must be a literal int
+let idx_val = match idx {
+ColumnarValue::Scalar(ScalarValue::Int32(Some(i))) => *i as usize,
Review Comment:
Good call, will fix this, thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
codecov-commenter commented on PR #2831: URL: https://github.com/apache/datafusion-comet/pull/2831#issuecomment-3597346236 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/2831?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report :x: Patch coverage is `9.09091%` with `40 lines` in your changes missing coverage. Please review. :white_check_mark: Project coverage is 21.18%. Comparing base ([`f09f8af`](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)) to head ([`ff1ebd6`](https://app.codecov.io/gh/apache/datafusion-comet/commit/ff1ebd6b3bebe85c0f393b268660dc8031614bc1?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)). :warning: Report is 730 commits behind head on main. | [Files with missing lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/2831?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines | |---|---|---| | [...rc/main/scala/org/apache/comet/serde/strings.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/2831?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Fserde%2Fstrings.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9zZXJkZS9zdHJpbmdzLnNjYWxh) | 4.76% | [40 Missing :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/2831?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#2831 +/- ## = - Coverage 56.12% 21.18% -34.95% + Complexity 976 456 -520 = Files 119 167 +48 Lines 1174315232 +3489 Branches 2251 2531 +280 = - Hits 6591 3227 -3364 - Misses 401211480 +7468 + Partials 1140 525 -615 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/2831?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :rocket: New features to boost your workflow: - :snowflake: [Test Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, report on failures, and find test suite problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
martin-g commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2577039322
##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,452 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result,
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::any::Any;
+use std::sync::Arc;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtract {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extract always returns String
+Ok(DataType::Utf8)
+}
+
+fn invoke_with_args(&self, args: ScalarFunctionArgs) ->
Result {
+// regexp_extract(subject, pattern, idx)
+if args.args.len() != 3 {
+return exec_err!(
+"regexp_extract expects 3 arguments, got {}",
+args.args.len()
+);
+}
+
+let subject = &args.args[0];
+let pattern = &args.args[1];
+let idx = &args.args[2];
+
+// Pattern must be a literal string
+let pattern_str = match pattern {
+ColumnarValue::Scalar(ScalarValue::Utf8(Some(s))) => s.clone(),
+_ => {
+return exec_err!("regexp_extract pattern must be a string
literal");
+}
+};
+
+// idx must be a literal int
+let idx_val = match idx {
+ColumnarValue::Scalar(ScalarValue::Int32(Some(i))) => *i as usize,
Review Comment:
Negative `i` will lead to a big idx_val here. Does it need some kind of
validation ?
##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,452 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result,
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::any::Any;
+use std::sync::Arc;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegE
Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]
coderfender commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2575900748
##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,401 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result,
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::sync::Arc;
+use std::any::Any;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtract {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extract always returns String
+Ok(DataType::Utf8)
+}
Review Comment:
@danielhumanmod , could we verify we dont return a LargeUtf8 as well ?
##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,401 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result,
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::sync::Arc;
+use std::any::Any;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtract {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extract always returns String
+Ok(DataType::Utf8)
+}
+
+fn invoke_with_args(&self, args: ScalarFunctionArgs) ->
Result {
+// regexp_extract(subject, pattern, idx)
+if args.args.len() != 3 {
+return exec_err!(
+"regexp_extract expects 3 arguments, got {}",
+args.args.len()
+);
+}
+
+let subject = &args.args[0];
+let pattern = &args.args[1];
+let idx = &args.args[2];
+
+// Pattern must be a literal string
+let pattern_str = match pattern {
+Column
