Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2026-01-06 Thread via GitHub


coderfender commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2666346598


##
spark/src/main/scala/org/apache/comet/serde/strings.scala:
##
@@ -286,3 +286,81 @@ trait CommonStringExprs {
 }
   }
 }
+
+object CometRegExpExtract extends CometExpressionSerde[RegExpExtract] {
+  override def getSupportLevel(expr: RegExpExtract): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible 
patterns
+expr.regexp match {
+  case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
+  !CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.get()) {
+  withInfo(
+expr,
+s"Regexp pattern $pattern is not compatible with Spark. " +
+  s"Set ${CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key}=true " +
+  "to allow it anyway.")
+  return Incompatible()
+}
+  case _ =>
+return Unsupported(Some("Only literal regexp patterns are supported"))

Review Comment:
   minor / nit / nice to have  :  reg exp could be made regular expression / 
regex (should be more useful for non-native  english speakers) 



##
spark/src/main/scala/org/apache/comet/serde/strings.scala:
##
@@ -286,3 +286,83 @@ trait CommonStringExprs {
 }
   }
 }
+
+object CometRegExpExtract extends CometExpressionSerde[RegExpExtract] {
+  override def getSupportLevel(expr: RegExpExtract): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible 
patterns
+expr.regexp match {
+  case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
+  !CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.get()) {
+  withInfo(
+expr,
+s"Regexp pattern $pattern is not compatible with Spark. " +
+  s"Set ${CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key}=true " +
+  "to allow it anyway.")
+  return Incompatible()
+}
+  case _ =>
+return Unsupported(Some("Only literal regexp patterns are supported"))
+}
+
+// Check if idx is a literal
+expr.idx match {
+  case Literal(_, DataTypes.IntegerType) =>
+Compatible()
+  case _ =>
+Unsupported(Some("Only literal group index is supported"))

Review Comment:
   thank you



##
docs/source/user-guide/latest/configs.md:
##
@@ -294,6 +294,8 @@ These settings can be used to determine which parts of the 
plan are accelerated
 | `spark.comet.expression.RLike.enabled` | Enable Comet acceleration for 
`RLike` | true |
 | `spark.comet.expression.Rand.enabled` | Enable Comet acceleration for `Rand` 
| true |
 | `spark.comet.expression.Randn.enabled` | Enable Comet acceleration for 
`Randn` | true |
+| `spark.comet.expression.RegExpExtract.enabled` | Enable Comet acceleration 
for `RegExpExtract` | true |
+| `spark.comet.expression.RegExpExtractAll.enabled` | Enable Comet 
acceleration for `RegExpExtractAll` | true |

Review Comment:
   Perhaps the default configs can be false ? (given that 
COMET_REGEXP_ALLOW_INCOMPATIBLE is defaulted to false) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-29 Thread via GitHub


danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2650434033


##
spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala:
##
@@ -391,4 +391,315 @@ class CometStringExpressionSuite extends CometTestBase {
 }
   }
 
+  test("regexp_extract basic") {

Review Comment:
   Also migrated in the latest commit, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-29 Thread via GitHub


danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2650416405


##
spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala:
##
@@ -391,4 +391,315 @@ class CometStringExpressionSuite extends CometTestBase {
 }
   }
 
+  test("regexp_extract basic") {
+withSQLConf(CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key -> "true") {
+  val data = Seq(
+("100-200", 1),
+("300-400", 1),
+(null, 1), // NULL input
+("no-match", 1), // no match → should return ""
+("abc123def456", 1),
+("", 1) // empty string
+  )
+
+  withParquetTable(data, "tbl") {
+// Test basic extraction: group 0 (full match)
+checkSparkAnswerAndOperator("SELECT regexp_extract(_1, 
'(\\d+)-(\\d+)', 0) FROM tbl")
+// Test group 1
+checkSparkAnswerAndOperator("SELECT regexp_extract(_1, 
'(\\d+)-(\\d+)', 1) FROM tbl")

Review Comment:
   Thanks for catching this issue, already fixed it in latest commit!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-26 Thread via GitHub


danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2648928758


##
spark/src/main/scala/org/apache/comet/serde/strings.scala:
##
@@ -286,3 +286,83 @@ trait CommonStringExprs {
 }
   }
 }
+
+object CometRegExpExtract extends CometExpressionSerde[RegExpExtract] {
+  override def getSupportLevel(expr: RegExpExtract): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible 
patterns
+expr.regexp match {
+  case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
+  !CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.get()) {
+  withInfo(
+expr,
+s"Regexp pattern $pattern is not compatible with Spark. " +
+  s"Set ${CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key}=true " +
+  "to allow it anyway.")
+  return Incompatible()
+}
+  case _ =>
+return Unsupported(Some("Only literal regexp patterns are supported"))
+}
+
+// Check if idx is a literal
+expr.idx match {
+  case Literal(_, DataTypes.IntegerType) =>
+Compatible()
+  case _ =>
+Unsupported(Some("Only literal group index is supported"))
+}
+  }
+
+  override def convert(
+  expr: RegExpExtract,
+  inputs: Seq[Attribute],
+  binding: Boolean): Option[Expr] = {
+val subjectExpr = exprToProtoInternal(expr.subject, inputs, binding)
+val patternExpr = exprToProtoInternal(expr.regexp, inputs, binding)
+val idxExpr = exprToProtoInternal(expr.idx, inputs, binding)
+val optExpr = scalarFunctionExprToProto("regexp_extract", subjectExpr, 
patternExpr, idxExpr)
+optExprWithInfo(optExpr, expr, expr.subject, expr.regexp, expr.idx)
+  }
+}
+
+object CometRegExpExtractAll extends CometExpressionSerde[RegExpExtractAll] {
+  override def getSupportLevel(expr: RegExpExtractAll): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible 
patterns
+expr.regexp match {
+  case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&

Review Comment:
   Thanks for the clarification!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-26 Thread via GitHub


danielhumanmod commented on PR #2831:
URL: 
https://github.com/apache/datafusion-comet/pull/2831#issuecomment-3693765968

   > Is there a way to be confident that an arbitrary user-provided regular 
expression, which Spark expects to be in Java format, is:
   > 
   > 1. valid in the intended native version
   > 2. has the same semantics 😅
   > 
   > e.g.
   > 
   > * [[CH] fallback for unsupported regex in re2 
incubator-gluten#7866](https://github.com/apache/incubator-gluten/issues/7866)
   > * https://docs.rs/regex/latest/regex/ doesn't support backreferences--in 
this case at least you could get a plan-time error, assuming the pattern 
argument is constant
   
   Good point! I share the same concern regarding the semantic differences 
between Java's regex engine and the native implementation.
   
   That is exactly why `isSupportedPattern` is currently hardcoded to return 
false. We are taking a conservative approach here: the feature is effectively 
disabled until we can prove via testing (as mentioned by Andy: 
https://github.com/apache/datafusion-comet/pull/2831#discussion_r2640735448


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-22 Thread via GitHub


andygrove commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2640824998


##
spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala:
##
@@ -391,4 +391,315 @@ class CometStringExpressionSuite extends CometTestBase {
 }
   }
 
+  test("regexp_extract basic") {
+withSQLConf(CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key -> "true") {
+  val data = Seq(
+("100-200", 1),
+("300-400", 1),
+(null, 1), // NULL input
+("no-match", 1), // no match → should return ""
+("abc123def456", 1),
+("", 1) // empty string
+  )
+
+  withParquetTable(data, "tbl") {
+// Test basic extraction: group 0 (full match)
+checkSparkAnswerAndOperator("SELECT regexp_extract(_1, 
'(\\d+)-(\\d+)', 0) FROM tbl")
+// Test group 1
+checkSparkAnswerAndOperator("SELECT regexp_extract(_1, 
'(\\d+)-(\\d+)', 1) FROM tbl")

Review Comment:
   ```suggestion
   checkSparkAnswerAndOperator("SELECT regexp_extract(_1, 
'(d+)-(d+)', 1) FROM tbl")
   ```
   
   The escaping is incorrect for many of these queries. Here is debug logging 
for this query now, which demonstrates that it is not actually extracting 
anything:
   
   ```
   SELECT regexp_extract(_1, '(\d+)-(\d+)', 1) FROM tbl
   *(1) CometColumnarToRow
   +- CometProject [regexp_extract(_1, (d+)-(d+), 1)#33], [regexp_extract(_1#6, 
(d+)-(d+), 1) AS regexp_extract(_1, (d+)-(d+), 1)#33]
  +- CometScan [native_iceberg_compat] parquet [_1#6] Batched: true, 
DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 
paths)[file:/tmp/spark-46241ff7-ee37-48fe-ad88-ea06bceae81f], PartitionFilters: 
[], PushedFilters: [], ReadSchema: struct<_1:string>
   
   ++
   |regexp_extract(_1, (d+)-(d+), 1)|
   ++
   ||
   ||
   ||
   ||
   ||
   |NULL|
   ++
   ```
   
   The issue is double-escaping. In the Scala string:
   - `(\\d+)-(\\d+)` → Scala produces `(\d+)-(\d+)`
   - But SQL then interprets `\d` as an escape sequence → becomes `(d+)-(d+)`
   
   After fixing this, I see:
   
   ```
   SELECT regexp_extract(_1, '(\\d+)-(\\d+)', 0) FROM tbl
   *(1) CometColumnarToRow
   +- CometProject [regexp_extract(_1, (\d+)-(\d+), 0)#15], 
[regexp_extract(_1#6, (\d+)-(\d+), 0) AS regexp_extract(_1, (\d+)-(\d+), 0)#15]
  +- CometScan [native_iceberg_compat] parquet [_1#6] Batched: true, 
DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 
paths)[file:/tmp/spark-901feb7b-b26a-4f2a-99ff-fc594af56804], PartitionFilters: 
[], PushedFilters: [], ReadSchema: struct<_1:string>
   
   +--+
   |regexp_extract(_1, (\d+)-(\d+), 0)|
   +--+
   |  |
   |  |
   |  |
   |100-200   |
   |300-400   |
   |NULL  |
   +--+
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-22 Thread via GitHub


andygrove commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2640824998


##
spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala:
##
@@ -391,4 +391,315 @@ class CometStringExpressionSuite extends CometTestBase {
 }
   }
 
+  test("regexp_extract basic") {
+withSQLConf(CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key -> "true") {
+  val data = Seq(
+("100-200", 1),
+("300-400", 1),
+(null, 1), // NULL input
+("no-match", 1), // no match → should return ""
+("abc123def456", 1),
+("", 1) // empty string
+  )
+
+  withParquetTable(data, "tbl") {
+// Test basic extraction: group 0 (full match)
+checkSparkAnswerAndOperator("SELECT regexp_extract(_1, 
'(\\d+)-(\\d+)', 0) FROM tbl")
+// Test group 1
+checkSparkAnswerAndOperator("SELECT regexp_extract(_1, 
'(\\d+)-(\\d+)', 1) FROM tbl")

Review Comment:
   ```suggestion
   checkSparkAnswerAndOperator("SELECT regexp_extract(_1, 
'(d+)-(d+)', 1) FROM tbl")
   ```
   
   The escaping is incorrect for many of these queries. Here is debug logging 
for this query now, which demonstrates that it is not actually extracting 
anything:
   
   ```
   SELECT regexp_extract(_1, '(\d+)-(\d+)', 1) FROM tbl
   *(1) CometColumnarToRow
   +- CometProject [regexp_extract(_1, (d+)-(d+), 1)#33], [regexp_extract(_1#6, 
(d+)-(d+), 1) AS regexp_extract(_1, (d+)-(d+), 1)#33]
  +- CometScan [native_iceberg_compat] parquet [_1#6] Batched: true, 
DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 
paths)[file:/tmp/spark-46241ff7-ee37-48fe-ad88-ea06bceae81f], PartitionFilters: 
[], PushedFilters: [], ReadSchema: struct<_1:string>
   
   ++
   |regexp_extract(_1, (d+)-(d+), 1)|
   ++
   ||
   ||
   ||
   ||
   ||
   |NULL|
   ++
   ```
   
   The issue is double-escaping. In the Scala string:
   - "'(\\d+)-(\\d+)'" → Scala produces '(\d+)-(\d+)'
   - But SQL then interprets \d as an escape sequence → becomes (d+)-(d+)
   
   After fixing this, I see:
   
   ```
   SELECT regexp_extract(_1, '(\\d+)-(\\d+)', 0) FROM tbl
   *(1) CometColumnarToRow
   +- CometProject [regexp_extract(_1, (\d+)-(\d+), 0)#15], 
[regexp_extract(_1#6, (\d+)-(\d+), 0) AS regexp_extract(_1, (\d+)-(\d+), 0)#15]
  +- CometScan [native_iceberg_compat] parquet [_1#6] Batched: true, 
DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 
paths)[file:/tmp/spark-901feb7b-b26a-4f2a-99ff-fc594af56804], PartitionFilters: 
[], PushedFilters: [], ReadSchema: struct<_1:string>
   
   +--+
   |regexp_extract(_1, (\d+)-(\d+), 0)|
   +--+
   |  |
   |  |
   |  |
   |100-200   |
   |300-400   |
   |NULL  |
   +--+
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-22 Thread via GitHub


andygrove commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2640795004


##
spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala:
##
@@ -391,4 +391,315 @@ class CometStringExpressionSuite extends CometTestBase {
 }
   }
 
+  test("regexp_extract basic") {

Review Comment:
   nit: We should probably move these tests to a new 
`CometRegexpExpressionSuite`, but it does not need to happen in this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-22 Thread via GitHub


andygrove commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2640782456


##
spark/src/main/scala/org/apache/comet/serde/strings.scala:
##
@@ -286,3 +286,83 @@ trait CommonStringExprs {
 }
   }
 }
+
+object CometRegExpExtract extends CometExpressionSerde[RegExpExtract] {
+  override def getSupportLevel(expr: RegExpExtract): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible 
patterns
+expr.regexp match {
+  case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
+  !CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.get()) {
+  withInfo(
+expr,
+s"Regexp pattern $pattern is not compatible with Spark. " +
+  s"Set ${CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key}=true " +
+  "to allow it anyway.")
+  return Incompatible()
+}
+  case _ =>
+return Unsupported(Some("Only literal regexp patterns are supported"))
+}
+
+// Check if idx is a literal
+expr.idx match {
+  case Literal(_, DataTypes.IntegerType) =>
+Compatible()
+  case _ =>
+Unsupported(Some("Only literal group index is supported"))

Review Comment:
   Spark only supports i32 for `idx`, so the current code seems correct



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-22 Thread via GitHub


andygrove commented on PR #2831:
URL: 
https://github.com/apache/datafusion-comet/pull/2831#issuecomment-3683202184

   Thanks @danielhumanmod. This PR is looking good. I am going to review in 
more detail later today. Apologies for the delay in getting to this one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-22 Thread via GitHub


andygrove commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2640735448


##
spark/src/main/scala/org/apache/comet/serde/strings.scala:
##
@@ -286,3 +286,83 @@ trait CommonStringExprs {
 }
   }
 }
+
+object CometRegExpExtract extends CometExpressionSerde[RegExpExtract] {
+  override def getSupportLevel(expr: RegExpExtract): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible 
patterns
+expr.regexp match {
+  case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
+  !CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.get()) {
+  withInfo(
+expr,
+s"Regexp pattern $pattern is not compatible with Spark. " +
+  s"Set ${CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key}=true " +
+  "to allow it anyway.")
+  return Incompatible()
+}
+  case _ =>
+return Unsupported(Some("Only literal regexp patterns are supported"))
+}
+
+// Check if idx is a literal
+expr.idx match {
+  case Literal(_, DataTypes.IntegerType) =>
+Compatible()
+  case _ =>
+Unsupported(Some("Only literal group index is supported"))
+}
+  }
+
+  override def convert(
+  expr: RegExpExtract,
+  inputs: Seq[Attribute],
+  binding: Boolean): Option[Expr] = {
+val subjectExpr = exprToProtoInternal(expr.subject, inputs, binding)
+val patternExpr = exprToProtoInternal(expr.regexp, inputs, binding)
+val idxExpr = exprToProtoInternal(expr.idx, inputs, binding)
+val optExpr = scalarFunctionExprToProto("regexp_extract", subjectExpr, 
patternExpr, idxExpr)
+optExprWithInfo(optExpr, expr, expr.subject, expr.regexp, expr.idx)
+  }
+}
+
+object CometRegExpExtractAll extends CometExpressionSerde[RegExpExtractAll] {
+  override def getSupportLevel(expr: RegExpExtractAll): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible 
patterns
+expr.regexp match {
+  case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&

Review Comment:
   The goal is to determine (via extensive fuzz testing) whether Comet actually 
is already compatible with Spark for certain simple cases and if so we can 
update `RegExp.isSupportedPattern`. It may be that there are no such cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-10 Thread via GitHub


acruise commented on PR #2831:
URL: 
https://github.com/apache/datafusion-comet/pull/2831#issuecomment-3637992442

   Is there a way to be confident that an arbitrary user-provided regular 
expression, which Spark expects to be in Java format, is:
   1. valid in the intended native version
   2. has the same semantics 😅 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-07 Thread via GitHub


danielhumanmod commented on PR #2831:
URL: 
https://github.com/apache/datafusion-comet/pull/2831#issuecomment-3623432768

   Hey @coderfender @martin-g , I’ve addressed the comments and refactored the 
Rust implementation based on the approach used in the similar UDF 
[regexp_replace](https://github.com/apache/datafusion/blob/fc6d0a4287ab8054d9f56c9ce0a13ffc0ec3c170/datafusion/functions/src/regex/regexpreplace.rs#L88)
 in DataFusion.
   
   Whenever you have time, I’d appreciate another look. Thanks again for the 
helpful feedback!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-07 Thread via GitHub


danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2596611672


##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,401 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result, 
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::sync::Arc;
+use std::any::Any;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtract {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extract always returns String
+Ok(DataType::Utf8)
+}

Review Comment:
   Added in latest commit, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-07 Thread via GitHub


danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2596611672


##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,401 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result, 
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::sync::Arc;
+use std::any::Any;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtract {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extract always returns String
+Ok(DataType::Utf8)
+}

Review Comment:
   Added, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-07 Thread via GitHub


danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2596610973


##
spark/src/main/scala/org/apache/comet/serde/strings.scala:
##
@@ -286,3 +286,83 @@ trait CommonStringExprs {
 }
   }
 }
+
+object CometRegExpExtract extends CometExpressionSerde[RegExpExtract] {
+  override def getSupportLevel(expr: RegExpExtract): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible 
patterns
+expr.regexp match {
+  case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
+  !CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.get()) {
+  withInfo(
+expr,
+s"Regexp pattern $pattern is not compatible with Spark. " +
+  s"Set ${CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key}=true " +
+  "to allow it anyway.")
+  return Incompatible()
+}
+  case _ =>
+return Unsupported(Some("Only literal regexp patterns are supported"))
+}
+
+// Check if idx is a literal
+expr.idx match {
+  case Literal(_, DataTypes.IntegerType) =>
+Compatible()
+  case _ =>
+Unsupported(Some("Only literal group index is supported"))

Review Comment:
   Look back this comment, given that Spark only convert index into i32 in UDF 
impl, do we want to keep behavior aligned to support only IntegerType here?
   
   Let me know if that's acceptable — I can adjust the patch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-02 Thread via GitHub


danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2583897542


##
spark/src/main/scala/org/apache/comet/serde/strings.scala:
##
@@ -286,3 +286,83 @@ trait CommonStringExprs {
 }
   }
 }
+
+object CometRegExpExtract extends CometExpressionSerde[RegExpExtract] {
+  override def getSupportLevel(expr: RegExpExtract): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible 
patterns
+expr.regexp match {
+  case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
+  !CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.get()) {
+  withInfo(
+expr,
+s"Regexp pattern $pattern is not compatible with Spark. " +
+  s"Set ${CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key}=true " +
+  "to allow it anyway.")
+  return Incompatible()
+}
+  case _ =>
+return Unsupported(Some("Only literal regexp patterns are supported"))
+}
+
+// Check if idx is a literal
+expr.idx match {
+  case Literal(_, DataTypes.IntegerType) =>
+Compatible()
+  case _ =>
+Unsupported(Some("Only literal group index is supported"))

Review Comment:
   Good catch, will add support for  these data types



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-02 Thread via GitHub


danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2583879942


##
spark/src/main/scala/org/apache/comet/serde/strings.scala:
##
@@ -286,3 +286,83 @@ trait CommonStringExprs {
 }
   }
 }
+
+object CometRegExpExtract extends CometExpressionSerde[RegExpExtract] {
+  override def getSupportLevel(expr: RegExpExtract): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible 
patterns
+expr.regexp match {
+  case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&
+  !CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.get()) {
+  withInfo(
+expr,
+s"Regexp pattern $pattern is not compatible with Spark. " +
+  s"Set ${CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key}=true " +
+  "to allow it anyway.")
+  return Incompatible()
+}
+  case _ =>
+return Unsupported(Some("Only literal regexp patterns are supported"))
+}
+
+// Check if idx is a literal
+expr.idx match {
+  case Literal(_, DataTypes.IntegerType) =>
+Compatible()
+  case _ =>
+Unsupported(Some("Only literal group index is supported"))
+}
+  }
+
+  override def convert(
+  expr: RegExpExtract,
+  inputs: Seq[Attribute],
+  binding: Boolean): Option[Expr] = {
+val subjectExpr = exprToProtoInternal(expr.subject, inputs, binding)
+val patternExpr = exprToProtoInternal(expr.regexp, inputs, binding)
+val idxExpr = exprToProtoInternal(expr.idx, inputs, binding)
+val optExpr = scalarFunctionExprToProto("regexp_extract", subjectExpr, 
patternExpr, idxExpr)
+optExprWithInfo(optExpr, expr, expr.subject, expr.regexp, expr.idx)
+  }
+}
+
+object CometRegExpExtractAll extends CometExpressionSerde[RegExpExtractAll] {
+  override def getSupportLevel(expr: RegExpExtractAll): SupportLevel = {
+// Check if the pattern is compatible with Spark or allow incompatible 
patterns
+expr.regexp match {
+  case Literal(pattern, DataTypes.StringType) =>
+if (!RegExp.isSupportedPattern(pattern.toString) &&

Review Comment:
   For now it is a placeholder introduced in this 
[PR](https://github.com/danielhumanmod/datafusion-comet/commit/e33d56050187c75afcbbdc2def96a093a019db6a#diff-8aa4d7fa45d8c9eb2d228424f3c64fa2979c4072ab28dd9582729b20ccc7ade5R25)
   
   @andygrove when you have a chance, could you share a bit more context on its 
purpose? I’d really appreciate it. Thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-02 Thread via GitHub


danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2583867063


##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,452 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result, 
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::any::Any;
+use std::sync::Arc;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtract {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extract always returns String
+Ok(DataType::Utf8)
+}
+
+fn invoke_with_args(&self, args: ScalarFunctionArgs) -> 
Result {
+// regexp_extract(subject, pattern, idx)
+if args.args.len() != 3 {
+return exec_err!(
+"regexp_extract expects 3 arguments, got {}",
+args.args.len()
+);
+}
+
+let subject = &args.args[0];
+let pattern = &args.args[1];
+let idx = &args.args[2];
+
+// Pattern must be a literal string
+let pattern_str = match pattern {
+ColumnarValue::Scalar(ScalarValue::Utf8(Some(s))) => s.clone(),
+_ => {
+return exec_err!("regexp_extract pattern must be a string 
literal");
+}
+};
+
+// idx must be a literal int
+let idx_val = match idx {
+ColumnarValue::Scalar(ScalarValue::Int32(Some(i))) => *i as usize,
+_ => {
+return exec_err!("regexp_extract idx must be an integer 
literal");
+}
+};
+
+// Compile regex once
+let regex = Regex::new(&pattern_str).map_err(|e| {
+internal_datafusion_err!("Invalid regex pattern '{}': {}", 
pattern_str, e)
+})?;
+
+match subject {
+ColumnarValue::Array(array) => {
+let result = regexp_extract_array(array, ®ex, idx_val)?;
+Ok(ColumnarValue::Array(result))
+}
+ColumnarValue::Scalar(ScalarValue::Utf8(s)) => {
+let result = match s {
+Some(text) => Some(extract_group(text, ®ex, idx_val)?),
+None => None, // NULL input → NULL output
+};
+Ok(ColumnarValue::Scalar(ScalarValue::Utf8(result)))
+}
+_ => exec_err!("regexp_extract expects string input"),
+}
+}
+
+fn aliases(&self) -> &[String] {
+&self.aliases
+}
+}
+
+/// Spark-compatible regexp_extract_all function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtractAll {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtractAll {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtractAll {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtractAll {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract_all"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extrac

Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-02 Thread via GitHub


danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2583619102


##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,401 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result, 
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::sync::Arc;
+use std::any::Any;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtract {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extract always returns String
+Ok(DataType::Utf8)
+}
+
+fn invoke_with_args(&self, args: ScalarFunctionArgs) -> 
Result {
+// regexp_extract(subject, pattern, idx)
+if args.args.len() != 3 {
+return exec_err!(
+"regexp_extract expects 3 arguments, got {}",
+args.args.len()
+);
+}
+
+let subject = &args.args[0];
+let pattern = &args.args[1];
+let idx = &args.args[2];
+
+// Pattern must be a literal string
+let pattern_str = match pattern {
+ColumnarValue::Scalar(ScalarValue::Utf8(Some(s))) => s.clone(),
+_ => {
+return exec_err!("regexp_extract pattern must be a string 
literal");
+}
+};
+
+// idx must be a literal int
+let idx_val = match idx {
+ColumnarValue::Scalar(ScalarValue::Int32(Some(i))) => *i as usize,
+_ => {
+return exec_err!("regexp_extract idx must be an integer 
literal");
+}
+};
+
+// Compile regex once
+let regex = Regex::new(&pattern_str).map_err(|e| {
+internal_datafusion_err!("Invalid regex pattern '{}': {}", 
pattern_str, e)
+})?;
+
+match subject {

Review Comment:
   Will extract some shared code, thanks for the suggestion!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-02 Thread via GitHub


danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2583618322


##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,401 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result, 
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::sync::Arc;
+use std::any::Any;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtract {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extract always returns String
+Ok(DataType::Utf8)
+}
+
+fn invoke_with_args(&self, args: ScalarFunctionArgs) -> 
Result {
+// regexp_extract(subject, pattern, idx)
+if args.args.len() != 3 {
+return exec_err!(
+"regexp_extract expects 3 arguments, got {}",
+args.args.len()
+);
+}
+
+let subject = &args.args[0];
+let pattern = &args.args[1];
+let idx = &args.args[2];
+
+// Pattern must be a literal string
+let pattern_str = match pattern {
+ColumnarValue::Scalar(ScalarValue::Utf8(Some(s))) => s.clone(),
+_ => {
+return exec_err!("regexp_extract pattern must be a string 
literal");
+}
+};
+
+// idx must be a literal int
+let idx_val = match idx {
+ColumnarValue::Scalar(ScalarValue::Int32(Some(i))) => *i as usize,
+_ => {
+return exec_err!("regexp_extract idx must be an integer 
literal");
+}

Review Comment:
   Good point. I checked the Spark's implementation, the index there is cast 
into i32 only. I was wondering if we should follow their behavior
   
   
https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L811C21-L811C33



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-02 Thread via GitHub


danielhumanmod commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2583613359


##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,452 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result, 
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::any::Any;
+use std::sync::Arc;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtract {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extract always returns String
+Ok(DataType::Utf8)
+}
+
+fn invoke_with_args(&self, args: ScalarFunctionArgs) -> 
Result {
+// regexp_extract(subject, pattern, idx)
+if args.args.len() != 3 {
+return exec_err!(
+"regexp_extract expects 3 arguments, got {}",
+args.args.len()
+);
+}
+
+let subject = &args.args[0];
+let pattern = &args.args[1];
+let idx = &args.args[2];
+
+// Pattern must be a literal string
+let pattern_str = match pattern {
+ColumnarValue::Scalar(ScalarValue::Utf8(Some(s))) => s.clone(),
+_ => {
+return exec_err!("regexp_extract pattern must be a string 
literal");
+}
+};
+
+// idx must be a literal int
+let idx_val = match idx {
+ColumnarValue::Scalar(ScalarValue::Int32(Some(i))) => *i as usize,

Review Comment:
   Good call, will fix this, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-01 Thread via GitHub


codecov-commenter commented on PR #2831:
URL: 
https://github.com/apache/datafusion-comet/pull/2831#issuecomment-3597346236

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/2831?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   :x: Patch coverage is `9.09091%` with `40 lines` in your changes missing 
coverage. Please review.
   :white_check_mark: Project coverage is 21.18%. Comparing base 
([`f09f8af`](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache))
 to head 
([`ff1ebd6`](https://app.codecov.io/gh/apache/datafusion-comet/commit/ff1ebd6b3bebe85c0f393b268660dc8031614bc1?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)).
   :warning: Report is 730 commits behind head on main.
   
   | [Files with missing 
lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/2831?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 | Patch % | Lines |
   |---|---|---|
   | 
[...rc/main/scala/org/apache/comet/serde/strings.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/2831?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Fserde%2Fstrings.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9zZXJkZS9zdHJpbmdzLnNjYWxh)
 | 4.76% | [40 Missing :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/2831?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   
   Additional details and impacted files
   
   
   
   ```diff
   @@  Coverage Diff  @@
   ##   main#2831   +/-   ##
   =
   - Coverage 56.12%   21.18%   -34.95% 
   + Complexity  976  456  -520 
   =
 Files   119  167   +48 
 Lines 1174315232 +3489 
 Branches   2251 2531  +280 
   =
   - Hits   6591 3227 -3364 
   - Misses 401211480 +7468 
   + Partials   1140  525  -615 
   ```
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/2831?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
:rocket: New features to boost your workflow: 
   
   - :snowflake: [Test 
Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, 
report on failures, and find test suite problems.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-12-01 Thread via GitHub


martin-g commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2577039322


##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,452 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result, 
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::any::Any;
+use std::sync::Arc;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtract {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extract always returns String
+Ok(DataType::Utf8)
+}
+
+fn invoke_with_args(&self, args: ScalarFunctionArgs) -> 
Result {
+// regexp_extract(subject, pattern, idx)
+if args.args.len() != 3 {
+return exec_err!(
+"regexp_extract expects 3 arguments, got {}",
+args.args.len()
+);
+}
+
+let subject = &args.args[0];
+let pattern = &args.args[1];
+let idx = &args.args[2];
+
+// Pattern must be a literal string
+let pattern_str = match pattern {
+ColumnarValue::Scalar(ScalarValue::Utf8(Some(s))) => s.clone(),
+_ => {
+return exec_err!("regexp_extract pattern must be a string 
literal");
+}
+};
+
+// idx must be a literal int
+let idx_val = match idx {
+ColumnarValue::Scalar(ScalarValue::Int32(Some(i))) => *i as usize,

Review Comment:
   Negative `i` will lead to a big idx_val here. Does it need some kind of 
validation ?



##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,452 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result, 
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::any::Any;
+use std::sync::Arc;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegE

Re: [PR] feat: Add support for `RegExpExtract`/`RegExpExtractAll` [datafusion-comet]

2025-11-30 Thread via GitHub


coderfender commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2575900748


##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,401 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result, 
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::sync::Arc;
+use std::any::Any;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtract {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extract always returns String
+Ok(DataType::Utf8)
+}

Review Comment:
   @danielhumanmod , could we verify we dont return a LargeUtf8 as well ?



##
native/spark-expr/src/string_funcs/regexp_extract.rs:
##
@@ -0,0 +1,401 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{Array, ArrayRef, GenericStringArray};
+use arrow::datatypes::DataType;
+use datafusion::common::{exec_err, internal_datafusion_err, Result, 
ScalarValue};
+use datafusion::logical_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use regex::Regex;
+use std::sync::Arc;
+use std::any::Any;
+
+/// Spark-compatible regexp_extract function
+#[derive(Debug, PartialEq, Eq, Hash)]
+pub struct SparkRegExpExtract {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkRegExpExtract {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkRegExpExtract {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec![],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkRegExpExtract {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"regexp_extract"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+// regexp_extract always returns String
+Ok(DataType::Utf8)
+}
+
+fn invoke_with_args(&self, args: ScalarFunctionArgs) -> 
Result {
+// regexp_extract(subject, pattern, idx)
+if args.args.len() != 3 {
+return exec_err!(
+"regexp_extract expects 3 arguments, got {}",
+args.args.len()
+);
+}
+
+let subject = &args.args[0];
+let pattern = &args.args[1];
+let idx = &args.args[2];
+
+// Pattern must be a literal string
+let pattern_str = match pattern {
+Column