Robert Joseph Evans created SPARK-44500:
-------------------------------------------

             Summary: parse_url treats key as regular expression
                 Key: SPARK-44500
                 URL: https://issues.apache.org/jira/browse/SPARK-44500
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.4.1, 3.4.0, 3.3.0, 3.2.0
            Reporter: Robert Joseph Evans


To be clear I am not 100% sure that this is a bug. It might be a feature, but I 
don't see anywhere that it is used as a feature. If it is a feature it really 
should be documented, because there are pitfalls. If it is a bug it should be 
fixed because it is really confusing and it is simple to shoot yourself in the 
foot.

```scala
> val urls = Seq("http://foo/bar?abc=BAD&a.c=GOOD";, 
> "http://foo/bar?a.c=GOOD&abc=BAD";).toDF
> urls.selectExpr("parse_url(value, 'QUERY', 'a.c')").show(false)

+----------------------------+
|parse_url(value, QUERY, a.c)|
+----------------------------+
|BAD                         |
|GOOD                        |
+----------------------------+

> urls.selectExpr("parse_url(value, 'QUERY', 'a[c')").show(false)
java.util.regex.PatternSyntaxException: Unclosed character class near index 15
(&|^)a[c=([^&]*)
               ^
  at java.util.regex.Pattern.error(Pattern.java:1969)
  at java.util.regex.Pattern.clazz(Pattern.java:2562)
  at java.util.regex.Pattern.sequence(Pattern.java:2077)
  at java.util.regex.Pattern.expr(Pattern.java:2010)
  at java.util.regex.Pattern.compile(Pattern.java:1702)
  at java.util.regex.Pattern.<init>(Pattern.java:1352)
  at java.util.regex.Pattern.compile(Pattern.java:1028)

```

The simple fix is to quote the key when making the pattern.

```scala
  private def getPattern(key: UTF8String): Pattern = {
    Pattern.compile(REGEXPREFIX + Pattern.quote(key.toString) + REGEXSUBFIX)
  }
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to