[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2020-01-24 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023019#comment-17023019
 ] 

Jeff Evans commented on SPARK-19248:


I'm not a Spark maintainer, so can't answer definitively.  However, I would 
guess they won't change the default value.  This was deliberately added in 2.0 
with a default value of false, and usually breaking changes like this are 
introduced in new major versions (speaking in general terms).

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2020-01-23 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022576#comment-17022576
 ] 

Nicholas Chammas commented on SPARK-19248:
--

Thanks for getting to the bottom of the issue, [~jeff.w.evans], and for 
providing a workaround.

Would an appropriate solution be to make {{escapedStringLiterals}} default to 
{{True}}? Or does that cause other problems?

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2020-01-23 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022430#comment-17022430
 ] 

Jeff Evans commented on SPARK-19248:


After some debugging, I figured out what's going on here.  The crux of this is 
the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under 
SPARK-20399.  This behavior changed in 2.0 (see 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).
  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2019-05-21 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844577#comment-16844577
 ] 

Hyukjin Kwon commented on SPARK-19248:
--

Yea, thanks.

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2019-05-20 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844547#comment-16844547
 ] 

Nicholas Chammas commented on SPARK-19248:
--

Looks like Spark 2.4.3 still exhibits the behavior reported in the original 
issue: 
{code:java}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.3
  /_/

Using Python version 3.7.3 (default, Mar 27 2019 13:25:00)
SparkSession available as 'spark'.
>>> df = spark.createDataFrame([('..   5.',)], ['col'])
>>> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS 
>>> col"]).collect()
>>> dfout   
[Row(col='5')]
>>> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
>>> col"]).collect()
>>> dfout2
[Row(col='')]
>>> 
{code}
[~hyukjin.kwon] - I'm going to reopen this issue.

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: bulk-closed
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2017-05-15 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010349#comment-16010349
 ] 

Yuming Wang commented on SPARK-19248:
-

*scala* api works fine:
{code:java}
scala> val df = spark.createDataFrame(Seq((0, "..   5."))).toDF("id","col")
df: org.apache.spark.sql.DataFrame = [id: int, col: string]

scala> df.select(regexp_replace($"col", "[ \\.]*", "")).show()
+-+
|regexp_replace(col, [ \.]*, )|
+-+
|5|
+-+


scala> df.select(regexp_replace($"col", "( |\\.)*", "")).show()
+--+
|regexp_replace(col, ( |\.)*, )|
+--+
| 5|
+--+

{code}

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Lucas Tittmann
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2017-01-16 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15825173#comment-15825173
 ] 

Hyukjin Kwon commented on SPARK-19248:
--

I just looked into this for my curiosity. This seems related with the parser 
behaviour.

1.6.1

{code}
>>> sqlContext.createDataFrame([[]]).selectExpr("'.'").show()
+---+
|  .|
+---+
|  .|
+---+

>>> sqlContext.createDataFrame([[]]).selectExpr("'\.'").show()
+---+
| \.|
+---+
| \.|
+---+

>>> sqlContext.createDataFrame([[]]).selectExpr("'\\.'").show()
+---+
| \.|
+---+
| \.|
+---+


>>> sqlContext.createDataFrame([[]]).selectExpr("'\\\.'").show()
+---+
|\\.|
+---+
|\\.|
+---+
{code}

current master

{code}
>>> sqlContext.createDataFrame([[]]).selectExpr("'.'").show()
+---+
|  .|
+---+
|  .|
+---+

>>> sqlContext.createDataFrame([[]]).selectExpr("'\.'").show()
+---+
|  .|
+---+
|  .|
+---+

>>> sqlContext.createDataFrame([[]]).selectExpr("'\\.'").show()
+---+
|  .|
+---+
|  .|
+---+

>>> sqlContext.createDataFrame([[]]).selectExpr("'\\\.'").show()
+---+
| \.|
+---+
| \.|
+---+
{code}

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Lucas Tittmann
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2017-01-16 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15824334#comment-15824334
 ] 

Nicholas Chammas commented on SPARK-19248:
--

Testing this out, it looks like 2.1 shows the same behavior as 2.0.

If I test this out in pure Python, it looks like 1.6 has the correct behavior 
and 2.0/2.1 are incorrect:

{code}
>>> import re
>>> re.sub(pattern='( |\.)*', repl='', string='..   5.')
'5'
{code}


> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Lucas Tittmann
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org