[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023019#comment-17023019 ] Jeff Evans commented on SPARK-19248: I'm not a Spark maintainer, so can't answer definitively. However, I would guess they won't change the default value. This was deliberately added in 2.0 with a default value of false, and usually breaking changes like this are introduced in new major versions (speaking in general terms). > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.4.3 >Reporter: Lucas Tittmann >Priority: Major > Labels: correctness > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022576#comment-17022576 ] Nicholas Chammas commented on SPARK-19248: -- Thanks for getting to the bottom of the issue, [~jeff.w.evans], and for providing a workaround. Would an appropriate solution be to make {{escapedStringLiterals}} default to {{True}}? Or does that cause other problems? > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.4.3 >Reporter: Lucas Tittmann >Priority: Major > Labels: correctness > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022430#comment-17022430 ] Jeff Evans commented on SPARK-19248: After some debugging, I figured out what's going on here. The crux of this is the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under SPARK-20399. This behavior changed in 2.0 (see [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]). If you start your PySpark sessions described above with this line: {{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}} then you should see the 1.6 behavior. > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.4.3 >Reporter: Lucas Tittmann >Priority: Major > Labels: correctness > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844577#comment-16844577 ] Hyukjin Kwon commented on SPARK-19248: -- Yea, thanks. > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.4.3 >Reporter: Lucas Tittmann >Priority: Major > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844547#comment-16844547 ] Nicholas Chammas commented on SPARK-19248: -- Looks like Spark 2.4.3 still exhibits the behavior reported in the original issue: {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.3 /_/ Using Python version 3.7.3 (default, Mar 27 2019 13:25:00) SparkSession available as 'spark'. >>> df = spark.createDataFrame([('.. 5.',)], ['col']) >>> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS >>> col"]).collect() >>> dfout [Row(col='5')] >>> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS >>> col"]).collect() >>> dfout2 [Row(col='')] >>> {code} [~hyukjin.kwon] - I'm going to reopen this issue. > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Lucas Tittmann >Priority: Major > Labels: bulk-closed > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010349#comment-16010349 ] Yuming Wang commented on SPARK-19248: - *scala* api works fine: {code:java} scala> val df = spark.createDataFrame(Seq((0, ".. 5."))).toDF("id","col") df: org.apache.spark.sql.DataFrame = [id: int, col: string] scala> df.select(regexp_replace($"col", "[ \\.]*", "")).show() +-+ |regexp_replace(col, [ \.]*, )| +-+ |5| +-+ scala> df.select(regexp_replace($"col", "( |\\.)*", "")).show() +--+ |regexp_replace(col, ( |\.)*, )| +--+ | 5| +--+ {code} > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Lucas Tittmann > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15825173#comment-15825173 ] Hyukjin Kwon commented on SPARK-19248: -- I just looked into this for my curiosity. This seems related with the parser behaviour. 1.6.1 {code} >>> sqlContext.createDataFrame([[]]).selectExpr("'.'").show() +---+ | .| +---+ | .| +---+ >>> sqlContext.createDataFrame([[]]).selectExpr("'\.'").show() +---+ | \.| +---+ | \.| +---+ >>> sqlContext.createDataFrame([[]]).selectExpr("'\\.'").show() +---+ | \.| +---+ | \.| +---+ >>> sqlContext.createDataFrame([[]]).selectExpr("'\\\.'").show() +---+ |\\.| +---+ |\\.| +---+ {code} current master {code} >>> sqlContext.createDataFrame([[]]).selectExpr("'.'").show() +---+ | .| +---+ | .| +---+ >>> sqlContext.createDataFrame([[]]).selectExpr("'\.'").show() +---+ | .| +---+ | .| +---+ >>> sqlContext.createDataFrame([[]]).selectExpr("'\\.'").show() +---+ | .| +---+ | .| +---+ >>> sqlContext.createDataFrame([[]]).selectExpr("'\\\.'").show() +---+ | \.| +---+ | \.| +---+ {code} > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Lucas Tittmann > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0
[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15824334#comment-15824334 ] Nicholas Chammas commented on SPARK-19248: -- Testing this out, it looks like 2.1 shows the same behavior as 2.0. If I test this out in pure Python, it looks like 1.6 has the correct behavior and 2.0/2.1 are incorrect: {code} >>> import re >>> re.sub(pattern='( |\.)*', repl='', string='.. 5.') '5' {code} > Regex_replace works in 1.6 but not in 2.0 > - > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Lucas Tittmann > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5.',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org