[jira] [Updated] (SPARK-26366) Except with transform regression

2020-03-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26366:
--
Affects Version/s: 2.3.0
   2.3.1

> Except with transform regression
> 
>
> Key: SPARK-26366
> URL: https://issues.apache.org/jira/browse/SPARK-26366
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Dan Osipov
>Assignee: Marco Gaido
>Priority: Major
>  Labels: correctness
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> There appears to be a regression between Spark 2.2 and 2.3. Below is the code 
> to reproduce it:
>  
> {code:java}
> import org.apache.spark.sql.functions.col
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val inputDF = spark.sqlContext.createDataFrame(
>   spark.sparkContext.parallelize(Seq(
> Row("0", "john", "smith", "j...@smith.com"),
> Row("1", "jane", "doe", "j...@doe.com"),
> Row("2", "apache", "spark", "sp...@apache.org"),
> Row("3", "foo", "bar", null)
>   )),
>   StructType(List(
> StructField("id", StringType, nullable=true),
> StructField("first_name", StringType, nullable=true),
> StructField("last_name", StringType, nullable=true),
> StructField("email", StringType, nullable=true)
>   ))
> )
> val exceptDF = inputDF.transform( toProcessDF =>
>   toProcessDF.filter(
>   (
> col("first_name").isin(Seq("john", "jane"): _*)
>   and col("last_name").isin(Seq("smith", "doe"): _*)
>   )
>   or col("email").isin(List(): _*)
>   )
> )
> inputDF.except(exceptDF).show()
> {code}
> Output with Spark 2.2:
> {noformat}
> +---+--+-++
> | id|first_name|last_name| email|
> +---+--+-++
> | 2| apache| spark|sp...@apache.org|
> | 3| foo| bar| null|
> +---+--+-++{noformat}
> Output with Spark 2.3:
> {noformat}
> +---+--+-++
> | id|first_name|last_name| email|
> +---+--+-++
> | 2| apache| spark|sp...@apache.org|
> +---+--+-++{noformat}
> Note, changing the last line to 
> {code:java}
> inputDF.except(exceptDF.cache()).show()
> {code}
> produces identical output for both Spark 2.3 and 2.2
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26366) Except with transform regression

2019-01-04 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26366:

Labels: correctness  (was: )

> Except with transform regression
> 
>
> Key: SPARK-26366
> URL: https://issues.apache.org/jira/browse/SPARK-26366
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.2
>Reporter: Dan Osipov
>Assignee: Marco Gaido
>Priority: Major
>  Labels: correctness
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> There appears to be a regression between Spark 2.2 and 2.3. Below is the code 
> to reproduce it:
>  
> {code:java}
> import org.apache.spark.sql.functions.col
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val inputDF = spark.sqlContext.createDataFrame(
>   spark.sparkContext.parallelize(Seq(
> Row("0", "john", "smith", "j...@smith.com"),
> Row("1", "jane", "doe", "j...@doe.com"),
> Row("2", "apache", "spark", "sp...@apache.org"),
> Row("3", "foo", "bar", null)
>   )),
>   StructType(List(
> StructField("id", StringType, nullable=true),
> StructField("first_name", StringType, nullable=true),
> StructField("last_name", StringType, nullable=true),
> StructField("email", StringType, nullable=true)
>   ))
> )
> val exceptDF = inputDF.transform( toProcessDF =>
>   toProcessDF.filter(
>   (
> col("first_name").isin(Seq("john", "jane"): _*)
>   and col("last_name").isin(Seq("smith", "doe"): _*)
>   )
>   or col("email").isin(List(): _*)
>   )
> )
> inputDF.except(exceptDF).show()
> {code}
> Output with Spark 2.2:
> {noformat}
> +---+--+-++
> | id|first_name|last_name| email|
> +---+--+-++
> | 2| apache| spark|sp...@apache.org|
> | 3| foo| bar| null|
> +---+--+-++{noformat}
> Output with Spark 2.3:
> {noformat}
> +---+--+-++
> | id|first_name|last_name| email|
> +---+--+-++
> | 2| apache| spark|sp...@apache.org|
> +---+--+-++{noformat}
> Note, changing the last line to 
> {code:java}
> inputDF.except(exceptDF.cache()).show()
> {code}
> produces identical output for both Spark 2.3 and 2.2
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26366) Except with transform regression

2018-12-22 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26366:
--
Fix Version/s: 2.3.3

> Except with transform regression
> 
>
> Key: SPARK-26366
> URL: https://issues.apache.org/jira/browse/SPARK-26366
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.2
>Reporter: Dan Osipov
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> There appears to be a regression between Spark 2.2 and 2.3. Below is the code 
> to reproduce it:
>  
> {code:java}
> import org.apache.spark.sql.functions.col
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val inputDF = spark.sqlContext.createDataFrame(
>   spark.sparkContext.parallelize(Seq(
> Row("0", "john", "smith", "j...@smith.com"),
> Row("1", "jane", "doe", "j...@doe.com"),
> Row("2", "apache", "spark", "sp...@apache.org"),
> Row("3", "foo", "bar", null)
>   )),
>   StructType(List(
> StructField("id", StringType, nullable=true),
> StructField("first_name", StringType, nullable=true),
> StructField("last_name", StringType, nullable=true),
> StructField("email", StringType, nullable=true)
>   ))
> )
> val exceptDF = inputDF.transform( toProcessDF =>
>   toProcessDF.filter(
>   (
> col("first_name").isin(Seq("john", "jane"): _*)
>   and col("last_name").isin(Seq("smith", "doe"): _*)
>   )
>   or col("email").isin(List(): _*)
>   )
> )
> inputDF.except(exceptDF).show()
> {code}
> Output with Spark 2.2:
> {noformat}
> +---+--+-++
> | id|first_name|last_name| email|
> +---+--+-++
> | 2| apache| spark|sp...@apache.org|
> | 3| foo| bar| null|
> +---+--+-++{noformat}
> Output with Spark 2.3:
> {noformat}
> +---+--+-++
> | id|first_name|last_name| email|
> +---+--+-++
> | 2| apache| spark|sp...@apache.org|
> +---+--+-++{noformat}
> Note, changing the last line to 
> {code:java}
> inputDF.except(exceptDF.cache()).show()
> {code}
> produces identical output for both Spark 2.3 and 2.2
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org