[jira] [Commented] (SPARK-37270) Incorect result of filter using isNull condition
[ https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444956#comment-17444956 ] Apache Spark commented on SPARK-37270: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/34627 > Incorect result of filter using isNull condition > > > Key: SPARK-37270 > URL: https://issues.apache.org/jira/browse/SPARK-37270 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Tomasz Kus >Priority: Major > Labels: correctness > > Simple code that allows to reproduce this issue: > {code:java} > val frame = Seq((false, 1)).toDF("bool", "number") > frame > .checkpoint() > .withColumn("conditions", when(col("bool"), "I am not null")) > .filter(col("conditions").isNull) > .show(false){code} > Although "conditions" column is null > {code:java} > +-+--+--+ > |bool |number|conditions| > +-+--+--+ > |false|1 |null | > +-+--+--+{code} > empty result is shown. > Execution plans: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#252) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Optimized Logical Plan == > LocalRelation , [bool#124, number#125, conditions#252] > == Physical Plan == > LocalTableScan , [bool#124, number#125, conditions#252] > {code} > After removing checkpoint proper result is returned and execution plans are > as follow: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#256) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Optimized Logical Plan == > LocalRelation [bool#124, number#125, conditions#256] > == Physical Plan == > LocalTableScan [bool#124, number#125, conditions#256] > {code} > It seems that the most important difference is LogicalRDD -> LocalRelation > There are following ways (workarounds) to retrieve correct result: > 1) remove checkpoint > 2) add explicit .otherwise(null) to when > 3) add checkpoint() or cache() just before filter > 4) downgrade to Spark 3.1.2 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37270) Incorect result of filter using isNull condition
[ https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443035#comment-17443035 ] Apache Spark commented on SPARK-37270: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/34580 > Incorect result of filter using isNull condition > > > Key: SPARK-37270 > URL: https://issues.apache.org/jira/browse/SPARK-37270 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Tomasz Kus >Priority: Major > Labels: correctness > > Simple code that allows to reproduce this issue: > {code:java} > val frame = Seq((false, 1)).toDF("bool", "number") > frame > .checkpoint() > .withColumn("conditions", when(col("bool"), "I am not null")) > .filter(col("conditions").isNull) > .show(false){code} > Although "conditions" column is null > {code:java} > +-+--+--+ > |bool |number|conditions| > +-+--+--+ > |false|1 |null | > +-+--+--+{code} > empty result is shown. > Execution plans: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#252) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Optimized Logical Plan == > LocalRelation , [bool#124, number#125, conditions#252] > == Physical Plan == > LocalTableScan , [bool#124, number#125, conditions#252] > {code} > After removing checkpoint proper result is returned and execution plans are > as follow: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#256) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Optimized Logical Plan == > LocalRelation [bool#124, number#125, conditions#256] > == Physical Plan == > LocalTableScan [bool#124, number#125, conditions#256] > {code} > It seems that the most important difference is LogicalRDD -> LocalRelation > There are following ways (workarounds) to retrieve correct result: > 1) remove checkpoint > 2) add explicit .otherwise(null) to when > 3) add checkpoint() or cache() just before filter > 4) downgrade to Spark 3.1.2 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37270) Incorect result of filter using isNull condition
[ https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443018#comment-17443018 ] Yuming Wang commented on SPARK-37270: - Thank you [~bersprockets] I will fix it soon. > Incorect result of filter using isNull condition > > > Key: SPARK-37270 > URL: https://issues.apache.org/jira/browse/SPARK-37270 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Tomasz Kus >Priority: Major > Labels: correctness > > Simple code that allows to reproduce this issue: > {code:java} > val frame = Seq((false, 1)).toDF("bool", "number") > frame > .checkpoint() > .withColumn("conditions", when(col("bool"), "I am not null")) > .filter(col("conditions").isNull) > .show(false){code} > Although "conditions" column is null > {code:java} > +-+--+--+ > |bool |number|conditions| > +-+--+--+ > |false|1 |null | > +-+--+--+{code} > empty result is shown. > Execution plans: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#252) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Optimized Logical Plan == > LocalRelation , [bool#124, number#125, conditions#252] > == Physical Plan == > LocalTableScan , [bool#124, number#125, conditions#252] > {code} > After removing checkpoint proper result is returned and execution plans are > as follow: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#256) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Optimized Logical Plan == > LocalRelation [bool#124, number#125, conditions#256] > == Physical Plan == > LocalTableScan [bool#124, number#125, conditions#256] > {code} > It seems that the most important difference is LogicalRDD -> LocalRelation > There are following ways (workarounds) to retrieve correct result: > 1) remove checkpoint > 2) add explicit .otherwise(null) to when > 3) add checkpoint() or cache() just before filter > 4) downgrade to Spark 3.1.2 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37270) Incorect result of filter using isNull condition
[ https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442848#comment-17442848 ] Bruce Robbins commented on SPARK-37270: --- [~yumwang] Seems related to SPARK-33848. > Incorect result of filter using isNull condition > > > Key: SPARK-37270 > URL: https://issues.apache.org/jira/browse/SPARK-37270 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Tomasz Kus >Priority: Major > Labels: correctness > > Simple code that allows to reproduce this issue: > {code:java} > val frame = Seq((false, 1)).toDF("bool", "number") > frame > .checkpoint() > .withColumn("conditions", when(col("bool"), "I am not null")) > .filter(col("conditions").isNull) > .show(false){code} > Although "conditions" column is null > {code:java} > +-+--+--+ > |bool |number|conditions| > +-+--+--+ > |false|1 |null | > +-+--+--+{code} > empty result is shown. > Execution plans: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#252) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Optimized Logical Plan == > LocalRelation , [bool#124, number#125, conditions#252] > == Physical Plan == > LocalTableScan , [bool#124, number#125, conditions#252] > {code} > After removing checkpoint proper result is returned and execution plans are > as follow: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#256) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Optimized Logical Plan == > LocalRelation [bool#124, number#125, conditions#256] > == Physical Plan == > LocalTableScan [bool#124, number#125, conditions#256] > {code} > It seems that the most important difference is LogicalRDD -> LocalRelation > There are following ways (workarounds) to retrieve correct result: > 1) remove checkpoint > 2) add explicit .otherwise(null) to when > 3) add checkpoint() or cache() just before filter > 4) downgrade to Spark 3.1.2 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37270) Incorect result of filter using isNull condition
[ https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442433#comment-17442433 ] Bruce Robbins commented on SPARK-37270: --- I can reproduce locally. In 3.1, the above snippet returns a single row, but on 3.2 and master, it returns nothing. You don't need checkpoint or cache to reproduce: simply reading from a table will do it. {noformat} drop table if exists base1; create table base1 stored as parquet as select * from values (false) tbl(a); select count(*) from ( select case when a then 'I am not null' end condition from base1 ) s where s.condition is null; {noformat} In 3.1, the above query returns 1. In 3.2 and master, it returns 0. The issue seems to be [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L680]. Because {{elseValue}} is None, the code doesn't push the {{isnull}} into the else branch, which results in: {noformat} Filter CASE WHEN a#12 THEN isnull(I am not null) END Relation default.base1[a#12] parquet {noformat} This gets simplified to: {noformat} Filter (a#12 AND isnull(I am not null)) Relation default.base1[a#12] parquet {noformat} This gets simplified to: {noformat} Filter false Relation default.base1[a#12] parquet {noformat} And finally, this gets simplified to: {noformat} LocalRelation , [a#12] {noformat} > Incorect result of filter using isNull condition > > > Key: SPARK-37270 > URL: https://issues.apache.org/jira/browse/SPARK-37270 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Tomasz Kus >Priority: Major > Labels: correctness > > Simple code that allows to reproduce this issue: > {code:java} > val frame = Seq((false, 1)).toDF("bool", "number") > frame > .checkpoint() > .withColumn("conditions", when(col("bool"), "I am not null")) > .filter(col("conditions").isNull) > .show(false){code} > Although "conditions" column is null > {code:java} > +-+--+--+ > |bool |number|conditions| > +-+--+--+ > |false|1 |null | > +-+--+--+{code} > empty result is shown. > Execution plans: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#252) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Optimized Logical Plan == > LocalRelation , [bool#124, number#125, conditions#252] > == Physical Plan == > LocalTableScan , [bool#124, number#125, conditions#252] > {code} > After removing checkpoint proper result is returned and execution plans are > as follow: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#256) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Optimized Logical Plan == > LocalRelation [bool#124, number#125, conditions#256] > == Physical Plan == > LocalTableScan [bool#124, number#125, conditions#256] > {code} > It seems that the most important difference is LogicalRDD -> LocalRelation > There are following ways (workarounds) to retrieve correct result: > 1) remove checkpoint > 2) add explicit .otherwise(null) to when > 3) add checkpoint() or cache() just before filter > 4) downgrade to Spark 3.1.2 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37270) Incorect result of filter using isNull condition
[ https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442074#comment-17442074 ] Hyukjin Kwon commented on SPARK-37270: -- Hm, I can't reproduce this locally. Are you able to reproduce this with running locally too? e.g.) {code} spark.sparkContext.setCheckpointDir("/tmp/checkpoints") val frame = Seq((false, 1)).toDF("bool", "number") frame .checkpoint() .withColumn("conditions", when(col("bool"), "I am not null")) .filter(col("conditions").isNull) .show(false) {code} > Incorect result of filter using isNull condition > > > Key: SPARK-37270 > URL: https://issues.apache.org/jira/browse/SPARK-37270 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Tomasz Kus >Priority: Major > Labels: correctness > > Simple code that allows to reproduce this issue: > {code:java} > val frame = Seq((false, 1)).toDF("bool", "number") > frame > .checkpoint() > .withColumn("conditions", when(col("bool"), "I am not null")) > .filter(col("conditions").isNull) > .show(false){code} > Although "conditions" column is null > {code:java} > +-+--+--+ > |bool |number|conditions| > +-+--+--+ > |false|1 |null | > +-+--+--+{code} > empty result is shown. > Execution plans: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#252) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Optimized Logical Plan == > LocalRelation , [bool#124, number#125, conditions#252] > == Physical Plan == > LocalTableScan , [bool#124, number#125, conditions#252] > {code} > After removing checkpoint proper result is returned and execution plans are > as follow: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#256) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Optimized Logical Plan == > LocalRelation [bool#124, number#125, conditions#256] > == Physical Plan == > LocalTableScan [bool#124, number#125, conditions#256] > {code} > It seems that the most important difference is LogicalRDD -> LocalRelation > There are following ways (workarounds) to retrieve correct result: > 1) remove checkpoint > 2) add explicit .otherwise(null) to when > 3) add checkpoint() or cache() just before filter > 4) downgrade to Spark 3.1.2 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org