[jira] [Updated] (SPARK-17099) Incorrect result when HAVING clause is added to group by query
[ https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17099: --- Labels: correctness (was: ) > Incorrect result when HAVING clause is added to group by query > -- > > Key: SPARK-17099 > URL: https://issues.apache.org/jira/browse/SPARK-17099 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Xiao Li >Priority: Blocker > Labels: correctness > Fix For: 2.0.1, 2.1.0 > > > Random query generation uncovered the following query which returns incorrect > results when run on Spark SQL. This wasn't the original query uncovered by > the generator, since I performed a bit of minimization to try to make it more > understandable. > With the following tables: > {code} > val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5") > val t2 = sc.parallelize( > Seq( > (-769, -244), > (-800, -409), > (940, 86), > (-507, 304), > (-367, 158)) > ).toDF("int_col_2", "int_col_5") > t1.registerTempTable("t1") > t2.registerTempTable("t2") > {code} > Run > {code} > SELECT > (SUM(COALESCE(t1.int_col_5, t2.int_col_2))), > ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2) > FROM t1 > RIGHT JOIN t2 > ON (t2.int_col_2) = (t1.int_col_5) > GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)), > COALESCE(t1.int_col_5, t2.int_col_2) > HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, > t2.int_col_2)) * 2) > {code} > In Spark SQL, this returns an empty result set, whereas Postgres returns four > rows. However, if I omit the {{HAVING}} clause I see that the group's rows > are being incorrectly filtered by the {{HAVING}} clause: > {code} > +--+---+--+ > | sum(coalesce(int_col_5, int_col_2)) | (coalesce(int_col_5, int_col_2) * 2) > | > +--+---+--+ > | -507 | -1014 > | > | 940 | 1880 > | > | -769 | -1538 > | > | -367 | -734 > | > | -800 | -1600 > | > +--+---+--+ > {code} > Based on this, the output after adding the {{HAVING}} should contain four > rows, not zero. > I'm not sure how to further shrink this in a straightforward way, so I'm > opening this bug to get help in triaging further. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17099) Incorrect result when HAVING clause is added to group by query
[ https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-17099: Component/s: SQL > Incorrect result when HAVING clause is added to group by query > -- > > Key: SPARK-17099 > URL: https://issues.apache.org/jira/browse/SPARK-17099 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Xiao Li >Priority: Blocker > Fix For: 2.0.1, 2.1.0 > > > Random query generation uncovered the following query which returns incorrect > results when run on Spark SQL. This wasn't the original query uncovered by > the generator, since I performed a bit of minimization to try to make it more > understandable. > With the following tables: > {code} > val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5") > val t2 = sc.parallelize( > Seq( > (-769, -244), > (-800, -409), > (940, 86), > (-507, 304), > (-367, 158)) > ).toDF("int_col_2", "int_col_5") > t1.registerTempTable("t1") > t2.registerTempTable("t2") > {code} > Run > {code} > SELECT > (SUM(COALESCE(t1.int_col_5, t2.int_col_2))), > ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2) > FROM t1 > RIGHT JOIN t2 > ON (t2.int_col_2) = (t1.int_col_5) > GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)), > COALESCE(t1.int_col_5, t2.int_col_2) > HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, > t2.int_col_2)) * 2) > {code} > In Spark SQL, this returns an empty result set, whereas Postgres returns four > rows. However, if I omit the {{HAVING}} clause I see that the group's rows > are being incorrectly filtered by the {{HAVING}} clause: > {code} > +--+---+--+ > | sum(coalesce(int_col_5, int_col_2)) | (coalesce(int_col_5, int_col_2) * 2) > | > +--+---+--+ > | -507 | -1014 > | > | 940 | 1880 > | > | -769 | -1538 > | > | -367 | -734 > | > | -800 | -1600 > | > +--+---+--+ > {code} > Based on this, the output after adding the {{HAVING}} should contain four > rows, not zero. > I'm not sure how to further shrink this in a straightforward way, so I'm > opening this bug to get help in triaging further. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17099) Incorrect result when HAVING clause is added to group by query
[ https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17099: --- Priority: Blocker (was: Critical) > Incorrect result when HAVING clause is added to group by query > -- > > Key: SPARK-17099 > URL: https://issues.apache.org/jira/browse/SPARK-17099 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > Random query generation uncovered the following query which returns incorrect > results when run on Spark SQL. This wasn't the original query uncovered by > the generator, since I performed a bit of minimization to try to make it more > understandable. > With the following tables: > {code} > val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5") > val t2 = sc.parallelize( > Seq( > (-769, -244), > (-800, -409), > (940, 86), > (-507, 304), > (-367, 158)) > ).toDF("int_col_2", "int_col_5") > t1.registerTempTable("t1") > t2.registerTempTable("t2") > {code} > Run > {code} > SELECT > (SUM(COALESCE(t1.int_col_5, t2.int_col_2))), > ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2) > FROM t1 > RIGHT JOIN t2 > ON (t2.int_col_2) = (t1.int_col_5) > GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)), > COALESCE(t1.int_col_5, t2.int_col_2) > HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, > t2.int_col_2)) * 2) > {code} > In Spark SQL, this returns an empty result set, whereas Postgres returns four > rows. However, if I omit the {{HAVING}} clause I see that the group's rows > are being incorrectly filtered by the {{HAVING}} clause: > {code} > +--+---+--+ > | sum(coalesce(int_col_5, int_col_2)) | (coalesce(int_col_5, int_col_2) * 2) > | > +--+---+--+ > | -507 | -1014 > | > | 940 | 1880 > | > | -769 | -1538 > | > | -367 | -734 > | > | -800 | -1600 > | > +--+---+--+ > {code} > Based on this, the output after adding the {{HAVING}} should contain four > rows, not zero. > I'm not sure how to further shrink this in a straightforward way, so I'm > opening this bug to get help in triaging further. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17099) Incorrect result when HAVING clause is added to group by query
[ https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17099: --- Target Version/s: 2.0.1, 2.1.0 Fix Version/s: (was: 2.1.0) > Incorrect result when HAVING clause is added to group by query > -- > > Key: SPARK-17099 > URL: https://issues.apache.org/jira/browse/SPARK-17099 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Critical > > Random query generation uncovered the following query which returns incorrect > results when run on Spark SQL. This wasn't the original query uncovered by > the generator, since I performed a bit of minimization to try to make it more > understandable. > With the following tables: > {code} > val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5") > val t2 = sc.parallelize( > Seq( > (-769, -244), > (-800, -409), > (940, 86), > (-507, 304), > (-367, 158)) > ).toDF("int_col_2", "int_col_5") > t1.registerTempTable("t1") > t2.registerTempTable("t2") > {code} > Run > {code} > SELECT > (SUM(COALESCE(t1.int_col_5, t2.int_col_2))), > ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2) > FROM t1 > RIGHT JOIN t2 > ON (t2.int_col_2) = (t1.int_col_5) > GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)), > COALESCE(t1.int_col_5, t2.int_col_2) > HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, > t2.int_col_2)) * 2) > {code} > In Spark SQL, this returns an empty result set, whereas Postgres returns four > rows. However, if I omit the {{HAVING}} clause I see that the group's rows > are being incorrectly filtered by the {{HAVING}} clause: > {code} > +--+---+--+ > | sum(coalesce(int_col_5, int_col_2)) | (coalesce(int_col_5, int_col_2) * 2) > | > +--+---+--+ > | -507 | -1014 > | > | 940 | 1880 > | > | -769 | -1538 > | > | -367 | -734 > | > | -800 | -1600 > | > +--+---+--+ > {code} > Based on this, the output after adding the {{HAVING}} should contain four > rows, not zero. > I'm not sure how to further shrink this in a straightforward way, so I'm > opening this bug to get help in triaging further. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17099) Incorrect result when HAVING clause is added to group by query
[ https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17099: --- Description: Random query generation uncovered the following query which returns incorrect results when run on Spark SQL. This wasn't the original query uncovered by the generator, since I performed a bit of minimization to try to make it more understandable. With the following tables: {code} val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5") val t2 = sc.parallelize( Seq( (-769, -244), (-800, -409), (940, 86), (-507, 304), (-367, 158)) ).toDF("int_col_2", "int_col_5") t1.registerTempTable("t1") t2.registerTempTable("t2") {code} Run {code} SELECT (SUM(COALESCE(t1.int_col_5, t2.int_col_2))), ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2) FROM t1 RIGHT JOIN t2 ON (t2.int_col_2) = (t1.int_col_5) GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)), COALESCE(t1.int_col_5, t2.int_col_2) HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2) {code} In Spark SQL, this returns an empty result set, whereas Postgres returns four rows. However, if I omit the {{HAVING}} clause I see that the group's rows are being incorrectly filtered by the {{HAVING}} clause: {code} +--+---+--+ | sum(coalesce(int_col_5, int_col_2)) | (coalesce(int_col_5, int_col_2) * 2) | +--+---+--+ | -507 | -1014 | | 940 | 1880 | | -769 | -1538 | | -367 | -734 | | -800 | -1600 | +--+---+--+ {code} Based on this, the output after adding the {{HAVING}} should contain four rows, not zero. I'm not sure how to further shrink this in a straightforward way, so I'm opening this bug to get help in triaging further. was: Random query generation uncovered the following query which returns incorrect results when run on Spark SQL. This wasn't the original query uncovered by the generator, since I performed a bit of minimization to try to make it more understandable. With the following tables: {code} val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5") val t2 = sc.parallelize( Seq( (-769, -244), (-800, -409), (940, 86), (-507, 304), (-367, 158)) ).toDF("int_col_2", "int_col_5") t1.registerTempTable("t1") t2.registerTempTable("t2") {code} Run {code} SELECT (SUM(COALESCE(t1.int_col_5, t2.int_col_2))), ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2) FROM t1 RIGHT JOIN t2 ON (t2.int_col_2) = (t1.int_col_5) GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)), COALESCE(t1.int_col_5, t2.int_col_2) HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2) {code} In Spark SQL, this returns an empty result set, whereas Postgres returns four rows. However, if I omit the {{HAVING}} clause I see that the group's rows are being incorrectly filtered by it: {code} +--+---+--+ | sum(coalesce(int_col_5, int_col_2)) | (coalesce(int_col_5, int_col_2) * 2) | +--+---+--+ | -507 | -1014 | | 940 | 1880 | | -769 | -1538 | | -367 | -734 | | -800 | -1600 | +--+---+--+ {code} Based on this, the output after adding the {{HAVING}} should contain four rows, not zero. I'm not sure how to further shrink this in a straightforward way, so I'm opening this bug to get help in triaging further. > Incorrect result when HAVING clause is added to group by query > -- > > Key: SPARK-17099 > URL: https://issues.apache.org/jira/browse/SPARK-17099 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Critical > Fix For: 2.1.0 > > > Random query generation uncovered the following
[jira] [Updated] (SPARK-17099) Incorrect result when HAVING clause is added to group by query
[ https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17099: --- Summary: Incorrect result when HAVING clause is added to group by query (was: Incorrect result when complex HAVING clause is added to query) > Incorrect result when HAVING clause is added to group by query > -- > > Key: SPARK-17099 > URL: https://issues.apache.org/jira/browse/SPARK-17099 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Critical > Fix For: 2.1.0 > > > Random query generation uncovered the following query which returns incorrect > results when run on Spark SQL. This wasn't the original query uncovered by > the generator, since I performed a bit of minimization to try to make it more > understandable. > With the following tables: > {code} > val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5") > val t2 = sc.parallelize( > Seq( > (-769, -244), > (-800, -409), > (940, 86), > (-507, 304), > (-367, 158)) > ).toDF("int_col_2", "int_col_5") > t1.registerTempTable("t1") > t2.registerTempTable("t2") > {code} > Run > {code} > SELECT > (SUM(COALESCE(t1.int_col_5, t2.int_col_2))), > ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2) > FROM t1 > RIGHT JOIN t2 > ON (t2.int_col_2) = (t1.int_col_5) > GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)), > COALESCE(t1.int_col_5, t2.int_col_2) > HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, > t2.int_col_2)) * 2) > {code} > In Spark SQL, this returns an empty result set, whereas Postgres returns four > rows. However, if I omit the {{HAVING}} clause I see that the group's rows > are being incorrectly filtered by it: > {code} > +--+---+--+ > | sum(coalesce(int_col_5, int_col_2)) | (coalesce(int_col_5, int_col_2) * 2) > | > +--+---+--+ > | -507 | -1014 > | > | 940 | 1880 > | > | -769 | -1538 > | > | -367 | -734 > | > | -800 | -1600 > | > +--+---+--+ > {code} > Based on this, the output after adding the {{HAVING}} should contain four > rows, not zero. > I'm not sure how to further shrink this in a straightforward way, so I'm > opening this bug to get help in triaging further. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org