[jira] [Updated] (SPARK-47319) Improve missingInput calculation
[ https://issues.apache.org/jira/browse/SPARK-47319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-47319: --- Description: {{QueryPlan.missingInput()}} calculation seems to be the root cause of {{DeduplicateRelations}} slownes in some cases. Let's try to improve it. (was: {{QueryPlan.missingInput()}} calculation seems to be the root cause of {{DeduplicateRelations}} slownes. Let's try to improve it.) > Improve missingInput calculation > > > Key: SPARK-47319 > URL: https://issues.apache.org/jira/browse/SPARK-47319 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > {{QueryPlan.missingInput()}} calculation seems to be the root cause of > {{DeduplicateRelations}} slownes in some cases. Let's try to improve it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47319) Improve missingInput calculation
[ https://issues.apache.org/jira/browse/SPARK-47319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-47319: --- Description: {{QueryPlan.missingInput()}} calculation seems to be the root cause of {{DeduplicateRelations}} slownes. Let's try to improve it. > Improve missingInput calculation > > > Key: SPARK-47319 > URL: https://issues.apache.org/jira/browse/SPARK-47319 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > {{QueryPlan.missingInput()}} calculation seems to be the root cause of > {{DeduplicateRelations}} slownes. Let's try to improve it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47319) Improve missingInput calculation
[ https://issues.apache.org/jira/browse/SPARK-47319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-47319: --- Summary: Improve missingInput calculation (was: Fix missingInput calculation) > Improve missingInput calculation > > > Key: SPARK-47319 > URL: https://issues.apache.org/jira/browse/SPARK-47319 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47319) Fix missingInput calculation
Peter Toth created SPARK-47319: -- Summary: Fix missingInput calculation Key: SPARK-47319 URL: https://issues.apache.org/jira/browse/SPARK-47319 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47217) De-duplication of Relations in Joins, can result in plan resolution failure
[ https://issues.apache.org/jira/browse/SPARK-47217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-47217: --- Shepherd: (was: Peter Toth) > De-duplication of Relations in Joins, can result in plan resolution failure > --- > > Key: SPARK-47217 > URL: https://issues.apache.org/jira/browse/SPARK-47217 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.1 >Reporter: Asif >Priority: Major > Labels: Spark-SQL, pull-request-available > > In case of some flavours of nested joins involving repetition of relation, > the projected columns when passed to the DataFrame.select API , as form of > df.column , can result in plan resolution failure due to attribute resolution > not happening. > A scenario in which this happens is > {noformat} > > Project ( dataframe A.column("col-a") ) > | > Join2 > || >Join1 DataFrame A > | > DataFrame ADataFrame B > {noformat} > In such cases, If it so happens that Join2 - right leg DataFrame A gets > re-aliased due to De-Duplication of relations, and if the project uses Column > definition obtained from DataFrame A, its exprId will not match the > re-aliased Join2 - right Leg- DataFrame A , causing resolution failure. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45805) Eliminate magic numbers in withOrigin
[ https://issues.apache.org/jira/browse/SPARK-45805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth resolved SPARK-45805. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43671 [https://github.com/apache/spark/pull/43671] > Eliminate magic numbers in withOrigin > - > > Key: SPARK-45805 > URL: https://issues.apache.org/jira/browse/SPARK-45805 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Refactor `withOrigin`, and make it more generic by eliminating the magic > number from which the traverse of stack traces starts. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45354) Resolve functions bottom-up
[ https://issues.apache.org/jira/browse/SPARK-45354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth resolved SPARK-45354. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43146 [https://github.com/apache/spark/pull/43146] > Resolve functions bottom-up > --- > > Key: SPARK-45354 > URL: https://issues.apache.org/jira/browse/SPARK-45354 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > This PR proposes bottum-up resolution in {{{}ResolveFunctions{}}}, which is > much faster if we have deeply nested {{{}UnresolvedFunctions{}}}. These > structures are more likely to occur after > [#42864|https://github.com/apache/spark/pull/42864]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45354) Resolve functions bottom-up
[ https://issues.apache.org/jira/browse/SPARK-45354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth reassigned SPARK-45354: -- Assignee: Peter Toth > Resolve functions bottom-up > --- > > Key: SPARK-45354 > URL: https://issues.apache.org/jira/browse/SPARK-45354 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Labels: pull-request-available > > This PR proposes bottum-up resolution in {{{}ResolveFunctions{}}}, which is > much faster if we have deeply nested {{{}UnresolvedFunctions{}}}. These > structures are more likely to occur after > [#42864|https://github.com/apache/spark/pull/42864]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45354) Resolve functions bottom-up
[ https://issues.apache.org/jira/browse/SPARK-45354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-45354: --- Description: This PR proposes bottum-up resolution in {{{}ResolveFunctions{}}}, which is much faster if we have deeply nested {{{}UnresolvedFunctions{}}}. These structures are more likely to occur after [#42864|https://github.com/apache/spark/pull/42864]. > Resolve functions bottom-up > --- > > Key: SPARK-45354 > URL: https://issues.apache.org/jira/browse/SPARK-45354 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Peter Toth >Priority: Major > Labels: pull-request-available > > This PR proposes bottum-up resolution in {{{}ResolveFunctions{}}}, which is > much faster if we have deeply nested {{{}UnresolvedFunctions{}}}. These > structures are more likely to occur after > [#42864|https://github.com/apache/spark/pull/42864]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45354) Resolve functions bottom-up
Peter Toth created SPARK-45354: -- Summary: Resolve functions bottom-up Key: SPARK-45354 URL: https://issues.apache.org/jira/browse/SPARK-45354 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45216) Fix non-deterministic seeded Dataset APIs
[ https://issues.apache.org/jira/browse/SPARK-45216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-45216: --- Description: If we run the following example the result is the expected equal 2 columns: {noformat} val c = rand() df.select(c, c) +--+--+ |rand(-4522010140232537566)|rand(-4522010140232537566)| +--+--+ |0.4520819282997137|0.4520819282997137| +--+--+ {noformat} But if we run use other similar APIs their result is incorrect: {noformat} val r1 = random() val r2 = uuid() val r3 = shuffle(col("x")) val x = df.select(r1, r1, r2, r2, r3, r3) +--+--+++--+--+ |rand()|rand()| uuid()| uuid()|shuffle(x)|shuffle(x)| +--+--+++--+--+ |0.7407604956381952|0.7957319451135009|e55bc4b0-74e6-4b0...|a587163b-d06b-4bb...| [1, 2, 3]| [2, 1, 3]| +--+--+++--+--+ {noformat} > Fix non-deterministic seeded Dataset APIs > - > > Key: SPARK-45216 > URL: https://issues.apache.org/jira/browse/SPARK-45216 > Project: Spark > Issue Type: Bug > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Peter Toth >Priority: Major > > If we run the following example the result is the expected equal 2 columns: > {noformat} > val c = rand() > df.select(c, c) > +--+--+ > |rand(-4522010140232537566)|rand(-4522010140232537566)| > +--+--+ > |0.4520819282997137|0.4520819282997137| > +--+--+ > {noformat} > > But if we run use other similar APIs their result is incorrect: > {noformat} > val r1 = random() > val r2 = uuid() > val r3 = shuffle(col("x")) > val x = df.select(r1, r1, r2, r2, r3, r3) > +--+--+++--+--+ > |rand()|rand()| uuid()| > uuid()|shuffle(x)|shuffle(x)| > +--+--+++--+--+ > |0.7407604956381952|0.7957319451135009|e55bc4b0-74e6-4b0...|a587163b-d06b-4bb...| > [1, 2, 3]| [2, 1, 3]| > +--+--+++--+--+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45216) Fix non-deterministic seeded Dataset APIs
Peter Toth created SPARK-45216: -- Summary: Fix non-deterministic seeded Dataset APIs Key: SPARK-45216 URL: https://issues.apache.org/jira/browse/SPARK-45216 Project: Spark Issue Type: Bug Components: Connect, SQL Affects Versions: 4.0.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45112) Use UnresolvedFunction based resolution in SQL Dataset functions
[ https://issues.apache.org/jira/browse/SPARK-45112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-45112: --- Summary: Use UnresolvedFunction based resolution in SQL Dataset functions (was: Use UnresolvedFunction in dataset functions) > Use UnresolvedFunction based resolution in SQL Dataset functions > > > Key: SPARK-45112 > URL: https://issues.apache.org/jira/browse/SPARK-45112 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Peter Toth >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45112) Use UnresolvedFunction in dataset functions
Peter Toth created SPARK-45112: -- Summary: Use UnresolvedFunction in dataset functions Key: SPARK-45112 URL: https://issues.apache.org/jira/browse/SPARK-45112 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45109) Fix eas_decrypt and ln in connect
[ https://issues.apache.org/jira/browse/SPARK-45109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-45109: --- Description: The current {{eas_descrypt}} reference to {{aes_encrypt}} is clearly a bug. The {{ln}} reference to {{log}} is more like a cosmetic issue, but because {{ln}} and {{log}} function implementations are different in Spark SQL we should use the same implementation in Spark Connect too. > Fix eas_decrypt and ln in connect > - > > Key: SPARK-45109 > URL: https://issues.apache.org/jira/browse/SPARK-45109 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0, 4.0.0 >Reporter: Peter Toth >Priority: Major > Labels: pull-request-available > > The current {{eas_descrypt}} reference to {{aes_encrypt}} is clearly a bug. > The {{ln}} reference to {{log}} is more like a cosmetic issue, but because > {{ln}} and {{log}} function implementations are different in Spark SQL we > should use the same implementation in Spark Connect too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45109) Fix eas_decrypt and ln in connect
Peter Toth created SPARK-45109: -- Summary: Fix eas_decrypt and ln in connect Key: SPARK-45109 URL: https://issues.apache.org/jira/browse/SPARK-45109 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.5.0, 4.0.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45022) Provide context for dataset API errors
[ https://issues.apache.org/jira/browse/SPARK-45022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-45022: --- Description: SQL failures already provide nice error context when there is a failure: {noformat} org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. == SQL(line 1, position 1) == a / b ^ at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201) at org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala) ... {noformat} We could add a similar user friendly error context to Dataset APIs. E.g. consider the following Spark app SimpleApp.scala: {noformat} 1 import org.apache.spark.sql.SparkSession 2 import org.apache.spark.sql.functions._ 3 4 object SimpleApp { 5def main(args: Array[String]) { 6 val spark = SparkSession.builder.appName("Simple Application").config("spark.sql.ansi.enabled", true).getOrCreate() 7 import spark.implicits._ 8 9 val c = col("a") / col("b") 10 11 Seq((1, 0)).toDF("a", "b").select(c).show() 12 13 spark.stop() 14} 15 } {noformat} then the error context could be: {noformat} Exception in thread "main" org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. == Dataset == "div" was called from SimpleApp$.main(SimpleApp.scala:9) at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201) at org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:672 ... {noformat} > Provide context for dataset API errors > -- > > Key: SPARK-45022 > URL: https://issues.apache.org/jira/browse/SPARK-45022 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Peter Toth >Priority: Major > > SQL failures already provide nice error context when there is a failure: > {noformat} > org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. > Use `try_divide` to tolerate divisor being 0 and return NULL instead. If > necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. > == SQL(line 1, position 1) == > a / b > ^ > at > org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201) > at > org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala) > ... > {noformat} > We could add a similar user friendly error context to Dataset APIs. > E.g. consider the following Spark app SimpleApp.scala: > {noformat} >1 import org.apache.spark.sql.SparkSession >2 import org.apache.spark.sql.functions._ >3 >4 object SimpleApp { >5def main(args: Array[String]) { >6 val spark = SparkSession.builder.appName("Simple > Application").config("spark.sql.ansi.enabled", true).getOrCreate() >7 import spark.implicits._ >8 >9 val c = col("a") / col("b") > 10 > 11 Seq((1, 0)).toDF("a", "b").select(c).show() > 12 > 13 spark.stop() > 14} > 15 } > {noformat} > then the error context could be: > {noformat} > Exception in thread "main" org.apache.spark.SparkArithmeticException: > [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being > 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to > "false" to bypass this error. > == Dataset == > "div" was called from SimpleApp$.main(SimpleApp.scala:9) > at > org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201) > at > org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:672 > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45034) Support deterministic mode function
Peter Toth created SPARK-45034: -- Summary: Support deterministic mode function Key: SPARK-45034 URL: https://issues.apache.org/jira/browse/SPARK-45034 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45022) Provide context for dataset API errors
Peter Toth created SPARK-45022: -- Summary: Provide context for dataset API errors Key: SPARK-45022 URL: https://issues.apache.org/jira/browse/SPARK-45022 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44934) PushdownPredicatesAndPruneColumnsForCTEDef creates invalid plan when called over CTE with duplicate attributes
[ https://issues.apache.org/jira/browse/SPARK-44934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth reassigned SPARK-44934: -- Assignee: Wen Yuen Pang > PushdownPredicatesAndPruneColumnsForCTEDef creates invalid plan when called > over CTE with duplicate attributes > -- > > Key: SPARK-44934 > URL: https://issues.apache.org/jira/browse/SPARK-44934 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.3, 3.4.1 >Reporter: Wen Yuen Pang >Assignee: Wen Yuen Pang >Priority: Minor > Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4 > > > When running the query > {code:java} > with cte as ( > select c1, c1, c2, c3 from t where random() > 0 > ) > select cte.c1, cte2.c1, cte.c2, cte2.c3 from > (select c1, c2 from cte) cte > inner join > (select c1, c3 from cte) cte2 > on cte.c1 = cte2.c1 {code} > > The query fails with the error > {code:java} > org.apache.spark.scheduler.DAGScheduler: Failed to update accumulator 9523 > (Unknown class) for task 1 > org.apache.spark.SparkException: attempted to access non-existent accumulator > 9523{code} > Further investigation shows that the rule > PushdownPredicatesAndPruneColumnsForCTEDef creates an invalid plan when the > output of a CTE contains duplicate expression IDs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44934) PushdownPredicatesAndPruneColumnsForCTEDef creates invalid plan when called over CTE with duplicate attributes
[ https://issues.apache.org/jira/browse/SPARK-44934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-44934: --- Affects Version/s: 3.3.3 > PushdownPredicatesAndPruneColumnsForCTEDef creates invalid plan when called > over CTE with duplicate attributes > -- > > Key: SPARK-44934 > URL: https://issues.apache.org/jira/browse/SPARK-44934 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.3, 3.4.1 >Reporter: Wen Yuen Pang >Priority: Minor > Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4 > > > When running the query > {code:java} > with cte as ( > select c1, c1, c2, c3 from t where random() > 0 > ) > select cte.c1, cte2.c1, cte.c2, cte2.c3 from > (select c1, c2 from cte) cte > inner join > (select c1, c3 from cte) cte2 > on cte.c1 = cte2.c1 {code} > > The query fails with the error > {code:java} > org.apache.spark.scheduler.DAGScheduler: Failed to update accumulator 9523 > (Unknown class) for task 1 > org.apache.spark.SparkException: attempted to access non-existent accumulator > 9523{code} > Further investigation shows that the rule > PushdownPredicatesAndPruneColumnsForCTEDef creates an invalid plan when the > output of a CTE contains duplicate expression IDs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44934) PushdownPredicatesAndPruneColumnsForCTEDef creates invalid plan when called over CTE with duplicate attributes
[ https://issues.apache.org/jira/browse/SPARK-44934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth resolved SPARK-44934. Fix Version/s: 3.3.4 3.5.0 4.0.0 3.4.2 Resolution: Fixed Issue resolved by pull request 42635 [https://github.com/apache/spark/pull/42635] > PushdownPredicatesAndPruneColumnsForCTEDef creates invalid plan when called > over CTE with duplicate attributes > -- > > Key: SPARK-44934 > URL: https://issues.apache.org/jira/browse/SPARK-44934 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.4.1 >Reporter: Wen Yuen Pang >Priority: Minor > Fix For: 3.3.4, 3.5.0, 4.0.0, 3.4.2 > > > When running the query > {code:java} > with cte as ( > select c1, c1, c2, c3 from t where random() > 0 > ) > select cte.c1, cte2.c1, cte.c2, cte2.c3 from > (select c1, c2 from cte) cte > inner join > (select c1, c3 from cte) cte2 > on cte.c1 = cte2.c1 {code} > > The query fails with the error > {code:java} > org.apache.spark.scheduler.DAGScheduler: Failed to update accumulator 9523 > (Unknown class) for task 1 > org.apache.spark.SparkException: attempted to access non-existent accumulator > 9523{code} > Further investigation shows that the rule > PushdownPredicatesAndPruneColumnsForCTEDef creates an invalid plan when the > output of a CTE contains duplicate expression IDs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour
[ https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-44871: --- Fix Version/s: 3.5.0 > Fix PERCENTILE_DISC behaviour > - > > Key: SPARK-44871 > URL: https://issues.apache.org/jira/browse/SPARK-44871 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Critical > Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4 > > > Currently {{percentile_disc()}} returns incorrect results in some cases: > E.g.: > {code:java} > SELECT > percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0, > percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1, > percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2, > percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3, > percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4, > percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5, > percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6, > percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7, > percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8, > percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9, > percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10 > FROM VALUES (0), (1), (2), (3), (4) AS v(a) > {code} > returns: > {code:java} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {code} > but it should return: > {noformat} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour
[ https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-44871: --- Affects Version/s: 3.4.0 3.3.2 3.3.1 3.3.0 > Fix PERCENTILE_DISC behaviour > - > > Key: SPARK-44871 > URL: https://issues.apache.org/jira/browse/SPARK-44871 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1 >Reporter: Peter Toth >Priority: Critical > Fix For: 3.5.0, 4.0.0 > > > Currently {{percentile_disc()}} returns incorrect results in some cases: > E.g.: > {code:java} > SELECT > percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0, > percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1, > percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2, > percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3, > percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4, > percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5, > percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6, > percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7, > percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8, > percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9, > percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10 > FROM VALUES (0), (1), (2), (3), (4) AS v(a) > {code} > returns: > {code:java} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {code} > but it should return: > {noformat} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour
[ https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-44871: --- Affects Version/s: 3.4.1 3.3.3 (was: 3.3.0) (was: 3.4.0) (was: 3.5.0) (was: 4.0.0) > Fix PERCENTILE_DISC behaviour > - > > Key: SPARK-44871 > URL: https://issues.apache.org/jira/browse/SPARK-44871 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.1 >Reporter: Peter Toth >Priority: Critical > Fix For: 3.5.0, 4.0.0 > > > Currently {{percentile_disc()}} returns incorrect results in some cases: > E.g.: > {code:java} > SELECT > percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0, > percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1, > percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2, > percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3, > percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4, > percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5, > percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6, > percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7, > percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8, > percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9, > percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10 > FROM VALUES (0), (1), (2), (3), (4) AS v(a) > {code} > returns: > {code:java} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {code} > but it should return: > {noformat} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour
[ https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-44871: --- Fix Version/s: 3.5.0 4.0.0 > Fix PERCENTILE_DISC behaviour > - > > Key: SPARK-44871 > URL: https://issues.apache.org/jira/browse/SPARK-44871 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0, 4.0.0 >Reporter: Peter Toth >Priority: Critical > Fix For: 3.5.0, 4.0.0 > > > Currently {{percentile_disc()}} returns incorrect results in some cases: > E.g.: > {code:java} > SELECT > percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0, > percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1, > percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2, > percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3, > percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4, > percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5, > percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6, > percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7, > percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8, > percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9, > percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10 > FROM VALUES (0), (1), (2), (3), (4) AS v(a) > {code} > returns: > {code:java} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {code} > but it should return: > {noformat} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44871) Fix PERCENTILE_DISC behaviour
[ https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17756097#comment-17756097 ] Peter Toth commented on SPARK-44871: [~tgraves], sure, I've just updated it. It looks like my PR didn't get linked here automatically, so here it is: https://github.com/apache/spark/pull/42559 > Fix PERCENTILE_DISC behaviour > - > > Key: SPARK-44871 > URL: https://issues.apache.org/jira/browse/SPARK-44871 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0, 4.0.0 >Reporter: Peter Toth >Priority: Critical > > Currently {{percentile_disc()}} returns incorrect results in some cases: > E.g.: > {code:java} > SELECT > percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0, > percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1, > percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2, > percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3, > percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4, > percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5, > percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6, > percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7, > percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8, > percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9, > percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10 > FROM VALUES (0), (1), (2), (3), (4) AS v(a) > {code} > returns: > {code:java} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {code} > but it should return: > {noformat} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour
[ https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-44871: --- Description: Currently {{percentile_disc()}} returns incorrect results in some cases: E.g.: {code:java} SELECT percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0, percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1, percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2, percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3, percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4, percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5, percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6, percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7, percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8, percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9, percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10 FROM VALUES (0), (1), (2), (3), (4) AS v(a) {code} returns: {code:java} +---+---+---+---+---+---+---+---+---+---+---+ | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| +---+---+---+---+---+---+---+---+---+---+---+ |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0| +---+---+---+---+---+---+---+---+---+---+---+ {code} but it should return: {noformat} +---+---+---+---+---+---+---+---+---+---+---+ | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| +---+---+---+---+---+---+---+---+---+---+---+ |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0| +---+---+---+---+---+---+---+---+---+---+---+ {noformat} > Fix PERCENTILE_DISC behaviour > - > > Key: SPARK-44871 > URL: https://issues.apache.org/jira/browse/SPARK-44871 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0, 4.0.0 >Reporter: Peter Toth >Priority: Critical > > Currently {{percentile_disc()}} returns incorrect results in some cases: > E.g.: > {code:java} > SELECT > percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0, > percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1, > percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2, > percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3, > percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4, > percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5, > percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6, > percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7, > percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8, > percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9, > percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10 > FROM VALUES (0), (1), (2), (3), (4) AS v(a) > {code} > returns: > {code:java} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {code} > but it should return: > {noformat} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44871) Fix PERCENTILE_DISC behaviour
Peter Toth created SPARK-44871: -- Summary: Fix PERCENTILE_DISC behaviour Key: SPARK-44871 URL: https://issues.apache.org/jira/browse/SPARK-44871 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0, 3.3.0, 3.5.0, 4.0.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43266) Move MergeScalarSubqueries to spark-sql
Peter Toth created SPARK-43266: -- Summary: Move MergeScalarSubqueries to spark-sql Key: SPARK-43266 URL: https://issues.apache.org/jira/browse/SPARK-43266 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.5.0 Reporter: Peter Toth This is a step to make SPARK-40193 easier. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43199) Make InlineCTE idempotent
Peter Toth created SPARK-43199: -- Summary: Make InlineCTE idempotent Key: SPARK-43199 URL: https://issues.apache.org/jira/browse/SPARK-43199 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24497) ANSI SQL: Recursive query
[ https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17714114#comment-17714114 ] Peter Toth edited comment on SPARK-24497 at 4/19/23 2:00 PM: - I've opened a new PR: https://github.com/apache/spark/pull/40744 to support recursive SQL, but for some reason it didn't get automatically linked here. [~gurwls223], you might know what went wrong... was (Author: petertoth): I've opened a new PR: https://github.com/apache/spark/pull/40093 to support recursive SQL, but for some reason it didn't get automatically linked here. [~gurwls223], you might know what went wrong... > ANSI SQL: Recursive query > - > > Key: SPARK-24497 > URL: https://issues.apache.org/jira/browse/SPARK-24497 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > h3. *Examples* > Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" > represents the structure of an organization as an adjacency list. > {code:sql} > CREATE TABLE department ( > id INTEGER PRIMARY KEY, -- department ID > parent_department INTEGER REFERENCES department, -- upper department ID > name TEXT -- department name > ); > INSERT INTO department (id, parent_department, "name") > VALUES > (0, NULL, 'ROOT'), > (1, 0, 'A'), > (2, 1, 'B'), > (3, 2, 'C'), > (4, 2, 'D'), > (5, 0, 'E'), > (6, 4, 'F'), > (7, 5, 'G'); > -- department structure represented here is as follows: > -- > -- ROOT-+->A-+->B-+->C > -- | | > -- | +->D-+->F > -- +->E-+->G > {code} > > To extract all departments under A, you can use the following recursive > query: > {code:sql} > WITH RECURSIVE subdepartment AS > ( > -- non-recursive term > SELECT * FROM department WHERE name = 'A' > UNION ALL > -- recursive term > SELECT d.* > FROM > department AS d > JOIN > subdepartment AS sd > ON (d.parent_department = sd.id) > ) > SELECT * > FROM subdepartment > ORDER BY name; > {code} > More details: > [http://wiki.postgresql.org/wiki/CTEReadme] > [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24497) ANSI SQL: Recursive query
[ https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17714114#comment-17714114 ] Peter Toth commented on SPARK-24497: I've opened a new PR: https://github.com/apache/spark/pull/40093 to support recursive SQL, but for some reason it didn't get automatically linked here. [~gurwls223], you might know what went wrong... > ANSI SQL: Recursive query > - > > Key: SPARK-24497 > URL: https://issues.apache.org/jira/browse/SPARK-24497 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > h3. *Examples* > Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" > represents the structure of an organization as an adjacency list. > {code:sql} > CREATE TABLE department ( > id INTEGER PRIMARY KEY, -- department ID > parent_department INTEGER REFERENCES department, -- upper department ID > name TEXT -- department name > ); > INSERT INTO department (id, parent_department, "name") > VALUES > (0, NULL, 'ROOT'), > (1, 0, 'A'), > (2, 1, 'B'), > (3, 2, 'C'), > (4, 2, 'D'), > (5, 0, 'E'), > (6, 4, 'F'), > (7, 5, 'G'); > -- department structure represented here is as follows: > -- > -- ROOT-+->A-+->B-+->C > -- | | > -- | +->D-+->F > -- +->E-+->G > {code} > > To extract all departments under A, you can use the following recursive > query: > {code:sql} > WITH RECURSIVE subdepartment AS > ( > -- non-recursive term > SELECT * FROM department WHERE name = 'A' > UNION ALL > -- recursive term > SELECT d.* > FROM > department AS d > JOIN > subdepartment AS sd > ON (d.parent_department = sd.id) > ) > SELECT * > FROM subdepartment > ORDER BY name; > {code} > More details: > [http://wiki.postgresql.org/wiki/CTEReadme] > [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43124) Dataset.show should not trigger job execution on CommandResults
Peter Toth created SPARK-43124: -- Summary: Dataset.show should not trigger job execution on CommandResults Key: SPARK-43124 URL: https://issues.apache.org/jira/browse/SPARK-43124 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42852) Revert NamedLambdaVariable related changes from EquivalentExpressions
[ https://issues.apache.org/jira/browse/SPARK-42852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-42852: --- Affects Version/s: (was: 3.3.2) > Revert NamedLambdaVariable related changes from EquivalentExpressions > - > > Key: SPARK-42852 > URL: https://issues.apache.org/jira/browse/SPARK-42852 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Fix For: 3.4.0 > > > See discussion > https://github.com/apache/spark/pull/40473#issuecomment-1474848224 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42836) Support for recursive queries
[ https://issues.apache.org/jira/browse/SPARK-42836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth resolved SPARK-42836. Resolution: Duplicate > Support for recursive queries > - > > Key: SPARK-42836 > URL: https://issues.apache.org/jira/browse/SPARK-42836 > Project: Spark > Issue Type: Question > Components: Java API, SQL >Affects Versions: 3.4.0 >Reporter: Max >Priority: Blocker > > Hello, a subtask was created a long time ago > https://issues.apache.org/jira/browse/SPARK-24497 > When will this task be completed? We really miss this. > Thx. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42836) Support for recursive queries
[ https://issues.apache.org/jira/browse/SPARK-42836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702138#comment-17702138 ] Peter Toth commented on SPARK-42836: Please ask on the already existing ticket to let others know you also support that new feature. BTW, I've opened PRs to add support for recursive queries but they always got stalled due to lack of reviews. I will try to rebase and adapt the latest in Spark 3.5 timeframe: https://github.com/apache/spark/pull/29210#issuecomment-1387144552 > Support for recursive queries > - > > Key: SPARK-42836 > URL: https://issues.apache.org/jira/browse/SPARK-42836 > Project: Spark > Issue Type: Question > Components: Java API, SQL >Affects Versions: 3.4.0 >Reporter: Max >Priority: Blocker > > Hello, a subtask was created a long time ago > https://issues.apache.org/jira/browse/SPARK-24497 > When will this task be completed? We really miss this. > Thx. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42852) Revert NamedLambdaVariable related changes from EquivalentExpressions
[ https://issues.apache.org/jira/browse/SPARK-42852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-42852: --- Summary: Revert NamedLambdaVariable related changes from EquivalentExpressions (was: Revert NamedLambdaVariables related changes from EquivalentExpressions) > Revert NamedLambdaVariable related changes from EquivalentExpressions > - > > Key: SPARK-42852 > URL: https://issues.apache.org/jira/browse/SPARK-42852 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0 >Reporter: Peter Toth >Priority: Major > > See discussion > https://github.com/apache/spark/pull/40473#issuecomment-1474848224 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42852) Revert NamedLambdaVariables related changes from EquivalentExpressions
[ https://issues.apache.org/jira/browse/SPARK-42852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-42852: --- Summary: Revert NamedLambdaVariables related changes from EquivalentExpressions (was: Rervert NamedLambdaVariables related changes from EquivalentExpressions) > Revert NamedLambdaVariables related changes from EquivalentExpressions > -- > > Key: SPARK-42852 > URL: https://issues.apache.org/jira/browse/SPARK-42852 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0 >Reporter: Peter Toth >Priority: Major > > See discussion > https://github.com/apache/spark/pull/40473#issuecomment-1474848224 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42852) Rervert NamedLambdaVariables related changes from EquivalentExpressions
Peter Toth created SPARK-42852: -- Summary: Rervert NamedLambdaVariables related changes from EquivalentExpressions Key: SPARK-42852 URL: https://issues.apache.org/jira/browse/SPARK-42852 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.2, 3.4.0 Reporter: Peter Toth See discussion https://github.com/apache/spark/pull/40473#issuecomment-1474848224 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42745) Improved AliasAwareOutputExpression works with DSv2
[ https://issues.apache.org/jira/browse/SPARK-42745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-42745: --- Description: After SPARK-40086 / SPARK-42049 the following, simple subselect expression containing query: {noformat} select (select sum(id) from t1) {noformat} fails with: {noformat} 09:48:57.645 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.batch$lzycompute(BatchScanExec.scala:47) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.batch(BatchScanExec.scala:47) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.hashCode(BatchScanExec.scala:60) at scala.runtime.Statics.anyHash(Statics.java:122) ... at org.apache.spark.sql.catalyst.trees.TreeNode.hashCode(TreeNode.scala:249) at scala.runtime.Statics.anyHash(Statics.java:122) at scala.collection.mutable.HashTable$HashUtils.elemHashCode(HashTable.scala:416) at scala.collection.mutable.HashTable$HashUtils.elemHashCode$(HashTable.scala:416) at scala.collection.mutable.HashMap.elemHashCode(HashMap.scala:44) at scala.collection.mutable.HashTable.addEntry(HashTable.scala:149) at scala.collection.mutable.HashTable.addEntry$(HashTable.scala:148) at scala.collection.mutable.HashMap.addEntry(HashMap.scala:44) at scala.collection.mutable.HashTable.init(HashTable.scala:110) at scala.collection.mutable.HashTable.init$(HashTable.scala:89) at scala.collection.mutable.HashMap.init(HashMap.scala:44) at scala.collection.mutable.HashMap.readObject(HashMap.scala:195) ... at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:87) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:129) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:85) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1520) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {noformat} when DSv2 is enabled. > Improved AliasAwareOutputExpression works with DSv2 > --- > > Key: SPARK-42745 > URL: https://issues.apache.org/jira/browse/SPARK-42745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Fix For: 3.4.0 > > > After SPARK-40086 / SPARK-42049 the following, simple subselect expression > containing query: > {noformat} > select (select sum(id) from t1) > {noformat} > fails with: > {noformat} > 09:48:57.645 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 > in stage 3.0 (TID 3) > java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.v2.BatchScanExec.batch$lzycompute(BatchScanExec.scala:47) > at > org.apache.spark.sql.execution.datasources.v2.BatchScanExec.batch(BatchScanExec.scala:47) > at > org.apache.spark.sql.execution.datasources.v2.BatchScanExec.hashCode(BatchScanExec.scala:60) > at scala.runtime.Statics.anyHash(Statics.java:122) > ... > at > org.apache.spark.sql.catalyst.trees.TreeNode.hashCode(TreeNode.scala:249) > at scala.runtime.Statics.anyHash(Statics.java:122) > at > scala.collection.mutable.HashTable$HashUtils.elemHashCode(HashTable.scala:416) > at > scala.collection.mutable.HashTable$HashUtils.elemHashCode$(HashTable.scala:416) > at scala.collection.mutable.HashMap.elemHashCode(HashMap.scala:44) > at scala.collection.mutable.HashTable.addEntry(HashTable.scala:149) > at scala.collection.mutable.HashTable.addEntry$(HashTable.scala:148) > at scala.collection.mutable.HashMap.addEntry(HashMap.scala:44) > at scala.collection.mutable.HashTable.init(HashTable.scala:110) > at scala.collection.mutable.HashTable.init$(HashTable.scala:89) > at scala.collection.mutable.HashMap.init(HashMap.scala:44) > at scala.collection.mutable.HashMap.readObject(HashMap.scala:195) > ... > at
[jira] [Updated] (SPARK-42745) Improved AliasAwareOutputExpression works with DSv2
[ https://issues.apache.org/jira/browse/SPARK-42745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-42745: --- Summary: Improved AliasAwareOutputExpression works with DSv2 (was: Fix NPE after recent AliasAwareOutputExpression changes) > Improved AliasAwareOutputExpression works with DSv2 > --- > > Key: SPARK-42745 > URL: https://issues.apache.org/jira/browse/SPARK-42745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Peter Toth >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42745) Fix NPE after recent AliasAwareOutputExpression changes
Peter Toth created SPARK-42745: -- Summary: Fix NPE after recent AliasAwareOutputExpression changes Key: SPARK-42745 URL: https://issues.apache.org/jira/browse/SPARK-42745 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0, 3.5.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42438) Improve constraint propagation using multiTransform
Peter Toth created SPARK-42438: -- Summary: Improve constraint propagation using multiTransform Key: SPARK-42438 URL: https://issues.apache.org/jira/browse/SPARK-42438 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42436) Improve multiTransform to generate alternatives dynamically
Peter Toth created SPARK-42436: -- Summary: Improve multiTransform to generate alternatives dynamically Key: SPARK-42436 URL: https://issues.apache.org/jira/browse/SPARK-42436 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42435) Update DataTables to 1.13.2
Peter Toth created SPARK-42435: -- Summary: Update DataTables to 1.13.2 Key: SPARK-42435 URL: https://issues.apache.org/jira/browse/SPARK-42435 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.5.0 Reporter: Peter Toth The 1.10.25 version of DataTables, that Spark uses, seems vulerable: https://security.snyk.io/package/npm/datatables.net/1.10.25. It may or may not affect Spark, but updating to latest 1.13.2 seems doable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug
[ https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685858#comment-17685858 ] Peter Toth edited comment on SPARK-42346 at 2/8/23 11:16 AM: - [~ritikam], you also need to disable the "ConvertToLocalRelation" rule optimization `--conf "spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation"` to get the error from spark-shell. was (Author: petertoth): [~ritikam], you also need to disable the "ConvertToLocalRelation" rule optimization `--conf "spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation"` to get the error from spark-schell. > distinct(count colname) with UNION ALL causes query analyzer bug > > > Key: SPARK-42346 > URL: https://issues.apache.org/jira/browse/SPARK-42346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0 >Reporter: Robin >Assignee: Peter Toth >Priority: Major > Fix For: 3.3.2, 3.4.0, 3.5.0 > > > If you combine a UNION ALL with a count(distinct colname) you get a query > analyzer bug. > > This behaviour is introduced in 3.3.0. The bug was not present in 3.2.1. > > Here is a reprex in PySpark: > {{df_pd = pd.DataFrame([}} > {{ \{'surname': 'a', 'first_name': 'b'}}} > {{])}} > {{df_spark = spark.createDataFrame(df_pd)}} > {{df_spark.createOrReplaceTempView("input_table")}} > {{sql = """}} > {{SELECT }} > {{ (SELECT Count(DISTINCT first_name) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table}} > {{UNION ALL}} > {{SELECT }} > {{ (SELECT Count(DISTINCT surname) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table """}} > {{spark.sql(sql).toPandas()}} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug
[ https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685858#comment-17685858 ] Peter Toth commented on SPARK-42346: [~ritikam], you also need to disable the "ConvertToLocalRelation" rule optimization `--conf "spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation"` to get the error from spark-schell. > distinct(count colname) with UNION ALL causes query analyzer bug > > > Key: SPARK-42346 > URL: https://issues.apache.org/jira/browse/SPARK-42346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0 >Reporter: Robin >Assignee: Peter Toth >Priority: Major > Fix For: 3.3.2, 3.4.0, 3.5.0 > > > If you combine a UNION ALL with a count(distinct colname) you get a query > analyzer bug. > > This behaviour is introduced in 3.3.0. The bug was not present in 3.2.1. > > Here is a reprex in PySpark: > {{df_pd = pd.DataFrame([}} > {{ \{'surname': 'a', 'first_name': 'b'}}} > {{])}} > {{df_spark = spark.createDataFrame(df_pd)}} > {{df_spark.createOrReplaceTempView("input_table")}} > {{sql = """}} > {{SELECT }} > {{ (SELECT Count(DISTINCT first_name) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table}} > {{UNION ALL}} > {{SELECT }} > {{ (SELECT Count(DISTINCT surname) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table """}} > {{spark.sql(sql).toPandas()}} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug
[ https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685124#comment-17685124 ] Peter Toth commented on SPARK-42346: [~ritikam], please use the Pyspark repro in description or add a 2nd row to your input_table if you use Scala. That's because Spark can optimize out count distinct from one row local relations. > distinct(count colname) with UNION ALL causes query analyzer bug > > > Key: SPARK-42346 > URL: https://issues.apache.org/jira/browse/SPARK-42346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0 >Reporter: Robin >Assignee: Peter Toth >Priority: Major > Fix For: 3.3.2, 3.4.0, 3.5.0 > > > If you combine a UNION ALL with a count(distinct colname) you get a query > analyzer bug. > > This behaviour is introduced in 3.3.0. The bug was not present in 3.2.1. > > Here is a reprex in PySpark: > {{df_pd = pd.DataFrame([}} > {{ \{'surname': 'a', 'first_name': 'b'}}} > {{])}} > {{df_spark = spark.createDataFrame(df_pd)}} > {{df_spark.createOrReplaceTempView("input_table")}} > {{sql = """}} > {{SELECT }} > {{ (SELECT Count(DISTINCT first_name) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table}} > {{UNION ALL}} > {{SELECT }} > {{ (SELECT Count(DISTINCT surname) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table """}} > {{spark.sql(sql).toPandas()}} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug
[ https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684272#comment-17684272 ] Peter Toth commented on SPARK-42346: [~yumwang], [~RobinLinacre], https://github.com/apache/spark/pull/39887 will fix the issue. [~viirya], as this is a regression from 3.2 to 3.3, if possible please include this in 3.3.2. > distinct(count colname) with UNION ALL causes query analyzer bug > > > Key: SPARK-42346 > URL: https://issues.apache.org/jira/browse/SPARK-42346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0 >Reporter: Robin >Priority: Major > > If you combine a UNION ALL with a count(distinct colname) you get a query > analyzer bug. > > This behaviour is introduced in 3.3.0. The bug was not present in 3.2.1. > > Here is a reprex in PySpark: > {{df_pd = pd.DataFrame([}} > {{ \{'surname': 'a', 'first_name': 'b'}}} > {{])}} > {{df_spark = spark.createDataFrame(df_pd)}} > {{df_spark.createOrReplaceTempView("input_table")}} > {{sql = """}} > {{SELECT }} > {{ (SELECT Count(DISTINCT first_name) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table}} > {{UNION ALL}} > {{SELECT }} > {{ (SELECT Count(DISTINCT surname) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table """}} > {{spark.sql(sql).toPandas()}} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug
[ https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-42346: --- Affects Version/s: 3.3.0 3.4.0 3.5.0 (was: 3.3.1) > distinct(count colname) with UNION ALL causes query analyzer bug > > > Key: SPARK-42346 > URL: https://issues.apache.org/jira/browse/SPARK-42346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0 >Reporter: Robin >Priority: Major > > If you combine a UNION ALL with a count(distinct colname) you get a query > analyzer bug. > > This behaviour is introduced in 3.3.0. The bug was not present in 3.2.1. > > Here is a reprex in PySpark: > {{df_pd = pd.DataFrame([}} > {{ \{'surname': 'a', 'first_name': 'b'}}} > {{])}} > {{df_spark = spark.createDataFrame(df_pd)}} > {{df_spark.createOrReplaceTempView("input_table")}} > {{sql = """}} > {{SELECT }} > {{ (SELECT Count(DISTINCT first_name) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table}} > {{UNION ALL}} > {{SELECT }} > {{ (SELECT Count(DISTINCT surname) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table """}} > {{spark.sql(sql).toPandas()}} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug
[ https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684184#comment-17684184 ] Peter Toth commented on SPARK-42346: Thanks for pinging me [~yumwang], this might be subquery merge related. I will look into it. > distinct(count colname) with UNION ALL causes query analyzer bug > > > Key: SPARK-42346 > URL: https://issues.apache.org/jira/browse/SPARK-42346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: Robin >Priority: Major > > If you combine a UNION ALL with a count(distinct colname) you get a query > analyzer bug. > > This behaviour is introduced in 3.3.0. The bug was not present in 3.2.1. > > Here is a reprex in PySpark: > {{df_pd = pd.DataFrame([}} > {{ \{'surname': 'a', 'first_name': 'b'}}} > {{])}} > {{df_spark = spark.createDataFrame(df_pd)}} > {{df_spark.createOrReplaceTempView("input_table")}} > {{sql = """}} > {{SELECT }} > {{ (SELECT Count(DISTINCT first_name) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table}} > {{UNION ALL}} > {{SELECT }} > {{ (SELECT Count(DISTINCT surname) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table """}} > {{spark.sql(sql).toPandas()}} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42136) Refactor BroadcastHashJoinExec output partitioning generation
Peter Toth created SPARK-42136: -- Summary: Refactor BroadcastHashJoinExec output partitioning generation Key: SPARK-42136 URL: https://issues.apache.org/jira/browse/SPARK-42136 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42134) Fix getPartitionFiltersAndDataFilters() to handle filters without referenced attributes
Peter Toth created SPARK-42134: -- Summary: Fix getPartitionFiltersAndDataFilters() to handle filters without referenced attributes Key: SPARK-42134 URL: https://issues.apache.org/jira/browse/SPARK-42134 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41468) Fix PlanExpression handling in EquivalentExpressions
Peter Toth created SPARK-41468: -- Summary: Fix PlanExpression handling in EquivalentExpressions Key: SPARK-41468 URL: https://issues.apache.org/jira/browse/SPARK-41468 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41367) Enable V2 file tables in read paths in session catalog
Peter Toth created SPARK-41367: -- Summary: Enable V2 file tables in read paths in session catalog Key: SPARK-41367 URL: https://issues.apache.org/jira/browse/SPARK-41367 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Peter Toth It would be good if we could use the already available V2 file source implmenentaions with the session catalog. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41124) Add DSv2 PlanStabilitySuites
Peter Toth created SPARK-41124: -- Summary: Add DSv2 PlanStabilitySuites Key: SPARK-41124 URL: https://issues.apache.org/jira/browse/SPARK-41124 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40874) Fix broadcasts in Python UDFs when encryption is enabled
[ https://issues.apache.org/jira/browse/SPARK-40874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-40874: --- Description: The following Pyspark script: {noformat} bin/pyspark --conf spark.io.encryption.enabled=true ... bar = {"a": "aa", "b": "bb"} foo = spark.sparkContext.broadcast(bar) spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "") spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect() {noformat} fails with: {noformat} 22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1] org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 811, in main func, profiler, deserializer, serializer = read_command(pickleSer, infile) File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 87, in read_command command = serializer._read_with_length(file) File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 173, in _read_with_length return self.loads(obj) File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 471, in loads return cloudpickle.loads(obj, encoding=encoding) EOFError: Ran out of input {noformat} > Fix broadcasts in Python UDFs when encryption is enabled > > > Key: SPARK-40874 > URL: https://issues.apache.org/jira/browse/SPARK-40874 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Peter Toth >Priority: Major > > The following Pyspark script: > {noformat} > bin/pyspark --conf spark.io.encryption.enabled=true > ... > bar = {"a": "aa", "b": "bb"} > foo = spark.sparkContext.broadcast(bar) > spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "") > spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect() > {noformat} > fails with: > {noformat} > 22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ > 1] > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File > "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 811, in main > func, profiler, deserializer, serializer = read_command(pickleSer, infile) > File > "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 87, in read_command > command = serializer._read_with_length(file) > File > "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", > line 173, in _read_with_length > return self.loads(obj) > File > "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", > line 471, in loads > return cloudpickle.loads(obj, encoding=encoding) > EOFError: Ran out of input > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-40874) Fix broadcasts in Python UDFs when encryption is enabled
[ https://issues.apache.org/jira/browse/SPARK-40874 ] Peter Toth deleted comment on SPARK-40874: was (Author: petertoth): The following Pyspark script: {noformat} bin/pyspark --conf spark.io.encryption.enabled=true ... bar = {"a": "aa", "b": "bb"} foo = spark.sparkContext.broadcast(bar) spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "") spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect() {noformat} fails with: {noformat} 22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1] org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 811, in main func, profiler, deserializer, serializer = read_command(pickleSer, infile) File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 87, in read_command command = serializer._read_with_length(file) File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 173, in _read_with_length return self.loads(obj) File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 471, in loads return cloudpickle.loads(obj, encoding=encoding) EOFError: Ran out of input {noformat} > Fix broadcasts in Python UDFs when encryption is enabled > > > Key: SPARK-40874 > URL: https://issues.apache.org/jira/browse/SPARK-40874 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Peter Toth >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40874) Fix broadcasts in Python UDFs when encryption is enabled
[ https://issues.apache.org/jira/browse/SPARK-40874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-40874: --- The following Pyspark script: {noformat} bin/pyspark --conf spark.io.encryption.enabled=true ... bar = {"a": "aa", "b": "bb"} foo = spark.sparkContext.broadcast(bar) spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "") spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect() {noformat} fails with: {noformat} 22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1] org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 811, in main func, profiler, deserializer, serializer = read_command(pickleSer, infile) File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 87, in read_command command = serializer._read_with_length(file) File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 173, in _read_with_length return self.loads(obj) File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 471, in loads return cloudpickle.loads(obj, encoding=encoding) EOFError: Ran out of input {noformat} > Fix broadcasts in Python UDFs when encryption is enabled > > > Key: SPARK-40874 > URL: https://issues.apache.org/jira/browse/SPARK-40874 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Peter Toth >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40874) Fix broadcasts in Python UDFs when encryption is enabled
[ https://issues.apache.org/jira/browse/SPARK-40874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-40874: --- Summary: Fix broadcasts in Python UDFs when encryption is enabled (was: Fix Python UDFs with broadcasts when encryption is enabled) > Fix broadcasts in Python UDFs when encryption is enabled > > > Key: SPARK-40874 > URL: https://issues.apache.org/jira/browse/SPARK-40874 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Peter Toth >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40874) Fix Python UDFs with broadcasts when encryption is enabled
Peter Toth created SPARK-40874: -- Summary: Fix Python UDFs with broadcasts when encryption is enabled Key: SPARK-40874 URL: https://issues.apache.org/jira/browse/SPARK-40874 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.4.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40599) Add multiTransform methods to TreeNode to generate alternatives
[ https://issues.apache.org/jira/browse/SPARK-40599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-40599: --- Summary: Add multiTransform methods to TreeNode to generate alternatives (was: Add multiTransform methods to TreeNode to generate alternative transformations) > Add multiTransform methods to TreeNode to generate alternatives > --- > > Key: SPARK-40599 > URL: https://issues.apache.org/jira/browse/SPARK-40599 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Peter Toth >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40599) Add multiTransform methods to TreeNode to generate alternative transformations
Peter Toth created SPARK-40599: -- Summary: Add multiTransform methods to TreeNode to generate alternative transformations Key: SPARK-40599 URL: https://issues.apache.org/jira/browse/SPARK-40599 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40259) Support Parquet DSv2 in subquery plan merge
[ https://issues.apache.org/jira/browse/SPARK-40259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-40259: --- Description: We could improve SPARK-34079 with DSv2 support. (was: We could improve SPARK-34079 to support DSv2.) > Support Parquet DSv2 in subquery plan merge > --- > > Key: SPARK-40259 > URL: https://issues.apache.org/jira/browse/SPARK-40259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Peter Toth >Priority: Major > > We could improve SPARK-34079 with DSv2 support. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40259) Support Parquet DSv2 in subquery plan merge
Peter Toth created SPARK-40259: -- Summary: Support Parquet DSv2 in subquery plan merge Key: SPARK-40259 URL: https://issues.apache.org/jira/browse/SPARK-40259 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Peter Toth We could improve SPARK-34079 to support DSv2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40247) Fix BitSet equality check
Peter Toth created SPARK-40247: -- Summary: Fix BitSet equality check Key: SPARK-40247 URL: https://issues.apache.org/jira/browse/SPARK-40247 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40245) Fix FileScan equality check when partition or data filter columns are not read
[ https://issues.apache.org/jira/browse/SPARK-40245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-40245: --- Summary: Fix FileScan equality check when partition or data filter columns are not read (was: Fix FileScan canonicalization when partition or data filter columns are not read) > Fix FileScan equality check when partition or data filter columns are not read > -- > > Key: SPARK-40245 > URL: https://issues.apache.org/jira/browse/SPARK-40245 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Peter Toth >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40245) Fix FileScan canonicalization when partition or data filter columns are not read
Peter Toth created SPARK-40245: -- Summary: Fix FileScan canonicalization when partition or data filter columns are not read Key: SPARK-40245 URL: https://issues.apache.org/jira/browse/SPARK-40245 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40193) Merge subquery plans with different filters
[ https://issues.apache.org/jira/browse/SPARK-40193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-40193: --- Summary: Merge subquery plans with different filters (was: Merge different filters when merging subquery plans) > Merge subquery plans with different filters > --- > > Key: SPARK-40193 > URL: https://issues.apache.org/jira/browse/SPARK-40193 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Peter Toth >Priority: Major > > We could improve SPARK-34079 to be able to merge different filters. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40193) Merge different filters when merging subquery plans
Peter Toth created SPARK-40193: -- Summary: Merge different filters when merging subquery plans Key: SPARK-40193 URL: https://issues.apache.org/jira/browse/SPARK-40193 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Peter Toth We could improve SPARK-34079 to be able to merge different filters. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40086) Improve AliasAwareOutputPartitioning to take all aliases into account
Peter Toth created SPARK-40086: -- Summary: Improve AliasAwareOutputPartitioning to take all aliases into account Key: SPARK-40086 URL: https://issues.apache.org/jira/browse/SPARK-40086 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Peter Toth Currently AliasAwareOutputPartitioning takes only the last alias by aliased expressions into account. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38717) Handle Hive's bucket spec case preserving behaviour
Peter Toth created SPARK-38717: -- Summary: Handle Hive's bucket spec case preserving behaviour Key: SPARK-38717 URL: https://issues.apache.org/jira/browse/SPARK-38717 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Peter Toth {code} CREATE TABLE t( c STRING, B_C STRING ) PARTITIONED BY (p_c STRING) CLUSTERED BY (B_C) INTO 4 BUCKETS STORED AS PARQUET {code} then {code} SELECT * FROM t {code} fails with: {code} Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns B_C is not part of the table columns ([FieldSchema(name:c, type:string, comment:null), FieldSchema(name:b_c, type:string, comment:null)] at org.apache.hadoop.hive.ql.metadata.Table.setBucketCols(Table.java:552) at org.apache.spark.sql.hive.client.HiveClientImpl$.toHiveTable(HiveClientImpl.scala:1098) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:764) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:225) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:224) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:274) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:763) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitionsByFilter$1(HiveExternalCatalog.scala:1287) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:101) ... 110 more {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL
[ https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512270#comment-17512270 ] Peter Toth commented on SPARK-26639: [~stubartmess], that's a different issue but it is fixed in SPARK-36447. > The reuse subquery function maybe does not work in SPARK SQL > > > Key: SPARK-26639 > URL: https://issues.apache.org/jira/browse/SPARK-26639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Ke Jia >Priority: Major > > The subquery reuse feature has done in > [https://github.com/apache/spark/pull/14548] > In my test, I found the visualized plan do show the subquery is executed > once. But the stage of same subquery execute maybe not once. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28299) Evaluation of multiple CTE uses
[ https://issues.apache.org/jira/browse/SPARK-28299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth resolved SPARK-28299. Resolution: Duplicate > Evaluation of multiple CTE uses > - > > Key: SPARK-28299 > URL: https://issues.apache.org/jira/browse/SPARK-28299 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Peter Toth >Priority: Major > > This query returns 2 in Spark SQL (ie. the CTE is evaluated twice), but it > returns 1 in PostgreSQL. > {noformat} > WITH t(x) AS (SELECT random()) > SELECT count(*) FROM ( > SELECT * FROM t > UNION > SELECT * FROM t > ) x > {noformat} > I tested MSSQL too and it returns 2 as Spark SQL does. Further tests are > needed on different DBs... -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37259) JDBC read is always going to wrap the query in a select statement
[ https://issues.apache.org/jira/browse/SPARK-37259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447980#comment-17447980 ] Peter Toth commented on SPARK-37259: I've opened a PR: [https://github.com/apache/spark/pull/34693] to support queries with CTE. {quote}to get the schema and having a way to get that, without running the query twice.{quote} I don't think that running the query twice to get the schema would be an issue as Spark adds a `WHERE 1=0` clause to the query. MSSQL engine should optimize the query and quickly return the schema with empty results. {quote}The other item is the query is going to do something to the query you pass in, so it would need to be based on dbtable being used that is only doing a trim; the query is wrapping: s"(${subquery}) SPARK_GEN_SUBQ_${curId.getAndIncrement()}"{quote} I don't think this is an issue, please see new unit tests in the PR. {quote}The other query that uses temp tables, in the sql server it is either #temptable or ##temptable is also still an issue because of how it getting wrapped in the select and the similar item if that runs the query to get the schema, then it actually creates the tables and the query fails when it runs since the table exists{quote} I'm not sure that temp tables fit into Spark's JDBC world. Let me check if we can workaround them with the new `withClause`... > JDBC read is always going to wrap the query in a select statement > - > > Key: SPARK-37259 > URL: https://issues.apache.org/jira/browse/SPARK-37259 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Kevin Appel >Priority: Major > > The read jdbc is wrapping the query it sends to the database server inside a > select statement and there is no way to override this currently. > Initially I ran into this issue when trying to run a CTE query against SQL > server and it fails, the details of the failure is in these cases: > [https://github.com/microsoft/mssql-jdbc/issues/1340] > [https://github.com/microsoft/mssql-jdbc/issues/1657] > [https://github.com/microsoft/sql-spark-connector/issues/147] > https://issues.apache.org/jira/browse/SPARK-32825 > https://issues.apache.org/jira/browse/SPARK-34928 > I started to patch the code to get the query to run and ran into a few > different items, if there is a way to add these features to allow this code > path to run, this would be extremely helpful to running these type of edge > case queries. These are basic examples here the actual queries are much more > complex and would require significant time to rewrite. > Inside JDBCOptions.scala the query is being set to either, using the dbtable > this allows the query to be passed without modification > > {code:java} > name.trim > or > s"(${subquery}) SPARK_GEN_SUBQ_${curId.getAndIncrement()}" > {code} > > Inside JDBCRelation.scala this is going to try to get the schema for this > query, and this ends up running dialect.getSchemaQuery which is doing: > {code:java} > s"SELECT * FROM $table WHERE 1=0"{code} > Overriding the dialect here and initially just passing back the $table gets > passed here and to the next issue which is in the compute function in > JDBCRDD.scala > > {code:java} > val sqlText = s"SELECT $columnList FROM ${options.tableOrQuery} > $myTableSampleClause" + s" $myWhereClause $getGroupByClause $myLimitClause" > > {code} > > For these two queries, about a CTE query and using temp tables, finding out > the schema is difficult without actually running the query and for the temp > table if you run it in the schema check that will have the table now exist > and fail when it runs the actual query. > > The way I patched these is by doing these two items: > JDBCRDD.scala (compute) > > {code:java} > val runQueryAsIs = options.parameters.getOrElse("runQueryAsIs", > "false").toBoolean > val sqlText = if (runQueryAsIs) { > s"${options.tableOrQuery}" > } else { > s"SELECT $columnList FROM ${options.tableOrQuery} $myWhereClause" > } > {code} > JDBCRelation.scala (getSchema) > {code:java} > val useCustomSchema = jdbcOptions.parameters.getOrElse("useCustomSchema", > "false").toBoolean > if (useCustomSchema) { > val myCustomSchema = jdbcOptions.parameters.getOrElse("customSchema", > "").toString > val newSchema = CatalystSqlParser.parseTableSchema(myCustomSchema) > logInfo(s"Going to return the new $newSchema because useCustomSchema is > $useCustomSchema and passed in $myCustomSchema") > newSchema > } else { > val tableSchema = JDBCRDD.resolveTable(jdbcOptions) > jdbcOptions.customSchema match { > case Some(customSchema) => JdbcUtils.getCustomSchema( > tableSchema, customSchema, resolver) > case None
[jira] [Commented] (SPARK-37259) JDBC read is always going to wrap the query in a select statement
[ https://issues.apache.org/jira/browse/SPARK-37259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17446554#comment-17446554 ] Peter Toth commented on SPARK-37259: [~KevinAppelBofa], how about adding a new `withClause` to the JDBC options? Do you think you could split your CTE query to "with clause" and "regular query" parts manually and specify something like: .option("withClause", withClause).option("query", query)? Because, that way we probably only need a small change to `sqlText` in `compute()` ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L370-L371):] {noformat} val sqlText = s"$withClause SELECT $columnList FROM ${options.tableOrQuery} $myTableSampleClause" + s" $myWhereClause $getGroupByClause $myLimitClause" {noformat} and also we could keep its other functionality. Sidenote: technically we could extract the WITH clause in MsSqlServerDialect and assemble a dialect specific `sqlText` there, but it is not that simple to do it... > JDBC read is always going to wrap the query in a select statement > - > > Key: SPARK-37259 > URL: https://issues.apache.org/jira/browse/SPARK-37259 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Kevin Appel >Priority: Major > > The read jdbc is wrapping the query it sends to the database server inside a > select statement and there is no way to override this currently. > Initially I ran into this issue when trying to run a CTE query against SQL > server and it fails, the details of the failure is in these cases: > [https://github.com/microsoft/mssql-jdbc/issues/1340] > [https://github.com/microsoft/mssql-jdbc/issues/1657] > [https://github.com/microsoft/sql-spark-connector/issues/147] > https://issues.apache.org/jira/browse/SPARK-32825 > https://issues.apache.org/jira/browse/SPARK-34928 > I started to patch the code to get the query to run and ran into a few > different items, if there is a way to add these features to allow this code > path to run, this would be extremely helpful to running these type of edge > case queries. These are basic examples here the actual queries are much more > complex and would require significant time to rewrite. > Inside JDBCOptions.scala the query is being set to either, using the dbtable > this allows the query to be passed without modification > > {code:java} > name.trim > or > s"(${subquery}) SPARK_GEN_SUBQ_${curId.getAndIncrement()}" > {code} > > Inside JDBCRelation.scala this is going to try to get the schema for this > query, and this ends up running dialect.getSchemaQuery which is doing: > {code:java} > s"SELECT * FROM $table WHERE 1=0"{code} > Overriding the dialect here and initially just passing back the $table gets > passed here and to the next issue which is in the compute function in > JDBCRDD.scala > > {code:java} > val sqlText = s"SELECT $columnList FROM ${options.tableOrQuery} > $myTableSampleClause" + s" $myWhereClause $getGroupByClause $myLimitClause" > > {code} > > For these two queries, about a CTE query and using temp tables, finding out > the schema is difficult without actually running the query and for the temp > table if you run it in the schema check that will have the table now exist > and fail when it runs the actual query. > > The way I patched these is by doing these two items: > JDBCRDD.scala (compute) > > {code:java} > val runQueryAsIs = options.parameters.getOrElse("runQueryAsIs", > "false").toBoolean > val sqlText = if (runQueryAsIs) { > s"${options.tableOrQuery}" > } else { > s"SELECT $columnList FROM ${options.tableOrQuery} $myWhereClause" > } > {code} > JDBCRelation.scala (getSchema) > {code:java} > val useCustomSchema = jdbcOptions.parameters.getOrElse("useCustomSchema", > "false").toBoolean > if (useCustomSchema) { > val myCustomSchema = jdbcOptions.parameters.getOrElse("customSchema", > "").toString > val newSchema = CatalystSqlParser.parseTableSchema(myCustomSchema) > logInfo(s"Going to return the new $newSchema because useCustomSchema is > $useCustomSchema and passed in $myCustomSchema") > newSchema > } else { > val tableSchema = JDBCRDD.resolveTable(jdbcOptions) > jdbcOptions.customSchema match { > case Some(customSchema) => JdbcUtils.getCustomSchema( > tableSchema, customSchema, resolver) > case None => tableSchema > } > }{code} > > This is allowing the query to run as is, by using the dbtable option and then > provide a custom schema that will bypass the dialect schema check > > Test queries > > {code:java} > query1 = """ > SELECT 1 as DummyCOL > """ > query2 = """ > WITH DummyCTE AS > (
[jira] [Commented] (SPARK-35672) Spark fails to launch executors with very large user classpath lists on YARN
[ https://issues.apache.org/jira/browse/SPARK-35672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419388#comment-17419388 ] Peter Toth commented on SPARK-35672: I put up a revert PR: https://github.com/apache/spark/pull/34082 > Spark fails to launch executors with very large user classpath lists on YARN > > > Key: SPARK-35672 > URL: https://issues.apache.org/jira/browse/SPARK-35672 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 3.1.2 > Environment: Linux RHEL7 > Spark 3.1.1 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Fix For: 3.2.0, 3.1.3 > > > When running Spark on YARN, the {{user-class-path}} argument to > {{CoarseGrainedExecutorBackend}} is used to pass a list of user JAR URIs to > executor processes. The argument is specified once for each JAR, and the URIs > are fully-qualified, so the paths can be quite long. With large user JAR > lists (say 1000+), this can result in system-level argument length limits > being exceeded, typically manifesting as the error message: > {code} > /bin/bash: Argument list too long > {code} > A [Google > search|https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22] > indicates that this is not a theoretical problem and afflicts real users, > including ours. This issue was originally observed on Spark 2.3, but has been > confirmed to exist in the master branch as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35672) Spark fails to launch executors with very large user classpath lists on YARN
[ https://issues.apache.org/jira/browse/SPARK-35672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419285#comment-17419285 ] Peter Toth commented on SPARK-35672: [~xkrogen], [~tgraves], unfortunately, I think this is a breaking change and should be reverted. On our clusters we use `{{spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/...}}` and before this change Yarn executor context looked like the following: {noformat} YARN executor launch context: env: ... command: ... {{JAVA_HOME}}/bin/java \ -server \ ... --user-class-path \ file:{{HADOOP_COMMON_HOME}}/...jar \ ... {noformat} and Yarn was able to substitute HADOOP_COMMON_HOME environment variable. But after this change user classpath is distributed in {{SparkConf}} and we can't use environment variables any more. cc [~Gengliang.Wang] > Spark fails to launch executors with very large user classpath lists on YARN > > > Key: SPARK-35672 > URL: https://issues.apache.org/jira/browse/SPARK-35672 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 3.1.2 > Environment: Linux RHEL7 > Spark 3.1.1 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Fix For: 3.2.0, 3.1.3 > > > When running Spark on YARN, the {{user-class-path}} argument to > {{CoarseGrainedExecutorBackend}} is used to pass a list of user JAR URIs to > executor processes. The argument is specified once for each JAR, and the URIs > are fully-qualified, so the paths can be quite long. With large user JAR > lists (say 1000+), this can result in system-level argument length limits > being exceeded, typically manifesting as the error message: > {code} > /bin/bash: Argument list too long > {code} > A [Google > search|https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22] > indicates that this is not a theoretical problem and afflicts real users, > including ours. This issue was originally observed on Spark 2.3, but has been > confirmed to exist in the master branch as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36065) date_trunc returns incorrect output
[ https://issues.apache.org/jira/browse/SPARK-36065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390421#comment-17390421 ] Peter Toth commented on SPARK-36065: I think the output is correct as there was a time zone change (+00:02:16) at 1891-10-01 00:00:00 in Bratislava and that means that 1891-10-01 00:00:00 = 1891-10-01 00:02:16. I found this site that shows the TZ changes: https://www.timeanddate.com/time/zone/slovakia/bratislava > date_trunc returns incorrect output > --- > > Key: SPARK-36065 > URL: https://issues.apache.org/jira/browse/SPARK-36065 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Sumeet >Priority: Major > Labels: date_trunc, sql, timestamp > > Hi, > Running date_trunc on any hour of "1891-10-01" returns incorrect output for > "Europe/Bratislava" timezone. > Use the following steps in order to reproduce the issue: > * Run spark-shell using: > {code:java} > TZ="Europe/Bratislava" ./bin/spark-shell --conf > spark.driver.extraJavaOptions='-Duser.timezone=Europe/Bratislava' --conf > spark.executor.extraJavaOptions='-Duser.timezone=Europe/Bratislava' --conf > spark.sql.session.timeZone="Europe/Bratislava"{code} > * Generate test data: > {code:java} > ((0 until 9).map(hour => s"1891-10-01 00:0$hour:00") ++ (10 until > 24).map(hour => s"1891-10-01 > 00:$hour:00")).toDF("ts_string").createOrReplaceTempView("temp_ts") > {code} > * Run query: > {code:java} > sql("select ts_string, cast(ts_string as TIMESTAMP) as ts, date_trunc('day', > ts_string) from temp_ts").show(false) > {code} > * Output: > {code:java} > +---+---+--+ > |ts_string |ts |date_trunc(day, ts_string)| > +---+---+--+ > |1891-10-01 00:00:00|1891-10-01 00:02:16|1891-10-01 00:02:16 | > |1891-10-01 00:01:00|1891-10-01 00:03:16|1891-10-01 00:02:16 | > |1891-10-01 00:02:00|1891-10-01 00:04:16|1891-10-01 00:02:16 | > |1891-10-01 00:03:00|1891-10-01 00:03:00|1891-10-01 00:02:16 | > |1891-10-01 00:04:00|1891-10-01 00:04:00|1891-10-01 00:02:16 | > |1891-10-01 00:05:00|1891-10-01 00:05:00|1891-10-01 00:02:16 | > |1891-10-01 00:06:00|1891-10-01 00:06:00|1891-10-01 00:02:16 | > |1891-10-01 00:07:00|1891-10-01 00:07:00|1891-10-01 00:02:16 | > |1891-10-01 00:08:00|1891-10-01 00:08:00|1891-10-01 00:02:16 | > |1891-10-01 00:10:00|1891-10-01 00:10:00|1891-10-01 00:02:16 | > |1891-10-01 00:11:00|1891-10-01 00:11:00|1891-10-01 00:02:16 | > |1891-10-01 00:12:00|1891-10-01 00:12:00|1891-10-01 00:02:16 | > |1891-10-01 00:13:00|1891-10-01 00:13:00|1891-10-01 00:02:16 | > |1891-10-01 00:14:00|1891-10-01 00:14:00|1891-10-01 00:02:16 | > |1891-10-01 00:15:00|1891-10-01 00:15:00|1891-10-01 00:02:16 | > |1891-10-01 00:16:00|1891-10-01 00:16:00|1891-10-01 00:02:16 | > |1891-10-01 00:17:00|1891-10-01 00:17:00|1891-10-01 00:02:16 | > |1891-10-01 00:18:00|1891-10-01 00:18:00|1891-10-01 00:02:16 | > |1891-10-01 00:19:00|1891-10-01 00:19:00|1891-10-01 00:02:16 | > |1891-10-01 00:20:00|1891-10-01 00:20:00|1891-10-01 00:02:16 | > +---+---+--+ > only showing top 20 rows > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36073) EquivalentExpressions fixes and improvements
[ https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-36073: --- Issue Type: Improvement (was: Bug) > EquivalentExpressions fixes and improvements > > > Key: SPARK-36073 > URL: https://issues.apache.org/jira/browse/SPARK-36073 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Peter Toth >Priority: Major > > Currently `EquivalentExpressions` has 2 issues: > - identifying common expressions in conditional expressions is not correct in > all cases > - transparently canonicalized expressions (like `PromotePrecision`) are > considered common subexpressions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36073) EquivalentExpressions fixes and improvements
[ https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-36073: --- Description: Currently `EquivalentExpressions` has 2 issues: - identifying common expressions in conditional expressions is not correct in all cases - transparently canonicalized expressions (like `PromotePrecision`) are considered common subexpressions was: Fixes an issue with identifying common expressions in conditional expressions (a side effect of the above). Fixes the issue of transparently canonicalized expressions (like PromotePrecision) are considered common subexpressions. > EquivalentExpressions fixes and improvements > > > Key: SPARK-36073 > URL: https://issues.apache.org/jira/browse/SPARK-36073 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Peter Toth >Priority: Minor > > Currently `EquivalentExpressions` has 2 issues: > - identifying common expressions in conditional expressions is not correct in > all cases > - transparently canonicalized expressions (like `PromotePrecision`) are > considered common subexpressions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36073) EquivalentExpressions fixes and improvements
[ https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-36073: --- Issue Type: Bug (was: Improvement) > EquivalentExpressions fixes and improvements > > > Key: SPARK-36073 > URL: https://issues.apache.org/jira/browse/SPARK-36073 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Peter Toth >Priority: Major > > Currently `EquivalentExpressions` has 2 issues: > - identifying common expressions in conditional expressions is not correct in > all cases > - transparently canonicalized expressions (like `PromotePrecision`) are > considered common subexpressions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36073) EquivalentExpressions fixes and improvements
[ https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-36073: --- Priority: Major (was: Minor) > EquivalentExpressions fixes and improvements > > > Key: SPARK-36073 > URL: https://issues.apache.org/jira/browse/SPARK-36073 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Peter Toth >Priority: Major > > Currently `EquivalentExpressions` has 2 issues: > - identifying common expressions in conditional expressions is not correct in > all cases > - transparently canonicalized expressions (like `PromotePrecision`) are > considered common subexpressions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36073) EquivalentExpressions fixes and improvements
[ https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-36073: --- Description: Fixes an issue with identifying common expressions in conditional expressions (a side effect of the above). Fixes the issue of transparently canonicalized expressions (like PromotePrecision) are considered common subexpressions. was:SPARK-35410 (https://github.com/apache/spark/commit/9e1b204bcce4a8fe24c1edd8271197277b5017f4#diff-4d8c210a38fc808fef3e5c966b438591f225daa3c9fd69359446b94c351aa11eR106-R112) filters out all child expressions, but in some cases that is not necessary. > EquivalentExpressions fixes and improvements > > > Key: SPARK-36073 > URL: https://issues.apache.org/jira/browse/SPARK-36073 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Peter Toth >Priority: Minor > > Fixes an issue with identifying common expressions in conditional expressions > (a side effect of the above). > Fixes the issue of transparently canonicalized expressions (like > PromotePrecision) are considered common subexpressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36073) EquivalentExpressions fixes and improvements
[ https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-36073: --- Summary: EquivalentExpressions fixes and improvements (was: SubExpr elimination should include common child exprs of conditional expressions) > EquivalentExpressions fixes and improvements > > > Key: SPARK-36073 > URL: https://issues.apache.org/jira/browse/SPARK-36073 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Peter Toth >Priority: Minor > > SPARK-35410 > (https://github.com/apache/spark/commit/9e1b204bcce4a8fe24c1edd8271197277b5017f4#diff-4d8c210a38fc808fef3e5c966b438591f225daa3c9fd69359446b94c351aa11eR106-R112) > filters out all child expressions, but in some cases that is not necessary. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36073) SubExpr elimination should include common child exprs of conditional expressions
Peter Toth created SPARK-36073: -- Summary: SubExpr elimination should include common child exprs of conditional expressions Key: SPARK-36073 URL: https://issues.apache.org/jira/browse/SPARK-36073 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Peter Toth SPARK-35410 (https://github.com/apache/spark/commit/9e1b204bcce4a8fe24c1edd8271197277b5017f4#diff-4d8c210a38fc808fef3e5c966b438591f225daa3c9fd69359446b94c351aa11eR106-R112) filters out all child expressions, but in some cases that is not necessary. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35855) Unify reuse map data structures in non-AQE and AQE rules
Peter Toth created SPARK-35855: -- Summary: Unify reuse map data structures in non-AQE and AQE rules Key: SPARK-35855 URL: https://issues.apache.org/jira/browse/SPARK-35855 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Peter Toth We can unify reuse map data structures in non-AQE and AQE rules (`ReuseExchangeAndSubquery`, `ReuseAdaptiveSubquery`) to a simple `Map[, ]`. Please find discussion here: [https://github.com/apache/spark/pull/28885#discussion_r655073897] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35798) Fix SparkPlan.sqlContext usage
Peter Toth created SPARK-35798: -- Summary: Fix SparkPlan.sqlContext usage Key: SPARK-35798 URL: https://issues.apache.org/jira/browse/SPARK-35798 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Peter Toth There might be SparkPlan nodes where canonicalization on executor side can cause issues. Mode details here: https://github.com/apache/spark/pull/32885/files#r651019687 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34801) java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.metadata.Hive.loadPartition
[ https://issues.apache.org/jira/browse/SPARK-34801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306093#comment-17306093 ] Peter Toth commented on SPARK-34801: Yes it is. Please use CDS 3 (Cloudera Distribution of Spark 3) on supported CDP versions. > java.lang.NoSuchMethodException: > org.apache.hadoop.hive.ql.metadata.Hive.loadPartition > -- > > Key: SPARK-34801 > URL: https://issues.apache.org/jira/browse/SPARK-34801 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.2 > Environment: HDP3.1.4.0-315 spark 3.0.2 >Reporter: zhaojk >Priority: Major > > use spark-sql run this sql insert overwrite table zry.zjk1 > partition(etl_dt=2) select * from zry.zry; > java.lang.NoSuchMethodException: > org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(org.apache.hadoop.fs.Path, > org.apache.hadoop.hive.ql.metadata.Table, java.util.Map, > org.apache.hadoop.hive.ql.plan.LoadTableDesc$LoadFileType, boolean, boolean, > boolean, boolean, boolean, java.lang.Long, int, boolean) > at java.lang.Class.getMethod(Class.java:1786) > at org.apache.spark.sql.hive.client.Shim.findMethod(HiveShim.scala:177) > at > org.apache.spark.sql.hive.client.Shim_v3_0.loadPartitionMethod$lzycompute(HiveShim.scala:1289) > at > org.apache.spark.sql.hive.client.Shim_v3_0.loadPartitionMethod(HiveShim.scala:1274) > at > org.apache.spark.sql.hive.client.Shim_v3_0.loadPartition(HiveShim.scala:1337) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:881) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:295) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:228) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:227) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:277) > at > org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:871) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:915) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:103) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:894) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:318) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) > at org.apache.spark.sql.Dataset.(Dataset.scala:229) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) > at
[jira] [Updated] (SPARK-34581) BoundAttribute issue after optimization by BooleanSimplification and PushFoldableIntoBranches
[ https://issues.apache.org/jira/browse/SPARK-34581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-34581: --- Affects Version/s: 3.0.2 3.1.1 > BoundAttribute issue after optimization by BooleanSimplification and > PushFoldableIntoBranches > - > > Key: SPARK-34581 > URL: https://issues.apache.org/jira/browse/SPARK-34581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.1 >Reporter: Yuming Wang >Priority: Major > > BoundAttribute issue will occur after optimization by BooleanSimplification > and PushFoldableIntoBranches. How to reproduce this issue: > {code:scala} > spark.sql("CREATE TABLE t1 (a INT, b INT) USING parquet") > spark.sql("CREATE TABLE t2 (a INT, b INT) USING parquet") > spark.sql( > """ > |SELECT cnt, > | NOT ( buyer_id ) AS buyer_id2 > |FROM (SELECT t1.a IS NOT NULL AS buyer_id, > | Count(*) AS cnt > |FROM t1 > | INNER JOIN t2 > | ON t1.a = t2.a > |GROUP BY 1) t > |""".stripMargin).collect() > {code} > {noformat} > Couldn't find a#4 in [CASE WHEN isnotnull(a#4) THEN 1 ELSE 2 > END#10,count(1)#3L] > java.lang.IllegalStateException: Couldn't find a#4 in [CASE WHEN > isnotnull(a#4) THEN 1 ELSE 2 END#10,count(1)#3L] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:316) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:316) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:406) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:242) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:404) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:357) > {noformat} > Another case: > {code:scala} > spark.sql( > """ > |SELECT cnt, > | CASE WHEN ( buyer_id = 2 AND cnt > 3 ) THEN 2 ELSE 3 END AS > buyer_id2 > |FROM (SELECT CASE WHEN t1.a IS NOT NULL THEN 1 ELSE 2 END AS buyer_id, > Count(*) AS cnt > |FROM t1 INNER JOIN t2 ON t1.a = t2.a > |GROUP BY 1) t > |""".stripMargin).collect() > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33482) V2 Datasources that extend FileScan preclude exchange reuse
[ https://issues.apache.org/jira/browse/SPARK-33482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-33482: --- Affects Version/s: 3.0.0 3.0.1 3.0.2 3.1.1 > V2 Datasources that extend FileScan preclude exchange reuse > --- > > Key: SPARK-33482 > URL: https://issues.apache.org/jira/browse/SPARK-33482 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1 >Reporter: Bruce Robbins >Priority: Major > > Sample query: > {noformat} > spark.read.parquet("tbl").createOrReplaceTempView("tbl") > spark.read.parquet("lookup").createOrReplaceTempView("lookup") > sql(""" >select tbl.col1, fk1, fk2 >from tbl, lookup l1, lookup l2 >where fk1 = l1.key >and fk2 = l2.key > """).explain > {noformat} > Test files can be created as so: > {noformat} > import scala.util.Random > val rand = Random > val tbl = spark.range(1, 1).map { x => > (rand.nextLong.abs % 20, >rand.nextLong.abs % 20, >x) > }.toDF("fk1", "fk2", "col1") > tbl.write.mode("overwrite").parquet("tbl") > val lookup = spark.range(0, 20).map { x => > (x + 1, x * 1, (x + 1) * 1) > }.toDF("key", "col1", "col2") > lookup.write.mode("overwrite").parquet("lookup") > {noformat} > Output with V1 Parquet reader: > {noformat} > == Physical Plan == > *(3) Project [col1#2L, fk1#0L, fk2#1L] > +- *(3) BroadcastHashJoin [fk2#1L], [key#12L], Inner, BuildRight, false >:- *(3) Project [fk1#0L, fk2#1L, col1#2L] >: +- *(3) BroadcastHashJoin [fk1#0L], [key#6L], Inner, BuildRight, false >: :- *(3) Filter (isnotnull(fk1#0L) AND isnotnull(fk2#1L)) >: : +- *(3) ColumnarToRow >: : +- FileScan parquet [fk1#0L,fk2#1L,col1#2L] Batched: true, > DataFilters: [isnotnull(fk1#0L), isnotnull(fk2#1L)], Format: Parquet, > Location: InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/tbl], > PartitionFilters: [], PushedFilters: [IsNotNull(fk1), IsNotNull(fk2)], > ReadSchema: struct >: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, > bigint, false]),false), [id=#75] >:+- *(1) Filter isnotnull(key#6L) >: +- *(1) ColumnarToRow >: +- FileScan parquet [key#6L] Batched: true, DataFilters: > [isnotnull(key#6L)], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], > PartitionFilters: [], PushedFilters: [IsNotNull(key)], ReadSchema: > struct >+- ReusedExchange [key#12L], BroadcastExchange > HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [id=#75] > {noformat} > With V1 Parquet reader, the exchange for lookup is reused (see last line). > Output with V2 Parquet reader (spark.sql.sources.useV1SourceList=""): > {noformat} > == Physical Plan == > *(3) Project [col1#2L, fk1#0L, fk2#1L] > +- *(3) BroadcastHashJoin [fk2#1L], [key#12L], Inner, BuildRight, false >:- *(3) Project [fk1#0L, fk2#1L, col1#2L] >: +- *(3) BroadcastHashJoin [fk1#0L], [key#6L], Inner, BuildRight, false >: :- *(3) Filter (isnotnull(fk1#0L) AND isnotnull(fk2#1L)) >: : +- *(3) ColumnarToRow >: : +- BatchScan[fk1#0L, fk2#1L, col1#2L] ParquetScan DataFilters: > [isnotnull(fk1#0L), isnotnull(fk2#1L)], Format: parquet, Location: > InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/tbl], > PartitionFilters: [], PushedFilers: [IsNotNull(fk1), IsNotNull(fk2)], > ReadSchema: struct, PushedFilters: > [IsNotNull(fk1), IsNotNull(fk2)] >: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, > bigint, false]),false), [id=#75] >:+- *(1) Filter isnotnull(key#6L) >: +- *(1) ColumnarToRow >: +- BatchScan[key#6L] ParquetScan DataFilters: > [isnotnull(key#6L)], Format: parquet, Location: > InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], > PartitionFilters: [], PushedFilers: [IsNotNull(key)], ReadSchema: > struct, PushedFilters: [IsNotNull(key)] >+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > false]),false), [id=#83] > +- *(2) Filter isnotnull(key#12L) > +- *(2) ColumnarToRow > +- BatchScan[key#12L] ParquetScan DataFilters: > [isnotnull(key#12L)], Format: parquet, Location: > InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], > PartitionFilters: [], PushedFilers: [IsNotNull(key)], ReadSchema: > struct, PushedFilters: [IsNotNull(key)] > {noformat} > With the V2 Parquet reader, the exchange for lookup is not reused (see last 4 > lines). > You can see the same issue with the Orc reader (and I assume any other > datasource that extends Filescan). > The issue appears to be this check in
[jira] [Updated] (SPARK-34756) Fix FileScan equality check
[ https://issues.apache.org/jira/browse/SPARK-34756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-34756: --- Affects Version/s: 3.1.0 > Fix FileScan equality check > --- > > Key: SPARK-34756 > URL: https://issues.apache.org/jira/browse/SPARK-34756 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1 >Reporter: Peter Toth >Priority: Major > > `&&` is missing from `FileScan.equals()`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34756) Fix FileScan equality check
Peter Toth created SPARK-34756: -- Summary: Fix FileScan equality check Key: SPARK-34756 URL: https://issues.apache.org/jira/browse/SPARK-34756 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1, 3.0.2, 3.0.1, 3.0.0 Reporter: Peter Toth `&&` is missing from `FileScan.equals()`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34726) Fix collectToPython timeouts
[ https://issues.apache.org/jira/browse/SPARK-34726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-34726: --- Affects Version/s: (was: 3.1.1) 2.4.7 > Fix collectToPython timeouts > > > Key: SPARK-34726 > URL: https://issues.apache.org/jira/browse/SPARK-34726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7 >Reporter: Peter Toth >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34726) Fix collectToPython timeouts
Peter Toth created SPARK-34726: -- Summary: Fix collectToPython timeouts Key: SPARK-34726 URL: https://issues.apache.org/jira/browse/SPARK-34726 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30655) Update WebUI Bootstrap to 4.4.1
[ https://issues.apache.org/jira/browse/SPARK-30655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298005#comment-17298005 ] Peter Toth commented on SPARK-30655: [~d.clarke], it looks like you already fixed this in [https://github.com/apache/spark/pull/27370|https://github.com/apache/spark/pull/27370]. Shall we close this ticket? > Update WebUI Bootstrap to 4.4.1 > --- > > Key: SPARK-30655 > URL: https://issues.apache.org/jira/browse/SPARK-30655 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Dale Clarke >Priority: Major > > Spark is using an older version of Bootstrap (v. 2.3.2) for the Web UI pages. > Bootstrap 2.x was moved to EOL in Aug 2013 and Bootstrap 3.x was moved to > EOL in July 2019 ([https://github.com/twbs/release)]. Older versions of > Bootstrap are also getting flagged in security scans for various CVEs: > * [https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-72889] > * [https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-173700] > * [https://snyk.io/vuln/npm:bootstrap:20180529] > * [https://snyk.io/vuln/npm:bootstrap:20160627] > I haven't validated each CVE, but it would probably be good practice to > resolve any potential issues and get on a supported release. > The bad news is that there have been quite a few changes between Bootstrap 2 > and Bootstrap 4. I've tried updating the library, refactoring/tweaking the > CSS and JS to maintain a similar appearance and functionality, and testing > the documentation. As with the ticket created for the outdated Bootstrap > version in the docs (SPARK-30654), this is a fairly large change so I'm sure > additional testing and fixes will be needed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24497) ANSI SQL: Recursive query
[ https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277367#comment-17277367 ] Peter Toth commented on SPARK-24497: Thanks [~ilaurens] for your comment. Recursive queries are very useful to process hierarchical structures indeed. I will try update my PRs this week, but the problem is the lack of reviews... > ANSI SQL: Recursive query > - > > Key: SPARK-24497 > URL: https://issues.apache.org/jira/browse/SPARK-24497 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > h3. *Examples* > Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" > represents the structure of an organization as an adjacency list. > {code:sql} > CREATE TABLE department ( > id INTEGER PRIMARY KEY, -- department ID > parent_department INTEGER REFERENCES department, -- upper department ID > name TEXT -- department name > ); > INSERT INTO department (id, parent_department, "name") > VALUES > (0, NULL, 'ROOT'), > (1, 0, 'A'), > (2, 1, 'B'), > (3, 2, 'C'), > (4, 2, 'D'), > (5, 0, 'E'), > (6, 4, 'F'), > (7, 5, 'G'); > -- department structure represented here is as follows: > -- > -- ROOT-+->A-+->B-+->C > -- | | > -- | +->D-+->F > -- +->E-+->G > {code} > > To extract all departments under A, you can use the following recursive > query: > {code:sql} > WITH RECURSIVE subdepartment AS > ( > -- non-recursive term > SELECT * FROM department WHERE name = 'A' > UNION ALL > -- recursive term > SELECT d.* > FROM > department AS d > JOIN > subdepartment AS sd > ON (d.parent_department = sd.id) > ) > SELECT * > FROM subdepartment > ORDER BY name; > {code} > More details: > [http://wiki.postgresql.org/wiki/CTEReadme] > [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32041) Exchange reuse won't work in cases when DPP, subqueries are involved
[ https://issues.apache.org/jira/browse/SPARK-32041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272102#comment-17272102 ] Peter Toth commented on SPARK-32041: Let me reopen this ticket as this is not a duplicate of SPARK-29375 but more like a bug. The connection between this ticket, SPARK-29375 and SPARK-28940 is that my PR (https://github.com/apache/spark/pull/28885) fixes all of them. > Exchange reuse won't work in cases when DPP, subqueries are involved > > > Key: SPARK-32041 > URL: https://issues.apache.org/jira/browse/SPARK-32041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Prakhar Jain >Priority: Major > > When an Exchange node is repeated at multiple places in the PhysicalPlan, and > if that exchange has some some DPP Subquery filter, then ReuseExchange > doesn't work for such Exchange and different stages are launched to compute > same thing. > Example: > {noformat} > // generate data > val factData = (1 to 100).map(i => (i%5, i%20, i)) > factData.toDF("store_id", "product_id", "units_sold") > .write > .partitionBy("store_id") > .format("parquet") > .saveAsTable("fact_stats") > val dimData = Seq[(Int, String, String)]( > (1, "AU", "US"), > (2, "CA", "US"), > (3, "KA", "IN"), > (4, "DL", "IN"), > (5, "GA", "PA")) > dimData.toDF("store_id", "state_province", "country") > .write > .format("parquet") > .saveAsTable("dim_stats") > sql("ANALYZE TABLE fact_stats COMPUTE STATISTICS FOR COLUMNS store_id") > sql("ANALYZE TABLE dim_stats COMPUTE STATISTICS FOR COLUMNS store_id") > // Set Configs > spark.sql("set > spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=true") > spark.sql("set spark.sql.autoBroadcastJoinThreshold=1000") > val query = """ > With view1 as ( > SELECT product_id, f.store_id > FROM fact_stats f JOIN dim_stats > ON f.store_id = dim_stats.store_id WHERE dim_stats.country = 'IN') > SELECT * FROM view1 v1 join view1 v2 WHERE v1.product_id = v2.product_id > """ > val df = spark.sql(query) > println(df.queryExecution.executedPlan) > {noformat} > {noformat} > Plan: > *(7) SortMergeJoin [product_id#1968|#1968], [product_id#2060|#2060], Inner > :- *(3) Sort [product_id#1968 ASC NULLS FIRST|#1968 ASC NULLS FIRST], false, > 0 > : +- Exchange hashpartitioning(product_id#1968, 5), true, [id=#1140|#1140] > : +- *(2) Project [product_id#1968, store_id#1970|#1968, store_id#1970] > : +- *(2) BroadcastHashJoin [store_id#1970|#1970], [store_id#1971|#1971], > Inner, BuildRight > : :- *(2) Project [product_id#1968, store_id#1970|#1968, store_id#1970] > : : +- *(2) Filter isnotnull(product_id#1968) > : : +- *(2) ColumnarToRow > : : +- FileScan parquet > default.fact_stats[product_id#1968,store_id#1970|#1968,store_id#1970] > Batched: true, DataFilters: [isnotnull(product_id#1968)|#1968)], Format: > Parquet, Location: > InMemoryFileIndex[file:/home/prakhar/src/os/1_spark/sql/core/spark-warehouse/org.apache.spark.sql..., > PartitionFilters: [isnotnull(store_id#1970), > dynamicpruningexpression(store_id#1970 IN dynamicpruning#2067)|#1970), > dynamicpruningexpression(store_id#1970 IN dynamicpruning#2067)], > PushedFilters: [IsNotNull(product_id)], ReadSchema: struct > : : +- SubqueryBroadcast dynamicpruning#2067, 0, [store_id#1971|#1971], > [id=#1131|#1131] > : : +- ReusedExchange [store_id#1971|#1971], BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), > [id=#1021|#1021] > : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > true] as bigint))), [id=#1021|#1021] > : +- *(1) Project [store_id#1971|#1971] > : +- *(1) Filter ((isnotnull(country#1973) AND (country#1973 = IN)) AND > isnotnull(store_id#1971)) > : +- *(1) ColumnarToRow > : +- FileScan parquet > default.dim_stats[store_id#1971,country#1973|#1971,country#1973] Batched: > true, DataFilters: [isnotnull(country#1973), (country#1973 = IN), > isnotnull(store_id#1971)|#1973), (country#1973 = IN), > isnotnull(store_id#1971)], Format: Parquet, Location: > InMemoryFileIndex[file:/home/prakhar/src/os/1_spark/sql/core/spark-warehouse/org.apache.spark.sql..., > PartitionFilters: [], PushedFilters: [IsNotNull(country), > EqualTo(country,IN), IsNotNull(store_id)], ReadSchema: > struct > +- *(6) Sort [product_id#2060 ASC NULLS FIRST|#2060 ASC NULLS FIRST], false, > 0 > +- ReusedExchange [product_id#2060, store_id#2062|#2060, store_id#2062], > Exchange hashpartitioning(product_id#1968, 5), true, [id=#1026|#1026] > {noformat} > Issue: > Note the last line of plan. Its a ReusedExchange which is pointing to > id=1026. But There is no Exchange node in plan with ID 1026. ReusedExchange > node is pointing to incorrect
[jira] [Reopened] (SPARK-32041) Exchange reuse won't work in cases when DPP, subqueries are involved
[ https://issues.apache.org/jira/browse/SPARK-32041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth reopened SPARK-32041: > Exchange reuse won't work in cases when DPP, subqueries are involved > > > Key: SPARK-32041 > URL: https://issues.apache.org/jira/browse/SPARK-32041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Prakhar Jain >Priority: Major > > When an Exchange node is repeated at multiple places in the PhysicalPlan, and > if that exchange has some some DPP Subquery filter, then ReuseExchange > doesn't work for such Exchange and different stages are launched to compute > same thing. > Example: > {noformat} > // generate data > val factData = (1 to 100).map(i => (i%5, i%20, i)) > factData.toDF("store_id", "product_id", "units_sold") > .write > .partitionBy("store_id") > .format("parquet") > .saveAsTable("fact_stats") > val dimData = Seq[(Int, String, String)]( > (1, "AU", "US"), > (2, "CA", "US"), > (3, "KA", "IN"), > (4, "DL", "IN"), > (5, "GA", "PA")) > dimData.toDF("store_id", "state_province", "country") > .write > .format("parquet") > .saveAsTable("dim_stats") > sql("ANALYZE TABLE fact_stats COMPUTE STATISTICS FOR COLUMNS store_id") > sql("ANALYZE TABLE dim_stats COMPUTE STATISTICS FOR COLUMNS store_id") > // Set Configs > spark.sql("set > spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=true") > spark.sql("set spark.sql.autoBroadcastJoinThreshold=1000") > val query = """ > With view1 as ( > SELECT product_id, f.store_id > FROM fact_stats f JOIN dim_stats > ON f.store_id = dim_stats.store_id WHERE dim_stats.country = 'IN') > SELECT * FROM view1 v1 join view1 v2 WHERE v1.product_id = v2.product_id > """ > val df = spark.sql(query) > println(df.queryExecution.executedPlan) > {noformat} > {noformat} > Plan: > *(7) SortMergeJoin [product_id#1968|#1968], [product_id#2060|#2060], Inner > :- *(3) Sort [product_id#1968 ASC NULLS FIRST|#1968 ASC NULLS FIRST], false, > 0 > : +- Exchange hashpartitioning(product_id#1968, 5), true, [id=#1140|#1140] > : +- *(2) Project [product_id#1968, store_id#1970|#1968, store_id#1970] > : +- *(2) BroadcastHashJoin [store_id#1970|#1970], [store_id#1971|#1971], > Inner, BuildRight > : :- *(2) Project [product_id#1968, store_id#1970|#1968, store_id#1970] > : : +- *(2) Filter isnotnull(product_id#1968) > : : +- *(2) ColumnarToRow > : : +- FileScan parquet > default.fact_stats[product_id#1968,store_id#1970|#1968,store_id#1970] > Batched: true, DataFilters: [isnotnull(product_id#1968)|#1968)], Format: > Parquet, Location: > InMemoryFileIndex[file:/home/prakhar/src/os/1_spark/sql/core/spark-warehouse/org.apache.spark.sql..., > PartitionFilters: [isnotnull(store_id#1970), > dynamicpruningexpression(store_id#1970 IN dynamicpruning#2067)|#1970), > dynamicpruningexpression(store_id#1970 IN dynamicpruning#2067)], > PushedFilters: [IsNotNull(product_id)], ReadSchema: struct > : : +- SubqueryBroadcast dynamicpruning#2067, 0, [store_id#1971|#1971], > [id=#1131|#1131] > : : +- ReusedExchange [store_id#1971|#1971], BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), > [id=#1021|#1021] > : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > true] as bigint))), [id=#1021|#1021] > : +- *(1) Project [store_id#1971|#1971] > : +- *(1) Filter ((isnotnull(country#1973) AND (country#1973 = IN)) AND > isnotnull(store_id#1971)) > : +- *(1) ColumnarToRow > : +- FileScan parquet > default.dim_stats[store_id#1971,country#1973|#1971,country#1973] Batched: > true, DataFilters: [isnotnull(country#1973), (country#1973 = IN), > isnotnull(store_id#1971)|#1973), (country#1973 = IN), > isnotnull(store_id#1971)], Format: Parquet, Location: > InMemoryFileIndex[file:/home/prakhar/src/os/1_spark/sql/core/spark-warehouse/org.apache.spark.sql..., > PartitionFilters: [], PushedFilters: [IsNotNull(country), > EqualTo(country,IN), IsNotNull(store_id)], ReadSchema: > struct > +- *(6) Sort [product_id#2060 ASC NULLS FIRST|#2060 ASC NULLS FIRST], false, > 0 > +- ReusedExchange [product_id#2060, store_id#2062|#2060, store_id#2062], > Exchange hashpartitioning(product_id#1968, 5), true, [id=#1026|#1026] > {noformat} > Issue: > Note the last line of plan. Its a ReusedExchange which is pointing to > id=1026. But There is no Exchange node in plan with ID 1026. ReusedExchange > node is pointing to incorrect Child node (1026 instead of 1140) and so in > actual, exchange reuse won't happen in this query. > Another query where issue is because of ReuseSubquery: > {noformat} > spark.sql("set spark.sql.autoBroadcastJoinThreshold=-1") > val query1 = """ > | With