[jira] [Updated] (SPARK-47319) Improve missingInput calculation

2024-03-08 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-47319:
---
Description: {{QueryPlan.missingInput()}} calculation seems to be the root 
cause of {{DeduplicateRelations}} slownes in some cases. Let's try to improve 
it.  (was: {{QueryPlan.missingInput()}} calculation seems to be the root cause 
of {{DeduplicateRelations}} slownes. Let's try to improve it.)

> Improve missingInput calculation
> 
>
> Key: SPARK-47319
> URL: https://issues.apache.org/jira/browse/SPARK-47319
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {{QueryPlan.missingInput()}} calculation seems to be the root cause of 
> {{DeduplicateRelations}} slownes in some cases. Let's try to improve it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47319) Improve missingInput calculation

2024-03-08 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-47319:
---
Description: {{QueryPlan.missingInput()}} calculation seems to be the root 
cause of {{DeduplicateRelations}} slownes. Let's try to improve it.

> Improve missingInput calculation
> 
>
> Key: SPARK-47319
> URL: https://issues.apache.org/jira/browse/SPARK-47319
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {{QueryPlan.missingInput()}} calculation seems to be the root cause of 
> {{DeduplicateRelations}} slownes. Let's try to improve it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47319) Improve missingInput calculation

2024-03-08 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-47319:
---
Summary: Improve missingInput calculation  (was: Fix missingInput 
calculation)

> Improve missingInput calculation
> 
>
> Key: SPARK-47319
> URL: https://issues.apache.org/jira/browse/SPARK-47319
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47319) Fix missingInput calculation

2024-03-07 Thread Peter Toth (Jira)
Peter Toth created SPARK-47319:
--

 Summary: Fix missingInput calculation
 Key: SPARK-47319
 URL: https://issues.apache.org/jira/browse/SPARK-47319
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47217) De-duplication of Relations in Joins, can result in plan resolution failure

2024-03-01 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-47217:
---
Shepherd:   (was: Peter Toth)

> De-duplication of Relations in Joins, can result in plan resolution failure
> ---
>
> Key: SPARK-47217
> URL: https://issues.apache.org/jira/browse/SPARK-47217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: Spark-SQL, pull-request-available
>
> In case of some flavours of  nested joins involving repetition of relation, 
> the projected columns when passed to the DataFrame.select API , as form of 
> df.column , can result in plan resolution failure due to attribute resolution 
> not happening.
> A scenario in which this happens is
> {noformat}
>
>   Project ( dataframe A.column("col-a") )
>  |
>   Join2
>   || 
>Join1  DataFrame A  
>   |
>  DataFrame ADataFrame B
> {noformat}
> In such cases, If it so happens that Join2 - right leg DataFrame A gets 
> re-aliased due to De-Duplication of relations, and if the project uses Column 
> definition obtained from DataFrame A, its exprId will not match the 
> re-aliased Join2 - right Leg- DataFrame A , causing resolution failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45805) Eliminate magic numbers in withOrigin

2023-11-06 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth resolved SPARK-45805.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43671
[https://github.com/apache/spark/pull/43671]

> Eliminate magic numbers in withOrigin
> -
>
> Key: SPARK-45805
> URL: https://issues.apache.org/jira/browse/SPARK-45805
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Refactor `withOrigin`, and make it more generic by eliminating the magic 
> number from which the traverse of stack traces starts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45354) Resolve functions bottom-up

2023-09-27 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth resolved SPARK-45354.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43146
[https://github.com/apache/spark/pull/43146]

> Resolve functions bottom-up
> ---
>
> Key: SPARK-45354
> URL: https://issues.apache.org/jira/browse/SPARK-45354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This PR proposes bottum-up resolution in {{{}ResolveFunctions{}}}, which is 
> much faster if we have deeply nested {{{}UnresolvedFunctions{}}}. These 
> structures are more likely to occur after 
> [#42864|https://github.com/apache/spark/pull/42864].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45354) Resolve functions bottom-up

2023-09-27 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth reassigned SPARK-45354:
--

Assignee: Peter Toth

> Resolve functions bottom-up
> ---
>
> Key: SPARK-45354
> URL: https://issues.apache.org/jira/browse/SPARK-45354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
>  Labels: pull-request-available
>
> This PR proposes bottum-up resolution in {{{}ResolveFunctions{}}}, which is 
> much faster if we have deeply nested {{{}UnresolvedFunctions{}}}. These 
> structures are more likely to occur after 
> [#42864|https://github.com/apache/spark/pull/42864].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45354) Resolve functions bottom-up

2023-09-27 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-45354:
---
Description: This PR proposes bottum-up resolution in 
{{{}ResolveFunctions{}}}, which is much faster if we have deeply nested 
{{{}UnresolvedFunctions{}}}. These structures are more likely to occur after 
[#42864|https://github.com/apache/spark/pull/42864].

> Resolve functions bottom-up
> ---
>
> Key: SPARK-45354
> URL: https://issues.apache.org/jira/browse/SPARK-45354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Peter Toth
>Priority: Major
>  Labels: pull-request-available
>
> This PR proposes bottum-up resolution in {{{}ResolveFunctions{}}}, which is 
> much faster if we have deeply nested {{{}UnresolvedFunctions{}}}. These 
> structures are more likely to occur after 
> [#42864|https://github.com/apache/spark/pull/42864].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45354) Resolve functions bottom-up

2023-09-27 Thread Peter Toth (Jira)
Peter Toth created SPARK-45354:
--

 Summary: Resolve functions bottom-up
 Key: SPARK-45354
 URL: https://issues.apache.org/jira/browse/SPARK-45354
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45216) Fix non-deterministic seeded Dataset APIs

2023-09-19 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-45216:
---
Description: 
If we run the following example the result is the expected equal 2 columns:

{noformat}
val c = rand()
df.select(c, c)

+--+--+
|rand(-4522010140232537566)|rand(-4522010140232537566)|
+--+--+
|0.4520819282997137|0.4520819282997137|
+--+--+
{noformat}

 
But if we run use other similar APIs their result is incorrect:

{noformat}
val r1 = random()
val r2 = uuid()
val r3 = shuffle(col("x"))
val x = df.select(r1, r1, r2, r2, r3, r3)

+--+--+++--+--+
|rand()|rand()|  uuid()|  
uuid()|shuffle(x)|shuffle(x)|
+--+--+++--+--+
|0.7407604956381952|0.7957319451135009|e55bc4b0-74e6-4b0...|a587163b-d06b-4bb...|
 [1, 2, 3]| [2, 1, 3]|
+--+--+++--+--+
{noformat}


> Fix non-deterministic seeded Dataset APIs
> -
>
> Key: SPARK-45216
> URL: https://issues.apache.org/jira/browse/SPARK-45216
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Peter Toth
>Priority: Major
>
> If we run the following example the result is the expected equal 2 columns:
> {noformat}
> val c = rand()
> df.select(c, c)
> +--+--+
> |rand(-4522010140232537566)|rand(-4522010140232537566)|
> +--+--+
> |0.4520819282997137|0.4520819282997137|
> +--+--+
> {noformat}
>  
> But if we run use other similar APIs their result is incorrect:
> {noformat}
> val r1 = random()
> val r2 = uuid()
> val r3 = shuffle(col("x"))
> val x = df.select(r1, r1, r2, r2, r3, r3)
> +--+--+++--+--+
> |rand()|rand()|  uuid()|  
> uuid()|shuffle(x)|shuffle(x)|
> +--+--+++--+--+
> |0.7407604956381952|0.7957319451135009|e55bc4b0-74e6-4b0...|a587163b-d06b-4bb...|
>  [1, 2, 3]| [2, 1, 3]|
> +--+--+++--+--+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45216) Fix non-deterministic seeded Dataset APIs

2023-09-19 Thread Peter Toth (Jira)
Peter Toth created SPARK-45216:
--

 Summary: Fix non-deterministic seeded Dataset APIs
 Key: SPARK-45216
 URL: https://issues.apache.org/jira/browse/SPARK-45216
 Project: Spark
  Issue Type: Bug
  Components: Connect, SQL
Affects Versions: 4.0.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45112) Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-11 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-45112:
---
Summary: Use UnresolvedFunction based resolution in SQL Dataset functions  
(was: Use UnresolvedFunction in dataset functions)

> Use UnresolvedFunction based resolution in SQL Dataset functions
> 
>
> Key: SPARK-45112
> URL: https://issues.apache.org/jira/browse/SPARK-45112
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Peter Toth
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45112) Use UnresolvedFunction in dataset functions

2023-09-10 Thread Peter Toth (Jira)
Peter Toth created SPARK-45112:
--

 Summary: Use UnresolvedFunction in dataset functions
 Key: SPARK-45112
 URL: https://issues.apache.org/jira/browse/SPARK-45112
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45109) Fix eas_decrypt and ln in connect

2023-09-10 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-45109:
---
Description: The current {{eas_descrypt}} reference to {{aes_encrypt}} is 
clearly a bug. The {{ln}} reference to {{log}} is more like a cosmetic issue, 
but because {{ln}} and {{log}} function implementations are different in Spark 
SQL we should use the same implementation in Spark Connect too.

> Fix eas_decrypt and ln in connect
> -
>
> Key: SPARK-45109
> URL: https://issues.apache.org/jira/browse/SPARK-45109
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Peter Toth
>Priority: Major
>  Labels: pull-request-available
>
> The current {{eas_descrypt}} reference to {{aes_encrypt}} is clearly a bug. 
> The {{ln}} reference to {{log}} is more like a cosmetic issue, but because 
> {{ln}} and {{log}} function implementations are different in Spark SQL we 
> should use the same implementation in Spark Connect too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45109) Fix eas_decrypt and ln in connect

2023-09-10 Thread Peter Toth (Jira)
Peter Toth created SPARK-45109:
--

 Summary: Fix eas_decrypt and ln in connect
 Key: SPARK-45109
 URL: https://issues.apache.org/jira/browse/SPARK-45109
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.5.0, 4.0.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45022) Provide context for dataset API errors

2023-08-31 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-45022:
---
Description: 
SQL failures already provide nice error context when there is a failure:
{noformat}
org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. 
Use `try_divide` to tolerate divisor being 0 and return NULL instead. If 
necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
== SQL(line 1, position 1) ==
a / b
^

at 
org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
at 
org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala)
...
{noformat}
We could add a similar user friendly error context to Dataset APIs.

E.g. consider the following Spark app SimpleApp.scala:
{noformat}
   1  import org.apache.spark.sql.SparkSession
   2  import org.apache.spark.sql.functions._
   3
   4  object SimpleApp {
   5def main(args: Array[String]) {
   6  val spark = SparkSession.builder.appName("Simple 
Application").config("spark.sql.ansi.enabled", true).getOrCreate()
   7  import spark.implicits._
   8
   9  val c = col("a") / col("b")
  10
  11  Seq((1, 0)).toDF("a", "b").select(c).show()
  12
  13  spark.stop()
  14}
  15  }
{noformat}
then the error context could be:
{noformat}
Exception in thread "main" org.apache.spark.SparkArithmeticException: 
[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 
and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" 
to bypass this error.
== Dataset ==
"div" was called from SimpleApp$.main(SimpleApp.scala:9)

at 
org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
at 
org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:672
...
{noformat}

> Provide context for dataset API errors
> --
>
> Key: SPARK-45022
> URL: https://issues.apache.org/jira/browse/SPARK-45022
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Peter Toth
>Priority: Major
>
> SQL failures already provide nice error context when there is a failure:
> {noformat}
> org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. 
> Use `try_divide` to tolerate divisor being 0 and return NULL instead. If 
> necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
> == SQL(line 1, position 1) ==
> a / b
> ^
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala)
> ...
> {noformat}
> We could add a similar user friendly error context to Dataset APIs.
> E.g. consider the following Spark app SimpleApp.scala:
> {noformat}
>1  import org.apache.spark.sql.SparkSession
>2  import org.apache.spark.sql.functions._
>3
>4  object SimpleApp {
>5def main(args: Array[String]) {
>6  val spark = SparkSession.builder.appName("Simple 
> Application").config("spark.sql.ansi.enabled", true).getOrCreate()
>7  import spark.implicits._
>8
>9  val c = col("a") / col("b")
>   10
>   11  Seq((1, 0)).toDF("a", "b").select(c).show()
>   12
>   13  spark.stop()
>   14}
>   15  }
> {noformat}
> then the error context could be:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkArithmeticException: 
> [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 
> 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to 
> "false" to bypass this error.
> == Dataset ==
> "div" was called from SimpleApp$.main(SimpleApp.scala:9)
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:672
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45034) Support deterministic mode function

2023-08-31 Thread Peter Toth (Jira)
Peter Toth created SPARK-45034:
--

 Summary: Support deterministic mode function
 Key: SPARK-45034
 URL: https://issues.apache.org/jira/browse/SPARK-45034
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45022) Provide context for dataset API errors

2023-08-30 Thread Peter Toth (Jira)
Peter Toth created SPARK-45022:
--

 Summary: Provide context for dataset API errors
 Key: SPARK-45022
 URL: https://issues.apache.org/jira/browse/SPARK-45022
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44934) PushdownPredicatesAndPruneColumnsForCTEDef creates invalid plan when called over CTE with duplicate attributes

2023-08-24 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth reassigned SPARK-44934:
--

Assignee: Wen Yuen Pang

> PushdownPredicatesAndPruneColumnsForCTEDef creates invalid plan when called 
> over CTE with duplicate attributes
> --
>
> Key: SPARK-44934
> URL: https://issues.apache.org/jira/browse/SPARK-44934
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.3, 3.4.1
>Reporter: Wen Yuen Pang
>Assignee: Wen Yuen Pang
>Priority: Minor
> Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4
>
>
> When running the query
> {code:java}
> with cte as (
>  select c1, c1, c2, c3 from t where random() > 0
> )
> select cte.c1, cte2.c1, cte.c2, cte2.c3 from
>  (select c1, c2 from cte) cte
>  inner join
>  (select c1, c3 from cte) cte2
>  on cte.c1 = cte2.c1 {code}
>  
> The query fails with the error
> {code:java}
> org.apache.spark.scheduler.DAGScheduler: Failed to update accumulator 9523 
> (Unknown class) for task 1
> org.apache.spark.SparkException: attempted to access non-existent accumulator 
> 9523{code}
> Further investigation shows that the rule 
> PushdownPredicatesAndPruneColumnsForCTEDef creates an invalid plan when the 
> output of a CTE contains duplicate expression IDs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44934) PushdownPredicatesAndPruneColumnsForCTEDef creates invalid plan when called over CTE with duplicate attributes

2023-08-24 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-44934:
---
Affects Version/s: 3.3.3

> PushdownPredicatesAndPruneColumnsForCTEDef creates invalid plan when called 
> over CTE with duplicate attributes
> --
>
> Key: SPARK-44934
> URL: https://issues.apache.org/jira/browse/SPARK-44934
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.3, 3.4.1
>Reporter: Wen Yuen Pang
>Priority: Minor
> Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4
>
>
> When running the query
> {code:java}
> with cte as (
>  select c1, c1, c2, c3 from t where random() > 0
> )
> select cte.c1, cte2.c1, cte.c2, cte2.c3 from
>  (select c1, c2 from cte) cte
>  inner join
>  (select c1, c3 from cte) cte2
>  on cte.c1 = cte2.c1 {code}
>  
> The query fails with the error
> {code:java}
> org.apache.spark.scheduler.DAGScheduler: Failed to update accumulator 9523 
> (Unknown class) for task 1
> org.apache.spark.SparkException: attempted to access non-existent accumulator 
> 9523{code}
> Further investigation shows that the rule 
> PushdownPredicatesAndPruneColumnsForCTEDef creates an invalid plan when the 
> output of a CTE contains duplicate expression IDs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44934) PushdownPredicatesAndPruneColumnsForCTEDef creates invalid plan when called over CTE with duplicate attributes

2023-08-24 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth resolved SPARK-44934.

Fix Version/s: 3.3.4
   3.5.0
   4.0.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 42635
[https://github.com/apache/spark/pull/42635]

> PushdownPredicatesAndPruneColumnsForCTEDef creates invalid plan when called 
> over CTE with duplicate attributes
> --
>
> Key: SPARK-44934
> URL: https://issues.apache.org/jira/browse/SPARK-44934
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.4.1
>Reporter: Wen Yuen Pang
>Priority: Minor
> Fix For: 3.3.4, 3.5.0, 4.0.0, 3.4.2
>
>
> When running the query
> {code:java}
> with cte as (
>  select c1, c1, c2, c3 from t where random() > 0
> )
> select cte.c1, cte2.c1, cte.c2, cte2.c3 from
>  (select c1, c2 from cte) cte
>  inner join
>  (select c1, c3 from cte) cte2
>  on cte.c1 = cte2.c1 {code}
>  
> The query fails with the error
> {code:java}
> org.apache.spark.scheduler.DAGScheduler: Failed to update accumulator 9523 
> (Unknown class) for task 1
> org.apache.spark.SparkException: attempted to access non-existent accumulator 
> 9523{code}
> Further investigation shows that the rule 
> PushdownPredicatesAndPruneColumnsForCTEDef creates an invalid plan when the 
> output of a CTE contains duplicate expression IDs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour

2023-08-23 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-44871:
---
Fix Version/s: 3.5.0

> Fix PERCENTILE_DISC behaviour
> -
>
> Key: SPARK-44871
> URL: https://issues.apache.org/jira/browse/SPARK-44871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Critical
> Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4
>
>
> Currently {{percentile_disc()}} returns incorrect results in some cases:
> E.g.:
> {code:java}
> SELECT
>   percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0,
>   percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1,
>   percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2,
>   percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3,
>   percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4,
>   percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5,
>   percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6,
>   percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7,
>   percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8,
>   percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9,
>   percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10
> FROM VALUES (0), (1), (2), (3), (4) AS v(a)
> {code}
> returns:
> {code:java}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {code}
> but it should return:
> {noformat}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour

2023-08-22 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-44871:
---
Affects Version/s: 3.4.0
   3.3.2
   3.3.1
   3.3.0

> Fix PERCENTILE_DISC behaviour
> -
>
> Key: SPARK-44871
> URL: https://issues.apache.org/jira/browse/SPARK-44871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>Reporter: Peter Toth
>Priority: Critical
> Fix For: 3.5.0, 4.0.0
>
>
> Currently {{percentile_disc()}} returns incorrect results in some cases:
> E.g.:
> {code:java}
> SELECT
>   percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0,
>   percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1,
>   percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2,
>   percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3,
>   percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4,
>   percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5,
>   percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6,
>   percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7,
>   percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8,
>   percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9,
>   percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10
> FROM VALUES (0), (1), (2), (3), (4) AS v(a)
> {code}
> returns:
> {code:java}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {code}
> but it should return:
> {noformat}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour

2023-08-22 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-44871:
---
Affects Version/s: 3.4.1
   3.3.3
   (was: 3.3.0)
   (was: 3.4.0)
   (was: 3.5.0)
   (was: 4.0.0)

> Fix PERCENTILE_DISC behaviour
> -
>
> Key: SPARK-44871
> URL: https://issues.apache.org/jira/browse/SPARK-44871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.1
>Reporter: Peter Toth
>Priority: Critical
> Fix For: 3.5.0, 4.0.0
>
>
> Currently {{percentile_disc()}} returns incorrect results in some cases:
> E.g.:
> {code:java}
> SELECT
>   percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0,
>   percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1,
>   percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2,
>   percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3,
>   percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4,
>   percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5,
>   percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6,
>   percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7,
>   percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8,
>   percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9,
>   percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10
> FROM VALUES (0), (1), (2), (3), (4) AS v(a)
> {code}
> returns:
> {code:java}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {code}
> but it should return:
> {noformat}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour

2023-08-22 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-44871:
---
Fix Version/s: 3.5.0
   4.0.0

> Fix PERCENTILE_DISC behaviour
> -
>
> Key: SPARK-44871
> URL: https://issues.apache.org/jira/browse/SPARK-44871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0, 4.0.0
>Reporter: Peter Toth
>Priority: Critical
> Fix For: 3.5.0, 4.0.0
>
>
> Currently {{percentile_disc()}} returns incorrect results in some cases:
> E.g.:
> {code:java}
> SELECT
>   percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0,
>   percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1,
>   percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2,
>   percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3,
>   percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4,
>   percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5,
>   percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6,
>   percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7,
>   percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8,
>   percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9,
>   percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10
> FROM VALUES (0), (1), (2), (3), (4) AS v(a)
> {code}
> returns:
> {code:java}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {code}
> but it should return:
> {noformat}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44871) Fix PERCENTILE_DISC behaviour

2023-08-18 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17756097#comment-17756097
 ] 

Peter Toth commented on SPARK-44871:


[~tgraves], sure, I've just updated it.

It looks like my PR didn't get linked here automatically, so here it is: 
https://github.com/apache/spark/pull/42559

> Fix PERCENTILE_DISC behaviour
> -
>
> Key: SPARK-44871
> URL: https://issues.apache.org/jira/browse/SPARK-44871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0, 4.0.0
>Reporter: Peter Toth
>Priority: Critical
>
> Currently {{percentile_disc()}} returns incorrect results in some cases:
> E.g.:
> {code:java}
> SELECT
>   percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0,
>   percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1,
>   percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2,
>   percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3,
>   percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4,
>   percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5,
>   percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6,
>   percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7,
>   percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8,
>   percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9,
>   percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10
> FROM VALUES (0), (1), (2), (3), (4) AS v(a)
> {code}
> returns:
> {code:java}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {code}
> but it should return:
> {noformat}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour

2023-08-18 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-44871:
---
Description: 
Currently {{percentile_disc()}} returns incorrect results in some cases:

E.g.:
{code:java}
SELECT
  percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0,
  percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1,
  percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2,
  percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3,
  percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4,
  percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5,
  percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6,
  percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7,
  percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8,
  percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9,
  percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10
FROM VALUES (0), (1), (2), (3), (4) AS v(a)
{code}
returns:
{code:java}
+---+---+---+---+---+---+---+---+---+---+---+
| p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
+---+---+---+---+---+---+---+---+---+---+---+
|0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0|
+---+---+---+---+---+---+---+---+---+---+---+
{code}
but it should return:
{noformat}
+---+---+---+---+---+---+---+---+---+---+---+
| p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
+---+---+---+---+---+---+---+---+---+---+---+
|0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0|
+---+---+---+---+---+---+---+---+---+---+---+
{noformat}

> Fix PERCENTILE_DISC behaviour
> -
>
> Key: SPARK-44871
> URL: https://issues.apache.org/jira/browse/SPARK-44871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0, 4.0.0
>Reporter: Peter Toth
>Priority: Critical
>
> Currently {{percentile_disc()}} returns incorrect results in some cases:
> E.g.:
> {code:java}
> SELECT
>   percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0,
>   percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1,
>   percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2,
>   percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3,
>   percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4,
>   percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5,
>   percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6,
>   percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7,
>   percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8,
>   percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9,
>   percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10
> FROM VALUES (0), (1), (2), (3), (4) AS v(a)
> {code}
> returns:
> {code:java}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {code}
> but it should return:
> {noformat}
> +---+---+---+---+---+---+---+---+---+---+---+
> | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10|
> +---+---+---+---+---+---+---+---+---+---+---+
> |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0|
> +---+---+---+---+---+---+---+---+---+---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44871) Fix PERCENTILE_DISC behaviour

2023-08-18 Thread Peter Toth (Jira)
Peter Toth created SPARK-44871:
--

 Summary: Fix PERCENTILE_DISC behaviour
 Key: SPARK-44871
 URL: https://issues.apache.org/jira/browse/SPARK-44871
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0, 3.3.0, 3.5.0, 4.0.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43266) Move MergeScalarSubqueries to spark-sql

2023-04-24 Thread Peter Toth (Jira)
Peter Toth created SPARK-43266:
--

 Summary: Move MergeScalarSubqueries to spark-sql
 Key: SPARK-43266
 URL: https://issues.apache.org/jira/browse/SPARK-43266
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Peter Toth


This is a step to make SPARK-40193 easier.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43199) Make InlineCTE idempotent

2023-04-19 Thread Peter Toth (Jira)
Peter Toth created SPARK-43199:
--

 Summary: Make InlineCTE idempotent
 Key: SPARK-43199
 URL: https://issues.apache.org/jira/browse/SPARK-43199
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24497) ANSI SQL: Recursive query

2023-04-19 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17714114#comment-17714114
 ] 

Peter Toth edited comment on SPARK-24497 at 4/19/23 2:00 PM:
-

I've opened a new PR: https://github.com/apache/spark/pull/40744 to support 
recursive SQL, but for some reason it didn't get automatically linked here. 
[~gurwls223], you might know what went wrong...


was (Author: petertoth):
I've opened a new PR: https://github.com/apache/spark/pull/40093 to support 
recursive SQL, but for some reason it didn't get automatically linked here. 
[~gurwls223], you might know what went wrong...

> ANSI SQL: Recursive query
> -
>
> Key: SPARK-24497
> URL: https://issues.apache.org/jira/browse/SPARK-24497
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> h3. *Examples*
> Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" 
> represents the structure of an organization as an adjacency list.
> {code:sql}
> CREATE TABLE department (
> id INTEGER PRIMARY KEY,  -- department ID
> parent_department INTEGER REFERENCES department, -- upper department ID
> name TEXT -- department name
> );
> INSERT INTO department (id, parent_department, "name")
> VALUES
>  (0, NULL, 'ROOT'),
>  (1, 0, 'A'),
>  (2, 1, 'B'),
>  (3, 2, 'C'),
>  (4, 2, 'D'),
>  (5, 0, 'E'),
>  (6, 4, 'F'),
>  (7, 5, 'G');
> -- department structure represented here is as follows:
> --
> -- ROOT-+->A-+->B-+->C
> --  | |
> --  | +->D-+->F
> --  +->E-+->G
> {code}
>  
>  To extract all departments under A, you can use the following recursive 
> query:
> {code:sql}
> WITH RECURSIVE subdepartment AS
> (
> -- non-recursive term
> SELECT * FROM department WHERE name = 'A'
> UNION ALL
> -- recursive term
> SELECT d.*
> FROM
> department AS d
> JOIN
> subdepartment AS sd
> ON (d.parent_department = sd.id)
> )
> SELECT *
> FROM subdepartment
> ORDER BY name;
> {code}
> More details:
> [http://wiki.postgresql.org/wiki/CTEReadme]
> [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24497) ANSI SQL: Recursive query

2023-04-19 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17714114#comment-17714114
 ] 

Peter Toth commented on SPARK-24497:


I've opened a new PR: https://github.com/apache/spark/pull/40093 to support 
recursive SQL, but for some reason it didn't get automatically linked here. 
[~gurwls223], you might know what went wrong...

> ANSI SQL: Recursive query
> -
>
> Key: SPARK-24497
> URL: https://issues.apache.org/jira/browse/SPARK-24497
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> h3. *Examples*
> Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" 
> represents the structure of an organization as an adjacency list.
> {code:sql}
> CREATE TABLE department (
> id INTEGER PRIMARY KEY,  -- department ID
> parent_department INTEGER REFERENCES department, -- upper department ID
> name TEXT -- department name
> );
> INSERT INTO department (id, parent_department, "name")
> VALUES
>  (0, NULL, 'ROOT'),
>  (1, 0, 'A'),
>  (2, 1, 'B'),
>  (3, 2, 'C'),
>  (4, 2, 'D'),
>  (5, 0, 'E'),
>  (6, 4, 'F'),
>  (7, 5, 'G');
> -- department structure represented here is as follows:
> --
> -- ROOT-+->A-+->B-+->C
> --  | |
> --  | +->D-+->F
> --  +->E-+->G
> {code}
>  
>  To extract all departments under A, you can use the following recursive 
> query:
> {code:sql}
> WITH RECURSIVE subdepartment AS
> (
> -- non-recursive term
> SELECT * FROM department WHERE name = 'A'
> UNION ALL
> -- recursive term
> SELECT d.*
> FROM
> department AS d
> JOIN
> subdepartment AS sd
> ON (d.parent_department = sd.id)
> )
> SELECT *
> FROM subdepartment
> ORDER BY name;
> {code}
> More details:
> [http://wiki.postgresql.org/wiki/CTEReadme]
> [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43124) Dataset.show should not trigger job execution on CommandResults

2023-04-13 Thread Peter Toth (Jira)
Peter Toth created SPARK-43124:
--

 Summary: Dataset.show should not trigger job execution on 
CommandResults
 Key: SPARK-43124
 URL: https://issues.apache.org/jira/browse/SPARK-43124
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42852) Revert NamedLambdaVariable related changes from EquivalentExpressions

2023-03-20 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-42852:
---
Affects Version/s: (was: 3.3.2)

> Revert NamedLambdaVariable related changes from EquivalentExpressions
> -
>
> Key: SPARK-42852
> URL: https://issues.apache.org/jira/browse/SPARK-42852
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.4.0
>
>
> See discussion 
> https://github.com/apache/spark/pull/40473#issuecomment-1474848224



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42836) Support for recursive queries

2023-03-18 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth resolved SPARK-42836.

Resolution: Duplicate

> Support for recursive queries
> -
>
> Key: SPARK-42836
> URL: https://issues.apache.org/jira/browse/SPARK-42836
> Project: Spark
>  Issue Type: Question
>  Components: Java API, SQL
>Affects Versions: 3.4.0
>Reporter: Max
>Priority: Blocker
>
> Hello, a subtask was created a long time ago 
> https://issues.apache.org/jira/browse/SPARK-24497
> When will this task be completed? We really miss this.
> Thx.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42836) Support for recursive queries

2023-03-18 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702138#comment-17702138
 ] 

Peter Toth commented on SPARK-42836:


Please ask on the already existing ticket to let others know you also support 
that new feature.

BTW, I've opened PRs to add support for recursive queries but they always got 
stalled due to lack of reviews. I will try to rebase and adapt the latest in 
Spark 3.5 timeframe: 
https://github.com/apache/spark/pull/29210#issuecomment-1387144552

> Support for recursive queries
> -
>
> Key: SPARK-42836
> URL: https://issues.apache.org/jira/browse/SPARK-42836
> Project: Spark
>  Issue Type: Question
>  Components: Java API, SQL
>Affects Versions: 3.4.0
>Reporter: Max
>Priority: Blocker
>
> Hello, a subtask was created a long time ago 
> https://issues.apache.org/jira/browse/SPARK-24497
> When will this task be completed? We really miss this.
> Thx.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42852) Revert NamedLambdaVariable related changes from EquivalentExpressions

2023-03-18 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-42852:
---
Summary: Revert NamedLambdaVariable related changes from 
EquivalentExpressions  (was: Revert NamedLambdaVariables related changes from 
EquivalentExpressions)

> Revert NamedLambdaVariable related changes from EquivalentExpressions
> -
>
> Key: SPARK-42852
> URL: https://issues.apache.org/jira/browse/SPARK-42852
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Peter Toth
>Priority: Major
>
> See discussion 
> https://github.com/apache/spark/pull/40473#issuecomment-1474848224



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42852) Revert NamedLambdaVariables related changes from EquivalentExpressions

2023-03-18 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-42852:
---
Summary: Revert NamedLambdaVariables related changes from 
EquivalentExpressions  (was: Rervert NamedLambdaVariables related changes from 
EquivalentExpressions)

> Revert NamedLambdaVariables related changes from EquivalentExpressions
> --
>
> Key: SPARK-42852
> URL: https://issues.apache.org/jira/browse/SPARK-42852
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Peter Toth
>Priority: Major
>
> See discussion 
> https://github.com/apache/spark/pull/40473#issuecomment-1474848224



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42852) Rervert NamedLambdaVariables related changes from EquivalentExpressions

2023-03-18 Thread Peter Toth (Jira)
Peter Toth created SPARK-42852:
--

 Summary: Rervert NamedLambdaVariables related changes from 
EquivalentExpressions
 Key: SPARK-42852
 URL: https://issues.apache.org/jira/browse/SPARK-42852
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.2, 3.4.0
Reporter: Peter Toth


See discussion 
https://github.com/apache/spark/pull/40473#issuecomment-1474848224



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42745) Improved AliasAwareOutputExpression works with DSv2

2023-03-10 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-42745:
---
Description: 
After SPARK-40086 / SPARK-42049 the following, simple subselect expression 
containing query:
{noformat}
select (select sum(id) from t1)
{noformat}
fails with:

{noformat}
09:48:57.645 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in 
stage 3.0 (TID 3)
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.batch$lzycompute(BatchScanExec.scala:47)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.batch(BatchScanExec.scala:47)
at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.hashCode(BatchScanExec.scala:60)
at scala.runtime.Statics.anyHash(Statics.java:122)
...
at 
org.apache.spark.sql.catalyst.trees.TreeNode.hashCode(TreeNode.scala:249)
at scala.runtime.Statics.anyHash(Statics.java:122)
at 
scala.collection.mutable.HashTable$HashUtils.elemHashCode(HashTable.scala:416)
at 
scala.collection.mutable.HashTable$HashUtils.elemHashCode$(HashTable.scala:416)
at scala.collection.mutable.HashMap.elemHashCode(HashMap.scala:44)
at scala.collection.mutable.HashTable.addEntry(HashTable.scala:149)
at scala.collection.mutable.HashTable.addEntry$(HashTable.scala:148)
at scala.collection.mutable.HashMap.addEntry(HashMap.scala:44)
at scala.collection.mutable.HashTable.init(HashTable.scala:110)
at scala.collection.mutable.HashTable.init$(HashTable.scala:89)
at scala.collection.mutable.HashMap.init(HashMap.scala:44)
at scala.collection.mutable.HashMap.readObject(HashMap.scala:195)
...
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:87)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:129)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:85)
at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1520)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
{noformat}
when DSv2 is enabled.

> Improved AliasAwareOutputExpression works with DSv2
> ---
>
> Key: SPARK-42745
> URL: https://issues.apache.org/jira/browse/SPARK-42745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.4.0
>
>
> After SPARK-40086 / SPARK-42049 the following, simple subselect expression 
> containing query:
> {noformat}
> select (select sum(id) from t1)
> {noformat}
> fails with:
> {noformat}
> 09:48:57.645 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 3.0 (TID 3)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.batch$lzycompute(BatchScanExec.scala:47)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.batch(BatchScanExec.scala:47)
>   at 
> org.apache.spark.sql.execution.datasources.v2.BatchScanExec.hashCode(BatchScanExec.scala:60)
>   at scala.runtime.Statics.anyHash(Statics.java:122)
> ...
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.hashCode(TreeNode.scala:249)
>   at scala.runtime.Statics.anyHash(Statics.java:122)
>   at 
> scala.collection.mutable.HashTable$HashUtils.elemHashCode(HashTable.scala:416)
>   at 
> scala.collection.mutable.HashTable$HashUtils.elemHashCode$(HashTable.scala:416)
>   at scala.collection.mutable.HashMap.elemHashCode(HashMap.scala:44)
>   at scala.collection.mutable.HashTable.addEntry(HashTable.scala:149)
>   at scala.collection.mutable.HashTable.addEntry$(HashTable.scala:148)
>   at scala.collection.mutable.HashMap.addEntry(HashMap.scala:44)
>   at scala.collection.mutable.HashTable.init(HashTable.scala:110)
>   at scala.collection.mutable.HashTable.init$(HashTable.scala:89)
>   at scala.collection.mutable.HashMap.init(HashMap.scala:44)
>   at scala.collection.mutable.HashMap.readObject(HashMap.scala:195)
> ...
>   at 

[jira] [Updated] (SPARK-42745) Improved AliasAwareOutputExpression works with DSv2

2023-03-10 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-42745:
---
Summary: Improved AliasAwareOutputExpression works with DSv2  (was: Fix NPE 
after recent AliasAwareOutputExpression changes)

> Improved AliasAwareOutputExpression works with DSv2
> ---
>
> Key: SPARK-42745
> URL: https://issues.apache.org/jira/browse/SPARK-42745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42745) Fix NPE after recent AliasAwareOutputExpression changes

2023-03-10 Thread Peter Toth (Jira)
Peter Toth created SPARK-42745:
--

 Summary: Fix NPE after recent AliasAwareOutputExpression changes
 Key: SPARK-42745
 URL: https://issues.apache.org/jira/browse/SPARK-42745
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0, 3.5.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42438) Improve constraint propagation using multiTransform

2023-02-14 Thread Peter Toth (Jira)
Peter Toth created SPARK-42438:
--

 Summary: Improve constraint propagation using multiTransform
 Key: SPARK-42438
 URL: https://issues.apache.org/jira/browse/SPARK-42438
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42436) Improve multiTransform to generate alternatives dynamically

2023-02-14 Thread Peter Toth (Jira)
Peter Toth created SPARK-42436:
--

 Summary: Improve multiTransform to generate alternatives 
dynamically
 Key: SPARK-42436
 URL: https://issues.apache.org/jira/browse/SPARK-42436
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42435) Update DataTables to 1.13.2

2023-02-14 Thread Peter Toth (Jira)
Peter Toth created SPARK-42435:
--

 Summary: Update DataTables to 1.13.2
 Key: SPARK-42435
 URL: https://issues.apache.org/jira/browse/SPARK-42435
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.5.0
Reporter: Peter Toth


The 1.10.25 version of DataTables, that Spark uses, seems vulerable: 
https://security.snyk.io/package/npm/datatables.net/1.10.25.
It may or may not affect Spark, but updating to latest 1.13.2 seems doable.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug

2023-02-08 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685858#comment-17685858
 ] 

Peter Toth edited comment on SPARK-42346 at 2/8/23 11:16 AM:
-

[~ritikam], you also need to disable the "ConvertToLocalRelation" rule 
optimization `--conf 
"spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation"`
 to get the error from spark-shell.


was (Author: petertoth):
[~ritikam], you also need to disable the "ConvertToLocalRelation" rule 
optimization `--conf 
"spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation"`
 to get the error from spark-schell.

> distinct(count colname) with UNION ALL causes query analyzer bug
> 
>
> Key: SPARK-42346
> URL: https://issues.apache.org/jira/browse/SPARK-42346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
>Reporter: Robin
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.3.2, 3.4.0, 3.5.0
>
>
> If you combine a UNION ALL with a count(distinct colname) you get a query 
> analyzer bug.
>  
> This behaviour is introduced in 3.3.0.  The bug was not present in 3.2.1.
>  
> Here is a reprex in PySpark:
> {{df_pd = pd.DataFrame([}}
> {{    \{'surname': 'a', 'first_name': 'b'}}}
> {{])}}
> {{df_spark = spark.createDataFrame(df_pd)}}
> {{df_spark.createOrReplaceTempView("input_table")}}
> {{sql = """}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT first_name) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table}}
> {{UNION ALL}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT surname) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table """}}
> {{spark.sql(sql).toPandas()}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug

2023-02-08 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685858#comment-17685858
 ] 

Peter Toth commented on SPARK-42346:


[~ritikam], you also need to disable the "ConvertToLocalRelation" rule 
optimization `--conf 
"spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation"`
 to get the error from spark-schell.

> distinct(count colname) with UNION ALL causes query analyzer bug
> 
>
> Key: SPARK-42346
> URL: https://issues.apache.org/jira/browse/SPARK-42346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
>Reporter: Robin
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.3.2, 3.4.0, 3.5.0
>
>
> If you combine a UNION ALL with a count(distinct colname) you get a query 
> analyzer bug.
>  
> This behaviour is introduced in 3.3.0.  The bug was not present in 3.2.1.
>  
> Here is a reprex in PySpark:
> {{df_pd = pd.DataFrame([}}
> {{    \{'surname': 'a', 'first_name': 'b'}}}
> {{])}}
> {{df_spark = spark.createDataFrame(df_pd)}}
> {{df_spark.createOrReplaceTempView("input_table")}}
> {{sql = """}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT first_name) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table}}
> {{UNION ALL}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT surname) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table """}}
> {{spark.sql(sql).toPandas()}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug

2023-02-06 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685124#comment-17685124
 ] 

Peter Toth commented on SPARK-42346:


[~ritikam], please use the Pyspark repro in description or add a 2nd row to 
your input_table if you use Scala. That's because Spark can optimize out count 
distinct from one row local relations.

> distinct(count colname) with UNION ALL causes query analyzer bug
> 
>
> Key: SPARK-42346
> URL: https://issues.apache.org/jira/browse/SPARK-42346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
>Reporter: Robin
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.3.2, 3.4.0, 3.5.0
>
>
> If you combine a UNION ALL with a count(distinct colname) you get a query 
> analyzer bug.
>  
> This behaviour is introduced in 3.3.0.  The bug was not present in 3.2.1.
>  
> Here is a reprex in PySpark:
> {{df_pd = pd.DataFrame([}}
> {{    \{'surname': 'a', 'first_name': 'b'}}}
> {{])}}
> {{df_spark = spark.createDataFrame(df_pd)}}
> {{df_spark.createOrReplaceTempView("input_table")}}
> {{sql = """}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT first_name) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table}}
> {{UNION ALL}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT surname) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table """}}
> {{spark.sql(sql).toPandas()}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug

2023-02-05 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684272#comment-17684272
 ] 

Peter Toth commented on SPARK-42346:


[~yumwang], [~RobinLinacre], https://github.com/apache/spark/pull/39887 will 
fix the issue.

[~viirya], as this is a regression from 3.2 to 3.3, if possible please include 
this in 3.3.2.

> distinct(count colname) with UNION ALL causes query analyzer bug
> 
>
> Key: SPARK-42346
> URL: https://issues.apache.org/jira/browse/SPARK-42346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
>Reporter: Robin
>Priority: Major
>
> If you combine a UNION ALL with a count(distinct colname) you get a query 
> analyzer bug.
>  
> This behaviour is introduced in 3.3.0.  The bug was not present in 3.2.1.
>  
> Here is a reprex in PySpark:
> {{df_pd = pd.DataFrame([}}
> {{    \{'surname': 'a', 'first_name': 'b'}}}
> {{])}}
> {{df_spark = spark.createDataFrame(df_pd)}}
> {{df_spark.createOrReplaceTempView("input_table")}}
> {{sql = """}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT first_name) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table}}
> {{UNION ALL}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT surname) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table """}}
> {{spark.sql(sql).toPandas()}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug

2023-02-05 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-42346:
---
Affects Version/s: 3.3.0
   3.4.0
   3.5.0
   (was: 3.3.1)

> distinct(count colname) with UNION ALL causes query analyzer bug
> 
>
> Key: SPARK-42346
> URL: https://issues.apache.org/jira/browse/SPARK-42346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
>Reporter: Robin
>Priority: Major
>
> If you combine a UNION ALL with a count(distinct colname) you get a query 
> analyzer bug.
>  
> This behaviour is introduced in 3.3.0.  The bug was not present in 3.2.1.
>  
> Here is a reprex in PySpark:
> {{df_pd = pd.DataFrame([}}
> {{    \{'surname': 'a', 'first_name': 'b'}}}
> {{])}}
> {{df_spark = spark.createDataFrame(df_pd)}}
> {{df_spark.createOrReplaceTempView("input_table")}}
> {{sql = """}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT first_name) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table}}
> {{UNION ALL}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT surname) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table """}}
> {{spark.sql(sql).toPandas()}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug

2023-02-04 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684184#comment-17684184
 ] 

Peter Toth commented on SPARK-42346:


Thanks for pinging me [~yumwang], this might be subquery merge related. I will 
look into it.

> distinct(count colname) with UNION ALL causes query analyzer bug
> 
>
> Key: SPARK-42346
> URL: https://issues.apache.org/jira/browse/SPARK-42346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Robin
>Priority: Major
>
> If you combine a UNION ALL with a count(distinct colname) you get a query 
> analyzer bug.
>  
> This behaviour is introduced in 3.3.0.  The bug was not present in 3.2.1.
>  
> Here is a reprex in PySpark:
> {{df_pd = pd.DataFrame([}}
> {{    \{'surname': 'a', 'first_name': 'b'}}}
> {{])}}
> {{df_spark = spark.createDataFrame(df_pd)}}
> {{df_spark.createOrReplaceTempView("input_table")}}
> {{sql = """}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT first_name) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table}}
> {{UNION ALL}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT surname) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table """}}
> {{spark.sql(sql).toPandas()}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42136) Refactor BroadcastHashJoinExec output partitioning generation

2023-01-20 Thread Peter Toth (Jira)
Peter Toth created SPARK-42136:
--

 Summary: Refactor BroadcastHashJoinExec output partitioning 
generation
 Key: SPARK-42136
 URL: https://issues.apache.org/jira/browse/SPARK-42136
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42134) Fix getPartitionFiltersAndDataFilters() to handle filters without referenced attributes

2023-01-20 Thread Peter Toth (Jira)
Peter Toth created SPARK-42134:
--

 Summary: Fix getPartitionFiltersAndDataFilters() to handle filters 
without referenced attributes
 Key: SPARK-42134
 URL: https://issues.apache.org/jira/browse/SPARK-42134
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41468) Fix PlanExpression handling in EquivalentExpressions

2022-12-09 Thread Peter Toth (Jira)
Peter Toth created SPARK-41468:
--

 Summary: Fix PlanExpression handling in EquivalentExpressions
 Key: SPARK-41468
 URL: https://issues.apache.org/jira/browse/SPARK-41468
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41367) Enable V2 file tables in read paths in session catalog

2022-12-02 Thread Peter Toth (Jira)
Peter Toth created SPARK-41367:
--

 Summary: Enable V2 file tables in read paths in session catalog
 Key: SPARK-41367
 URL: https://issues.apache.org/jira/browse/SPARK-41367
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Peter Toth


It would be good if we could use the already available V2 file source 
implmenentaions with the session catalog.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41124) Add DSv2 PlanStabilitySuites

2022-11-13 Thread Peter Toth (Jira)
Peter Toth created SPARK-41124:
--

 Summary: Add DSv2 PlanStabilitySuites
 Key: SPARK-41124
 URL: https://issues.apache.org/jira/browse/SPARK-41124
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40874) Fix broadcasts in Python UDFs when encryption is enabled

2022-10-21 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-40874:
---
Description: 
The following Pyspark script:
{noformat}
bin/pyspark --conf spark.io.encryption.enabled=true

...

bar = {"a": "aa", "b": "bb"}
foo = spark.sparkContext.broadcast(bar)
spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "")
spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect()
{noformat}
fails with:
{noformat}
22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File 
"/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 811, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File 
"/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 87, in read_command
command = serializer._read_with_length(file)
  File 
"/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py",
 line 173, in _read_with_length
return self.loads(obj)
  File 
"/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py",
 line 471, in loads
return cloudpickle.loads(obj, encoding=encoding)
EOFError: Ran out of input
{noformat}

> Fix broadcasts in Python UDFs when encryption is enabled
> 
>
> Key: SPARK-40874
> URL: https://issues.apache.org/jira/browse/SPARK-40874
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Priority: Major
>
> The following Pyspark script:
> {noformat}
> bin/pyspark --conf spark.io.encryption.enabled=true
> ...
> bar = {"a": "aa", "b": "bb"}
> foo = spark.sparkContext.broadcast(bar)
> spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "")
> spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect()
> {noformat}
> fails with:
> {noformat}
> 22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 
> 1]
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 811, in main
> func, profiler, deserializer, serializer = read_command(pickleSer, infile)
>   File 
> "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 87, in read_command
> command = serializer._read_with_length(file)
>   File 
> "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 173, in _read_with_length
> return self.loads(obj)
>   File 
> "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 471, in loads
> return cloudpickle.loads(obj, encoding=encoding)
> EOFError: Ran out of input
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-40874) Fix broadcasts in Python UDFs when encryption is enabled

2022-10-21 Thread Peter Toth (Jira)


[ https://issues.apache.org/jira/browse/SPARK-40874 ]


Peter Toth deleted comment on SPARK-40874:


was (Author: petertoth):
The following Pyspark script:
{noformat}
bin/pyspark --conf spark.io.encryption.enabled=true

...

bar = {"a": "aa", "b": "bb"}
foo = spark.sparkContext.broadcast(bar)
spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "")
spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect()
{noformat}
fails with:
{noformat}
22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File 
"/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 811, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File 
"/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 87, in read_command
command = serializer._read_with_length(file)
  File 
"/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py",
 line 173, in _read_with_length
return self.loads(obj)
  File 
"/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py",
 line 471, in loads
return cloudpickle.loads(obj, encoding=encoding)
EOFError: Ran out of input
{noformat}

> Fix broadcasts in Python UDFs when encryption is enabled
> 
>
> Key: SPARK-40874
> URL: https://issues.apache.org/jira/browse/SPARK-40874
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40874) Fix broadcasts in Python UDFs when encryption is enabled

2022-10-21 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-40874:
---

The following Pyspark script:
{noformat}
bin/pyspark --conf spark.io.encryption.enabled=true

...

bar = {"a": "aa", "b": "bb"}
foo = spark.sparkContext.broadcast(bar)
spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "")
spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect()
{noformat}
fails with:
{noformat}
22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File 
"/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 811, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File 
"/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 87, in read_command
command = serializer._read_with_length(file)
  File 
"/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py",
 line 173, in _read_with_length
return self.loads(obj)
  File 
"/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py",
 line 471, in loads
return cloudpickle.loads(obj, encoding=encoding)
EOFError: Ran out of input
{noformat}

> Fix broadcasts in Python UDFs when encryption is enabled
> 
>
> Key: SPARK-40874
> URL: https://issues.apache.org/jira/browse/SPARK-40874
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40874) Fix broadcasts in Python UDFs when encryption is enabled

2022-10-21 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-40874:
---
Summary: Fix broadcasts in Python UDFs when encryption is enabled  (was: 
Fix Python UDFs with broadcasts when encryption is enabled)

> Fix broadcasts in Python UDFs when encryption is enabled
> 
>
> Key: SPARK-40874
> URL: https://issues.apache.org/jira/browse/SPARK-40874
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40874) Fix Python UDFs with broadcasts when encryption is enabled

2022-10-21 Thread Peter Toth (Jira)
Peter Toth created SPARK-40874:
--

 Summary: Fix Python UDFs with broadcasts when encryption is enabled
 Key: SPARK-40874
 URL: https://issues.apache.org/jira/browse/SPARK-40874
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40599) Add multiTransform methods to TreeNode to generate alternatives

2022-09-28 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-40599:
---
Summary: Add multiTransform methods to TreeNode to generate alternatives  
(was: Add multiTransform methods to TreeNode to generate alternative 
transformations)

> Add multiTransform methods to TreeNode to generate alternatives
> ---
>
> Key: SPARK-40599
> URL: https://issues.apache.org/jira/browse/SPARK-40599
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40599) Add multiTransform methods to TreeNode to generate alternative transformations

2022-09-28 Thread Peter Toth (Jira)
Peter Toth created SPARK-40599:
--

 Summary: Add multiTransform methods to TreeNode to generate 
alternative transformations
 Key: SPARK-40599
 URL: https://issues.apache.org/jira/browse/SPARK-40599
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40259) Support Parquet DSv2 in subquery plan merge

2022-08-29 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-40259:
---
Description: We could improve SPARK-34079 with DSv2 support.  (was: We 
could improve SPARK-34079 to support DSv2.)

> Support Parquet DSv2 in subquery plan merge
> ---
>
> Key: SPARK-40259
> URL: https://issues.apache.org/jira/browse/SPARK-40259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Priority: Major
>
> We could improve SPARK-34079 with DSv2 support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40259) Support Parquet DSv2 in subquery plan merge

2022-08-29 Thread Peter Toth (Jira)
Peter Toth created SPARK-40259:
--

 Summary: Support Parquet DSv2 in subquery plan merge
 Key: SPARK-40259
 URL: https://issues.apache.org/jira/browse/SPARK-40259
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Peter Toth


We could improve SPARK-34079 to support DSv2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40247) Fix BitSet equality check

2022-08-28 Thread Peter Toth (Jira)
Peter Toth created SPARK-40247:
--

 Summary: Fix BitSet equality check
 Key: SPARK-40247
 URL: https://issues.apache.org/jira/browse/SPARK-40247
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40245) Fix FileScan equality check when partition or data filter columns are not read

2022-08-27 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-40245:
---
Summary: Fix FileScan equality check when partition or data filter columns 
are not read  (was: Fix FileScan canonicalization when partition or data filter 
columns are not read)

> Fix FileScan equality check when partition or data filter columns are not read
> --
>
> Key: SPARK-40245
> URL: https://issues.apache.org/jira/browse/SPARK-40245
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40245) Fix FileScan canonicalization when partition or data filter columns are not read

2022-08-27 Thread Peter Toth (Jira)
Peter Toth created SPARK-40245:
--

 Summary: Fix FileScan canonicalization when partition or data 
filter columns are not read
 Key: SPARK-40245
 URL: https://issues.apache.org/jira/browse/SPARK-40245
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40193) Merge subquery plans with different filters

2022-08-23 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-40193:
---
Summary: Merge subquery plans with different filters  (was: Merge different 
filters when merging subquery plans)

> Merge subquery plans with different filters
> ---
>
> Key: SPARK-40193
> URL: https://issues.apache.org/jira/browse/SPARK-40193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Priority: Major
>
> We could improve SPARK-34079 to be able to merge different filters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40193) Merge different filters when merging subquery plans

2022-08-23 Thread Peter Toth (Jira)
Peter Toth created SPARK-40193:
--

 Summary: Merge different filters when merging subquery plans
 Key: SPARK-40193
 URL: https://issues.apache.org/jira/browse/SPARK-40193
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Peter Toth


We could improve SPARK-34079 to be able to merge different filters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40086) Improve AliasAwareOutputPartitioning to take all aliases into account

2022-08-15 Thread Peter Toth (Jira)
Peter Toth created SPARK-40086:
--

 Summary: Improve AliasAwareOutputPartitioning to take all aliases 
into account
 Key: SPARK-40086
 URL: https://issues.apache.org/jira/browse/SPARK-40086
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Peter Toth


Currently AliasAwareOutputPartitioning takes only the last alias by aliased 
expressions into account.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38717) Handle Hive's bucket spec case preserving behaviour

2022-03-31 Thread Peter Toth (Jira)
Peter Toth created SPARK-38717:
--

 Summary: Handle Hive's bucket spec case preserving behaviour
 Key: SPARK-38717
 URL: https://issues.apache.org/jira/browse/SPARK-38717
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Peter Toth


{code}
CREATE TABLE t(
 c STRING,
 B_C STRING
)
PARTITIONED BY (p_c STRING)
CLUSTERED BY (B_C) INTO 4 BUCKETS
STORED AS PARQUET
{code}
then
{code}
SELECT * FROM t
{code}
fails with:
{code}
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns B_C 
is not part of the table columns ([FieldSchema(name:c, type:string, 
comment:null), FieldSchema(name:b_c, type:string, comment:null)]
at 
org.apache.hadoop.hive.ql.metadata.Table.setBucketCols(Table.java:552)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$.toHiveTable(HiveClientImpl.scala:1098)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:764)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:225)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:224)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:274)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:763)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitionsByFilter$1(HiveExternalCatalog.scala:1287)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:101)
... 110 more
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2022-03-25 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512270#comment-17512270
 ] 

Peter Toth commented on SPARK-26639:


[~stubartmess], that's a different issue but it is fixed in SPARK-36447.

> The reuse subquery function maybe does not work in SPARK SQL
> 
>
> Key: SPARK-26639
> URL: https://issues.apache.org/jira/browse/SPARK-26639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ke Jia
>Priority: Major
>
> The subquery reuse feature has done in 
> [https://github.com/apache/spark/pull/14548]
> In my test, I found the visualized plan do show the subquery is executed 
> once. But the stage of same subquery execute maybe not once.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28299) Evaluation of multiple CTE uses

2021-11-24 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth resolved SPARK-28299.

Resolution: Duplicate

> Evaluation of multiple CTE uses  
> -
>
> Key: SPARK-28299
> URL: https://issues.apache.org/jira/browse/SPARK-28299
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Priority: Major
>
> This query returns 2 in Spark SQL (ie. the CTE is evaluated twice), but it 
> returns 1 in PostgreSQL.
> {noformat}
> WITH t(x) AS (SELECT random())
> SELECT count(*) FROM (
>   SELECT * FROM t
>   UNION
>   SELECT * FROM t
> ) x
> {noformat}
> I tested MSSQL too and it returns 2 as Spark SQL does. Further tests are 
> needed on different DBs...



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37259) JDBC read is always going to wrap the query in a select statement

2021-11-23 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447980#comment-17447980
 ] 

Peter Toth commented on SPARK-37259:


I've opened a PR: [https://github.com/apache/spark/pull/34693] to support 
queries with CTE.

{quote}to get the schema and having a way to get that, without running the 
query twice.{quote}
I don't think that running the query twice to get the schema would be an issue 
as Spark adds a `WHERE 1=0` clause to the query. MSSQL engine should optimize 
the query and quickly return the schema with empty results.

{quote}The other item is the query is going to do something to the query you 
pass in, so it would need to be based on dbtable being used that is only doing 
a trim; the query is wrapping:
s"(${subquery}) SPARK_GEN_SUBQ_${curId.getAndIncrement()}"{quote}
I don't think this is an issue, please see new unit tests in the PR.

{quote}The other query that uses temp tables, in the sql server it is either 
#temptable or ##temptable is also still an issue because of how it getting 
wrapped in the select and the similar item if that runs the query to get the 
schema, then it actually creates the tables and the query fails when it runs 
since the table exists{quote}
I'm not sure that temp tables fit into Spark's JDBC world. Let me check if we 
can workaround them with the new `withClause`...


> JDBC read is always going to wrap the query in a select statement
> -
>
> Key: SPARK-37259
> URL: https://issues.apache.org/jira/browse/SPARK-37259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Kevin Appel
>Priority: Major
>
> The read jdbc is wrapping the query it sends to the database server inside a 
> select statement and there is no way to override this currently.
> Initially I ran into this issue when trying to run a CTE query against SQL 
> server and it fails, the details of the failure is in these cases:
> [https://github.com/microsoft/mssql-jdbc/issues/1340]
> [https://github.com/microsoft/mssql-jdbc/issues/1657]
> [https://github.com/microsoft/sql-spark-connector/issues/147]
> https://issues.apache.org/jira/browse/SPARK-32825
> https://issues.apache.org/jira/browse/SPARK-34928
> I started to patch the code to get the query to run and ran into a few 
> different items, if there is a way to add these features to allow this code 
> path to run, this would be extremely helpful to running these type of edge 
> case queries.  These are basic examples here the actual queries are much more 
> complex and would require significant time to rewrite.
> Inside JDBCOptions.scala the query is being set to either, using the dbtable 
> this allows the query to be passed without modification
>  
> {code:java}
> name.trim
> or
> s"(${subquery}) SPARK_GEN_SUBQ_${curId.getAndIncrement()}"
> {code}
>  
> Inside JDBCRelation.scala this is going to try to get the schema for this 
> query, and this ends up running dialect.getSchemaQuery which is doing:
> {code:java}
> s"SELECT * FROM $table WHERE 1=0"{code}
> Overriding the dialect here and initially just passing back the $table gets 
> passed here and to the next issue which is in the compute function in 
> JDBCRDD.scala
>  
> {code:java}
> val sqlText = s"SELECT $columnList FROM ${options.tableOrQuery} 
> $myTableSampleClause" + s" $myWhereClause $getGroupByClause $myLimitClause"
>  
> {code}
>  
> For these two queries, about a CTE query and using temp tables, finding out 
> the schema is difficult without actually running the query and for the temp 
> table if you run it in the schema check that will have the table now exist 
> and fail when it runs the actual query.
>  
> The way I patched these is by doing these two items:
> JDBCRDD.scala (compute)
>  
> {code:java}
>     val runQueryAsIs = options.parameters.getOrElse("runQueryAsIs", 
> "false").toBoolean
>     val sqlText = if (runQueryAsIs) {
>       s"${options.tableOrQuery}"
>     } else {
>       s"SELECT $columnList FROM ${options.tableOrQuery} $myWhereClause"
>     }
> {code}
> JDBCRelation.scala (getSchema)
> {code:java}
> val useCustomSchema = jdbcOptions.parameters.getOrElse("useCustomSchema", 
> "false").toBoolean
>     if (useCustomSchema) {
>       val myCustomSchema = jdbcOptions.parameters.getOrElse("customSchema", 
> "").toString
>       val newSchema = CatalystSqlParser.parseTableSchema(myCustomSchema)
>       logInfo(s"Going to return the new $newSchema because useCustomSchema is 
> $useCustomSchema and passed in $myCustomSchema")
>       newSchema
>     } else {
>       val tableSchema = JDBCRDD.resolveTable(jdbcOptions)
>       jdbcOptions.customSchema match {
>       case Some(customSchema) => JdbcUtils.getCustomSchema(
>         tableSchema, customSchema, resolver)
>       case None 

[jira] [Commented] (SPARK-37259) JDBC read is always going to wrap the query in a select statement

2021-11-19 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17446554#comment-17446554
 ] 

Peter Toth commented on SPARK-37259:


[~KevinAppelBofa], how about adding a new `withClause` to the JDBC options? Do 
you think you could split your CTE query to "with clause" and "regular query" 
parts manually and specify something like: .option("withClause", 
withClause).option("query", query)?
Because, that way we probably only need a small change to `sqlText` in 
`compute()` 
([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L370-L371):]
{noformat}
val sqlText = s"$withClause SELECT $columnList FROM ${options.tableOrQuery} 
$myTableSampleClause" +
  s" $myWhereClause $getGroupByClause $myLimitClause"
{noformat}
and also we could keep its other functionality.

Sidenote: technically we could extract the WITH clause in MsSqlServerDialect 
and assemble a dialect specific `sqlText` there, but it is not that simple to 
do it...

> JDBC read is always going to wrap the query in a select statement
> -
>
> Key: SPARK-37259
> URL: https://issues.apache.org/jira/browse/SPARK-37259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Kevin Appel
>Priority: Major
>
> The read jdbc is wrapping the query it sends to the database server inside a 
> select statement and there is no way to override this currently.
> Initially I ran into this issue when trying to run a CTE query against SQL 
> server and it fails, the details of the failure is in these cases:
> [https://github.com/microsoft/mssql-jdbc/issues/1340]
> [https://github.com/microsoft/mssql-jdbc/issues/1657]
> [https://github.com/microsoft/sql-spark-connector/issues/147]
> https://issues.apache.org/jira/browse/SPARK-32825
> https://issues.apache.org/jira/browse/SPARK-34928
> I started to patch the code to get the query to run and ran into a few 
> different items, if there is a way to add these features to allow this code 
> path to run, this would be extremely helpful to running these type of edge 
> case queries.  These are basic examples here the actual queries are much more 
> complex and would require significant time to rewrite.
> Inside JDBCOptions.scala the query is being set to either, using the dbtable 
> this allows the query to be passed without modification
>  
> {code:java}
> name.trim
> or
> s"(${subquery}) SPARK_GEN_SUBQ_${curId.getAndIncrement()}"
> {code}
>  
> Inside JDBCRelation.scala this is going to try to get the schema for this 
> query, and this ends up running dialect.getSchemaQuery which is doing:
> {code:java}
> s"SELECT * FROM $table WHERE 1=0"{code}
> Overriding the dialect here and initially just passing back the $table gets 
> passed here and to the next issue which is in the compute function in 
> JDBCRDD.scala
>  
> {code:java}
> val sqlText = s"SELECT $columnList FROM ${options.tableOrQuery} 
> $myTableSampleClause" + s" $myWhereClause $getGroupByClause $myLimitClause"
>  
> {code}
>  
> For these two queries, about a CTE query and using temp tables, finding out 
> the schema is difficult without actually running the query and for the temp 
> table if you run it in the schema check that will have the table now exist 
> and fail when it runs the actual query.
>  
> The way I patched these is by doing these two items:
> JDBCRDD.scala (compute)
>  
> {code:java}
>     val runQueryAsIs = options.parameters.getOrElse("runQueryAsIs", 
> "false").toBoolean
>     val sqlText = if (runQueryAsIs) {
>       s"${options.tableOrQuery}"
>     } else {
>       s"SELECT $columnList FROM ${options.tableOrQuery} $myWhereClause"
>     }
> {code}
> JDBCRelation.scala (getSchema)
> {code:java}
> val useCustomSchema = jdbcOptions.parameters.getOrElse("useCustomSchema", 
> "false").toBoolean
>     if (useCustomSchema) {
>       val myCustomSchema = jdbcOptions.parameters.getOrElse("customSchema", 
> "").toString
>       val newSchema = CatalystSqlParser.parseTableSchema(myCustomSchema)
>       logInfo(s"Going to return the new $newSchema because useCustomSchema is 
> $useCustomSchema and passed in $myCustomSchema")
>       newSchema
>     } else {
>       val tableSchema = JDBCRDD.resolveTable(jdbcOptions)
>       jdbcOptions.customSchema match {
>       case Some(customSchema) => JdbcUtils.getCustomSchema(
>         tableSchema, customSchema, resolver)
>       case None => tableSchema
>       }
>     }{code}
>  
> This is allowing the query to run as is, by using the dbtable option and then 
> provide a custom schema that will bypass the dialect schema check
>  
> Test queries
>  
> {code:java}
> query1 = """ 
> SELECT 1 as DummyCOL
> """
> query2 = """ 
> WITH DummyCTE AS
> (

[jira] [Commented] (SPARK-35672) Spark fails to launch executors with very large user classpath lists on YARN

2021-09-23 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419388#comment-17419388
 ] 

Peter Toth commented on SPARK-35672:


I put up a revert PR: https://github.com/apache/spark/pull/34082

> Spark fails to launch executors with very large user classpath lists on YARN
> 
>
> Key: SPARK-35672
> URL: https://issues.apache.org/jira/browse/SPARK-35672
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.1.2
> Environment: Linux RHEL7
> Spark 3.1.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.2.0, 3.1.3
>
>
> When running Spark on YARN, the {{user-class-path}} argument to 
> {{CoarseGrainedExecutorBackend}} is used to pass a list of user JAR URIs to 
> executor processes. The argument is specified once for each JAR, and the URIs 
> are fully-qualified, so the paths can be quite long. With large user JAR 
> lists (say 1000+), this can result in system-level argument length limits 
> being exceeded, typically manifesting as the error message:
> {code}
> /bin/bash: Argument list too long
> {code}
> A [Google 
> search|https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22]
>  indicates that this is not a theoretical problem and afflicts real users, 
> including ours. This issue was originally observed on Spark 2.3, but has been 
> confirmed to exist in the master branch as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35672) Spark fails to launch executors with very large user classpath lists on YARN

2021-09-23 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419285#comment-17419285
 ] 

Peter Toth commented on SPARK-35672:


[~xkrogen], [~tgraves], unfortunately, I think this is a breaking change and 
should be reverted.

On our clusters we use 
`{{spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/...}}` and before 
this change Yarn executor context looked like the following:
{noformat}
YARN executor launch context:
  env:
...

  command:
...
  {{JAVA_HOME}}/bin/java \
  -server \
  ...
  --user-class-path \
  file:{{HADOOP_COMMON_HOME}}/...jar \
  ...
{noformat}
and Yarn was able to substitute HADOOP_COMMON_HOME environment variable.

But after this change user classpath is distributed in {{SparkConf}} and we 
can't use environment variables any more.

cc [~Gengliang.Wang]


> Spark fails to launch executors with very large user classpath lists on YARN
> 
>
> Key: SPARK-35672
> URL: https://issues.apache.org/jira/browse/SPARK-35672
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.1.2
> Environment: Linux RHEL7
> Spark 3.1.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.2.0, 3.1.3
>
>
> When running Spark on YARN, the {{user-class-path}} argument to 
> {{CoarseGrainedExecutorBackend}} is used to pass a list of user JAR URIs to 
> executor processes. The argument is specified once for each JAR, and the URIs 
> are fully-qualified, so the paths can be quite long. With large user JAR 
> lists (say 1000+), this can result in system-level argument length limits 
> being exceeded, typically manifesting as the error message:
> {code}
> /bin/bash: Argument list too long
> {code}
> A [Google 
> search|https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22]
>  indicates that this is not a theoretical problem and afflicts real users, 
> including ours. This issue was originally observed on Spark 2.3, but has been 
> confirmed to exist in the master branch as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36065) date_trunc returns incorrect output

2021-07-30 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390421#comment-17390421
 ] 

Peter Toth commented on SPARK-36065:


I think the output is correct as there was a time zone change (+00:02:16) at 
1891-10-01 00:00:00 in Bratislava and that means that 1891-10-01 00:00:00 = 
1891-10-01 00:02:16.
I found this site that shows the TZ changes: 
https://www.timeanddate.com/time/zone/slovakia/bratislava

> date_trunc returns incorrect output
> ---
>
> Key: SPARK-36065
> URL: https://issues.apache.org/jira/browse/SPARK-36065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Sumeet
>Priority: Major
>  Labels: date_trunc, sql, timestamp
>
> Hi,
> Running date_trunc on any hour of "1891-10-01" returns incorrect output for 
> "Europe/Bratislava" timezone.
> Use the following steps in order to reproduce the issue:
>  * Run spark-shell using:
> {code:java}
> TZ="Europe/Bratislava" ./bin/spark-shell --conf 
> spark.driver.extraJavaOptions='-Duser.timezone=Europe/Bratislava' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=Europe/Bratislava' --conf 
> spark.sql.session.timeZone="Europe/Bratislava"{code}
>  * Generate test data:
> {code:java}
> ((0 until 9).map(hour => s"1891-10-01 00:0$hour:00") ++ (10 until 
> 24).map(hour => s"1891-10-01 
> 00:$hour:00")).toDF("ts_string").createOrReplaceTempView("temp_ts")
> {code}
>  * Run query:
> {code:java}
> sql("select ts_string, cast(ts_string as TIMESTAMP) as ts, date_trunc('day', 
> ts_string) from temp_ts").show(false)
> {code}
>  * Output:
> {code:java}
> +---+---+--+
> |ts_string  |ts |date_trunc(day, ts_string)|
> +---+---+--+
> |1891-10-01 00:00:00|1891-10-01 00:02:16|1891-10-01 00:02:16   |
> |1891-10-01 00:01:00|1891-10-01 00:03:16|1891-10-01 00:02:16   |
> |1891-10-01 00:02:00|1891-10-01 00:04:16|1891-10-01 00:02:16   |
> |1891-10-01 00:03:00|1891-10-01 00:03:00|1891-10-01 00:02:16   |
> |1891-10-01 00:04:00|1891-10-01 00:04:00|1891-10-01 00:02:16   |
> |1891-10-01 00:05:00|1891-10-01 00:05:00|1891-10-01 00:02:16   |
> |1891-10-01 00:06:00|1891-10-01 00:06:00|1891-10-01 00:02:16   |
> |1891-10-01 00:07:00|1891-10-01 00:07:00|1891-10-01 00:02:16   |
> |1891-10-01 00:08:00|1891-10-01 00:08:00|1891-10-01 00:02:16   |
> |1891-10-01 00:10:00|1891-10-01 00:10:00|1891-10-01 00:02:16   |
> |1891-10-01 00:11:00|1891-10-01 00:11:00|1891-10-01 00:02:16   |
> |1891-10-01 00:12:00|1891-10-01 00:12:00|1891-10-01 00:02:16   |
> |1891-10-01 00:13:00|1891-10-01 00:13:00|1891-10-01 00:02:16   |
> |1891-10-01 00:14:00|1891-10-01 00:14:00|1891-10-01 00:02:16   |
> |1891-10-01 00:15:00|1891-10-01 00:15:00|1891-10-01 00:02:16   |
> |1891-10-01 00:16:00|1891-10-01 00:16:00|1891-10-01 00:02:16   |
> |1891-10-01 00:17:00|1891-10-01 00:17:00|1891-10-01 00:02:16   |
> |1891-10-01 00:18:00|1891-10-01 00:18:00|1891-10-01 00:02:16   |
> |1891-10-01 00:19:00|1891-10-01 00:19:00|1891-10-01 00:02:16   |
> |1891-10-01 00:20:00|1891-10-01 00:20:00|1891-10-01 00:02:16   |
> +---+---+--+
> only showing top 20 rows
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36073) EquivalentExpressions fixes and improvements

2021-07-12 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-36073:
---
Issue Type: Improvement  (was: Bug)

> EquivalentExpressions fixes and improvements
> 
>
> Key: SPARK-36073
> URL: https://issues.apache.org/jira/browse/SPARK-36073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Peter Toth
>Priority: Major
>
> Currently `EquivalentExpressions` has 2 issues:
> - identifying common expressions in conditional expressions is not correct in 
> all cases
> - transparently canonicalized expressions (like `PromotePrecision`) are 
> considered common subexpressions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36073) EquivalentExpressions fixes and improvements

2021-07-10 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-36073:
---
Description: 
Currently `EquivalentExpressions` has 2 issues:
- identifying common expressions in conditional expressions is not correct in 
all cases
- transparently canonicalized expressions (like `PromotePrecision`) are 
considered common subexpressions

  was:

Fixes an issue with identifying common expressions in conditional expressions 
(a side effect of the above).
Fixes the issue of transparently canonicalized expressions (like 
PromotePrecision) are considered common subexpressions.


> EquivalentExpressions fixes and improvements
> 
>
> Key: SPARK-36073
> URL: https://issues.apache.org/jira/browse/SPARK-36073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Peter Toth
>Priority: Minor
>
> Currently `EquivalentExpressions` has 2 issues:
> - identifying common expressions in conditional expressions is not correct in 
> all cases
> - transparently canonicalized expressions (like `PromotePrecision`) are 
> considered common subexpressions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36073) EquivalentExpressions fixes and improvements

2021-07-10 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-36073:
---
Issue Type: Bug  (was: Improvement)

> EquivalentExpressions fixes and improvements
> 
>
> Key: SPARK-36073
> URL: https://issues.apache.org/jira/browse/SPARK-36073
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Peter Toth
>Priority: Major
>
> Currently `EquivalentExpressions` has 2 issues:
> - identifying common expressions in conditional expressions is not correct in 
> all cases
> - transparently canonicalized expressions (like `PromotePrecision`) are 
> considered common subexpressions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36073) EquivalentExpressions fixes and improvements

2021-07-10 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-36073:
---
Priority: Major  (was: Minor)

> EquivalentExpressions fixes and improvements
> 
>
> Key: SPARK-36073
> URL: https://issues.apache.org/jira/browse/SPARK-36073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Peter Toth
>Priority: Major
>
> Currently `EquivalentExpressions` has 2 issues:
> - identifying common expressions in conditional expressions is not correct in 
> all cases
> - transparently canonicalized expressions (like `PromotePrecision`) are 
> considered common subexpressions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36073) EquivalentExpressions fixes and improvements

2021-07-10 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-36073:
---
Description: 

Fixes an issue with identifying common expressions in conditional expressions 
(a side effect of the above).
Fixes the issue of transparently canonicalized expressions (like 
PromotePrecision) are considered common subexpressions.

  was:SPARK-35410 
(https://github.com/apache/spark/commit/9e1b204bcce4a8fe24c1edd8271197277b5017f4#diff-4d8c210a38fc808fef3e5c966b438591f225daa3c9fd69359446b94c351aa11eR106-R112)
 filters out all child expressions, but in some cases that is not necessary.


> EquivalentExpressions fixes and improvements
> 
>
> Key: SPARK-36073
> URL: https://issues.apache.org/jira/browse/SPARK-36073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Peter Toth
>Priority: Minor
>
> Fixes an issue with identifying common expressions in conditional expressions 
> (a side effect of the above).
> Fixes the issue of transparently canonicalized expressions (like 
> PromotePrecision) are considered common subexpressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36073) EquivalentExpressions fixes and improvements

2021-07-10 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-36073:
---
Summary: EquivalentExpressions fixes and improvements  (was: SubExpr 
elimination should include common child exprs of conditional expressions)

> EquivalentExpressions fixes and improvements
> 
>
> Key: SPARK-36073
> URL: https://issues.apache.org/jira/browse/SPARK-36073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Peter Toth
>Priority: Minor
>
> SPARK-35410 
> (https://github.com/apache/spark/commit/9e1b204bcce4a8fe24c1edd8271197277b5017f4#diff-4d8c210a38fc808fef3e5c966b438591f225daa3c9fd69359446b94c351aa11eR106-R112)
>  filters out all child expressions, but in some cases that is not necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36073) SubExpr elimination should include common child exprs of conditional expressions

2021-07-09 Thread Peter Toth (Jira)
Peter Toth created SPARK-36073:
--

 Summary: SubExpr elimination should include common child exprs of 
conditional expressions
 Key: SPARK-36073
 URL: https://issues.apache.org/jira/browse/SPARK-36073
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Peter Toth


SPARK-35410 
(https://github.com/apache/spark/commit/9e1b204bcce4a8fe24c1edd8271197277b5017f4#diff-4d8c210a38fc808fef3e5c966b438591f225daa3c9fd69359446b94c351aa11eR106-R112)
 filters out all child expressions, but in some cases that is not necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35855) Unify reuse map data structures in non-AQE and AQE rules

2021-06-22 Thread Peter Toth (Jira)
Peter Toth created SPARK-35855:
--

 Summary: Unify reuse map data structures in non-AQE and AQE rules
 Key: SPARK-35855
 URL: https://issues.apache.org/jira/browse/SPARK-35855
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Peter Toth


We can unify reuse map data structures in non-AQE and AQE rules 
(`ReuseExchangeAndSubquery`, `ReuseAdaptiveSubquery`) to a simple 
`Map[, ]`.

Please find discussion here: 
[https://github.com/apache/spark/pull/28885#discussion_r655073897]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35798) Fix SparkPlan.sqlContext usage

2021-06-17 Thread Peter Toth (Jira)
Peter Toth created SPARK-35798:
--

 Summary: Fix SparkPlan.sqlContext usage
 Key: SPARK-35798
 URL: https://issues.apache.org/jira/browse/SPARK-35798
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Peter Toth


There might be SparkPlan nodes where canonicalization on executor side can 
cause issues. 

Mode details here: https://github.com/apache/spark/pull/32885/files#r651019687



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34801) java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.metadata.Hive.loadPartition

2021-03-22 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306093#comment-17306093
 ] 

Peter Toth commented on SPARK-34801:


Yes it is. Please use CDS 3 (Cloudera Distribution of Spark 3) on supported CDP 
versions.

> java.lang.NoSuchMethodException: 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition
> --
>
> Key: SPARK-34801
> URL: https://issues.apache.org/jira/browse/SPARK-34801
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.2
> Environment: HDP3.1.4.0-315  spark 3.0.2
>Reporter: zhaojk
>Priority: Major
>
> use spark-sql  run this sql  insert overwrite table zry.zjk1 
> partition(etl_dt=2) select * from zry.zry;
> java.lang.NoSuchMethodException: 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(org.apache.hadoop.fs.Path,
>  org.apache.hadoop.hive.ql.metadata.Table, java.util.Map, 
> org.apache.hadoop.hive.ql.plan.LoadTableDesc$LoadFileType, boolean, boolean, 
> boolean, boolean, boolean, java.lang.Long, int, boolean)
>  at java.lang.Class.getMethod(Class.java:1786)
>  at org.apache.spark.sql.hive.client.Shim.findMethod(HiveShim.scala:177)
>  at 
> org.apache.spark.sql.hive.client.Shim_v3_0.loadPartitionMethod$lzycompute(HiveShim.scala:1289)
>  at 
> org.apache.spark.sql.hive.client.Shim_v3_0.loadPartitionMethod(HiveShim.scala:1274)
>  at 
> org.apache.spark.sql.hive.client.Shim_v3_0.loadPartition(HiveShim.scala:1337)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:881)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:295)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:228)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:227)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:277)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:871)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:915)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:103)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:894)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179)
>  at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:318)
>  at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
>  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
>  at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>  at 

[jira] [Updated] (SPARK-34581) BoundAttribute issue after optimization by BooleanSimplification and PushFoldableIntoBranches

2021-03-21 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-34581:
---
Affects Version/s: 3.0.2
   3.1.1

> BoundAttribute issue after optimization by BooleanSimplification and 
> PushFoldableIntoBranches
> -
>
> Key: SPARK-34581
> URL: https://issues.apache.org/jira/browse/SPARK-34581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Yuming Wang
>Priority: Major
>
> BoundAttribute issue will occur after optimization by BooleanSimplification 
> and PushFoldableIntoBranches. How to reproduce this issue:
> {code:scala}
> spark.sql("CREATE TABLE t1 (a INT, b INT) USING parquet")
> spark.sql("CREATE TABLE t2 (a INT, b INT) USING parquet")
>   spark.sql(
> """
>   |SELECT cnt,
>   |   NOT ( buyer_id ) AS buyer_id2
>   |FROM   (SELECT t1.a IS NOT NULL AS buyer_id,
>   |   Count(*) AS cnt
>   |FROM   t1
>   |   INNER JOIN t2
>   |   ON t1.a = t2.a
>   |GROUP  BY 1) t 
>   |""".stripMargin).collect()
> {code}
> {noformat}
> Couldn't find a#4 in [CASE WHEN isnotnull(a#4) THEN 1 ELSE 2 
> END#10,count(1)#3L]
> java.lang.IllegalStateException: Couldn't find a#4 in [CASE WHEN 
> isnotnull(a#4) THEN 1 ELSE 2 END#10,count(1)#3L]
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:242)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:404)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:357)
> {noformat}
> Another case:
> {code:scala}
> spark.sql(
>   """
> |SELECT cnt,
> |   CASE WHEN ( buyer_id = 2 AND cnt > 3 ) THEN 2 ELSE 3 END AS 
> buyer_id2
> |FROM   (SELECT CASE WHEN t1.a IS NOT NULL THEN 1 ELSE 2 END AS buyer_id, 
> Count(*) AS cnt
> |FROM   t1 INNER JOIN t2 ON t1.a = t2.a
> |GROUP  BY 1) t
> |""".stripMargin).collect()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33482) V2 Datasources that extend FileScan preclude exchange reuse

2021-03-16 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-33482:
---
Affects Version/s: 3.0.0
   3.0.1
   3.0.2
   3.1.1

> V2 Datasources that extend FileScan preclude exchange reuse
> ---
>
> Key: SPARK-33482
> URL: https://issues.apache.org/jira/browse/SPARK-33482
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Bruce Robbins
>Priority: Major
>
> Sample query:
> {noformat}
> spark.read.parquet("tbl").createOrReplaceTempView("tbl")
> spark.read.parquet("lookup").createOrReplaceTempView("lookup")
> sql("""
>select tbl.col1, fk1, fk2
>from tbl, lookup l1, lookup l2
>where fk1 = l1.key
>and fk2 = l2.key
> """).explain
> {noformat}
> Test files can be created as so:
> {noformat}
> import scala.util.Random
> val rand = Random
> val tbl = spark.range(1, 1).map { x =>
>   (rand.nextLong.abs % 20,
>rand.nextLong.abs % 20,
>x)
> }.toDF("fk1", "fk2", "col1")
> tbl.write.mode("overwrite").parquet("tbl")
> val lookup = spark.range(0, 20).map { x =>
>   (x + 1, x * 1, (x + 1) * 1)
> }.toDF("key", "col1", "col2")
> lookup.write.mode("overwrite").parquet("lookup")
> {noformat}
> Output with V1 Parquet reader:
> {noformat}
>  == Physical Plan ==
> *(3) Project [col1#2L, fk1#0L, fk2#1L]
> +- *(3) BroadcastHashJoin [fk2#1L], [key#12L], Inner, BuildRight, false
>:- *(3) Project [fk1#0L, fk2#1L, col1#2L]
>:  +- *(3) BroadcastHashJoin [fk1#0L], [key#6L], Inner, BuildRight, false
>: :- *(3) Filter (isnotnull(fk1#0L) AND isnotnull(fk2#1L))
>: :  +- *(3) ColumnarToRow
>: : +- FileScan parquet [fk1#0L,fk2#1L,col1#2L] Batched: true, 
> DataFilters: [isnotnull(fk1#0L), isnotnull(fk2#1L)], Format: Parquet, 
> Location: InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/tbl], 
> PartitionFilters: [], PushedFilters: [IsNotNull(fk1), IsNotNull(fk2)], 
> ReadSchema: struct
>: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> bigint, false]),false), [id=#75]
>:+- *(1) Filter isnotnull(key#6L)
>:   +- *(1) ColumnarToRow
>:  +- FileScan parquet [key#6L] Batched: true, DataFilters: 
> [isnotnull(key#6L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], 
> PartitionFilters: [], PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
>+- ReusedExchange [key#12L], BroadcastExchange 
> HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [id=#75]
> {noformat}
> With V1 Parquet reader, the exchange for lookup is reused (see last line).
> Output with V2 Parquet reader (spark.sql.sources.useV1SourceList=""):
> {noformat}
>  == Physical Plan ==
> *(3) Project [col1#2L, fk1#0L, fk2#1L]
> +- *(3) BroadcastHashJoin [fk2#1L], [key#12L], Inner, BuildRight, false
>:- *(3) Project [fk1#0L, fk2#1L, col1#2L]
>:  +- *(3) BroadcastHashJoin [fk1#0L], [key#6L], Inner, BuildRight, false
>: :- *(3) Filter (isnotnull(fk1#0L) AND isnotnull(fk2#1L))
>: :  +- *(3) ColumnarToRow
>: : +- BatchScan[fk1#0L, fk2#1L, col1#2L] ParquetScan DataFilters: 
> [isnotnull(fk1#0L), isnotnull(fk2#1L)], Format: parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/tbl], 
> PartitionFilters: [], PushedFilers: [IsNotNull(fk1), IsNotNull(fk2)], 
> ReadSchema: struct, PushedFilters: 
> [IsNotNull(fk1), IsNotNull(fk2)]
>: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> bigint, false]),false), [id=#75]
>:+- *(1) Filter isnotnull(key#6L)
>:   +- *(1) ColumnarToRow
>:  +- BatchScan[key#6L] ParquetScan DataFilters: 
> [isnotnull(key#6L)], Format: parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], 
> PartitionFilters: [], PushedFilers: [IsNotNull(key)], ReadSchema: 
> struct, PushedFilters: [IsNotNull(key)]
>+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false]),false), [id=#83]
>   +- *(2) Filter isnotnull(key#12L)
>  +- *(2) ColumnarToRow
> +- BatchScan[key#12L] ParquetScan DataFilters: 
> [isnotnull(key#12L)], Format: parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], 
> PartitionFilters: [], PushedFilers: [IsNotNull(key)], ReadSchema: 
> struct, PushedFilters: [IsNotNull(key)]
> {noformat}
> With the V2 Parquet reader, the exchange for lookup is not reused (see last 4 
> lines).
> You can see the same issue with the Orc reader (and I assume any other 
> datasource that extends Filescan).
> The issue appears to be this check in 

[jira] [Updated] (SPARK-34756) Fix FileScan equality check

2021-03-16 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-34756:
---
Affects Version/s: 3.1.0

> Fix FileScan equality check
> ---
>
> Key: SPARK-34756
> URL: https://issues.apache.org/jira/browse/SPARK-34756
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Peter Toth
>Priority: Major
>
> `&&` is missing from `FileScan.equals()`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34756) Fix FileScan equality check

2021-03-16 Thread Peter Toth (Jira)
Peter Toth created SPARK-34756:
--

 Summary: Fix FileScan equality check
 Key: SPARK-34756
 URL: https://issues.apache.org/jira/browse/SPARK-34756
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1, 3.0.2, 3.0.1, 3.0.0
Reporter: Peter Toth


`&&` is missing from `FileScan.equals()`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34726) Fix collectToPython timeouts

2021-03-12 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-34726:
---
Affects Version/s: (was: 3.1.1)
   2.4.7

> Fix collectToPython timeouts
> 
>
> Key: SPARK-34726
> URL: https://issues.apache.org/jira/browse/SPARK-34726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7
>Reporter: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34726) Fix collectToPython timeouts

2021-03-12 Thread Peter Toth (Jira)
Peter Toth created SPARK-34726:
--

 Summary: Fix collectToPython timeouts
 Key: SPARK-34726
 URL: https://issues.apache.org/jira/browse/SPARK-34726
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30655) Update WebUI Bootstrap to 4.4.1

2021-03-09 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298005#comment-17298005
 ] 

Peter Toth commented on SPARK-30655:


[~d.clarke], it looks like you already fixed this in 
[https://github.com/apache/spark/pull/27370|https://github.com/apache/spark/pull/27370].
 Shall we close this ticket?

> Update WebUI Bootstrap to 4.4.1
> ---
>
> Key: SPARK-30655
> URL: https://issues.apache.org/jira/browse/SPARK-30655
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Dale Clarke
>Priority: Major
>
> Spark is using an older version of Bootstrap (v. 2.3.2) for the Web UI pages. 
>  Bootstrap 2.x was moved to EOL in Aug 2013 and Bootstrap 3.x was moved to 
> EOL in July 2019 ([https://github.com/twbs/release)].  Older versions of 
> Bootstrap are also getting flagged in security scans for various CVEs:
>  * [https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-72889]
>  * [https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-173700]
>  * [https://snyk.io/vuln/npm:bootstrap:20180529]
>  * [https://snyk.io/vuln/npm:bootstrap:20160627]
> I haven't validated each CVE, but it would probably be good practice to 
> resolve any potential issues and get on a supported release.
> The bad news is that there have been quite a few changes between Bootstrap 2 
> and Bootstrap 4.  I've tried updating the library, refactoring/tweaking the 
> CSS and JS to maintain a similar appearance and functionality, and testing 
> the documentation.  As with the ticket created for the outdated Bootstrap 
> version in the docs (SPARK-30654), this is a fairly large change so I'm sure 
> additional testing and fixes will be needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24497) ANSI SQL: Recursive query

2021-02-02 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277367#comment-17277367
 ] 

Peter Toth commented on SPARK-24497:


Thanks [~ilaurens] for your comment. Recursive queries are very useful to 
process hierarchical structures indeed.

I will try update my PRs this week, but the problem is the lack of reviews...

> ANSI SQL: Recursive query
> -
>
> Key: SPARK-24497
> URL: https://issues.apache.org/jira/browse/SPARK-24497
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> h3. *Examples*
> Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" 
> represents the structure of an organization as an adjacency list.
> {code:sql}
> CREATE TABLE department (
> id INTEGER PRIMARY KEY,  -- department ID
> parent_department INTEGER REFERENCES department, -- upper department ID
> name TEXT -- department name
> );
> INSERT INTO department (id, parent_department, "name")
> VALUES
>  (0, NULL, 'ROOT'),
>  (1, 0, 'A'),
>  (2, 1, 'B'),
>  (3, 2, 'C'),
>  (4, 2, 'D'),
>  (5, 0, 'E'),
>  (6, 4, 'F'),
>  (7, 5, 'G');
> -- department structure represented here is as follows:
> --
> -- ROOT-+->A-+->B-+->C
> --  | |
> --  | +->D-+->F
> --  +->E-+->G
> {code}
>  
>  To extract all departments under A, you can use the following recursive 
> query:
> {code:sql}
> WITH RECURSIVE subdepartment AS
> (
> -- non-recursive term
> SELECT * FROM department WHERE name = 'A'
> UNION ALL
> -- recursive term
> SELECT d.*
> FROM
> department AS d
> JOIN
> subdepartment AS sd
> ON (d.parent_department = sd.id)
> )
> SELECT *
> FROM subdepartment
> ORDER BY name;
> {code}
> More details:
> [http://wiki.postgresql.org/wiki/CTEReadme]
> [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32041) Exchange reuse won't work in cases when DPP, subqueries are involved

2021-01-26 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272102#comment-17272102
 ] 

Peter Toth commented on SPARK-32041:


Let me reopen this ticket as this is not a duplicate of SPARK-29375 but more 
like a bug. The connection between this ticket, SPARK-29375 and SPARK-28940 is 
that my PR (https://github.com/apache/spark/pull/28885) fixes all of them. 

> Exchange reuse won't work in cases when DPP, subqueries are involved
> 
>
> Key: SPARK-32041
> URL: https://issues.apache.org/jira/browse/SPARK-32041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Prakhar Jain
>Priority: Major
>
> When an Exchange node is repeated at multiple places in the PhysicalPlan, and 
> if that exchange has some some DPP Subquery filter, then ReuseExchange 
> doesn't work for such Exchange and different stages are launched to compute 
> same thing.
> Example:
> {noformat}
> // generate data
> val factData = (1 to 100).map(i => (i%5, i%20, i))
> factData.toDF("store_id", "product_id", "units_sold")
>   .write
>   .partitionBy("store_id")
>   .format("parquet")
>   .saveAsTable("fact_stats")
> val dimData = Seq[(Int, String, String)](
>   (1, "AU", "US"),
>   (2, "CA", "US"),
>   (3, "KA", "IN"),
>   (4, "DL", "IN"),
>   (5, "GA", "PA"))
> dimData.toDF("store_id", "state_province", "country")
>   .write
>   .format("parquet")
>   .saveAsTable("dim_stats")
> sql("ANALYZE TABLE fact_stats COMPUTE STATISTICS FOR COLUMNS store_id")
> sql("ANALYZE TABLE dim_stats COMPUTE STATISTICS FOR COLUMNS store_id")
> // Set Configs
> spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=true")
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=1000")
> val query = """
> With view1 as (
>   SELECT product_id, f.store_id
>   FROM fact_stats f JOIN dim_stats
>   ON f.store_id = dim_stats.store_id WHERE dim_stats.country = 'IN')
> SELECT * FROM view1 v1 join view1 v2 WHERE v1.product_id = v2.product_id
> """
> val df = spark.sql(query)
> println(df.queryExecution.executedPlan)
> {noformat}
> {noformat}
> Plan:
>  *(7) SortMergeJoin [product_id#1968|#1968], [product_id#2060|#2060], Inner
>  :- *(3) Sort [product_id#1968 ASC NULLS FIRST|#1968 ASC NULLS FIRST], false, > 0
>  : +- Exchange hashpartitioning(product_id#1968, 5), true, [id=#1140|#1140]
>  : +- *(2) Project [product_id#1968, store_id#1970|#1968, store_id#1970]
>  : +- *(2) BroadcastHashJoin [store_id#1970|#1970], [store_id#1971|#1971], 
> Inner, BuildRight
>  : :- *(2) Project [product_id#1968, store_id#1970|#1968, store_id#1970]
>  : : +- *(2) Filter isnotnull(product_id#1968)
>  : : +- *(2) ColumnarToRow
>  : : +- FileScan parquet 
> default.fact_stats[product_id#1968,store_id#1970|#1968,store_id#1970] 
> Batched: true, DataFilters: [isnotnull(product_id#1968)|#1968)], Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/home/prakhar/src/os/1_spark/sql/core/spark-warehouse/org.apache.spark.sql...,
>  PartitionFilters: [isnotnull(store_id#1970), 
> dynamicpruningexpression(store_id#1970 IN dynamicpruning#2067)|#1970), 
> dynamicpruningexpression(store_id#1970 IN dynamicpruning#2067)], 
> PushedFilters: [IsNotNull(product_id)], ReadSchema: struct
>  : : +- SubqueryBroadcast dynamicpruning#2067, 0, [store_id#1971|#1971], 
> [id=#1131|#1131]
>  : : +- ReusedExchange [store_id#1971|#1971], BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), 
> [id=#1021|#1021]
>  : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> true] as bigint))), [id=#1021|#1021]
>  : +- *(1) Project [store_id#1971|#1971]
>  : +- *(1) Filter ((isnotnull(country#1973) AND (country#1973 = IN)) AND 
> isnotnull(store_id#1971))
>  : +- *(1) ColumnarToRow
>  : +- FileScan parquet 
> default.dim_stats[store_id#1971,country#1973|#1971,country#1973] Batched: 
> true, DataFilters: [isnotnull(country#1973), (country#1973 = IN), 
> isnotnull(store_id#1971)|#1973), (country#1973 = IN), 
> isnotnull(store_id#1971)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/home/prakhar/src/os/1_spark/sql/core/spark-warehouse/org.apache.spark.sql...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(country), 
> EqualTo(country,IN), IsNotNull(store_id)], ReadSchema: 
> struct
>  +- *(6) Sort [product_id#2060 ASC NULLS FIRST|#2060 ASC NULLS FIRST], false, > 0
>  +- ReusedExchange [product_id#2060, store_id#2062|#2060, store_id#2062], 
> Exchange hashpartitioning(product_id#1968, 5), true, [id=#1026|#1026]
> {noformat}
> Issue:
>  Note the last line of plan. Its a ReusedExchange which is pointing to 
> id=1026. But There is no Exchange node in plan with ID 1026. ReusedExchange 
> node is pointing to incorrect 

[jira] [Reopened] (SPARK-32041) Exchange reuse won't work in cases when DPP, subqueries are involved

2021-01-26 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth reopened SPARK-32041:


> Exchange reuse won't work in cases when DPP, subqueries are involved
> 
>
> Key: SPARK-32041
> URL: https://issues.apache.org/jira/browse/SPARK-32041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Prakhar Jain
>Priority: Major
>
> When an Exchange node is repeated at multiple places in the PhysicalPlan, and 
> if that exchange has some some DPP Subquery filter, then ReuseExchange 
> doesn't work for such Exchange and different stages are launched to compute 
> same thing.
> Example:
> {noformat}
> // generate data
> val factData = (1 to 100).map(i => (i%5, i%20, i))
> factData.toDF("store_id", "product_id", "units_sold")
>   .write
>   .partitionBy("store_id")
>   .format("parquet")
>   .saveAsTable("fact_stats")
> val dimData = Seq[(Int, String, String)](
>   (1, "AU", "US"),
>   (2, "CA", "US"),
>   (3, "KA", "IN"),
>   (4, "DL", "IN"),
>   (5, "GA", "PA"))
> dimData.toDF("store_id", "state_province", "country")
>   .write
>   .format("parquet")
>   .saveAsTable("dim_stats")
> sql("ANALYZE TABLE fact_stats COMPUTE STATISTICS FOR COLUMNS store_id")
> sql("ANALYZE TABLE dim_stats COMPUTE STATISTICS FOR COLUMNS store_id")
> // Set Configs
> spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=true")
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=1000")
> val query = """
> With view1 as (
>   SELECT product_id, f.store_id
>   FROM fact_stats f JOIN dim_stats
>   ON f.store_id = dim_stats.store_id WHERE dim_stats.country = 'IN')
> SELECT * FROM view1 v1 join view1 v2 WHERE v1.product_id = v2.product_id
> """
> val df = spark.sql(query)
> println(df.queryExecution.executedPlan)
> {noformat}
> {noformat}
> Plan:
>  *(7) SortMergeJoin [product_id#1968|#1968], [product_id#2060|#2060], Inner
>  :- *(3) Sort [product_id#1968 ASC NULLS FIRST|#1968 ASC NULLS FIRST], false, > 0
>  : +- Exchange hashpartitioning(product_id#1968, 5), true, [id=#1140|#1140]
>  : +- *(2) Project [product_id#1968, store_id#1970|#1968, store_id#1970]
>  : +- *(2) BroadcastHashJoin [store_id#1970|#1970], [store_id#1971|#1971], 
> Inner, BuildRight
>  : :- *(2) Project [product_id#1968, store_id#1970|#1968, store_id#1970]
>  : : +- *(2) Filter isnotnull(product_id#1968)
>  : : +- *(2) ColumnarToRow
>  : : +- FileScan parquet 
> default.fact_stats[product_id#1968,store_id#1970|#1968,store_id#1970] 
> Batched: true, DataFilters: [isnotnull(product_id#1968)|#1968)], Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/home/prakhar/src/os/1_spark/sql/core/spark-warehouse/org.apache.spark.sql...,
>  PartitionFilters: [isnotnull(store_id#1970), 
> dynamicpruningexpression(store_id#1970 IN dynamicpruning#2067)|#1970), 
> dynamicpruningexpression(store_id#1970 IN dynamicpruning#2067)], 
> PushedFilters: [IsNotNull(product_id)], ReadSchema: struct
>  : : +- SubqueryBroadcast dynamicpruning#2067, 0, [store_id#1971|#1971], 
> [id=#1131|#1131]
>  : : +- ReusedExchange [store_id#1971|#1971], BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), 
> [id=#1021|#1021]
>  : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> true] as bigint))), [id=#1021|#1021]
>  : +- *(1) Project [store_id#1971|#1971]
>  : +- *(1) Filter ((isnotnull(country#1973) AND (country#1973 = IN)) AND 
> isnotnull(store_id#1971))
>  : +- *(1) ColumnarToRow
>  : +- FileScan parquet 
> default.dim_stats[store_id#1971,country#1973|#1971,country#1973] Batched: 
> true, DataFilters: [isnotnull(country#1973), (country#1973 = IN), 
> isnotnull(store_id#1971)|#1973), (country#1973 = IN), 
> isnotnull(store_id#1971)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/home/prakhar/src/os/1_spark/sql/core/spark-warehouse/org.apache.spark.sql...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(country), 
> EqualTo(country,IN), IsNotNull(store_id)], ReadSchema: 
> struct
>  +- *(6) Sort [product_id#2060 ASC NULLS FIRST|#2060 ASC NULLS FIRST], false, > 0
>  +- ReusedExchange [product_id#2060, store_id#2062|#2060, store_id#2062], 
> Exchange hashpartitioning(product_id#1968, 5), true, [id=#1026|#1026]
> {noformat}
> Issue:
>  Note the last line of plan. Its a ReusedExchange which is pointing to 
> id=1026. But There is no Exchange node in plan with ID 1026. ReusedExchange 
> node is pointing to incorrect Child node (1026 instead of 1140) and so in 
> actual, exchange reuse won't happen in this query.
> Another query where issue is because of ReuseSubquery:
> {noformat}
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=-1")
> val query1 = """
>   | With 

  1   2   3   >