date:20230419

[jira] [Created] (SPARK-43211) Remove Hadoop2 support in IsolatedClientLoader

2023-04-19 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-43211:
-

 Summary: Remove Hadoop2 support in IsolatedClientLoader
 Key: SPARK-43211
 URL: https://issues.apache.org/jira/browse/SPARK-43211
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43210) Introduce PySparkAssersionError

2023-04-19 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-43210:
---

 Summary: Introduce PySparkAssersionError
 Key: SPARK-43210
 URL: https://issues.apache.org/jira/browse/SPARK-43210
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Introduce PySparkAssersionError



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43209) Migrate Expression errors into error class

2023-04-19 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-43209:
---

 Summary: Migrate Expression errors into error class
 Key: SPARK-43209
 URL: https://issues.apache.org/jira/browse/SPARK-43209
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Migrate Expression errors into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42945) Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect

2023-04-19 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714416#comment-17714416
 ] 

Hyukjin Kwon commented on SPARK-42945:
--

Reverted at 
https://github.com/apache/spark/commit/09a43531d30346bb7c8d213822513dc35c70f82e

> Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect
> ---
>
> Key: SPARK-42945
> URL: https://issues.apache.org/jira/browse/SPARK-42945
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>
> Make the PySpark setting PYSPARK_JVM_STACKTRACE_ENABLED work with Spark 
> Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-42945) Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect

2023-04-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-42945:
--

> Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect
> ---
>
> Key: SPARK-42945
> URL: https://issues.apache.org/jira/browse/SPARK-42945
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.5.0
>
>
> Make the PySpark setting PYSPARK_JVM_STACKTRACE_ENABLED work with Spark 
> Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42945) Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect

2023-04-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42945:
-
Fix Version/s: (was: 3.5.0)

> Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect
> ---
>
> Key: SPARK-42945
> URL: https://issues.apache.org/jira/browse/SPARK-42945
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>
> Make the PySpark setting PYSPARK_JVM_STACKTRACE_ENABLED work with Spark 
> Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43208) IsolatedClassLoader should close barrier class InputStream after reading

2023-04-19 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-43208:
-

 Summary: IsolatedClassLoader should close barrier class 
InputStream after reading
 Key: SPARK-43208
 URL: https://issues.apache.org/jira/browse/SPARK-43208
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43196) Replace reflection w/ direct calling for `ContainerLaunchContext#setTokensConf`

2023-04-19 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43196:


Assignee: Yang Jie

> Replace reflection w/ direct calling for 
> `ContainerLaunchContext#setTokensConf`
> ---
>
> Key: SPARK-43196
> URL: https://issues.apache.org/jira/browse/SPARK-43196
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43196) Replace reflection w/ direct calling for `ContainerLaunchContext#setTokensConf`

2023-04-19 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43196.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40855
[https://github.com/apache/spark/pull/40855]

> Replace reflection w/ direct calling for 
> `ContainerLaunchContext#setTokensConf`
> ---
>
> Key: SPARK-43196
> URL: https://issues.apache.org/jira/browse/SPARK-43196
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43191) Replace reflection w/ direct calling for Hadoop CallerContext

2023-04-19 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43191.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40850
[https://github.com/apache/spark/pull/40850]

> Replace reflection w/ direct calling for Hadoop CallerContext 
> --
>
> Key: SPARK-43191
> URL: https://issues.apache.org/jira/browse/SPARK-43191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43191) Replace reflection w/ direct calling for Hadoop CallerContext

2023-04-19 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43191:


Assignee: Cheng Pan

> Replace reflection w/ direct calling for Hadoop CallerContext 
> --
>
> Key: SPARK-43191
> URL: https://issues.apache.org/jira/browse/SPARK-43191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43200) Remove Hadoop 2 reference in docs

2023-04-19 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43200.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40857
[https://github.com/apache/spark/pull/40857]

> Remove Hadoop 2 reference in docs
> -
>
> Key: SPARK-43200
> URL: https://issues.apache.org/jira/browse/SPARK-43200
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43200) Remove Hadoop 2 reference in docs

2023-04-19 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43200:


Assignee: Cheng Pan

> Remove Hadoop 2 reference in docs
> -
>
> Key: SPARK-43200
> URL: https://issues.apache.org/jira/browse/SPARK-43200
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43197) Clean up the code written for compatibility with Hadoop 2

2023-04-19 Thread Sun Chao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714403#comment-17714403
 ] 

Sun Chao commented on SPARK-43197:
--

Thanks for the ping [~gurwls223] . Subscribed.

> Clean up the code written for compatibility with Hadoop 2
> -
>
> Key: SPARK-43197
> URL: https://issues.apache.org/jira/browse/SPARK-43197
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, SQL, YARN
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> SPARK-42452 removed support for Hadoop2, we can clean up the code written for 
> compatibility with Hadoop 2 to make it more concise



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43195) Remove unnecessary serializable wrapper in HadoopFSUtils

2023-04-19 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43195.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40854
[https://github.com/apache/spark/pull/40854]

> Remove unnecessary serializable wrapper in HadoopFSUtils
> 
>
> Key: SPARK-43195
> URL: https://issues.apache.org/jira/browse/SPARK-43195
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43195) Remove unnecessary serializable wrapper in HadoopFSUtils

2023-04-19 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43195:


Assignee: Cheng Pan

> Remove unnecessary serializable wrapper in HadoopFSUtils
> 
>
> Key: SPARK-43195
> URL: https://issues.apache.org/jira/browse/SPARK-43195
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43112) Spark may use a column other than the actual specified partitioning column for partitioning, for Hive format tables

2023-04-19 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif resolved SPARK-43112.
--
Resolution: Not A Bug

> Spark may  use a column other than the actual specified partitioning column 
> for partitioning, for Hive format tables
> 
>
> Key: SPARK-43112
> URL: https://issues.apache.org/jira/browse/SPARK-43112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Asif
>Priority: Major
>
> The class org.apache.spark.sql.catalyst.catalog.HiveTableRelation has its 
> output method implemented as 
>   // The partition column should always appear after data columns.
>   override def output: Seq[AttributeReference] = dataCols ++ partitionCols
> But the DataWriting commands of spark like InsertIntoHiveDirCommand, expect 
> that the output from HiveTableRelation is in the order in which the columns 
> are actually defined in the DDL.
> As a result, multiple mismatch scenarios can happen like:
> 1) data type casting exception being thrown , even though the data frame 
> being inserted has schema which is identical to what is used for creating ddl.
>   OR
> 2) Wrong column being used for partitioning , if the datatypes are same or 
> cast-able, like date type and long
> will be creating a PR with the bug test



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43206) Streaming query exception() also include stack trace

2023-04-19 Thread Wei Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Liu updated SPARK-43206:

Description: 
[https://github.com/apache/spark/pull/40785#issuecomment-1515522281]

 

 

> Streaming query exception() also include stack trace
> 
>
> Key: SPARK-43206
> URL: https://issues.apache.org/jira/browse/SPARK-43206
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Wei Liu
>Priority: Major
>
> [https://github.com/apache/spark/pull/40785#issuecomment-1515522281]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43206) Streaming query exception() also include stack trace

2023-04-19 Thread Wei Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Liu updated SPARK-43206:

Epic Link: SPARK-42938

> Streaming query exception() also include stack trace
> 
>
> Key: SPARK-43206
> URL: https://issues.apache.org/jira/browse/SPARK-43206
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Wei Liu
>Priority: Major
>
> [https://github.com/apache/spark/pull/40785#issuecomment-1515522281]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43167) Streaming Connect console output format support

2023-04-19 Thread Wei Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Liu resolved SPARK-43167.
-
Resolution: Not A Problem

automatically supported with existing Connect implementation

 

> Streaming Connect console output format support
> ---
>
> Key: SPARK-43167
> URL: https://issues.apache.org/jira/browse/SPARK-43167
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Wei Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43206) Streaming query exception() also include stack trace

2023-04-19 Thread Wei Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Liu updated SPARK-43206:

Environment: (was: 
https://github.com/apache/spark/pull/40785#issuecomment-1515522281

 )

> Streaming query exception() also include stack trace
> 
>
> Key: SPARK-43206
> URL: https://issues.apache.org/jira/browse/SPARK-43206
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Wei Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43194) PySpark 3.4.0 cannot convert timestamp-typed objects to pandas with pandas 2.0

2023-04-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-43194:
-
Parent: SPARK-42618
Issue Type: Sub-task  (was: Bug)

> PySpark 3.4.0 cannot convert timestamp-typed objects to pandas with pandas 2.0
> --
>
> Key: SPARK-43194
> URL: https://issues.apache.org/jira/browse/SPARK-43194
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
> Environment: {code}
> In [4]: import pandas as pd
> In [5]: pd.__version__
> Out[5]: '2.0.0'
> In [6]: import pyspark as ps
> In [7]: ps.__version__
> Out[7]: '3.4.0'
> {code}
>Reporter: Phillip Cloud
>Priority: Major
>
> {code}
> In [1]: from pyspark.sql import SparkSession
> In [2]: session = SparkSession.builder.appName("test").getOrCreate()
> 23/04/19 09:21:42 WARN Utils: Your hostname, albatross resolves to a loopback 
> address: 127.0.0.2; using 192.168.1.170 instead (on interface enp5s0)
> 23/04/19 09:21:42 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 23/04/19 09:21:42 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> In [3]: session.sql("select now()").toPandas()
> {code}
> Results in:
> {code}
> ...
> TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass 
> e.g. 'datetime64[ns]' instead.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43189) No overload variant of "pandas_udf" matches argument type "str"

2023-04-19 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714351#comment-17714351
 ] 

Hyukjin Kwon commented on SPARK-43189:
--

[~ei-grad] are you interested in submitting a PR?

> No overload variant of "pandas_udf" matches argument type "str"
> ---
>
> Key: SPARK-43189
> URL: https://issues.apache.org/jira/browse/SPARK-43189
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.4, 3.3.2, 3.4.0
>Reporter: Andrew Grigorev
>Priority: Major
>
> h2. Issue
> Users who have mypy enabled in their IDE or CI environment face very verbose 
> error messages when using the {{pandas_udf}} function in PySpark. The current 
> typing of the {{pandas_udf}} function seems to be causing these issues. As a 
> workaround, the official documentation provides examples that use {{{}# type: 
> ignore[call-overload]{}}}, but this is not an ideal solution.
> h2. Example
> Here's a code snippet taken from 
> [docs|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs]
>  that triggers the error when mypy is enabled:
> {code:python}
> from pyspark.sql.functions import pandas_udf
> import pandas as pd
> @pandas_udf("col1 string, col2 long")
> def func(s1: pd.Series, s2: pd.Series, s3: pd.DataFrame) -> pd.DataFrame:
> s3['col2'] = s1 + s2.str.len()
> return s3 {code}
> Running mypy on this code results in a long and verbose error message, which 
> makes it difficult for users to understand the actual issue and how to 
> resolve it.
> h2. Proposed Solution
> We kindly request the PySpark development team to review and improve the 
> typing for the {{pandas_udf}} function to prevent these verbose error 
> messages from appearing. This improvement will help users who have mypy 
> enabled in their development environments to have a better experience when 
> using PySpark.
> Furthermore, we suggest updating the official documentation to provide better 
> examples that do not rely on {{# type: ignore[call-overload]}} to suppress 
> these errors.
> h2. Impact
> By addressing this issue, users of PySpark with mypy enabled in their 
> development environment will be able to write and verify their code more 
> efficiently, without being overwhelmed by verbose error messages. This will 
> lead to a more enjoyable and productive experience when working with PySpark 
> and pandas UDFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43197) Clean up the code written for compatibility with Hadoop 2

2023-04-19 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714350#comment-17714350
 ] 

Hyukjin Kwon commented on SPARK-43197:
--

cc [~sunchao] FYI

> Clean up the code written for compatibility with Hadoop 2
> -
>
> Key: SPARK-43197
> URL: https://issues.apache.org/jira/browse/SPARK-43197
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, SQL, YARN
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> SPARK-42452 removed support for Hadoop2, we can clean up the code written for 
> compatibility with Hadoop 2 to make it more concise



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43201) Inconsistency between from_avro and from_json function

2023-04-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-43201:
-
Component/s: SQL
 (was: Spark Core)

> Inconsistency between from_avro and from_json function
> --
>
> Key: SPARK-43201
> URL: https://issues.apache.org/jira/browse/SPARK-43201
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Philip Adetiloye
>Priority: Major
>
> Spark from_avro function does not allow schema to use dataframe column but 
> takes a String schema:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: String): Column {code}
> This makes it impossible to deserialize rows of Avro records with different 
> schema since only one schema string could be pass externally. 
>  
> Here is what I would expect:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
> code example:
> {code:java}
> import org.apache.spark.sql.functions.from_avro
> val avroSchema1 = 
> """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
>  
> val avroSchema2 = 
> """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
> val df = Seq(
>   (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
>   (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
> ).toDF("binaryData", "schema")
> val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))
> parsed.show()
> // Output:
> // ++
> // |  parsedData|
> // ++
> // |[apple1, 1.0]|
> // |[apple2, 2.0]|
> // ++
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43207) Add helper functions for extract value from literal expression

2023-04-19 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-43207:
-

 Summary: Add helper functions for extract value from literal 
expression
 Key: SPARK-43207
 URL: https://issues.apache.org/jira/browse/SPARK-43207
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43206) Streaming query exception() also include stack trace

2023-04-19 Thread Wei Liu (Jira)

Wei Liu created SPARK-43206:
---

 Summary: Streaming query exception() also include stack trace
 Key: SPARK-43206
 URL: https://issues.apache.org/jira/browse/SPARK-43206
 Project: Spark
  Issue Type: Task
  Components: Connect, Structured Streaming
Affects Versions: 3.5.0
 Environment: 
https://github.com/apache/spark/pull/40785#issuecomment-1515522281

 
Reporter: Wei Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43129) Scala Core API for Streaming Spark Connect

2023-04-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43129.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40783
[https://github.com/apache/spark/pull/40783]

> Scala Core API for Streaming Spark Connect
> --
>
> Key: SPARK-43129
> URL: https://issues.apache.org/jira/browse/SPARK-43129
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
>Priority: Major
> Fix For: 3.5.0
>
>
> Scala client API for streaming spark connect. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43129) Scala Core API for Streaming Spark Connect

2023-04-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43129:


Assignee: Raghu Angadi

> Scala Core API for Streaming Spark Connect
> --
>
> Key: SPARK-43129
> URL: https://issues.apache.org/jira/browse/SPARK-43129
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
>Priority: Major
>
> Scala client API for streaming spark connect. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42945) Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect

2023-04-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42945:


Assignee: Allison Wang

> Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect
> ---
>
> Key: SPARK-42945
> URL: https://issues.apache.org/jira/browse/SPARK-42945
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>
> Make the PySpark setting PYSPARK_JVM_STACKTRACE_ENABLED work with Spark 
> Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42945) Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect

2023-04-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42945.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40575
[https://github.com/apache/spark/pull/40575]

> Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect
> ---
>
> Key: SPARK-42945
> URL: https://issues.apache.org/jira/browse/SPARK-42945
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.5.0
>
>
> Make the PySpark setting PYSPARK_JVM_STACKTRACE_ENABLED work with Spark 
> Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43205) Add an IDENTIFIER(stringLiteral) clause that maps a string to an identifier

2023-04-19 Thread Serge Rielau (Jira)

Serge Rielau created SPARK-43205:


 Summary: Add an IDENTIFIER(stringLiteral) clause that maps a 
string to an identifier
 Key: SPARK-43205
 URL: https://issues.apache.org/jira/browse/SPARK-43205
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Serge Rielau


There is a requirement for SQL templates, where the table and or column names 
are provided through substitution. This can be done today using variable 
substitution:
SET hivevar:tabname = mytab;
SELECT * FROM ${ hivevar:tabname };

A straight variable substitution is dangerous since it does allow for SQL 
injection:
SET hivevar:tabname = mytab, someothertab;
SELECT * FROM ${ hivevar:tabname };

A way to get around this problem is to wrap the variable substitution with a 
clause that limits the scope t produce an identifier.
This approach is taken by Snowflake:
 
[https://docs.snowflake.com/en/sql-reference/session-variables#using-variables-in-sql]

SET hivevar:tabname = 'tabname';
SELECT * FROM IDENTIFIER(${ hivevar:tabname })



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43204) Align MERGE assignments with table attributes

2023-04-19 Thread Anton Okolnychyi (Jira)

Anton Okolnychyi created SPARK-43204:


 Summary: Align MERGE assignments with table attributes
 Key: SPARK-43204
 URL: https://issues.apache.org/jira/browse/SPARK-43204
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Anton Okolnychyi


Similar to SPARK-42151, we need to do the same for MERGE assignments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43202) Replace reflection w/ direct calling for YARN Resource API

2023-04-19 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-43202:
-

 Summary: Replace reflection w/ direct calling for YARN Resource API
 Key: SPARK-43202
 URL: https://issues.apache.org/jira/browse/SPARK-43202
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Affects Versions: 3.5.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43203) Fix DROP table behavior in session catalog

2023-04-19 Thread Anton Okolnychyi (Jira)

Anton Okolnychyi created SPARK-43203:


 Summary: Fix DROP table behavior in session catalog
 Key: SPARK-43203
 URL: https://issues.apache.org/jira/browse/SPARK-43203
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Anton Okolnychyi


DROP table behavior is not working correctly in 3.4.0 because we always invoke 
V1 drop logic if the identifier looks like a V1 identifier. This is a big 
blockers for external data sources that provide custom session catalogs.

See [here|https://github.com/apache/spark/pull/37879/files#r1170501180] for 
details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43203) Fix DROP table behavior in session catalog

2023-04-19 Thread Anton Okolnychyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Okolnychyi updated SPARK-43203:
-
Description: 
DROP table behavior is not working correctly in 3.4.0 because we always invoke 
V1 drop logic if the identifier looks like a V1 identifier. This is a big 
blocker for external data sources that provide custom session catalogs.

See [here|https://github.com/apache/spark/pull/37879/files#r1170501180] for 
details.

  was:
DROP table behavior is not working correctly in 3.4.0 because we always invoke 
V1 drop logic if the identifier looks like a V1 identifier. This is a big 
blockers for external data sources that provide custom session catalogs.

See [here|https://github.com/apache/spark/pull/37879/files#r1170501180] for 
details.


> Fix DROP table behavior in session catalog
> --
>
> Key: SPARK-43203
> URL: https://issues.apache.org/jira/browse/SPARK-43203
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> DROP table behavior is not working correctly in 3.4.0 because we always 
> invoke V1 drop logic if the identifier looks like a V1 identifier. This is a 
> big blocker for external data sources that provide custom session catalogs.
> See [here|https://github.com/apache/spark/pull/37879/files#r1170501180] for 
> details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43201) Inconsistency between from_avro and from_json function

2023-04-19 Thread Philip Adetiloye (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-43201:
-
Description: 
Spark from_avro function does not allow schema to use dataframe column but 
takes a String schema:
{code:java}
def from_avro(col: Column, jsonFormatSchema: String): Column {code}
This makes it impossible to deserialize rows of Avro records with different 
schema since only one schema string could be pass externally. 

 

Here is what I would expect:
{code:java}
def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
code example:
{code:java}
import org.apache.spark.sql.functions.from_avro

val avroSchema1 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
 

val avroSchema2 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""


val df = Seq(
  (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
  (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
).toDF("binaryData", "schema")


val parsed = df.select(from_avro($"binaryData", 
$"schema").as("parsedData"))parsed.show()


// Output:
// ++
// |  parsedData|
// ++
// |[apple1, 1.0]|
// |[apple2, 2.0]|
// ++
 {code}
 

  was:
Spark from_avro function does not allow schema to use dataframe column but 
takes a String schema:
{code:java}
def from_avro(col: Column, jsonFormatSchema: String): Column {code}
This makes it impossible to deserialize rows of Avro records with different 
schema since only one schema string could be pass externally. 

 

Here is what I would expect:
{code:java}
def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
code example:
{code:java}
import org.apache.spark.sql.functions.from_avro

val avroSchema1 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""val
 

val avroSchema2 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""


val df = Seq(
  (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
  (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
).toDF("binaryData", "schema")


val parsed = df.select(from_avro($"binaryData", 
$"schema").as("parsedData"))parsed.show()


// Output:
// ++
// |  parsedData|
// ++
// |[apple1, 1.0]|
// |[apple2, 2.0]|
// ++
 {code}
 


> Inconsistency between from_avro and from_json function
> --
>
> Key: SPARK-43201
> URL: https://issues.apache.org/jira/browse/SPARK-43201
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Philip Adetiloye
>Priority: Major
>
> Spark from_avro function does not allow schema to use dataframe column but 
> takes a String schema:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: String): Column {code}
> This makes it impossible to deserialize rows of Avro records with different 
> schema since only one schema string could be pass externally. 
>  
> Here is what I would expect:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
> code example:
> {code:java}
> import org.apache.spark.sql.functions.from_avro
> val avroSchema1 = 
> """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
>  
> val avroSchema2 = 
> """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
> val df = Seq(
>   (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
>   (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
> ).toDF("binaryData", "schema")
> val parsed = df.select(from_avro($"binaryData", 
> $"schema").as("parsedData"))parsed.show()
> // Output:
> // ++
> // |  parsedData|
> // ++
> // |[apple1, 1.0]|
> // |[apple2, 2.0]|
> // ++
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43201) Inconsistency between from_avro and from_json function

2023-04-19 Thread Philip Adetiloye (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-43201:
-
Description: 
Spark from_avro function does not allow schema to use dataframe column but 
takes a String schema:
{code:java}
def from_avro(col: Column, jsonFormatSchema: String): Column {code}
This makes it impossible to deserialize rows of Avro records with different 
schema since only one schema string could be pass externally. 

 

Here is what I would expect:
{code:java}
def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
code example:
{code:java}
import org.apache.spark.sql.functions.from_avro

val avroSchema1 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
 

val avroSchema2 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""


val df = Seq(
  (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
  (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
).toDF("binaryData", "schema")


val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))


parsed.show()


// Output:
// ++
// |  parsedData|
// ++
// |[apple1, 1.0]|
// |[apple2, 2.0]|
// ++
 {code}
 

  was:
Spark from_avro function does not allow schema to use dataframe column but 
takes a String schema:
{code:java}
def from_avro(col: Column, jsonFormatSchema: String): Column {code}
This makes it impossible to deserialize rows of Avro records with different 
schema since only one schema string could be pass externally. 

 

Here is what I would expect:
{code:java}
def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
code example:
{code:java}
import org.apache.spark.sql.functions.from_avro

val avroSchema1 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
 

val avroSchema2 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""


val df = Seq(
  (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
  (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
).toDF("binaryData", "schema")


val parsed = df.select(from_avro($"binaryData", 
$"schema").as("parsedData"))parsed.show()


// Output:
// ++
// |  parsedData|
// ++
// |[apple1, 1.0]|
// |[apple2, 2.0]|
// ++
 {code}
 


> Inconsistency between from_avro and from_json function
> --
>
> Key: SPARK-43201
> URL: https://issues.apache.org/jira/browse/SPARK-43201
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Philip Adetiloye
>Priority: Major
>
> Spark from_avro function does not allow schema to use dataframe column but 
> takes a String schema:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: String): Column {code}
> This makes it impossible to deserialize rows of Avro records with different 
> schema since only one schema string could be pass externally. 
>  
> Here is what I would expect:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
> code example:
> {code:java}
> import org.apache.spark.sql.functions.from_avro
> val avroSchema1 = 
> """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
>  
> val avroSchema2 = 
> """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
> val df = Seq(
>   (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
>   (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
> ).toDF("binaryData", "schema")
> val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))
> parsed.show()
> // Output:
> // ++
> // |  parsedData|
> // ++
> // |[apple1, 1.0]|
> // |[apple2, 2.0]|
> // ++
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43201) Inconsistency between from_avro and from_json function

2023-04-19 Thread Philip Adetiloye (Jira)

Philip Adetiloye created SPARK-43201:


 Summary: Inconsistency between from_avro and from_json function
 Key: SPARK-43201
 URL: https://issues.apache.org/jira/browse/SPARK-43201
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Philip Adetiloye


Spark from_avro function does not allow schema to use dataframe column but 
takes a String schema:
{code:java}
def from_avro(col: Column, jsonFormatSchema: String): Column {code}
This makes it impossible to deserialize rows of Avro records with different 
schema since only one schema string could be pass externally. 

 

Here is what I would expect:
{code:java}
def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
code example:
{code:java}
import org.apache.spark.sql.functions.from_avro

val avroSchema1 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""val
 

val avroSchema2 = 
"""{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""


val df = Seq(
  (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
  (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
).toDF("binaryData", "schema")


val parsed = df.select(from_avro($"binaryData", 
$"schema").as("parsedData"))parsed.show()


// Output:
// ++
// |  parsedData|
// ++
// |[apple1, 1.0]|
// |[apple2, 2.0]|
// ++
 {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43200) Remove Hadoop 2 reference in docs

2023-04-19 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-43200:
-

 Summary: Remove Hadoop 2 reference in docs
 Key: SPARK-43200
 URL: https://issues.apache.org/jira/browse/SPARK-43200
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 3.5.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43187) Remove workaround for MiniKdc's BindException

2023-04-19 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43187.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40849
[https://github.com/apache/spark/pull/40849]

> Remove workaround for MiniKdc's BindException
> -
>
> Key: SPARK-43187
> URL: https://issues.apache.org/jira/browse/SPARK-43187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43187) Remove workaround for MiniKdc's BindException

2023-04-19 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43187:


Assignee: Cheng Pan

> Remove workaround for MiniKdc's BindException
> -
>
> Key: SPARK-43187
> URL: https://issues.apache.org/jira/browse/SPARK-43187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43186) Remove workaround for FileSinkDesc

2023-04-19 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43186.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40848
[https://github.com/apache/spark/pull/40848]

> Remove workaround for FileSinkDesc
> --
>
> Key: SPARK-43186
> URL: https://issues.apache.org/jira/browse/SPARK-43186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43186) Remove workaround for FileSinkDesc

2023-04-19 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43186:


Assignee: Cheng Pan

> Remove workaround for FileSinkDesc
> --
>
> Key: SPARK-43186
> URL: https://issues.apache.org/jira/browse/SPARK-43186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43142) DSL expressions fail on attribute with special characters

2023-04-19 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714209#comment-17714209
 ] 

Hudson commented on SPARK-43142:


User 'rshkv' has created a pull request for this issue:
https://github.com/apache/spark/pull/40794

> DSL expressions fail on attribute with special characters
> -
>
> Key: SPARK-43142
> URL: https://issues.apache.org/jira/browse/SPARK-43142
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Expressions on implicitly converted attributes fail if the attributes have 
> names containing special characters. They fail even if the attributes are 
> backtick-quoted:
> {code:java}
> scala> import org.apache.spark.sql.catalyst.dsl.expressions._
> import org.apache.spark.sql.catalyst.dsl.expressions._
> scala> "`slashed/col`".attr
> res0: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = 
> 'slashed/col
> scala> "`slashed/col`".attr.asc
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '/' expecting {, '.', '-'}(line 1, pos 7)
> == SQL ==
> slashed/col
> ---^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43124) Dataset.show should not trigger job execution on CommandResults

2023-04-19 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714211#comment-17714211
 ] 

Hudson commented on SPARK-43124:


User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/40779

> Dataset.show should not trigger job execution on CommandResults
> ---
>
> Key: SPARK-43124
> URL: https://issues.apache.org/jira/browse/SPARK-43124
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43137) Improve ArrayInsert if the position is foldable and equals to zero.

2023-04-19 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714214#comment-17714214
 ] 

Hudson commented on SPARK-43137:


User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/40789

> Improve ArrayInsert if the position is foldable and equals to zero.
> ---
>
> Key: SPARK-43137
> URL: https://issues.apache.org/jira/browse/SPARK-43137
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>
> We want make array_prepend reuse the implementation of array_insert, but it 
> seems a bit performance worse if the position is foldable and equals to zero.
> The reason is that always do the check for position is negative or positive, 
> and the code is too long. Too long code will lead to JIT failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43160) Remove typing.io namespace references as it is being removed

2023-04-19 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714212#comment-17714212
 ] 

Hudson commented on SPARK-43160:


User 'aimtsou' has created a pull request for this issue:
https://github.com/apache/spark/pull/40819

> Remove typing.io namespace references as it is being removed
> 
>
> Key: SPARK-43160
> URL: https://issues.apache.org/jira/browse/SPARK-43160
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.4, 3.3.2, 3.4.0
>Reporter: Aimilios Tsouvelekakis
>Priority: Minor
>
> Python 3.11 gives a deprecation warning to the following:
> {code:java}
>  /python/3.11.1/lib/python3.11/site-packages/pyspark/broadcast.py:38: 
> DeprecationWarning: typing.io is deprecated, import directly from typing 
> instead. typing.io will be removed in Python 3.12.
>     from typing.io import BinaryIO  # type: ignore[import]{code}
> The only reference comes from:
> {code:java}
> spark % git grep typing.io 
> python/pyspark/broadcast.py:from typing.io import BinaryIO  # type: 
> ignore[import] {code}
> I will fix the import so it does not cause any deprecation problem.
>  
> This is documeted in [1|[https://bugs.python.org/issue35089]], 
> [2|https://docs.python.org/3/library/typing.html#typing.IO]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43199) Make InlineCTE idempotent

2023-04-19 Thread Peter Toth (Jira)

Peter Toth created SPARK-43199:
--

 Summary: Make InlineCTE idempotent
 Key: SPARK-43199
 URL: https://issues.apache.org/jira/browse/SPARK-43199
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43198) Fix "Could not initialise class ammonite..." error when using filter

2023-04-19 Thread Venkata Sai Akhil Gudesa (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venkata Sai Akhil Gudesa updated SPARK-43198:
-
Description: 
When
{code:java}
spark.range(10).filter(n => n % 2 == 0).collectAsList()`{code}
 is run in the ammonite REPL (Spark Connect), the following error is thrown:
{noformat}
io.grpc.StatusRuntimeException: UNKNOWN: ammonite/repl/ReplBridge$
  io.grpc.Status.asRuntimeException(Status.java:535)
  io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
  
org.apache.spark.sql.connect.client.SparkResult.org$apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:62)
  org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:114)
  org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:131)
  org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2687)
  org.apache.spark.sql.Dataset.withResult(Dataset.scala:3088)
  org.apache.spark.sql.Dataset.collect(Dataset.scala:2686)
  org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2700)
  ammonite.$sess.cmd0$.(cmd0.sc:1)
  ammonite.$sess.cmd0$.(cmd0.sc){noformat}

  was:
When `spark.range(10).filter(n => n % 2 == 0).collectAsList()` is run in the 
ammonite REPL (Spark Connect), the following error is thrown:

```

io.grpc.StatusRuntimeException: UNKNOWN: ammonite/repl/ReplBridge$
  io.grpc.Status.asRuntimeException(Status.java:535)
  io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
  
org.apache.spark.sql.connect.client.SparkResult.org$apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:62)
  org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:114)
  org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:131)
  org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2687)
  org.apache.spark.sql.Dataset.withResult(Dataset.scala:3088)
  org.apache.spark.sql.Dataset.collect(Dataset.scala:2686)
  org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2700)
  ammonite.$sess.cmd0$.(cmd0.sc:1)
  ammonite.$sess.cmd0$.(cmd0.sc)

```


> Fix "Could not initialise class ammonite..." error when using filter
> 
>
> Key: SPARK-43198
> URL: https://issues.apache.org/jira/browse/SPARK-43198
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Venkata Sai Akhil Gudesa
>Priority: Major
>
> When
> {code:java}
> spark.range(10).filter(n => n % 2 == 0).collectAsList()`{code}
>  is run in the ammonite REPL (Spark Connect), the following error is thrown:
> {noformat}
> io.grpc.StatusRuntimeException: UNKNOWN: ammonite/repl/ReplBridge$
>   io.grpc.Status.asRuntimeException(Status.java:535)
>   
> io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
>   
> org.apache.spark.sql.connect.client.SparkResult.org$apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:62)
>   
> org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:114)
>   
> org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:131)
>   org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2687)
>   org.apache.spark.sql.Dataset.withResult(Dataset.scala:3088)
>   org.apache.spark.sql.Dataset.collect(Dataset.scala:2686)
>   org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2700)
>   ammonite.$sess.cmd0$.(cmd0.sc:1)
>   ammonite.$sess.cmd0$.(cmd0.sc){noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43198) Fix "Could not initialise class ammonite..." error when using filter

2023-04-19 Thread Venkata Sai Akhil Gudesa (Jira)

Venkata Sai Akhil Gudesa created SPARK-43198:


 Summary: Fix "Could not initialise class ammonite..." error when 
using filter
 Key: SPARK-43198
 URL: https://issues.apache.org/jira/browse/SPARK-43198
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.5.0
Reporter: Venkata Sai Akhil Gudesa


When `spark.range(10).filter(n => n % 2 == 0).collectAsList()` is run in the 
ammonite REPL (Spark Connect), the following error is thrown:

```

io.grpc.StatusRuntimeException: UNKNOWN: ammonite/repl/ReplBridge$
  io.grpc.Status.asRuntimeException(Status.java:535)
  io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
  
org.apache.spark.sql.connect.client.SparkResult.org$apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:62)
  org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:114)
  org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:131)
  org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2687)
  org.apache.spark.sql.Dataset.withResult(Dataset.scala:3088)
  org.apache.spark.sql.Dataset.collect(Dataset.scala:2686)
  org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2700)
  ammonite.$sess.cmd0$.(cmd0.sc:1)
  ammonite.$sess.cmd0$.(cmd0.sc)

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43179) Add option for applications to control saving of metadata in the External Shuffle Service LevelDB

2023-04-19 Thread Ignite TC Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714149#comment-17714149
 ] 

Ignite TC Bot commented on SPARK-43179:
---

User 'otterc' has created a pull request for this issue:
https://github.com/apache/spark/pull/40843

> Add option for applications to control saving of metadata in the External 
> Shuffle Service LevelDB
> -
>
> Key: SPARK-43179
> URL: https://issues.apache.org/jira/browse/SPARK-43179
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.4.0
>Reporter: Chandni Singh
>Priority: Major
>
> Currently, the External Shuffle Service stores application metadata in 
> LevelDB. This is necessary to enable the shuffle server to resume serving 
> shuffle data for an application whose executors registered before the 
> NodeManager restarts. However, the metadata includes the application secret, 
> which is stored in LevelDB without encryption. This is a potential security 
> risk, particularly for applications with high security requirements. While 
> filesystem access control lists (ACLs) can help protect keys and 
> certificates, they may not be sufficient for some use cases. In response, we 
> have decided not to store metadata for these high-security applications in 
> LevelDB. As a result, these applications may experience more failures in the 
> event of a node restart, but we believe this trade-off is acceptable given 
> the increased security risk.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43137) Improve ArrayInsert if the position is foldable and equals to zero.

2023-04-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-43137:
---

Assignee: jiaan.geng

> Improve ArrayInsert if the position is foldable and equals to zero.
> ---
>
> Key: SPARK-43137
> URL: https://issues.apache.org/jira/browse/SPARK-43137
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> We want make array_prepend reuse the implementation of array_insert, but it 
> seems a bit performance worse if the position is foldable and equals to zero.
> The reason is that always do the check for position is negative or positive, 
> and the code is too long. Too long code will lead to JIT failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43137) Improve ArrayInsert if the position is foldable and equals to zero.

2023-04-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-43137.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40833
[https://github.com/apache/spark/pull/40833]

> Improve ArrayInsert if the position is foldable and equals to zero.
> ---
>
> Key: SPARK-43137
> URL: https://issues.apache.org/jira/browse/SPARK-43137
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>
> We want make array_prepend reuse the implementation of array_insert, but it 
> seems a bit performance worse if the position is foldable and equals to zero.
> The reason is that always do the check for position is negative or positive, 
> and the code is too long. Too long code will lead to JIT failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37829) An outer-join using joinWith on DataFrames returns Rows with null fields instead of null values

2023-04-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37829:
---

Assignee: Jason Xu

> An outer-join using joinWith on DataFrames returns Rows with null fields 
> instead of null values
> ---
>
> Key: SPARK-37829
> URL: https://issues.apache.org/jira/browse/SPARK-37829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0
>Reporter: Clément de Groc
>Assignee: Jason Xu
>Priority: Major
> Fix For: 3.4.1, 3.5.0
>
>
> Doing an outer-join using {{joinWith}} on {{{}DataFrame{}}}s used to return 
> missing values as {{null}} in Spark 2.4.8, but returns them as {{Rows}} with 
> {{null}} values in Spark 3+.
> The issue can be reproduced with [the following 
> test|https://github.com/cdegroc/spark/commit/79f4d6a1ec6c69b10b72dbc8f92ab6490d5ef5e5]
>  that succeeds on Spark 2.4.8 but fails starting from Spark 3.0.0.
> The problem only arises when working with DataFrames: Datasets of case 
> classes work as expected as demonstrated by [this other 
> test|https://github.com/apache/spark/blob/v3.0.0/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala#L1200-L1223].
> I couldn't find an explanation for this change in the Migration guide so I'm 
> assuming this is a bug.
> A {{git bisect}} pointed me to [that 
> commit|https://github.com/apache/spark/commit/cd92f25be5a221e0d4618925f7bc9dfd3bb8cb59].
> Reverting the commit solves the problem.
> A similar solution,  but without reverting, is shown 
> [here|https://github.com/cdegroc/spark/commit/684c675bf070876a475a9b225f6c2f92edce4c8a].
> Happy to help if you think of another approach / can provide some guidance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37829) An outer-join using joinWith on DataFrames returns Rows with null fields instead of null values

2023-04-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37829.
-
Fix Version/s: 3.5.0
   3.4.1
   Resolution: Fixed

Issue resolved by pull request 40755
[https://github.com/apache/spark/pull/40755]

> An outer-join using joinWith on DataFrames returns Rows with null fields 
> instead of null values
> ---
>
> Key: SPARK-37829
> URL: https://issues.apache.org/jira/browse/SPARK-37829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0
>Reporter: Clément de Groc
>Priority: Major
> Fix For: 3.5.0, 3.4.1
>
>
> Doing an outer-join using {{joinWith}} on {{{}DataFrame{}}}s used to return 
> missing values as {{null}} in Spark 2.4.8, but returns them as {{Rows}} with 
> {{null}} values in Spark 3+.
> The issue can be reproduced with [the following 
> test|https://github.com/cdegroc/spark/commit/79f4d6a1ec6c69b10b72dbc8f92ab6490d5ef5e5]
>  that succeeds on Spark 2.4.8 but fails starting from Spark 3.0.0.
> The problem only arises when working with DataFrames: Datasets of case 
> classes work as expected as demonstrated by [this other 
> test|https://github.com/apache/spark/blob/v3.0.0/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala#L1200-L1223].
> I couldn't find an explanation for this change in the Migration guide so I'm 
> assuming this is a bug.
> A {{git bisect}} pointed me to [that 
> commit|https://github.com/apache/spark/commit/cd92f25be5a221e0d4618925f7bc9dfd3bb8cb59].
> Reverting the commit solves the problem.
> A similar solution,  but without reverting, is shown 
> [here|https://github.com/cdegroc/spark/commit/684c675bf070876a475a9b225f6c2f92edce4c8a].
> Happy to help if you think of another approach / can provide some guidance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43184) Resume using enumeration to compare `NodeState.DECOMMISSIONING`

2023-04-19 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-43184:
-
Parent: SPARK-43197
Issue Type: Sub-task  (was: Improvement)

> Resume using enumeration to compare  `NodeState.DECOMMISSIONING`
> 
>
> Key: SPARK-43184
> URL: https://issues.apache.org/jira/browse/SPARK-43184
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43185) Inline `hadoop-client` related properties in `pom.xml`

2023-04-19 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-43185:
-
Parent: SPARK-43197
Issue Type: Sub-task  (was: Improvement)

> Inline `hadoop-client` related properties in `pom.xml`
> --
>
> Key: SPARK-43185
> URL: https://issues.apache.org/jira/browse/SPARK-43185
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43186) Remove workaround for FileSinkDesc

2023-04-19 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-43186:
-
Parent: SPARK-43197
Issue Type: Sub-task  (was: Improvement)

> Remove workaround for FileSinkDesc
> --
>
> Key: SPARK-43186
> URL: https://issues.apache.org/jira/browse/SPARK-43186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43187) Remove workaround for MiniKdc's BindException

2023-04-19 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-43187:
-
Parent: SPARK-43197
Issue Type: Sub-task  (was: Test)

> Remove workaround for MiniKdc's BindException
> -
>
> Key: SPARK-43187
> URL: https://issues.apache.org/jira/browse/SPARK-43187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43191) Replace reflection w/ direct calling for Hadoop CallerContext

2023-04-19 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-43191:
-
Parent: SPARK-43197
Issue Type: Sub-task  (was: Improvement)

> Replace reflection w/ direct calling for Hadoop CallerContext 
> --
>
> Key: SPARK-43191
> URL: https://issues.apache.org/jira/browse/SPARK-43191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43193) Remove workaround for HADOOP-12074

2023-04-19 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-43193:
-
Parent: SPARK-43197
Issue Type: Sub-task  (was: Improvement)

> Remove workaround for HADOOP-12074
> --
>
> Key: SPARK-43193
> URL: https://issues.apache.org/jira/browse/SPARK-43193
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43195) Remove unnecessary serializable wrapper in HadoopFSUtils

2023-04-19 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-43195:
-
Parent: SPARK-43197
Issue Type: Sub-task  (was: Improvement)

> Remove unnecessary serializable wrapper in HadoopFSUtils
> 
>
> Key: SPARK-43195
> URL: https://issues.apache.org/jira/browse/SPARK-43195
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43196) Replace reflection w/ direct calling for `ContainerLaunchContext#setTokensConf`

2023-04-19 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-43196:
-
Parent: SPARK-43197
Issue Type: Sub-task  (was: Improvement)

> Replace reflection w/ direct calling for 
> `ContainerLaunchContext#setTokensConf`
> ---
>
> Key: SPARK-43196
> URL: https://issues.apache.org/jira/browse/SPARK-43196
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24497) ANSI SQL: Recursive query

2023-04-19 Thread Peter Toth (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714114#comment-17714114
 ] 

Peter Toth edited comment on SPARK-24497 at 4/19/23 2:00 PM:
-

I've opened a new PR: https://github.com/apache/spark/pull/40744 to support 
recursive SQL, but for some reason it didn't get automatically linked here. 
[~gurwls223], you might know what went wrong...


was (Author: petertoth):
I've opened a new PR: https://github.com/apache/spark/pull/40093 to support 
recursive SQL, but for some reason it didn't get automatically linked here. 
[~gurwls223], you might know what went wrong...

> ANSI SQL: Recursive query
> -
>
> Key: SPARK-24497
> URL: https://issues.apache.org/jira/browse/SPARK-24497
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> h3. *Examples*
> Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" 
> represents the structure of an organization as an adjacency list.
> {code:sql}
> CREATE TABLE department (
> id INTEGER PRIMARY KEY,  -- department ID
> parent_department INTEGER REFERENCES department, -- upper department ID
> name TEXT -- department name
> );
> INSERT INTO department (id, parent_department, "name")
> VALUES
>  (0, NULL, 'ROOT'),
>  (1, 0, 'A'),
>  (2, 1, 'B'),
>  (3, 2, 'C'),
>  (4, 2, 'D'),
>  (5, 0, 'E'),
>  (6, 4, 'F'),
>  (7, 5, 'G');
> -- department structure represented here is as follows:
> --
> -- ROOT-+->A-+->B-+->C
> --  | |
> --  | +->D-+->F
> --  +->E-+->G
> {code}
>  
>  To extract all departments under A, you can use the following recursive 
> query:
> {code:sql}
> WITH RECURSIVE subdepartment AS
> (
> -- non-recursive term
> SELECT * FROM department WHERE name = 'A'
> UNION ALL
> -- recursive term
> SELECT d.*
> FROM
> department AS d
> JOIN
> subdepartment AS sd
> ON (d.parent_department = sd.id)
> )
> SELECT *
> FROM subdepartment
> ORDER BY name;
> {code}
> More details:
> [http://wiki.postgresql.org/wiki/CTEReadme]
> [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43197) Clean up the code written for compatibility with Hadoop 2

2023-04-19 Thread Yang Jie (Jira)

Yang Jie created SPARK-43197:


 Summary: Clean up the code written for compatibility with Hadoop 2
 Key: SPARK-43197
 URL: https://issues.apache.org/jira/browse/SPARK-43197
 Project: Spark
  Issue Type: Umbrella
  Components: Spark Core, SQL, YARN
Affects Versions: 3.5.0
Reporter: Yang Jie


SPARK-42452 removed support for Hadoop2, we can clean up the code written for 
compatibility with Hadoop 2 to make it more concise



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24497) ANSI SQL: Recursive query

2023-04-19 Thread Peter Toth (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714114#comment-17714114
 ] 

Peter Toth commented on SPARK-24497:


I've opened a new PR: https://github.com/apache/spark/pull/40093 to support 
recursive SQL, but for some reason it didn't get automatically linked here. 
[~gurwls223], you might know what went wrong...

> ANSI SQL: Recursive query
> -
>
> Key: SPARK-24497
> URL: https://issues.apache.org/jira/browse/SPARK-24497
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> h3. *Examples*
> Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" 
> represents the structure of an organization as an adjacency list.
> {code:sql}
> CREATE TABLE department (
> id INTEGER PRIMARY KEY,  -- department ID
> parent_department INTEGER REFERENCES department, -- upper department ID
> name TEXT -- department name
> );
> INSERT INTO department (id, parent_department, "name")
> VALUES
>  (0, NULL, 'ROOT'),
>  (1, 0, 'A'),
>  (2, 1, 'B'),
>  (3, 2, 'C'),
>  (4, 2, 'D'),
>  (5, 0, 'E'),
>  (6, 4, 'F'),
>  (7, 5, 'G');
> -- department structure represented here is as follows:
> --
> -- ROOT-+->A-+->B-+->C
> --  | |
> --  | +->D-+->F
> --  +->E-+->G
> {code}
>  
>  To extract all departments under A, you can use the following recursive 
> query:
> {code:sql}
> WITH RECURSIVE subdepartment AS
> (
> -- non-recursive term
> SELECT * FROM department WHERE name = 'A'
> UNION ALL
> -- recursive term
> SELECT d.*
> FROM
> department AS d
> JOIN
> subdepartment AS sd
> ON (d.parent_department = sd.id)
> )
> SELECT *
> FROM subdepartment
> ORDER BY name;
> {code}
> More details:
> [http://wiki.postgresql.org/wiki/CTEReadme]
> [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43196) Replace reflection w/ direct calling for `ContainerLaunchContext#setTokensConf`

2023-04-19 Thread Yang Jie (Jira)

Yang Jie created SPARK-43196:


 Summary: Replace reflection w/ direct calling for 
`ContainerLaunchContext#setTokensConf`
 Key: SPARK-43196
 URL: https://issues.apache.org/jira/browse/SPARK-43196
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43195) Remove unnecessary serializable wrapper in HadoopFSUtils

2023-04-19 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-43195:
-

 Summary: Remove unnecessary serializable wrapper in HadoopFSUtils
 Key: SPARK-43195
 URL: https://issues.apache.org/jira/browse/SPARK-43195
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43194) PySpark 3.4.0 cannot convert timestamp-typed objects to pandas with pandas 2.0

2023-04-19 Thread Phillip Cloud (Jira)

Phillip Cloud created SPARK-43194:
-

 Summary: PySpark 3.4.0 cannot convert timestamp-typed objects to 
pandas with pandas 2.0
 Key: SPARK-43194
 URL: https://issues.apache.org/jira/browse/SPARK-43194
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.4.0
 Environment: {code}
In [4]: import pandas as pd

In [5]: pd.__version__
Out[5]: '2.0.0'

In [6]: import pyspark as ps

In [7]: ps.__version__
Out[7]: '3.4.0'
{code}
Reporter: Phillip Cloud


{code}
In [1]: from pyspark.sql import SparkSession

In [2]: session = SparkSession.builder.appName("test").getOrCreate()
23/04/19 09:21:42 WARN Utils: Your hostname, albatross resolves to a loopback 
address: 127.0.0.2; using 192.168.1.170 instead (on interface enp5s0)
23/04/19 09:21:42 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
23/04/19 09:21:42 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable

In [3]: session.sql("select now()").toPandas()
{code}

Results in:

{code}
...
TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 
'datetime64[ns]' instead.
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43193) Remove workaround for HADOOP-12074

2023-04-19 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-43193:
-

 Summary: Remove workaround for HADOOP-12074
 Key: SPARK-43193
 URL: https://issues.apache.org/jira/browse/SPARK-43193
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.5.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43191) Replace reflection w/ direct calling for Hadoop CallerContext

2023-04-19 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-43191:
-

 Summary: Replace reflection w/ direct calling for Hadoop 
CallerContext 
 Key: SPARK-43191
 URL: https://issues.apache.org/jira/browse/SPARK-43191
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43192) Spark connect's user agent validations are too restrictive

2023-04-19 Thread Niranjan Jayakar (Jira)

Niranjan Jayakar created SPARK-43192:


 Summary: Spark connect's user agent validations are too restrictive
 Key: SPARK-43192
 URL: https://issues.apache.org/jira/browse/SPARK-43192
 Project: Spark
  Issue Type: Bug
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Niranjan Jayakar


The current restriction on allowed charset and length are too restrictive

 

https://github.com/apache/spark/blob/cac6f58318bb84d532f02d245a50d3c66daa3e4b/python/pyspark/sql/connect/client.py#L274-L275



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43190) ListQuery.childOutput should be consistent with child output

2023-04-19 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-43190:
---

 Summary: ListQuery.childOutput should be consistent with child 
output
 Key: SPARK-43190
 URL: https://issues.apache.org/jira/browse/SPARK-43190
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43189) No overload variant of "pandas_udf" matches argument type "str"

2023-04-19 Thread Andrew Grigorev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Grigorev updated SPARK-43189:

Description: 
h2. Issue

Users who have mypy enabled in their IDE or CI environment face very verbose 
error messages when using the {{pandas_udf}} function in PySpark. The current 
typing of the {{pandas_udf}} function seems to be causing these issues. As a 
workaround, the official documentation provides examples that use {{{}# type: 
ignore[call-overload]{}}}, but this is not an ideal solution.
h2. Example

Here's a code snippet taken from 
[docs|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs]
 that triggers the error when mypy is enabled:
{code:python}
from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf("col1 string, col2 long")
def func(s1: pd.Series, s2: pd.Series, s3: pd.DataFrame) -> pd.DataFrame:
s3['col2'] = s1 + s2.str.len()
return s3 {code}
Running mypy on this code results in a long and verbose error message, which 
makes it difficult for users to understand the actual issue and how to resolve 
it.
h2. Proposed Solution

We kindly request the PySpark development team to review and improve the typing 
for the {{pandas_udf}} function to prevent these verbose error messages from 
appearing. This improvement will help users who have mypy enabled in their 
development environments to have a better experience when using PySpark.

Furthermore, we suggest updating the official documentation to provide better 
examples that do not rely on {{# type: ignore[call-overload]}} to suppress 
these errors.
h2. Impact

By addressing this issue, users of PySpark with mypy enabled in their 
development environment will be able to write and verify their code more 
efficiently, without being overwhelmed by verbose error messages. This will 
lead to a more enjoyable and productive experience when working with PySpark 
and pandas UDFs.

  was:
h2. Issue

Users who have mypy enabled in their IDE or CI environment face very verbose 
error messages when using the {{pandas_udf}} function in PySpark. The current 
typing of the {{pandas_udf}} function seems to be causing these issues. As a 
workaround, the official documentation provides examples that use {{{}# type: 
ignore[call-overload]{}}}, but this is not an ideal solution.
h2. Example

Here's a code snippet that triggers the error when mypy is enabled:
{code:python}
from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf("string")
def f(s: pd.Series) -> pd.Series:
return pd.Series(["a"]*len(s), index=s.index)
{code}
Running mypy on this code results in a long and verbose error message, which 
makes it difficult for users to understand the actual issue and how to resolve 
it.
h2. Proposed Solution

We kindly request the PySpark development team to review and improve the typing 
for the {{pandas_udf}} function to prevent these verbose error messages from 
appearing. This improvement will help users who have mypy enabled in their 
development environments to have a better experience when using PySpark.

Furthermore, we suggest updating the official documentation to provide better 
examples that do not rely on {{# type: ignore[call-overload]}} to suppress 
these errors.
h2. Impact

By addressing this issue, users of PySpark with mypy enabled in their 
development environment will be able to write and verify their code more 
efficiently, without being overwhelmed by verbose error messages. This will 
lead to a more enjoyable and productive experience when working with PySpark 
and pandas UDFs.


> No overload variant of "pandas_udf" matches argument type "str"
> ---
>
> Key: SPARK-43189
> URL: https://issues.apache.org/jira/browse/SPARK-43189
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.4, 3.3.2, 3.4.0
>Reporter: Andrew Grigorev
>Priority: Major
>
> h2. Issue
> Users who have mypy enabled in their IDE or CI environment face very verbose 
> error messages when using the {{pandas_udf}} function in PySpark. The current 
> typing of the {{pandas_udf}} function seems to be causing these issues. As a 
> workaround, the official documentation provides examples that use {{{}# type: 
> ignore[call-overload]{}}}, but this is not an ideal solution.
> h2. Example
> Here's a code snippet taken from 
> [docs|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs]
>  that triggers the error when mypy is enabled:
> {code:python}
> from pyspark.sql.functions import pandas_udf
> import pandas as pd
> @pandas_udf("col1 string, col2 long")
> def func(s1: pd.Series, s2: pd.Series, s3: pd.DataFrame) -> pd.DataFrame:

[jira] [Created] (SPARK-43189) No overload variant of "pandas_udf" matches argument type "str"

2023-04-19 Thread Andrew Grigorev (Jira)

Andrew Grigorev created SPARK-43189:
---

 Summary: No overload variant of "pandas_udf" matches argument type 
"str"
 Key: SPARK-43189
 URL: https://issues.apache.org/jira/browse/SPARK-43189
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0, 3.3.2, 3.2.4
Reporter: Andrew Grigorev


h2. Issue

Users who have mypy enabled in their IDE or CI environment face very verbose 
error messages when using the {{pandas_udf}} function in PySpark. The current 
typing of the {{pandas_udf}} function seems to be causing these issues. As a 
workaround, the official documentation provides examples that use {{{}# type: 
ignore[call-overload]{}}}, but this is not an ideal solution.
h2. Example

Here's a code snippet that triggers the error when mypy is enabled:
{code:python}
from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf("string")
def f(s: pd.Series) -> pd.Series:
return pd.Series(["a"]*len(s), index=s.index)
{code}
Running mypy on this code results in a long and verbose error message, which 
makes it difficult for users to understand the actual issue and how to resolve 
it.
h2. Proposed Solution

We kindly request the PySpark development team to review and improve the typing 
for the {{pandas_udf}} function to prevent these verbose error messages from 
appearing. This improvement will help users who have mypy enabled in their 
development environments to have a better experience when using PySpark.

Furthermore, we suggest updating the official documentation to provide better 
examples that do not rely on {{# type: ignore[call-overload]}} to suppress 
these errors.
h2. Impact

By addressing this issue, users of PySpark with mypy enabled in their 
development environment will be able to write and verify their code more 
efficiently, without being overwhelmed by verbose error messages. This will 
lead to a more enjoyable and productive experience when working with PySpark 
and pandas UDFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43188) Cannot write to Azure Datalake Gen2 (abfs/abfss) after Spark 3.1.2

2023-04-19 Thread Nicolas PHUNG (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas PHUNG updated SPARK-43188:
--
Description: 
Hello,

I have an issue with Spark 3.3.2 & Spark 3.4.0 to write into Azure Data Lake 
Storage Gen2 (abfs/abfss scheme). I've got the following errors:
{code:java}
warn 13:12:47.554: StdErr from Kernel Process 23/04/19 13:12:47 ERROR 
FileFormatWriter: Aborting job 
6a75949c-1473-4445-b8ab-d125be3f0f21.org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent 
failure: Lost task 1.0 in stage 0.0 (TID 1) (myhost executor driver): 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid 
local directory for datablock-0001-    at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:462)
    at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:165)
    at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
    at 
org.apache.hadoop.fs.store.DataBlocks$DiskBlockFactory.createTmpFileForWrite(DataBlocks.java:980)
    at 
org.apache.hadoop.fs.store.DataBlocks$DiskBlockFactory.create(DataBlocks.java:960)
    at 
org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.createBlockIfNeeded(AbfsOutputStream.java:262)
    at 
org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.(AbfsOutputStream.java:173)
    at 
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.createFile(AzureBlobFileSystemStore.java:580)
    at 
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.create(AzureBlobFileSystem.java:301)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1195)    at 
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1175)    at 
org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74)
    at 
org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:347)  
  at 
org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:314)  
  at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:480)
    at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420)
    at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409)
    at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:36)
    at 
org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$$anon$1.newInstance(ParquetUtils.scala:490)
    at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
    at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:146)
    at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:389)
    at 
org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888) 
   at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)    at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:328)    at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)    at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)    at 
org.apache.spark.scheduler.Task.run(Task.scala:139)    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)    at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
   at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:    at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785)
    at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721)
    at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) 
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)    at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720)    
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1206)
    at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:

[jira] [Created] (SPARK-43188) Cannot write to Azure Datalake Gen2 (abfs/abfss) after Spark 3.1.2

2023-04-19 Thread Nicolas PHUNG (Jira)

Nicolas PHUNG created SPARK-43188:
-

 Summary: Cannot write to Azure Datalake Gen2 (abfs/abfss) after 
Spark 3.1.2
 Key: SPARK-43188
 URL: https://issues.apache.org/jira/browse/SPARK-43188
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 3.4.0, 3.3.2
Reporter: Nicolas PHUNG


Hello,

I have an issue with Spark 3.3.2 & Spark 3.4.0 to write into Azure Data Lake 
Storage Gen2 (abfs/abfss scheme). I've got the following errors:
{code:java}
warn 13:12:47.554: StdErr from Kernel Process 23/04/19 13:12:47 ERROR 
FileFormatWriter: Aborting job 
6a75949c-1473-4445-b8ab-d125be3f0f21.org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent 
failure: Lost task 1.0 in stage 0.0 (TID 1) 
(FR07258024L.dsk.eur.msd.world.socgen executor driver): 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid 
local directory for datablock-0001-    at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:462)
    at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:165)
    at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
    at 
org.apache.hadoop.fs.store.DataBlocks$DiskBlockFactory.createTmpFileForWrite(DataBlocks.java:980)
    at 
org.apache.hadoop.fs.store.DataBlocks$DiskBlockFactory.create(DataBlocks.java:960)
    at 
org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.createBlockIfNeeded(AbfsOutputStream.java:262)
    at 
org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.(AbfsOutputStream.java:173)
    at 
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.createFile(AzureBlobFileSystemStore.java:580)
    at 
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.create(AzureBlobFileSystem.java:301)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1195)    at 
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1175)    at 
org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74)
    at 
org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:347)  
  at 
org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:314)  
  at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:480)
    at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420)
    at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409)
    at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:36)
    at 
org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$$anon$1.newInstance(ParquetUtils.scala:490)
    at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
    at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:146)
    at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:389)
    at 
org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888) 
   at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)    at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:328)    at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)    at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)    at 
org.apache.spark.scheduler.Task.run(Task.scala:139)    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)    at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
   at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:    at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785)
    at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721)
    at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) 
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)    at 
org.apache.spark.scheduler.DAGSche

[jira] [Commented] (SPARK-43187) Remove workaround for MiniKdc's BindException

2023-04-19 Thread Ignite TC Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714071#comment-17714071
 ] 

Ignite TC Bot commented on SPARK-43187:
---

User 'pan3793' has created a pull request for this issue:
https://github.com/apache/spark/pull/40849

> Remove workaround for MiniKdc's BindException
> -
>
> Key: SPARK-43187
> URL: https://issues.apache.org/jira/browse/SPARK-43187
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43187) Remove workaround for MiniKdc's BindException

2023-04-19 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-43187:
-

 Summary: Remove workaround for MiniKdc's BindException
 Key: SPARK-43187
 URL: https://issues.apache.org/jira/browse/SPARK-43187
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 3.5.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43186) Remove workaround for FileSinkDesc

2023-04-19 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-43186:
-

 Summary: Remove workaround for FileSinkDesc
 Key: SPARK-43186
 URL: https://issues.apache.org/jira/browse/SPARK-43186
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43185) Inline `hadoop-client` related properties in `pom.xml`

2023-04-19 Thread Yang Jie (Jira)

Yang Jie created SPARK-43185:


 Summary: Inline `hadoop-client` related properties in `pom.xml`
 Key: SPARK-43185
 URL: https://issues.apache.org/jira/browse/SPARK-43185
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43176) Deduplicate imports in Connect Tests

2023-04-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43176.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40839
[https://github.com/apache/spark/pull/40839]

> Deduplicate imports in Connect Tests
> 
>
> Key: SPARK-43176
> URL: https://issues.apache.org/jira/browse/SPARK-43176
> Project: Spark
>  Issue Type: Test
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43176) Deduplicate imports in Connect Tests

2023-04-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43176:


Assignee: Ruifeng Zheng

> Deduplicate imports in Connect Tests
> 
>
> Key: SPARK-43176
> URL: https://issues.apache.org/jira/browse/SPARK-43176
> Project: Spark
>  Issue Type: Test
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43184) Resume using enumeration to compare `NodeState.DECOMMISSIONING`

2023-04-19 Thread Yang Jie (Jira)

Yang Jie created SPARK-43184:


 Summary: Resume using enumeration to compare  
`NodeState.DECOMMISSIONING`
 Key: SPARK-43184
 URL: https://issues.apache.org/jira/browse/SPARK-43184
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43183) Move update event on idleness in streaming query listener to separate callback method

2023-04-19 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-43183:


 Summary: Move update event on idleness in streaming query listener 
to separate callback method
 Key: SPARK-43183
 URL: https://issues.apache.org/jira/browse/SPARK-43183
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.5.0
Reporter: Jungtaek Lim


People has been having a lot of confusions about update event on idleness; it’s 
not only the matter of understanding but also comes up with various types of 
complaints. For example, since we give the latest batch ID for update event on 
idleness, if the listener implementation blindly performs upsert based on batch 
ID, they are in risk to lose metrics.

This also complicates the logic because we have to memorize the execution for 
the previous batch, which is arguably not necessary.

Because of this, we’d be better to move the idle event out of progress update 
event and have separate callback method for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed

2023-04-19 Thread Liu Shuo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Shuo updated SPARK-43182:
-
Description: 
When we test AE in Spark3.4.0 with the following case, we find If we disable AE 
or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable 
AE and enable skewJoin，it will take very long time.

The test case:
{code:java}
###uncompress the part-m-***.zip attachment, and put these files under 
'/tmp/spark-warehouse/data/' dir.

create table source_aqe(c1 int,c18 string) using csv options(path 
'file:///tmp/spark-warehouse/data/');
create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 
string); 
insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
source_aqe;
insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
source_aqe limit 12;
insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
source_aqe limit 12;
set spark.sql.adaptive.enabled=false;
set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
 
###it will finish in 20s 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;


set spark.sql.adaptive.enabled=true;
set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
###it will take very long time 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;
{code}

  was:
When we test AE in Spark3.4.0 with the following case, we find If we disable AE 
or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable 
AE and enable skewJoin，it will take very long time.

The test case:
{code:java}
###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' 
dir.

create table source_aqe(c1 int,c18 string) using csv options(path 
'file:///tmp/spark-warehouse/data/');
create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 
string); 
insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
source_aqe;
insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
source_aqe limit 12;
insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
source_aqe limit 12;
set spark.sql.adaptive.enabled=false;
set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
 
###it will finish in 20s 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;


set spark.sql.adaptive.enabled=true;
set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSize

[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed

2023-04-19 Thread Liu Shuo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Shuo updated SPARK-43182:
-
Attachment: part-m-9.zip
part-m-8.zip
part-m-7.zip
part-m-6.zip
part-m-5.zip
part-m-4.zip
part-m-3.zip
part-m-2.zip
part-m-00016.zip
part-m-00015.zip
part-m-00014.zip
part-m-00013.zip
part-m-00012.zip
part-m-00011.zip
part-m-00010.zip
part-m-1.zip
part-m-0.zip
part-m-00019.zip
part-m-00018.zip
part-m-00017.zip

> Mutilple tables join with limit when AE is enabled and one table is skewed
> --
>
> Key: SPARK-43182
> URL: https://issues.apache.org/jira/browse/SPARK-43182
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Liu Shuo
>Priority: Critical
> Attachments: part-m-0.zip, part-m-1.zip, part-m-2.zip, 
> part-m-3.zip, part-m-4.zip, part-m-5.zip, part-m-6.zip, 
> part-m-7.zip, part-m-8.zip, part-m-9.zip, part-m-00010.zip, 
> part-m-00011.zip, part-m-00012.zip, part-m-00013.zip, part-m-00014.zip, 
> part-m-00015.zip, part-m-00016.zip, part-m-00017.zip, part-m-00018.zip, 
> part-m-00019.zip
>
>
> When we test AE in Spark3.4.0 with the following case, we find If we disable 
> AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we 
> enable AE and enable skewJoin，it will take very long time.
> The test case:
> {code:java}
> ###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' 
> dir.
> create table source_aqe(c1 int,c18 string) using csv options(path 
> 'file:///tmp/spark-warehouse/data/');
> create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned 
> by(c18 string); 
> insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
> source_aqe;
> insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
> source_aqe limit 12;
> insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
> source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
> PARQUET partitioned by(c18 string); 
> insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
> source_aqe limit 16;
> insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
> source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
> PARQUET partitioned by(c18 string); 
> insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
> source_aqe limit 16;
> insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
> source_aqe limit 12;
> set spark.sql.adaptive.enabled=false;
> set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
> set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
> set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
> set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
> set spark.sql.autoBroadcastJoinThreshold = 51200;
>  
> ###it will finish in 20s 
> select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
> hive_snappy_aqe_table3 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10;
> set spark.sql.adaptive.enabled=true;
> set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
> set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
> set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
> set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
> set spark.sql.autoBroadcastJoinThreshold = 51200;
> ###it will take very long time 
> select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
> hive_snappy_aqe_table3 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed

2023-04-19 Thread Liu Shuo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Shuo updated SPARK-43182:
-
Attachment: (was: part-m-0.zip)

> Mutilple tables join with limit when AE is enabled and one table is skewed
> --
>
> Key: SPARK-43182
> URL: https://issues.apache.org/jira/browse/SPARK-43182
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Liu Shuo
>Priority: Critical
>
> When we test AE in Spark3.4.0 with the following case, we find If we disable 
> AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we 
> enable AE and enable skewJoin，it will take very long time.
> The test case:
> {code:java}
> ###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' 
> dir.
> create table source_aqe(c1 int,c18 string) using csv options(path 
> 'file:///tmp/spark-warehouse/data/');
> create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned 
> by(c18 string); 
> insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
> source_aqe;
> insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
> source_aqe limit 12;
> insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
> source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
> PARQUET partitioned by(c18 string); 
> insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
> source_aqe limit 16;
> insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
> source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
> PARQUET partitioned by(c18 string); 
> insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
> source_aqe limit 16;
> insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
> source_aqe limit 12;
> set spark.sql.adaptive.enabled=false;
> set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
> set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
> set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
> set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
> set spark.sql.autoBroadcastJoinThreshold = 51200;
>  
> ###it will finish in 20s 
> select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
> hive_snappy_aqe_table3 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10;
> set spark.sql.adaptive.enabled=true;
> set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
> set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
> set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
> set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
> set spark.sql.autoBroadcastJoinThreshold = 51200;
> ###it will take very long time 
> select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
> hive_snappy_aqe_table3 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43181) spark-sql console should display the Spark WEB UI address

2023-04-19 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713991#comment-17713991
 ] 

ASF GitHub Bot commented on SPARK-43181:


User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40844

> spark-sql console should display the Spark WEB UI address
> -
>
> Key: SPARK-43181
> URL: https://issues.apache.org/jira/browse/SPARK-43181
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed

2023-04-19 Thread Liu Shuo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Shuo updated SPARK-43182:
-
Attachment: part-m-0.zip

> Mutilple tables join with limit when AE is enabled and one table is skewed
> --
>
> Key: SPARK-43182
> URL: https://issues.apache.org/jira/browse/SPARK-43182
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Liu Shuo
>Priority: Critical
> Attachments: part-m-0.zip
>
>
> When we test AE in Spark3.4.0 with the following case, we find If we disable 
> AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we 
> enable AE and enable skewJoin，it will take very long time.
> The test case:
> {code:java}
> ###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' 
> dir.
> create table source_aqe(c1 int,c18 string) using csv options(path 
> 'file:///tmp/spark-warehouse/data/');
> create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned 
> by(c18 string); 
> insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
> source_aqe;
> insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
> source_aqe limit 12;
> insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
> source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
> PARQUET partitioned by(c18 string); 
> insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
> source_aqe limit 16;
> insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
> source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
> PARQUET partitioned by(c18 string); 
> insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
> source_aqe limit 16;
> insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
> source_aqe limit 12;
> set spark.sql.adaptive.enabled=false;
> set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
> set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
> set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
> set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
> set spark.sql.autoBroadcastJoinThreshold = 51200;
>  
> ###it will finish in 20s 
> select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
> hive_snappy_aqe_table3 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10;
> set spark.sql.adaptive.enabled=true;
> set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
> set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
> set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
> set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
> set spark.sql.autoBroadcastJoinThreshold = 51200;
> ###it will take very long time 
> select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
> hive_snappy_aqe_table3 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed

2023-04-19 Thread Liu Shuo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Shuo updated SPARK-43182:
-
Description: 
When we test AE in Spark3.4.0 with the following case, we find If we disable AE 
or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable 
AE and enable skewJoin，it will take very long time.

The test case:
{code:java}
###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' 
dir.

create table source_aqe(c1 int,c18 string) using csv options(path 
'file:///tmp/spark-warehouse/data/');
create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 
string); 
insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
source_aqe;
insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
source_aqe limit 12;
insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
source_aqe limit 12;
set spark.sql.adaptive.enabled=false;
set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
 
###it will finish in 20s 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;


set spark.sql.adaptive.enabled=true;
set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
###it will take very long time 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;
{code}

  was:
When we test AE in Spark3.4.0 with the following case, we find If we disable AE 
or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable 
AE and enable skewJoin，it will take very long time.

The test case:
{code:java}
###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' 
dir.

create table source_aqe(c1 int,c18 string) using csv options(path 
'file:///tmp/spark-warehouse/data/');
create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 
string); 
insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
source_aqe;
insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
source_aqe limit 12;
insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
source_aqe limit 12;
set spark.sql.adaptive.enabled=false;
set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
 
###it will finish in 20s 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;




set spark.sql.adaptive.enabled=true;
set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set sp

[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed

2023-04-19 Thread Liu Shuo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Shuo updated SPARK-43182:
-
Description: 
When we test AE in Spark3.4.0 with the following case, we find If we disable AE 
or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable 
AE and enable skewJoin，it will take very long time.

The test case:
{code:java}
###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' 
dir.

create table source_aqe(c1 int,c18 string) using csv options(path 
'file:///tmp/spark-warehouse/data/');
create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 
string); 
insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
source_aqe;
insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
source_aqe limit 12;
insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
source_aqe limit 12;
set spark.sql.adaptive.enabled=false;
set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
 
###it will finish in 20s 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;




set spark.sql.adaptive.enabled=true;
set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
###it will take very long time 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;



{code}

  was:
When we test AE in Spark3.4.0 with the following case, we find If we disable AE 
or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable 
AE and enable skewJoin，it will take very long time.

The test case:
{code:java}
###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' 
dir.

create table source_aqe(c1 int,c18 string) using csv options(path 
'file:///tmp/spark-warehouse/data/');
create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 
string); 
insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
source_aqe;
insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
source_aqe limit 12;
insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
source_aqe limit 12;
set spark.sql.adaptive.enabled=false;
set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
 
###it will finish in 20s 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;
set spark.sql.adaptive.enabled=true;
set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set s

[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed

2023-04-19 Thread Liu Shuo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Shuo updated SPARK-43182:
-
Description: 
[^data.rar][^data.zip]When we test AE in Spark3.4.0 with the following case, we 
find If we disable AE or enable Ae but disable skewJoin, the sql will finish in 
20s, but if we enable AE and enable skewJoin，it will take very long time.

The test case:
{code:java}
###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' 
dir.

create table source_aqe(c1 int,c18 string) using csv options(path 
'file:///tmp/spark-warehouse/data/');
create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 
string); 
insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
source_aqe;
insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
source_aqe limit 12;
insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
source_aqe limit 12;
set spark.sql.adaptive.enabled=false;
set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
 
###it will finish in 20s 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;
set spark.sql.adaptive.enabled=true;
set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
###it will take very long time 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;{code}

  was:
[^data.zip]When we test AE in Spark3.4.0 with the following case, we find If we 
disable AE or enable Ae but disable skewJoin, the sql will finish in 20s, but 
if we enable AE and enable skewJoin，it will take very long time.

The test case:
{code:java}
###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' 
dir.

create table source_aqe(c1 int,c18 string) using csv options(path 
'file:///tmp/spark-warehouse/data/');
create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 
string); 
insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
source_aqe;
insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
source_aqe limit 12;
insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
source_aqe limit 12;
set spark.sql.adaptive.enabled=false;
set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
 
###it will finish in 20s 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;
set spark.sql.adaptive.enabled=true;
set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitio

[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed

2023-04-19 Thread Liu Shuo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Shuo updated SPARK-43182:
-
Description: 
When we test AE in Spark3.4.0 with the following case, we find If we disable AE 
or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable 
AE and enable skewJoin，it will take very long time.

The test case:
{code:java}
###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' 
dir.

create table source_aqe(c1 int,c18 string) using csv options(path 
'file:///tmp/spark-warehouse/data/');
create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 
string); 
insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
source_aqe;
insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
source_aqe limit 12;
insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
source_aqe limit 12;
set spark.sql.adaptive.enabled=false;
set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
 
###it will finish in 20s 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;
set spark.sql.adaptive.enabled=true;
set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
###it will take very long time 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;{code}

  was:
[^data.rar][^data.zip]When we test AE in Spark3.4.0 with the following case, we 
find If we disable AE or enable Ae but disable skewJoin, the sql will finish in 
20s, but if we enable AE and enable skewJoin，it will take very long time.

The test case:
{code:java}
###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' 
dir.

create table source_aqe(c1 int,c18 string) using csv options(path 
'file:///tmp/spark-warehouse/data/');
create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 
string); 
insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
source_aqe;
insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
source_aqe limit 12;
insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
source_aqe limit 12;
set spark.sql.adaptive.enabled=false;
set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
 
###it will finish in 20s 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;
set spark.sql.adaptive.enabled=true;
set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInByte

[jira] [Commented] (SPARK-42869) can not analyze window exp on sub query

2023-04-19 Thread GuangWeiHong (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713987#comment-17713987
 ] 

GuangWeiHong commented on SPARK-42869:
--

OK, thanks

> can not analyze window exp on sub query
> ---
>
> Key: SPARK-42869
> URL: https://issues.apache.org/jira/browse/SPARK-42869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: GuangWeiHong
>Priority: Major
> Attachments: image-2023-03-20-18-00-40-578.png, 
> image-2023-04-17-19-06-28-069.png, image-2023-04-17-19-09-41-485.png
>
>
>  
> CREATE TABLE test_noindex_table(`name` STRING,`age` INT,`city` STRING) 
> PARTITIONED BY (`date` STRING);
>  
> SELECT
>     *
> FROM
> (
>     SELECT *, COUNT(1) OVER itr AS grp_size
>     FROM test_noindex_table 
>     WINDOW itr AS (PARTITION BY city)
> ) tbl
> WINDOW itr2 AS (PARTITION BY
>     city
> )
>  
> Window specification itr is not defined in the WINDOW clause.
>   !image-2023-03-20-18-00-40-578.png|width=560,height=361!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed

2023-04-19 Thread Liu Shuo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Shuo updated SPARK-43182:
-
Description: 
[^data.zip]When we test AE in Spark3.4.0 with the following case, we find If we 
disable AE or enable Ae but disable skewJoin, the sql will finish in 20s, but 
if we enable AE and enable skewJoin，it will take very long time.

The test case:
{code:java}
###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' 
dir.

create table source_aqe(c1 int,c18 string) using csv options(path 
'file:///tmp/spark-warehouse/data/');
create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 
string); 
insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
source_aqe;
insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
source_aqe limit 12;
insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
source_aqe limit 12;
set spark.sql.adaptive.enabled=false;
set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
 
###it will finish in 20s 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;
set spark.sql.adaptive.enabled=true;
set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
###it will take very long time 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;{code}

  was:
[^data.zip]When we test AE in Spark3.4.0 with the following case, we find If we 
disable AE or enable Ae but disable skewJoin, the sql will finish in 20s, but 
if we enable AE and enable skewJoin，it will take very long time.

The test case:
{code:java}
###
create table source_aqe(c1 int,c18 string) using csv options(path 
'file:///tmp/spark-warehouse/data/');
create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 
string); 
insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
source_aqe;
insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
source_aqe limit 12;
insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
source_aqe limit 12;
set spark.sql.adaptive.enabled=false;
set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
 
###it will finish in 20s 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;
set spark.sql.adaptive.enabled=true;
set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
###it will take very

[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed

2023-04-19 Thread Liu Shuo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Shuo updated SPARK-43182:
-
Description: 
[^data.zip]When we test AE in Spark3.4.0 with the following case, we find If we 
disable AE or enable Ae but disable skewJoin, the sql will finish in 20s, but 
if we enable AE and enable skewJoin，it will take very long time.

The test case:
{code:java}
###
create table source_aqe(c1 int,c18 string) using csv options(path 
'file:///tmp/spark-warehouse/data/');
create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 
string); 
insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
source_aqe;
insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
source_aqe limit 12;
insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
source_aqe limit 12;
set spark.sql.adaptive.enabled=false;
set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
 
###it will finish in 20s 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;
set spark.sql.adaptive.enabled=true;
set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
###it will take very long time 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;{code}

  was:
When we test AE in Spark3.4.0 with the following case, we find If we disable AE 
or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable 
AE and enable skewJoin，it will take very long time.

The test case:
{code:java}
create table source_aqe(c1 int,c18 string) using csv options(path 
'file:///tmp/spark-warehouse/data/');
create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 
string); 
insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
source_aqe;
insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
source_aqe limit 12;
insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
PARQUET partitioned by(c18 string); 
insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
source_aqe limit 16;
insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
source_aqe limit 12;
set spark.sql.adaptive.enabled=false;
set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
 
###it will finish in 20s 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 
limit 10;
set spark.sql.adaptive.enabled=true;
set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
set spark.sql.autoBroadcastJoinThreshold = 51200;
###it will take very long time 
select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
hive_snappy_aqe

[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed

2023-04-19 Thread Liu Shuo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Shuo updated SPARK-43182:
-
Summary: Mutilple tables join with limit when AE is enabled and one table 
is skewed  (was: Mutiple tables join with limit when AE is enabled and one 
table is skewed)

> Mutilple tables join with limit when AE is enabled and one table is skewed
> --
>
> Key: SPARK-43182
> URL: https://issues.apache.org/jira/browse/SPARK-43182
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Liu Shuo
>Priority: Critical
>
> When we test AE in Spark3.4.0 with the following case, we find If we disable 
> AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we 
> enable AE and enable skewJoin，it will take very long time.
> The test case:
> {code:java}
> create table source_aqe(c1 int,c18 string) using csv options(path 
> 'file:///tmp/spark-warehouse/data/');
> create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned 
> by(c18 string); 
> insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
> source_aqe;
> insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
> source_aqe limit 12;
> insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
> source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
> PARQUET partitioned by(c18 string); 
> insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
> source_aqe limit 16;
> insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
> source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
> PARQUET partitioned by(c18 string); 
> insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
> source_aqe limit 16;
> insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
> source_aqe limit 12;
> set spark.sql.adaptive.enabled=false;
> set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
> set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
> set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
> set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
> set spark.sql.autoBroadcastJoinThreshold = 51200;
>  
> ###it will finish in 20s 
> select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
> hive_snappy_aqe_table3 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10;
> set spark.sql.adaptive.enabled=true;
> set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
> set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
> set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
> set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
> set spark.sql.autoBroadcastJoinThreshold = 51200;
> ###it will take very long time 
> select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
> hive_snappy_aqe_table3 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10;{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43182) Mutiple tables join with limit when AE is enabled and one table is skewed

2023-04-19 Thread Liu Shuo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Shuo updated SPARK-43182:
-
Summary: Mutiple tables join with limit when AE is enabled and one table is 
skewed  (was: 3 tables join with limit when AE is enabled and one table is 
skewed)

> Mutiple tables join with limit when AE is enabled and one table is skewed
> -
>
> Key: SPARK-43182
> URL: https://issues.apache.org/jira/browse/SPARK-43182
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Liu Shuo
>Priority: Critical
>
> When we test AE in Spark3.4.0 with the following case, we find If we disable 
> AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we 
> enable AE and enable skewJoin，it will take very long time.
> The test case:
> {code:java}
> create table source_aqe(c1 int,c18 string) using csv options(path 
> 'file:///tmp/spark-warehouse/data/');
> create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned 
> by(c18 string); 
> insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from 
> source_aqe;
> insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from 
> source_aqe limit 12;
> insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from 
> source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as 
> PARQUET partitioned by(c18 string); 
> insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from 
> source_aqe limit 16;
> insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from 
> source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as 
> PARQUET partitioned by(c18 string); 
> insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from 
> source_aqe limit 16;
> insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from 
> source_aqe limit 12;
> set spark.sql.adaptive.enabled=false;
> set spark.sql.adaptive.forceOptimizeSkewedJoin = false;
> set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
> set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
> set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
> set spark.sql.autoBroadcastJoinThreshold = 51200;
>  
> ###it will finish in 20s 
> select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
> hive_snappy_aqe_table3 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10;
> set spark.sql.adaptive.enabled=true;
> set spark.sql.adaptive.forceOptimizeSkewedJoin = true;
> set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1;
> set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB;
> set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB;
> set spark.sql.autoBroadcastJoinThreshold = 51200;
> ###it will take very long time 
> select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join 
> hive_snappy_aqe_table3 on 
> hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10;{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 105 matches

Mail list logo