date:20231018

[jira] [Updated] (SPARK-44347) Upgrade janino to 3.1.10

2023-10-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44347:
---
Labels: pull-request-available  (was: )

> Upgrade janino to 3.1.10
> 
>
> Key: SPARK-44347
> URL: https://issues.apache.org/jira/browse/SPARK-44347
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45023) SPIP: Python Stored Procedures

2023-10-18 Thread Abhinav Kumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776963#comment-17776963
 ] 

Abhinav Kumar commented on SPARK-45023:
---

Not sure where we are with this. Looks like we are not progressing with this. I 
do see value of SQL based Stored Procedure (to begin with just grouped sqls) - 
user can reveal the intent of usage and Spark can optimize holistically. Should 
we discuss and modify proposal accordingly? Please suggest.

> SPIP: Python Stored Procedures
> --
>
> Key: SPARK-45023
> URL: https://issues.apache.org/jira/browse/SPARK-45023
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Stored procedures are an extension of the ANSI SQL standard. They play a 
> crucial role in improving the capabilities of SQL by encapsulating complex 
> logic into reusable routines. 
> This proposal aims to extend Spark SQL by introducing support for stored 
> procedures, starting with Python as the procedural language. This addition 
> will allow users to execute procedural programs, leveraging programming 
> constructs of Python to perform tasks with complex logic. Additionally, users 
> can persist these procedural routines in catalogs such as HMS for future 
> reuse. By providing this functionality, we intend to seamlessly empower Spark 
> users to integrate with Python routines within their SQL workflows.
> {*}SPIP{*}: 
> [https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45600) Separate the Python data source logic from DataFrameReader

2023-10-18 Thread Allison Wang (Jira)

Allison Wang created SPARK-45600:


 Summary: Separate the Python data source logic from DataFrameReader
 Key: SPARK-45600
 URL: https://issues.apache.org/jira/browse/SPARK-45600
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Currently we have added a few instance variables to store information for 
Python data source reader. We should have a dedicated reader class for Python 
data source to make the current DataFrameReader clean.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45585) Fix time format and redirection issues in SparkSubmit tests

2023-10-18 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-45585.
--
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 43421
[https://github.com/apache/spark/pull/43421]

> Fix time format and redirection issues in SparkSubmit tests
> ---
>
> Key: SPARK-45585
> URL: https://issues.apache.org/jira/browse/SPARK-45585
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45585) Fix time format and redirection issues in SparkSubmit tests

2023-10-18 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-45585:


Assignee: Kent Yao

> Fix time format and redirection issues in SparkSubmit tests
> ---
>
> Key: SPARK-45585
> URL: https://issues.apache.org/jira/browse/SPARK-45585
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-45546) Make publish-snapshot support package first then deploy

2023-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-45546:
--

Reverted at 
https://github.com/apache/spark/commit/706872d4de2374d1faf84d8706611a092c0b6e76 
and 
https://github.com/apache/spark/commit/e37dd3ab8e0707eead2cb068bc19456349ccdd86

> Make publish-snapshot support package first then deploy
> ---
>
> Key: SPARK-45546
> URL: https://issues.apache.org/jira/browse/SPARK-45546
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45546) Make publish-snapshot support package first then deploy

2023-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45546.
--
Resolution: Invalid

> Make publish-snapshot support package first then deploy
> ---
>
> Key: SPARK-45546
> URL: https://issues.apache.org/jira/browse/SPARK-45546
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45546) Make publish-snapshot support package first then deploy

2023-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45546:


Assignee: (was: Yang Jie)

> Make publish-snapshot support package first then deploy
> ---
>
> Key: SPARK-45546
> URL: https://issues.apache.org/jira/browse/SPARK-45546
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45546) Make publish-snapshot support package first then deploy

2023-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-45546:
-
Fix Version/s: (was: 4.0.0)

> Make publish-snapshot support package first then deploy
> ---
>
> Key: SPARK-45546
> URL: https://issues.apache.org/jira/browse/SPARK-45546
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45603) merge_spark_pr shall notice us about GITHUB_OAUTH_KEY expiry

2023-10-18 Thread Kent Yao (Jira)

Kent Yao created SPARK-45603:


 Summary: merge_spark_pr shall notice us about GITHUB_OAUTH_KEY 
expiry
 Key: SPARK-45603
 URL: https://issues.apache.org/jira/browse/SPARK-45603
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45553) Deprecate assertPandasOnSparkEqual

2023-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45553.
--
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 43426
[https://github.com/apache/spark/pull/43426]

> Deprecate assertPandasOnSparkEqual
> --
>
> Key: SPARK-45553
> URL: https://issues.apache.org/jira/browse/SPARK-45553
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark, Tests
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>
> We will add new APIs for DataFrame, Series and Index separately, and we 
> should deprecate assertPandasOnSparkEqual.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45553) Deprecate assertPandasOnSparkEqual

2023-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45553:


Assignee: Haejoon Lee

> Deprecate assertPandasOnSparkEqual
> --
>
> Key: SPARK-45553
> URL: https://issues.apache.org/jira/browse/SPARK-45553
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark, Tests
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> We will add new APIs for DataFrame, Series and Index separately, and we 
> should deprecate assertPandasOnSparkEqual.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45603) merge_spark_pr shall notice us about GITHUB_OAUTH_KEY expiry

2023-10-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45603:
---
Labels: pull-request-available  (was: )

> merge_spark_pr shall notice us about GITHUB_OAUTH_KEY expiry
> 
>
> Key: SPARK-45603
> URL: https://issues.apache.org/jira/browse/SPARK-45603
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45588) Minor scaladoc improvement in StreamingForeachBatchHelper

2023-10-18 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-45588:


Assignee: Raghu Angadi

> Minor scaladoc improvement in StreamingForeachBatchHelper
> -
>
> Key: SPARK-45588
> URL: https://issues.apache.org/jira/browse/SPARK-45588
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
>Priority: Trivial
>  Labels: pull-request-available
>
> Document RunnerCleaner in StreamingForeachBatchHelper.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45588) Minor scaladoc improvement in StreamingForeachBatchHelper

2023-10-18 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-45588.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43424
[https://github.com/apache/spark/pull/43424]

> Minor scaladoc improvement in StreamingForeachBatchHelper
> -
>
> Key: SPARK-45588
> URL: https://issues.apache.org/jira/browse/SPARK-45588
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Document RunnerCleaner in StreamingForeachBatchHelper.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45589) Supplementary exception class

2023-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45589.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43427
[https://github.com/apache/spark/pull/43427]

> Supplementary exception class
> -
>
> Key: SPARK-45589
> URL: https://issues.apache.org/jira/browse/SPARK-45589
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45589) Supplementary exception class

2023-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45589:


Assignee: BingKun Pan

> Supplementary exception class
> -
>
> Key: SPARK-45589
> URL: https://issues.apache.org/jira/browse/SPARK-45589
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45220) Refine docstring of `DataFrame.join`

2023-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45220:


Assignee: Allison Wang

> Refine docstring of `DataFrame.join`
> 
>
> Key: SPARK-45220
> URL: https://issues.apache.org/jira/browse/SPARK-45220
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Refine the docstring of `DataFrame.join`.
> The examples should also include: left join, left anit join, join on multiple 
> columns and column names, join on multiple conditions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45220) Refine docstring of `DataFrame.join`

2023-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45220.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43039
[https://github.com/apache/spark/pull/43039]

> Refine docstring of `DataFrame.join`
> 
>
> Key: SPARK-45220
> URL: https://issues.apache.org/jira/browse/SPARK-45220
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Refine the docstring of `DataFrame.join`.
> The examples should also include: left join, left anit join, join on multiple 
> columns and column names, join on multiple conditions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44734) Add documentation for type casting rules in Python UDFs/UDTFs

2023-10-18 Thread BingKun Pan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1043#comment-1043
 ] 

BingKun Pan commented on SPARK-44734:
-

Okay, let me take a look at the two JIRAs above first.

> Add documentation for type casting rules in Python UDFs/UDTFs
> -
>
> Key: SPARK-44734
> URL: https://issues.apache.org/jira/browse/SPARK-44734
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> In addition to type mappings between Spark data types and Python data types 
> (SPARK-44733), we should add the type casting rules for regular and 
> arrow-optimized Python UDFs/UDTFs. 
> We currently have this table in code:
>  * Arrow: 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/functions.py#L311-L329]
>  * Python UDF: 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L101-L116]
> We should add a proper documentation page for the type casting rules. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45543) InferWindowGroupLimit causes bug if the other window functions haven't the same window frame as the rank-like functions

2023-10-18 Thread Jiaan Geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45543:
---
Summary: InferWindowGroupLimit causes bug if the other window functions 
haven't the same window frame as the rank-like functions  (was: 
InferWindowGroupLimit causes bug if the window frame is different between 
rank-like functions and others)

> InferWindowGroupLimit causes bug if the other window functions haven't the 
> same window frame as the rank-like functions
> ---
>
> Key: SPARK-45543
> URL: https://issues.apache.org/jira/browse/SPARK-45543
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: Ron Serruya
>Assignee: Jiaan Geng
>Priority: Critical
>  Labels: correctness, data-loss, pull-request-available
>
> First, it's my first bug, so I'm hoping I'm doing it right, also, as I'm not 
> very knowledgeable about spark internals, I hope I diagnosed the problem 
> correctly
> I found the degradation in spark version 3.5.0:
> When using multiple windows that share the same partition and ordering (but 
> with different "frame boundaries", where one window is a ranking function, 
> "WindowGroupLimit" is added to the plan causing wrong values to be created 
> from the other windows.
> *This behavior didn't exist in versions 3.3 and 3.4.*
> Example:
>  
> {code:python}
> import pysparkfrom pyspark.sql import functions as F, Window  
> df = spark.createDataFrame([
> {'row_id': 1, 'name': 'Dave', 'score': 1, 'year': 2020},
> {'row_id': 1, 'name': 'Dave', 'score': 2, 'year': 2022},
> {'row_id': 1, 'name': 'Dave', 'score': 3, 'year': 2023},
> {'row_id': 2, 'name': 'Amy', 'score': 6, 'year': 2021},
> ])
> # Create first window for row number
> window_spec = Window.partitionBy('row_id', 'name').orderBy(F.desc('year'))
> # Create additional window from the first window with unbounded frame
> unbound_spec = window_spec.rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)
> # Try to keep the first row by year, and also collect all scores into a list
> df2 = df.withColumn(
> 'rn', 
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores', 
> F.collect_list('score').over(unbound_spec)
> ){code}
> So far everything works, and if we display df2:
>  
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Dave|1 |2|2022|2  |[3, 2, 1] |
> |Dave|1 |1|2020|3  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
>  
> However, once we filter to keep only the first row number:
>  
> {noformat}
> df2.filter("rn=1").show(truncate=False)
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3]   |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
> As you can see just filtering changed the "all_scores" array for Dave.
> (This example uses `collect_list`, however, the same result happens with 
> other functions, such as max, min, count, etc)
>  
> Now, if instead of using the two windows we used, I will use the first window 
> and a window with different ordering, or create a completely new window with 
> same partition but no ordering, it will work fine:
> {code:python}
> new_window = Window.partitionBy('row_id', 
> 'name').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
> df3 = df.withColumn(
> 'rn',
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores',
> F.collect_list('score').over(new_window)
> )
> df3.filter("rn=1").show(truncate=False){code}
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+
> {noformat}
> In addition, if we use all 3 windows to create 3 different columns, it will 
> also work ok. So it seems the issue happens only when all the windows used 
> share the same partition and ordering.
> Here is the final plan for the faulty dataframe:
> {noformat}
> df2.filter("rn=1").explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Filter (rn#9 = 1)
>    +- Window [row_number() windowspecdefinition(row_id#1L, name#0, year#3L 
> DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> currentrow$())) AS rn#9, collect_list(score#2L, 0, 0) 
>

[jira] [Resolved] (SPARK-45586) Reduce compilation time for large expression trees

2023-10-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-45586.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43420
[https://github.com/apache/spark/pull/43420]

> Reduce compilation time for large expression trees
> --
>
> Key: SPARK-45586
> URL: https://issues.apache.org/jira/browse/SPARK-45586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kelvin Jiang
>Assignee: Kelvin Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Some rules, such as TypeCoercion, are very expensive when the query plan 
> contains very large expression trees.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45586) Reduce compilation time for large expression trees

2023-10-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-45586:
---

Assignee: Kelvin Jiang

> Reduce compilation time for large expression trees
> --
>
> Key: SPARK-45586
> URL: https://issues.apache.org/jira/browse/SPARK-45586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kelvin Jiang
>Assignee: Kelvin Jiang
>Priority: Major
>  Labels: pull-request-available
>
> Some rules, such as TypeCoercion, are very expensive when the query plan 
> contains very large expression trees.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45507) Correctness bug in correlated scalar subqueries with COUNT aggregates

2023-10-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-45507.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43341
[https://github.com/apache/spark/pull/43341]

> Correctness bug in correlated scalar subqueries with COUNT aggregates
> -
>
> Key: SPARK-45507
> URL: https://issues.apache.org/jira/browse/SPARK-45507
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Andy Lam
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code:java}
>  
> create view if not exists t1(a1, a2) as values (0, 1), (1, 2);
> create view if not exists t2(b1, b2) as values (0, 2), (0, 3);
> create view if not exists t3(c1, c2) as values (0, 2), (0, 3);
> -- Example 1
> select (
>   select SUM(l.cnt + r.cnt)
>   from (select count(*) cnt from t2 where t1.a1 = t2.b1 having cnt = 0) l
>   join (select count(*) cnt from t3 where t1.a1 = t3.c1 having cnt = 0) r
>   on l.cnt = r.cnt
> ) from t1
> -- Correct answer: (null, 0)
> +--+
> |scalarsubquery(c1, c1)|
> +--+
> |null  |
> |null  |
> +--+
> -- Example 2
> select ( select sum(cnt) from (select count(*) cnt from t2 where t1.c1 = 
> t2.c1) ) from t1
> -- Correct answer: (2, 0)
> +--+
> |scalarsubquery(c1)|
> +--+
> |2 |
> |null  |
> +--+
> -- Example 3
> select ( select count(*) from (select count(*) cnt from t2 where t1.c1 = 
> t2.c1) ) from t1
> -- Correct answer: (1, 1)
> +--+
> |scalarsubquery(c1)|
> +--+
> |1 |
> |0 |
> +--+ {code}
>  
>  
> DB fiddle for correctness 
> check:[https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/10403#]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45596) Use java.lang.ref.Cleaner instead of org.apache.spark.sql.connect.client.util.Cleaner

2023-10-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45596:
---
Labels: pull-request-available  (was: )

> Use java.lang.ref.Cleaner instead of 
> org.apache.spark.sql.connect.client.util.Cleaner
> -
>
> Key: SPARK-45596
> URL: https://issues.apache.org/jira/browse/SPARK-45596
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Min Zhao
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2023-10-19-02-25-57-966.png
>
>
> Now, we have updated JDK to 17,  so should replace this class by 
> [[java.lang.ref.Cleaner]].
>  
> !image-2023-10-19-02-25-57-966.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45601) stackoverflow when executing rule ExtractWindowExpressions

2023-10-18 Thread JacobZheng (Jira)

JacobZheng created SPARK-45601:
--

 Summary: stackoverflow when executing rule ExtractWindowExpressions
 Key: SPARK-45601
 URL: https://issues.apache.org/jira/browse/SPARK-45601
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.3
Reporter: JacobZheng


I am encountering stackoverflow errors while executing the following test case. 
I looked at the source code and it is ExtractWindowExpressions not extracting 
the window correctly and encountering a dead loop at 
resolveOperatorsDownWithPruning that is causing it.

{code:scala}
// Some comments here
  test("agg filter contains window") {
val src = Seq((1, "b", "c")).toDF("col1", "col2", "col3")
  .withColumn("test",
expr("count(col1) filter (where min(col1) over(partition by col2 order 
by col3)>1)"))
src.show()
  }
{code}

Now my question is this kind of in agg filter (window) is the correct usage? Or 
should I add a check like spark sql and throw an error "It is not allowed to 
use window functions inside WHERE clause"?




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45601) stackoverflow when executing rule ExtractWindowExpressions

2023-10-18 Thread JacobZheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JacobZheng updated SPARK-45601:
---
Description: 
I am encountering stackoverflow errors while executing the following test case. 
I looked at the source code and it is ExtractWindowExpressions not extracting 
the window correctly and encountering a dead loop at 
resolveOperatorsDownWithPruning that is causing it.

{code:scala}
 test("agg filter contains window") {
val src = Seq((1, "b", "c")).toDF("col1", "col2", "col3")
  .withColumn("test",
expr("count(col1) filter (where min(col1) over(partition by col2 order 
by col3)>1)"))
src.show()
  }
{code}

Now my question is this kind of in agg filter (window) is the correct usage? Or 
should I add a check like spark sql and throw an error "It is not allowed to 
use window functions inside WHERE clause"?


  was:
I am encountering stackoverflow errors while executing the following test case. 
I looked at the source code and it is ExtractWindowExpressions not extracting 
the window correctly and encountering a dead loop at 
resolveOperatorsDownWithPruning that is causing it.

{code:scala}
// Some comments here
  test("agg filter contains window") {
val src = Seq((1, "b", "c")).toDF("col1", "col2", "col3")
  .withColumn("test",
expr("count(col1) filter (where min(col1) over(partition by col2 order 
by col3)>1)"))
src.show()
  }
{code}

Now my question is this kind of in agg filter (window) is the correct usage? Or 
should I add a check like spark sql and throw an error "It is not allowed to 
use window functions inside WHERE clause"?



> stackoverflow when executing rule ExtractWindowExpressions
> --
>
> Key: SPARK-45601
> URL: https://issues.apache.org/jira/browse/SPARK-45601
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.3
>Reporter: JacobZheng
>Priority: Major
>
> I am encountering stackoverflow errors while executing the following test 
> case. I looked at the source code and it is ExtractWindowExpressions not 
> extracting the window correctly and encountering a dead loop at 
> resolveOperatorsDownWithPruning that is causing it.
> {code:scala}
>  test("agg filter contains window") {
> val src = Seq((1, "b", "c")).toDF("col1", "col2", "col3")
>   .withColumn("test",
> expr("count(col1) filter (where min(col1) over(partition by col2 
> order by col3)>1)"))
> src.show()
>   }
> {code}
> Now my question is this kind of in agg filter (window) is the correct usage? 
> Or should I add a check like spark sql and throw an error "It is not allowed 
> to use window functions inside WHERE clause"?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45602) Replace `s.c.MapOps.filterKeys` with `s.c.MapOps.view.filterKeys`

2023-10-18 Thread Yang Jie (Jira)

Yang Jie created SPARK-45602:


 Summary: Replace `s.c.MapOps.filterKeys` with 
`s.c.MapOps.view.filterKeys`
 Key: SPARK-45602
 URL: https://issues.apache.org/jira/browse/SPARK-45602
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes, Spark Core, SQL, YARN
Affects Versions: 4.0.0
Reporter: Yang Jie


{code:java}
/** Filters this map by retaining only keys satisfying a predicate.
  *  @param  p   the predicate used to test keys
  *  @return an immutable map consisting only of those key value pairs of this 
map where the key satisfies
  *  the predicate `p`. The resulting map wraps the original map 
without copying any elements.
  */
@deprecated("Use .view.filterKeys(f). A future version will include a strict 
version of this method (for now, .view.filterKeys(p).toMap).", "2.13.0")
def filterKeys(p: K => Boolean): MapView[K, V] = new MapView.FilterKeys(this, 
p) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45543) InferWindowGroupLimit causes bug if the other window functions haven't the same window frame as the rank-like functions

2023-10-18 Thread Jiaan Geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45543:
---
Summary: InferWindowGroupLimit causes bug if the other window functions 
haven't the same window frame as the rank-like functions  (was: 
WindowGroupLimit causes bug if the other window functions haven't the same 
window frame as the rank-like functions)

> InferWindowGroupLimit causes bug if the other window functions haven't the 
> same window frame as the rank-like functions
> ---
>
> Key: SPARK-45543
> URL: https://issues.apache.org/jira/browse/SPARK-45543
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: Ron Serruya
>Assignee: Jiaan Geng
>Priority: Critical
>  Labels: correctness, data-loss, pull-request-available
>
> First, it's my first bug, so I'm hoping I'm doing it right, also, as I'm not 
> very knowledgeable about spark internals, I hope I diagnosed the problem 
> correctly
> I found the degradation in spark version 3.5.0:
> When using multiple windows that share the same partition and ordering (but 
> with different "frame boundaries", where one window is a ranking function, 
> "WindowGroupLimit" is added to the plan causing wrong values to be created 
> from the other windows.
> *This behavior didn't exist in versions 3.3 and 3.4.*
> Example:
>  
> {code:python}
> import pysparkfrom pyspark.sql import functions as F, Window  
> df = spark.createDataFrame([
> {'row_id': 1, 'name': 'Dave', 'score': 1, 'year': 2020},
> {'row_id': 1, 'name': 'Dave', 'score': 2, 'year': 2022},
> {'row_id': 1, 'name': 'Dave', 'score': 3, 'year': 2023},
> {'row_id': 2, 'name': 'Amy', 'score': 6, 'year': 2021},
> ])
> # Create first window for row number
> window_spec = Window.partitionBy('row_id', 'name').orderBy(F.desc('year'))
> # Create additional window from the first window with unbounded frame
> unbound_spec = window_spec.rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)
> # Try to keep the first row by year, and also collect all scores into a list
> df2 = df.withColumn(
> 'rn', 
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores', 
> F.collect_list('score').over(unbound_spec)
> ){code}
> So far everything works, and if we display df2:
>  
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Dave|1 |2|2022|2  |[3, 2, 1] |
> |Dave|1 |1|2020|3  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
>  
> However, once we filter to keep only the first row number:
>  
> {noformat}
> df2.filter("rn=1").show(truncate=False)
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3]   |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
> As you can see just filtering changed the "all_scores" array for Dave.
> (This example uses `collect_list`, however, the same result happens with 
> other functions, such as max, min, count, etc)
>  
> Now, if instead of using the two windows we used, I will use the first window 
> and a window with different ordering, or create a completely new window with 
> same partition but no ordering, it will work fine:
> {code:python}
> new_window = Window.partitionBy('row_id', 
> 'name').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
> df3 = df.withColumn(
> 'rn',
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores',
> F.collect_list('score').over(new_window)
> )
> df3.filter("rn=1").show(truncate=False){code}
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+
> {noformat}
> In addition, if we use all 3 windows to create 3 different columns, it will 
> also work ok. So it seems the issue happens only when all the windows used 
> share the same partition and ordering.
> Here is the final plan for the faulty dataframe:
> {noformat}
> df2.filter("rn=1").explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Filter (rn#9 = 1)
>    +- Window [row_number() windowspecdefinition(row_id#1L, name#0, year#3L 
> DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> currentrow$())) AS rn#9, collect_list(score#2L, 0, 0) 
>

[jira] [Updated] (SPARK-45543) InferWindowGroupLimit causes bug if the window frame is different between rank-like functions and others

2023-10-18 Thread Jiaan Geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45543:
---
Summary: InferWindowGroupLimit causes bug if the window frame is different 
between rank-like functions and others  (was: InferWindowGroupLimit causes bug 
if the other window functions haven't the same window frame as the rank-like 
functions)

> InferWindowGroupLimit causes bug if the window frame is different between 
> rank-like functions and others
> 
>
> Key: SPARK-45543
> URL: https://issues.apache.org/jira/browse/SPARK-45543
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: Ron Serruya
>Assignee: Jiaan Geng
>Priority: Critical
>  Labels: correctness, data-loss, pull-request-available
>
> First, it's my first bug, so I'm hoping I'm doing it right, also, as I'm not 
> very knowledgeable about spark internals, I hope I diagnosed the problem 
> correctly
> I found the degradation in spark version 3.5.0:
> When using multiple windows that share the same partition and ordering (but 
> with different "frame boundaries", where one window is a ranking function, 
> "WindowGroupLimit" is added to the plan causing wrong values to be created 
> from the other windows.
> *This behavior didn't exist in versions 3.3 and 3.4.*
> Example:
>  
> {code:python}
> import pysparkfrom pyspark.sql import functions as F, Window  
> df = spark.createDataFrame([
> {'row_id': 1, 'name': 'Dave', 'score': 1, 'year': 2020},
> {'row_id': 1, 'name': 'Dave', 'score': 2, 'year': 2022},
> {'row_id': 1, 'name': 'Dave', 'score': 3, 'year': 2023},
> {'row_id': 2, 'name': 'Amy', 'score': 6, 'year': 2021},
> ])
> # Create first window for row number
> window_spec = Window.partitionBy('row_id', 'name').orderBy(F.desc('year'))
> # Create additional window from the first window with unbounded frame
> unbound_spec = window_spec.rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)
> # Try to keep the first row by year, and also collect all scores into a list
> df2 = df.withColumn(
> 'rn', 
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores', 
> F.collect_list('score').over(unbound_spec)
> ){code}
> So far everything works, and if we display df2:
>  
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Dave|1 |2|2022|2  |[3, 2, 1] |
> |Dave|1 |1|2020|3  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
>  
> However, once we filter to keep only the first row number:
>  
> {noformat}
> df2.filter("rn=1").show(truncate=False)
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3]   |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
> As you can see just filtering changed the "all_scores" array for Dave.
> (This example uses `collect_list`, however, the same result happens with 
> other functions, such as max, min, count, etc)
>  
> Now, if instead of using the two windows we used, I will use the first window 
> and a window with different ordering, or create a completely new window with 
> same partition but no ordering, it will work fine:
> {code:python}
> new_window = Window.partitionBy('row_id', 
> 'name').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
> df3 = df.withColumn(
> 'rn',
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores',
> F.collect_list('score').over(new_window)
> )
> df3.filter("rn=1").show(truncate=False){code}
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+
> {noformat}
> In addition, if we use all 3 windows to create 3 different columns, it will 
> also work ok. So it seems the issue happens only when all the windows used 
> share the same partition and ordering.
> Here is the final plan for the faulty dataframe:
> {noformat}
> df2.filter("rn=1").explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Filter (rn#9 = 1)
>    +- Window [row_number() windowspecdefinition(row_id#1L, name#0, year#3L 
> DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> currentrow$())) AS rn#9, collect_list(score#2L, 0, 0) 
> windowspecdefinition(row_id#1L, name#0,

[jira] [Commented] (SPARK-44817) SPIP: Incremental Stats Collection

2023-10-18 Thread Abhinav Kumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776976#comment-17776976
 ] 

Abhinav Kumar commented on SPARK-44817:
---

[~rakson] [~gurwls223] [~cloud_fan] - We find this issue quite common. 
Currently, the incremental stats collection is done mostly outside the spark 
application as a end of day process (to avoid SLA breaches), and sometimes 
within the current application, if DML materially changes the stats. This 
proposal seems like a good idea, consider users can control it via spark 
parameter.

Views?

> SPIP: Incremental Stats Collection
> --
>
> Key: SPARK-44817
> URL: https://issues.apache.org/jira/browse/SPARK-44817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if 
> auto update of stats collection is not turned on. To keep stats updated we 
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very 
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run 
> itself. This way our table and column stats would be fresh at all the time 
> and CBO benefits can be applied. Initially, we can only update table level 
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
> STATISTICS` for updating stats.
> [SPIP Document 
> |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45507) Correctness bug in correlated scalar subqueries with COUNT aggregates

2023-10-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-45507:
---

Assignee: Andy Lam

> Correctness bug in correlated scalar subqueries with COUNT aggregates
> -
>
> Key: SPARK-45507
> URL: https://issues.apache.org/jira/browse/SPARK-45507
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Andy Lam
>Assignee: Andy Lam
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code:java}
>  
> create view if not exists t1(a1, a2) as values (0, 1), (1, 2);
> create view if not exists t2(b1, b2) as values (0, 2), (0, 3);
> create view if not exists t3(c1, c2) as values (0, 2), (0, 3);
> -- Example 1
> select (
>   select SUM(l.cnt + r.cnt)
>   from (select count(*) cnt from t2 where t1.a1 = t2.b1 having cnt = 0) l
>   join (select count(*) cnt from t3 where t1.a1 = t3.c1 having cnt = 0) r
>   on l.cnt = r.cnt
> ) from t1
> -- Correct answer: (null, 0)
> +--+
> |scalarsubquery(c1, c1)|
> +--+
> |null  |
> |null  |
> +--+
> -- Example 2
> select ( select sum(cnt) from (select count(*) cnt from t2 where t1.c1 = 
> t2.c1) ) from t1
> -- Correct answer: (2, 0)
> +--+
> |scalarsubquery(c1)|
> +--+
> |2 |
> |null  |
> +--+
> -- Example 3
> select ( select count(*) from (select count(*) cnt from t2 where t1.c1 = 
> t2.c1) ) from t1
> -- Correct answer: (1, 1)
> +--+
> |scalarsubquery(c1)|
> +--+
> |1 |
> |0 |
> +--+ {code}
>  
>  
> DB fiddle for correctness 
> check:[https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/10403#]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45555) Returning a debuggable object for failed assertion

2023-10-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-4:
---
Labels: pull-request-available  (was: )

> Returning a debuggable object for failed assertion
> --
>
> Key: SPARK-4
> URL: https://issues.apache.org/jira/browse/SPARK-4
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> To facilitate debugging, we should add a functionality to return debuggable 
> object when the assertion is failed from testing util function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45602) Replace `s.c.MapOps.filterKeys` with `s.c.MapOps.view.filterKeys`

2023-10-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45602:
---
Labels: pull-request-available  (was: )

> Replace `s.c.MapOps.filterKeys` with `s.c.MapOps.view.filterKeys`
> -
>
> Key: SPARK-45602
> URL: https://issues.apache.org/jira/browse/SPARK-45602
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core, SQL, YARN
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> /** Filters this map by retaining only keys satisfying a predicate.
>   *  @param  p   the predicate used to test keys
>   *  @return an immutable map consisting only of those key value pairs of 
> this map where the key satisfies
>   *  the predicate `p`. The resulting map wraps the original map 
> without copying any elements.
>   */
> @deprecated("Use .view.filterKeys(f). A future version will include a strict 
> version of this method (for now, .view.filterKeys(p).toMap).", "2.13.0")
> def filterKeys(p: K => Boolean): MapView[K, V] = new MapView.FilterKeys(this, 
> p) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45511) SPIP: State Data Source - Reader

2023-10-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45511:
---
Labels: SPIP pull-request-available  (was: SPIP)

> SPIP: State Data Source - Reader
> 
>
> Key: SPARK-45511
> URL: https://issues.apache.org/jira/browse/SPARK-45511
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>  Labels: SPIP, pull-request-available
>
> State Store has been a black box from the introduction of the stateful 
> operator. It has been the “internal” data to the streaming query, and Spark 
> does not expose the data outside of the streaming query. There is no 
> feature/tool for users to read and modify the content of state stores.
> Specific to the ability to read the state, the lack of feature brings up 
> various limitations like following:
>  * Users are unable to see the content in the state store, leading to 
> inability to debug.
>  * Users have to perform some indirect approach on verifying the content of 
> the state store in unit tests. The only option they can take is relying on 
> the output of the query.
> Given that, we propose to introduce a feature which enables users to read the 
> state from the outside of the streaming query.
> SPIP: 
> [https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45534) Use `java.lang.ref.Cleaner` instead of `finalize` for `RemoteBlockPushResolver`

2023-10-18 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45534:
---

Assignee: Min Zhao

> Use `java.lang.ref.Cleaner` instead of `finalize` for 
> `RemoteBlockPushResolver`
> ---
>
> Key: SPARK-45534
> URL: https://issues.apache.org/jira/browse/SPARK-45534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Min Zhao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45534) Use `java.lang.ref.Cleaner` instead of `finalize` for `RemoteBlockPushResolver`

2023-10-18 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45534.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43371
[https://github.com/apache/spark/pull/43371]

> Use `java.lang.ref.Cleaner` instead of `finalize` for 
> `RemoteBlockPushResolver`
> ---
>
> Key: SPARK-45534
> URL: https://issues.apache.org/jira/browse/SPARK-45534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Min Zhao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45558) Introduce a metadata file for streaming stateful operator

2023-10-18 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-45558.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43393
[https://github.com/apache/spark/pull/43393]

> Introduce a metadata file for streaming stateful operator
> -
>
> Key: SPARK-45558
> URL: https://issues.apache.org/jira/browse/SPARK-45558
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Chaoqin Li
>Assignee: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> The information to store in the metadata file:
>  * operator name (no need to be unique among stateful operators in the query)
>  * state store name
>  * numColumnsPrefixKey: > 0 if prefix scan is enabled, 0 otherwise
> The body of metadata file will be in json format. The metadata file will be 
> versioned just as other streaming metadata file to be future proof.
> The metadata file will improve expose more information about the state store, 
> improves debugability and facilitate the development of state related feature 
> such as reading and writing state and state repartitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45590) okio-1.15.0 CVE-2023-3635

2023-10-18 Thread Colm O hEigeartaigh (Jira)

Colm O hEigeartaigh created SPARK-45590:
---

 Summary: okio-1.15.0 CVE-2023-3635
 Key: SPARK-45590
 URL: https://issues.apache.org/jira/browse/SPARK-45590
 Project: Spark
  Issue Type: Task
  Components: Build
Affects Versions: 3.5.0
Reporter: Colm O hEigeartaigh


CVE-2023-3635 is being flagged against okio-1.15.0 present in the Spark 3.5.0 
build:
 * ./spark-3.5.0-bin-without-hadoop/jars/okio-1.15.0.jar
 * ./spark-3.5.0-bin-hadoop3/jars/okio-1.15.0.jar

I don't see okio in the dependency tree, it must be coming in via some profile.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45558) Introduce a metadata file for streaming stateful operator

2023-10-18 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-45558:


Assignee: Chaoqin Li

> Introduce a metadata file for streaming stateful operator
> -
>
> Key: SPARK-45558
> URL: https://issues.apache.org/jira/browse/SPARK-45558
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Chaoqin Li
>Assignee: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
>
> The information to store in the metadata file:
>  * operator name (no need to be unique among stateful operators in the query)
>  * state store name
>  * numColumnsPrefixKey: > 0 if prefix scan is enabled, 0 otherwise
> The body of metadata file will be in json format. The metadata file will be 
> versioned just as other streaming metadata file to be future proof.
> The metadata file will improve expose more information about the state store, 
> improves debugability and facilitate the development of state related feature 
> such as reading and writing state and state repartitioning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45542) Replace `setSafeMode(HdfsConstants.SafeModeAction, boolean)` with `setSafeMode(SafeModeAction, boolean)`

2023-10-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45542:
-

Assignee: Yang Jie

> Replace `setSafeMode(HdfsConstants.SafeModeAction, boolean)` with 
> `setSafeMode(SafeModeAction, boolean)`
> 
>
> Key: SPARK-45542
> URL: https://issues.apache.org/jira/browse/SPARK-45542
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>
> {code:java}
> /**
>  * Enter, leave or get safe mode.
>  *
>  * @param action
>  *  One of SafeModeAction.ENTER, SafeModeAction.LEAVE and
>  *  SafeModeAction.GET.
>  * @param isChecked
>  *  If true check only for Active NNs status, else check first NN's
>  *  status.
>  *
>  * @see 
> org.apache.hadoop.hdfs.protocol.ClientProtocol#setSafeMode(HdfsConstants.SafeModeAction,
>  * boolean)
>  *
>  * @deprecated please instead use
>  *   {@link DistributedFileSystem#setSafeMode(SafeModeAction, 
> boolean)}.
>  */
> @Deprecated
> public boolean setSafeMode(HdfsConstants.SafeModeAction action,
> boolean isChecked) throws IOException {
>   return dfs.setSafeMode(action, isChecked);
> } {code}
>  
> `setSafeMode(HdfsConstants.SafeModeAction, boolean)` is `Deprecated`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45574) Add :: syntax as a shorthand for casting

2023-10-18 Thread Ivan Mitic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mitic updated SPARK-45574:
---
Description: Adds the `::` syntax as syntactic sugar for casting columns. 
This is a pretty common syntax, and it was accepted by the SQL API  (was: Adds 
the `::` syntax as syntactic sugar for casting columns. This is a pretty common 
syntax, and it was accepted by the SQL API in the [Semi-Structured Data API 
PRD](

[https://docs.google.com/document/d/1yNf0oE7XNZpLvsWly-MxZaxdlvMdRlZ1ZjSndtmoiWs/edit#heading=h.k50kjbi5yepj]

).)

> Add :: syntax as a shorthand for casting
> 
>
> Key: SPARK-45574
> URL: https://issues.apache.org/jira/browse/SPARK-45574
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Ivan Mitic
>Priority: Major
>  Labels: pull-request-available, release-notes
>
> Adds the `::` syntax as syntactic sugar for casting columns. This is a pretty 
> common syntax, and it was accepted by the SQL API



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45035) Support ignoreCorruptFiles for multiline CSV

2023-10-18 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-45035.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42979
[https://github.com/apache/spark/pull/42979]

> Support ignoreCorruptFiles for multiline CSV
> 
>
> Key: SPARK-45035
> URL: https://issues.apache.org/jira/browse/SPARK-45035
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yaohua Zhao
>Assignee: Jia Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Today, `ignoreCorruptFiles` does not work well for multiline CSV mode.
> {code:java}
> spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")val 
> testCorruptDF0 = spark.read.option("ignoreCorruptFiles", 
> "true").option("multiline", "true").csv("/tmp/sourcepath/").show() {code}
> It throws an exception instead of ignoring silently:
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4940.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 4940.0 (TID 4031) (10.68.177.106 executor 0): 
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalStateException - Error reading from input
> Parser Configuration: CsvParserSettings:
>   Auto configuration enabled=true
>   Auto-closing enabled=true
>   Autodetect column delimiter=false
>   Autodetect quotes=false
>   Column reordering enabled=true
>   Delimiters for detection=null
>   Empty value=
>   Escape unquoted values=false
>   Header extraction enabled=null
>   Headers=null
>   Ignore leading whitespaces=false
>   Ignore leading whitespaces in quotes=false
>   Ignore trailing whitespaces=false
>   Ignore trailing whitespaces in quotes=false
>   Input buffer size=1048576
>   Input reading on separate thread=false
>   Keep escape sequences=false
>   Keep quotes=false
>   Length of content displayed on error=1000
>   Line separator detection enabled=true
>   Maximum number of characters per column=-1
>   Maximum number of columns=20480
>   Normalize escaped line separators=true
>   Null value=
>   Number of records to read=all
>   Processor=none
>   Restricting data in exceptions=false
>   RowProcessor error handler=null
>   Selected fields=none
>   Skip bits as whitespace=true
>   Skip empty lines=true
>   Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
>   CsvFormat:
>   Comment character=#
>   Field delimiter=,
>   Line separator (normalized)=\n
>   Line separator sequence=\n
>   Quote character="
>   Quote escape character=\
>   Quote escape escape character=null
> Internal state when error was thrown: line=0, column=0, record=0
>   at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402)
>   at 
> com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:277)
>   at 
> com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:843)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.(UnivocityParser.scala:463)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$.convertStream(UnivocityParser.scala:46...
>  {code}
> It is because the multiline parsing uses a different RDD (`BinaryFileRDD`) 
> which does not go through `FileScanRDD`. We could potentially add this 
> support to `BinaryFileRDD`, or even reuse the `FileScanRDD` for multiline 
> parsing mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45035) Support ignoreCorruptFiles for multiline CSV

2023-10-18 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-45035:


Assignee: Jia Fan

> Support ignoreCorruptFiles for multiline CSV
> 
>
> Key: SPARK-45035
> URL: https://issues.apache.org/jira/browse/SPARK-45035
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yaohua Zhao
>Assignee: Jia Fan
>Priority: Major
>  Labels: pull-request-available
>
> Today, `ignoreCorruptFiles` does not work well for multiline CSV mode.
> {code:java}
> spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")val 
> testCorruptDF0 = spark.read.option("ignoreCorruptFiles", 
> "true").option("multiline", "true").csv("/tmp/sourcepath/").show() {code}
> It throws an exception instead of ignoring silently:
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4940.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 4940.0 (TID 4031) (10.68.177.106 executor 0): 
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalStateException - Error reading from input
> Parser Configuration: CsvParserSettings:
>   Auto configuration enabled=true
>   Auto-closing enabled=true
>   Autodetect column delimiter=false
>   Autodetect quotes=false
>   Column reordering enabled=true
>   Delimiters for detection=null
>   Empty value=
>   Escape unquoted values=false
>   Header extraction enabled=null
>   Headers=null
>   Ignore leading whitespaces=false
>   Ignore leading whitespaces in quotes=false
>   Ignore trailing whitespaces=false
>   Ignore trailing whitespaces in quotes=false
>   Input buffer size=1048576
>   Input reading on separate thread=false
>   Keep escape sequences=false
>   Keep quotes=false
>   Length of content displayed on error=1000
>   Line separator detection enabled=true
>   Maximum number of characters per column=-1
>   Maximum number of columns=20480
>   Normalize escaped line separators=true
>   Null value=
>   Number of records to read=all
>   Processor=none
>   Restricting data in exceptions=false
>   RowProcessor error handler=null
>   Selected fields=none
>   Skip bits as whitespace=true
>   Skip empty lines=true
>   Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
>   CsvFormat:
>   Comment character=#
>   Field delimiter=,
>   Line separator (normalized)=\n
>   Line separator sequence=\n
>   Quote character="
>   Quote escape character=\
>   Quote escape escape character=null
> Internal state when error was thrown: line=0, column=0, record=0
>   at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402)
>   at 
> com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:277)
>   at 
> com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:843)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.(UnivocityParser.scala:463)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$.convertStream(UnivocityParser.scala:46...
>  {code}
> It is because the multiline parsing uses a different RDD (`BinaryFileRDD`) 
> which does not go through `FileScanRDD`. We could potentially add this 
> support to `BinaryFileRDD`, or even reuse the `FileScanRDD` for multiline 
> parsing mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45589) Supplementary exception class

2023-10-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45589:
---
Labels: pull-request-available  (was: )

> Supplementary exception class
> -
>
> Key: SPARK-45589
> URL: https://issues.apache.org/jira/browse/SPARK-45589
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45370) Fix python test when ansi mode enabled

2023-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45370.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43168
[https://github.com/apache/spark/pull/43168]

> Fix python test when ansi mode enabled
> --
>
> Key: SPARK-45370
> URL: https://issues.apache.org/jira/browse/SPARK-45370
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45370) Fix python test when ansi mode enabled

2023-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45370:


Assignee: BingKun Pan

> Fix python test when ansi mode enabled
> --
>
> Key: SPARK-45370
> URL: https://issues.apache.org/jira/browse/SPARK-45370
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44734) Add documentation for type casting rules in Python UDFs/UDTFs

2023-10-18 Thread BingKun Pan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776533#comment-17776533
 ] 

BingKun Pan commented on SPARK-44734:
-

I work on it.

> Add documentation for type casting rules in Python UDFs/UDTFs
> -
>
> Key: SPARK-44734
> URL: https://issues.apache.org/jira/browse/SPARK-44734
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> In addition to type mappings between Spark data types and Python data types 
> (SPARK-44733), we should add the type casting rules for regular and 
> arrow-optimized Python UDFs/UDTFs. 
> We currently have this table in code:
>  * Arrow: 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/functions.py#L311-L329]
>  * Python UDF: 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L101-L116]
> We should add a proper documentation page for the type casting rules. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45591) Upgrade ASM to 9.6

2023-10-18 Thread Yang Jie (Jira)

Yang Jie created SPARK-45591:


 Summary: Upgrade ASM to 9.6
 Key: SPARK-45591
 URL: https://issues.apache.org/jira/browse/SPARK-45591
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45576) Remove unnecessary debug logs in ReloadingX509TrustManagerSuite

2023-10-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45576:
--
Summary: Remove unnecessary debug logs in ReloadingX509TrustManagerSuite  
(was: [CORE] Remove unnecessary debug logs in ReloadingX509TrustManagerSuite)

> Remove unnecessary debug logs in ReloadingX509TrustManagerSuite
> ---
>
> Key: SPARK-45576
> URL: https://issues.apache.org/jira/browse/SPARK-45576
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> These were added accidentally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45587) Skip UNIDOC and MIMA in build GitHub Action job

2023-10-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45587.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43422
[https://github.com/apache/spark/pull/43422]

> Skip UNIDOC and MIMA in build GitHub Action job
> ---
>
> Key: SPARK-45587
> URL: https://issues.apache.org/jira/browse/SPARK-45587
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45587) Skip UNIDOC and MIMA in build GitHub Action job

2023-10-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45587:
-

Assignee: Dongjoon Hyun

> Skip UNIDOC and MIMA in build GitHub Action job
> ---
>
> Key: SPARK-45587
> URL: https://issues.apache.org/jira/browse/SPARK-45587
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45570) Spark job hangs due to task launch thread failed to create

2023-10-18 Thread lifulong (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lifulong updated SPARK-45570:
-
Environment: 
spark.speculation is use default value false

spark version 3.1.2
 

  was:
spark.speculation is use default value false
 


> Spark job hangs due to task launch thread failed to create
> --
>
> Key: SPARK-45570
> URL: https://issues.apache.org/jira/browse/SPARK-45570
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.5.0
> Environment: spark.speculation is use default value false
> spark version 3.1.2
>  
>Reporter: lifulong
>Priority: Major
>
> spark job hangs while web ui show there is one task in running stage keep 
> running for multi hours, while other tasks finished in a few minutes 
> executor will never report task launch failed info to driver
>  
> Below is spark task execute thread launch log:
> 23/10/17 04:45:42 ERROR Inbox: An error happened while processing message in 
> the inbox for Executor
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:717)
>         at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>         at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
>         at org.apache.spark.executor.Executor.launchTask(Executor.scala:270)
>         at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:173)
>         at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>         at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>         at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>         at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45570) Spark job hangs due to task launch thread failed to create

2023-10-18 Thread lifulong (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lifulong updated SPARK-45570:
-
Affects Version/s: 3.5.0

> Spark job hangs due to task launch thread failed to create
> --
>
> Key: SPARK-45570
> URL: https://issues.apache.org/jira/browse/SPARK-45570
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.5.0
> Environment: spark.speculation is use default value false
>  
>Reporter: lifulong
>Priority: Major
>
> spark job hangs while web ui show there is one task in running stage keep 
> running for multi hours, while other tasks finished in a few minutes 
> executor will never report task launch failed info to driver
>  
> Below is spark task execute thread launch log:
> 23/10/17 04:45:42 ERROR Inbox: An error happened while processing message in 
> the inbox for Executor
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:717)
>         at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>         at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
>         at org.apache.spark.executor.Executor.launchTask(Executor.scala:270)
>         at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:173)
>         at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>         at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>         at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>         at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45553) Deprecate assertPandasOnSparkEqual

2023-10-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45553:
---
Labels: pull-request-available  (was: )

> Deprecate assertPandasOnSparkEqual
> --
>
> Key: SPARK-45553
> URL: https://issues.apache.org/jira/browse/SPARK-45553
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark, Tests
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> We will add new APIs for DataFrame, Series and Index separately, and we 
> should deprecate assertPandasOnSparkEqual.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45574) Add :: syntax as a shorthand for casting

2023-10-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45574:
---
Labels: pull-request-available release-notes  (was: release-notes)

> Add :: syntax as a shorthand for casting
> 
>
> Key: SPARK-45574
> URL: https://issues.apache.org/jira/browse/SPARK-45574
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Ivan Mitic
>Priority: Major
>  Labels: pull-request-available, release-notes
>
> Adds the `::` syntax as syntactic sugar for casting columns. This is a pretty 
> common syntax, and it was accepted by the SQL API in the [Semi-Structured 
> Data API PRD](
> [https://docs.google.com/document/d/1yNf0oE7XNZpLvsWly-MxZaxdlvMdRlZ1ZjSndtmoiWs/edit#heading=h.k50kjbi5yepj]
> ).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45591) Upgrade ASM to 9.6

2023-10-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45591:
---
Labels: pull-request-available  (was: )

> Upgrade ASM to 9.6
> --
>
> Key: SPARK-45591
> URL: https://issues.apache.org/jira/browse/SPARK-45591
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45586) Reduce compilation time for large expression trees

2023-10-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45586:
--

Assignee: Apache Spark

> Reduce compilation time for large expression trees
> --
>
> Key: SPARK-45586
> URL: https://issues.apache.org/jira/browse/SPARK-45586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kelvin Jiang
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Some rules, such as TypeCoercion, are very expensive when the query plan 
> contains very large expression trees.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45586) Reduce compilation time for large expression trees

2023-10-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45586:
--

Assignee: (was: Apache Spark)

> Reduce compilation time for large expression trees
> --
>
> Key: SPARK-45586
> URL: https://issues.apache.org/jira/browse/SPARK-45586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kelvin Jiang
>Priority: Major
>  Labels: pull-request-available
>
> Some rules, such as TypeCoercion, are very expensive when the query plan 
> contains very large expression trees.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45589) Supplementary exception class

2023-10-18 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-45589:
---

 Summary: Supplementary exception class
 Key: SPARK-45589
 URL: https://issues.apache.org/jira/browse/SPARK-45589
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44649) Runtime Filter supports passing equivalent creation side expressions

2023-10-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-44649.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42317
[https://github.com/apache/spark/pull/42317]

> Runtime Filter supports passing equivalent creation side expressions
> 
>
> Key: SPARK-44649
> URL: https://issues.apache.org/jira/browse/SPARK-44649
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code:java}
> SELECT
>   d_year,
>   i_brand_id,
>   i_class_id,
>   i_category_id,
>   i_manufact_id,
>   cs_quantity - COALESCE(cr_return_quantity, 0) AS sales_cnt,
>   cs_ext_sales_price - COALESCE(cr_return_amount, 0.0) AS sales_amt
> FROM catalog_sales
>   JOIN item ON i_item_sk = cs_item_sk
>   JOIN date_dim ON d_date_sk = cs_sold_date_sk
>   LEFT JOIN catalog_returns ON (cs_order_number = cr_order_number
> AND cs_item_sk = cr_item_sk)
> WHERE i_category = 'Books'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44649) Runtime Filter supports passing equivalent creation side expressions

2023-10-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-44649:
---

Assignee: Jiaan Geng

> Runtime Filter supports passing equivalent creation side expressions
> 
>
> Key: SPARK-44649
> URL: https://issues.apache.org/jira/browse/SPARK-44649
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> SELECT
>   d_year,
>   i_brand_id,
>   i_class_id,
>   i_category_id,
>   i_manufact_id,
>   cs_quantity - COALESCE(cr_return_quantity, 0) AS sales_cnt,
>   cs_ext_sales_price - COALESCE(cr_return_amount, 0.0) AS sales_amt
> FROM catalog_sales
>   JOIN item ON i_item_sk = cs_item_sk
>   JOIN date_dim ON d_date_sk = cs_sold_date_sk
>   LEFT JOIN catalog_returns ON (cs_order_number = cr_order_number
> AND cs_item_sk = cr_item_sk)
> WHERE i_category = 'Books'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45542) Replace `setSafeMode(HdfsConstants.SafeModeAction, boolean)` with `setSafeMode(SafeModeAction, boolean)`

2023-10-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45542.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43377
[https://github.com/apache/spark/pull/43377]

> Replace `setSafeMode(HdfsConstants.SafeModeAction, boolean)` with 
> `setSafeMode(SafeModeAction, boolean)`
> 
>
> Key: SPARK-45542
> URL: https://issues.apache.org/jira/browse/SPARK-45542
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code:java}
> /**
>  * Enter, leave or get safe mode.
>  *
>  * @param action
>  *  One of SafeModeAction.ENTER, SafeModeAction.LEAVE and
>  *  SafeModeAction.GET.
>  * @param isChecked
>  *  If true check only for Active NNs status, else check first NN's
>  *  status.
>  *
>  * @see 
> org.apache.hadoop.hdfs.protocol.ClientProtocol#setSafeMode(HdfsConstants.SafeModeAction,
>  * boolean)
>  *
>  * @deprecated please instead use
>  *   {@link DistributedFileSystem#setSafeMode(SafeModeAction, 
> boolean)}.
>  */
> @Deprecated
> public boolean setSafeMode(HdfsConstants.SafeModeAction action,
> boolean isChecked) throws IOException {
>   return dfs.setSafeMode(action, isChecked);
> } {code}
>  
> `setSafeMode(HdfsConstants.SafeModeAction, boolean)` is `Deprecated`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45549) Remove unused `numExistingExecutors` in `CoarseGrainedSchedulerBackend`

2023-10-18 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-45549:
-
Priority: Trivial  (was: Minor)

> Remove unused `numExistingExecutors` in `CoarseGrainedSchedulerBackend`
> ---
>
> Key: SPARK-45549
> URL: https://issues.apache.org/jira/browse/SPARK-45549
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: xiaoping.huang
>Priority: Trivial
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45593) Building a runnable distribution from master code running spark-sql raise error "java.lang.ClassNotFoundException: org.sparkproject.guava.util.concurrent.internal.Intern

2023-10-18 Thread yikaifei (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yikaifei updated SPARK-45593:
-
Description: 
 

Reproducing steps:

first, clone spark master code, then:
 # Build runnable distribution from master code by : `/dev/make-distribution.sh 
--name ui --pip --tgz  -Phive -Phive-thriftserver -Pyarn -Pconnect`
 # Install runnable distribution package
 # Run `bin/spark-sql`

 

Got error:
{code:java}
 23/10/18 20:51:46 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess
    at java.base/java.lang.ClassLoader.defineClass1(Native Method)
    at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
    at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
    at 
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
    at java.base/java.lang.ClassLoader.defineClass1(Native Method)
    at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
    at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
    at 
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
    at java.base/java.lang.ClassLoader.defineClass1(Native Method)
    at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
    at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
    at 
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
    at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511)
    at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515)
    at 
org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168)
    at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079)
    at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011)
    at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034)
    at 
org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)
    at 
org.apache.spark.storage.BlockManagerId$.getCachedBlockManagerId(BlockManagerId.scala:146)
    at org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:127)
    at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:545)
    at org.apache.spark.SparkContext.(SparkContext.scala:629)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2916)
    at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1100)
    at scala.Option.getOrElse(Option.scala:201)
    at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1094)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:64)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:441)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:177)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
    at

[jira] [Created] (SPARK-45594) Auto repartition before writing data into partitioned or bucket table

2023-10-18 Thread Wan Kun (Jira)

Wan Kun created SPARK-45594:
---

 Summary:  Auto repartition before writing data into partitioned or 
bucket table
 Key: SPARK-45594
 URL: https://issues.apache.org/jira/browse/SPARK-45594
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wan Kun


Now, when writing data into partitioned table, there will be at least 
*dynamicPartitions * ShuffleNum* files; when writing data into bucket table, 
there will be at least *bucketNums * shuffleNum* files.
We can shuffle by the dynamic partitions or bucket columns before writing data 
into the table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45593) Building a runnable distribution from master code running spark-sql raise error "java.lang.ClassNotFoundException: org.sparkproject.guava.util.concurrent.internal.Intern

2023-10-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45593:
---
Labels: pull-request-available  (was: )

> Building a runnable distribution from master code running spark-sql raise 
> error "java.lang.ClassNotFoundException: 
> org.sparkproject.guava.util.concurrent.internal.InternalFutureFailureAccess"
> ---
>
> Key: SPARK-45593
> URL: https://issues.apache.org/jira/browse/SPARK-45593
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: yikaifei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Building a runnable distribution from master code running spark-sql raise 
> error "java.lang.ClassNotFoundException: 
> org.sparkproject.guava.util.concurrent.internal.InternalFutureFailureAccess";
> Reproducing steps, first, clone spark master code, then:
>  # Build runnable distribution from master code by : 
> `/dev/make-distribution.sh --name ui --pip --tgz  -Phive -Phive-thriftserver 
> -Pyarn -Pconnect`
>  # Install runnable distribution package
>  # Run `bin/spark-sql`
> Got error:
> {code:java}
>  23/10/18 20:51:46 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
>     at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511)
>     at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515)
>     at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168)
>     at 
> org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079)
>     at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011)
>     at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034)
>     at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)
>     at 
>

[jira] [Assigned] (SPARK-45549) Remove unused `numExistingExecutors` in `CoarseGrainedSchedulerBackend`

2023-10-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45549:
-

Assignee: xiaoping.huang

> Remove unused `numExistingExecutors` in `CoarseGrainedSchedulerBackend`
> ---
>
> Key: SPARK-45549
> URL: https://issues.apache.org/jira/browse/SPARK-45549
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: xiaoping.huang
>Assignee: xiaoping.huang
>Priority: Trivial
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45549) Remove unused `numExistingExecutors` in `CoarseGrainedSchedulerBackend`

2023-10-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45549.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43383
[https://github.com/apache/spark/pull/43383]

> Remove unused `numExistingExecutors` in `CoarseGrainedSchedulerBackend`
> ---
>
> Key: SPARK-45549
> URL: https://issues.apache.org/jira/browse/SPARK-45549
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: xiaoping.huang
>Assignee: xiaoping.huang
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45592) AQE and InMemoryTableScanExec correctness bug

2023-10-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45592:
---
Labels: pull-request-available  (was: )

> AQE and InMemoryTableScanExec correctness bug
> -
>
> Key: SPARK-45592
> URL: https://issues.apache.org/jira/browse/SPARK-45592
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Emil Ejbyfeldt
>Priority: Major
>  Labels: pull-request-available
>
> The following query should return 100
> {code:java}
> import org.apache.spark.storage.StorageLevelval
> df = spark.range(0, 100, 1, 5).map(l => (l, l))
> val ee = df.select($"_1".as("src"), $"_2".as("dst"))
>   .persist(StorageLevel.MEMORY_AND_DISK)
> ee.count()
> val minNbrs1 = ee
>   .groupBy("src").agg(min(col("dst")).as("min_number"))
>   .persist(StorageLevel.MEMORY_AND_DISK)
> val join = ee.join(minNbrs1, "src")
> join.count(){code}
> but on spark 3.5.0 there is a correctness bug causing it to return `104800` 
> or some other smaller value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45593) Building a runnable distribution from master code running spark-sql raise error "java.lang.ClassNotFoundException: org.sparkproject.guava.util.concurrent.internal.Intern

2023-10-18 Thread yikaifei (Jira)

yikaifei created SPARK-45593:


 Summary: Building a runnable distribution from master code running 
spark-sql raise error "java.lang.ClassNotFoundException: 
org.sparkproject.guava.util.concurrent.internal.InternalFutureFailureAccess"
 Key: SPARK-45593
 URL: https://issues.apache.org/jira/browse/SPARK-45593
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: yikaifei
 Fix For: 4.0.0


Reproducing steps:

first, clone spark master code, then:
 # Build runnable distribution from master code by : `/dev/make-distribution.sh 
--name ui --pip --tgz  -Phive -Phive-thriftserver -Pyarn -Pconnect`
 # Install runnable distribution package
 # Run `bin/spark-sql`

 

Got error:

```

23/10/18 20:51:46 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess
    at java.base/java.lang.ClassLoader.defineClass1(Native Method)
    at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
    at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
    at 
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
    at java.base/java.lang.ClassLoader.defineClass1(Native Method)
    at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
    at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
    at 
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
    at java.base/java.lang.ClassLoader.defineClass1(Native Method)
    at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
    at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
    at 
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
    at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511)
    at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515)
    at 
org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168)
    at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079)
    at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011)
    at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034)
    at 
org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)
    at 
org.apache.spark.storage.BlockManagerId$.getCachedBlockManagerId(BlockManagerId.scala:146)
    at org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:127)
    at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:545)
    at org.apache.spark.SparkContext.(SparkContext.scala:629)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2916)
    at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1100)
    at scala.Option.getOrElse(Option.scala:201)
    at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1094)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:64)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:441)
    at

[jira] [Updated] (SPARK-45593) Building a runnable distribution from master code running spark-sql raise error "java.lang.ClassNotFoundException: org.sparkproject.guava.util.concurrent.internal.Intern

2023-10-18 Thread yikaifei (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yikaifei updated SPARK-45593:
-
Description: 
Building a runnable distribution from master code running spark-sql raise error 
"java.lang.ClassNotFoundException: 
org.sparkproject.guava.util.concurrent.internal.InternalFutureFailureAccess";

Reproducing steps, first, clone spark master code, then:
 # Build runnable distribution from master code by : `/dev/make-distribution.sh 
--name ui --pip --tgz  -Phive -Phive-thriftserver -Pyarn -Pconnect`
 # Install runnable distribution package
 # Run `bin/spark-sql`

Got error:
{code:java}
 23/10/18 20:51:46 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess
    at java.base/java.lang.ClassLoader.defineClass1(Native Method)
    at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
    at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
    at 
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
    at java.base/java.lang.ClassLoader.defineClass1(Native Method)
    at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
    at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
    at 
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
    at java.base/java.lang.ClassLoader.defineClass1(Native Method)
    at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
    at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
    at 
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
    at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511)
    at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515)
    at 
org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168)
    at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079)
    at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011)
    at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034)
    at 
org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)
    at 
org.apache.spark.storage.BlockManagerId$.getCachedBlockManagerId(BlockManagerId.scala:146)
    at org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:127)
    at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:545)
    at org.apache.spark.SparkContext.(SparkContext.scala:629)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2916)
    at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1100)
    at scala.Option.getOrElse(Option.scala:201)
    at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1094)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:64)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:441)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:177)
    at

[jira] [Updated] (SPARK-45593) Building a runnable distribution from master code running spark-sql raise error "java.lang.ClassNotFoundException: org.sparkproject.guava.util.concurrent.internal.Intern

2023-10-18 Thread yikaifei (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yikaifei updated SPARK-45593:
-
Description: 
Building a runnable distribution from master code running spark-sql raise error 
"java.lang.ClassNotFoundException: 
org.sparkproject.guava.util.concurrent.internal.InternalFutureFailureAccess";

 

Reproducing steps, first, clone spark master code, then:
 # Build runnable distribution from master code by : `/dev/make-distribution.sh 
--name ui --pip --tgz  -Phive -Phive-thriftserver -Pyarn -Pconnect`
 # Install runnable distribution package
 # Run `bin/spark-sql`

Got error:
{code:java}
 23/10/18 20:51:46 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess
    at java.base/java.lang.ClassLoader.defineClass1(Native Method)
    at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
    at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
    at 
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
    at java.base/java.lang.ClassLoader.defineClass1(Native Method)
    at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
    at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
    at 
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
    at java.base/java.lang.ClassLoader.defineClass1(Native Method)
    at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
    at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
    at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
    at 
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
    at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511)
    at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515)
    at 
org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168)
    at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079)
    at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011)
    at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034)
    at 
org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)
    at 
org.apache.spark.storage.BlockManagerId$.getCachedBlockManagerId(BlockManagerId.scala:146)
    at org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:127)
    at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:545)
    at org.apache.spark.SparkContext.(SparkContext.scala:629)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2916)
    at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1100)
    at scala.Option.getOrElse(Option.scala:201)
    at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1094)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:64)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:441)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:177)
    at

[jira] [Updated] (SPARK-45594) Auto repartition before writing data into partitioned or bucket table

2023-10-18 Thread Wan Kun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-45594:

Description: 
Now, when writing data into partitioned table, there will be at least 
*dynamicPartitions * ShuffleNum* files; when writing data into bucket table, 
there will be at least *bucketNums * shuffleNum* files.
We can shuffle by the dynamic partitions or bucket columns before writing data 
into the table and will create ShuffleNum files.

  was:
Now, when writing data into partitioned table, there will be at least 
*dynamicPartitions * ShuffleNum* files; when writing data into bucket table, 
there will be at least *bucketNums * shuffleNum* files.
We can shuffle by the dynamic partitions or bucket columns before writing data 
into the table.


>  Auto repartition before writing data into partitioned or bucket table
> --
>
> Key: SPARK-45594
> URL: https://issues.apache.org/jira/browse/SPARK-45594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wan Kun
>Priority: Major
>
> Now, when writing data into partitioned table, there will be at least 
> *dynamicPartitions * ShuffleNum* files; when writing data into bucket table, 
> there will be at least *bucketNums * shuffleNum* files.
> We can shuffle by the dynamic partitions or bucket columns before writing 
> data into the table and will create ShuffleNum files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45595) Expose SQLSTATE in error message

2023-10-18 Thread Serge Rielau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Serge Rielau updated SPARK-45595:
-
Summary: Expose SQLSTATE in error message  (was: Expose SQLSTATRE in 
errormessage)

> Expose SQLSTATE in error message
> 
>
> Key: SPARK-45595
> URL: https://issues.apache.org/jira/browse/SPARK-45595
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Serge Rielau
>Priority: Major
>
> When using spark.sql.error.messageFormat in MINIMAL or STANDARD mode the 
> SQLSTATE is exposed;
> We want to extend this to PRETTY mode, now that all errors have SQLSTATEs
> We propose to trail the SQLSTATE after the text message, so it does not take 
> away from the reading experience of the message, while still being easily 
> found by tooling or humans.
> []  SQLSTATE: 
> 
> Example:
> {{[DIVIDE_BY_ZERO] ** Division by zero. Use `try_divide` to tolerate divisor 
> being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to 
> "false" to bypass this error. SQLSTATE: 22013}}
> {{{}== SQL(line 1, position 8){}}}{{{}==
> {}}}{{{}SELECT 1/0
> {}}}{{   ^^^}}
> Other options considered have been:
> {{[DIVIDE_BY_ZERO](22013) ** Division by zero. Use `try_divide` to tolerate 
> divisor being 0 and return NULL instead. If necessary set 
> "spark.sql.ansi.enabled" to "false" to bypass this error. }}
> {{{}== SQL(line 1, position 8){}}}{{{}==
> {}}}{{{}SELECT 1/0
> {}}}{{   ^^^}}
> {{and}}
> [DIVIDE_BY_ZERO] ** Division by zero. Use `try_divide` to tolerate 
> divisor being 0 and return NULL instead. If necessary set 
> "spark.sql.ansi.enabled" to "false" to bypass this error.}}
> {{{}== SQL(line 1, position 8){}}}{{{}=={}}}
> {{SELECT 1/0}}
> {{   ^^^}}
> SQLSTATE: 22013
> }}{{{}{{}}{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45595) Expose SQLSTATRE in errormessage

2023-10-18 Thread Serge Rielau (Jira)

Serge Rielau created SPARK-45595:


 Summary: Expose SQLSTATRE in errormessage
 Key: SPARK-45595
 URL: https://issues.apache.org/jira/browse/SPARK-45595
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Serge Rielau


When using spark.sql.error.messageFormat in MINIMAL or STANDARD mode the 
SQLSTATE is exposed;
We want to extend this to PRETTY mode, now that all errors have SQLSTATEs

We propose to trail the SQLSTATE after the text message, so it does not take 
away from the reading experience of the message, while still being easily found 
by tooling or humans.
[]  SQLSTATE: 


Example:


{{[DIVIDE_BY_ZERO] ** Division by zero. Use `try_divide` to tolerate divisor 
being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to 
"false" to bypass this error. SQLSTATE: 22013}}
{{{}== SQL(line 1, position 8){}}}{{{}==
{}}}{{{}SELECT 1/0
{}}}{{   ^^^}}

Other options considered have been:
{{[DIVIDE_BY_ZERO](22013) ** Division by zero. Use `try_divide` to tolerate 
divisor being 0 and return NULL instead. If necessary set 
"spark.sql.ansi.enabled" to "false" to bypass this error. }}
{{{}== SQL(line 1, position 8){}}}{{{}==
{}}}{{{}SELECT 1/0
{}}}{{   ^^^}}


{{and}}

[DIVIDE_BY_ZERO] ** Division by zero. Use `try_divide` to tolerate divisor 
being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to 
"false" to bypass this error.}}
{{{}== SQL(line 1, position 8){}}}{{{}=={}}}
{{SELECT 1/0}}
{{   ^^^}}
SQLSTATE: 22013
}}{{{}{{}}{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45594) Auto repartition before writing data into partitioned or bucket table

2023-10-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45594:
---
Labels: pull-request-available  (was: )

>  Auto repartition before writing data into partitioned or bucket table
> --
>
> Key: SPARK-45594
> URL: https://issues.apache.org/jira/browse/SPARK-45594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wan Kun
>Priority: Major
>  Labels: pull-request-available
>
> Now, when writing data into partitioned table, there will be at least 
> *dynamicPartitions * ShuffleNum* files; when writing data into bucket table, 
> there will be at least *bucketNums * shuffleNum* files.
> We can shuffle by the dynamic partitions or bucket columns before writing 
> data into the table and will create ShuffleNum files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45552) Introduce flexible parameters to assertDataFrameEqual

2023-10-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45552:
---
Labels: pull-request-available  (was: )

> Introduce flexible parameters to assertDataFrameEqual
> -
>
> Key: SPARK-45552
> URL: https://issues.apache.org/jira/browse/SPARK-45552
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> Add new parameters maxErrors, showOnlyDiff, maxRowsShow, ignoreColumnNames to 
> the assertDataFrameEqual.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45570) Spark job hangs due to task launch thread failed to create

2023-10-18 Thread lifulong (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lifulong updated SPARK-45570:
-
Attachment: image-2023-10-18-18-18-36-132.png

> Spark job hangs due to task launch thread failed to create
> --
>
> Key: SPARK-45570
> URL: https://issues.apache.org/jira/browse/SPARK-45570
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.5.0
> Environment: spark.speculation is use default value false
> spark version 3.1.2
>  
>Reporter: lifulong
>Priority: Major
> Attachments: image-2023-10-18-18-18-36-132.png
>
>
> spark job hangs while web ui show there is one task in running stage keep 
> running for multi hours, while other tasks finished in a few minutes 
> executor will never report task launch failed info to driver
>  
> Below is spark task execute thread launch log:
> 23/10/17 04:45:42 ERROR Inbox: An error happened while processing message in 
> the inbox for Executor
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:717)
>         at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>         at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
>         at org.apache.spark.executor.Executor.launchTask(Executor.scala:270)
>         at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:173)
>         at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>         at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>         at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>         at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45570) Spark job hangs due to task launch thread failed to create

2023-10-18 Thread lifulong (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776595#comment-17776595
 ] 

lifulong commented on SPARK-45570:
--

!image-2023-10-18-18-18-36-132.png!
catch thread create exception from line "threadPool.execute(tr)", and do
execBackend.statusUpdate(taskDescription.taskId, TaskState.FAILED, 
EMPTY_BYTE_BUFFER)
after get exception can fix this problem in theory
is this solution ok?

> Spark job hangs due to task launch thread failed to create
> --
>
> Key: SPARK-45570
> URL: https://issues.apache.org/jira/browse/SPARK-45570
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.5.0
> Environment: spark.speculation is use default value false
> spark version 3.1.2
>  
>Reporter: lifulong
>Priority: Major
> Attachments: image-2023-10-18-18-18-36-132.png
>
>
> spark job hangs while web ui show there is one task in running stage keep 
> running for multi hours, while other tasks finished in a few minutes 
> executor will never report task launch failed info to driver
>  
> Below is spark task execute thread launch log:
> 23/10/17 04:45:42 ERROR Inbox: An error happened while processing message in 
> the inbox for Executor
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:717)
>         at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>         at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
>         at org.apache.spark.executor.Executor.launchTask(Executor.scala:270)
>         at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:173)
>         at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>         at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>         at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>         at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45588) Minor scaladoc improvement in StreamingForeachBatchHelper

2023-10-18 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-45588:
-
Issue Type: Improvement  (was: Bug)
  Priority: Trivial  (was: Major)

> Minor scaladoc improvement in StreamingForeachBatchHelper
> -
>
> Key: SPARK-45588
> URL: https://issues.apache.org/jira/browse/SPARK-45588
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Raghu Angadi
>Priority: Trivial
>  Labels: pull-request-available
>
> Document RunnerCleaner in StreamingForeachBatchHelper.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45592) AQE and InMemoryTableScanExec correctness bug

2023-10-18 Thread Emil Ejbyfeldt (Jira)

Emil Ejbyfeldt created SPARK-45592:
--

 Summary: AQE and InMemoryTableScanExec correctness bug
 Key: SPARK-45592
 URL: https://issues.apache.org/jira/browse/SPARK-45592
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Emil Ejbyfeldt


The following query should return 100
{code:java}
import org.apache.spark.storage.StorageLevelval

df = spark.range(0, 100, 1, 5).map(l => (l, l))
val ee = df.select($"_1".as("src"), $"_2".as("dst"))
  .persist(StorageLevel.MEMORY_AND_DISK)

ee.count()
val minNbrs1 = ee
  .groupBy("src").agg(min(col("dst")).as("min_number"))
  .persist(StorageLevel.MEMORY_AND_DISK)
val join = ee.join(minNbrs1, "src")
join.count(){code}
but on spark 3.5.0 there is a correctness bug causing it to return `104800` or 
some other smaller value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45596) Use java.lang.ref.Cleaner instead of org.apache.spark.sql.connect.client.util.Cleaner

2023-10-18 Thread Min Zhao (Jira)

Min Zhao created SPARK-45596:


 Summary: Use java.lang.ref.Cleaner instead of 
org.apache.spark.sql.connect.client.util.Cleaner
 Key: SPARK-45596
 URL: https://issues.apache.org/jira/browse/SPARK-45596
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Min Zhao


Now, we have updated JDK to 17,  so should replace this class by 
[[java.lang.ref.Cleaner]].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45596) Use java.lang.ref.Cleaner instead of org.apache.spark.sql.connect.client.util.Cleaner

2023-10-18 Thread Min Zhao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Zhao updated SPARK-45596:
-
Attachment: image-2023-10-19-02-25-57-966.png

> Use java.lang.ref.Cleaner instead of 
> org.apache.spark.sql.connect.client.util.Cleaner
> -
>
> Key: SPARK-45596
> URL: https://issues.apache.org/jira/browse/SPARK-45596
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Min Zhao
>Priority: Minor
> Attachments: image-2023-10-19-02-25-57-966.png
>
>
> Now, we have updated JDK to 17,  so should replace this class by 
> [[java.lang.ref.Cleaner]].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45596) Use java.lang.ref.Cleaner instead of org.apache.spark.sql.connect.client.util.Cleaner

2023-10-18 Thread Min Zhao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Zhao updated SPARK-45596:
-
Description: 
Now, we have updated JDK to 17,  so should replace this class by 
[[java.lang.ref.Cleaner]].

 

!image-2023-10-19-02-25-57-966.png!

  was:Now, we have updated JDK to 17,  so should replace this class by 
[[java.lang.ref.Cleaner]].


> Use java.lang.ref.Cleaner instead of 
> org.apache.spark.sql.connect.client.util.Cleaner
> -
>
> Key: SPARK-45596
> URL: https://issues.apache.org/jira/browse/SPARK-45596
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Min Zhao
>Priority: Minor
> Attachments: image-2023-10-19-02-25-57-966.png
>
>
> Now, we have updated JDK to 17,  so should replace this class by 
> [[java.lang.ref.Cleaner]].
>  
> !image-2023-10-19-02-25-57-966.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45597) Support creating table using a Python data source in SQL

2023-10-18 Thread Allison Wang (Jira)

Allison Wang created SPARK-45597:


 Summary: Support creating table using a Python data source in SQL
 Key: SPARK-45597
 URL: https://issues.apache.org/jira/browse/SPARK-45597
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Support creating table using a Python data source in SQL query:

For instance:

`CREATE TABLE tableName() USING  OPTIONS `



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45598) Delta table 3.0-rc2 not working with Spark Connect 3.5.0

2023-10-18 Thread Faiz Halde (Jira)

Faiz Halde created SPARK-45598:
--

 Summary: Delta table 3.0-rc2 not working with Spark Connect 3.5.0
 Key: SPARK-45598
 URL: https://issues.apache.org/jira/browse/SPARK-45598
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.5.0
Reporter: Faiz Halde


Spark version 3.5.0

Spark Connect version 3.5.0

Delta table 3.0-rc2

When trying to run a simple job that writes to a delta table

{{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
{{val data = spark.read.json("profiles.json")}}
{{data.write.format("delta").save("/tmp/delta")}}

 

{{Error log in connect client}}

{{Exception in thread "main" org.apache.spark.SparkException: 
io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: cannot 
assign instance of java.lang.invoke.SerializedLambda to field 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 in 
instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}}
{{    at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}}
{{    at 
java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{...}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}}
{{    at scala.collection.Iterator.foreach(Iterator.scala:943)}}
{{    at scala.collection.Iterator.foreach$(Iterator.scala:943)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.foreach(GrpcExceptionConverter.scala:46)}}
{{    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)}}
{{    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)}}
{{    at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)}}
{{    at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)}}
{{    at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)}}
{{    at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.to(GrpcExceptionConverter.scala:46)}}
{{    at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)}}
{{    at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.toBuffer(GrpcExceptionConverter.scala:46)}}
{{    at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:554)}}
{{    at 
org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:257)}}
{{    at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:221)}}
{{    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:210)}}
{{    at Main$.main(Main.scala:11)}}
{{    at Main.main(Main.scala)}}

 

{{Error log in spark connect

[jira] [Updated] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2023-10-18 Thread Robert Joseph Evans (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated SPARK-45599:

Labels: data-corruption  (was: )

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Priority: Major
>  Labels: data-corruption
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
>

[jira] [Updated] (SPARK-45230) Adjust sorter for Aggregate after SMJ

2023-10-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45230:
---
Labels: pull-request-available  (was: )

> Adjust sorter for Aggregate after SMJ
> -
>
> Key: SPARK-45230
> URL: https://issues.apache.org/jira/browse/SPARK-45230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wan Kun
>Priority: Major
>  Labels: pull-request-available
>
> If there is an aggregate operator after the SMJ and the grouping expressions 
> of aggregate operator contain all the join keys of the streamed side, we can 
> add a sorter on the streamed side of the SMJ, so that the aggregate can be 
> convert to a SortAggregate which will be faster than HashAggregate.
> For example, with table t1(a, b, c) and t2(x, y, z):
> {code:java}
> SELECT a, b, sum(c)
> FROM t1
> JOIN t2
> ON t1.b = t2.y
> GROUP BY a, b
> {code}
> Before this PR:
> {code:java}
> Scan(t1)Scan(t2)
> |  |
> |  |
> Exchange 1   Exchange 2
>  \ /
>Sort(t1.b)Sort(t2.y) 
>  \ /
>SMJ (t1.b = t2.y)
> |
> |
>  HashAggregate
> {code}
> We can change Sort(t1.b)  to Sort(t1.b, t1.a) in the left side of the SMJ, 
> and then the following aggregate could be convert to SortAggregate, which 
> will be faster.
> {code:java}
> Scan(t1)Scan(t2)
> |  |
> |  |
> Exchange 1   Exchange 2
>  \/
>  Sort(t1.b, t1.a)   Sort(t2.y) 
>  \ /
>   SMJ (t1.b = t2.y)
> |
> |
>   SortAggregate
> {code}
> Benchmark result
> {code:java}
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.16
> Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> Aggregate after SMJ:  Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> Hash aggregate after SMJ  50508  50667
>  225  0.42408.4   1.0X
> Sort aggregate after SMJ  27556  27734
>  252  0.81314.0   1.8X
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2023-10-18 Thread Robert Joseph Evans (Jira)

Robert Joseph Evans created SPARK-45599:
---

 Summary: Percentile can produce a wrong answer if -0.0 and 0.0 are 
mixed in the dataset
 Key: SPARK-45599
 URL: https://issues.apache.org/jira/browse/SPARK-45599
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 3.2.3, 3.3.0
Reporter: Robert Joseph Evans


I think this actually impacts all versions that have ever supported percentile 
and it may impact other things because the bug is in OpenHashMap.

 

I am really surprised that we caught this bug because everything has to hit 
just wrong to make it happen. in python/pyspark if you run

 
{code:python}
from math import *
from pyspark.sql.types import *

data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
(5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
(-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
(2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
(-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
(1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
(-5.682293414619055e+46,), (-4.585039307326895e+166,), 
(-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
(None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
(-5.046677974902737e+132,), (-5.490780063080251e-09,), 
(1.703824427218836e-55,), (-1.1961155424160076e+102,), 
(1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
(5.120795466142678e-215,), (-9.01991342808203e+282,), 
(4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
(3.4543959813437507e-304,), (-7.590734560275502e-63,), 
(9.376528689861087e+117,), (-2.1696969883753554e-292,), 
(7.227411393136537e+206,), (-2.428999624265911e-293,), 
(5.741383583382542e-14,), (-1.4882040107841963e+286,), 
(2.1973064836362255e-159,), (0.028096279323357867,), (8.475809563703283e-64,), 
(3.002803065141241e-139,), (-1.1041009815645263e+203,), 
(1.8461539468514548e-225,), (-5.620339412794757e-251,), 
(3.5103766991437114e-60,), (2.4925669515657655e+165,), 
(3.217759099462207e+108,), (-8.796717685143486e+203,), 
(2.037360925124577e+292,), (-6.542279108216022e+206,), 
(-7.951172614280046e-74,), (6.226527569272003e+152,), 
(-5.673977270111637e-84,), (-1.0186016078084965e-281,), 
(1.7976931348623157e+308,), (4.205809391029644e+137,), 
(-9.871721037428167e+119,), (None,), (-1.6663254121185628e-256,), 
(1.0075153091760986e-236,), (-0.0,), (0.0,), (1.7976931348623157e+308,), 
(4.3214483342777574e-117,), (-7.973642629411105e-89,), 
(-1.1028137694801181e-297,), (2.9000325280299273e-39,), 
(-1.077534929323113e-264,), (-1.1847952892216515e+137,), (nan,), 
(7.849390806334983e+226,), (-1.831402251805194e+65,), 
(-2.664533698035492e+203,), (-2.2385155698231885e+285,), 
(-2.3016388448634844e-155,), (-9.607772864590422e+217,), 
(3.437191836077251e+209,), (1.9846569552093057e-137,), 
(-3.010452936419635e-233,), (1.4309793775440402e-87,), 
(-2.9383643865423363e-103,), (-4.696878567317712e-162,), 
(8.391630779050713e-135,), (nan,), (-3.3885098786542755e-128,), 
(-4.5154178008513483e-122,), (nan,), (nan,), (2.187766760184779e+306,), 
(7.679268835670585e+223,), (6.3131466321042515e+153,), 
(1.779652973678931e+173,), (9.247723870123388e-295,), (5.891823952773268e+98,), 
(inf,), (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
(-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
(2.5212410617263588e-282,), (-2.646144697462316e-35,), 
(-3.468683249247593e-196,), (nan,), (None,), (nan,), (1.822129180806602e-245,), 
(5.211702553315461e-259,), (-1.0,), (-5.682293414619055e+46,), 
(-4.585039307326895e+166,), (-5.936844510098297e-82,), (-5234708055733.116,), 
(4920675036.053339,), (None,), (4.4501477170144023e-308,), 
(2.176024662699802e-210,), (-5.046677974902737e+132,), 
(-5.490780063080251e-09,), (1.703824427218836e-55,), 
(-1.1961155424160076e+102,), (1.4403274475565667e+41,), (None,), 
(5.4470705929955455e-86,), (5.120795466142678e-215,), 
(-9.01991342808203e+282,), (4.051866849943636e-254,), (-3588518231990.927,), 
(-1.8891559842111865e+63,), (3.4543959813437507e-304,), 
(-7.590734560275502e-63,), (9.376528689861087e+117,), 
(-2.1696969883753554e-292,), (7.227411393136537e+206,), 
(-2.428999624265911e-293,), (5.741383583382542e-14,), 
(-1.4882040107841963e+286,), (2.1973064836362255e-159,), 
(0.028096279323357867,), (8.475809563703283e-64,), (3.002803065141241e-139,), 
(-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
(-5.620339412794757e-251,), (3.5103766991437114e-60,), 
(2.4925669515657655e+165,), (3.217759099462207e+108,), 
(-8.796717685143486e+203,), (2.037360925124577e+292,), 
(-6.542279108216022e+206,), (-7.951172614280046e-74,), 
(6.226527569272003e+152,), (-5.673977270111637e-84,), 
(-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
(4.205809391029644e+137,), (-9.871721037428167e+119,),

[jira] [Commented] (SPARK-44734) Add documentation for type casting rules in Python UDFs/UDTFs

2023-10-18 Thread Philip Dakin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776921#comment-17776921
 ] 

Philip Dakin commented on SPARK-44734:
--

[~panbingkun] please make sure changes operate well with 
https://github.com/apache/spark/pull/43369.

> Add documentation for type casting rules in Python UDFs/UDTFs
> -
>
> Key: SPARK-44734
> URL: https://issues.apache.org/jira/browse/SPARK-44734
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> In addition to type mappings between Spark data types and Python data types 
> (SPARK-44733), we should add the type casting rules for regular and 
> arrow-optimized Python UDFs/UDTFs. 
> We currently have this table in code:
>  * Arrow: 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/functions.py#L311-L329]
>  * Python UDF: 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L101-L116]
> We should add a proper documentation page for the type casting rules. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-44734) Add documentation for type casting rules in Python UDFs/UDTFs

2023-10-18 Thread Philip Dakin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776921#comment-17776921
 ] 

Philip Dakin edited comment on SPARK-44734 at 10/18/23 9:47 PM:


[~panbingkun] please make sure changes operate well with 
https://issues.apache.org/jira/browse/SPARK-44733 in 
[https://github.com/apache/spark/pull/43369]


was (Author: JIRAUSER302581):
[~panbingkun] please make sure changes operate well with 
https://github.com/apache/spark/pull/43369.

> Add documentation for type casting rules in Python UDFs/UDTFs
> -
>
> Key: SPARK-44734
> URL: https://issues.apache.org/jira/browse/SPARK-44734
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> In addition to type mappings between Spark data types and Python data types 
> (SPARK-44733), we should add the type casting rules for regular and 
> arrow-optimized Python UDFs/UDTFs. 
> We currently have this table in code:
>  * Arrow: 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/functions.py#L311-L329]
>  * Python UDF: 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L101-L116]
> We should add a proper documentation page for the type casting rules. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45582) Streaming aggregation in complete mode should not refer to store instance after commit

2023-10-18 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-45582:


Assignee: Anish Shrigondekar

> Streaming aggregation in complete mode should not refer to store instance 
> after commit
> --
>
> Key: SPARK-45582
> URL: https://issues.apache.org/jira/browse/SPARK-45582
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
>  Labels: pull-request-available
>
> Streaming aggregation in complete mode should not refer to store instance 
> after commit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45582) Streaming aggregation in complete mode should not refer to store instance after commit

2023-10-18 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-45582.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43413
[https://github.com/apache/spark/pull/43413]

> Streaming aggregation in complete mode should not refer to store instance 
> after commit
> --
>
> Key: SPARK-45582
> URL: https://issues.apache.org/jira/browse/SPARK-45582
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Streaming aggregation in complete mode should not refer to store instance 
> after commit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44526) Porting k8s PVC reuse logic to spark standalone

2023-10-18 Thread Faiz Halde (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-44526:
---
Affects Version/s: 3.4.1
   (was: 3.5.0)

> Porting k8s PVC reuse logic to spark standalone
> ---
>
> Key: SPARK-44526
> URL: https://issues.apache.org/jira/browse/SPARK-44526
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.4.1
>Reporter: Faiz Halde
>Priority: Major
>
> Hi,
> This ticket is meant to understand the work that would be involved in porting 
> the k8s PVC reuse feature onto the spark standalone cluster manager which 
> reuses the shuffle files present locally in the disk
> We are a heavy user of spot instances and we suffer from spot terminations 
> impacting our long running jobs
> The logic in `KubernetesLocalDiskShuffleExecutorComponents` itself is not 
> that much. However when I tried this on the 
> `LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
> suggests there is more to recovering shuffle files
> I'd like to understand what will be the work involved for this. We'll be more 
> than happy to contribute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44526) Porting k8s PVC reuse logic to spark standalone

2023-10-18 Thread Faiz Halde (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-44526:
---
Affects Version/s: 3.5.0
   (was: 3.4.1)

> Porting k8s PVC reuse logic to spark standalone
> ---
>
> Key: SPARK-44526
> URL: https://issues.apache.org/jira/browse/SPARK-44526
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> Hi,
> This ticket is meant to understand the work that would be involved in porting 
> the k8s PVC reuse feature onto the spark standalone cluster manager which 
> reuses the shuffle files present locally in the disk
> We are a heavy user of spot instances and we suffer from spot terminations 
> impacting our long running jobs
> The logic in `KubernetesLocalDiskShuffleExecutorComponents` itself is not 
> that much. However when I tried this on the 
> `LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
> suggests there is more to recovering shuffle files
> I'd like to understand what will be the work involved for this. We'll be more 
> than happy to contribute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45598) Delta table 3.0-rc2 not working with Spark Connect 3.5.0

2023-10-18 Thread Faiz Halde (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-45598:
---
Description: 
Spark version 3.5.0

Spark Connect version 3.5.0

Delta table 3.0-rc2

Spark connect server was started using

*{{./sbin/start-connect-server.sh --master spark://localhost:7077 --packages 
org.apache.spark:spark-connect_2.12:3.5.0,io.delta:delta-spark_2.12:3.0.0rc2 
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
 --conf 
'spark.jars.repositories=[https://oss.sonatype.org/content/repositories/iodelta-1120']}}*

{{Connect client depends on}}
*libraryDependencies += "io.delta" %% "delta-spark" % "3.0.0rc2"*
*and the connect libraries*
 

When trying to run a simple job that writes to a delta table

{{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
{{val data = spark.read.json("profiles.json")}}
{{data.write.format("delta").save("/tmp/delta")}}

 

{{Error log in connect client}}

{{Exception in thread "main" org.apache.spark.SparkException: 
io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: cannot 
assign instance of java.lang.invoke.SerializedLambda to field 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 in 
instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}}
{{    at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}}
{{    at 
java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{...}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}}
{{    at scala.collection.Iterator.foreach(Iterator.scala:943)}}
{{    at scala.collection.Iterator.foreach$(Iterator.scala:943)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.foreach(GrpcExceptionConverter.scala:46)}}
{{    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)}}
{{    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)}}
{{    at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)}}
{{    at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)}}
{{    at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)}}
{{    at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.to(GrpcExceptionConverter.scala:46)}}
{{    at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)}}
{{    at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.toBuffer(GrpcExceptionConverter.scala:46)}}
{{    at

[jira] [Updated] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2023-10-18 Thread Robert Joseph Evans (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated SPARK-45599:

Priority: Blocker  (was: Major)

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: data-corruption
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
>

[jira] [Updated] (SPARK-45559) Support spark.read.schema(...) for Python data source API

2023-10-18 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45559:
-
Description: 
Support `spark.read.schema(...)` for Python data source read.

Add test cases where we send the schema as a string instead of StructType, and 
a positive case as well as a negative case where it doesn't parse successfully 
with fromDDL?

  was:Support `spark.read.schema(...)` for Python data source read


> Support spark.read.schema(...) for Python data source API
> -
>
> Key: SPARK-45559
> URL: https://issues.apache.org/jira/browse/SPARK-45559
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Support `spark.read.schema(...)` for Python data source read.
> Add test cases where we send the schema as a string instead of StructType, 
> and a positive case as well as a negative case where it doesn't parse 
> successfully with fromDDL?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45583) Spark SQL returning incorrect values for full outer join on keys with the same name.

2023-10-18 Thread Bruce Robbins (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776783#comment-17776783
 ] 

Bruce Robbins commented on SPARK-45583:
---

Strangely, I cannot reproduce. Is some setting required?
{noformat}
sql("select version()").show(false)
+--+
|version() |
+--+
|3.5.0 ce5ddad990373636e94071e7cef2f31021add07b|
+--+

scala> sql("""WITH people as (
  SELECT * FROM (VALUES 
(1, 'Peter'), 
(2, 'Homer'), 
(3, 'Ned'),
(3, 'Jenny')
  ) AS Idiots(id, FirstName)
), location as (
  SELECT * FROM (VALUES
(1, 'sample0'),
(1, 'sample1'),
(2, 'sample2')  
  ) as Locations(id, address)
)SELECT
  *
FROM
  people
FULL OUTER JOIN
  location
ON
  people.id = location.id""").show(false)
 |  |  |  |  |  |  |  |  |  |  |
  |  |  |  |  |  |  |  |  | 
+---+-++---+
|id |FirstName|id  |address|
+---+-++---+
|1  |Peter|1   |sample0|
|1  |Peter|1   |sample1|
|2  |Homer|2   |sample2|
|3  |Ned  |NULL|NULL   |
|3  |Jenny|NULL|NULL   |
+---+-++---+

scala> 
{noformat}

> Spark SQL returning incorrect values for full outer join on keys with the 
> same name.
> 
>
> Key: SPARK-45583
> URL: https://issues.apache.org/jira/browse/SPARK-45583
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Huw
>Priority: Major
>
> {{The following query gives the wrong results.}}
>  
> {{WITH people as (}}
> {{  SELECT * FROM (VALUES }}
> {{    (1, 'Peter'), }}
> {{    (2, 'Homer'), }}
> {{    (3, 'Ned'),}}
> {{    (3, 'Jenny')}}
> {{  ) AS Idiots(id, FirstName)}}
> {{{}){}}}{{{}, location as ({}}}
> {{  SELECT * FROM (VALUES}}
> {{    (1, 'sample0'),}}
> {{    (1, 'sample1'),}}
> {{    (2, 'sample2')  }}
> {{  ) as Locations(id, address)}}
> {{{}){}}}{{{}SELECT{}}}
> {{  *}}
> {{FROM}}
> {{  people}}
> {{FULL OUTER JOIN}}
> {{  location}}
> {{ON}}
> {{  people.id = location.id}}
> {{We find the following table:}}
> ||id: integer||FirstName: string||id: integer||address: string||
> |2|Homer|2|sample2|
> |null|Ned|null|null|
> |null|Jenny|null|null|
> |1|Peter|1|sample0|
> |1|Peter|1|sample1|
> {{But clearly the first `id` column is wrong, the nulls should be 3.}}
> If we rename the id column in (only) the person table to pid we get the 
> correct results:
> ||pid: integer||FirstName: string||id: integer||address: string||
> |2|Homer|2|sample2|
> |3|Ned|null|null|
> |3|Jenny|null|null|
> |1|Peter|1|sample0|
> |1|Peter|1|sample1|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 103 matches

Mail list logo