[jira] [Updated] (SPARK-44347) Upgrade janino to 3.1.10
[ https://issues.apache.org/jira/browse/SPARK-44347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44347: --- Labels: pull-request-available (was: ) > Upgrade janino to 3.1.10 > > > Key: SPARK-44347 > URL: https://issues.apache.org/jira/browse/SPARK-44347 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45023) SPIP: Python Stored Procedures
[ https://issues.apache.org/jira/browse/SPARK-45023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776963#comment-17776963 ] Abhinav Kumar commented on SPARK-45023: --- Not sure where we are with this. Looks like we are not progressing with this. I do see value of SQL based Stored Procedure (to begin with just grouped sqls) - user can reveal the intent of usage and Spark can optimize holistically. Should we discuss and modify proposal accordingly? Please suggest. > SPIP: Python Stored Procedures > -- > > Key: SPARK-45023 > URL: https://issues.apache.org/jira/browse/SPARK-45023 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Stored procedures are an extension of the ANSI SQL standard. They play a > crucial role in improving the capabilities of SQL by encapsulating complex > logic into reusable routines. > This proposal aims to extend Spark SQL by introducing support for stored > procedures, starting with Python as the procedural language. This addition > will allow users to execute procedural programs, leveraging programming > constructs of Python to perform tasks with complex logic. Additionally, users > can persist these procedural routines in catalogs such as HMS for future > reuse. By providing this functionality, we intend to seamlessly empower Spark > users to integrate with Python routines within their SQL workflows. > {*}SPIP{*}: > [https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45600) Separate the Python data source logic from DataFrameReader
Allison Wang created SPARK-45600: Summary: Separate the Python data source logic from DataFrameReader Key: SPARK-45600 URL: https://issues.apache.org/jira/browse/SPARK-45600 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Currently we have added a few instance variables to store information for Python data source reader. We should have a dedicated reader class for Python data source to make the current DataFrameReader clean. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45585) Fix time format and redirection issues in SparkSubmit tests
[ https://issues.apache.org/jira/browse/SPARK-45585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-45585. -- Fix Version/s: 3.5.1 4.0.0 Resolution: Fixed Issue resolved by pull request 43421 [https://github.com/apache/spark/pull/43421] > Fix time format and redirection issues in SparkSubmit tests > --- > > Key: SPARK-45585 > URL: https://issues.apache.org/jira/browse/SPARK-45585 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 3.5.1, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45585) Fix time format and redirection issues in SparkSubmit tests
[ https://issues.apache.org/jira/browse/SPARK-45585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-45585: Assignee: Kent Yao > Fix time format and redirection issues in SparkSubmit tests > --- > > Key: SPARK-45585 > URL: https://issues.apache.org/jira/browse/SPARK-45585 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-45546) Make publish-snapshot support package first then deploy
[ https://issues.apache.org/jira/browse/SPARK-45546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-45546: -- Reverted at https://github.com/apache/spark/commit/706872d4de2374d1faf84d8706611a092c0b6e76 and https://github.com/apache/spark/commit/e37dd3ab8e0707eead2cb068bc19456349ccdd86 > Make publish-snapshot support package first then deploy > --- > > Key: SPARK-45546 > URL: https://issues.apache.org/jira/browse/SPARK-45546 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45546) Make publish-snapshot support package first then deploy
[ https://issues.apache.org/jira/browse/SPARK-45546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45546. -- Resolution: Invalid > Make publish-snapshot support package first then deploy > --- > > Key: SPARK-45546 > URL: https://issues.apache.org/jira/browse/SPARK-45546 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45546) Make publish-snapshot support package first then deploy
[ https://issues.apache.org/jira/browse/SPARK-45546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45546: Assignee: (was: Yang Jie) > Make publish-snapshot support package first then deploy > --- > > Key: SPARK-45546 > URL: https://issues.apache.org/jira/browse/SPARK-45546 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45546) Make publish-snapshot support package first then deploy
[ https://issues.apache.org/jira/browse/SPARK-45546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-45546: - Fix Version/s: (was: 4.0.0) > Make publish-snapshot support package first then deploy > --- > > Key: SPARK-45546 > URL: https://issues.apache.org/jira/browse/SPARK-45546 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45603) merge_spark_pr shall notice us about GITHUB_OAUTH_KEY expiry
Kent Yao created SPARK-45603: Summary: merge_spark_pr shall notice us about GITHUB_OAUTH_KEY expiry Key: SPARK-45603 URL: https://issues.apache.org/jira/browse/SPARK-45603 Project: Spark Issue Type: Bug Components: Project Infra Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45553) Deprecate assertPandasOnSparkEqual
[ https://issues.apache.org/jira/browse/SPARK-45553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45553. -- Fix Version/s: 3.5.1 4.0.0 Resolution: Fixed Issue resolved by pull request 43426 [https://github.com/apache/spark/pull/43426] > Deprecate assertPandasOnSparkEqual > -- > > Key: SPARK-45553 > URL: https://issues.apache.org/jira/browse/SPARK-45553 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark, Tests >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > Fix For: 3.5.1, 4.0.0 > > > We will add new APIs for DataFrame, Series and Index separately, and we > should deprecate assertPandasOnSparkEqual. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45553) Deprecate assertPandasOnSparkEqual
[ https://issues.apache.org/jira/browse/SPARK-45553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45553: Assignee: Haejoon Lee > Deprecate assertPandasOnSparkEqual > -- > > Key: SPARK-45553 > URL: https://issues.apache.org/jira/browse/SPARK-45553 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark, Tests >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > > We will add new APIs for DataFrame, Series and Index separately, and we > should deprecate assertPandasOnSparkEqual. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45603) merge_spark_pr shall notice us about GITHUB_OAUTH_KEY expiry
[ https://issues.apache.org/jira/browse/SPARK-45603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45603: --- Labels: pull-request-available (was: ) > merge_spark_pr shall notice us about GITHUB_OAUTH_KEY expiry > > > Key: SPARK-45603 > URL: https://issues.apache.org/jira/browse/SPARK-45603 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45588) Minor scaladoc improvement in StreamingForeachBatchHelper
[ https://issues.apache.org/jira/browse/SPARK-45588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-45588: Assignee: Raghu Angadi > Minor scaladoc improvement in StreamingForeachBatchHelper > - > > Key: SPARK-45588 > URL: https://issues.apache.org/jira/browse/SPARK-45588 > Project: Spark > Issue Type: Improvement > Components: Connect, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Raghu Angadi >Assignee: Raghu Angadi >Priority: Trivial > Labels: pull-request-available > > Document RunnerCleaner in StreamingForeachBatchHelper. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45588) Minor scaladoc improvement in StreamingForeachBatchHelper
[ https://issues.apache.org/jira/browse/SPARK-45588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-45588. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43424 [https://github.com/apache/spark/pull/43424] > Minor scaladoc improvement in StreamingForeachBatchHelper > - > > Key: SPARK-45588 > URL: https://issues.apache.org/jira/browse/SPARK-45588 > Project: Spark > Issue Type: Improvement > Components: Connect, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Raghu Angadi >Assignee: Raghu Angadi >Priority: Trivial > Labels: pull-request-available > Fix For: 4.0.0 > > > Document RunnerCleaner in StreamingForeachBatchHelper. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45589) Supplementary exception class
[ https://issues.apache.org/jira/browse/SPARK-45589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45589. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43427 [https://github.com/apache/spark/pull/43427] > Supplementary exception class > - > > Key: SPARK-45589 > URL: https://issues.apache.org/jira/browse/SPARK-45589 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45589) Supplementary exception class
[ https://issues.apache.org/jira/browse/SPARK-45589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45589: Assignee: BingKun Pan > Supplementary exception class > - > > Key: SPARK-45589 > URL: https://issues.apache.org/jira/browse/SPARK-45589 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45220) Refine docstring of `DataFrame.join`
[ https://issues.apache.org/jira/browse/SPARK-45220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45220: Assignee: Allison Wang > Refine docstring of `DataFrame.join` > > > Key: SPARK-45220 > URL: https://issues.apache.org/jira/browse/SPARK-45220 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: pull-request-available > > Refine the docstring of `DataFrame.join`. > The examples should also include: left join, left anit join, join on multiple > columns and column names, join on multiple conditions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45220) Refine docstring of `DataFrame.join`
[ https://issues.apache.org/jira/browse/SPARK-45220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45220. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43039 [https://github.com/apache/spark/pull/43039] > Refine docstring of `DataFrame.join` > > > Key: SPARK-45220 > URL: https://issues.apache.org/jira/browse/SPARK-45220 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Refine the docstring of `DataFrame.join`. > The examples should also include: left join, left anit join, join on multiple > columns and column names, join on multiple conditions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44734) Add documentation for type casting rules in Python UDFs/UDTFs
[ https://issues.apache.org/jira/browse/SPARK-44734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1043#comment-1043 ] BingKun Pan commented on SPARK-44734: - Okay, let me take a look at the two JIRAs above first. > Add documentation for type casting rules in Python UDFs/UDTFs > - > > Key: SPARK-44734 > URL: https://issues.apache.org/jira/browse/SPARK-44734 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > In addition to type mappings between Spark data types and Python data types > (SPARK-44733), we should add the type casting rules for regular and > arrow-optimized Python UDFs/UDTFs. > We currently have this table in code: > * Arrow: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/functions.py#L311-L329] > * Python UDF: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L101-L116] > We should add a proper documentation page for the type casting rules. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45543) InferWindowGroupLimit causes bug if the other window functions haven't the same window frame as the rank-like functions
[ https://issues.apache.org/jira/browse/SPARK-45543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaan Geng updated SPARK-45543: --- Summary: InferWindowGroupLimit causes bug if the other window functions haven't the same window frame as the rank-like functions (was: InferWindowGroupLimit causes bug if the window frame is different between rank-like functions and others) > InferWindowGroupLimit causes bug if the other window functions haven't the > same window frame as the rank-like functions > --- > > Key: SPARK-45543 > URL: https://issues.apache.org/jira/browse/SPARK-45543 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core, SQL >Affects Versions: 3.5.0 >Reporter: Ron Serruya >Assignee: Jiaan Geng >Priority: Critical > Labels: correctness, data-loss, pull-request-available > > First, it's my first bug, so I'm hoping I'm doing it right, also, as I'm not > very knowledgeable about spark internals, I hope I diagnosed the problem > correctly > I found the degradation in spark version 3.5.0: > When using multiple windows that share the same partition and ordering (but > with different "frame boundaries", where one window is a ranking function, > "WindowGroupLimit" is added to the plan causing wrong values to be created > from the other windows. > *This behavior didn't exist in versions 3.3 and 3.4.* > Example: > > {code:python} > import pysparkfrom pyspark.sql import functions as F, Window > df = spark.createDataFrame([ > {'row_id': 1, 'name': 'Dave', 'score': 1, 'year': 2020}, > {'row_id': 1, 'name': 'Dave', 'score': 2, 'year': 2022}, > {'row_id': 1, 'name': 'Dave', 'score': 3, 'year': 2023}, > {'row_id': 2, 'name': 'Amy', 'score': 6, 'year': 2021}, > ]) > # Create first window for row number > window_spec = Window.partitionBy('row_id', 'name').orderBy(F.desc('year')) > # Create additional window from the first window with unbounded frame > unbound_spec = window_spec.rowsBetween(Window.unboundedPreceding, > Window.unboundedFollowing) > # Try to keep the first row by year, and also collect all scores into a list > df2 = df.withColumn( > 'rn', > F.row_number().over(window_spec) > ).withColumn( > 'all_scores', > F.collect_list('score').over(unbound_spec) > ){code} > So far everything works, and if we display df2: > > {noformat} > ++--+-++---+--+ > |name|row_id|score|year|rn |all_scores| > ++--+-++---+--+ > |Dave|1 |3|2023|1 |[3, 2, 1] | > |Dave|1 |2|2022|2 |[3, 2, 1] | > |Dave|1 |1|2020|3 |[3, 2, 1] | > |Amy |2 |6|2021|1 |[6] | > ++--+-++---+--+{noformat} > > However, once we filter to keep only the first row number: > > {noformat} > df2.filter("rn=1").show(truncate=False) > ++--+-++---+--+ > |name|row_id|score|year|rn |all_scores| > ++--+-++---+--+ > |Dave|1 |3|2023|1 |[3] | > |Amy |2 |6|2021|1 |[6] | > ++--+-++---+--+{noformat} > As you can see just filtering changed the "all_scores" array for Dave. > (This example uses `collect_list`, however, the same result happens with > other functions, such as max, min, count, etc) > > Now, if instead of using the two windows we used, I will use the first window > and a window with different ordering, or create a completely new window with > same partition but no ordering, it will work fine: > {code:python} > new_window = Window.partitionBy('row_id', > 'name').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) > df3 = df.withColumn( > 'rn', > F.row_number().over(window_spec) > ).withColumn( > 'all_scores', > F.collect_list('score').over(new_window) > ) > df3.filter("rn=1").show(truncate=False){code} > {noformat} > ++--+-++---+--+ > |name|row_id|score|year|rn |all_scores| > ++--+-++---+--+ > |Dave|1 |3|2023|1 |[3, 2, 1] | > |Amy |2 |6|2021|1 |[6] | > ++--+-++---+--+ > {noformat} > In addition, if we use all 3 windows to create 3 different columns, it will > also work ok. So it seems the issue happens only when all the windows used > share the same partition and ordering. > Here is the final plan for the faulty dataframe: > {noformat} > df2.filter("rn=1").explain() > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- Filter (rn#9 = 1) > +- Window [row_number() windowspecdefinition(row_id#1L, name#0, year#3L > DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), > currentrow$())) AS rn#9, collect_list(score#2L, 0, 0) >
[jira] [Resolved] (SPARK-45586) Reduce compilation time for large expression trees
[ https://issues.apache.org/jira/browse/SPARK-45586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-45586. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43420 [https://github.com/apache/spark/pull/43420] > Reduce compilation time for large expression trees > -- > > Key: SPARK-45586 > URL: https://issues.apache.org/jira/browse/SPARK-45586 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Kelvin Jiang >Assignee: Kelvin Jiang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Some rules, such as TypeCoercion, are very expensive when the query plan > contains very large expression trees. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45586) Reduce compilation time for large expression trees
[ https://issues.apache.org/jira/browse/SPARK-45586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-45586: --- Assignee: Kelvin Jiang > Reduce compilation time for large expression trees > -- > > Key: SPARK-45586 > URL: https://issues.apache.org/jira/browse/SPARK-45586 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Kelvin Jiang >Assignee: Kelvin Jiang >Priority: Major > Labels: pull-request-available > > Some rules, such as TypeCoercion, are very expensive when the query plan > contains very large expression trees. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45507) Correctness bug in correlated scalar subqueries with COUNT aggregates
[ https://issues.apache.org/jira/browse/SPARK-45507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-45507. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43341 [https://github.com/apache/spark/pull/43341] > Correctness bug in correlated scalar subqueries with COUNT aggregates > - > > Key: SPARK-45507 > URL: https://issues.apache.org/jira/browse/SPARK-45507 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Andy Lam >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > {code:java} > > create view if not exists t1(a1, a2) as values (0, 1), (1, 2); > create view if not exists t2(b1, b2) as values (0, 2), (0, 3); > create view if not exists t3(c1, c2) as values (0, 2), (0, 3); > -- Example 1 > select ( > select SUM(l.cnt + r.cnt) > from (select count(*) cnt from t2 where t1.a1 = t2.b1 having cnt = 0) l > join (select count(*) cnt from t3 where t1.a1 = t3.c1 having cnt = 0) r > on l.cnt = r.cnt > ) from t1 > -- Correct answer: (null, 0) > +--+ > |scalarsubquery(c1, c1)| > +--+ > |null | > |null | > +--+ > -- Example 2 > select ( select sum(cnt) from (select count(*) cnt from t2 where t1.c1 = > t2.c1) ) from t1 > -- Correct answer: (2, 0) > +--+ > |scalarsubquery(c1)| > +--+ > |2 | > |null | > +--+ > -- Example 3 > select ( select count(*) from (select count(*) cnt from t2 where t1.c1 = > t2.c1) ) from t1 > -- Correct answer: (1, 1) > +--+ > |scalarsubquery(c1)| > +--+ > |1 | > |0 | > +--+ {code} > > > DB fiddle for correctness > check:[https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/10403#] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45596) Use java.lang.ref.Cleaner instead of org.apache.spark.sql.connect.client.util.Cleaner
[ https://issues.apache.org/jira/browse/SPARK-45596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45596: --- Labels: pull-request-available (was: ) > Use java.lang.ref.Cleaner instead of > org.apache.spark.sql.connect.client.util.Cleaner > - > > Key: SPARK-45596 > URL: https://issues.apache.org/jira/browse/SPARK-45596 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Min Zhao >Priority: Minor > Labels: pull-request-available > Attachments: image-2023-10-19-02-25-57-966.png > > > Now, we have updated JDK to 17, so should replace this class by > [[java.lang.ref.Cleaner]]. > > !image-2023-10-19-02-25-57-966.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45601) stackoverflow when executing rule ExtractWindowExpressions
JacobZheng created SPARK-45601: -- Summary: stackoverflow when executing rule ExtractWindowExpressions Key: SPARK-45601 URL: https://issues.apache.org/jira/browse/SPARK-45601 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.3 Reporter: JacobZheng I am encountering stackoverflow errors while executing the following test case. I looked at the source code and it is ExtractWindowExpressions not extracting the window correctly and encountering a dead loop at resolveOperatorsDownWithPruning that is causing it. {code:scala} // Some comments here test("agg filter contains window") { val src = Seq((1, "b", "c")).toDF("col1", "col2", "col3") .withColumn("test", expr("count(col1) filter (where min(col1) over(partition by col2 order by col3)>1)")) src.show() } {code} Now my question is this kind of in agg filter (window) is the correct usage? Or should I add a check like spark sql and throw an error "It is not allowed to use window functions inside WHERE clause"? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45601) stackoverflow when executing rule ExtractWindowExpressions
[ https://issues.apache.org/jira/browse/SPARK-45601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JacobZheng updated SPARK-45601: --- Description: I am encountering stackoverflow errors while executing the following test case. I looked at the source code and it is ExtractWindowExpressions not extracting the window correctly and encountering a dead loop at resolveOperatorsDownWithPruning that is causing it. {code:scala} test("agg filter contains window") { val src = Seq((1, "b", "c")).toDF("col1", "col2", "col3") .withColumn("test", expr("count(col1) filter (where min(col1) over(partition by col2 order by col3)>1)")) src.show() } {code} Now my question is this kind of in agg filter (window) is the correct usage? Or should I add a check like spark sql and throw an error "It is not allowed to use window functions inside WHERE clause"? was: I am encountering stackoverflow errors while executing the following test case. I looked at the source code and it is ExtractWindowExpressions not extracting the window correctly and encountering a dead loop at resolveOperatorsDownWithPruning that is causing it. {code:scala} // Some comments here test("agg filter contains window") { val src = Seq((1, "b", "c")).toDF("col1", "col2", "col3") .withColumn("test", expr("count(col1) filter (where min(col1) over(partition by col2 order by col3)>1)")) src.show() } {code} Now my question is this kind of in agg filter (window) is the correct usage? Or should I add a check like spark sql and throw an error "It is not allowed to use window functions inside WHERE clause"? > stackoverflow when executing rule ExtractWindowExpressions > -- > > Key: SPARK-45601 > URL: https://issues.apache.org/jira/browse/SPARK-45601 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3 >Reporter: JacobZheng >Priority: Major > > I am encountering stackoverflow errors while executing the following test > case. I looked at the source code and it is ExtractWindowExpressions not > extracting the window correctly and encountering a dead loop at > resolveOperatorsDownWithPruning that is causing it. > {code:scala} > test("agg filter contains window") { > val src = Seq((1, "b", "c")).toDF("col1", "col2", "col3") > .withColumn("test", > expr("count(col1) filter (where min(col1) over(partition by col2 > order by col3)>1)")) > src.show() > } > {code} > Now my question is this kind of in agg filter (window) is the correct usage? > Or should I add a check like spark sql and throw an error "It is not allowed > to use window functions inside WHERE clause"? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45602) Replace `s.c.MapOps.filterKeys` with `s.c.MapOps.view.filterKeys`
Yang Jie created SPARK-45602: Summary: Replace `s.c.MapOps.filterKeys` with `s.c.MapOps.view.filterKeys` Key: SPARK-45602 URL: https://issues.apache.org/jira/browse/SPARK-45602 Project: Spark Issue Type: Sub-task Components: Kubernetes, Spark Core, SQL, YARN Affects Versions: 4.0.0 Reporter: Yang Jie {code:java} /** Filters this map by retaining only keys satisfying a predicate. * @param p the predicate used to test keys * @return an immutable map consisting only of those key value pairs of this map where the key satisfies * the predicate `p`. The resulting map wraps the original map without copying any elements. */ @deprecated("Use .view.filterKeys(f). A future version will include a strict version of this method (for now, .view.filterKeys(p).toMap).", "2.13.0") def filterKeys(p: K => Boolean): MapView[K, V] = new MapView.FilterKeys(this, p) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45543) InferWindowGroupLimit causes bug if the other window functions haven't the same window frame as the rank-like functions
[ https://issues.apache.org/jira/browse/SPARK-45543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaan Geng updated SPARK-45543: --- Summary: InferWindowGroupLimit causes bug if the other window functions haven't the same window frame as the rank-like functions (was: WindowGroupLimit causes bug if the other window functions haven't the same window frame as the rank-like functions) > InferWindowGroupLimit causes bug if the other window functions haven't the > same window frame as the rank-like functions > --- > > Key: SPARK-45543 > URL: https://issues.apache.org/jira/browse/SPARK-45543 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core, SQL >Affects Versions: 3.5.0 >Reporter: Ron Serruya >Assignee: Jiaan Geng >Priority: Critical > Labels: correctness, data-loss, pull-request-available > > First, it's my first bug, so I'm hoping I'm doing it right, also, as I'm not > very knowledgeable about spark internals, I hope I diagnosed the problem > correctly > I found the degradation in spark version 3.5.0: > When using multiple windows that share the same partition and ordering (but > with different "frame boundaries", where one window is a ranking function, > "WindowGroupLimit" is added to the plan causing wrong values to be created > from the other windows. > *This behavior didn't exist in versions 3.3 and 3.4.* > Example: > > {code:python} > import pysparkfrom pyspark.sql import functions as F, Window > df = spark.createDataFrame([ > {'row_id': 1, 'name': 'Dave', 'score': 1, 'year': 2020}, > {'row_id': 1, 'name': 'Dave', 'score': 2, 'year': 2022}, > {'row_id': 1, 'name': 'Dave', 'score': 3, 'year': 2023}, > {'row_id': 2, 'name': 'Amy', 'score': 6, 'year': 2021}, > ]) > # Create first window for row number > window_spec = Window.partitionBy('row_id', 'name').orderBy(F.desc('year')) > # Create additional window from the first window with unbounded frame > unbound_spec = window_spec.rowsBetween(Window.unboundedPreceding, > Window.unboundedFollowing) > # Try to keep the first row by year, and also collect all scores into a list > df2 = df.withColumn( > 'rn', > F.row_number().over(window_spec) > ).withColumn( > 'all_scores', > F.collect_list('score').over(unbound_spec) > ){code} > So far everything works, and if we display df2: > > {noformat} > ++--+-++---+--+ > |name|row_id|score|year|rn |all_scores| > ++--+-++---+--+ > |Dave|1 |3|2023|1 |[3, 2, 1] | > |Dave|1 |2|2022|2 |[3, 2, 1] | > |Dave|1 |1|2020|3 |[3, 2, 1] | > |Amy |2 |6|2021|1 |[6] | > ++--+-++---+--+{noformat} > > However, once we filter to keep only the first row number: > > {noformat} > df2.filter("rn=1").show(truncate=False) > ++--+-++---+--+ > |name|row_id|score|year|rn |all_scores| > ++--+-++---+--+ > |Dave|1 |3|2023|1 |[3] | > |Amy |2 |6|2021|1 |[6] | > ++--+-++---+--+{noformat} > As you can see just filtering changed the "all_scores" array for Dave. > (This example uses `collect_list`, however, the same result happens with > other functions, such as max, min, count, etc) > > Now, if instead of using the two windows we used, I will use the first window > and a window with different ordering, or create a completely new window with > same partition but no ordering, it will work fine: > {code:python} > new_window = Window.partitionBy('row_id', > 'name').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) > df3 = df.withColumn( > 'rn', > F.row_number().over(window_spec) > ).withColumn( > 'all_scores', > F.collect_list('score').over(new_window) > ) > df3.filter("rn=1").show(truncate=False){code} > {noformat} > ++--+-++---+--+ > |name|row_id|score|year|rn |all_scores| > ++--+-++---+--+ > |Dave|1 |3|2023|1 |[3, 2, 1] | > |Amy |2 |6|2021|1 |[6] | > ++--+-++---+--+ > {noformat} > In addition, if we use all 3 windows to create 3 different columns, it will > also work ok. So it seems the issue happens only when all the windows used > share the same partition and ordering. > Here is the final plan for the faulty dataframe: > {noformat} > df2.filter("rn=1").explain() > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- Filter (rn#9 = 1) > +- Window [row_number() windowspecdefinition(row_id#1L, name#0, year#3L > DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), > currentrow$())) AS rn#9, collect_list(score#2L, 0, 0) >
[jira] [Updated] (SPARK-45543) InferWindowGroupLimit causes bug if the window frame is different between rank-like functions and others
[ https://issues.apache.org/jira/browse/SPARK-45543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaan Geng updated SPARK-45543: --- Summary: InferWindowGroupLimit causes bug if the window frame is different between rank-like functions and others (was: InferWindowGroupLimit causes bug if the other window functions haven't the same window frame as the rank-like functions) > InferWindowGroupLimit causes bug if the window frame is different between > rank-like functions and others > > > Key: SPARK-45543 > URL: https://issues.apache.org/jira/browse/SPARK-45543 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core, SQL >Affects Versions: 3.5.0 >Reporter: Ron Serruya >Assignee: Jiaan Geng >Priority: Critical > Labels: correctness, data-loss, pull-request-available > > First, it's my first bug, so I'm hoping I'm doing it right, also, as I'm not > very knowledgeable about spark internals, I hope I diagnosed the problem > correctly > I found the degradation in spark version 3.5.0: > When using multiple windows that share the same partition and ordering (but > with different "frame boundaries", where one window is a ranking function, > "WindowGroupLimit" is added to the plan causing wrong values to be created > from the other windows. > *This behavior didn't exist in versions 3.3 and 3.4.* > Example: > > {code:python} > import pysparkfrom pyspark.sql import functions as F, Window > df = spark.createDataFrame([ > {'row_id': 1, 'name': 'Dave', 'score': 1, 'year': 2020}, > {'row_id': 1, 'name': 'Dave', 'score': 2, 'year': 2022}, > {'row_id': 1, 'name': 'Dave', 'score': 3, 'year': 2023}, > {'row_id': 2, 'name': 'Amy', 'score': 6, 'year': 2021}, > ]) > # Create first window for row number > window_spec = Window.partitionBy('row_id', 'name').orderBy(F.desc('year')) > # Create additional window from the first window with unbounded frame > unbound_spec = window_spec.rowsBetween(Window.unboundedPreceding, > Window.unboundedFollowing) > # Try to keep the first row by year, and also collect all scores into a list > df2 = df.withColumn( > 'rn', > F.row_number().over(window_spec) > ).withColumn( > 'all_scores', > F.collect_list('score').over(unbound_spec) > ){code} > So far everything works, and if we display df2: > > {noformat} > ++--+-++---+--+ > |name|row_id|score|year|rn |all_scores| > ++--+-++---+--+ > |Dave|1 |3|2023|1 |[3, 2, 1] | > |Dave|1 |2|2022|2 |[3, 2, 1] | > |Dave|1 |1|2020|3 |[3, 2, 1] | > |Amy |2 |6|2021|1 |[6] | > ++--+-++---+--+{noformat} > > However, once we filter to keep only the first row number: > > {noformat} > df2.filter("rn=1").show(truncate=False) > ++--+-++---+--+ > |name|row_id|score|year|rn |all_scores| > ++--+-++---+--+ > |Dave|1 |3|2023|1 |[3] | > |Amy |2 |6|2021|1 |[6] | > ++--+-++---+--+{noformat} > As you can see just filtering changed the "all_scores" array for Dave. > (This example uses `collect_list`, however, the same result happens with > other functions, such as max, min, count, etc) > > Now, if instead of using the two windows we used, I will use the first window > and a window with different ordering, or create a completely new window with > same partition but no ordering, it will work fine: > {code:python} > new_window = Window.partitionBy('row_id', > 'name').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) > df3 = df.withColumn( > 'rn', > F.row_number().over(window_spec) > ).withColumn( > 'all_scores', > F.collect_list('score').over(new_window) > ) > df3.filter("rn=1").show(truncate=False){code} > {noformat} > ++--+-++---+--+ > |name|row_id|score|year|rn |all_scores| > ++--+-++---+--+ > |Dave|1 |3|2023|1 |[3, 2, 1] | > |Amy |2 |6|2021|1 |[6] | > ++--+-++---+--+ > {noformat} > In addition, if we use all 3 windows to create 3 different columns, it will > also work ok. So it seems the issue happens only when all the windows used > share the same partition and ordering. > Here is the final plan for the faulty dataframe: > {noformat} > df2.filter("rn=1").explain() > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- Filter (rn#9 = 1) > +- Window [row_number() windowspecdefinition(row_id#1L, name#0, year#3L > DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), > currentrow$())) AS rn#9, collect_list(score#2L, 0, 0) > windowspecdefinition(row_id#1L, name#0,
[jira] [Commented] (SPARK-44817) SPIP: Incremental Stats Collection
[ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776976#comment-17776976 ] Abhinav Kumar commented on SPARK-44817: --- [~rakson] [~gurwls223] [~cloud_fan] - We find this issue quite common. Currently, the incremental stats collection is done mostly outside the spark application as a end of day process (to avoid SLA breaches), and sometimes within the current application, if DML materially changes the stats. This proposal seems like a good idea, consider users can control it via spark parameter. Views? > SPIP: Incremental Stats Collection > -- > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Rakesh Raushan >Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. > [SPIP Document > |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45507) Correctness bug in correlated scalar subqueries with COUNT aggregates
[ https://issues.apache.org/jira/browse/SPARK-45507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-45507: --- Assignee: Andy Lam > Correctness bug in correlated scalar subqueries with COUNT aggregates > - > > Key: SPARK-45507 > URL: https://issues.apache.org/jira/browse/SPARK-45507 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Andy Lam >Assignee: Andy Lam >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > {code:java} > > create view if not exists t1(a1, a2) as values (0, 1), (1, 2); > create view if not exists t2(b1, b2) as values (0, 2), (0, 3); > create view if not exists t3(c1, c2) as values (0, 2), (0, 3); > -- Example 1 > select ( > select SUM(l.cnt + r.cnt) > from (select count(*) cnt from t2 where t1.a1 = t2.b1 having cnt = 0) l > join (select count(*) cnt from t3 where t1.a1 = t3.c1 having cnt = 0) r > on l.cnt = r.cnt > ) from t1 > -- Correct answer: (null, 0) > +--+ > |scalarsubquery(c1, c1)| > +--+ > |null | > |null | > +--+ > -- Example 2 > select ( select sum(cnt) from (select count(*) cnt from t2 where t1.c1 = > t2.c1) ) from t1 > -- Correct answer: (2, 0) > +--+ > |scalarsubquery(c1)| > +--+ > |2 | > |null | > +--+ > -- Example 3 > select ( select count(*) from (select count(*) cnt from t2 where t1.c1 = > t2.c1) ) from t1 > -- Correct answer: (1, 1) > +--+ > |scalarsubquery(c1)| > +--+ > |1 | > |0 | > +--+ {code} > > > DB fiddle for correctness > check:[https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/10403#] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45555) Returning a debuggable object for failed assertion
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-4: --- Labels: pull-request-available (was: ) > Returning a debuggable object for failed assertion > -- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > > To facilitate debugging, we should add a functionality to return debuggable > object when the assertion is failed from testing util function. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45602) Replace `s.c.MapOps.filterKeys` with `s.c.MapOps.view.filterKeys`
[ https://issues.apache.org/jira/browse/SPARK-45602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45602: --- Labels: pull-request-available (was: ) > Replace `s.c.MapOps.filterKeys` with `s.c.MapOps.view.filterKeys` > - > > Key: SPARK-45602 > URL: https://issues.apache.org/jira/browse/SPARK-45602 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core, SQL, YARN >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > > {code:java} > /** Filters this map by retaining only keys satisfying a predicate. > * @param p the predicate used to test keys > * @return an immutable map consisting only of those key value pairs of > this map where the key satisfies > * the predicate `p`. The resulting map wraps the original map > without copying any elements. > */ > @deprecated("Use .view.filterKeys(f). A future version will include a strict > version of this method (for now, .view.filterKeys(p).toMap).", "2.13.0") > def filterKeys(p: K => Boolean): MapView[K, V] = new MapView.FilterKeys(this, > p) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45511) SPIP: State Data Source - Reader
[ https://issues.apache.org/jira/browse/SPARK-45511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45511: --- Labels: SPIP pull-request-available (was: SPIP) > SPIP: State Data Source - Reader > > > Key: SPARK-45511 > URL: https://issues.apache.org/jira/browse/SPARK-45511 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Jungtaek Lim >Priority: Major > Labels: SPIP, pull-request-available > > State Store has been a black box from the introduction of the stateful > operator. It has been the “internal” data to the streaming query, and Spark > does not expose the data outside of the streaming query. There is no > feature/tool for users to read and modify the content of state stores. > Specific to the ability to read the state, the lack of feature brings up > various limitations like following: > * Users are unable to see the content in the state store, leading to > inability to debug. > * Users have to perform some indirect approach on verifying the content of > the state store in unit tests. The only option they can take is relying on > the output of the query. > Given that, we propose to introduce a feature which enables users to read the > state from the outside of the streaming query. > SPIP: > [https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45534) Use `java.lang.ref.Cleaner` instead of `finalize` for `RemoteBlockPushResolver`
[ https://issues.apache.org/jira/browse/SPARK-45534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-45534: --- Assignee: Min Zhao > Use `java.lang.ref.Cleaner` instead of `finalize` for > `RemoteBlockPushResolver` > --- > > Key: SPARK-45534 > URL: https://issues.apache.org/jira/browse/SPARK-45534 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Min Zhao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45534) Use `java.lang.ref.Cleaner` instead of `finalize` for `RemoteBlockPushResolver`
[ https://issues.apache.org/jira/browse/SPARK-45534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-45534. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43371 [https://github.com/apache/spark/pull/43371] > Use `java.lang.ref.Cleaner` instead of `finalize` for > `RemoteBlockPushResolver` > --- > > Key: SPARK-45534 > URL: https://issues.apache.org/jira/browse/SPARK-45534 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Min Zhao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45558) Introduce a metadata file for streaming stateful operator
[ https://issues.apache.org/jira/browse/SPARK-45558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-45558. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43393 [https://github.com/apache/spark/pull/43393] > Introduce a metadata file for streaming stateful operator > - > > Key: SPARK-45558 > URL: https://issues.apache.org/jira/browse/SPARK-45558 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Chaoqin Li >Assignee: Chaoqin Li >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > The information to store in the metadata file: > * operator name (no need to be unique among stateful operators in the query) > * state store name > * numColumnsPrefixKey: > 0 if prefix scan is enabled, 0 otherwise > The body of metadata file will be in json format. The metadata file will be > versioned just as other streaming metadata file to be future proof. > The metadata file will improve expose more information about the state store, > improves debugability and facilitate the development of state related feature > such as reading and writing state and state repartitioning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45590) okio-1.15.0 CVE-2023-3635
Colm O hEigeartaigh created SPARK-45590: --- Summary: okio-1.15.0 CVE-2023-3635 Key: SPARK-45590 URL: https://issues.apache.org/jira/browse/SPARK-45590 Project: Spark Issue Type: Task Components: Build Affects Versions: 3.5.0 Reporter: Colm O hEigeartaigh CVE-2023-3635 is being flagged against okio-1.15.0 present in the Spark 3.5.0 build: * ./spark-3.5.0-bin-without-hadoop/jars/okio-1.15.0.jar * ./spark-3.5.0-bin-hadoop3/jars/okio-1.15.0.jar I don't see okio in the dependency tree, it must be coming in via some profile. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45558) Introduce a metadata file for streaming stateful operator
[ https://issues.apache.org/jira/browse/SPARK-45558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-45558: Assignee: Chaoqin Li > Introduce a metadata file for streaming stateful operator > - > > Key: SPARK-45558 > URL: https://issues.apache.org/jira/browse/SPARK-45558 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Chaoqin Li >Assignee: Chaoqin Li >Priority: Major > Labels: pull-request-available > > The information to store in the metadata file: > * operator name (no need to be unique among stateful operators in the query) > * state store name > * numColumnsPrefixKey: > 0 if prefix scan is enabled, 0 otherwise > The body of metadata file will be in json format. The metadata file will be > versioned just as other streaming metadata file to be future proof. > The metadata file will improve expose more information about the state store, > improves debugability and facilitate the development of state related feature > such as reading and writing state and state repartitioning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45542) Replace `setSafeMode(HdfsConstants.SafeModeAction, boolean)` with `setSafeMode(SafeModeAction, boolean)`
[ https://issues.apache.org/jira/browse/SPARK-45542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45542: - Assignee: Yang Jie > Replace `setSafeMode(HdfsConstants.SafeModeAction, boolean)` with > `setSafeMode(SafeModeAction, boolean)` > > > Key: SPARK-45542 > URL: https://issues.apache.org/jira/browse/SPARK-45542 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Labels: pull-request-available > > {code:java} > /** > * Enter, leave or get safe mode. > * > * @param action > * One of SafeModeAction.ENTER, SafeModeAction.LEAVE and > * SafeModeAction.GET. > * @param isChecked > * If true check only for Active NNs status, else check first NN's > * status. > * > * @see > org.apache.hadoop.hdfs.protocol.ClientProtocol#setSafeMode(HdfsConstants.SafeModeAction, > * boolean) > * > * @deprecated please instead use > * {@link DistributedFileSystem#setSafeMode(SafeModeAction, > boolean)}. > */ > @Deprecated > public boolean setSafeMode(HdfsConstants.SafeModeAction action, > boolean isChecked) throws IOException { > return dfs.setSafeMode(action, isChecked); > } {code} > > `setSafeMode(HdfsConstants.SafeModeAction, boolean)` is `Deprecated` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45574) Add :: syntax as a shorthand for casting
[ https://issues.apache.org/jira/browse/SPARK-45574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Mitic updated SPARK-45574: --- Description: Adds the `::` syntax as syntactic sugar for casting columns. This is a pretty common syntax, and it was accepted by the SQL API (was: Adds the `::` syntax as syntactic sugar for casting columns. This is a pretty common syntax, and it was accepted by the SQL API in the [Semi-Structured Data API PRD]( [https://docs.google.com/document/d/1yNf0oE7XNZpLvsWly-MxZaxdlvMdRlZ1ZjSndtmoiWs/edit#heading=h.k50kjbi5yepj] ).) > Add :: syntax as a shorthand for casting > > > Key: SPARK-45574 > URL: https://issues.apache.org/jira/browse/SPARK-45574 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.1 >Reporter: Ivan Mitic >Priority: Major > Labels: pull-request-available, release-notes > > Adds the `::` syntax as syntactic sugar for casting columns. This is a pretty > common syntax, and it was accepted by the SQL API -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45035) Support ignoreCorruptFiles for multiline CSV
[ https://issues.apache.org/jira/browse/SPARK-45035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-45035. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42979 [https://github.com/apache/spark/pull/42979] > Support ignoreCorruptFiles for multiline CSV > > > Key: SPARK-45035 > URL: https://issues.apache.org/jira/browse/SPARK-45035 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yaohua Zhao >Assignee: Jia Fan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Today, `ignoreCorruptFiles` does not work well for multiline CSV mode. > {code:java} > spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")val > testCorruptDF0 = spark.read.option("ignoreCorruptFiles", > "true").option("multiline", "true").csv("/tmp/sourcepath/").show() {code} > It throws an exception instead of ignoring silently: > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 4940.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 4940.0 (TID 4031) (10.68.177.106 executor 0): > com.univocity.parsers.common.TextParsingException: > java.lang.IllegalStateException - Error reading from input > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Auto-closing enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Delimiters for detection=null > Empty value= > Escape unquoted values=false > Header extraction enabled=null > Headers=null > Ignore leading whitespaces=false > Ignore leading whitespaces in quotes=false > Ignore trailing whitespaces=false > Ignore trailing whitespaces in quotes=false > Input buffer size=1048576 > Input reading on separate thread=false > Keep escape sequences=false > Keep quotes=false > Length of content displayed on error=1000 > Line separator detection enabled=true > Maximum number of characters per column=-1 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Processor=none > Restricting data in exceptions=false > RowProcessor error handler=null > Selected fields=none > Skip bits as whitespace=true > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=# > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\n > Quote character=" > Quote escape character=\ > Quote escape escape character=null > Internal state when error was thrown: line=0, column=0, record=0 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402) > at > com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:277) > at > com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:843) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.(UnivocityParser.scala:463) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.convertStream(UnivocityParser.scala:46... > {code} > It is because the multiline parsing uses a different RDD (`BinaryFileRDD`) > which does not go through `FileScanRDD`. We could potentially add this > support to `BinaryFileRDD`, or even reuse the `FileScanRDD` for multiline > parsing mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45035) Support ignoreCorruptFiles for multiline CSV
[ https://issues.apache.org/jira/browse/SPARK-45035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-45035: Assignee: Jia Fan > Support ignoreCorruptFiles for multiline CSV > > > Key: SPARK-45035 > URL: https://issues.apache.org/jira/browse/SPARK-45035 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yaohua Zhao >Assignee: Jia Fan >Priority: Major > Labels: pull-request-available > > Today, `ignoreCorruptFiles` does not work well for multiline CSV mode. > {code:java} > spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")val > testCorruptDF0 = spark.read.option("ignoreCorruptFiles", > "true").option("multiline", "true").csv("/tmp/sourcepath/").show() {code} > It throws an exception instead of ignoring silently: > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 4940.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 4940.0 (TID 4031) (10.68.177.106 executor 0): > com.univocity.parsers.common.TextParsingException: > java.lang.IllegalStateException - Error reading from input > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Auto-closing enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Delimiters for detection=null > Empty value= > Escape unquoted values=false > Header extraction enabled=null > Headers=null > Ignore leading whitespaces=false > Ignore leading whitespaces in quotes=false > Ignore trailing whitespaces=false > Ignore trailing whitespaces in quotes=false > Input buffer size=1048576 > Input reading on separate thread=false > Keep escape sequences=false > Keep quotes=false > Length of content displayed on error=1000 > Line separator detection enabled=true > Maximum number of characters per column=-1 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Processor=none > Restricting data in exceptions=false > RowProcessor error handler=null > Selected fields=none > Skip bits as whitespace=true > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=# > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\n > Quote character=" > Quote escape character=\ > Quote escape escape character=null > Internal state when error was thrown: line=0, column=0, record=0 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402) > at > com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:277) > at > com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:843) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.(UnivocityParser.scala:463) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.convertStream(UnivocityParser.scala:46... > {code} > It is because the multiline parsing uses a different RDD (`BinaryFileRDD`) > which does not go through `FileScanRDD`. We could potentially add this > support to `BinaryFileRDD`, or even reuse the `FileScanRDD` for multiline > parsing mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45589) Supplementary exception class
[ https://issues.apache.org/jira/browse/SPARK-45589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45589: --- Labels: pull-request-available (was: ) > Supplementary exception class > - > > Key: SPARK-45589 > URL: https://issues.apache.org/jira/browse/SPARK-45589 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45370) Fix python test when ansi mode enabled
[ https://issues.apache.org/jira/browse/SPARK-45370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45370. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43168 [https://github.com/apache/spark/pull/43168] > Fix python test when ansi mode enabled > -- > > Key: SPARK-45370 > URL: https://issues.apache.org/jira/browse/SPARK-45370 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45370) Fix python test when ansi mode enabled
[ https://issues.apache.org/jira/browse/SPARK-45370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45370: Assignee: BingKun Pan > Fix python test when ansi mode enabled > -- > > Key: SPARK-45370 > URL: https://issues.apache.org/jira/browse/SPARK-45370 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44734) Add documentation for type casting rules in Python UDFs/UDTFs
[ https://issues.apache.org/jira/browse/SPARK-44734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776533#comment-17776533 ] BingKun Pan commented on SPARK-44734: - I work on it. > Add documentation for type casting rules in Python UDFs/UDTFs > - > > Key: SPARK-44734 > URL: https://issues.apache.org/jira/browse/SPARK-44734 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > In addition to type mappings between Spark data types and Python data types > (SPARK-44733), we should add the type casting rules for regular and > arrow-optimized Python UDFs/UDTFs. > We currently have this table in code: > * Arrow: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/functions.py#L311-L329] > * Python UDF: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L101-L116] > We should add a proper documentation page for the type casting rules. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45591) Upgrade ASM to 9.6
Yang Jie created SPARK-45591: Summary: Upgrade ASM to 9.6 Key: SPARK-45591 URL: https://issues.apache.org/jira/browse/SPARK-45591 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45576) Remove unnecessary debug logs in ReloadingX509TrustManagerSuite
[ https://issues.apache.org/jira/browse/SPARK-45576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45576: -- Summary: Remove unnecessary debug logs in ReloadingX509TrustManagerSuite (was: [CORE] Remove unnecessary debug logs in ReloadingX509TrustManagerSuite) > Remove unnecessary debug logs in ReloadingX509TrustManagerSuite > --- > > Key: SPARK-45576 > URL: https://issues.apache.org/jira/browse/SPARK-45576 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hasnain Lakhani >Assignee: Hasnain Lakhani >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > These were added accidentally. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45587) Skip UNIDOC and MIMA in build GitHub Action job
[ https://issues.apache.org/jira/browse/SPARK-45587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45587. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43422 [https://github.com/apache/spark/pull/43422] > Skip UNIDOC and MIMA in build GitHub Action job > --- > > Key: SPARK-45587 > URL: https://issues.apache.org/jira/browse/SPARK-45587 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45587) Skip UNIDOC and MIMA in build GitHub Action job
[ https://issues.apache.org/jira/browse/SPARK-45587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45587: - Assignee: Dongjoon Hyun > Skip UNIDOC and MIMA in build GitHub Action job > --- > > Key: SPARK-45587 > URL: https://issues.apache.org/jira/browse/SPARK-45587 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45570) Spark job hangs due to task launch thread failed to create
[ https://issues.apache.org/jira/browse/SPARK-45570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lifulong updated SPARK-45570: - Environment: spark.speculation is use default value false spark version 3.1.2 was: spark.speculation is use default value false > Spark job hangs due to task launch thread failed to create > -- > > Key: SPARK-45570 > URL: https://issues.apache.org/jira/browse/SPARK-45570 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.5.0 > Environment: spark.speculation is use default value false > spark version 3.1.2 > >Reporter: lifulong >Priority: Major > > spark job hangs while web ui show there is one task in running stage keep > running for multi hours, while other tasks finished in a few minutes > executor will never report task launch failed info to driver > > Below is spark task execute thread launch log: > 23/10/17 04:45:42 ERROR Inbox: An error happened while processing message in > the inbox for Executor > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378) > at org.apache.spark.executor.Executor.launchTask(Executor.scala:270) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:173) > at > org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45570) Spark job hangs due to task launch thread failed to create
[ https://issues.apache.org/jira/browse/SPARK-45570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lifulong updated SPARK-45570: - Affects Version/s: 3.5.0 > Spark job hangs due to task launch thread failed to create > -- > > Key: SPARK-45570 > URL: https://issues.apache.org/jira/browse/SPARK-45570 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.5.0 > Environment: spark.speculation is use default value false > >Reporter: lifulong >Priority: Major > > spark job hangs while web ui show there is one task in running stage keep > running for multi hours, while other tasks finished in a few minutes > executor will never report task launch failed info to driver > > Below is spark task execute thread launch log: > 23/10/17 04:45:42 ERROR Inbox: An error happened while processing message in > the inbox for Executor > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378) > at org.apache.spark.executor.Executor.launchTask(Executor.scala:270) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:173) > at > org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45553) Deprecate assertPandasOnSparkEqual
[ https://issues.apache.org/jira/browse/SPARK-45553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45553: --- Labels: pull-request-available (was: ) > Deprecate assertPandasOnSparkEqual > -- > > Key: SPARK-45553 > URL: https://issues.apache.org/jira/browse/SPARK-45553 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark, Tests >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > > We will add new APIs for DataFrame, Series and Index separately, and we > should deprecate assertPandasOnSparkEqual. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45574) Add :: syntax as a shorthand for casting
[ https://issues.apache.org/jira/browse/SPARK-45574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45574: --- Labels: pull-request-available release-notes (was: release-notes) > Add :: syntax as a shorthand for casting > > > Key: SPARK-45574 > URL: https://issues.apache.org/jira/browse/SPARK-45574 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.1 >Reporter: Ivan Mitic >Priority: Major > Labels: pull-request-available, release-notes > > Adds the `::` syntax as syntactic sugar for casting columns. This is a pretty > common syntax, and it was accepted by the SQL API in the [Semi-Structured > Data API PRD]( > [https://docs.google.com/document/d/1yNf0oE7XNZpLvsWly-MxZaxdlvMdRlZ1ZjSndtmoiWs/edit#heading=h.k50kjbi5yepj] > ). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45591) Upgrade ASM to 9.6
[ https://issues.apache.org/jira/browse/SPARK-45591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45591: --- Labels: pull-request-available (was: ) > Upgrade ASM to 9.6 > -- > > Key: SPARK-45591 > URL: https://issues.apache.org/jira/browse/SPARK-45591 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45586) Reduce compilation time for large expression trees
[ https://issues.apache.org/jira/browse/SPARK-45586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45586: -- Assignee: Apache Spark > Reduce compilation time for large expression trees > -- > > Key: SPARK-45586 > URL: https://issues.apache.org/jira/browse/SPARK-45586 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Kelvin Jiang >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Some rules, such as TypeCoercion, are very expensive when the query plan > contains very large expression trees. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45586) Reduce compilation time for large expression trees
[ https://issues.apache.org/jira/browse/SPARK-45586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45586: -- Assignee: (was: Apache Spark) > Reduce compilation time for large expression trees > -- > > Key: SPARK-45586 > URL: https://issues.apache.org/jira/browse/SPARK-45586 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Kelvin Jiang >Priority: Major > Labels: pull-request-available > > Some rules, such as TypeCoercion, are very expensive when the query plan > contains very large expression trees. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45589) Supplementary exception class
BingKun Pan created SPARK-45589: --- Summary: Supplementary exception class Key: SPARK-45589 URL: https://issues.apache.org/jira/browse/SPARK-45589 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44649) Runtime Filter supports passing equivalent creation side expressions
[ https://issues.apache.org/jira/browse/SPARK-44649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-44649. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42317 [https://github.com/apache/spark/pull/42317] > Runtime Filter supports passing equivalent creation side expressions > > > Key: SPARK-44649 > URL: https://issues.apache.org/jira/browse/SPARK-44649 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > {code:java} > SELECT > d_year, > i_brand_id, > i_class_id, > i_category_id, > i_manufact_id, > cs_quantity - COALESCE(cr_return_quantity, 0) AS sales_cnt, > cs_ext_sales_price - COALESCE(cr_return_amount, 0.0) AS sales_amt > FROM catalog_sales > JOIN item ON i_item_sk = cs_item_sk > JOIN date_dim ON d_date_sk = cs_sold_date_sk > LEFT JOIN catalog_returns ON (cs_order_number = cr_order_number > AND cs_item_sk = cr_item_sk) > WHERE i_category = 'Books' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44649) Runtime Filter supports passing equivalent creation side expressions
[ https://issues.apache.org/jira/browse/SPARK-44649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-44649: --- Assignee: Jiaan Geng > Runtime Filter supports passing equivalent creation side expressions > > > Key: SPARK-44649 > URL: https://issues.apache.org/jira/browse/SPARK-44649 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > Labels: pull-request-available > > {code:java} > SELECT > d_year, > i_brand_id, > i_class_id, > i_category_id, > i_manufact_id, > cs_quantity - COALESCE(cr_return_quantity, 0) AS sales_cnt, > cs_ext_sales_price - COALESCE(cr_return_amount, 0.0) AS sales_amt > FROM catalog_sales > JOIN item ON i_item_sk = cs_item_sk > JOIN date_dim ON d_date_sk = cs_sold_date_sk > LEFT JOIN catalog_returns ON (cs_order_number = cr_order_number > AND cs_item_sk = cr_item_sk) > WHERE i_category = 'Books' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45542) Replace `setSafeMode(HdfsConstants.SafeModeAction, boolean)` with `setSafeMode(SafeModeAction, boolean)`
[ https://issues.apache.org/jira/browse/SPARK-45542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45542. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43377 [https://github.com/apache/spark/pull/43377] > Replace `setSafeMode(HdfsConstants.SafeModeAction, boolean)` with > `setSafeMode(SafeModeAction, boolean)` > > > Key: SPARK-45542 > URL: https://issues.apache.org/jira/browse/SPARK-45542 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > {code:java} > /** > * Enter, leave or get safe mode. > * > * @param action > * One of SafeModeAction.ENTER, SafeModeAction.LEAVE and > * SafeModeAction.GET. > * @param isChecked > * If true check only for Active NNs status, else check first NN's > * status. > * > * @see > org.apache.hadoop.hdfs.protocol.ClientProtocol#setSafeMode(HdfsConstants.SafeModeAction, > * boolean) > * > * @deprecated please instead use > * {@link DistributedFileSystem#setSafeMode(SafeModeAction, > boolean)}. > */ > @Deprecated > public boolean setSafeMode(HdfsConstants.SafeModeAction action, > boolean isChecked) throws IOException { > return dfs.setSafeMode(action, isChecked); > } {code} > > `setSafeMode(HdfsConstants.SafeModeAction, boolean)` is `Deprecated` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45549) Remove unused `numExistingExecutors` in `CoarseGrainedSchedulerBackend`
[ https://issues.apache.org/jira/browse/SPARK-45549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-45549: - Priority: Trivial (was: Minor) > Remove unused `numExistingExecutors` in `CoarseGrainedSchedulerBackend` > --- > > Key: SPARK-45549 > URL: https://issues.apache.org/jira/browse/SPARK-45549 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: xiaoping.huang >Priority: Trivial > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45593) Building a runnable distribution from master code running spark-sql raise error "java.lang.ClassNotFoundException: org.sparkproject.guava.util.concurrent.internal.Intern
[ https://issues.apache.org/jira/browse/SPARK-45593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yikaifei updated SPARK-45593: - Description: Reproducing steps: first, clone spark master code, then: # Build runnable distribution from master code by : `/dev/make-distribution.sh --name ui --pip --tgz -Phive -Phive-thriftserver -Pyarn -Pconnect` # Install runnable distribution package # Run `bin/spark-sql` Got error: {code:java} 23/10/18 20:51:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread "main" java.lang.NoClassDefFoundError: org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess at java.base/java.lang.ClassLoader.defineClass1(Native Method) at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012) at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) at java.base/java.lang.ClassLoader.defineClass1(Native Method) at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012) at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) at java.base/java.lang.ClassLoader.defineClass1(Native Method) at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012) at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511) at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515) at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168) at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079) at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011) at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034) at org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010) at org.apache.spark.storage.BlockManagerId$.getCachedBlockManagerId(BlockManagerId.scala:146) at org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:127) at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:545) at org.apache.spark.SparkContext.(SparkContext.scala:629) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2916) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1100) at scala.Option.getOrElse(Option.scala:201) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1094) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:64) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:441) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:177) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
[jira] [Created] (SPARK-45594) Auto repartition before writing data into partitioned or bucket table
Wan Kun created SPARK-45594: --- Summary: Auto repartition before writing data into partitioned or bucket table Key: SPARK-45594 URL: https://issues.apache.org/jira/browse/SPARK-45594 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Wan Kun Now, when writing data into partitioned table, there will be at least *dynamicPartitions * ShuffleNum* files; when writing data into bucket table, there will be at least *bucketNums * shuffleNum* files. We can shuffle by the dynamic partitions or bucket columns before writing data into the table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45593) Building a runnable distribution from master code running spark-sql raise error "java.lang.ClassNotFoundException: org.sparkproject.guava.util.concurrent.internal.Intern
[ https://issues.apache.org/jira/browse/SPARK-45593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45593: --- Labels: pull-request-available (was: ) > Building a runnable distribution from master code running spark-sql raise > error "java.lang.ClassNotFoundException: > org.sparkproject.guava.util.concurrent.internal.InternalFutureFailureAccess" > --- > > Key: SPARK-45593 > URL: https://issues.apache.org/jira/browse/SPARK-45593 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: yikaifei >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Building a runnable distribution from master code running spark-sql raise > error "java.lang.ClassNotFoundException: > org.sparkproject.guava.util.concurrent.internal.InternalFutureFailureAccess"; > Reproducing steps, first, clone spark master code, then: > # Build runnable distribution from master code by : > `/dev/make-distribution.sh --name ui --pip --tgz -Phive -Phive-thriftserver > -Pyarn -Pconnect` > # Install runnable distribution package > # Run `bin/spark-sql` > Got error: > {code:java} > 23/10/18 20:51:46 WARN NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > Exception in thread "main" java.lang.NoClassDefFoundError: > org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess > at java.base/java.lang.ClassLoader.defineClass1(Native Method) > at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012) > at > java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) > at > java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) > at > java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) > at > java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) > at java.base/java.lang.ClassLoader.defineClass1(Native Method) > at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012) > at > java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) > at > java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) > at > java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) > at > java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) > at java.base/java.lang.ClassLoader.defineClass1(Native Method) > at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012) > at > java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) > at > java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) > at > java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) > at > java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) > at > org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511) > at > org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515) > at > org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168) > at > org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079) > at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011) > at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034) > at > org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010) > at >
[jira] [Assigned] (SPARK-45549) Remove unused `numExistingExecutors` in `CoarseGrainedSchedulerBackend`
[ https://issues.apache.org/jira/browse/SPARK-45549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45549: - Assignee: xiaoping.huang > Remove unused `numExistingExecutors` in `CoarseGrainedSchedulerBackend` > --- > > Key: SPARK-45549 > URL: https://issues.apache.org/jira/browse/SPARK-45549 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: xiaoping.huang >Assignee: xiaoping.huang >Priority: Trivial > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45549) Remove unused `numExistingExecutors` in `CoarseGrainedSchedulerBackend`
[ https://issues.apache.org/jira/browse/SPARK-45549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45549. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43383 [https://github.com/apache/spark/pull/43383] > Remove unused `numExistingExecutors` in `CoarseGrainedSchedulerBackend` > --- > > Key: SPARK-45549 > URL: https://issues.apache.org/jira/browse/SPARK-45549 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: xiaoping.huang >Assignee: xiaoping.huang >Priority: Trivial > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45592) AQE and InMemoryTableScanExec correctness bug
[ https://issues.apache.org/jira/browse/SPARK-45592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45592: --- Labels: pull-request-available (was: ) > AQE and InMemoryTableScanExec correctness bug > - > > Key: SPARK-45592 > URL: https://issues.apache.org/jira/browse/SPARK-45592 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Emil Ejbyfeldt >Priority: Major > Labels: pull-request-available > > The following query should return 100 > {code:java} > import org.apache.spark.storage.StorageLevelval > df = spark.range(0, 100, 1, 5).map(l => (l, l)) > val ee = df.select($"_1".as("src"), $"_2".as("dst")) > .persist(StorageLevel.MEMORY_AND_DISK) > ee.count() > val minNbrs1 = ee > .groupBy("src").agg(min(col("dst")).as("min_number")) > .persist(StorageLevel.MEMORY_AND_DISK) > val join = ee.join(minNbrs1, "src") > join.count(){code} > but on spark 3.5.0 there is a correctness bug causing it to return `104800` > or some other smaller value. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45593) Building a runnable distribution from master code running spark-sql raise error "java.lang.ClassNotFoundException: org.sparkproject.guava.util.concurrent.internal.Intern
yikaifei created SPARK-45593: Summary: Building a runnable distribution from master code running spark-sql raise error "java.lang.ClassNotFoundException: org.sparkproject.guava.util.concurrent.internal.InternalFutureFailureAccess" Key: SPARK-45593 URL: https://issues.apache.org/jira/browse/SPARK-45593 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: yikaifei Fix For: 4.0.0 Reproducing steps: first, clone spark master code, then: # Build runnable distribution from master code by : `/dev/make-distribution.sh --name ui --pip --tgz -Phive -Phive-thriftserver -Pyarn -Pconnect` # Install runnable distribution package # Run `bin/spark-sql` Got error: ``` 23/10/18 20:51:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread "main" java.lang.NoClassDefFoundError: org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess at java.base/java.lang.ClassLoader.defineClass1(Native Method) at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012) at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) at java.base/java.lang.ClassLoader.defineClass1(Native Method) at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012) at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) at java.base/java.lang.ClassLoader.defineClass1(Native Method) at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012) at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511) at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515) at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168) at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079) at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011) at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034) at org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010) at org.apache.spark.storage.BlockManagerId$.getCachedBlockManagerId(BlockManagerId.scala:146) at org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:127) at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:545) at org.apache.spark.SparkContext.(SparkContext.scala:629) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2916) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1100) at scala.Option.getOrElse(Option.scala:201) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1094) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:64) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:441) at
[jira] [Updated] (SPARK-45593) Building a runnable distribution from master code running spark-sql raise error "java.lang.ClassNotFoundException: org.sparkproject.guava.util.concurrent.internal.Intern
[ https://issues.apache.org/jira/browse/SPARK-45593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yikaifei updated SPARK-45593: - Description: Building a runnable distribution from master code running spark-sql raise error "java.lang.ClassNotFoundException: org.sparkproject.guava.util.concurrent.internal.InternalFutureFailureAccess"; Reproducing steps, first, clone spark master code, then: # Build runnable distribution from master code by : `/dev/make-distribution.sh --name ui --pip --tgz -Phive -Phive-thriftserver -Pyarn -Pconnect` # Install runnable distribution package # Run `bin/spark-sql` Got error: {code:java} 23/10/18 20:51:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread "main" java.lang.NoClassDefFoundError: org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess at java.base/java.lang.ClassLoader.defineClass1(Native Method) at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012) at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) at java.base/java.lang.ClassLoader.defineClass1(Native Method) at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012) at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) at java.base/java.lang.ClassLoader.defineClass1(Native Method) at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012) at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511) at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515) at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168) at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079) at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011) at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034) at org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010) at org.apache.spark.storage.BlockManagerId$.getCachedBlockManagerId(BlockManagerId.scala:146) at org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:127) at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:545) at org.apache.spark.SparkContext.(SparkContext.scala:629) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2916) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1100) at scala.Option.getOrElse(Option.scala:201) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1094) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:64) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:441) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:177) at
[jira] [Updated] (SPARK-45593) Building a runnable distribution from master code running spark-sql raise error "java.lang.ClassNotFoundException: org.sparkproject.guava.util.concurrent.internal.Intern
[ https://issues.apache.org/jira/browse/SPARK-45593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yikaifei updated SPARK-45593: - Description: Building a runnable distribution from master code running spark-sql raise error "java.lang.ClassNotFoundException: org.sparkproject.guava.util.concurrent.internal.InternalFutureFailureAccess"; Reproducing steps, first, clone spark master code, then: # Build runnable distribution from master code by : `/dev/make-distribution.sh --name ui --pip --tgz -Phive -Phive-thriftserver -Pyarn -Pconnect` # Install runnable distribution package # Run `bin/spark-sql` Got error: {code:java} 23/10/18 20:51:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread "main" java.lang.NoClassDefFoundError: org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess at java.base/java.lang.ClassLoader.defineClass1(Native Method) at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012) at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) at java.base/java.lang.ClassLoader.defineClass1(Native Method) at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012) at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) at java.base/java.lang.ClassLoader.defineClass1(Native Method) at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012) at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511) at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515) at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168) at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079) at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011) at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034) at org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010) at org.apache.spark.storage.BlockManagerId$.getCachedBlockManagerId(BlockManagerId.scala:146) at org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:127) at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:545) at org.apache.spark.SparkContext.(SparkContext.scala:629) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2916) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1100) at scala.Option.getOrElse(Option.scala:201) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1094) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:64) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:441) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:177) at
[jira] [Updated] (SPARK-45594) Auto repartition before writing data into partitioned or bucket table
[ https://issues.apache.org/jira/browse/SPARK-45594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-45594: Description: Now, when writing data into partitioned table, there will be at least *dynamicPartitions * ShuffleNum* files; when writing data into bucket table, there will be at least *bucketNums * shuffleNum* files. We can shuffle by the dynamic partitions or bucket columns before writing data into the table and will create ShuffleNum files. was: Now, when writing data into partitioned table, there will be at least *dynamicPartitions * ShuffleNum* files; when writing data into bucket table, there will be at least *bucketNums * shuffleNum* files. We can shuffle by the dynamic partitions or bucket columns before writing data into the table. > Auto repartition before writing data into partitioned or bucket table > -- > > Key: SPARK-45594 > URL: https://issues.apache.org/jira/browse/SPARK-45594 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wan Kun >Priority: Major > > Now, when writing data into partitioned table, there will be at least > *dynamicPartitions * ShuffleNum* files; when writing data into bucket table, > there will be at least *bucketNums * shuffleNum* files. > We can shuffle by the dynamic partitions or bucket columns before writing > data into the table and will create ShuffleNum files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45595) Expose SQLSTATE in error message
[ https://issues.apache.org/jira/browse/SPARK-45595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Serge Rielau updated SPARK-45595: - Summary: Expose SQLSTATE in error message (was: Expose SQLSTATRE in errormessage) > Expose SQLSTATE in error message > > > Key: SPARK-45595 > URL: https://issues.apache.org/jira/browse/SPARK-45595 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Serge Rielau >Priority: Major > > When using spark.sql.error.messageFormat in MINIMAL or STANDARD mode the > SQLSTATE is exposed; > We want to extend this to PRETTY mode, now that all errors have SQLSTATEs > We propose to trail the SQLSTATE after the text message, so it does not take > away from the reading experience of the message, while still being easily > found by tooling or humans. > [] SQLSTATE: > > Example: > {{[DIVIDE_BY_ZERO] ** Division by zero. Use `try_divide` to tolerate divisor > being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to > "false" to bypass this error. SQLSTATE: 22013}} > {{{}== SQL(line 1, position 8){}}}{{{}== > {}}}{{{}SELECT 1/0 > {}}}{{ ^^^}} > Other options considered have been: > {{[DIVIDE_BY_ZERO](22013) ** Division by zero. Use `try_divide` to tolerate > divisor being 0 and return NULL instead. If necessary set > "spark.sql.ansi.enabled" to "false" to bypass this error. }} > {{{}== SQL(line 1, position 8){}}}{{{}== > {}}}{{{}SELECT 1/0 > {}}}{{ ^^^}} > {{and}} > [DIVIDE_BY_ZERO] ** Division by zero. Use `try_divide` to tolerate > divisor being 0 and return NULL instead. If necessary set > "spark.sql.ansi.enabled" to "false" to bypass this error.}} > {{{}== SQL(line 1, position 8){}}}{{{}=={}}} > {{SELECT 1/0}} > {{ ^^^}} > SQLSTATE: 22013 > }}{{{}{{}}{}}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45595) Expose SQLSTATRE in errormessage
Serge Rielau created SPARK-45595: Summary: Expose SQLSTATRE in errormessage Key: SPARK-45595 URL: https://issues.apache.org/jira/browse/SPARK-45595 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Serge Rielau When using spark.sql.error.messageFormat in MINIMAL or STANDARD mode the SQLSTATE is exposed; We want to extend this to PRETTY mode, now that all errors have SQLSTATEs We propose to trail the SQLSTATE after the text message, so it does not take away from the reading experience of the message, while still being easily found by tooling or humans. [] SQLSTATE: Example: {{[DIVIDE_BY_ZERO] ** Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22013}} {{{}== SQL(line 1, position 8){}}}{{{}== {}}}{{{}SELECT 1/0 {}}}{{ ^^^}} Other options considered have been: {{[DIVIDE_BY_ZERO](22013) ** Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. }} {{{}== SQL(line 1, position 8){}}}{{{}== {}}}{{{}SELECT 1/0 {}}}{{ ^^^}} {{and}} [DIVIDE_BY_ZERO] ** Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.}} {{{}== SQL(line 1, position 8){}}}{{{}=={}}} {{SELECT 1/0}} {{ ^^^}} SQLSTATE: 22013 }}{{{}{{}}{}}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45594) Auto repartition before writing data into partitioned or bucket table
[ https://issues.apache.org/jira/browse/SPARK-45594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45594: --- Labels: pull-request-available (was: ) > Auto repartition before writing data into partitioned or bucket table > -- > > Key: SPARK-45594 > URL: https://issues.apache.org/jira/browse/SPARK-45594 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wan Kun >Priority: Major > Labels: pull-request-available > > Now, when writing data into partitioned table, there will be at least > *dynamicPartitions * ShuffleNum* files; when writing data into bucket table, > there will be at least *bucketNums * shuffleNum* files. > We can shuffle by the dynamic partitions or bucket columns before writing > data into the table and will create ShuffleNum files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45552) Introduce flexible parameters to assertDataFrameEqual
[ https://issues.apache.org/jira/browse/SPARK-45552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45552: --- Labels: pull-request-available (was: ) > Introduce flexible parameters to assertDataFrameEqual > - > > Key: SPARK-45552 > URL: https://issues.apache.org/jira/browse/SPARK-45552 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > > Add new parameters maxErrors, showOnlyDiff, maxRowsShow, ignoreColumnNames to > the assertDataFrameEqual. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45570) Spark job hangs due to task launch thread failed to create
[ https://issues.apache.org/jira/browse/SPARK-45570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lifulong updated SPARK-45570: - Attachment: image-2023-10-18-18-18-36-132.png > Spark job hangs due to task launch thread failed to create > -- > > Key: SPARK-45570 > URL: https://issues.apache.org/jira/browse/SPARK-45570 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.5.0 > Environment: spark.speculation is use default value false > spark version 3.1.2 > >Reporter: lifulong >Priority: Major > Attachments: image-2023-10-18-18-18-36-132.png > > > spark job hangs while web ui show there is one task in running stage keep > running for multi hours, while other tasks finished in a few minutes > executor will never report task launch failed info to driver > > Below is spark task execute thread launch log: > 23/10/17 04:45:42 ERROR Inbox: An error happened while processing message in > the inbox for Executor > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378) > at org.apache.spark.executor.Executor.launchTask(Executor.scala:270) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:173) > at > org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45570) Spark job hangs due to task launch thread failed to create
[ https://issues.apache.org/jira/browse/SPARK-45570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776595#comment-17776595 ] lifulong commented on SPARK-45570: -- !image-2023-10-18-18-18-36-132.png! catch thread create exception from line "threadPool.execute(tr)", and do execBackend.statusUpdate(taskDescription.taskId, TaskState.FAILED, EMPTY_BYTE_BUFFER) after get exception can fix this problem in theory is this solution ok? > Spark job hangs due to task launch thread failed to create > -- > > Key: SPARK-45570 > URL: https://issues.apache.org/jira/browse/SPARK-45570 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.5.0 > Environment: spark.speculation is use default value false > spark version 3.1.2 > >Reporter: lifulong >Priority: Major > Attachments: image-2023-10-18-18-18-36-132.png > > > spark job hangs while web ui show there is one task in running stage keep > running for multi hours, while other tasks finished in a few minutes > executor will never report task launch failed info to driver > > Below is spark task execute thread launch log: > 23/10/17 04:45:42 ERROR Inbox: An error happened while processing message in > the inbox for Executor > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378) > at org.apache.spark.executor.Executor.launchTask(Executor.scala:270) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:173) > at > org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45588) Minor scaladoc improvement in StreamingForeachBatchHelper
[ https://issues.apache.org/jira/browse/SPARK-45588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-45588: - Issue Type: Improvement (was: Bug) Priority: Trivial (was: Major) > Minor scaladoc improvement in StreamingForeachBatchHelper > - > > Key: SPARK-45588 > URL: https://issues.apache.org/jira/browse/SPARK-45588 > Project: Spark > Issue Type: Improvement > Components: Connect, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Raghu Angadi >Priority: Trivial > Labels: pull-request-available > > Document RunnerCleaner in StreamingForeachBatchHelper. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45592) AQE and InMemoryTableScanExec correctness bug
Emil Ejbyfeldt created SPARK-45592: -- Summary: AQE and InMemoryTableScanExec correctness bug Key: SPARK-45592 URL: https://issues.apache.org/jira/browse/SPARK-45592 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Emil Ejbyfeldt The following query should return 100 {code:java} import org.apache.spark.storage.StorageLevelval df = spark.range(0, 100, 1, 5).map(l => (l, l)) val ee = df.select($"_1".as("src"), $"_2".as("dst")) .persist(StorageLevel.MEMORY_AND_DISK) ee.count() val minNbrs1 = ee .groupBy("src").agg(min(col("dst")).as("min_number")) .persist(StorageLevel.MEMORY_AND_DISK) val join = ee.join(minNbrs1, "src") join.count(){code} but on spark 3.5.0 there is a correctness bug causing it to return `104800` or some other smaller value. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45596) Use java.lang.ref.Cleaner instead of org.apache.spark.sql.connect.client.util.Cleaner
Min Zhao created SPARK-45596: Summary: Use java.lang.ref.Cleaner instead of org.apache.spark.sql.connect.client.util.Cleaner Key: SPARK-45596 URL: https://issues.apache.org/jira/browse/SPARK-45596 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 4.0.0 Reporter: Min Zhao Now, we have updated JDK to 17, so should replace this class by [[java.lang.ref.Cleaner]]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45596) Use java.lang.ref.Cleaner instead of org.apache.spark.sql.connect.client.util.Cleaner
[ https://issues.apache.org/jira/browse/SPARK-45596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Zhao updated SPARK-45596: - Attachment: image-2023-10-19-02-25-57-966.png > Use java.lang.ref.Cleaner instead of > org.apache.spark.sql.connect.client.util.Cleaner > - > > Key: SPARK-45596 > URL: https://issues.apache.org/jira/browse/SPARK-45596 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Min Zhao >Priority: Minor > Attachments: image-2023-10-19-02-25-57-966.png > > > Now, we have updated JDK to 17, so should replace this class by > [[java.lang.ref.Cleaner]]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45596) Use java.lang.ref.Cleaner instead of org.apache.spark.sql.connect.client.util.Cleaner
[ https://issues.apache.org/jira/browse/SPARK-45596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Zhao updated SPARK-45596: - Description: Now, we have updated JDK to 17, so should replace this class by [[java.lang.ref.Cleaner]]. !image-2023-10-19-02-25-57-966.png! was:Now, we have updated JDK to 17, so should replace this class by [[java.lang.ref.Cleaner]]. > Use java.lang.ref.Cleaner instead of > org.apache.spark.sql.connect.client.util.Cleaner > - > > Key: SPARK-45596 > URL: https://issues.apache.org/jira/browse/SPARK-45596 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Min Zhao >Priority: Minor > Attachments: image-2023-10-19-02-25-57-966.png > > > Now, we have updated JDK to 17, so should replace this class by > [[java.lang.ref.Cleaner]]. > > !image-2023-10-19-02-25-57-966.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45597) Support creating table using a Python data source in SQL
Allison Wang created SPARK-45597: Summary: Support creating table using a Python data source in SQL Key: SPARK-45597 URL: https://issues.apache.org/jira/browse/SPARK-45597 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Support creating table using a Python data source in SQL query: For instance: `CREATE TABLE tableName() USING OPTIONS ` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45598) Delta table 3.0-rc2 not working with Spark Connect 3.5.0
Faiz Halde created SPARK-45598: -- Summary: Delta table 3.0-rc2 not working with Spark Connect 3.5.0 Key: SPARK-45598 URL: https://issues.apache.org/jira/browse/SPARK-45598 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.5.0 Reporter: Faiz Halde Spark version 3.5.0 Spark Connect version 3.5.0 Delta table 3.0-rc2 When trying to run a simple job that writes to a delta table {{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}} {{val data = spark.read.json("profiles.json")}} {{data.write.format("delta").save("/tmp/delta")}} {{Error log in connect client}} {{Exception in thread "main" org.apache.spark.SparkException: io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 in instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}} {{ at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}} {{ at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}} {{ at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}} {{ at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} {{ at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} {{ at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}} {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}} {{ at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} {{ at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} {{ at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} {{ at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} {{ at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} {{ at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} {{ at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}} {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}} {{ at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} {{ at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} {{ at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} {{...}} {{ at org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}} {{ at org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}} {{ at org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}} {{ at scala.collection.Iterator.foreach(Iterator.scala:943)}} {{ at scala.collection.Iterator.foreach$(Iterator.scala:943)}} {{ at org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.foreach(GrpcExceptionConverter.scala:46)}} {{ at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)}} {{ at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)}} {{ at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)}} {{ at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)}} {{ at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)}} {{ at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)}} {{ at org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.to(GrpcExceptionConverter.scala:46)}} {{ at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)}} {{ at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)}} {{ at org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.toBuffer(GrpcExceptionConverter.scala:46)}} {{ at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:554)}} {{ at org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:257)}} {{ at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:221)}} {{ at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:210)}} {{ at Main$.main(Main.scala:11)}} {{ at Main.main(Main.scala)}} {{Error log in spark connect
[jira] [Updated] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
[ https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated SPARK-45599: Labels: data-corruption (was: ) > Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset > -- > > Key: SPARK-45599 > URL: https://issues.apache.org/jira/browse/SPARK-45599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.2.3, 3.5.0 >Reporter: Robert Joseph Evans >Priority: Major > Labels: data-corruption > > I think this actually impacts all versions that have ever supported > percentile and it may impact other things because the bug is in OpenHashMap. > > I am really surprised that we caught this bug because everything has to hit > just wrong to make it happen. in python/pyspark if you run > > {code:python} > from math import * > from pyspark.sql.types import * > data = [(1.779652973678931e+173,), (9.247723870123388e-295,), > (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), > (-3.085825028509117e+74,), (-1.9569489404314425e+128,), > (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), > (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), > (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), > (5.120795466142678e-215,), (-9.01991342808203e+282,), > (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), > (3.4543959813437507e-304,), (-7.590734560275502e-63,), > (9.376528689861087e+117,), (-2.1696969883753554e-292,), > (7.227411393136537e+206,), (-2.428999624265911e-293,), > (5.741383583382542e-14,), (-1.4882040107841963e+286,), > (2.1973064836362255e-159,), (0.028096279323357867,), > (8.475809563703283e-64,), (3.002803065141241e-139,), > (-1.1041009815645263e+203,), (1.8461539468514548e-225,), > (-5.620339412794757e-251,), (3.5103766991437114e-60,), > (2.4925669515657655e+165,), (3.217759099462207e+108,), > (-8.796717685143486e+203,), (2.037360925124577e+292,), > (-6.542279108216022e+206,), (-7.951172614280046e-74,), > (6.226527569272003e+152,), (-5.673977270111637e-84,), > (-1.0186016078084965e-281,), (1.7976931348623157e+308,), > (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), > (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), > (1.7976931348623157e+308,), (4.3214483342777574e-117,), > (-7.973642629411105e-89,), (-1.1028137694801181e-297,), > (2.9000325280299273e-39,), (-1.077534929323113e-264,), > (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), > (-1.831402251805194e+65,), (-2.664533698035492e+203,), > (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), > (-9.607772864590422e+217,), (3.437191836077251e+209,), > (1.9846569552093057e-137,), (-3.010452936419635e-233,), > (1.4309793775440402e-87,), (-2.9383643865423363e-103,), > (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), > (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), > (2.187766760184779e+306,), (7.679268835670585e+223,), > (6.3131466321042515e+153,), (1.779652973678931e+173,), > (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), > (1.9042708096454302e+195,), (-3.085825028509117e+74,), > (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), > (2.5212410617263588e-282,), (-2.646144697462316e-35,), > (-3.468683249247593e-196,), (nan,), (None,), (nan,), > (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), > (5.120795466142678e-215,), (-9.01991342808203e+282,), > (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), > (3.4543959813437507e-304,), (-7.590734560275502e-63,), > (9.376528689861087e+117,), (-2.1696969883753554e-292,), > (7.227411393136537e+206,), (-2.428999624265911e-293,), > (5.741383583382542e-14,), (-1.4882040107841963e+286,), > (2.1973064836362255e-159,), (0.028096279323357867,), > (8.475809563703283e-64,), (3.002803065141241e-139,), >
[jira] [Updated] (SPARK-45230) Adjust sorter for Aggregate after SMJ
[ https://issues.apache.org/jira/browse/SPARK-45230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45230: --- Labels: pull-request-available (was: ) > Adjust sorter for Aggregate after SMJ > - > > Key: SPARK-45230 > URL: https://issues.apache.org/jira/browse/SPARK-45230 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wan Kun >Priority: Major > Labels: pull-request-available > > If there is an aggregate operator after the SMJ and the grouping expressions > of aggregate operator contain all the join keys of the streamed side, we can > add a sorter on the streamed side of the SMJ, so that the aggregate can be > convert to a SortAggregate which will be faster than HashAggregate. > For example, with table t1(a, b, c) and t2(x, y, z): > {code:java} > SELECT a, b, sum(c) > FROM t1 > JOIN t2 > ON t1.b = t2.y > GROUP BY a, b > {code} > Before this PR: > {code:java} > Scan(t1)Scan(t2) > | | > | | > Exchange 1 Exchange 2 > \ / >Sort(t1.b)Sort(t2.y) > \ / >SMJ (t1.b = t2.y) > | > | > HashAggregate > {code} > We can change Sort(t1.b) to Sort(t1.b, t1.a) in the left side of the SMJ, > and then the following aggregate could be convert to SortAggregate, which > will be faster. > {code:java} > Scan(t1)Scan(t2) > | | > | | > Exchange 1 Exchange 2 > \/ > Sort(t1.b, t1.a) Sort(t2.y) > \ / > SMJ (t1.b = t2.y) > | > | > SortAggregate > {code} > Benchmark result > {code:java} > Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.16 > Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > Aggregate after SMJ: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > Hash aggregate after SMJ 50508 50667 > 225 0.42408.4 1.0X > Sort aggregate after SMJ 27556 27734 > 252 0.81314.0 1.8X > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
Robert Joseph Evans created SPARK-45599: --- Summary: Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset Key: SPARK-45599 URL: https://issues.apache.org/jira/browse/SPARK-45599 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0, 3.2.3, 3.3.0 Reporter: Robert Joseph Evans I think this actually impacts all versions that have ever supported percentile and it may impact other things because the bug is in OpenHashMap. I am really surprised that we caught this bug because everything has to hit just wrong to make it happen. in python/pyspark if you run {code:python} from math import * from pyspark.sql.types import * data = [(1.779652973678931e+173,), (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), (-3.085825028509117e+74,), (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), (-5.682293414619055e+46,), (-4.585039307326895e+166,), (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), (-5.046677974902737e+132,), (-5.490780063080251e-09,), (1.703824427218836e-55,), (-1.1961155424160076e+102,), (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), (5.120795466142678e-215,), (-9.01991342808203e+282,), (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), (3.4543959813437507e-304,), (-7.590734560275502e-63,), (9.376528689861087e+117,), (-2.1696969883753554e-292,), (7.227411393136537e+206,), (-2.428999624265911e-293,), (5.741383583382542e-14,), (-1.4882040107841963e+286,), (2.1973064836362255e-159,), (0.028096279323357867,), (8.475809563703283e-64,), (3.002803065141241e-139,), (-1.1041009815645263e+203,), (1.8461539468514548e-225,), (-5.620339412794757e-251,), (3.5103766991437114e-60,), (2.4925669515657655e+165,), (3.217759099462207e+108,), (-8.796717685143486e+203,), (2.037360925124577e+292,), (-6.542279108216022e+206,), (-7.951172614280046e-74,), (6.226527569272003e+152,), (-5.673977270111637e-84,), (-1.0186016078084965e-281,), (1.7976931348623157e+308,), (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), (1.7976931348623157e+308,), (4.3214483342777574e-117,), (-7.973642629411105e-89,), (-1.1028137694801181e-297,), (2.9000325280299273e-39,), (-1.077534929323113e-264,), (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), (-1.831402251805194e+65,), (-2.664533698035492e+203,), (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), (-9.607772864590422e+217,), (3.437191836077251e+209,), (1.9846569552093057e-137,), (-3.010452936419635e-233,), (1.4309793775440402e-87,), (-2.9383643865423363e-103,), (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), (2.187766760184779e+306,), (7.679268835670585e+223,), (6.3131466321042515e+153,), (1.779652973678931e+173,), (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), (-3.085825028509117e+74,), (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), (-5.682293414619055e+46,), (-4.585039307326895e+166,), (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), (-5.046677974902737e+132,), (-5.490780063080251e-09,), (1.703824427218836e-55,), (-1.1961155424160076e+102,), (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), (5.120795466142678e-215,), (-9.01991342808203e+282,), (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), (3.4543959813437507e-304,), (-7.590734560275502e-63,), (9.376528689861087e+117,), (-2.1696969883753554e-292,), (7.227411393136537e+206,), (-2.428999624265911e-293,), (5.741383583382542e-14,), (-1.4882040107841963e+286,), (2.1973064836362255e-159,), (0.028096279323357867,), (8.475809563703283e-64,), (3.002803065141241e-139,), (-1.1041009815645263e+203,), (1.8461539468514548e-225,), (-5.620339412794757e-251,), (3.5103766991437114e-60,), (2.4925669515657655e+165,), (3.217759099462207e+108,), (-8.796717685143486e+203,), (2.037360925124577e+292,), (-6.542279108216022e+206,), (-7.951172614280046e-74,), (6.226527569272003e+152,), (-5.673977270111637e-84,), (-1.0186016078084965e-281,), (1.7976931348623157e+308,), (4.205809391029644e+137,), (-9.871721037428167e+119,),
[jira] [Commented] (SPARK-44734) Add documentation for type casting rules in Python UDFs/UDTFs
[ https://issues.apache.org/jira/browse/SPARK-44734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776921#comment-17776921 ] Philip Dakin commented on SPARK-44734: -- [~panbingkun] please make sure changes operate well with https://github.com/apache/spark/pull/43369. > Add documentation for type casting rules in Python UDFs/UDTFs > - > > Key: SPARK-44734 > URL: https://issues.apache.org/jira/browse/SPARK-44734 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > In addition to type mappings between Spark data types and Python data types > (SPARK-44733), we should add the type casting rules for regular and > arrow-optimized Python UDFs/UDTFs. > We currently have this table in code: > * Arrow: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/functions.py#L311-L329] > * Python UDF: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L101-L116] > We should add a proper documentation page for the type casting rules. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44734) Add documentation for type casting rules in Python UDFs/UDTFs
[ https://issues.apache.org/jira/browse/SPARK-44734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776921#comment-17776921 ] Philip Dakin edited comment on SPARK-44734 at 10/18/23 9:47 PM: [~panbingkun] please make sure changes operate well with https://issues.apache.org/jira/browse/SPARK-44733 in [https://github.com/apache/spark/pull/43369] was (Author: JIRAUSER302581): [~panbingkun] please make sure changes operate well with https://github.com/apache/spark/pull/43369. > Add documentation for type casting rules in Python UDFs/UDTFs > - > > Key: SPARK-44734 > URL: https://issues.apache.org/jira/browse/SPARK-44734 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > In addition to type mappings between Spark data types and Python data types > (SPARK-44733), we should add the type casting rules for regular and > arrow-optimized Python UDFs/UDTFs. > We currently have this table in code: > * Arrow: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/functions.py#L311-L329] > * Python UDF: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L101-L116] > We should add a proper documentation page for the type casting rules. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45582) Streaming aggregation in complete mode should not refer to store instance after commit
[ https://issues.apache.org/jira/browse/SPARK-45582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-45582: Assignee: Anish Shrigondekar > Streaming aggregation in complete mode should not refer to store instance > after commit > -- > > Key: SPARK-45582 > URL: https://issues.apache.org/jira/browse/SPARK-45582 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > Labels: pull-request-available > > Streaming aggregation in complete mode should not refer to store instance > after commit -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45582) Streaming aggregation in complete mode should not refer to store instance after commit
[ https://issues.apache.org/jira/browse/SPARK-45582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-45582. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43413 [https://github.com/apache/spark/pull/43413] > Streaming aggregation in complete mode should not refer to store instance > after commit > -- > > Key: SPARK-45582 > URL: https://issues.apache.org/jira/browse/SPARK-45582 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Streaming aggregation in complete mode should not refer to store instance > after commit -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44526) Porting k8s PVC reuse logic to spark standalone
[ https://issues.apache.org/jira/browse/SPARK-44526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Faiz Halde updated SPARK-44526: --- Affects Version/s: 3.4.1 (was: 3.5.0) > Porting k8s PVC reuse logic to spark standalone > --- > > Key: SPARK-44526 > URL: https://issues.apache.org/jira/browse/SPARK-44526 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.4.1 >Reporter: Faiz Halde >Priority: Major > > Hi, > This ticket is meant to understand the work that would be involved in porting > the k8s PVC reuse feature onto the spark standalone cluster manager which > reuses the shuffle files present locally in the disk > We are a heavy user of spot instances and we suffer from spot terminations > impacting our long running jobs > The logic in `KubernetesLocalDiskShuffleExecutorComponents` itself is not > that much. However when I tried this on the > `LocalDiskShuffleExecutorComponents` it was not a successful experiment which > suggests there is more to recovering shuffle files > I'd like to understand what will be the work involved for this. We'll be more > than happy to contribute -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44526) Porting k8s PVC reuse logic to spark standalone
[ https://issues.apache.org/jira/browse/SPARK-44526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Faiz Halde updated SPARK-44526: --- Affects Version/s: 3.5.0 (was: 3.4.1) > Porting k8s PVC reuse logic to spark standalone > --- > > Key: SPARK-44526 > URL: https://issues.apache.org/jira/browse/SPARK-44526 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.5.0 >Reporter: Faiz Halde >Priority: Major > > Hi, > This ticket is meant to understand the work that would be involved in porting > the k8s PVC reuse feature onto the spark standalone cluster manager which > reuses the shuffle files present locally in the disk > We are a heavy user of spot instances and we suffer from spot terminations > impacting our long running jobs > The logic in `KubernetesLocalDiskShuffleExecutorComponents` itself is not > that much. However when I tried this on the > `LocalDiskShuffleExecutorComponents` it was not a successful experiment which > suggests there is more to recovering shuffle files > I'd like to understand what will be the work involved for this. We'll be more > than happy to contribute -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45598) Delta table 3.0-rc2 not working with Spark Connect 3.5.0
[ https://issues.apache.org/jira/browse/SPARK-45598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Faiz Halde updated SPARK-45598: --- Description: Spark version 3.5.0 Spark Connect version 3.5.0 Delta table 3.0-rc2 Spark connect server was started using *{{./sbin/start-connect-server.sh --master spark://localhost:7077 --packages org.apache.spark:spark-connect_2.12:3.5.0,io.delta:delta-spark_2.12:3.0.0rc2 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" --conf 'spark.jars.repositories=[https://oss.sonatype.org/content/repositories/iodelta-1120']}}* {{Connect client depends on}} *libraryDependencies += "io.delta" %% "delta-spark" % "3.0.0rc2"* *and the connect libraries* When trying to run a simple job that writes to a delta table {{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}} {{val data = spark.read.json("profiles.json")}} {{data.write.format("delta").save("/tmp/delta")}} {{Error log in connect client}} {{Exception in thread "main" org.apache.spark.SparkException: io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 in instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}} {{ at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}} {{ at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}} {{ at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}} {{ at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} {{ at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} {{ at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}} {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}} {{ at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} {{ at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} {{ at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} {{ at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} {{ at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} {{ at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} {{ at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}} {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}} {{ at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} {{ at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} {{ at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} {{...}} {{ at org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}} {{ at org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}} {{ at org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}} {{ at scala.collection.Iterator.foreach(Iterator.scala:943)}} {{ at scala.collection.Iterator.foreach$(Iterator.scala:943)}} {{ at org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.foreach(GrpcExceptionConverter.scala:46)}} {{ at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)}} {{ at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)}} {{ at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)}} {{ at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)}} {{ at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)}} {{ at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)}} {{ at org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.to(GrpcExceptionConverter.scala:46)}} {{ at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)}} {{ at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)}} {{ at org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.toBuffer(GrpcExceptionConverter.scala:46)}} {{ at
[jira] [Updated] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
[ https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated SPARK-45599: Priority: Blocker (was: Major) > Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset > -- > > Key: SPARK-45599 > URL: https://issues.apache.org/jira/browse/SPARK-45599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.2.3, 3.5.0 >Reporter: Robert Joseph Evans >Priority: Blocker > Labels: data-corruption > > I think this actually impacts all versions that have ever supported > percentile and it may impact other things because the bug is in OpenHashMap. > > I am really surprised that we caught this bug because everything has to hit > just wrong to make it happen. in python/pyspark if you run > > {code:python} > from math import * > from pyspark.sql.types import * > data = [(1.779652973678931e+173,), (9.247723870123388e-295,), > (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), > (-3.085825028509117e+74,), (-1.9569489404314425e+128,), > (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), > (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), > (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), > (5.120795466142678e-215,), (-9.01991342808203e+282,), > (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), > (3.4543959813437507e-304,), (-7.590734560275502e-63,), > (9.376528689861087e+117,), (-2.1696969883753554e-292,), > (7.227411393136537e+206,), (-2.428999624265911e-293,), > (5.741383583382542e-14,), (-1.4882040107841963e+286,), > (2.1973064836362255e-159,), (0.028096279323357867,), > (8.475809563703283e-64,), (3.002803065141241e-139,), > (-1.1041009815645263e+203,), (1.8461539468514548e-225,), > (-5.620339412794757e-251,), (3.5103766991437114e-60,), > (2.4925669515657655e+165,), (3.217759099462207e+108,), > (-8.796717685143486e+203,), (2.037360925124577e+292,), > (-6.542279108216022e+206,), (-7.951172614280046e-74,), > (6.226527569272003e+152,), (-5.673977270111637e-84,), > (-1.0186016078084965e-281,), (1.7976931348623157e+308,), > (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), > (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), > (1.7976931348623157e+308,), (4.3214483342777574e-117,), > (-7.973642629411105e-89,), (-1.1028137694801181e-297,), > (2.9000325280299273e-39,), (-1.077534929323113e-264,), > (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), > (-1.831402251805194e+65,), (-2.664533698035492e+203,), > (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), > (-9.607772864590422e+217,), (3.437191836077251e+209,), > (1.9846569552093057e-137,), (-3.010452936419635e-233,), > (1.4309793775440402e-87,), (-2.9383643865423363e-103,), > (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), > (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), > (2.187766760184779e+306,), (7.679268835670585e+223,), > (6.3131466321042515e+153,), (1.779652973678931e+173,), > (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), > (1.9042708096454302e+195,), (-3.085825028509117e+74,), > (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), > (2.5212410617263588e-282,), (-2.646144697462316e-35,), > (-3.468683249247593e-196,), (nan,), (None,), (nan,), > (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), > (5.120795466142678e-215,), (-9.01991342808203e+282,), > (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), > (3.4543959813437507e-304,), (-7.590734560275502e-63,), > (9.376528689861087e+117,), (-2.1696969883753554e-292,), > (7.227411393136537e+206,), (-2.428999624265911e-293,), > (5.741383583382542e-14,), (-1.4882040107841963e+286,), > (2.1973064836362255e-159,), (0.028096279323357867,), > (8.475809563703283e-64,), (3.002803065141241e-139,), >
[jira] [Updated] (SPARK-45559) Support spark.read.schema(...) for Python data source API
[ https://issues.apache.org/jira/browse/SPARK-45559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45559: - Description: Support `spark.read.schema(...)` for Python data source read. Add test cases where we send the schema as a string instead of StructType, and a positive case as well as a negative case where it doesn't parse successfully with fromDDL? was:Support `spark.read.schema(...)` for Python data source read > Support spark.read.schema(...) for Python data source API > - > > Key: SPARK-45559 > URL: https://issues.apache.org/jira/browse/SPARK-45559 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Support `spark.read.schema(...)` for Python data source read. > Add test cases where we send the schema as a string instead of StructType, > and a positive case as well as a negative case where it doesn't parse > successfully with fromDDL? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45583) Spark SQL returning incorrect values for full outer join on keys with the same name.
[ https://issues.apache.org/jira/browse/SPARK-45583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776783#comment-17776783 ] Bruce Robbins commented on SPARK-45583: --- Strangely, I cannot reproduce. Is some setting required? {noformat} sql("select version()").show(false) +--+ |version() | +--+ |3.5.0 ce5ddad990373636e94071e7cef2f31021add07b| +--+ scala> sql("""WITH people as ( SELECT * FROM (VALUES (1, 'Peter'), (2, 'Homer'), (3, 'Ned'), (3, 'Jenny') ) AS Idiots(id, FirstName) ), location as ( SELECT * FROM (VALUES (1, 'sample0'), (1, 'sample1'), (2, 'sample2') ) as Locations(id, address) )SELECT * FROM people FULL OUTER JOIN location ON people.id = location.id""").show(false) | | | | | | | | | | | | | | | | | | | | +---+-++---+ |id |FirstName|id |address| +---+-++---+ |1 |Peter|1 |sample0| |1 |Peter|1 |sample1| |2 |Homer|2 |sample2| |3 |Ned |NULL|NULL | |3 |Jenny|NULL|NULL | +---+-++---+ scala> {noformat} > Spark SQL returning incorrect values for full outer join on keys with the > same name. > > > Key: SPARK-45583 > URL: https://issues.apache.org/jira/browse/SPARK-45583 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Huw >Priority: Major > > {{The following query gives the wrong results.}} > > {{WITH people as (}} > {{ SELECT * FROM (VALUES }} > {{ (1, 'Peter'), }} > {{ (2, 'Homer'), }} > {{ (3, 'Ned'),}} > {{ (3, 'Jenny')}} > {{ ) AS Idiots(id, FirstName)}} > {{{}){}}}{{{}, location as ({}}} > {{ SELECT * FROM (VALUES}} > {{ (1, 'sample0'),}} > {{ (1, 'sample1'),}} > {{ (2, 'sample2') }} > {{ ) as Locations(id, address)}} > {{{}){}}}{{{}SELECT{}}} > {{ *}} > {{FROM}} > {{ people}} > {{FULL OUTER JOIN}} > {{ location}} > {{ON}} > {{ people.id = location.id}} > {{We find the following table:}} > ||id: integer||FirstName: string||id: integer||address: string|| > |2|Homer|2|sample2| > |null|Ned|null|null| > |null|Jenny|null|null| > |1|Peter|1|sample0| > |1|Peter|1|sample1| > {{But clearly the first `id` column is wrong, the nulls should be 3.}} > If we rename the id column in (only) the person table to pid we get the > correct results: > ||pid: integer||FirstName: string||id: integer||address: string|| > |2|Homer|2|sample2| > |3|Ned|null|null| > |3|Jenny|null|null| > |1|Peter|1|sample0| > |1|Peter|1|sample1| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org