[jira] [Commented] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug
[ https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685124#comment-17685124 ] Peter Toth commented on SPARK-42346: [~ritikam], please use the Pyspark repro in description or add a 2nd row to your input_table if you use Scala. That's because Spark can optimize out count distinct from one row local relations. > distinct(count colname) with UNION ALL causes query analyzer bug > > > Key: SPARK-42346 > URL: https://issues.apache.org/jira/browse/SPARK-42346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0 >Reporter: Robin >Assignee: Peter Toth >Priority: Major > Fix For: 3.3.2, 3.4.0, 3.5.0 > > > If you combine a UNION ALL with a count(distinct colname) you get a query > analyzer bug. > > This behaviour is introduced in 3.3.0. The bug was not present in 3.2.1. > > Here is a reprex in PySpark: > {{df_pd = pd.DataFrame([}} > {{ \{'surname': 'a', 'first_name': 'b'}}} > {{])}} > {{df_spark = spark.createDataFrame(df_pd)}} > {{df_spark.createOrReplaceTempView("input_table")}} > {{sql = """}} > {{SELECT }} > {{ (SELECT Count(DISTINCT first_name) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table}} > {{UNION ALL}} > {{SELECT }} > {{ (SELECT Count(DISTINCT surname) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table """}} > {{spark.sql(sql).toPandas()}} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42017) df["bad_key"] does not raise AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-42017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-42017: -- Summary: df["bad_key"] does not raise AnalysisException (was: Different error type AnalysisException vs SparkConnectAnalysisException) > df["bad_key"] does not raise AnalysisException > -- > > Key: SPARK-42017 > URL: https://issues.apache.org/jira/browse/SPARK-42017 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > e.g.) > {code} > 23/01/12 14:33:43 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > FAILED [ 8%] > pyspark/sql/tests/test_column.py:105 (ColumnParityTests.test_access_column) > self = testMethod=test_access_column> > def test_access_column(self): > df = self.df > self.assertTrue(isinstance(df.key, Column)) > self.assertTrue(isinstance(df["key"], Column)) > self.assertTrue(isinstance(df[0], Column)) > self.assertRaises(IndexError, lambda: df[2]) > > self.assertRaises(AnalysisException, lambda: df["bad_key"]) > E AssertionError: AnalysisException not raised by > ../test_column.py:112: AssertionError > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42368) Ignore SparkRemoteFileTest K8s IT test case in GitHub Action
[ https://issues.apache.org/jira/browse/SPARK-42368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42368: - Assignee: Dongjoon Hyun > Ignore SparkRemoteFileTest K8s IT test case in GitHub Action > > > Key: SPARK-42368 > URL: https://issues.apache.org/jira/browse/SPARK-42368 > Project: Spark > Issue Type: Test > Components: Project Infra, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42368) Ignore SparkRemoteFileTest K8s IT test case in GitHub Action
[ https://issues.apache.org/jira/browse/SPARK-42368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42368. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39921 [https://github.com/apache/spark/pull/39921] > Ignore SparkRemoteFileTest K8s IT test case in GitHub Action > > > Key: SPARK-42368 > URL: https://issues.apache.org/jira/browse/SPARK-42368 > Project: Spark > Issue Type: Test > Components: Project Infra, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41708) Pull v1write information to WriteFiles
[ https://issues.apache.org/jira/browse/SPARK-41708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685103#comment-17685103 ] Apache Spark commented on SPARK-41708: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/39922 > Pull v1write information to WriteFiles > -- > > Key: SPARK-41708 > URL: https://issues.apache.org/jira/browse/SPARK-41708 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > Make WriteFiles hold v1 write information -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39851) Improve join stats estimation if one side can keep uniqueness
[ https://issues.apache.org/jira/browse/SPARK-39851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685104#comment-17685104 ] Apache Spark commented on SPARK-39851: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/39923 > Improve join stats estimation if one side can keep uniqueness > - > > Key: SPARK-39851 > URL: https://issues.apache.org/jira/browse/SPARK-39851 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:sql} > SELECT i_item_sk ss_item_sk > FROM item, >(SELECT DISTINCT iss.i_brand_idbrand_id, > iss.i_class_idclass_id, > iss.i_category_id category_id > FROM item iss) x > WHERE i_brand_id = brand_id >AND i_class_id = class_id >AND i_category_id = category_id > {code} > Current: > {noformat} > == Optimized Logical Plan == > Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=370.8 MiB, > rowCount=3.24E+7) > +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = > class_id#52)) AND (i_category_id#15 = category_id#53)), > Statistics(sizeInBytes=1112.3 MiB, rowCount=3.24E+7) >:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], > Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5) >: +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND > isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) >: +- Relation > spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25] > parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) >+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, > class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, > rowCount=1.37E+5) > +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, > i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, > rowCount=2.02E+5) > +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) > AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) > +- Relation > spark_catalog.default.item[i_item_sk#55,i_item_id#56,i_rec_start_date#57,i_rec_end_date#58,i_item_desc#59,i_current_price#60,i_wholesale_cost#61,i_brand_id#62,i_brand#63,i_class_id#64,i_class#65,i_category_id#66,i_category#67,i_manufact_id#68,i_manufact#69,i_size#70,i_formulation#71,i_color#72,i_units#73,i_container#74,i_manager_id#75,i_product_name#76] > parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) > {noformat} > Excepted: > {noformat} > == Optimized Logical Plan == > Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=2.3 MiB, > rowCount=2.02E+5) > +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = > class_id#52)) AND (i_category_id#15 = category_id#53)), > Statistics(sizeInBytes=7.0 MiB, rowCount=2.02E+5) >:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], > Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5) >: +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND > isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) >: +- Relation > spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25] > parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) >+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, > class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, > rowCount=1.37E+5) > +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, > i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, > rowCount=2.02E+5) > +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) > AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) > +- Relation >
[jira] [Commented] (SPARK-41708) Pull v1write information to WriteFiles
[ https://issues.apache.org/jira/browse/SPARK-41708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685102#comment-17685102 ] Apache Spark commented on SPARK-41708: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/39924 > Pull v1write information to WriteFiles > -- > > Key: SPARK-41708 > URL: https://issues.apache.org/jira/browse/SPARK-41708 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > Make WriteFiles hold v1 write information -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42368) Ignore SparkRemoteFileTest K8s IT test case in GitHub Action
[ https://issues.apache.org/jira/browse/SPARK-42368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42368: Assignee: Apache Spark > Ignore SparkRemoteFileTest K8s IT test case in GitHub Action > > > Key: SPARK-42368 > URL: https://issues.apache.org/jira/browse/SPARK-42368 > Project: Spark > Issue Type: Test > Components: Project Infra, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42368) Ignore SparkRemoteFileTest K8s IT test case in GitHub Action
[ https://issues.apache.org/jira/browse/SPARK-42368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685101#comment-17685101 ] Apache Spark commented on SPARK-42368: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39921 > Ignore SparkRemoteFileTest K8s IT test case in GitHub Action > > > Key: SPARK-42368 > URL: https://issues.apache.org/jira/browse/SPARK-42368 > Project: Spark > Issue Type: Test > Components: Project Infra, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42368) Ignore SparkRemoteFileTest K8s IT test case in GitHub Action
[ https://issues.apache.org/jira/browse/SPARK-42368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42368: Assignee: (was: Apache Spark) > Ignore SparkRemoteFileTest K8s IT test case in GitHub Action > > > Key: SPARK-42368 > URL: https://issues.apache.org/jira/browse/SPARK-42368 > Project: Spark > Issue Type: Test > Components: Project Infra, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41962) Update the import order of scala package in class SpecificParquetRecordReaderBase
[ https://issues.apache.org/jira/browse/SPARK-41962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41962. -- Fix Version/s: 3.2.4 3.3.2 Resolution: Fixed Issue resolved by pull request 39906 [https://github.com/apache/spark/pull/39906] > Update the import order of scala package in class > SpecificParquetRecordReaderBase > - > > Key: SPARK-41962 > URL: https://issues.apache.org/jira/browse/SPARK-41962 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: shuyouZZ >Assignee: shuyouZZ >Priority: Major > Fix For: 3.2.4, 3.3.2, 3.4.0 > > > There is a check style issue in class {{SpecificParquetRecordReaderBase}}. > The import order of scala package is not correct. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41962) Update the import order of scala package in class SpecificParquetRecordReaderBase
[ https://issues.apache.org/jira/browse/SPARK-41962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41962: Assignee: shuyouZZ > Update the import order of scala package in class > SpecificParquetRecordReaderBase > - > > Key: SPARK-41962 > URL: https://issues.apache.org/jira/browse/SPARK-41962 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: shuyouZZ >Assignee: shuyouZZ >Priority: Major > Fix For: 3.4.0 > > > There is a check style issue in class {{SpecificParquetRecordReaderBase}}. > The import order of scala package is not correct. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42306) Assign name to _LEGACY_ERROR_TEMP_1317
[ https://issues.apache.org/jira/browse/SPARK-42306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-42306. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39877 [https://github.com/apache/spark/pull/39877] > Assign name to _LEGACY_ERROR_TEMP_1317 > -- > > Key: SPARK-42306 > URL: https://issues.apache.org/jira/browse/SPARK-42306 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42306) Assign name to _LEGACY_ERROR_TEMP_1317
[ https://issues.apache.org/jira/browse/SPARK-42306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-42306: Assignee: Haejoon Lee > Assign name to _LEGACY_ERROR_TEMP_1317 > -- > > Key: SPARK-42306 > URL: https://issues.apache.org/jira/browse/SPARK-42306 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42352) Upgrade maven to 3.8.7
[ https://issues.apache.org/jira/browse/SPARK-42352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-42352: - Description: [https://maven.apache.org/docs/3.8.7/release-notes.html] was: [https://maven.apache.org/docs/3.8.7/release-notes.html] change to upgrade 3.9.0 https://maven.apache.org/docs/3.9.0/release-notes.html > Upgrade maven to 3.8.7 > -- > > Key: SPARK-42352 > URL: https://issues.apache.org/jira/browse/SPARK-42352 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > > [https://maven.apache.org/docs/3.8.7/release-notes.html] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42352) Upgrade maven to 3.8.7
[ https://issues.apache.org/jira/browse/SPARK-42352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-42352: - Summary: Upgrade maven to 3.8.7 (was: Upgrade maven to 3.9.0) > Upgrade maven to 3.8.7 > -- > > Key: SPARK-42352 > URL: https://issues.apache.org/jira/browse/SPARK-42352 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > > [https://maven.apache.org/docs/3.8.7/release-notes.html] > > change to upgrade 3.9.0 > > https://maven.apache.org/docs/3.9.0/release-notes.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42368) Ignore SparkRemoteFileTest K8s IT test case in GitHub Action
Dongjoon Hyun created SPARK-42368: - Summary: Ignore SparkRemoteFileTest K8s IT test case in GitHub Action Key: SPARK-42368 URL: https://issues.apache.org/jira/browse/SPARK-42368 Project: Spark Issue Type: Test Components: Project Infra, Tests Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41612) Support Catalog.isCached
[ https://issues.apache.org/jira/browse/SPARK-41612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41612. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39919 [https://github.com/apache/spark/pull/39919] > Support Catalog.isCached > > > Key: SPARK-41612 > URL: https://issues.apache.org/jira/browse/SPARK-41612 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41600) Support Catalog.cacheTable
[ https://issues.apache.org/jira/browse/SPARK-41600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41600: Assignee: Hyukjin Kwon > Support Catalog.cacheTable > -- > > Key: SPARK-41600 > URL: https://issues.apache.org/jira/browse/SPARK-41600 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41623) Support Catalog.uncacheTable
[ https://issues.apache.org/jira/browse/SPARK-41623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41623: Assignee: Hyukjin Kwon > Support Catalog.uncacheTable > > > Key: SPARK-41623 > URL: https://issues.apache.org/jira/browse/SPARK-41623 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41612) Support Catalog.isCached
[ https://issues.apache.org/jira/browse/SPARK-41612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41612: Assignee: Hyukjin Kwon > Support Catalog.isCached > > > Key: SPARK-41612 > URL: https://issues.apache.org/jira/browse/SPARK-41612 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41623) Support Catalog.uncacheTable
[ https://issues.apache.org/jira/browse/SPARK-41623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41623. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39919 [https://github.com/apache/spark/pull/39919] > Support Catalog.uncacheTable > > > Key: SPARK-41623 > URL: https://issues.apache.org/jira/browse/SPARK-41623 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41600) Support Catalog.cacheTable
[ https://issues.apache.org/jira/browse/SPARK-41600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41600. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39919 [https://github.com/apache/spark/pull/39919] > Support Catalog.cacheTable > -- > > Key: SPARK-41600 > URL: https://issues.apache.org/jira/browse/SPARK-41600 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42367) DataFrame.drop could handle duplicated columns
Ruifeng Zheng created SPARK-42367: - Summary: DataFrame.drop could handle duplicated columns Key: SPARK-42367 URL: https://issues.apache.org/jira/browse/SPARK-42367 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng {code:java} >>> df.join(df2, df.name == df2.name, 'inner').show() +---++--++ |age|name|height|name| +---++--++ | 16| Bob|85| Bob| | 14| Tom|80| Tom| +---++--++ >>> df.join(df2, df.name == df2.name, 'inner').drop('name').show() +---+--+ |age|height| +---+--+ | 16|85| | 14|80| +---+--+ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42364) Split 'pyspark.pandas.tests.test_dataframe'
[ https://issues.apache.org/jira/browse/SPARK-42364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42364. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39915 [https://github.com/apache/spark/pull/39915] > Split 'pyspark.pandas.tests.test_dataframe' > --- > > Key: SPARK-42364 > URL: https://issues.apache.org/jira/browse/SPARK-42364 > Project: Spark > Issue Type: Test > Components: ps, Tests >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42364) Split 'pyspark.pandas.tests.test_dataframe'
[ https://issues.apache.org/jira/browse/SPARK-42364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42364: - Assignee: Ruifeng Zheng > Split 'pyspark.pandas.tests.test_dataframe' > --- > > Key: SPARK-42364 > URL: https://issues.apache.org/jira/browse/SPARK-42364 > Project: Spark > Issue Type: Test > Components: ps, Tests >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42363) Remove session.register_udf
[ https://issues.apache.org/jira/browse/SPARK-42363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42363: - Assignee: Hyukjin Kwon > Remove session.register_udf > --- > > Key: SPARK-42363 > URL: https://issues.apache.org/jira/browse/SPARK-42363 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42363) Remove session.register_udf
[ https://issues.apache.org/jira/browse/SPARK-42363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42363. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39916 [https://github.com/apache/spark/pull/39916] > Remove session.register_udf > --- > > Key: SPARK-42363 > URL: https://issues.apache.org/jira/browse/SPARK-42363 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42038) SPJ: Support partially clustered distribution
[ https://issues.apache.org/jira/browse/SPARK-42038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42038. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39633 [https://github.com/apache/spark/pull/39633] > SPJ: Support partially clustered distribution > - > > Key: SPARK-42038 > URL: https://issues.apache.org/jira/browse/SPARK-42038 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.4.0 > > > Currently the storage-partitioned join requires both sides to be fully > clustered on the partition values, that is, all input partitions reported by > a V2 data source shall be grouped by partition values before the join > happens. This could lead to data skew issues if a particular partition value > is associated with a large amount of rows. > > To combat this, we can introduce the idea of partially clustered > distribution, which means that only one side of the join is required to be > fully clustered, while the other side is not. This allows Spark to increase > the parallelism of the join and avoid the data skewness. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42038) SPJ: Support partially clustered distribution
[ https://issues.apache.org/jira/browse/SPARK-42038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42038: - Assignee: Chao Sun > SPJ: Support partially clustered distribution > - > > Key: SPARK-42038 > URL: https://issues.apache.org/jira/browse/SPARK-42038 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > Currently the storage-partitioned join requires both sides to be fully > clustered on the partition values, that is, all input partitions reported by > a V2 data source shall be grouped by partition values before the join > happens. This could lead to data skew issues if a particular partition value > is associated with a large amount of rows. > > To combat this, we can introduce the idea of partially clustered > distribution, which means that only one side of the join is required to be > fully clustered, while the other side is not. This allows Spark to increase > the parallelism of the join and avoid the data skewness. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource
[ https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42335: Description: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource. After this change, the behavior as flows: |id|code|2.4 and before|3.0 and after|this update|remark| |1|Seq("#abc", "\udef", "xyz").toDF() .write.{color:#57d9a3}option("comment", "\u"){color}.csv(path)|#abc *def* xyz|{color:#4c9aff}"#abc"{color} {color:#4c9aff}*def*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}*"def"*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |2|Seq("#abc", "\udef", "xyz").toDF() .write{color:#57d9a3}.option("comment", "#"){color}.csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |3|Seq("#abc", "\udef", "xyz").toDF() .write.csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|default behavior: the same| |4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.{color:#57d9a3}option("comment", "\u"){color}.csv(path)|#abc xyz|{color:#4c9aff}#abc{color} {color:#4c9aff}\udef{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.{color:#57d9a3}option("comment", "#"){color}.csv(path)|\udef xyz|\udef xyz|\udef xyz|the same| |6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.csv(path)|#abc xyz|#abc \udef xyz|#abc \udef xyz|default behavior: the same| was: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource. After this change, the behavior as flows: |id|code|2.4 and before|3.0 and after|this update|remark| |1|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "\u").csv(path)|#abc *def* xyz|{color:#4c9aff}"#abc"{color} {color:#4c9aff}*def*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}*"def"*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |2|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "#").csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |3|Seq("#abc", "\udef", "xyz").toDF() .write.csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "\u").csv(path)|#abc xyz|{color:#4c9aff}#abc{color} {color:#4c9aff}\udef{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment",
[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource
[ https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42335: Description: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource. |id|code|2.4 and before|3.0 and after|this update|remark| |1|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "\u").csv(path)|#abc *def* xyz|{color:#4c9aff}"#abc"{color} {color:#4c9aff}*def*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}*"def"*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |2|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "#").csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |3|Seq("#abc", "\udef", "xyz").toDF() .write.csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "\u").csv(path)|#abc xyz|{color:#4c9aff}#abc{color} {color:#4c9aff}\udef{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "#").csv(path)|\udef xyz|\udef xyz|\udef xyz|the same| |6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.csv(path)|#abc xyz|#abc \udef xyz|#abc \udef xyz|the same| was: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource. |id|code| |2.4 and before|3.0 and after|current update|remark| |1|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "\u").csv(path)| |#abc *def* xyz|{color:#4c9aff}"#abc"{color} {color:#4c9aff}*def*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}*"def"*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}a little bit difference{color}| |2|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "#").csv(path)| |#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |3|Seq("#abc", "\udef", "xyz").toDF() .write.csv(path)| |#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "\u").csv(path)| |#abc xyz|{color:#4c9aff}#abc{color} {color:#4c9aff}\udef{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}a liitle bit difference{color}| |5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "#").csv(path)| |\udef xyz|\udef xyz|\udef xyz|the same| |6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.csv(path)| |#abc xyz|#abc \udef xyz|#abc \udef xyz|the same| > Pass the comment option
[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource
[ https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42335: Description: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource. After this change, the behavior as flows: |id|code|2.4 and before|3.0 and after|this update|remark| |1|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "\u").csv(path)|#abc *def* xyz|{color:#4c9aff}"#abc"{color} {color:#4c9aff}*def*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}*"def"*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |2|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "#").csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |3|Seq("#abc", "\udef", "xyz").toDF() .write.csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "\u").csv(path)|#abc xyz|{color:#4c9aff}#abc{color} {color:#4c9aff}\udef{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "#").csv(path)|\udef xyz|\udef xyz|\udef xyz|the same| |6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.csv(path)|#abc xyz|#abc \udef xyz|#abc \udef xyz|the same| was: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource. |id|code|2.4 and before|3.0 and after|this update|remark| |1|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "\u").csv(path)|#abc *def* xyz|{color:#4c9aff}"#abc"{color} {color:#4c9aff}*def*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}*"def"*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |2|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "#").csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |3|Seq("#abc", "\udef", "xyz").toDF() .write.csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "\u").csv(path)|#abc xyz|{color:#4c9aff}#abc{color} {color:#4c9aff}\udef{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "#").csv(path)|\udef xyz|\udef xyz|\udef xyz|the same| |6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.csv(path)|#abc xyz|#abc
[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource
[ https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42335: Description: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource. |id|code| |2.4 and before|3.0 and after|current update|remark| |1|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "\u").csv(path)| |#abc *def* xyz|{color:#4c9aff}"#abc"{color} {color:#4c9aff}*def*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}*"def"*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}a little bit difference{color}| |2|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "#").csv(path)| |#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |3|Seq("#abc", "\udef", "xyz").toDF() .write.csv(path)| |#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "\u").csv(path)| |#abc xyz|{color:#4c9aff}#abc{color} {color:#4c9aff}\udef{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}a liitle bit difference{color}| |5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "#").csv(path)| |\udef xyz|\udef xyz|\udef xyz|the same| |6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.csv(path)| |#abc xyz|#abc \udef xyz|#abc \udef xyz|the same| was: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource. > Pass the comment option through to univocity if users set it explicitly in > CSV dataSource > - > > Key: SPARK-42335 > URL: https://issues.apache.org/jira/browse/SPARK-42335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.4.0 > > Attachments: image-2023-02-03-18-56-01-596.png, > image-2023-02-03-18-56-10-083.png > > > In PR [https://github.com/apache/spark/pull/29516], in order to fix some > bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to > 2.9.0, it also involved a new feature of univocity-parsers that quoting > values of the first column that start with the comment character. It made a > breaking for users downstream that handing a whole row as input. > > For codes: > {code:java} > Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} > Before Spark 3.0,the content of output CSV files is shown as: > !image-2023-02-03-18-56-01-596.png! > After this change, the content is shown as: > !image-2023-02-03-18-56-10-083.png! > For users, they can't set comment option
[jira] [Resolved] (SPARK-42354) Upgrade Jackson to 2.14.2
[ https://issues.apache.org/jira/browse/SPARK-42354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42354. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39898 [https://github.com/apache/spark/pull/39898] > Upgrade Jackson to 2.14.2 > - > > Key: SPARK-42354 > URL: https://issues.apache.org/jira/browse/SPARK-42354 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.14.2 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42354) Upgrade Jackson to 2.14.2
[ https://issues.apache.org/jira/browse/SPARK-42354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42354: - Assignee: Yang Jie > Upgrade Jackson to 2.14.2 > - > > Key: SPARK-42354 > URL: https://issues.apache.org/jira/browse/SPARK-42354 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.14.2 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41716) Factor pyspark.sql.connect.Catalog._catalog_to_pandas to client.py
[ https://issues.apache.org/jira/browse/SPARK-41716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41716: Assignee: (was: Apache Spark) > Factor pyspark.sql.connect.Catalog._catalog_to_pandas to client.py > -- > > Key: SPARK-41716 > URL: https://issues.apache.org/jira/browse/SPARK-41716 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > _catalog_to_pandas is more about client.py. We should probably factor this > out to the client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41716) Factor pyspark.sql.connect.Catalog._catalog_to_pandas to client.py
[ https://issues.apache.org/jira/browse/SPARK-41716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41716: Assignee: Apache Spark > Factor pyspark.sql.connect.Catalog._catalog_to_pandas to client.py > -- > > Key: SPARK-41716 > URL: https://issues.apache.org/jira/browse/SPARK-41716 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > _catalog_to_pandas is more about client.py. We should probably factor this > out to the client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41716) Factor pyspark.sql.connect.Catalog._catalog_to_pandas to client.py
[ https://issues.apache.org/jira/browse/SPARK-41716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685048#comment-17685048 ] Apache Spark commented on SPARK-41716: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39920 > Factor pyspark.sql.connect.Catalog._catalog_to_pandas to client.py > -- > > Key: SPARK-41716 > URL: https://issues.apache.org/jira/browse/SPARK-41716 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > _catalog_to_pandas is more about client.py. We should probably factor this > out to the client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42362) Upgrade kubernetes-client from 6.4.0 to 6.4.1
[ https://issues.apache.org/jira/browse/SPARK-42362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42362: - Assignee: Bjørn Jørgensen > Upgrade kubernetes-client from 6.4.0 to 6.4.1 > - > > Key: SPARK-42362 > URL: https://issues.apache.org/jira/browse/SPARK-42362 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Minor > > New version of kubernetes client > Release notes > https://github.com/fabric8io/kubernetes-client/releases/tag/v6.4.1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42362) Upgrade kubernetes-client from 6.4.0 to 6.4.1
[ https://issues.apache.org/jira/browse/SPARK-42362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42362. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39912 [https://github.com/apache/spark/pull/39912] > Upgrade kubernetes-client from 6.4.0 to 6.4.1 > - > > Key: SPARK-42362 > URL: https://issues.apache.org/jira/browse/SPARK-42362 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Minor > Fix For: 3.4.0 > > > New version of kubernetes client > Release notes > https://github.com/fabric8io/kubernetes-client/releases/tag/v6.4.1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41612) Support Catalog.isCached
[ https://issues.apache.org/jira/browse/SPARK-41612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685032#comment-17685032 ] Apache Spark commented on SPARK-41612: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39919 > Support Catalog.isCached > > > Key: SPARK-41612 > URL: https://issues.apache.org/jira/browse/SPARK-41612 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41623) Support Catalog.uncacheTable
[ https://issues.apache.org/jira/browse/SPARK-41623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41623: Assignee: (was: Apache Spark) > Support Catalog.uncacheTable > > > Key: SPARK-41623 > URL: https://issues.apache.org/jira/browse/SPARK-41623 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41612) Support Catalog.isCached
[ https://issues.apache.org/jira/browse/SPARK-41612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41612: Assignee: (was: Apache Spark) > Support Catalog.isCached > > > Key: SPARK-41612 > URL: https://issues.apache.org/jira/browse/SPARK-41612 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41612) Support Catalog.isCached
[ https://issues.apache.org/jira/browse/SPARK-41612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685030#comment-17685030 ] Apache Spark commented on SPARK-41612: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39919 > Support Catalog.isCached > > > Key: SPARK-41612 > URL: https://issues.apache.org/jira/browse/SPARK-41612 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41623) Support Catalog.uncacheTable
[ https://issues.apache.org/jira/browse/SPARK-41623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685028#comment-17685028 ] Apache Spark commented on SPARK-41623: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39919 > Support Catalog.uncacheTable > > > Key: SPARK-41623 > URL: https://issues.apache.org/jira/browse/SPARK-41623 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41612) Support Catalog.isCached
[ https://issues.apache.org/jira/browse/SPARK-41612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685031#comment-17685031 ] Apache Spark commented on SPARK-41612: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39919 > Support Catalog.isCached > > > Key: SPARK-41612 > URL: https://issues.apache.org/jira/browse/SPARK-41612 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41612) Support Catalog.isCached
[ https://issues.apache.org/jira/browse/SPARK-41612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41612: Assignee: Apache Spark > Support Catalog.isCached > > > Key: SPARK-41612 > URL: https://issues.apache.org/jira/browse/SPARK-41612 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41623) Support Catalog.uncacheTable
[ https://issues.apache.org/jira/browse/SPARK-41623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41623: Assignee: Apache Spark > Support Catalog.uncacheTable > > > Key: SPARK-41623 > URL: https://issues.apache.org/jira/browse/SPARK-41623 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41623) Support Catalog.uncacheTable
[ https://issues.apache.org/jira/browse/SPARK-41623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685027#comment-17685027 ] Apache Spark commented on SPARK-41623: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39919 > Support Catalog.uncacheTable > > > Key: SPARK-41623 > URL: https://issues.apache.org/jira/browse/SPARK-41623 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41600) Support Catalog.cacheTable
[ https://issues.apache.org/jira/browse/SPARK-41600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685024#comment-17685024 ] Apache Spark commented on SPARK-41600: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39919 > Support Catalog.cacheTable > -- > > Key: SPARK-41600 > URL: https://issues.apache.org/jira/browse/SPARK-41600 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41600) Support Catalog.cacheTable
[ https://issues.apache.org/jira/browse/SPARK-41600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41600: Assignee: Apache Spark > Support Catalog.cacheTable > -- > > Key: SPARK-41600 > URL: https://issues.apache.org/jira/browse/SPARK-41600 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42366) Log shuffle data corruption diagnose cause
[ https://issues.apache.org/jira/browse/SPARK-42366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685025#comment-17685025 ] Apache Spark commented on SPARK-42366: -- User 'cxzl25' has created a pull request for this issue: https://github.com/apache/spark/pull/39918 > Log shuffle data corruption diagnose cause > -- > > Key: SPARK-42366 > URL: https://issues.apache.org/jira/browse/SPARK-42366 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: dzcxzl >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42366) Log shuffle data corruption diagnose cause
[ https://issues.apache.org/jira/browse/SPARK-42366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42366: Assignee: (was: Apache Spark) > Log shuffle data corruption diagnose cause > -- > > Key: SPARK-42366 > URL: https://issues.apache.org/jira/browse/SPARK-42366 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: dzcxzl >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42366) Log shuffle data corruption diagnose cause
[ https://issues.apache.org/jira/browse/SPARK-42366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42366: Assignee: Apache Spark > Log shuffle data corruption diagnose cause > -- > > Key: SPARK-42366 > URL: https://issues.apache.org/jira/browse/SPARK-42366 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: dzcxzl >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41600) Support Catalog.cacheTable
[ https://issues.apache.org/jira/browse/SPARK-41600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685026#comment-17685026 ] Apache Spark commented on SPARK-41600: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39919 > Support Catalog.cacheTable > -- > > Key: SPARK-41600 > URL: https://issues.apache.org/jira/browse/SPARK-41600 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41600) Support Catalog.cacheTable
[ https://issues.apache.org/jira/browse/SPARK-41600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41600: Assignee: (was: Apache Spark) > Support Catalog.cacheTable > -- > > Key: SPARK-41600 > URL: https://issues.apache.org/jira/browse/SPARK-41600 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42366) Log shuffle data corruption diagnose cause
[ https://issues.apache.org/jira/browse/SPARK-42366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-42366: --- Summary: Log shuffle data corruption diagnose cause (was: Log output shuffle data corruption diagnose cause) > Log shuffle data corruption diagnose cause > -- > > Key: SPARK-42366 > URL: https://issues.apache.org/jira/browse/SPARK-42366 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: dzcxzl >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42366) Log output shuffle data corruption diagnose cause
[ https://issues.apache.org/jira/browse/SPARK-42366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-42366: --- Summary: Log output shuffle data corruption diagnose cause (was: Log output shuffle data corruption diagnose causes) > Log output shuffle data corruption diagnose cause > - > > Key: SPARK-42366 > URL: https://issues.apache.org/jira/browse/SPARK-42366 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: dzcxzl >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42366) Log output shuffle data corruption diagnose causes
dzcxzl created SPARK-42366: -- Summary: Log output shuffle data corruption diagnose causes Key: SPARK-42366 URL: https://issues.apache.org/jira/browse/SPARK-42366 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 3.2.0 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42352) Upgrade maven to 3.9.0
[ https://issues.apache.org/jira/browse/SPARK-42352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-42352: - Description: [https://maven.apache.org/docs/3.8.7/release-notes.html] change to upgrade 3.9.0 https://maven.apache.org/docs/3.9.0/release-notes.html was:https://maven.apache.org/docs/3.8.7/release-notes.html > Upgrade maven to 3.9.0 > -- > > Key: SPARK-42352 > URL: https://issues.apache.org/jira/browse/SPARK-42352 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > > [https://maven.apache.org/docs/3.8.7/release-notes.html] > > change to upgrade 3.9.0 > > https://maven.apache.org/docs/3.9.0/release-notes.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42352) Upgrade maven to 3.9.0
[ https://issues.apache.org/jira/browse/SPARK-42352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-42352: - Summary: Upgrade maven to 3.9.0 (was: Upgrade maven to 3.8.7) > Upgrade maven to 3.9.0 > -- > > Key: SPARK-42352 > URL: https://issues.apache.org/jira/browse/SPARK-42352 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > > https://maven.apache.org/docs/3.8.7/release-notes.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42365) Split 'pyspark.pandas.tests.test_ops_on_diff_frames'
[ https://issues.apache.org/jira/browse/SPARK-42365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42365: Assignee: Apache Spark > Split 'pyspark.pandas.tests.test_ops_on_diff_frames' > > > Key: SPARK-42365 > URL: https://issues.apache.org/jira/browse/SPARK-42365 > Project: Spark > Issue Type: Test > Components: ps, Tests >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42365) Split 'pyspark.pandas.tests.test_ops_on_diff_frames'
[ https://issues.apache.org/jira/browse/SPARK-42365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685016#comment-17685016 ] Apache Spark commented on SPARK-42365: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39917 > Split 'pyspark.pandas.tests.test_ops_on_diff_frames' > > > Key: SPARK-42365 > URL: https://issues.apache.org/jira/browse/SPARK-42365 > Project: Spark > Issue Type: Test > Components: ps, Tests >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42365) Split 'pyspark.pandas.tests.test_ops_on_diff_frames'
[ https://issues.apache.org/jira/browse/SPARK-42365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42365: Assignee: (was: Apache Spark) > Split 'pyspark.pandas.tests.test_ops_on_diff_frames' > > > Key: SPARK-42365 > URL: https://issues.apache.org/jira/browse/SPARK-42365 > Project: Spark > Issue Type: Test > Components: ps, Tests >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42365) Split 'pyspark.pandas.tests.test_ops_on_diff_frames'
[ https://issues.apache.org/jira/browse/SPARK-42365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685015#comment-17685015 ] Apache Spark commented on SPARK-42365: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39917 > Split 'pyspark.pandas.tests.test_ops_on_diff_frames' > > > Key: SPARK-42365 > URL: https://issues.apache.org/jira/browse/SPARK-42365 > Project: Spark > Issue Type: Test > Components: ps, Tests >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42365) Split 'pyspark.pandas.tests.test_ops_on_diff_frames'
Ruifeng Zheng created SPARK-42365: - Summary: Split 'pyspark.pandas.tests.test_ops_on_diff_frames' Key: SPARK-42365 URL: https://issues.apache.org/jira/browse/SPARK-42365 Project: Spark Issue Type: Test Components: ps, Tests Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42363) Remove session.register_udf
[ https://issues.apache.org/jira/browse/SPARK-42363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42363: Assignee: (was: Apache Spark) > Remove session.register_udf > --- > > Key: SPARK-42363 > URL: https://issues.apache.org/jira/browse/SPARK-42363 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40532) Python version for UDF should follow the servers version
[ https://issues.apache.org/jira/browse/SPARK-40532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40532: Assignee: (was: Apache Spark) > Python version for UDF should follow the servers version > > > Key: SPARK-40532 > URL: https://issues.apache.org/jira/browse/SPARK-40532 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Minor > > Currently, we artificially pin the Python version to 3.9 in the UDF > generation code, but this should actually be the correct server vs client > version. > > In addition the version should be configured as part of the function > definition proto message. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42363) Remove session.register_udf
[ https://issues.apache.org/jira/browse/SPARK-42363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685005#comment-17685005 ] Apache Spark commented on SPARK-42363: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39916 > Remove session.register_udf > --- > > Key: SPARK-42363 > URL: https://issues.apache.org/jira/browse/SPARK-42363 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42363) Remove session.register_udf
[ https://issues.apache.org/jira/browse/SPARK-42363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42363: Assignee: Apache Spark > Remove session.register_udf > --- > > Key: SPARK-42363 > URL: https://issues.apache.org/jira/browse/SPARK-42363 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40532) Python version for UDF should follow the servers version
[ https://issues.apache.org/jira/browse/SPARK-40532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685004#comment-17685004 ] Apache Spark commented on SPARK-40532: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39914 > Python version for UDF should follow the servers version > > > Key: SPARK-40532 > URL: https://issues.apache.org/jira/browse/SPARK-40532 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Minor > > Currently, we artificially pin the Python version to 3.9 in the UDF > generation code, but this should actually be the correct server vs client > version. > > In addition the version should be configured as part of the function > definition proto message. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40532) Python version for UDF should follow the servers version
[ https://issues.apache.org/jira/browse/SPARK-40532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40532: Assignee: Apache Spark > Python version for UDF should follow the servers version > > > Key: SPARK-40532 > URL: https://issues.apache.org/jira/browse/SPARK-40532 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Apache Spark >Priority: Minor > > Currently, we artificially pin the Python version to 3.9 in the UDF > generation code, but this should actually be the correct server vs client > version. > > In addition the version should be configured as part of the function > definition proto message. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42364) Split 'pyspark.pandas.tests.test_dataframe'
[ https://issues.apache.org/jira/browse/SPARK-42364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42364: Assignee: Apache Spark > Split 'pyspark.pandas.tests.test_dataframe' > --- > > Key: SPARK-42364 > URL: https://issues.apache.org/jira/browse/SPARK-42364 > Project: Spark > Issue Type: Test > Components: ps, Tests >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42364) Split 'pyspark.pandas.tests.test_dataframe'
[ https://issues.apache.org/jira/browse/SPARK-42364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685003#comment-17685003 ] Apache Spark commented on SPARK-42364: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39915 > Split 'pyspark.pandas.tests.test_dataframe' > --- > > Key: SPARK-42364 > URL: https://issues.apache.org/jira/browse/SPARK-42364 > Project: Spark > Issue Type: Test > Components: ps, Tests >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42364) Split 'pyspark.pandas.tests.test_dataframe'
[ https://issues.apache.org/jira/browse/SPARK-42364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42364: Assignee: (was: Apache Spark) > Split 'pyspark.pandas.tests.test_dataframe' > --- > > Key: SPARK-42364 > URL: https://issues.apache.org/jira/browse/SPARK-42364 > Project: Spark > Issue Type: Test > Components: ps, Tests >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42364) Split 'pyspark.pandas.tests.test_dataframe'
Ruifeng Zheng created SPARK-42364: - Summary: Split 'pyspark.pandas.tests.test_dataframe' Key: SPARK-42364 URL: https://issues.apache.org/jira/browse/SPARK-42364 Project: Spark Issue Type: Test Components: ps, Tests Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42363) Remove session.register_udf
Hyukjin Kwon created SPARK-42363: Summary: Remove session.register_udf Key: SPARK-42363 URL: https://issues.apache.org/jira/browse/SPARK-42363 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug
[ https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684992#comment-17684992 ] Ritika Maheshwari commented on SPARK-42346: --- I have Spark 3.3.0 and I do not have 39887 fix . I am not able to reproduce this issue. Am I missing something? scala> val df = Seq(("a","b")).toDF("surname","first_name") *df*: *org.apache.spark.sql.DataFrame* = [surname: string, first_name: string] scala> df.createOrReplaceTempView("input_table") scala> spark.sql("select(Select Count(Distinct first_name) from input_table) As distinct_value_count from input_table Union all select (select count(Distinct surname) from input_table) as distinct_value_count from input_table").show() ++ |distinct_value_count| ++ | 1| | 1| ++ = Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Union :- Project [cast(Subquery subquery#46, [id=#114] as string) AS distinct_value_count#62] : : +- Subquery subquery#46, [id=#114] : : +- AdaptiveSparkPlan isFinalPlan=false : : +- HashAggregate(keys=[], functions=[count(first_name#12)], output=[count(DISTINCT first_name)#53L]) : : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#112] : : +- HashAggregate(keys=[], functions=[partial_count(first_name#12)], output=[count#67L]) : : +- LocalTableScan [first_name#12] : +- LocalTableScan [_1#6, _2#7] +- Project [cast(Subquery subquery#48, [id=#125] as string) AS distinct_value_count#64] : +- Subquery subquery#48, [id=#125] : +- AdaptiveSparkPlan isFinalPlan=false : +- HashAggregate(keys=[], functions=[count(surname#11)], output=[count(DISTINCT surname)#55L]) : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#123] : +- HashAggregate(keys=[], functions=[partial_count(surname#11)], output=[count#68L]) : +- LocalTableScan [surname#11] +- LocalTableScan [_1#50, _2#51|#50, _2#51] This is what I have in my SparkOptimizer.scala override def defaultBatches: Seq[Batch] = (preOptimizationBatches ++ super.defaultBatches :+ Batch("Optimize Metadata Only Query", Once, OptimizeMetadataOnlyQuery(catalog)) :+ Batch("PartitionPruning", Once, PartitionPruning) :+ Batch("InjectRuntimeFilter", FixedPoint(1), InjectRuntimeFilter, RewritePredicateSubquery) :+ Batch("MergeScalarSubqueries", Once, MergeScalarSubqueries) :+ Batch("Pushdown Filters from PartitionPruning", fixedPoint, PushDownPredicates) :+ Batch > distinct(count colname) with UNION ALL causes query analyzer bug > > > Key: SPARK-42346 > URL: https://issues.apache.org/jira/browse/SPARK-42346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0 >Reporter: Robin >Assignee: Peter Toth >Priority: Major > Fix For: 3.3.2, 3.4.0, 3.5.0 > > > If you combine a UNION ALL with a count(distinct colname) you get a query > analyzer bug. > > This behaviour is introduced in 3.3.0. The bug was not present in 3.2.1. > > Here is a reprex in PySpark: > {{df_pd = pd.DataFrame([}} > {{ \{'surname': 'a', 'first_name': 'b'}}} > {{])}} > {{df_spark = spark.createDataFrame(df_pd)}} > {{df_spark.createOrReplaceTempView("input_table")}} > {{sql = """}} > {{SELECT }} > {{ (SELECT Count(DISTINCT first_name) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table}} > {{UNION ALL}} > {{SELECT }} > {{ (SELECT Count(DISTINCT surname) FROM input_table) }} > {{ AS distinct_value_count}} > {{FROM input_table """}} > {{spark.sql(sql).toPandas()}} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42268) Add UserDefinedType in protos
[ https://issues.apache.org/jira/browse/SPARK-42268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684990#comment-17684990 ] Apache Spark commented on SPARK-42268: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39913 > Add UserDefinedType in protos > - > > Key: SPARK-42268 > URL: https://issues.apache.org/jira/browse/SPARK-42268 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42268) Add UserDefinedType in protos
[ https://issues.apache.org/jira/browse/SPARK-42268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684989#comment-17684989 ] Apache Spark commented on SPARK-42268: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39913 > Add UserDefinedType in protos > - > > Key: SPARK-42268 > URL: https://issues.apache.org/jira/browse/SPARK-42268 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42362) Upgrade kubernetes-client from 6.4.0 to 6.4.1
[ https://issues.apache.org/jira/browse/SPARK-42362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684952#comment-17684952 ] Apache Spark commented on SPARK-42362: -- User 'bjornjorgensen' has created a pull request for this issue: https://github.com/apache/spark/pull/39912 > Upgrade kubernetes-client from 6.4.0 to 6.4.1 > - > > Key: SPARK-42362 > URL: https://issues.apache.org/jira/browse/SPARK-42362 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Priority: Minor > > New version of kubernetes client > Release notes > https://github.com/fabric8io/kubernetes-client/releases/tag/v6.4.1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42362) Upgrade kubernetes-client from 6.4.0 to 6.4.1
[ https://issues.apache.org/jira/browse/SPARK-42362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42362: Assignee: (was: Apache Spark) > Upgrade kubernetes-client from 6.4.0 to 6.4.1 > - > > Key: SPARK-42362 > URL: https://issues.apache.org/jira/browse/SPARK-42362 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Priority: Minor > > New version of kubernetes client > Release notes > https://github.com/fabric8io/kubernetes-client/releases/tag/v6.4.1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42362) Upgrade kubernetes-client from 6.4.0 to 6.4.1
[ https://issues.apache.org/jira/browse/SPARK-42362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42362: Assignee: Apache Spark > Upgrade kubernetes-client from 6.4.0 to 6.4.1 > - > > Key: SPARK-42362 > URL: https://issues.apache.org/jira/browse/SPARK-42362 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Assignee: Apache Spark >Priority: Minor > > New version of kubernetes client > Release notes > https://github.com/fabric8io/kubernetes-client/releases/tag/v6.4.1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42362) Upgrade kubernetes-client from 6.4.0 to 6.4.1
Bjørn Jørgensen created SPARK-42362: --- Summary: Upgrade kubernetes-client from 6.4.0 to 6.4.1 Key: SPARK-42362 URL: https://issues.apache.org/jira/browse/SPARK-42362 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: Bjørn Jørgensen New version of kubernetes client Release notes https://github.com/fabric8io/kubernetes-client/releases/tag/v6.4.1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42361) Add an option to use external storage to distribute JAR set in cluster mode on Kube
Holden Karau created SPARK-42361: Summary: Add an option to use external storage to distribute JAR set in cluster mode on Kube Key: SPARK-42361 URL: https://issues.apache.org/jira/browse/SPARK-42361 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.5.0 Reporter: Holden Karau tl;dr – sometimes the driver can get overwhelmed serving the initial jar set. You'll see a lot of "Executor fetching spark://.../jar" and then connection timed out. On YARN the jars (in cluster mode) are cached in HDFS. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36478) Removes outer join if all grouping and aggregate expressions are from the streamed side
[ https://issues.apache.org/jira/browse/SPARK-36478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684906#comment-17684906 ] Apache Spark commented on SPARK-36478: -- User 'clubycoder' has created a pull request for this issue: https://github.com/apache/spark/pull/39911 > Removes outer join if all grouping and aggregate expressions are from the > streamed side > --- > > Key: SPARK-36478 > URL: https://issues.apache.org/jira/browse/SPARK-36478 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wan Kun >Priority: Minor > > Removes outer join if all grouping and aggregate expressions are from the > streamed side. > For example: > {code:java} > spark.range(200L).selectExpr("id AS a", "id as b", "id as > c").createTempView("t1") > spark.range(300L).selectExpr("id AS a").createTempView("t2") > spark.sql("SELECT t1.b, max(t1.c) as c FROM t1 LEFT JOIN t2 ON t1.a = t2.a > GROUP BY t1.b").explain(true) > {code} > Current optimized plan: > {code:java} > == Optimized Logical Plan == > Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L] > +- Project [b#3L, c#4L] >+- Join LeftOuter, (a#2L = a#10L) > :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L] > : +- Range (0, 200, step=1, splits=Some(1)) > +- Project [id#8L AS a#10L] > +- Range (0, 300, step=1, splits=Some(1)) > {code} > Expected optimized plan: > {code:java} > == Optimized Logical Plan == > Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L] > +- Project [id#274L AS b#277L, id#274L AS c#278L] >+- Range (0, 200, step=1, splits=Some(2)) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684903#comment-17684903 ] Gera Shegalov edited comment on SPARK-41793 at 2/6/23 7:38 PM: --- if the consensus is that it's not a correctness bug in 3.4, then this fix should probably be documented and backported to maintenance branches? was (Author: jira.shegalov): if the consensus is that it's not a correctness bug in 3.4, then this fix should probably be documented and probably backported to maintenance branches? > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Blocker > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684903#comment-17684903 ] Gera Shegalov commented on SPARK-41793: --- if the consensus is that it's not a correctness bug in 3.4, then this fix should probably be documented and probably backported to maintenance branches? > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Blocker > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24942) Improve cluster resource management with jobs containing barrier stage
[ https://issues.apache.org/jira/browse/SPARK-24942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684895#comment-17684895 ] manpreet singh commented on SPARK-24942: [~gurwls223] Any updates on this? It seems like we are also facing this. We want to use stage level scheduling with our jobs needing Barrier execution. If we cannot enable DRA, then we will be incurring a huge infra cost for the spark pool which is no longer being used for the current stage. > Improve cluster resource management with jobs containing barrier stage > -- > > Key: SPARK-24942 > URL: https://issues.apache.org/jira/browse/SPARK-24942 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Xingbo Jiang >Priority: Major > > https://github.com/apache/spark/pull/21758#discussion_r205652317 > We shall improve cluster resource management to address the following issues: > - With dynamic resource allocation enabled, it may happen that we acquire > some executors (but not enough to launch all the tasks in a barrier stage) > and later release them due to executor idle time expire, and then acquire > again. > - There can be deadlock with two concurrent applications. Each application > may acquire some resources, but not enough to launch all the tasks in a > barrier stage. And after hitting the idle timeout and releasing them, they > may acquire resources again, but just continually trade resources between > each other. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42357) Log `exitCode` when `SparkContext.stop` starts
[ https://issues.apache.org/jira/browse/SPARK-42357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42357: - Assignee: Dongjoon Hyun > Log `exitCode` when `SparkContext.stop` starts > -- > > Key: SPARK-42357 > URL: https://issues.apache.org/jira/browse/SPARK-42357 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > {code} > 23/02/06 02:12:55 INFO SparkContext: SparkContext is stopping with exitCode 0. > {code} > {code} > Pi is roughly 3.147080 > 23/02/06 02:12:55 INFO SparkContext: SparkContext is stopping with exitCode 0. > ... > 23/02/06 02:12:55 INFO AbstractConnector: Stopped Spark@1cb72b8{HTTP/1.1, > (http/1.1)}{localhost:4040} > 23/02/06 02:12:55 INFO SparkUI: Stopped Spark web UI at http://localhost:4040 > 23/02/06 02:12:55 INFO MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 23/02/06 02:12:55 INFO MemoryStore: MemoryStore cleared > 23/02/06 02:12:55 INFO BlockManager: BlockManager stopped > 23/02/06 02:12:55 INFO BlockManagerMaster: BlockManagerMaster stopped > 23/02/06 02:12:55 INFO > OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 23/02/06 02:12:55 INFO SparkContext: Successfully stopped SparkContext > 23/02/06 02:12:56 INFO ShutdownHookManager: Shutdown hook called > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42357) Log `exitCode` when `SparkContext.stop` starts
[ https://issues.apache.org/jira/browse/SPARK-42357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42357. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39900 [https://github.com/apache/spark/pull/39900] > Log `exitCode` when `SparkContext.stop` starts > -- > > Key: SPARK-42357 > URL: https://issues.apache.org/jira/browse/SPARK-42357 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > > {code} > 23/02/06 02:12:55 INFO SparkContext: SparkContext is stopping with exitCode 0. > {code} > {code} > Pi is roughly 3.147080 > 23/02/06 02:12:55 INFO SparkContext: SparkContext is stopping with exitCode 0. > ... > 23/02/06 02:12:55 INFO AbstractConnector: Stopped Spark@1cb72b8{HTTP/1.1, > (http/1.1)}{localhost:4040} > 23/02/06 02:12:55 INFO SparkUI: Stopped Spark web UI at http://localhost:4040 > 23/02/06 02:12:55 INFO MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 23/02/06 02:12:55 INFO MemoryStore: MemoryStore cleared > 23/02/06 02:12:55 INFO BlockManager: BlockManager stopped > 23/02/06 02:12:55 INFO BlockManagerMaster: BlockManagerMaster stopped > 23/02/06 02:12:55 INFO > OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 23/02/06 02:12:55 INFO SparkContext: Successfully stopped SparkContext > 23/02/06 02:12:56 INFO ShutdownHookManager: Shutdown hook called > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42337) Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT
[ https://issues.apache.org/jira/browse/SPARK-42337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684856#comment-17684856 ] Apache Spark commented on SPARK-42337: -- User 'allisonwang-db' has created a pull request for this issue: https://github.com/apache/spark/pull/39910 > Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT > - > > Key: SPARK-42337 > URL: https://issues.apache.org/jira/browse/SPARK-42337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Priority: Major > > Add the new error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT and move > the following error classes to use the new one: > * _LEGACY_ERROR_TEMP_1283 > * _LEGACY_ERROR_TEMP_1284 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42337) Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT
[ https://issues.apache.org/jira/browse/SPARK-42337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42337: Assignee: Apache Spark > Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT > - > > Key: SPARK-42337 > URL: https://issues.apache.org/jira/browse/SPARK-42337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > > Add the new error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT and move > the following error classes to use the new one: > * _LEGACY_ERROR_TEMP_1283 > * _LEGACY_ERROR_TEMP_1284 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42337) Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT
[ https://issues.apache.org/jira/browse/SPARK-42337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42337: Assignee: (was: Apache Spark) > Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT > - > > Key: SPARK-42337 > URL: https://issues.apache.org/jira/browse/SPARK-42337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Priority: Major > > Add the new error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT and move > the following error classes to use the new one: > * _LEGACY_ERROR_TEMP_1283 > * _LEGACY_ERROR_TEMP_1284 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42337) Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT
[ https://issues.apache.org/jira/browse/SPARK-42337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684855#comment-17684855 ] Apache Spark commented on SPARK-42337: -- User 'allisonwang-db' has created a pull request for this issue: https://github.com/apache/spark/pull/39910 > Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT > - > > Key: SPARK-42337 > URL: https://issues.apache.org/jira/browse/SPARK-42337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Priority: Major > > Add the new error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT and move > the following error classes to use the new one: > * _LEGACY_ERROR_TEMP_1283 > * _LEGACY_ERROR_TEMP_1284 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42287) Optimize the packaging strategy of connect client module
[ https://issues.apache.org/jira/browse/SPARK-42287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684849#comment-17684849 ] Apache Spark commented on SPARK-42287: -- User 'zhenlineo' has created a pull request for this issue: https://github.com/apache/spark/pull/39866 > Optimize the packaging strategy of connect client module > > > Key: SPARK-42287 > URL: https://issues.apache.org/jira/browse/SPARK-42287 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > > # `perfmark-api` not shaded into `connect-client-jvm module.jar` and it is > not the default dependency of spark, we can package `perfmark-api` to > `connect-client-jvm module.jar` to avoid users' manual dependence on it > # sbt-assembly result of `connect-client-jvm module` packed too much jars > and not relocation them, such as hadoop, rocksdb, roaringbitmap and so on, > should simplify it refer to the packaging results of maven > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42337) Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT
[ https://issues.apache.org/jira/browse/SPARK-42337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-42337: - Summary: Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT (was: Add the new error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT) > Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT > - > > Key: SPARK-42337 > URL: https://issues.apache.org/jira/browse/SPARK-42337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Priority: Major > > Add the new error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT and move > the following error classes to use the new one: > * _LEGACY_ERROR_TEMP_1283 > * _LEGACY_ERROR_TEMP_1284 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41470) SPJ: Spark shouldn't assume InternalRow implements equals and hashCode
[ https://issues.apache.org/jira/browse/SPARK-41470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-41470: - Fix Version/s: 3.4.0 (was: 3.5.0) > SPJ: Spark shouldn't assume InternalRow implements equals and hashCode > -- > > Key: SPARK-41470 > URL: https://issues.apache.org/jira/browse/SPARK-41470 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Priority: Major > Fix For: 3.4.0 > > > Currently SPJ (Storage-Partitioned Join) actually assumes the {{InternalRow}} > returned by {{HasPartitionKey}} implements {{equals}} and {{{}hashCode{}}}. > We should remove this restriction. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41470) SPJ: Spark shouldn't assume InternalRow implements equals and hashCode
[ https://issues.apache.org/jira/browse/SPARK-41470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-41470: Assignee: Mars > SPJ: Spark shouldn't assume InternalRow implements equals and hashCode > -- > > Key: SPARK-41470 > URL: https://issues.apache.org/jira/browse/SPARK-41470 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Assignee: Mars >Priority: Major > Fix For: 3.4.0 > > > Currently SPJ (Storage-Partitioned Join) actually assumes the {{InternalRow}} > returned by {{HasPartitionKey}} implements {{equals}} and {{{}hashCode{}}}. > We should remove this restriction. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org