date:20230206

[jira] [Commented] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug

2023-02-06 Thread Peter Toth (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685124#comment-17685124
 ] 

Peter Toth commented on SPARK-42346:


[~ritikam], please use the Pyspark repro in description or add a 2nd row to 
your input_table if you use Scala. That's because Spark can optimize out count 
distinct from one row local relations.

> distinct(count colname) with UNION ALL causes query analyzer bug
> 
>
> Key: SPARK-42346
> URL: https://issues.apache.org/jira/browse/SPARK-42346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
>Reporter: Robin
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.3.2, 3.4.0, 3.5.0
>
>
> If you combine a UNION ALL with a count(distinct colname) you get a query 
> analyzer bug.
>  
> This behaviour is introduced in 3.3.0.  The bug was not present in 3.2.1.
>  
> Here is a reprex in PySpark:
> {{df_pd = pd.DataFrame([}}
> {{    \{'surname': 'a', 'first_name': 'b'}}}
> {{])}}
> {{df_spark = spark.createDataFrame(df_pd)}}
> {{df_spark.createOrReplaceTempView("input_table")}}
> {{sql = """}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT first_name) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table}}
> {{UNION ALL}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT surname) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table """}}
> {{spark.sql(sql).toPandas()}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42017) df["bad_key"] does not raise AnalysisException

2023-02-06 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-42017:
--
Summary: df["bad_key"] does not raise AnalysisException  (was: Different 
error type AnalysisException vs SparkConnectAnalysisException)

> df["bad_key"] does not raise AnalysisException
> --
>
> Key: SPARK-42017
> URL: https://issues.apache.org/jira/browse/SPARK-42017
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> e.g.)
> {code}
> 23/01/12 14:33:43 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> FAILED [  8%]
> pyspark/sql/tests/test_column.py:105 (ColumnParityTests.test_access_column)
> self =  testMethod=test_access_column>
> def test_access_column(self):
> df = self.df
> self.assertTrue(isinstance(df.key, Column))
> self.assertTrue(isinstance(df["key"], Column))
> self.assertTrue(isinstance(df[0], Column))
> self.assertRaises(IndexError, lambda: df[2])
> >   self.assertRaises(AnalysisException, lambda: df["bad_key"])
> E   AssertionError: AnalysisException not raised by 
> ../test_column.py:112: AssertionError
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42368) Ignore SparkRemoteFileTest K8s IT test case in GitHub Action

2023-02-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42368:
-

Assignee: Dongjoon Hyun

> Ignore SparkRemoteFileTest K8s IT test case in GitHub Action
> 
>
> Key: SPARK-42368
> URL: https://issues.apache.org/jira/browse/SPARK-42368
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42368) Ignore SparkRemoteFileTest K8s IT test case in GitHub Action

2023-02-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42368.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39921
[https://github.com/apache/spark/pull/39921]

> Ignore SparkRemoteFileTest K8s IT test case in GitHub Action
> 
>
> Key: SPARK-42368
> URL: https://issues.apache.org/jira/browse/SPARK-42368
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41708) Pull v1write information to WriteFiles

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685103#comment-17685103
 ] 

Apache Spark commented on SPARK-41708:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/39922

> Pull v1write information to WriteFiles
> --
>
> Key: SPARK-41708
> URL: https://issues.apache.org/jira/browse/SPARK-41708
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> Make WriteFiles hold v1 write information



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39851) Improve join stats estimation if one side can keep uniqueness

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685104#comment-17685104
 ] 

Apache Spark commented on SPARK-39851:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/39923

> Improve join stats estimation if one side can keep uniqueness
> -
>
> Key: SPARK-39851
> URL: https://issues.apache.org/jira/browse/SPARK-39851
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> SELECT i_item_sk ss_item_sk
> FROM   item,
>(SELECT DISTINCT iss.i_brand_idbrand_id,
> iss.i_class_idclass_id,
> iss.i_category_id category_id
> FROM   item iss) x
> WHERE  i_brand_id = brand_id
>AND i_class_id = class_id
>AND i_category_id = category_id 
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=370.8 MiB, 
> rowCount=3.24E+7)
> +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = 
> class_id#52)) AND (i_category_id#15 = category_id#53)), 
> Statistics(sizeInBytes=1112.3 MiB, rowCount=3.24E+7)
>:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], 
> Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5)
>:  +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND 
> isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
>: +- Relation 
> spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
>+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, 
> class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, 
> rowCount=1.37E+5)
>   +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, 
> i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, 
> rowCount=2.02E+5)
>  +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) 
> AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
> +- Relation 
> spark_catalog.default.item[i_item_sk#55,i_item_id#56,i_rec_start_date#57,i_rec_end_date#58,i_item_desc#59,i_current_price#60,i_wholesale_cost#61,i_brand_id#62,i_brand#63,i_class_id#64,i_class#65,i_category_id#66,i_category#67,i_manufact_id#68,i_manufact#69,i_size#70,i_formulation#71,i_color#72,i_units#73,i_container#74,i_manager_id#75,i_product_name#76]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
> {noformat}
> Excepted:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=2.3 MiB, 
> rowCount=2.02E+5)
> +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = 
> class_id#52)) AND (i_category_id#15 = category_id#53)), 
> Statistics(sizeInBytes=7.0 MiB, rowCount=2.02E+5)
>:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], 
> Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5)
>:  +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND 
> isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
>: +- Relation 
> spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
>+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, 
> class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, 
> rowCount=1.37E+5)
>   +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, 
> i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, 
> rowCount=2.02E+5)
>  +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) 
> AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
> +- Relation 
>

[jira] [Commented] (SPARK-41708) Pull v1write information to WriteFiles

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685102#comment-17685102
 ] 

Apache Spark commented on SPARK-41708:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/39924

> Pull v1write information to WriteFiles
> --
>
> Key: SPARK-41708
> URL: https://issues.apache.org/jira/browse/SPARK-41708
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> Make WriteFiles hold v1 write information



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42368) Ignore SparkRemoteFileTest K8s IT test case in GitHub Action

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42368:


Assignee: Apache Spark

> Ignore SparkRemoteFileTest K8s IT test case in GitHub Action
> 
>
> Key: SPARK-42368
> URL: https://issues.apache.org/jira/browse/SPARK-42368
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42368) Ignore SparkRemoteFileTest K8s IT test case in GitHub Action

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685101#comment-17685101
 ] 

Apache Spark commented on SPARK-42368:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39921

> Ignore SparkRemoteFileTest K8s IT test case in GitHub Action
> 
>
> Key: SPARK-42368
> URL: https://issues.apache.org/jira/browse/SPARK-42368
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42368) Ignore SparkRemoteFileTest K8s IT test case in GitHub Action

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42368:


Assignee: (was: Apache Spark)

> Ignore SparkRemoteFileTest K8s IT test case in GitHub Action
> 
>
> Key: SPARK-42368
> URL: https://issues.apache.org/jira/browse/SPARK-42368
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41962) Update the import order of scala package in class SpecificParquetRecordReaderBase

2023-02-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41962.
--
Fix Version/s: 3.2.4
   3.3.2
   Resolution: Fixed

Issue resolved by pull request 39906
[https://github.com/apache/spark/pull/39906]

> Update the import order of scala package in class 
> SpecificParquetRecordReaderBase
> -
>
> Key: SPARK-41962
> URL: https://issues.apache.org/jira/browse/SPARK-41962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: shuyouZZ
>Assignee: shuyouZZ
>Priority: Major
> Fix For: 3.2.4, 3.3.2, 3.4.0
>
>
> There is a check style issue in class {{SpecificParquetRecordReaderBase}}. 
> The import order of scala package is not correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41962) Update the import order of scala package in class SpecificParquetRecordReaderBase

2023-02-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41962:


Assignee: shuyouZZ

> Update the import order of scala package in class 
> SpecificParquetRecordReaderBase
> -
>
> Key: SPARK-41962
> URL: https://issues.apache.org/jira/browse/SPARK-41962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: shuyouZZ
>Assignee: shuyouZZ
>Priority: Major
> Fix For: 3.4.0
>
>
> There is a check style issue in class {{SpecificParquetRecordReaderBase}}. 
> The import order of scala package is not correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42306) Assign name to _LEGACY_ERROR_TEMP_1317

2023-02-06 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-42306.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39877
[https://github.com/apache/spark/pull/39877]

> Assign name to _LEGACY_ERROR_TEMP_1317
> --
>
> Key: SPARK-42306
> URL: https://issues.apache.org/jira/browse/SPARK-42306
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42306) Assign name to _LEGACY_ERROR_TEMP_1317

2023-02-06 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-42306:


Assignee: Haejoon Lee

> Assign name to _LEGACY_ERROR_TEMP_1317
> --
>
> Key: SPARK-42306
> URL: https://issues.apache.org/jira/browse/SPARK-42306
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42352) Upgrade maven to 3.8.7

2023-02-06 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-42352:
-
Description: 
[https://maven.apache.org/docs/3.8.7/release-notes.html]

 

  was:
[https://maven.apache.org/docs/3.8.7/release-notes.html]

 

change to upgrade 3.9.0 

 

https://maven.apache.org/docs/3.9.0/release-notes.html


> Upgrade maven to 3.8.7
> --
>
> Key: SPARK-42352
> URL: https://issues.apache.org/jira/browse/SPARK-42352
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> [https://maven.apache.org/docs/3.8.7/release-notes.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42352) Upgrade maven to 3.8.7

2023-02-06 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-42352:
-
Summary: Upgrade maven to 3.8.7  (was: Upgrade maven to 3.9.0)

> Upgrade maven to 3.8.7
> --
>
> Key: SPARK-42352
> URL: https://issues.apache.org/jira/browse/SPARK-42352
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> [https://maven.apache.org/docs/3.8.7/release-notes.html]
>  
> change to upgrade 3.9.0 
>  
> https://maven.apache.org/docs/3.9.0/release-notes.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42368) Ignore SparkRemoteFileTest K8s IT test case in GitHub Action

2023-02-06 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-42368:
-

 Summary: Ignore SparkRemoteFileTest K8s IT test case in GitHub 
Action
 Key: SPARK-42368
 URL: https://issues.apache.org/jira/browse/SPARK-42368
 Project: Spark
  Issue Type: Test
  Components: Project Infra, Tests
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41612) Support Catalog.isCached

2023-02-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41612.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39919
[https://github.com/apache/spark/pull/39919]

> Support Catalog.isCached
> 
>
> Key: SPARK-41612
> URL: https://issues.apache.org/jira/browse/SPARK-41612
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41600) Support Catalog.cacheTable

2023-02-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41600:


Assignee: Hyukjin Kwon

> Support Catalog.cacheTable
> --
>
> Key: SPARK-41600
> URL: https://issues.apache.org/jira/browse/SPARK-41600
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41623) Support Catalog.uncacheTable

2023-02-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41623:


Assignee: Hyukjin Kwon

> Support Catalog.uncacheTable
> 
>
> Key: SPARK-41623
> URL: https://issues.apache.org/jira/browse/SPARK-41623
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41612) Support Catalog.isCached

2023-02-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41612:


Assignee: Hyukjin Kwon

> Support Catalog.isCached
> 
>
> Key: SPARK-41612
> URL: https://issues.apache.org/jira/browse/SPARK-41612
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41623) Support Catalog.uncacheTable

2023-02-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41623.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39919
[https://github.com/apache/spark/pull/39919]

> Support Catalog.uncacheTable
> 
>
> Key: SPARK-41623
> URL: https://issues.apache.org/jira/browse/SPARK-41623
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41600) Support Catalog.cacheTable

2023-02-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41600.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39919
[https://github.com/apache/spark/pull/39919]

> Support Catalog.cacheTable
> --
>
> Key: SPARK-41600
> URL: https://issues.apache.org/jira/browse/SPARK-41600
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42367) DataFrame.drop could handle duplicated columns

2023-02-06 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-42367:
-

 Summary: DataFrame.drop could handle duplicated columns
 Key: SPARK-42367
 URL: https://issues.apache.org/jira/browse/SPARK-42367
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng



{code:java}
>>> df.join(df2, df.name == df2.name, 'inner').show()
+---++--++
|age|name|height|name|
+---++--++
| 16| Bob|85| Bob|
| 14| Tom|80| Tom|
+---++--++

>>> df.join(df2, df.name == df2.name, 'inner').drop('name').show()
+---+--+
|age|height|
+---+--+
| 16|85|
| 14|80|
+---+--+

{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42364) Split 'pyspark.pandas.tests.test_dataframe'

2023-02-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42364.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39915
[https://github.com/apache/spark/pull/39915]

> Split 'pyspark.pandas.tests.test_dataframe'
> ---
>
> Key: SPARK-42364
> URL: https://issues.apache.org/jira/browse/SPARK-42364
> Project: Spark
>  Issue Type: Test
>  Components: ps, Tests
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42364) Split 'pyspark.pandas.tests.test_dataframe'

2023-02-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42364:
-

Assignee: Ruifeng Zheng

> Split 'pyspark.pandas.tests.test_dataframe'
> ---
>
> Key: SPARK-42364
> URL: https://issues.apache.org/jira/browse/SPARK-42364
> Project: Spark
>  Issue Type: Test
>  Components: ps, Tests
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42363) Remove session.register_udf

2023-02-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42363:
-

Assignee: Hyukjin Kwon

> Remove session.register_udf
> ---
>
> Key: SPARK-42363
> URL: https://issues.apache.org/jira/browse/SPARK-42363
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42363) Remove session.register_udf

2023-02-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42363.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39916
[https://github.com/apache/spark/pull/39916]

> Remove session.register_udf
> ---
>
> Key: SPARK-42363
> URL: https://issues.apache.org/jira/browse/SPARK-42363
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42038) SPJ: Support partially clustered distribution

2023-02-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42038.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39633
[https://github.com/apache/spark/pull/39633]

> SPJ: Support partially clustered distribution
> -
>
> Key: SPARK-42038
> URL: https://issues.apache.org/jira/browse/SPARK-42038
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently the storage-partitioned join requires both sides to be fully 
> clustered on the partition values, that is, all input partitions reported by 
> a V2 data source shall be grouped by partition values before the join 
> happens. This could lead to data skew issues if a particular partition value 
> is associated with a large amount of rows.
>  
> To combat this, we can introduce the idea of partially clustered 
> distribution, which means that only one side of the join is required to be 
> fully clustered, while the other side is not. This allows Spark to increase 
> the parallelism of the join and avoid the data skewness.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42038) SPJ: Support partially clustered distribution

2023-02-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42038:
-

Assignee: Chao Sun

> SPJ: Support partially clustered distribution
> -
>
> Key: SPARK-42038
> URL: https://issues.apache.org/jira/browse/SPARK-42038
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> Currently the storage-partitioned join requires both sides to be fully 
> clustered on the partition values, that is, all input partitions reported by 
> a V2 data source shall be grouped by partition values before the join 
> happens. This could lead to data skew issues if a particular partition value 
> is associated with a large amount of rows.
>  
> To combat this, we can introduce the idea of partially clustered 
> distribution, which means that only one side of the join is required to be 
> fully clustered, while the other side is not. This allows Spark to increase 
> the parallelism of the join and avoid the data skewness.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-06 Thread Wei Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0，the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 

After this change, the behavior as flows:
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.{color:#57d9a3}option("comment", "\u"){color}.csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write{color:#57d9a3}.option("comment", "#"){color}.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|default behavior: the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.{color:#57d9a3}option("comment", "\u"){color}.csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.{color:#57d9a3}option("comment", "#"){color}.csv(path)|\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)|#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|default behavior: the same|

 

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0，the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 

After this change, the behavior as flows:
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment",

[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-06 Thread Wei Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0，the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)|\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)|#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|the same|

 

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0，the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 
|id|code| |2.4 and before|3.0 and after|current update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)| |#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}a little bit difference{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)| |#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)| |#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)| |#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}a liitle bit difference{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)| |\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)| |#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|the same|
 


> Pass the comment option

[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-06 Thread Wei Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0，the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 

After this change, the behavior as flows:
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)|\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)|#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|the same|

 

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0，the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)|\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)|#abc
xyz|#abc

[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-06 Thread Wei Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0，the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 
|id|code| |2.4 and before|3.0 and after|current update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)| |#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}a little bit difference{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)| |#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)| |#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)| |#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}a liitle bit difference{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)| |\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)| |#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|the same|
 

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0，the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.


> Pass the comment option through to univocity if users set it explicitly in 
> CSV dataSource
> -
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0，the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option

[jira] [Resolved] (SPARK-42354) Upgrade Jackson to 2.14.2

2023-02-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42354.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39898
[https://github.com/apache/spark/pull/39898]

> Upgrade Jackson to 2.14.2
> -
>
> Key: SPARK-42354
> URL: https://issues.apache.org/jira/browse/SPARK-42354
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.14.2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42354) Upgrade Jackson to 2.14.2

2023-02-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42354:
-

Assignee: Yang Jie

> Upgrade Jackson to 2.14.2
> -
>
> Key: SPARK-42354
> URL: https://issues.apache.org/jira/browse/SPARK-42354
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.14.2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41716) Factor pyspark.sql.connect.Catalog._catalog_to_pandas to client.py

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41716:


Assignee: (was: Apache Spark)

> Factor pyspark.sql.connect.Catalog._catalog_to_pandas to client.py
> --
>
> Key: SPARK-41716
> URL: https://issues.apache.org/jira/browse/SPARK-41716
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> _catalog_to_pandas is more about client.py. We should probably factor this 
> out to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41716) Factor pyspark.sql.connect.Catalog._catalog_to_pandas to client.py

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41716:


Assignee: Apache Spark

> Factor pyspark.sql.connect.Catalog._catalog_to_pandas to client.py
> --
>
> Key: SPARK-41716
> URL: https://issues.apache.org/jira/browse/SPARK-41716
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> _catalog_to_pandas is more about client.py. We should probably factor this 
> out to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41716) Factor pyspark.sql.connect.Catalog._catalog_to_pandas to client.py

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685048#comment-17685048
 ] 

Apache Spark commented on SPARK-41716:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39920

> Factor pyspark.sql.connect.Catalog._catalog_to_pandas to client.py
> --
>
> Key: SPARK-41716
> URL: https://issues.apache.org/jira/browse/SPARK-41716
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> _catalog_to_pandas is more about client.py. We should probably factor this 
> out to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42362) Upgrade kubernetes-client from 6.4.0 to 6.4.1

2023-02-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42362:
-

Assignee: Bjørn Jørgensen

> Upgrade kubernetes-client from 6.4.0 to 6.4.1
> -
>
> Key: SPARK-42362
> URL: https://issues.apache.org/jira/browse/SPARK-42362
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Minor
>
> New version of kubernetes client 
> Release notes 
> https://github.com/fabric8io/kubernetes-client/releases/tag/v6.4.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42362) Upgrade kubernetes-client from 6.4.0 to 6.4.1

2023-02-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42362.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39912
[https://github.com/apache/spark/pull/39912]

> Upgrade kubernetes-client from 6.4.0 to 6.4.1
> -
>
> Key: SPARK-42362
> URL: https://issues.apache.org/jira/browse/SPARK-42362
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Minor
> Fix For: 3.4.0
>
>
> New version of kubernetes client 
> Release notes 
> https://github.com/fabric8io/kubernetes-client/releases/tag/v6.4.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41612) Support Catalog.isCached

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685032#comment-17685032
 ] 

Apache Spark commented on SPARK-41612:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39919

> Support Catalog.isCached
> 
>
> Key: SPARK-41612
> URL: https://issues.apache.org/jira/browse/SPARK-41612
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41623) Support Catalog.uncacheTable

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41623:


Assignee: (was: Apache Spark)

> Support Catalog.uncacheTable
> 
>
> Key: SPARK-41623
> URL: https://issues.apache.org/jira/browse/SPARK-41623
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41612) Support Catalog.isCached

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41612:


Assignee: (was: Apache Spark)

> Support Catalog.isCached
> 
>
> Key: SPARK-41612
> URL: https://issues.apache.org/jira/browse/SPARK-41612
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41612) Support Catalog.isCached

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685030#comment-17685030
 ] 

Apache Spark commented on SPARK-41612:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39919

> Support Catalog.isCached
> 
>
> Key: SPARK-41612
> URL: https://issues.apache.org/jira/browse/SPARK-41612
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41623) Support Catalog.uncacheTable

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685028#comment-17685028
 ] 

Apache Spark commented on SPARK-41623:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39919

> Support Catalog.uncacheTable
> 
>
> Key: SPARK-41623
> URL: https://issues.apache.org/jira/browse/SPARK-41623
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41612) Support Catalog.isCached

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685031#comment-17685031
 ] 

Apache Spark commented on SPARK-41612:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39919

> Support Catalog.isCached
> 
>
> Key: SPARK-41612
> URL: https://issues.apache.org/jira/browse/SPARK-41612
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41612) Support Catalog.isCached

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41612:


Assignee: Apache Spark

> Support Catalog.isCached
> 
>
> Key: SPARK-41612
> URL: https://issues.apache.org/jira/browse/SPARK-41612
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41623) Support Catalog.uncacheTable

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41623:


Assignee: Apache Spark

> Support Catalog.uncacheTable
> 
>
> Key: SPARK-41623
> URL: https://issues.apache.org/jira/browse/SPARK-41623
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41623) Support Catalog.uncacheTable

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685027#comment-17685027
 ] 

Apache Spark commented on SPARK-41623:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39919

> Support Catalog.uncacheTable
> 
>
> Key: SPARK-41623
> URL: https://issues.apache.org/jira/browse/SPARK-41623
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41600) Support Catalog.cacheTable

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685024#comment-17685024
 ] 

Apache Spark commented on SPARK-41600:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39919

> Support Catalog.cacheTable
> --
>
> Key: SPARK-41600
> URL: https://issues.apache.org/jira/browse/SPARK-41600
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41600) Support Catalog.cacheTable

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41600:


Assignee: Apache Spark

> Support Catalog.cacheTable
> --
>
> Key: SPARK-41600
> URL: https://issues.apache.org/jira/browse/SPARK-41600
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42366) Log shuffle data corruption diagnose cause

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685025#comment-17685025
 ] 

Apache Spark commented on SPARK-42366:
--

User 'cxzl25' has created a pull request for this issue:
https://github.com/apache/spark/pull/39918

> Log shuffle data corruption diagnose cause
> --
>
> Key: SPARK-42366
> URL: https://issues.apache.org/jira/browse/SPARK-42366
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: dzcxzl
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42366) Log shuffle data corruption diagnose cause

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42366:


Assignee: (was: Apache Spark)

> Log shuffle data corruption diagnose cause
> --
>
> Key: SPARK-42366
> URL: https://issues.apache.org/jira/browse/SPARK-42366
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: dzcxzl
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42366) Log shuffle data corruption diagnose cause

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42366:


Assignee: Apache Spark

> Log shuffle data corruption diagnose cause
> --
>
> Key: SPARK-42366
> URL: https://issues.apache.org/jira/browse/SPARK-42366
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: dzcxzl
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41600) Support Catalog.cacheTable

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685026#comment-17685026
 ] 

Apache Spark commented on SPARK-41600:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39919

> Support Catalog.cacheTable
> --
>
> Key: SPARK-41600
> URL: https://issues.apache.org/jira/browse/SPARK-41600
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41600) Support Catalog.cacheTable

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41600:


Assignee: (was: Apache Spark)

> Support Catalog.cacheTable
> --
>
> Key: SPARK-41600
> URL: https://issues.apache.org/jira/browse/SPARK-41600
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42366) Log shuffle data corruption diagnose cause

2023-02-06 Thread dzcxzl (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-42366:
---
Summary: Log shuffle data corruption diagnose cause  (was: Log output 
shuffle data corruption diagnose cause)

> Log shuffle data corruption diagnose cause
> --
>
> Key: SPARK-42366
> URL: https://issues.apache.org/jira/browse/SPARK-42366
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: dzcxzl
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42366) Log output shuffle data corruption diagnose cause

2023-02-06 Thread dzcxzl (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-42366:
---
Summary: Log output shuffle data corruption diagnose cause  (was: Log 
output shuffle data corruption diagnose causes)

> Log output shuffle data corruption diagnose cause
> -
>
> Key: SPARK-42366
> URL: https://issues.apache.org/jira/browse/SPARK-42366
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: dzcxzl
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42366) Log output shuffle data corruption diagnose causes

2023-02-06 Thread dzcxzl (Jira)

dzcxzl created SPARK-42366:
--

 Summary: Log output shuffle data corruption diagnose causes
 Key: SPARK-42366
 URL: https://issues.apache.org/jira/browse/SPARK-42366
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.2.0
Reporter: dzcxzl






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42352) Upgrade maven to 3.9.0

2023-02-06 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-42352:
-
Description: 
[https://maven.apache.org/docs/3.8.7/release-notes.html]

 

change to upgrade 3.9.0 

 

https://maven.apache.org/docs/3.9.0/release-notes.html

  was:https://maven.apache.org/docs/3.8.7/release-notes.html


> Upgrade maven to 3.9.0
> --
>
> Key: SPARK-42352
> URL: https://issues.apache.org/jira/browse/SPARK-42352
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> [https://maven.apache.org/docs/3.8.7/release-notes.html]
>  
> change to upgrade 3.9.0 
>  
> https://maven.apache.org/docs/3.9.0/release-notes.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42352) Upgrade maven to 3.9.0

2023-02-06 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-42352:
-
Summary: Upgrade maven to 3.9.0  (was: Upgrade maven to 3.8.7)

> Upgrade maven to 3.9.0
> --
>
> Key: SPARK-42352
> URL: https://issues.apache.org/jira/browse/SPARK-42352
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://maven.apache.org/docs/3.8.7/release-notes.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42365) Split 'pyspark.pandas.tests.test_ops_on_diff_frames'

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42365:


Assignee: Apache Spark

> Split 'pyspark.pandas.tests.test_ops_on_diff_frames'
> 
>
> Key: SPARK-42365
> URL: https://issues.apache.org/jira/browse/SPARK-42365
> Project: Spark
>  Issue Type: Test
>  Components: ps, Tests
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42365) Split 'pyspark.pandas.tests.test_ops_on_diff_frames'

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685016#comment-17685016
 ] 

Apache Spark commented on SPARK-42365:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39917

> Split 'pyspark.pandas.tests.test_ops_on_diff_frames'
> 
>
> Key: SPARK-42365
> URL: https://issues.apache.org/jira/browse/SPARK-42365
> Project: Spark
>  Issue Type: Test
>  Components: ps, Tests
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42365) Split 'pyspark.pandas.tests.test_ops_on_diff_frames'

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42365:


Assignee: (was: Apache Spark)

> Split 'pyspark.pandas.tests.test_ops_on_diff_frames'
> 
>
> Key: SPARK-42365
> URL: https://issues.apache.org/jira/browse/SPARK-42365
> Project: Spark
>  Issue Type: Test
>  Components: ps, Tests
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42365) Split 'pyspark.pandas.tests.test_ops_on_diff_frames'

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685015#comment-17685015
 ] 

Apache Spark commented on SPARK-42365:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39917

> Split 'pyspark.pandas.tests.test_ops_on_diff_frames'
> 
>
> Key: SPARK-42365
> URL: https://issues.apache.org/jira/browse/SPARK-42365
> Project: Spark
>  Issue Type: Test
>  Components: ps, Tests
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42365) Split 'pyspark.pandas.tests.test_ops_on_diff_frames'

2023-02-06 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-42365:
-

 Summary: Split 'pyspark.pandas.tests.test_ops_on_diff_frames'
 Key: SPARK-42365
 URL: https://issues.apache.org/jira/browse/SPARK-42365
 Project: Spark
  Issue Type: Test
  Components: ps, Tests
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42363) Remove session.register_udf

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42363:


Assignee: (was: Apache Spark)

> Remove session.register_udf
> ---
>
> Key: SPARK-42363
> URL: https://issues.apache.org/jira/browse/SPARK-42363
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40532) Python version for UDF should follow the servers version

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40532:


Assignee: (was: Apache Spark)

> Python version for UDF should follow the servers version
> 
>
> Key: SPARK-40532
> URL: https://issues.apache.org/jira/browse/SPARK-40532
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Minor
>
> Currently, we artificially pin the Python version to 3.9 in the UDF 
> generation code, but this should actually be the correct server vs client 
> version.
>  
> In addition the version should be configured as part of the function 
> definition proto message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42363) Remove session.register_udf

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685005#comment-17685005
 ] 

Apache Spark commented on SPARK-42363:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39916

> Remove session.register_udf
> ---
>
> Key: SPARK-42363
> URL: https://issues.apache.org/jira/browse/SPARK-42363
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42363) Remove session.register_udf

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42363:


Assignee: Apache Spark

> Remove session.register_udf
> ---
>
> Key: SPARK-42363
> URL: https://issues.apache.org/jira/browse/SPARK-42363
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40532) Python version for UDF should follow the servers version

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685004#comment-17685004
 ] 

Apache Spark commented on SPARK-40532:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39914

> Python version for UDF should follow the servers version
> 
>
> Key: SPARK-40532
> URL: https://issues.apache.org/jira/browse/SPARK-40532
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Minor
>
> Currently, we artificially pin the Python version to 3.9 in the UDF 
> generation code, but this should actually be the correct server vs client 
> version.
>  
> In addition the version should be configured as part of the function 
> definition proto message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40532) Python version for UDF should follow the servers version

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40532:


Assignee: Apache Spark

> Python version for UDF should follow the servers version
> 
>
> Key: SPARK-40532
> URL: https://issues.apache.org/jira/browse/SPARK-40532
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, we artificially pin the Python version to 3.9 in the UDF 
> generation code, but this should actually be the correct server vs client 
> version.
>  
> In addition the version should be configured as part of the function 
> definition proto message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42364) Split 'pyspark.pandas.tests.test_dataframe'

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42364:


Assignee: Apache Spark

> Split 'pyspark.pandas.tests.test_dataframe'
> ---
>
> Key: SPARK-42364
> URL: https://issues.apache.org/jira/browse/SPARK-42364
> Project: Spark
>  Issue Type: Test
>  Components: ps, Tests
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42364) Split 'pyspark.pandas.tests.test_dataframe'

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685003#comment-17685003
 ] 

Apache Spark commented on SPARK-42364:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39915

> Split 'pyspark.pandas.tests.test_dataframe'
> ---
>
> Key: SPARK-42364
> URL: https://issues.apache.org/jira/browse/SPARK-42364
> Project: Spark
>  Issue Type: Test
>  Components: ps, Tests
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42364) Split 'pyspark.pandas.tests.test_dataframe'

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42364:


Assignee: (was: Apache Spark)

> Split 'pyspark.pandas.tests.test_dataframe'
> ---
>
> Key: SPARK-42364
> URL: https://issues.apache.org/jira/browse/SPARK-42364
> Project: Spark
>  Issue Type: Test
>  Components: ps, Tests
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42364) Split 'pyspark.pandas.tests.test_dataframe'

2023-02-06 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-42364:
-

 Summary: Split 'pyspark.pandas.tests.test_dataframe'
 Key: SPARK-42364
 URL: https://issues.apache.org/jira/browse/SPARK-42364
 Project: Spark
  Issue Type: Test
  Components: ps, Tests
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42363) Remove session.register_udf

2023-02-06 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-42363:


 Summary: Remove session.register_udf
 Key: SPARK-42363
 URL: https://issues.apache.org/jira/browse/SPARK-42363
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug

2023-02-06 Thread Ritika Maheshwari (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684992#comment-17684992
 ] 

Ritika Maheshwari commented on SPARK-42346:
---

I have Spark 3.3.0 and I do not have 39887 fix . I am not able to reproduce 
this issue. Am I missing something?

 

scala> val df = Seq(("a","b")).toDF("surname","first_name")

*df*: *org.apache.spark.sql.DataFrame* = [surname: string, first_name: string]

 

scala> df.createOrReplaceTempView("input_table")

 

scala> spark.sql("select(Select Count(Distinct first_name) from input_table) As 
distinct_value_count from input_table Union all select (select count(Distinct 
surname) from input_table) as distinct_value_count from input_table").show()

++                                                          

|distinct_value_count|

++

|                   1|

|                   1|

++

= Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Union
   :- Project [cast(Subquery subquery#46, [id=#114] as string) AS 
distinct_value_count#62]
   :  :  +- Subquery subquery#46, [id=#114]
   :  :     +- AdaptiveSparkPlan isFinalPlan=false
   :  :        +- HashAggregate(keys=[], functions=[count(first_name#12)], 
output=[count(DISTINCT first_name)#53L])
   :  :           +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#112]
   :  :              +- HashAggregate(keys=[], 
functions=[partial_count(first_name#12)], output=[count#67L])
   :  :                 +- LocalTableScan [first_name#12]
   :  +- LocalTableScan [_1#6, _2#7]
   +- Project [cast(Subquery subquery#48, [id=#125] as string) AS 
distinct_value_count#64]
      :  +- Subquery subquery#48, [id=#125]
      :     +- AdaptiveSparkPlan isFinalPlan=false
      :        +- HashAggregate(keys=[], functions=[count(surname#11)], 
output=[count(DISTINCT surname)#55L])
      :           +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#123]
      :              +- HashAggregate(keys=[], 
functions=[partial_count(surname#11)], output=[count#68L])
      :                 +- LocalTableScan [surname#11]
      +- LocalTableScan [_1#50, _2#51|#50, _2#51]


 

This is what I have in my SparkOptimizer.scala


override def defaultBatches: Seq[Batch] = (preOptimizationBatches ++ 
super.defaultBatches :+
Batch("Optimize Metadata Only Query", Once, OptimizeMetadataOnlyQuery(catalog)) 
:+
Batch("PartitionPruning", Once,
PartitionPruning) :+
Batch("InjectRuntimeFilter", FixedPoint(1),
InjectRuntimeFilter,
RewritePredicateSubquery) :+
Batch("MergeScalarSubqueries", Once,
MergeScalarSubqueries) :+
Batch("Pushdown Filters from PartitionPruning", fixedPoint,
PushDownPredicates) :+
Batch

> distinct(count colname) with UNION ALL causes query analyzer bug
> 
>
> Key: SPARK-42346
> URL: https://issues.apache.org/jira/browse/SPARK-42346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
>Reporter: Robin
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.3.2, 3.4.0, 3.5.0
>
>
> If you combine a UNION ALL with a count(distinct colname) you get a query 
> analyzer bug.
>  
> This behaviour is introduced in 3.3.0.  The bug was not present in 3.2.1.
>  
> Here is a reprex in PySpark:
> {{df_pd = pd.DataFrame([}}
> {{    \{'surname': 'a', 'first_name': 'b'}}}
> {{])}}
> {{df_spark = spark.createDataFrame(df_pd)}}
> {{df_spark.createOrReplaceTempView("input_table")}}
> {{sql = """}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT first_name) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table}}
> {{UNION ALL}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT surname) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table """}}
> {{spark.sql(sql).toPandas()}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42268) Add UserDefinedType in protos

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684990#comment-17684990
 ] 

Apache Spark commented on SPARK-42268:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39913

> Add UserDefinedType in protos
> -
>
> Key: SPARK-42268
> URL: https://issues.apache.org/jira/browse/SPARK-42268
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42268) Add UserDefinedType in protos

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684989#comment-17684989
 ] 

Apache Spark commented on SPARK-42268:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39913

> Add UserDefinedType in protos
> -
>
> Key: SPARK-42268
> URL: https://issues.apache.org/jira/browse/SPARK-42268
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42362) Upgrade kubernetes-client from 6.4.0 to 6.4.1

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684952#comment-17684952
 ] 

Apache Spark commented on SPARK-42362:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/39912

> Upgrade kubernetes-client from 6.4.0 to 6.4.1
> -
>
> Key: SPARK-42362
> URL: https://issues.apache.org/jira/browse/SPARK-42362
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Priority: Minor
>
> New version of kubernetes client 
> Release notes 
> https://github.com/fabric8io/kubernetes-client/releases/tag/v6.4.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42362) Upgrade kubernetes-client from 6.4.0 to 6.4.1

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42362:


Assignee: (was: Apache Spark)

> Upgrade kubernetes-client from 6.4.0 to 6.4.1
> -
>
> Key: SPARK-42362
> URL: https://issues.apache.org/jira/browse/SPARK-42362
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Priority: Minor
>
> New version of kubernetes client 
> Release notes 
> https://github.com/fabric8io/kubernetes-client/releases/tag/v6.4.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42362) Upgrade kubernetes-client from 6.4.0 to 6.4.1

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42362:


Assignee: Apache Spark

> Upgrade kubernetes-client from 6.4.0 to 6.4.1
> -
>
> Key: SPARK-42362
> URL: https://issues.apache.org/jira/browse/SPARK-42362
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Assignee: Apache Spark
>Priority: Minor
>
> New version of kubernetes client 
> Release notes 
> https://github.com/fabric8io/kubernetes-client/releases/tag/v6.4.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42362) Upgrade kubernetes-client from 6.4.0 to 6.4.1

2023-02-06 Thread Jira

Bjørn Jørgensen created SPARK-42362:
---

 Summary: Upgrade kubernetes-client from 6.4.0 to 6.4.1
 Key: SPARK-42362
 URL: https://issues.apache.org/jira/browse/SPARK-42362
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Bjørn Jørgensen


New version of kubernetes client 

Release notes https://github.com/fabric8io/kubernetes-client/releases/tag/v6.4.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42361) Add an option to use external storage to distribute JAR set in cluster mode on Kube

2023-02-06 Thread Holden Karau (Jira)

Holden Karau created SPARK-42361:


 Summary: Add an option to use external storage to distribute JAR 
set in cluster mode on Kube
 Key: SPARK-42361
 URL: https://issues.apache.org/jira/browse/SPARK-42361
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.5.0
Reporter: Holden Karau


tl;dr – sometimes the driver can get overwhelmed serving the initial jar set. 
You'll see a lot of "Executor fetching spark://.../jar" and then connection 
timed out.

 

On YARN the jars (in cluster mode) are cached in HDFS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36478) Removes outer join if all grouping and aggregate expressions are from the streamed side

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684906#comment-17684906
 ] 

Apache Spark commented on SPARK-36478:
--

User 'clubycoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/39911

> Removes outer join if all grouping and aggregate expressions are from the 
> streamed side
> ---
>
> Key: SPARK-36478
> URL: https://issues.apache.org/jira/browse/SPARK-36478
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wan Kun
>Priority: Minor
>
> Removes outer join if all grouping and aggregate expressions are from the 
> streamed side.
> For example:
> {code:java}
> spark.range(200L).selectExpr("id AS a", "id as b", "id as 
> c").createTempView("t1")
> spark.range(300L).selectExpr("id AS a").createTempView("t2")
> spark.sql("SELECT t1.b, max(t1.c) as c FROM t1 LEFT JOIN t2 ON t1.a = t2.a 
> GROUP BY t1.b").explain(true)
> {code}
> Current optimized plan:
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [b#3L], [b#3L, max(c#4L) AS c#20L]
> +- Project [b#3L, c#4L]
>+- Join LeftOuter, (a#2L = a#10L)
>   :- Project [id#0L AS a#2L, id#0L AS b#3L, id#0L AS c#4L]
>   :  +- Range (0, 200, step=1, splits=Some(1))
>   +- Project [id#8L AS a#10L]
>  +- Range (0, 300, step=1, splits=Some(1))
> {code}
> Expected optimized plan:
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [b#277L], [b#277L, max(c#278L) AS c#290L]
> +- Project [id#274L AS b#277L, id#274L AS c#278L]
>+- Range (0, 200, step=1, splits=Some(2))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals

2023-02-06 Thread Gera Shegalov (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684903#comment-17684903
 ] 

Gera Shegalov edited comment on SPARK-41793 at 2/6/23 7:38 PM:
---

if the consensus is that it's not a correctness bug in 3.4,  then this fix 
should probably be documented and backported to maintenance branches?


was (Author: jira.shegalov):
if the consensus is that it's not a correctness bug in 3.4,  then this fix 
should probably be documented and probably backported to maintenance branches?

> Incorrect result for window frames defined by a range clause on large 
> decimals 
> ---
>
> Key: SPARK-41793
> URL: https://issues.apache.org/jira/browse/SPARK-41793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gera Shegalov
>Priority: Blocker
>  Labels: correctness
>
> Context 
> https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686
> The following windowing query on a simple two-row input should produce two 
> non-empty windows as a result
> {code}
> from pprint import pprint
> data = [
>   ('9223372036854775807', '11342371013783243717493546650944543.47'),
>   ('9223372036854775807', '.99')
> ]
> df1 = spark.createDataFrame(data, 'a STRING, b STRING')
> df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)'))
> df2.createOrReplaceTempView('test_table')
> df = sql('''
>   SELECT 
> COUNT(1) OVER (
>   PARTITION BY a 
>   ORDER BY b ASC 
>   RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING
> ) AS CNT_1 
>   FROM 
> test_table
>   ''')
> res = df.collect()
> df.explain(True)
> pprint(res)
> {code}
> Spark 3.4.0-SNAPSHOT output:
> {code}
> [Row(CNT_1=1), Row(CNT_1=0)]
> {code}
> Spark 3.3.1 output as expected:
> {code}
> Row(CNT_1=1), Row(CNT_1=1)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals

2023-02-06 Thread Gera Shegalov (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684903#comment-17684903
 ] 

Gera Shegalov commented on SPARK-41793:
---

if the consensus is that it's not a correctness bug in 3.4,  then this fix 
should probably be documented and probably backported to maintenance branches?

> Incorrect result for window frames defined by a range clause on large 
> decimals 
> ---
>
> Key: SPARK-41793
> URL: https://issues.apache.org/jira/browse/SPARK-41793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gera Shegalov
>Priority: Blocker
>  Labels: correctness
>
> Context 
> https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686
> The following windowing query on a simple two-row input should produce two 
> non-empty windows as a result
> {code}
> from pprint import pprint
> data = [
>   ('9223372036854775807', '11342371013783243717493546650944543.47'),
>   ('9223372036854775807', '.99')
> ]
> df1 = spark.createDataFrame(data, 'a STRING, b STRING')
> df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)'))
> df2.createOrReplaceTempView('test_table')
> df = sql('''
>   SELECT 
> COUNT(1) OVER (
>   PARTITION BY a 
>   ORDER BY b ASC 
>   RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING
> ) AS CNT_1 
>   FROM 
> test_table
>   ''')
> res = df.collect()
> df.explain(True)
> pprint(res)
> {code}
> Spark 3.4.0-SNAPSHOT output:
> {code}
> [Row(CNT_1=1), Row(CNT_1=0)]
> {code}
> Spark 3.3.1 output as expected:
> {code}
> Row(CNT_1=1), Row(CNT_1=1)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24942) Improve cluster resource management with jobs containing barrier stage

2023-02-06 Thread manpreet singh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684895#comment-17684895
 ] 

manpreet singh commented on SPARK-24942:


[~gurwls223]  Any updates on this? 

It seems like we are also facing this.

We want to use stage level scheduling with our jobs needing Barrier execution. 
If we cannot enable DRA,  then we will be incurring a huge infra cost for the 
spark pool which is no longer being used for the current stage.

 

> Improve cluster resource management with jobs containing barrier stage
> --
>
> Key: SPARK-24942
> URL: https://issues.apache.org/jira/browse/SPARK-24942
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r205652317
> We shall improve cluster resource management to address the following issues:
> - With dynamic resource allocation enabled, it may happen that we acquire 
> some executors (but not enough to launch all the tasks in a barrier stage) 
> and later release them due to executor idle time expire, and then acquire 
> again.
> - There can be deadlock with two concurrent applications. Each application 
> may acquire some resources, but not enough to launch all the tasks in a 
> barrier stage. And after hitting the idle timeout and releasing them, they 
> may acquire resources again, but just continually trade resources between 
> each other.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42357) Log `exitCode` when `SparkContext.stop` starts

2023-02-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42357:
-

Assignee: Dongjoon Hyun

> Log `exitCode` when `SparkContext.stop` starts
> --
>
> Key: SPARK-42357
> URL: https://issues.apache.org/jira/browse/SPARK-42357
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> {code}
> 23/02/06 02:12:55 INFO SparkContext: SparkContext is stopping with exitCode 0.
> {code}
> {code}
> Pi is roughly 3.147080
> 23/02/06 02:12:55 INFO SparkContext: SparkContext is stopping with exitCode 0.
> ...
> 23/02/06 02:12:55 INFO AbstractConnector: Stopped Spark@1cb72b8{HTTP/1.1, 
> (http/1.1)}{localhost:4040}
> 23/02/06 02:12:55 INFO SparkUI: Stopped Spark web UI at http://localhost:4040
> 23/02/06 02:12:55 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 23/02/06 02:12:55 INFO MemoryStore: MemoryStore cleared
> 23/02/06 02:12:55 INFO BlockManager: BlockManager stopped
> 23/02/06 02:12:55 INFO BlockManagerMaster: BlockManagerMaster stopped
> 23/02/06 02:12:55 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 23/02/06 02:12:55 INFO SparkContext: Successfully stopped SparkContext
> 23/02/06 02:12:56 INFO ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42357) Log `exitCode` when `SparkContext.stop` starts

2023-02-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42357.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39900
[https://github.com/apache/spark/pull/39900]

> Log `exitCode` when `SparkContext.stop` starts
> --
>
> Key: SPARK-42357
> URL: https://issues.apache.org/jira/browse/SPARK-42357
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>
> {code}
> 23/02/06 02:12:55 INFO SparkContext: SparkContext is stopping with exitCode 0.
> {code}
> {code}
> Pi is roughly 3.147080
> 23/02/06 02:12:55 INFO SparkContext: SparkContext is stopping with exitCode 0.
> ...
> 23/02/06 02:12:55 INFO AbstractConnector: Stopped Spark@1cb72b8{HTTP/1.1, 
> (http/1.1)}{localhost:4040}
> 23/02/06 02:12:55 INFO SparkUI: Stopped Spark web UI at http://localhost:4040
> 23/02/06 02:12:55 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 23/02/06 02:12:55 INFO MemoryStore: MemoryStore cleared
> 23/02/06 02:12:55 INFO BlockManager: BlockManager stopped
> 23/02/06 02:12:55 INFO BlockManagerMaster: BlockManagerMaster stopped
> 23/02/06 02:12:55 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 23/02/06 02:12:55 INFO SparkContext: Successfully stopped SparkContext
> 23/02/06 02:12:56 INFO ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42337) Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684856#comment-17684856
 ] 

Apache Spark commented on SPARK-42337:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/39910

> Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT
> -
>
> Key: SPARK-42337
> URL: https://issues.apache.org/jira/browse/SPARK-42337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Priority: Major
>
> Add the new error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT and move 
> the following error classes to use the new one:
>  * _LEGACY_ERROR_TEMP_1283
>  * _LEGACY_ERROR_TEMP_1284



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42337) Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42337:


Assignee: Apache Spark

> Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT
> -
>
> Key: SPARK-42337
> URL: https://issues.apache.org/jira/browse/SPARK-42337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>
> Add the new error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT and move 
> the following error classes to use the new one:
>  * _LEGACY_ERROR_TEMP_1283
>  * _LEGACY_ERROR_TEMP_1284



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42337) Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT

2023-02-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42337:


Assignee: (was: Apache Spark)

> Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT
> -
>
> Key: SPARK-42337
> URL: https://issues.apache.org/jira/browse/SPARK-42337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Priority: Major
>
> Add the new error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT and move 
> the following error classes to use the new one:
>  * _LEGACY_ERROR_TEMP_1283
>  * _LEGACY_ERROR_TEMP_1284



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42337) Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684855#comment-17684855
 ] 

Apache Spark commented on SPARK-42337:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/39910

> Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT
> -
>
> Key: SPARK-42337
> URL: https://issues.apache.org/jira/browse/SPARK-42337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Priority: Major
>
> Add the new error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT and move 
> the following error classes to use the new one:
>  * _LEGACY_ERROR_TEMP_1283
>  * _LEGACY_ERROR_TEMP_1284



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42287) Optimize the packaging strategy of connect client module

2023-02-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684849#comment-17684849
 ] 

Apache Spark commented on SPARK-42287:
--

User 'zhenlineo' has created a pull request for this issue:
https://github.com/apache/spark/pull/39866

> Optimize the packaging strategy of connect client module
> 
>
> Key: SPARK-42287
> URL: https://issues.apache.org/jira/browse/SPARK-42287
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> # `perfmark-api` not shaded into `connect-client-jvm module.jar` and it is 
> not the default dependency of spark, we can package `perfmark-api` to 
> `connect-client-jvm module.jar` to avoid users' manual dependence on it
>  # sbt-assembly result of `connect-client-jvm module` packed too much jars 
> and not relocation them, such as hadoop, rocksdb, roaringbitmap and so on,  
> should simplify it refer to  the packaging results of maven
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42337) Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT

2023-02-06 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-42337:
-
Summary: Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT  (was: 
Add the new error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT)

> Add error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT
> -
>
> Key: SPARK-42337
> URL: https://issues.apache.org/jira/browse/SPARK-42337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Priority: Major
>
> Add the new error class CREATE_PERSISTENT_OBJECT_OVER_TEMP_OBJECT and move 
> the following error classes to use the new one:
>  * _LEGACY_ERROR_TEMP_1283
>  * _LEGACY_ERROR_TEMP_1284



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41470) SPJ: Spark shouldn't assume InternalRow implements equals and hashCode

2023-02-06 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-41470:
-
Fix Version/s: 3.4.0
   (was: 3.5.0)

> SPJ: Spark shouldn't assume InternalRow implements equals and hashCode
> --
>
> Key: SPARK-41470
> URL: https://issues.apache.org/jira/browse/SPARK-41470
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently SPJ (Storage-Partitioned Join) actually assumes the {{InternalRow}} 
> returned by {{HasPartitionKey}} implements {{equals}} and {{{}hashCode{}}}. 
> We should remove this restriction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41470) SPJ: Spark shouldn't assume InternalRow implements equals and hashCode

2023-02-06 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-41470:


Assignee: Mars

> SPJ: Spark shouldn't assume InternalRow implements equals and hashCode
> --
>
> Key: SPARK-41470
> URL: https://issues.apache.org/jira/browse/SPARK-41470
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Assignee: Mars
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently SPJ (Storage-Partitioned Join) actually assumes the {{InternalRow}} 
> returned by {{HasPartitionKey}} implements {{equals}} and {{{}hashCode{}}}. 
> We should remove this restriction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 162 matches

Mail list logo